Friday, August 24, 2018

Mind Your State for Your State of Mind

This article by Pat Helland appeared at ACM Queue on July 2018. Pat is a self-confessed apostate. He is also a database philosopher; look at the title of his recent publications: Standing on distributed shoulders of giants, Life beyond distributed transactions, Immutability changes everything, Heisenberg was on the "write" track, consistently eventual, etc.

This "mind your state for your state of mind" article looks at the history of interactions of applications and storage/databases, and charts their co-evolution as they move into the distributed and scalable world.

The evolution of state, storage, and computing

Storage has evolved from disks directly attached to your computer to shared appliances such as SANs (storage area networks) leading the way to storage clusters of commodity servers contained in a network.

Computing evolved from a single-process on a single server, to multiple processes communicating on a single server, to RPCs (remote procedure calls) across a tiny cluster of servers. In the 2000s, the concept of SOA (service oriented architecture) emerged to provide trust isolation so that the distrusted outsider cannot modify the data. As the industry started running services at huge scale, it learned that breaking a service into smaller microservices provides advantages through better modularity/decoupling and through stateless and restartable operation. Today microservices have emerged as the leading technique to support scalable applications.

Databases also evolved tremendously. Before database transactions, there were complexities in updating data even in a single computer, especially if failures happened. Database transactions dramatically simplified the life of application developers. But as solutions scaled beyond a single database, life got more challenging. First, we tried to make multiple databases look like one database. Then, we started hooking multiple applications together using SOA; each service had its own discrete database with its own transactions but used messaging to coordinate across boundaries. Key-value stores offered more scale but less declarative functionality for processing the application's data.  Multirecord transactions were lost as scale was gained. Finally, when we started using microservices, a microservice instance did not have its own data but reached directly to a distributed store shared across many separate services. This scaled better—if you got the implementation right.

Minding durable and session state in microservices

Durable state is stuff that gets remembered across requests and persists across failures. Durable state is not usually kept in microservices. Instead, it is kept in back-end databases and key-value stores. In the next section we look at some of these distributed stores and their example use cases.

Session state is the stuff that gets remembered across requests in a session but not across failures. Session state exists within the endpoints associated with the session, and is hard to maintain when the session is smeared across service instances: the next message to the service pool may land at a different service instance.

Without session state, you can't easily create transactions crossing requests. Typically, microservice environments support a transaction within a single request but not across multiple requests. Furthermore, if a microservice accesses a scalable key-value store as it processes a single request, the scalable key-value store will usually support only atomic updates to a single key. Programmers are on their own when changing values tied to multiple keys. I will discuss the implications of this in MAD questions section at the end.

Different stores for different uses

As we mentioned as part of evolution of databases, to cope with scalable environments, data had to be sharded into key values. Most of these scalable key-value stores ensured linearizable, strongly consistent updates to their single keys. Unfortunately, these linearizable stores would occasionally cause delays seen by users. This led to the construction of nonlinearizable stores with the big advantage that they have excellent response times for reads and writes. In exchange, they sometimes give a reader an old value.


Different applications demand different behaviors from durable state. Do you want it *right* or do you want it *right now*? Applications usually want the latter and are tolerant of stale versions. We review some example application patterns below.


Workflow over key-value. This pattern demonstrates how applications perform workflow when the durable state is too large to fit in a single database. The workflow implemented by careful replacement will be a mess if you can't read the last value written. Hence, this usage pattern will stall and not be stale. This is the "must be right" even if it's not "right now" case.

Transactional blobs-by-ref. This application runs using transactions and a relational database and stores big blobs such as documents, photos, PDFs etc at a data store. To modify a blob, you always create a new blob to replace the old one. Storing immutable blobs in a nonlinearizable database does not have any problems with returning a stale version: since there's only one immutable version, there are no stale versions. Storing immutable data in a nonlinearizable store enjoys the best of both worlds: it's both right and right now.

E-Commerce shopping cart. In e-commerce, each shopping cart is for a separate customer. There's no need or desire for cross-cart consistency. Customers are very unhappy if their access to a shopping cart stalls. Shopping carts should be right now even if they're not right.

E-Commerce product catalog. Product catalogs for large e-commerce sites are processed offline and stuffed into large scalable caches. This is another example of the business needing an answer right now more than it needs the answer to be right.

Search. In search, it is OK to get stale answers, but the latency for the response must be short. There's no notion of linearizable reads nor of read-your-writes.


Each application pattern shows different characteristics and tradeoffs. As a developer, you should first consider your application's requirements carefully.
  • Is it OK to stall on reads?
  • Is it OK to stall on writes?
  • Is it OK to return stale versions?
You can't have everything!

If you don't carefully mind your state, it will bite back, and degrade your state of mind.

The Cosmos DB take

I am doing my sabbatical at Microsoft Cosmos DB. So I try to put things in context based on what I see/work on here. This is how Cosmos DB fits in this picture and provides answers to these challenges.

We had talked about this in a previous post. "Clients should be able to choose their desired consistency. The system cannot possibly predict or determine the consistency that is required by a given application or client."

Most real-world application scenarios do not fall nicely into the two extreme choices of consistency models that are typically offered by databases, e.g., strong and eventual consistency. Cosmos DB offers five well-defined and practical consistency models to accommodate real-life application needs. Of the 5 well-defined consistency models to choose from, customers have been overwhelmingly selecting the relaxed consistency models (e.g. Session, Bounded-Staleness and Consistent Prefix). Cosmos DB brings a tunable set of well-defined consistency models that users can set at the click of a button ---there is no need to deal with messy datacenters, databases, and quorum configurations. Users can later override the consistency level they selected on each individual request  if they want to. Cosmos DB is also the 1st commercial database system to have exposed Probabilistic Bounded Staleness (PBS)  as a metric for customers to determine "how eventual is eventual consistency".


I will have a series of posts on global distribution/replication in Cosmos DB soon. Below is just an appetizer :-)

Even for a deployment over multiple regions/continents, the write operation for session, consistent prefix, and eventual consistency levels are acknowledged by the write region without blocking for replies from other regions. Bounded consistency write often gets replied quickly from the write region without waiting for replication to other regions: the bound of staleness is not reached means, the region can give the write a green light locally. Finally, in all these cases including strong consistency, with the multimaster general availability rollout, the write region is always the closest region to the client performing the write.

Latency of the reads are guaranteed to be fast backed up by 99.99% SLAs.  Reads are always answered within the region (nearest region) contacted by the client. While serving reads from the nearest region, the consistency level selected (a global property!) is guaranteed by the clever use of logical sequence numbers to check that the provided data guarantees the selected consistency level. Within the region, a read is answered by one replica (without waiting for other replica or region) for consistent prefix and eventual consistency levels. Reads may occasionally wait for the second read replica (with potentially waiting for fresh data to arrive from another region) for session consistency. Finally, reads are answered by a read quorum of 2 out of 4 for bounded and strong consistency.

In other words, while the tradeoffs between predictable read latency, predictable write latency, and read your writes (session consistency) inherently exists, Cosmos DB handles them lazily/efficiently but surely by involving client-library (or gateway) collaboration. As we mentioned in a previous post, it is possible to achieve more efficiency with lazy state synchronization behind the curtain of operation consistency exposed to the client.

MAD questions


1. What is the next trend in the co-evolution of computing and storage?

The pattern of software-based innovation has always been to virtualize a physical thing (say typewriters, libraries, publishing press, accountants), and then improve on it every year thanks to the availability of exponentially more computing/querying power. The cloud took this to another level with its virtually infinite computing and storage resources provided on demand.

Software is eating the world. By 2025, we will likely have a virtual personal assistant, virtual nanny, virtual personalized teacher, and a virtual personalized doctor accessible through our smartphones.

If you think about it, the trend for providing X as a service derives and benefits from the trend of virtualizing X. Another term for virtualization is making it software-defined. We had seen software-defined storage, software-defined networks, software-defined radios, etc. (Couple years ago I was joking about when we will see software-defined software, now I joke about software-defined software-defined software.)

This trend also applies to and shapes the cloud computing architecture.

  • First virtual machines (VMs) came and virtualized and shared the hardware so multiple VMs can colocate on the same machine. This allowed consolidation of machines, prevented the server sprawl problem, and reduced costs as well as improving manageability. 
  • Then containers came and virtualized and shared the operating system, and avoided the overheads of VMs. They provided faster startup times for application servers. 
  • "Serverless" took the virtualization a step ahead. They virtualize and share the runtime, and now the unit of deployment is a function. Applications are now defined as a set of functions (i.e., lambda handlers) with access to a common data store. 

A new trend is developing for utilizing serverless computing even for the long running analytics jobs. Here are some examples.

The gravity of virtualization pulls for disaggregation of services as well. The paper talked about the trend about disaggregation of services, e.g., computing from storage. I think this trend will continue because this is fueled by the economies of scale cloud computing leverage on.

Finally, I had written a distributed systems perspective prediction of new trends for research earlier here.


2. But, you didn't talk about decentralization and blockchains?

Cloud and modern datacenter computing for that matter provides trust isolation. The premise of blockchain and decentralized systems is to provide trustless computation. They take trustless as an axiom.  While it is clear that trust isolation is a feature organizations/people care (due to security and fraud protection), it is not clear if trustless computing is a feature organizations/people care.

Finally, as I wrote about this earlier several times, logical centralization (i.e., the cloud model) has a lot of advantages over decentralization in terms of efficiency, scalability, ease of coordination, and even fault-tolerance (via ease of coordination and availability of replication). Centralization benefits from the powerful hand of economy of scale. Decentralized is up against a very steep cliff.

Here is High Scalability blog's take on it as well.


3. How do we start to address the distributed coordination challenge of microservices?

Despite the fact that transactional solutions do not work well with microservices architectures, we often need to provide some of the transactional guarantees to operations that span multiple (sometimes dozens of) microservices. In particular, if one of the microservices in the request is not successful, we need to revert the state of the microservices that have already changed their states. As we discussed it is hard to maintain session state with microservices. These corrective actions are typically written at the coordinator layer of the application in an ad-hoc manner and are not enforced by some specialized protocol.

This is a real problem, compounded with the tendency of microservice instances to appear/disappear on demand.

One thing that helps is to use distributed sagas pattern to instill some discipline on undoing the side-effects of a failed operation that involves many microservices.

I had proposed that self-stabilization has a role to play here. Here is my report on this; section 4.1 is the most relevant part to this problem.

Last Friday I had attended @cmeik's end of internship talk at Microsoft Research. It was on building a prototype middleware that helps with the fault-tolerance of across-microservices operations.

No comments: