Friday, August 17, 2018

Replicated Data Consistency Explained Through Baseball

I had mentioned this report in my first overview post about my sabbatical at Cosmos DB. This is a 2011 technical report by Doug Terry, who was working at Microsoft Research at the time. The paper explains the consistency models through publishing of baseball scores via multiple channels. (Don't worry, you don't have to know baseball rules or scoring to follow the example; the example works as well for publishing of, say, basketball scores.)

Many consistency models have been proposed for distributed and replicated systems. The reason for exploring different consistency models is that there are fundamental tradeoffs between consistency, performance, and availability.

While seeing the latest written value for an object is desirable and reasonable to provide within a datacenter, offering strong consistency for georeplicated services across continents results in lower performance and reduced availability for reads, writes, or both.

On the opposite end of the spectrum, with eventual consistency, the data returned by a read operation is the value of the object at some point in the past, yet the performance and availability dimensions are greatly improved.

The paper argues that eventual consistency is insufficient for most of the use cases, but strong consistency is not needed either; most use-cases benefit from some intermediate consistency guarantees.

This concurs with Cosmos DB production workloads. As I mentioned in my first overview post, about 73% of Cosmos DB tenants use session consistency and 20% prefer bounded staleness. This limits strong-consistency and eventual-consistency to the fringes. After I introduce the model and consistency guarantees the paper considers, I will talk about how these map to the consistency levels used in Cosmos DB.

Read consistency guarantees

The paper considers 6 consistency guarantees. These are all examples of the read-write centric operation consistency we discussed in the "Many Faces of Consistency" paper.

The paper considers a simple abstract model for the data store. In this model, clients perform write operations to a master node in the data store. Writes are serialized and eventually performed/replicated in the same order (with that of the master) at all the servers (a.k.a. replicas). The client perform reads from the servers/replicas. Reads return the values of data objects that were previously written, though not necessarily the latest values. This is because the entire set of writes at the master may not yet be reflected in order and in entirety at the servers.

The 6 consistency guarantees below are defined by which set of previous writes are visible to a read operation.

Strong consistency ensures that a read operation returns the value that was last written for a given object. In other words, a read observes the effects of all previously completed writes: if write operations can modify or extend portions of a data object, such as appending data to a log, then the read returns the result of applying all writes to that object. (Note that in order to achieve strong consistency in the presence of crashes, the write operation at the master should be going lockstep with a quorum of replicas to permanently record writes, requiring synchronous replication!)

Eventual consistency is the weakest of the guarantees, so it allows the greatest set of possible return values. Such a read can return results from a replica that has received an arbitrary subset of the writes to the data object being read.

By requesting a consistent prefix, a reader is guaranteed to observe an ordered sequence of writes starting with the first write to a data object. In other words, the reader sees a version of the data store that existed at the master at some time in the past.

Bounded staleness ensures that read results are not too stale. That is, the read operation is guaranteed to see at least all writes that completed d time (or number of updates) before the read started. The read may potentially see some more recently written values.

Monotonic Reads (sometimes also called as a session guarantee) is a property that applies to a sequence of read operations that are performed by a given client. It states that if the client issues a read operation and then later issues another read to the same object(s), the second read will return the same value(s) or the results of later writes.

Read My Writes is also a session property specific to a client. It guarantees that the effects of all writes that were performed by the client are visible to the client's subsequent reads. If a client writes a new value for a data object and then reads this object, the read will return the value that was last written by the client (or some other value that was later written by a different client).

None of these last four read guarantees are stronger than each other, thus applications may want to combine a multiple of these guarantees. For example, a client could request monotonic reads and read my writes so that it observes a data store that is consistent with its own actions.

The table shows the performance and availability typically associated with each consistency guarantee. Strong consistency is desirable from a consistency viewpoint but offers the worst performance and availability since it generally requires reading from a majority of replicas. Eventual consistency, on the other hand, allows clients to read from any replica, but offers the weakest consistency. The table illustrates the tradeoffs involved as each guarantee offers a unique combination of consistency, performance, and availability.

Cosmos DB consistency levels

Cosmos DB allows developers to choose among 5 well-defined consistency models along the consistency spectrum. While the definitions of strong, eventual, and consistent prefix are the same as the ones discussed in the report, Cosmos DB strengthens the definitions for bounded staleness and session consistency, making them more useful for the clients.

More specifically, Cosmos DB's bounded staleness is strengthened to offer total global order except within the "staleness window". In addition, monotonic read guarantees exist within a region both inside and outside the "staleness window".

Cosmos DB's session consistency (again scoped to a client session) is strengthened to include consistent prefix, monotonic writes, read my writes, and write-follows-reads in addition to the monotonic read property. As such, it is ideal for all scenarios where a device or user session is involved. (You can check the clickable consistency map by Kyle Kingsburry to read about the definitions of monotonic writes and write-follows-reads which impose order on the writes.)

It is possible to sign up for a free trial (with no credit card and commitment) to test these guarantees on Cosmos DB. Once you create a Cosmos DB resource, you can also see an animation of the consistency guarantees after selecting the default consistency option from the tab. I am pasting screenshots for strong consistency and bounded staleness animations below.

While animations are nice for intuitively understanding consistency levels, inside Cosmos DB, the TLA+ specification language is used for specifying these models precisely and model checking them with the global distribution protocols considered. In the coming weeks, as I promised, I will try to sanitize and publish from a client-side perspective the TLA+ specifications for these consistency levels.

To put their money where their mouth is, Cosmos DB offers comprehensive 99.99% SLAs which guarantee throughput, consistency, availability, and latency for Cosmos DB database accounts scoped to a single Azure region configured with any of the five consistency levels, or database accounts spanning multiple regions, configured with any of the four relaxed consistency levels. Furthermore, independent of the choice of a consistency level, Cosmos DB offers a 99.999% SLA for read and write availability for database accounts spanning two or more regions. I will dedicate a separate blog post on SLAs in the coming weeks, as I start learning more about them.

Baseball analogy

This toy example assumes that the score of the game is recorded in the  key-value store in two objects, one for the number of runs scored by the "visitors" and one for the "home" team's runs. When a team scores a run, a read operation is performed on its current score, the returned value is incremented by one, and the new value is written back to the key-value store.

This sequence of writes is from a hypothetical baseball game with the inning-by-inning line score, and the game is currently in the middle of the seventh inning, and the home team is winning 2-5.

Different read guarantees may result in clients reading different scores for this game that is in progress. The table below lists the complete set of scores that could be returned by reading the visitors and home scores with each of the 6 consistency guarantees. The visitors' score is listed first, and different possible return values are separated by comas.

A strong consistency read can only return one result, the current score. On the other hand, an eventual consistency read can return one of 18 possible scores, many of which are ones that were never the actual score. The consistent prefix property limits the result to scores that actually existed at some time. The results that can be returned by a bounded staleness read  depend on the desired bound.

The paper considers 6 hypothetical participants querying the baseball database for scores: the scorekeeper, umpire, radio reporter, sportswriter, statistician, and the stat-watcher. The table lists the consistencies that these participant use. Of course, each participant would be okay with strong consistency, but, by relaxing the consistency requested for her reads, she will likely observe better performance and availability. Additionally, the storage system may be able to better balance the read workload across servers since it has more flexibility in selecting servers to answer weak consistency read requests.

The toy example is meant to illustrate that the desired consistency depends as much on who is reading the data as on the type of data. All of the 6 presented consistency guarantees are useful, because each guarantee appears at least once in the participant needs table. That is, different clients may want different consistencies even when accessing the same data.


The report (here is the talk by Doug Terry on the report if you like to get a more immersive/extensive discussion of the topic) concludes as follows:
Clients should be able to choose their desired consistency. The system cannot possibly predict or determine the consistency that is required by a given application or client. The preferred consistency often depends on how the data is being used. Moreover, knowledge of who writes data or when data was last written can sometimes allow clients to perform a relaxed consistency read, and obtain the associated benefits, while reading up-to-date data. This could be of practical significance since the inherent trade-offs between consistency, performance, and availability are tangible and may become more pronounced with the proliferation of georeplicated services. This suggests that cloud storage systems should at least consider offering a larger choice of read consistencies.

As I discussed above, Cosmos DB fits the bill. It provides 5 consistency level choices in an all-in-one packet. It is easy to configure the consistency level on the fly, and the effects take place quickly. Cosmos DB also allows you the flexibility to override the default consistency level you configured for your account on a specific read request. It turns out only about 2% of Cosmos DB tenants override consistency levels on a per request basis.

MAD questions

1. What are the effects of granularity of writes on consistency?
The paper said: "We assume that the score of the game is recorded in a key-value store in two objects, one for the number of runs scored by the visitors and one for the home team's runs."

Why keep two separate objects "home" "visitor" though? Instead, what if we used just one object called "score" that consists of a tuple <visitor, home>. Then the reads would be more consistent by design; even the eventual consistency read will not return a score that was not an actual store at one point in the game.

Sometimes you don't get to design the data storage/access schemes, but when you get to decide this, you can act smartly and improve consistency. This reminds me of techniques used in self-stabilizing systems for compacting the state space to forbid bad states by construction.

2. What are the theoretical limits on the tradeoffs among consistency levels?
As we have seen, each proposed consistency model occupies some point in the complex space of tradeoffs. The CAP theorem shows a coarse tradeoff between consistency and availability. The PACELC model tries to capture further tradeoffs between consistency, availability, and latency.

More progress has been made in exploring the tradeoff space from the theoretical perspective since then. The causal consistency result showed that natural causal consistency, a strengthening of causal consistency that respects the real-time ordering of operations, provides a tight bound on consistency semantics that can be enforced without compromising availability and convergence. There has been plethora of papers recently on improvements on causal consistency georeplicated datastores, and I hope to summarize the prominent ones in the coming weeks.

3. Is baseball scoring also nonlinear?
Baseball is still foreign to me, even though I have been in US for 20 years now. While I try to be open-minded and eager to try new things, I can be stubborn about not learning some things. Here is another example where I was unreasonably close minded. Actually, though I had seen this paper earlier, I didn't read it then because I thought I would have to learn about baseball rules to understand it. Funny? No! I wonder what other opportunities I miss because of being peculiarly close minded on certain things.

Well, to win a battle against my obstinate and peculiar ignorance on baseball, I just watched this 5 minute explanation of baseball rules. Turns out, it is not that complicated. Things never turn out as scary/complicated/bad as I make them to be in my mind.

I had written about how bowling scoring is nonlinear and consequences of that. Does baseball scoring also have a nonlinear return? It looks like you can get a lot of runs scored in a single inning. So maybe that counts as nonlinearity as it can change the game score quickly, at any inning.

Tuesday, August 14, 2018

Schema-Agnostic Indexing with Azure Cosmos DB

This VLDB'15 paper is authored by people from the Cosmos DB team and Microsoft research. The paper has "DocumentDB" in the title because DocumentDB was the project that evolved into Azure Cosmos DB. The project started in 2010 to address the developer pain-points inside Microsoft for supporting large scale applications. In 2015 the first generation of the technology was made available to Azure developers as Azure DocumentDB. It has since added many new features and introduced significant new capabilities, resulting in Azure Cosmos DB!

This paper describes the schema-agnostic indexing subsystem of Cosmos DB.  By being fully schema-agnostic, Cosmos DB provides many benefits to the developers. Keeping the database schema and indexes in-sync with an application's schema is especially painful for globally distributed apps. But with a schema-agnostic database, you can iterate your app quickly without worrying of schemas or indexes. Cosmos DB automatically indexes all the data --no schema, no indexes required-- and serves queries very fast. Since no schema and index management is required, you don't have to worry about application downtime while migrating schemas either.

To achieve these, Cosmos DB natively supports JSON (JavaScript Object Notation) data model and JavaScript language directly within its database engine. This approach enables the following:

  • The query language (rooted in JavaScript's type system, expression evaluation and function invocation model) supports rich relational and hierarchical queries and is exposed to developers as a SQL dialect. 
  • By default, the database engine automatically indexes all documents without requiring schema or secondary indexes from developers.
  • The database supports transactional execution of application logic provided via stored procedures and triggers, written entirely in JavaScript. 

Of course the challenge is to do all of these while (1) providing predictable performance guarantees and (2) remaining cost effective. For (1), the challenges are in updating the index efficiently and synchronously against high write rates and in hitting the SLA performance guarantees as well as the configured consistency levels. For (2), the challenges are in resource governance for providing multitenancy under extremely frugal budgets: the index updates must be performed within the strict budget of system resources (CPU, memory, storage), and load balancing and relocation of replicas should be performed such that the worst-case on-disk storage overhead of the index is bounded and predictable. Please see my overview post for a high level overview of Cosmos DB. I will be diving in to the subsystems in the coming days.

The resource model

A tenant starts by provisioning a database account using an Azure subscription (you can try it for free at The database account manages one or more databases, each of which manages a set of resources: users, permissions, and containers.  In Cosmos DB, a container can be projected as a "collection, table, graph" for supporting APIs for "SQL, Table, Gremlin". In addition, a container also manages stored procedures, triggers, user defined functions (UDFs) and attachments.

Each resource is uniquely identified by a stable logical URI and is represented as a JSON document. Developers can interact with resources via HTTP (and over a stateless TCP protocol) using the standard HTTP verbs for CRUD (create, read update, delete), queries, and stored procedures.

All the data within an Azure Cosmos DB container (e.g. collection, table, graph) is horizontally partitioned and Cosmos DB transparently manages those resource partitions behind the scene. Tenants can elastically scale a resource by simply creating new resources which get partitioned (or distributed) across resource partitions automatically.

A resource partition is made highly available by a replica set across Azure regions. I talk about this global distribution next.

Global distribution

The service is deployed worldwide across multiple Azure regions (50+ of them). Cosmos DB is classified as a foundational service in Azure --which means it is automatically available in every Azure region including public, sovereign, and government clouds. Cosmos DB runs on clusters of machines each with dedicated local SSDs --other storage options for petabyte level data is also conīŦgurable.

Each stamp has its own fabric controller. A stamp consists of racks, each of which is built out as a separate fault domain with redundant networking and power. I will summarize the Microsoft fabric paper later.

Each machine hosts replicas corresponding to various resource partitions, which are placed and load balanced across machines in the federation. Each replica hosts an instance of the database engine, which manages the resources (e.g. documents) as well as the associated index. The database engine contains replicated state machine for coordination, the JavaScript language runtime, the query processor, and the storage and indexing subsystems responsible for transactional storage and indexing of documents.

While I won't be able to detail it in this post, Cosmos DB's unique design of resource governance underlies the core of the system: on any single machine there could be hundreds of tenants co-existing, each properly isolated (without any noisy neighbor symptoms) and resource-governed. In addition, Cosmos DB provides a consistent programming model to provision throughput across a heterogeneous set of database operations for each of the tenants at each region at any point in time while maintaining the SLAs.

Schema-agnostic indexing

The schema of a document describes the structure and the type system of the document. In contrast to XML, JSON's type system is simple: it is a strict subset of the type systems of Javascript as well as many other programming languages.

The Cosmos DB database engine operates directly at the level of JSON grammar, remaining agnostic to the concept of a document schema and blurring the boundary between the structure and instance values of documents. This, in turn, enables it to automatically index documents without requiring schema or secondary indices as we see below.

Documents as trees

By representing documents as trees, Cosmos DB normalizes both the structure and the instance values across documents into the unifying concept of a dynamically encoded path structure. In this representation, each label in a JSON document (including both the property names and their values) becomes a node of the tree. In other words, the values become first class citizen, the same level as schema labels. Note that the leaves contain actual values and the intermediate nodes the schema information.

The figure above shows the trees created for a document 1 and document 2. The labels are inferred from the document. A (pseudo) root node is created to parent the rest of the (actual) nodes corresponding to the labels in the document underneath. The nested data structures drive the hierarchy in the tree. Intermediate artificial nodes labeled with numeric values (e.g. 0, 1, ...) are employed for representing enumerations, array indices.


With automatic indexing, every path in a document tree is indexed (unless the developer configures it to exclude certain path patterns). Therefore, it is important to use a normalized path representation to ensure that the cost to indexing and querying a document with deeply nested structure, say 10 levels, is the same as that of a flat one consisting of key-value pairs of 1 level depth.

There are two possible mappings of document and the paths: a forward index mapping, which keeps a map of (document id, path) tuples, as we saw above, and an inverted index mapping, which keeps a map of (path, document id) tuples.

The inverted index provides a very efficient representation for querying. The index tree is a document which is constructed out of the union of all of the trees representing individual documents within the collection. The index tree grows over time as new documents get added or updated to the collection. In contrast to relational database indexing, there is no need for restarting indexing from scratch as new fields are introduced.

Each node of the index tree is an index entry containing the label and position values, called the term, and the ids of the documents, called the postings. The postings in the curly brackets (e.g. {1,2}) in the inverted index figure correspond to the documents (e.g., Document1 and Document2) containing the given label value. An important implication of treating both the schema labels and instance values uniformly is that everything is packed inside a big index. An instance value (still in the leaves) is not repeated, it can be in different roles across documents, with different schema labels, but it is the same value.

The inverted index looks to me as similar to the indexing structures used in a search engine in the information retrieval domain. In a sense, Cosmos DB let's you search your database for anything added to it regardless of its schema structure.

For the normalized path, the service uses a combination of partial (consisting of 3 segments) forward path representation for paths where range support is needed while following partial reverse path representation for paths needing equality (hash) support.


Developers can query the collections using queries written in SQL and JavaScript. Both SQL and JavaScript queries get translated to an internal intermediate query language (Query IL). The Query IL supports projections, filters, aggregates, sort, flatten operators, expressions (arithmetic, logical, and various data transformations), and user defined functions (UDFs).

The query model attempts to strike a balance between functionality, efficiency, and simplicity. The database engine natively compiles and executes the SQL query statements. You can query a collection using the REST APIs or any of the client SDKs. The .NET SDK comes with a LINQ provider. Here is a query playground where you can experiment.

The below figures illustrate point querying and range querying. The inverted index allows the query to identify the documents that match the query predicate quickly. Another important implication of treating both the schema and instance values uniformly in terms of paths is that the inverted index is also a tree. Thus, the index and the results can be serialized to a valid JSON document and returned as documents themselves as they are returned in the tree representation. This enables recursing over these results with additional querying.

For the second query, a range query, GermanTax is a user defined function executed as part of query processing.

Scratching the surface

Here is the summary of the stuff so far. The database engine is designed to be schema-agnostic by representing documents as trees. This enables supporting automatic indexing of documents, serving consistent queries in the face of sustained write volumes under an extremely frugal resource budget in a multitenant environment. (This 2015 talk by Dharma provides a nice summary of these concepts. This page provides code examples for these concepts.)

In this post, I only scratched the surface.  I haven't started to write about the technical meat of the paper in the logical index organization and the physical index organization sections. These sections introduce a novel logical index layout and a latch-free log-structured storage with blind incremental updates in order to meet the strict requirements of performance (SLA-bounded) and cost effectiveness (in a multitenant environment). Bw-trees introduced by Microsoft Research, a type of B-trees, is employed as part of the physical index organization.

In order to write about those sections more confidently, I will track down the experts here on the logical and physical index organization and will try to get a crash course on these.

MAD questions

1. How do NoSQL and NewSQL indexing methods compare/contrast with those for relational databases?
I realized that I am completely ignorant on this topic. But, it also appears that there aren't good explanations on this when I search for it. I will try to find more about this.

2. How do these design decisions fit with the historical trend/direction of database systems?
I found this recent article by Pat Helland very interesting: "Mind Your State for Your State of Mind". So I will plan to summarize it and try to analyze how today's design decisions fit with the historical trends/directions.
"Applications have had an interesting evolution as they have moved into the distributed and scalable world. Similarly, storage and its cousin databases have changed side by side with applications. Many times, the semantics, performance, and failure models of storage and applications do a subtle dance as they change in support of changing business requirements and environmental challenges. Adding scale to the mix has really stirred things up. This article looks at some of these issues and their impact on systems."

3. Can there be a ranking component to the search/querying that enables ranking with respect to user-defined priority? That can be useful for long running or realtime streaming query.

Thursday, August 9, 2018

The many faces of consistency

This is a lovely paper (2016) from Marcos Aguilera and Doug Terry. It is an easy read. It manages to be technical and enlightening without involving analysis, derivation, or formulas.

The paper talks about consistency. In distributed/networked systems (which includes pretty much any practical computer system today), data sharing and replication brings up a fundamental question: What should happen if a client modifies some data items and simultaneously, or within a short time, another client reads or modifies the same items, possibly at a different replica?

The answer depends on the context/application. Sometimes eventual consistency is what is desired (e.g., DNS), and sometimes you need strong consistency (e.g., reservations, accounting).

The paper points out two different types of consistency, state consistency and operation consistency, and focuses mainly on comparing/contrasting these two approaches.

When I see the terms state versus operation consistency, without reading the rest of the paper, my gut reaction is to conclude that this is a matter of abstraction. If you are implementing the system, you work at the state consistency level. If you are exposing this to the clients, since they do not need to know the internals of the system and would only care about the operations they invoke, then you work at the operational consistency level. The state consistency can help support and provide operational consistency to the clients.

My gut reaction turns out to be pretty accurate, yet the paper goes into more details and elaborates on some subtleties. I am glad to read about the two approaches in more detail. I think, going forward, it would be important to relate/bridge/reconcile these two approaches.

1. State consistency

State consistency pertains to the state of the system (which comprises of the current values of the data items). These are properties that the system should satisfy despite concurrent access and the existence of multiple replicas.

A major subcategory of state consistency is one defined by an invariant ---a predicate on the state that must evaluate to true. For example, in a concurrent program, a singly linked list must not contain cycles. As another example, in a primary-backup system mutual consistency invariant requires that replicas have the same state when there are no outstanding updates.

It is possible to weaken/loosen invariants to include error bounds, probabilistic guarantees, and eventual satisfaction.

2. Operation consistency

State consistency is limited to the properties on state, but in many cases clients care little about the system state and more about the results that they obtain from the system. This calls for a different form of consistency, operation consistency, which pertains to the behavior that clients observe from interacting with the system.

Operation consistency has subcategories based on the different ways to define the consistency property.

2.1 Sequential equivalence

This subcategory defines the permitted operation results of a concurrent execution in terms of the permitted operation results in a sequential execution. Some examples are as follows.

Linearizability is a strong form of consistency. Each operation must appear to occur at an instantaneous point between its start time (when the client submits it) and finish time (when the client receives the response), and execution at these instantaneous points form a valid sequential execution. More precisely, there must exist a legal total order T of all operations with their results, such that (1) T is consistent with the partial order <, meaning that if op1 finishes before op2 starts then op1 appears before op2 in T, and (2) T defines a correct sequential execution.

Sequential consistency is weaker than linearizability. It requires that operations execute as if they were totally ordered in a way that respects the order in which *each client* issues operations. More precisely, the partial order < is this time defined as: op1 < op2 iff both operations are executed by *the same client* and op1 finishes before op2 starts. In other terms, while linearizability imposes across-clients system-level ordering, sequential consistency is content with imposing a client-centric ordering.

The next examples pertain to systems that support transactions. Intuitively, a transaction is a bundle of one or more operations that must be executed as a whole. The system provides an isolation property, which ensures that transactions do not significantly interfere with one another. There are many isolation properties: serializability, strong session serializability, order-preserving serializability, snapshot isolation, read committed, repeatable reads, etc. All of these are forms of operation consistency, and several of them (e.g., serializability, strong session serializability, and order-preserving serializability aka strict-serializability) are of the sequential equivalence subcategory.

2.2 Reference equivalence

Reference equivalence is a generalization of sequential equivalence. Examples of this occur often in database systems.

Snapshot isolation requires that transactions behave identically to the following reference implementation. When a transaction starts, it gets assigned a monotonic start timestamp. When the transaction reads data, it reads from a snapshot of the system as of the start timestamp. When a transaction T1 wishes to commit, the system obtains a monotonic commit timestamp and verifies whether there is some other transaction T2 such that (1) T2 updates some item that T1 also updates, and (2) T2 has committed with a commit timestamp between T1’s start and commit timestamp. If so, then T1 is aborted; otherwise, T1 is committed and all its updates are applied instantaneously as of the time of T1’s commit timestamp.

Another example is one-copy serializability where the replicated system must behave like a single node (no replicas!) and provides serializability.

2.3 Read-write centric

The read-write centric subcategory applies to systems with two very specific operations: read and write. Consistency is defined with respect to the set of writes that could have potentially affected the read.

Read-my-writes requires that a read by a client sees at least all writes previously executed by the same client, in the order in which they were executed. This is relevant when clients expect to observe their own writes, but can tolerate delays before observing the writes of others.

Bounded staleness bounds the time (or number of updates) it takes for writes to be seen by reads: A read must see at least all writes that complete d time (or number of updates) before the read started.

3. State versus operation consistency

Operation consistency is, well..., operational! I was inculcated as part of my PhD training in distributed systems to avoid operational reasoning as it fails to work for concurrent execution. This is also what I teach my students within the first week of distributed systems class, as well. Operational reasoning does not scale since there are too many corner cases to check. For reasoning about distributed systems, we use invariant-based reasoning, which lends itself better to state consistency.

However, the truth is multidimensional. While unsuitable for reasoning/verification, operational consistency has its advantages.

On which type of consistency to use, the paper suggests the following:
"First, think about the negation of consistency: what are the inconsistencies that must be avoided? If the answer is most easily described by an undesirable state (e.g., two replicas diverge), then use state consistency. If the answer is most easily described by an incorrect result to an operation (e.g., a read returns stale data), then use operation consistency. 
A second important consideration is application dependency. Many operation consistency and some state consistency properties are application independent (e.g., serializability, linearizability, mutual consistency, eventual consistency). We recommend trying to use such properties, before defining an application-specific one, because the mechanisms to enforce them are well understood. If the system requires an application specific property, and state and operation consistency are both natural choices, then we recommend using state consistency due to its simplicity."

Notice that while operation consistency section listed more than a dozen consistency levels, the state consistency section just named a couple invariant types, including a vaguely named mutual consistency invariant. This is because the state-consistency is implementation specific, a whitebox approach. It is more suitable for the distributed systems designers/builders rather than users/clients. By  restricting the domain to specific operations (such as read and write), operation consistency is able to treat the system as a blackbox and provides a reusable abstraction to the users/clients.

4. What is really behind the curtain?

Here is another advantageous usecase for operation consistency approach. It provides you an abstraction (i.e., a curtain, a veil) that you can leverage in your implementation. Behind this curtain, you can pull tricks. The paper gives this simple example.
"An interesting example is a storage system with three servers replicated using majority quorums, where (1) to write data, the system attaches a monotonic timestamp and stores the data at two (a majority of) servers, and (2) to read, the system fetches the data from two servers; if the servers return the same data, the system returns the data to the client; otherwise, the system picks the data with the highest timestamp, stores that data and its timestamp in another server (to ensure that two servers have the data), and returns the data to the client. This system violates mutual consistency, because when there are no outstanding operations, one of the servers deviates from the other two. However, this inconsistency is not observable in the results returned by reads, since a read filters out the inconsistent server by querying a majority. In fact, this storage system satisfies linearizability, one of the strongest forms of operation consistency."

This is sort of a lazy evaluation idea. You don't always need to expend work/energy to keep the database consistent, you tidy up the database only when it is queried, only when it needs to perform.

The "TAPIR: Building Consistent Transactions with Inconsistent Replication (SOSP'15)" paper also builds on this premise. Irene puts it this way:
"Today, systems that want to provide strong guarantees use Paxos (or, if you are hipper than me, RAFT), and everyone else uses something cheaper. Paxos enforces a strict serial ordering of operations across replicas, which is useful, but requires coordination across replicas on every operation, which is expensive. 
What we found in the TAPIR project is that Paxos is too strong for some strong system guarantees and, as a result, is wasting work and performance for those systems. For example, a lock server wants mutual exclusion, but Paxos provides a strict serial ordering of lock operations. This means that a lock server built using Paxos for replication is coordinating across replicas even when it is not necessary to ensure mutual exclusion. 
Even more interesting, a transactional storage system wants strictly serializable transactions, which requires a linearizable ordering of transactions but only requires a partial ordering of operations (because not all transactions touch all keys). With some careful design in TAPIR, we are able to enforce a linearizable ordering of transactions with no ordering of operations."

5. What role do these concepts play in Cosmos DB?

Cosmos DB provides 5 well defined operation consistency properties to the clients: strong, bounded, session, consistent prefix, and eventual consistency. These consistency models were chosen because they are practical and useful as signaled by their actual usage by the customers for their production workloads.

And, of course, in order not to leave performance on the table, instead of imposing strong state-based consistency, Cosmos DB global distribution protocols employ several optimizations behind the curtain while still managing to provide the requested operation consistency levels to the clients.

Thus, state-consistency still plays a major role in Cosmos DB, from the core layer developers/designers perspective. The core layer backend team uses TLA+ to identify and check weakest state-consistency invariants that  support and imply the desired operational consistency levels. These weak invariants are efficient and do not require costly state synchronization.

As I mentioned in my previous post, I will post about TLA+/PlusCal translation of consistency levels provided by Cosmos DB here. I also hope to talk about some of the state-consistency invariants and efficiency techniques employed when I start describing the global distribution at the Cosmos DB core layer in my upcoming blog posts.

MAD questions

1. Is it possible to bridge the two approaches?

These two approaches can be complementary like two sides of the same coin.

I think it is easier to go from state-consistency of the system to inferring and designing operation consistency properties.

How about the other way? While checking operation consistency requires analyzing an execution log, by restricting the domain to specific operations (such as read and write), it is possible to achieve application independence and treat the system as a blackbox. The Jepsen library achieves that for distributed systems testing. If the system is a blackbox, or was developed without invariant/state consistency, this operation consistency based testing approach can help in identifying problems. But it is still unclear, how to infer/design state consistency properties/invariants that can help in fixing the problem.

The development of better tools for observability/auditability approaches (such as Retroscope) can help in bridging this gap.

Another effort to help bridge the gap could be to identify/name and create an ontology of invariants used in state consistency approaches. I don't know much work in that direction except this one from VLDB15.

2. Do these two approaches converge at the eventual consistency position?
The paper states the following. "Operational eventual consistency is a variant of eventual consistency (a form of state consistency) defined using operation consistency. The requirement is that each write be eventually seen by all reads, and if clients stop executing writes then eventually every read returns the same latest value."

Eventual consistency is likely a natural convergence point for the state and operation consistency. This reminds of the "Conflict-free replicated data types" paper.

3. Are terms/definitions about consistency consistent yet?
Consistency is an important concept so it emerged and developed in different domains (distributed systems, databases, and computer architecture) simultaneously. And of course different domains used different terminology and confusion arose.

I am not surprised. Even in the distributed systems community, on the restricted topic of distributed consensus, researchers have been using inconsistent terminology for many decades before consensus (pun intended) arose and the terminology converged and standardized.

Consistency definitions can be overwhelming because there are many parameters involved, even without considering transactions.

Kyle Kingsbury recently provided a simplified clickable map of major consistency models (including transactional consistency models).

Tuesday, August 7, 2018

Azure Cosmos DB

I started my sabbatical work with the Microsoft Azure Cosmos DB team recently. I have been in talks and collaboration with the Cosmos DB people, and specifically with Dharma Shukla, for over 3 years. I have been very impressed with what they were doing and decided that this would be the best place to spend my sabbatical year.

The travel and settling down took time. I will write about those later. I will also write about my impressions of the greater Seattle area as I discover more about it. This was a big change for me after having stayed in Buffalo for 13 years. I love the scenery: everywhere I look I see a gorgeous lake or hill/mountain scene. And, oh my God, there are blackberries everywhere! It looks like the Himalayan blackberries is the invasive species here, but I can't complain. As I go on an evening stroll with my family, we snack on blackberries growing along the sidewalk. It seems like we are the only ones doing so ---people in US are not much used to eating fruits off from trees/bushes.

Ok, coming back to Cosmos DB... My first impressions--from an insider-perspective this time-- about Cosmos DB is also very positive and overwhelming. It is hard not to get overwhelmed. Cosmos DB provides a global highly-available low-latency all-in-one database/storage/querying/analytics service to heavyweight demanding businesses. Cosmos DB is used ubiquitously within Microsoft systems/services, and is also one of the fastest-growing services used by Azure developers externally. It manages 100s of petabytes of indexed data, and serves 100s of trillions of requests every day from thousands of customers worldwide, and enables customers to build highly-responsive mission-critical applications.

I find that there are a lot of things to learn before I can start contributing in a meaningful and significant way to the Cosmos DB team. So I will use this blog to facilitate, speed up, and capture my learning. The process of writing helps me detach, see the big picture, and internalize stuff better. Moreover my blog also serves as my augmented memory and I refer back to it for many things.

Here is my first attempt at an overview post. As I get to know Cosmos DB better, I hope to give you other more-in-depth overview posts.

What is Cosmos DB?

Cosmos DB is Azure's cloud-native database service.

(The term "cloud-native" is a loaded key term, and the team doesn't use it lightly. I will try to unpack some of it here, and I will revisit this in my later posts.)

It is a database that offers frictionless global distribution across any number of Azure regions ---50+ of them! It enables you to elastically scale throughput and storage worldwide on-demand quickly, and you pay only for what you provision. It guarantees single-digit-millisecond latencies at the 99th percentile, supports multiple consistency models, and is backed by comprehensive service level agreements (SLAs).

I am most impressed with its all-in-one capability. Cosmos DB seamlessly supports many APIs, data formats, consistency levels, and needs across many regions. This alleviates data integration pains which is a major problem for all businesses. The all-in-one capability also eliminates the developer effort wasted into keeping multiple systems with different-yet-aligned goals in sync with each other. I had written earlier about the Lambda versus Kappa architectures, and how the pendulum is all the way to Kappa. Cosmos DB all-in-one gives you the Kappa benefits.

This all-in-one capability backed with global-scale distribution enables new computing models as well. The datacenter-as-a-computer paper from 2009 had talked about the vision of warehouse scale machines. By providing a frictionless globe-scale replicated database, CosmosDB opens the way to thinking about the globe-as-a-computer. One of the usecases I heard from some Cosmos DB customers amazed me. Some customers allocate a spare region (say Australia) where they have no read/write clients as an analytics region. This spare region still gets consistent data replication and stays very up-to-date and is employed for running analytics jobs without jeopardizing the access latencies of real read-write clients. Talk about disaggregated computation and storage! This is disaggregated storage, computing, analytics, and serverless across the globe. Under this model, the globe becomes your playground.

This disaggregated yet all-in-one computing model also manifests itself in customer acquisition and settling in Cosmos DB. Customers often come for the query serving level, which provides high throughput and low-latency via SSDs. Then they get interested and invest into the lower-throughput but higher/cheaper storage options to store terrabytes and petabytes of data. They then diversify and enrich their portfolio further with analytics, event-driven lambda, and real-time streaming capabilities provided in Cosmos DB.

There is a lot to discuss, but in this post I will only make a brief introduction to the issues/concepts, hoping to write more about them later. My interests are of course at the bottom of the stack at the core layer, so I will likely dedicate most of my coming posts to the core layer.

Core layer

The core layer provides capabilities that the other layers build upon. These include global distribution, horizontally and independently scalable storage and throughput, guaranteed single-digit millisecond latency, tunable consistency levels, and comprehensive SLAs.

Resource governance is an important and pervasive component of the core layer. Request units (allocating CPU, memory, throughput) is the currency to provision the resources. Provisioning a desired level of throughput through dynamically changing access patterns and across a heterogeneous set of database operations presents many challenges. To meet the stringent SLA guarantees for throughput, latency, consistency, and availability, Cosmos DB automatically employs partition splitting and relocation. This is challenging to achieve as Cosmos DB also handles fine-grained multi-tenancy with 100s of tenants sharing a single machine and 1000s of tenants sharing a single cluster each with diverse workloads and isolated from the rest. Adding even more to the challenge, Cosmos DB supports scaling database throughput and storage independently, automatically, and swiftly to address the customer's dynamically changing requirements/needs.

To provide another important functionality, global distribution, Cosmos DB enables you to configure the regions for "read", "write", or "read/write" regions. Using Azure Cosmos DB's multi-homing APIs, the app always knows where the nearest region is (even as you add and remove regions to/from your Cosmos DB database) and sends the requests to the nearest datacenter. All reads are served from a quorum local to the closest region to provide low latency access to data anywhere in the world.

Cosmos DB allows developers to choose among five well-defined consistency models along the consistency spectrum. (Yay, consistency levels!) You can configure the default consistency level on your Cosmos DB account (and later override the consistency on a specific read request).  About 73% of Azure Cosmos DB tenants use session consistency and 20% prefer bounded staleness. Only 2% of Azure Cosmos DB tenants override consistency levels on a per request basis. In Cosmos DB, reads served at session, consistent prefix, and eventual consistency are twice as cheap as reads with strong or bounded staleness consistency.

This lovely technical report explains the consistency models through publishing of baseball scores via multiple channels. I will write a summary of this paper in the coming days. The paper concludes: "Even simple databases may have diverse users with different consistency needs. Clients should be able to choose their desired consistency. The system cannot possibly predict or determine the consistency that is required by a given application or client. The preferred consistency often depends on how the data is being used. Moreover, knowledge of who writes data or when data was last written can sometimes allow clients to perform a relaxed consistency read, and obtain the associated benefits, while reading up-to-date data."

Data layer

Cosmos DB supports and projects multiple data models (documents, graphs, key-value, table, etc.) over a minimalist type system and core data model: the atom-record-sequence (ARS) model.

A Cosmos DB resource container is a schema-agnostic container of arbitrary user-generated JSON items and JavaScript based stored procedures, triggers, and user-defined-functions (UDFs). Container and item resources are further projected as reified resource types for a specific type of API interface. For example, while using document-oriented APIs, container and item resources are projected as collection and document resources respectively. Similarly, for graph-oriented API access, the underlying container and item resources are projected as graph, node/edge resources respectively.

The overall resource model of an application using Cosmos DB is a hierarchical overlay of the resources rooted under the database account, and can be navigated using hyperlinks.

API layer

Cosmos DB supports three main classes of developers: (1) those familiar with relational databases and prefer SQL language, (2) those familiar with dynamically typed modern programming languages (like JavaScript) and want a dynamically typed efficiently queryable database, and (3) those who are already familiar with popular NoSQL databases and want to transfer their application to Azure cloud without a rewrite.

In order to meet developers wherever they are, Cosmos DB supports SQL, MongoDB, Cassandra, Gremlin, Table APIs with SDKs available in multiple languages.

The ambitious future

There is a palpable buzz in the air in Cosmos DB offices due to the imminent multimaster general availability rollout (which I will also write about later). The team members keep it to themselves working intensely most of the time, but would also have frequent meetings and occasional bursty standup discussions. This is my first deployment in a big team/company, so I am trying to take this in as well. (Probably a post on that is coming up as well.)

It looks like the Cosmos DB team caught a good momentum. The team wants to make Cosmos DB the prominent cloud database and even the go-to all-in-one cloud middleware. Better analytic support and better OLAP/OLTP integration is in the works to support more demanding more powerful next generation applications.

Cosmos DB already has great traction in enterprise systems. I think it will be getting more love from independent developers as well, since it provides serverless computing and all-in-one system with many APIs. It is possible to try it for free at To keep up to date with the latest news and announcements, you can follow @AzureCosmosDB and #CosmosDB on Twitter.

My work at Cosmos DB

In the short term, as I learn more about Cosmos DB, I will write more posts like this one. I will also try to learn and write about the customer usecases, workloads, and operations issue without revealing details. I think learning about the real world usecases and problems will be one of the most important benefits I will be able to get from my sabbatical.

In the medium term, I will work on TLA+/PlusCal translation of consistency levels provided by Cosmos DB and share them here. Cosmos DB uses TLA+/PlusCal to specify and reason about the protocols. This helps prevent concurrency bugs, race conditions, and helps with the development efforts. TLA+ modeling has been instrumental in Cosmos DB's design which integrated global distribution, consistency guarantees, and high-availability from the ground-up. (Here is an interview where Leslie Lamport shares his thoughts on the foundations of Azure Cosmos DB and his influence in the design of Azure Cosmos DB.) This is very dear to my heart, as I have been employing TLA+ in my distributed systems classes for the past 5 years.

Finally, as I get a better mastery of Cosmos DB internals, I like to contribute to protocols on multimaster multirecord transaction support. I also like to learn more about and contribute to Cosmos DB's automatic failover support during one or more regional outages. Of course, these protocols will all be modeled and verified with TLA+.

MAD questions

1. What would you do with a frictionless cloud middleware? Which new applications can this enable?
Here is something that comes to my mind. Companies are already uploading IOT sensor data from cars to Azure CosmosDB continuously. Next step would be to build more ambitious applications that make sense of correlated readings and use these to coordinate actions. Applications of this could be in traffic shaping/management of self-driving car squadrons. These applications will likely combine OLAP (including powerful machine-learning processes) and OLTP (including real-time streaming and actions).

2. What are some patterns in globe-as-a-computer paradigm?
Is this a truly new paradigm or is it just a degree of convenience? Is it possible to argue that this is only incremental and not transformational? But, then, a great degree of incremental capability translates to transformational change.

3. What are other things you like to learn about at a global-scale database?
Let me know in the comments.