Friday, August 31, 2018

TLA+ specification of the bounded staleness and strong consistency guarantees

In my previous post, I had presented a TLA+ modeling of distributed data store that provides the consistent prefix property. In this post, I extend this model slightly to build bounded and strong consistency. In fact the strong consistency specification is achieved when we take the Delta on the bounded consistency as 1.

The TLA+ (well, PlusCal to be more accurate) code for this model is available at

The system model

As in the previous post, we assume there is a write region (with ID=1) and read regions (with IDs 2 through NumRegions). The write region performs the write and copies it to the read regions. There are FIFO communication channels between the write and read regions.
WriteRegion == 1
ReadRegions == 2..NumRegions
chan = [n \in 1..NumRegions |-> <<>>]; 

We use D to denote the Delta on the bounded staleness consistency. Bounded staleness ensures that read results are not too stale. That is, the read operation is guaranteed to see at least all writes that precedes D  number of updates before the read started. The read may potentially see some more recently written values.

Strong consistency ensures that a read operation returns the value that was last written for a given object. This is achieved by using D=1.

The write region actions

The write region has 3 actions to perform: Commit the write in the write region, Multicast the write to the read regions, and Receive/process an ack message from a read region.

These actions can be performed in any arbitrary order inside the process loop, and the model checker will methodically explore all possible combinations of these actions interleaved with the read region actions to expose any flaws in the protocol.

The first action selection is to commit a write in the write region. (We don't get into how that is implemented in the region; maybe commit is done by replicating the update to a write quorum in the region.) As a result the CommittedLSN is incremented. Note that, in contrast to prefix consistency model, in the bounded staleness model the write is throttled to not advance more than D ahead of any of the read regions.

The second action selection is to multicast a committed write to the read regions through the FIFO channels. and this is an asynchronous replication. These may be sent whenever this action is chosen, and is not blocked on waiting any acknowledgements from the read regions.

This action is exactly the same as in the prefix consistency model. The SentLSN variable denotes the last CommittedLSN write that is forwarded to the read regions and is used for ensuring ordered replication.

The final action selection is to receive an Ack message from the channel from a read region. The Progress variable keeps track of the Ack messages, and the CompletedLSN is updated to reflect the highest write that is acknowledged by all the read regions. Harking back to action 1, notice that the write region lazily disseminates this CompletedLSN information with the read regions by piggybacking this to the commit-write messages. In this model the read-regions do not utilize this CompletedLSN information, but as I start to explore in the MAD questions, this can be useful.

The read region actions

The read regions only react to the replicated messages from the write region. The first action selection is to receive a message pending in the channel from the write region. The second action selection is to send back an Ack message for any replication message that is not acknowledged yet. The actions are almost the same except for the line updating the CompletedLSN at the read region.

Invariant checking and testing

The consistent prefix invariant still holds for this model as we refined that model to obtain this one.
CP == [][\A i \in ReadRegions: 
               CommittedLSN'[i] = CommittedLSN[i] 
            \/ CommittedLSN'[i] = CommittedLSN[i] + 1]_vars

The BoundedC invariant is to check that the read regions are always maintained to be within the staleness bound of the most recent CommittedLSN.  (Since I used CommittedLSN variable for both read and write regions, the TLA translator assigned CommittedLSN_ "with underscore" to the write region's version to distinguish it from that of the read regions.)
BoundedC  == \A i \in ReadRegions : 
                      CommittedLSN[i]=< CommittedLSN_[1] 
                   /\ CommittedLSN[i]>= CommittedLSN_[1] -D

The SyncStep invariant is to check the relationship between the CompletedLSN at the write region and the copies maintained at the read regions.
SyncStep  == \A i \in ReadRegions : 
                      CompletedLSN[i] =< CompletedLSN_[1]
                   \/ CompletedLSN[i] > CompletedLSN_[1] -D

I first wrote this predicate with "CompletedLSN[i] > CompletedLSN_[1] -1" but the model checker was quick to tell me I was wrong. This is bounded by D and "not 1" as receive operations at the read regions can be asynchronous within the D staleness bound. Here the write region received Acks for its two commits back to back so the CompletedLSN at the write region was 2 versions ahead of those in the read regions.

MAD questions

1. Did we explore the design space thoroughly within this simple model?

No, it turns out, there is still surprisingly interesting and useful tricks we can pull within this simple model.

As I mentioned in the review of the "Many Faces of Consistency" paper, there is the "operational definitions of consistency" exposed to the client and there is the "state based definitions consistency" used by the developers, and there is a gap between the two where you can play interesting tricks and expose the client operational consistency it cares about in an efficient way.

In our model we approached things from the state consistency perspective and made sure everything works safely. We can still add a lot of efficiency and functionality by slightly changing how we expose things to the client from an operational consistency perspective. Azure Cosmos DB performs many interesting tricks under the hood to implement many consistency models in concert. More on this later...

Wednesday, August 29, 2018

TLA+ specification of the consistent prefix guarantee

We had covered consistency properties in the recent posts. Now to keep it more concrete and to give you a taste of TLA+, I present a model of a distributed data store that provides the consistent prefix property.

In my next post, I will extend this model slightly to present bounded and strong consistency properties. And, in the post after that, I will add a client model to show how this data store can be extended to provide session consistency. The TLA+ (well, PlusCal to be more accurate) code for this model is available here. 

The system model

We assume there is a write region (with ID=1) and read regions (with IDs 2 through NumRegions). The write region performs the write and copies it to the read regions. There are FIFO communication channels between the write and read regions.
WriteRegion == 1
ReadRegions == 2..NumRegions
chan = [n \in 1..NumRegions |-> <<>>]; 

The write region actions

The write region has 3 actions to perform:

  • Commit the write in the write region
  • Multicast the write to the read regions
  • Receive/process an ack message from a read region

These three actions can be performed in any arbitrary order inside the process loop, and the model checker will ruthlessly try all possible combinations of these actions interleaved with the read region actions to check if there is a violation of the invariant properties entered.

The first action selection is to commit a write in the write region. (We don't get into how that is implemented in the region; maybe commit is done by replicating the update to a write quorum in the region.) As a result the CommittedLSN is incremented. CommittedLSN keeps track of the write operation. We are not modeling the content of the write, we will model the write only with its CommittedLSN and try to get the writes replicated in the read regions in the same order.

The second action selection is to multicast a committed write to the read regions. To send the writes in order through the FIFO channels, the write region maintains a SentLSN variable to denote the last CommittedLSN write that is forwarded to the read regions. If there are some CommittedLSN writes that are yet to be forwarded,  these are multicasted in order of the LSN numbers without any gap. However, note that, this is an asynchronous replication. The messages may be sent whenever this action is chosen, and this send action is not blocked on waiting any acknowledgements from the read regions.

The final action selection is to receive an Ack message from a read region. This model does not use the Ack messages for updating anything, but they can be used for tracking the progress of the read regions. This will become important for the bounded consistency and strong consistency properties.

The loop is bounded by a constant MAXN so that the model checking space is limited. MAXN=5 is an acceptable number. If you have worked with a model checker, you know that checking 5 repeat of the operation without the invariant broken is a very good indication it works. Yes, this is not induction proof, but since the model checker tries everything that could possibly go wrong, it would have found a counterexample (e.g., race-condition that violates an invariant) even with 3 repeats or so.

The read region actions

The read regions only react to the replicated messages from the write region. So they have only two actions.

The first action selection is to receive a message pending in the channel from the write region. CommittedLSN variable is then updated to match the LSN in the message. Here we don't check for the gaps, but since the channels are FIFO and the messages are sent by the write region in the increasing LSN order without gaps, the consistent prefix property holds ---as we later confirm when we model-check with the invariant.

The second action selection is to send back an Ack message for any replication message that is not acknowledged yet. Since this action is also asynchronously chosen, the acknowledgement can be cumulative, it can skip over LSNs and acknowledge the highest seen, and that's OK.

Invariant checking and testing

Our invariant is short and sweet. It checks that at any read region, the CommittedLSN --if updated-- is always incremented by 1 over its previous value. That is, there are no gaps in the commit sequence.
CP == [][\A i \in ReadRegions: 
               CommittedLSN'[i] = CommittedLSN[i] 
            \/ CommittedLSN'[i] = CommittedLSN[i] + 1]_vars

An alternative would be to insert a PrefixConsistency boolean variable in the specification, which gets set to FALSE at the read region if the replica receives an out-of-order commit request. But that makes the model ugly; it is not nice to entangle the model and property to be checked.

Another alternative would be to write a client, and to check for operation consistency among the client reads. But that is cumbersome, because you need to introduce a client and a read operation at the read regions. Furthermore, that will also cause the model checker to take longer time to complete.

What are some other properties we can test here?

In the consistent prefix guarantee the write region can commit asynchronously and freely without waiting for the read regions to catchup. The read regions can catchup on their own pace.

Let's write some conditions, "fake invariants", to test that this is the case.
SyncStep  == \A i \in ReadRegions  : 
                   CommittedLSN[i]> CommittedLSN_[1] -3
SyncStep2 == \A i,j \in ReadRegions: 
                   CommittedLSN[i]# CommittedLSN[j]  -3

The model checker is quick to find counterexamples for these conditions. For the SyncStep the error trace provides a short counterexample where the write region raced ahead to commit 3 updates without broadcasting any updates to the read regions. (Since I used CommittedLSN variable for both read and write regions, the TLA translator assigned CommittedLSN_ "with underscore" to the write region's version to distinguish it from that of the read regions.)

For the SyncStep2, the error trace is longer because it needs some setup. Here the write region performed 3 writes and broadcasted these to the read regions, but only the read region 2 performed the updates and get to CommittedLSN=3, while the read region 3 has not performed any receive action from its channel.

Things I tried to speed up the TLA model checking

I believe it is important to share the mistakes committed as well as the end result, so others can learn better.

My first model was large. My improvements involved trimming and simplifying that model.
Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away. --Antoine de Saint-Exupery.
To keep model checking feasible I used MAXN. But I found that the model checker would still be crunching scenarios after ten minutes. I knew that this is a simple protocol and it shouldn't take so long to model check. Then I noticed that the problem is with bounding the run on the wrong thing: I was bounding the CompletedLSN variable with MAXN, but the CommittedLSN was still able to race unbounded above the MAXN. After I noticed my mistake the model checker took only a couple seconds for MAXN=5.

What is CompletedLSN? In my earlier version of the model, the write region used CompletedLSN to keep track of the progress of the read regions, but I realized this variable was unnecessary for checking the consistent prefix, so I took that out entirely.

I also performed other simplifications to reduce the model checking state space. Instead of sending messages to the regions one by one, I modified the write region to queue the messages at once via the multicast operation.

MAD questions

1. How do you relax this for eventual consistency?
For eventual consistency, we don't need to maintain CommittedLSN and a sequencing of writes. The write region can maintain a set for writes/updates, and multicast any write/update from that set at random. The read region just receives an update and performs it. There is also no need for acknowledgement from the read region. The repeated multicasts will eventually establish that every write/update is replicated.

2. How would you extend this to the stronger consistency guarantees?
In my next post, I will extend this model to specify bounded staleness and strong consistency properties. As part of this extension, the write region would need to keep track of the updates performed at the read regions and slow down to make sure it is not going too far ahead of the read regions.

Since session consistency is specific to a client, we will write a client model and let the client share/copy LSN numbers and get served using that LSN number. The "LSN" will be our session state.

3. Was this too simple an example to be worth modeling?
As I wrote above, I had several blunders before I could simplify this model to what I presented. And I learned a lot from developing this simple model. So, this was definitely worth modeling and working on.

I had written about the benefits of modeling before here.

Friday, August 24, 2018

Mind Your State for Your State of Mind

This article by Pat Helland appeared at ACM Queue on July 2018. Pat is a self-confessed apostate. He is also a database philosopher; look at the title of his recent publications: Standing on distributed shoulders of giants, Life beyond distributed transactions, Immutability changes everything, Heisenberg was on the "write" track, consistently eventual, etc.

This "mind your state for your state of mind" article looks at the history of interactions of applications and storage/databases, and charts their co-evolution as they move into the distributed and scalable world.

The evolution of state, storage, and computing

Storage has evolved from disks directly attached to your computer to shared appliances such as SANs (storage area networks) leading the way to storage clusters of commodity servers contained in a network.

Computing evolved from a single-process on a single server, to multiple processes communicating on a single server, to RPCs (remote procedure calls) across a tiny cluster of servers. In the 2000s, the concept of SOA (service oriented architecture) emerged to provide trust isolation so that the distrusted outsider cannot modify the data. As the industry started running services at huge scale, it learned that breaking a service into smaller microservices provides advantages through better modularity/decoupling and through stateless and restartable operation. Today microservices have emerged as the leading technique to support scalable applications.

Databases also evolved tremendously. Before database transactions, there were complexities in updating data even in a single computer, especially if failures happened. Database transactions dramatically simplified the life of application developers. But as solutions scaled beyond a single database, life got more challenging. First, we tried to make multiple databases look like one database. Then, we started hooking multiple applications together using SOA; each service had its own discrete database with its own transactions but used messaging to coordinate across boundaries. Key-value stores offered more scale but less declarative functionality for processing the application's data.  Multirecord transactions were lost as scale was gained. Finally, when we started using microservices, a microservice instance did not have its own data but reached directly to a distributed store shared across many separate services. This scaled better—if you got the implementation right.

Minding durable and session state in microservices

Durable state is stuff that gets remembered across requests and persists across failures. Durable state is not usually kept in microservices. Instead, it is kept in back-end databases and key-value stores. In the next section we look at some of these distributed stores and their example use cases.

Session state is the stuff that gets remembered across requests in a session but not across failures. Session state exists within the endpoints associated with the session, and is hard to maintain when the session is smeared across service instances: the next message to the service pool may land at a different service instance.

Without session state, you can't easily create transactions crossing requests. Typically, microservice environments support a transaction within a single request but not across multiple requests. Furthermore, if a microservice accesses a scalable key-value store as it processes a single request, the scalable key-value store will usually support only atomic updates to a single key. Programmers are on their own when changing values tied to multiple keys. I will discuss the implications of this in MAD questions section at the end.

Different stores for different uses

As we mentioned as part of evolution of databases, to cope with scalable environments, data had to be sharded into key values. Most of these scalable key-value stores ensured linearizable, strongly consistent updates to their single keys. Unfortunately, these linearizable stores would occasionally cause delays seen by users. This led to the construction of nonlinearizable stores with the big advantage that they have excellent response times for reads and writes. In exchange, they sometimes give a reader an old value.

Different applications demand different behaviors from durable state. Do you want it *right* or do you want it *right now*? Applications usually want the latter and are tolerant of stale versions. We review some example application patterns below.

Workflow over key-value. This pattern demonstrates how applications perform workflow when the durable state is too large to fit in a single database. The workflow implemented by careful replacement will be a mess if you can't read the last value written. Hence, this usage pattern will stall and not be stale. This is the "must be right" even if it's not "right now" case.

Transactional blobs-by-ref. This application runs using transactions and a relational database and stores big blobs such as documents, photos, PDFs etc at a data store. To modify a blob, you always create a new blob to replace the old one. Storing immutable blobs in a nonlinearizable database does not have any problems with returning a stale version: since there's only one immutable version, there are no stale versions. Storing immutable data in a nonlinearizable store enjoys the best of both worlds: it's both right and right now.

E-Commerce shopping cart. In e-commerce, each shopping cart is for a separate customer. There's no need or desire for cross-cart consistency. Customers are very unhappy if their access to a shopping cart stalls. Shopping carts should be right now even if they're not right.

E-Commerce product catalog. Product catalogs for large e-commerce sites are processed offline and stuffed into large scalable caches. This is another example of the business needing an answer right now more than it needs the answer to be right.

Search. In search, it is OK to get stale answers, but the latency for the response must be short. There's no notion of linearizable reads nor of read-your-writes.

Each application pattern shows different characteristics and tradeoffs. As a developer, you should first consider your application's requirements carefully.
  • Is it OK to stall on reads?
  • Is it OK to stall on writes?
  • Is it OK to return stale versions?
You can't have everything!

If you don't carefully mind your state, it will bite back, and degrade your state of mind.

The Cosmos DB take

I am doing my sabbatical at Microsoft Cosmos DB. So I try to put things in context based on what I see/work on here. This is how Cosmos DB fits in this picture and provides answers to these challenges.

We had talked about this in a previous post. "Clients should be able to choose their desired consistency. The system cannot possibly predict or determine the consistency that is required by a given application or client."

Most real-world application scenarios do not fall nicely into the two extreme choices of consistency models that are typically offered by databases, e.g., strong and eventual consistency. Cosmos DB offers five well-defined and practical consistency models to accommodate real-life application needs. Of the 5 well-defined consistency models to choose from, customers have been overwhelmingly selecting the relaxed consistency models (e.g. Session, Bounded-Staleness and Consistent Prefix). Cosmos DB brings a tunable set of well-defined consistency models that users can set at the click of a button ---there is no need to deal with messy datacenters, databases, and quorum configurations. Users can later override the consistency level they selected on each individual request  if they want to. Cosmos DB is also the 1st commercial database system to have exposed Probabilistic Bounded Staleness (PBS)  as a metric for customers to determine "how eventual is eventual consistency".

I will have a series of posts on global distribution/replication in Cosmos DB soon. Below is just an appetizer :-)

Even for a deployment over multiple regions/continents, the write operation for session, consistent prefix, and eventual consistency levels are acknowledged by the write region without blocking for replies from other regions. Bounded consistency write often gets replied quickly from the write region without waiting for replication to other regions: the bound of staleness is not reached means, the region can give the write a green light locally. Finally, in all these cases including strong consistency, with the multimaster general availability rollout, the write region is always the closest region to the client performing the write.

Latency of the reads are guaranteed to be fast backed up by 99.99% SLAs.  Reads are always answered within the region (nearest region) contacted by the client. While serving reads from the nearest region, the consistency level selected (a global property!) is guaranteed by the clever use of logical sequence numbers to check that the provided data guarantees the selected consistency level. Within the region, a read is answered by one replica (without waiting for other replica or region) for consistent prefix and eventual consistency levels. Reads may occasionally wait for the second read replica (with potentially waiting for fresh data to arrive from another region) for session consistency. Finally, reads are answered by a read quorum of 2 out of 4 for bounded and strong consistency.

In other words, while the tradeoffs between predictable read latency, predictable write latency, and read your writes (session consistency) inherently exists, Cosmos DB handles them lazily/efficiently but surely by involving client-library (or gateway) collaboration. As we mentioned in a previous post, it is possible to achieve more efficiency with lazy state synchronization behind the curtain of operation consistency exposed to the client.

MAD questions

1. What is the next trend in the co-evolution of computing and storage?

The pattern of software-based innovation has always been to virtualize a physical thing (say typewriters, libraries, publishing press, accountants), and then improve on it every year thanks to the availability of exponentially more computing/querying power. The cloud took this to another level with its virtually infinite computing and storage resources provided on demand.

Software is eating the world. By 2025, we will likely have a virtual personal assistant, virtual nanny, virtual personalized teacher, and a virtual personalized doctor accessible through our smartphones.

If you think about it, the trend for providing X as a service derives and benefits from the trend of virtualizing X. Another term for virtualization is making it software-defined. We had seen software-defined storage, software-defined networks, software-defined radios, etc. (Couple years ago I was joking about when we will see software-defined software, now I joke about software-defined software-defined software.)

This trend also applies to and shapes the cloud computing architecture.

  • First virtual machines (VMs) came and virtualized and shared the hardware so multiple VMs can colocate on the same machine. This allowed consolidation of machines, prevented the server sprawl problem, and reduced costs as well as improving manageability. 
  • Then containers came and virtualized and shared the operating system, and avoided the overheads of VMs. They provided faster startup times for application servers. 
  • "Serverless" took the virtualization a step ahead. They virtualize and share the runtime, and now the unit of deployment is a function. Applications are now defined as a set of functions (i.e., lambda handlers) with access to a common data store. 

A new trend is developing for utilizing serverless computing even for the long running analytics jobs. Here are some examples.

The gravity of virtualization pulls for disaggregation of services as well. The paper talked about the trend about disaggregation of services, e.g., computing from storage. I think this trend will continue because this is fueled by the economies of scale cloud computing leverage on.

Finally, I had written a distributed systems perspective prediction of new trends for research earlier here.

2. But, you didn't talk about decentralization and blockchains?

Cloud and modern datacenter computing for that matter provides trust isolation. The premise of blockchain and decentralized systems is to provide trustless computation. They take trustless as an axiom.  While it is clear that trust isolation is a feature organizations/people care (due to security and fraud protection), it is not clear if trustless computing is a feature organizations/people care.

Finally, as I wrote about this earlier several times, logical centralization (i.e., the cloud model) has a lot of advantages over decentralization in terms of efficiency, scalability, ease of coordination, and even fault-tolerance (via ease of coordination and availability of replication). Centralization benefits from the powerful hand of economy of scale. Decentralized is up against a very steep cliff.

Here is High Scalability blog's take on it as well.

3. How do we start to address the distributed coordination challenge of microservices?

Despite the fact that transactional solutions do not work well with microservices architectures, we often need to provide some of the transactional guarantees to operations that span multiple (sometimes dozens of) microservices. In particular, if one of the microservices in the request is not successful, we need to revert the state of the microservices that have already changed their states. As we discussed it is hard to maintain session state with microservices. These corrective actions are typically written at the coordinator layer of the application in an ad-hoc manner and are not enforced by some specialized protocol.

This is a real problem, compounded with the tendency of microservice instances to appear/disappear on demand.

One thing that helps is to use distributed sagas pattern to instill some discipline on undoing the side-effects of a failed operation that involves many microservices.

I had proposed that self-stabilization has a role to play here. Here is my report on this; section 4.1 is the most relevant part to this problem.

Last Friday I had attended @cmeik's end of internship talk at Microsoft Research. It was on building a prototype middleware that helps with the fault-tolerance of across-microservices operations.

Tuesday, August 21, 2018

Logical index organization in Cosmos DB

This post zooms into the logical indexing subsystem mentioned in my previous post on "Schema-Agnostic Indexing with Azure Cosmos DB". 

With the advent of big data, we face a big data-integration problem. It is very hard to enforce a schema (structure/type system) on data, and irregularities and entropy is a fact of life. You will be better off if you accept this as a given, rather than pretend you are very organized, you can foresee all required fields in your application/database, and every branch of your organization will be disciplined enough to use the same format to collect/store data.

A patch employed by relational databases is to add sparse new columns to accommodate for possibilities and to store a superset of the schemas. However, after you invoke an alter-table on a big data set, you realize this doesn't scale well, and start searching for schema-agnostic solutions.

Achieving schema agnosticism

As we discussed in the previous post, JSON provides a solution for easier schema management. JSON's type system is simple and lightweight (in contrast to XML) and is self-documenting. JSON supports a strict subset of the type systems of Javascript and many modern programming languages. Today it is the lingua franca of Internet and is natively supported in most languages.

Using JSON helps NoSQL datastores to operate without a schema for data ingestion purposes and for possible application changes in the future. But this doesn't automatically make them fully schema agnostic. For querying the database, those solutions still require a schema. The user is typically asked to provide indexing fields, and the queries are performed on those indexes.

The Cosmos DB approach to achieving full schema agnosticism is by automatically indexing everything upon data ingest and allowing the users to query for anything without having to deal with schema or index management.

The question then becomes: what is a good indexing structure to solve the fully schema-agnostic querying problem?

Be the tree

Relational databases have been doing indexing for half century, but indexing there is highly optimized for relational schema databases and has limitations. Often a B-tree index per column is employed. While this achieves very fast reading and querying performance, it becomes inadequate for high volume writes on big data. Newly inserted data would need to be indexed for each column in the schema using B-trees and will cause write amplification problems. A newly inserted column or change in schema would lead to updating all the leafs.

Instead of creating an index tree for each column, Cosmos DB employs one index for the whole database account, i.e., the Cosmos DB container (e.g., a table, a collection or a graph). This one-index-tree-to-rule-them-all grows as new documents get added to the container. Since the schema variance is often not  very wild, the number of shared paths over intermediate schema nodes remain small compared to the number of leaf nodes (instance values). This way the index tree achieves efficiency for updates upon new data/schema insertion and enables for searching (range or point query) for any arbitrary schema or value in the container.

I try to explain how this works in the rest of the post. First I'd like to clarify that we restrict ourselves to the logical organization of the indexing, and don't go down the stack to discuss physical organization of the index structures. At the logical layer, we don't have to think about the various B-tree implementations in the physical layer: we will just treat the index organization as a sorted-map structure (e.g., as in sorted map in Java). At the physical organization layer, to score even more efficiency optimizations, Cosmos DB employs the Bw-tree data structure implementation of this logical index on flash/SSDs. There are many other efficient implementations of B-trees for different storage devices and scenarios based on write-ahead-log and log-structure-merge-tree ideas.

I would like to thank Shireesh Thota at Cosmos DB for giving me a crash course on the logical indexing topic. Without his clear explanations, I would be grappling with these concepts for a long long time.

Logical indexing

In our previous post, we discussed how the tree representation of the JSON documents allows the database engine to treat the structure of the document as well as the instance values homogeneously.

We also introduced the index tree that is constructed out of the union of all of the trees representing the individual documents within the container. Each node of the index tree is an index entry containing the label and position values, called the term, and the ids of the documents containing the term, called the postings.

The logical index organization

For cost effective persistence and lookup, the index tree needs to be converted into a storage efficient representation.  At the logical indexing layer, CosmosDB maps the paths in the index tree to key-value tuples. The value consists of the postings list of the encoded document (or document fragment) ids. The key consists of the term representing the encoded path information of the node/path in the index tree, concatenated with a posting entry selector (PES) that helps partition the postings horizontally.

This way, the terms are mapped to the corresponding Doc IDs (i.e., postings) containing them. The resulting sorted map enables the query processing to identify the documents that match the query predicates very quickly. On this sorted map, byte compare is employed for enabling range queries. There is also a reverse path representation to enable efficient point queries. As we'll see below, the logical indexing has a direct impact on what kind of queries the database can support.

We discuss how Cosmos DB represents the terms and the postings efficiently in the next two sections.

Representing terms efficiently

Cosmos DB uses a combination of partial forward path representation for paths to enable range querying support, and  partial reverse path representation to enable equality/hash support.

The terms for forward paths are byte encoded to be able to enable range queries such as SELECT * FROM root r WHERE r.Country < "Germany". Yes, you read that right, you can compare at the string level, because strings are byte-encoded to allow that.

The terms for the reverse paths are hash encoded for efficient point querying such as SELECT * FROM root r WHERE r.location[0].country = "France".

Finally, the path representations also allow wild card queries such as SELECT c FROM c JOIN w IN c.location WHERE w = "France". This is achieved by bunching the forward and backward paths always in 3 segments, such as location/0/city and 0/city/"Paris" rather than using the full path $/location/0/city/"Paris".  This is like the the n-gram idea the search engines use. This also reduces the storage cost of the index.

Partial forward path encoding scheme. To enable efficient range and spatial querying, the partial forward path encoding is done differently for numeric and non-numeric labels. For non-numeric values, each of the 3 segment paths are encoded based on all the characters. The least significant byte of the resultant hash is assigned for the first and second segments. For the last segment, lexicographical order is preserved by storing the full string or a smaller prefix based on the precision specified for the path.
For the numeric segment appearing as the first or second segments, a special hash function is applied to optimize for the non-leaf numeric values. This hash function exploits the fact that most non-leaf numeric values (e.g. enumerations, array indices etc.) are frequently concentrated between 0-100 and rarely contain negative or large values. A numeric segment occurring in the third position is treated specially: the most significant n bytes (n is the numeric precision specified for the path) of the 8 byte hash are applied, to preserve order.

Partial reverse path encoding scheme.  To enable point querying, the term generated in the reverse order, with the leaf having higher number of bits in the term, placed first. This scheme also serves wildcard queries like finding any node that contains the value "Paris", since the leaf node is the first segment.

Representing posting lists efficiently

The postings list captures the document ids of all the documents which contain the given term. The posting list is bitmap compressed for efficient querying/retrieval as well. In order to represent a postings list dynamically (i.e. without a fixed sized/static scheme or pre-reserved space), compactly and in a manner amenable to computing fast set operations (e.g., to test for document presence during query processing), Cosmos DB uses the below two techniques.

Partitioning a postings list. Each insertion of a new document to a container is assigned a monotonically increasing document ID. The postings list for a given term consists of a variable length list of postings entries partitioned by postings entry selector (PES). A PES is a variable length (up to 7 bytes), offset into the postings entry. The number of PES bytes is a function of the number of documents in a container. The number of postings entries --for a given size of a PES-- is a function of document frequency for the document id range which falls within the PES range. Document ids within 0-16K will use the first postings entry, document ids from 16K-4M will use the next 256 posting entries, and so on. For instance, a container with 2M documents will not use more than 1 byte of PES and will only ever use up to 128 postings entries within a postings list.

Dynamic encoding of posting entries. Within a single partition (pointed by a PES), each document needs only 14 bits which can be captured with a short word. However, Cosmos DB also optimizes this. Depending on the distribution, postings words within a postings entry are encoded dynamically using a set of encoding schemes including (but not restricted to) various bitmap encoding schemes inspired primarily by WAH (Word-Aligned Hybrid). The core idea is to preserve the best encoding for dense distributions (like WAH) but to efficiently work for sparse distributions (unlike WAH).

Customizing the index

The default indexing policy automatically indexes all properties of all documents. Developers can choose certain documents to be excluded or included in the index at the time of inserting or replacing them to the container. Developers can also choose to include or exclude certain paths (including wildcard patterns) to be indexed across documents.

Cosmos DB also supports configuring the consistency of indexing on a container.

Consistent indexing is the default policy. Here the queries on a given container follow the same consistency level as specified for the point-reads (i.e. strong, bounded-staleness, session or eventual). The index is updated synchronously as part of the document update (i.e. insert, replace, update, and delete of a document in a container). Consistent indexing supports consistent queries at the cost of possible reduction in write throughput. This reduction is a function of the unique paths that need to be indexed and the consistency level. The consistent indexing mode is designed for "write quickly, query immediately" workloads.

To allow maximum document ingestion throughput, a container can be configured with lazy consistency; meaning queries are eventually consistent. The index is updated asynchronously when a given replica of a container's partition is quiescent. For "ingest now, query later" workloads requiring unhindered document ingestion, the lazy indexing mode is more suitable.

MAD questions

1. Is this too specialized information?
I am a distributed systems/algorithms person. Logical indexing is a specialized database topic. Does understanding this help me become a better distributed systems researcher?

I would argue yes.  First of all, developing expertise in multiple branches, being a Pi-shaped academician, provides advantages. Aside from that, learning new things stretches your brain and makes it easier to learn other things.

2. How is filtering done within a document?
Cosmos DB represents documents also as binary encodings for efficient storage and querying. When a query returns documents that match the query predicates, instead of filtering records inside the document, Cosmos DB uses the binary encoding features and performs byte-compares to skim within the document quickly to jump/skip over irrelevant parts quickly. A lot of deduplication is also employed at these encoding. In the coming weeks, I may delve in to the physical organization of the index and documents, but I need to track down another expert to help me with that.

For topics that are too outside of my expertise it is very helpful to get a first introduction from an expert. Learning from Shireesh was very fun. An expert makes even the most complicated topics look easy and understandable. This is an interesting paradigm shift which you will have sometime if you haven't already: When you don't understand a topic, often the problem is, it is not presented very competently. The corollary to this epiphany is that if you are unable to explain something simply and in an accessible way, you haven't mastered it yet.

Friday, August 17, 2018

Replicated Data Consistency Explained Through Baseball

I had mentioned this report in my first overview post about my sabbatical at Cosmos DB. This is a 2011 technical report by Doug Terry, who was working at Microsoft Research at the time. The paper explains the consistency models through publishing of baseball scores via multiple channels. (Don't worry, you don't have to know baseball rules or scoring to follow the example; the example works as well for publishing of, say, basketball scores.)

Many consistency models have been proposed for distributed and replicated systems. The reason for exploring different consistency models is that there are fundamental tradeoffs between consistency, performance, and availability.

While seeing the latest written value for an object is desirable and reasonable to provide within a datacenter, offering strong consistency for georeplicated services across continents results in lower performance and reduced availability for reads, writes, or both.

On the opposite end of the spectrum, with eventual consistency, the data returned by a read operation is the value of the object at some point in the past, yet the performance and availability dimensions are greatly improved.

The paper argues that eventual consistency is insufficient for most of the use cases, but strong consistency is not needed either; most use-cases benefit from some intermediate consistency guarantees.

This concurs with Cosmos DB production workloads. As I mentioned in my first overview post, about 73% of Cosmos DB tenants use session consistency and 20% prefer bounded staleness. This limits strong-consistency and eventual-consistency to the fringes. After I introduce the model and consistency guarantees the paper considers, I will talk about how these map to the consistency levels used in Cosmos DB.

Read consistency guarantees

The paper considers 6 consistency guarantees. These are all examples of the read-write centric operation consistency we discussed in the "Many Faces of Consistency" paper.

The paper considers a simple abstract model for the data store. In this model, clients perform write operations to a master node in the data store. Writes are serialized and eventually performed/replicated in the same order (with that of the master) at all the servers (a.k.a. replicas). The client perform reads from the servers/replicas. Reads return the values of data objects that were previously written, though not necessarily the latest values. This is because the entire set of writes at the master may not yet be reflected in order and in entirety at the servers.

The 6 consistency guarantees below are defined by which set of previous writes are visible to a read operation.

Strong consistency ensures that a read operation returns the value that was last written for a given object. In other words, a read observes the effects of all previously completed writes: if write operations can modify or extend portions of a data object, such as appending data to a log, then the read returns the result of applying all writes to that object. (Note that in order to achieve strong consistency in the presence of crashes, the write operation at the master should be going lockstep with a quorum of replicas to permanently record writes, requiring synchronous replication!)

Eventual consistency is the weakest of the guarantees, so it allows the greatest set of possible return values. Such a read can return results from a replica that has received an arbitrary subset of the writes to the data object being read.

By requesting a consistent prefix, a reader is guaranteed to observe an ordered sequence of writes starting with the first write to a data object. In other words, the reader sees a version of the data store that existed at the master at some time in the past.

Bounded staleness ensures that read results are not too stale. That is, the read operation is guaranteed to see at least all writes that completed d time (or number of updates) before the read started. The read may potentially see some more recently written values.

Monotonic Reads (sometimes also called as a session guarantee) is a property that applies to a sequence of read operations that are performed by a given client. It states that if the client issues a read operation and then later issues another read to the same object(s), the second read will return the same value(s) or the results of later writes.

Read My Writes is also a session property specific to a client. It guarantees that the effects of all writes that were performed by the client are visible to the client's subsequent reads. If a client writes a new value for a data object and then reads this object, the read will return the value that was last written by the client (or some other value that was later written by a different client).

None of these last four read guarantees are stronger than each other, thus applications may want to combine a multiple of these guarantees. For example, a client could request monotonic reads and read my writes so that it observes a data store that is consistent with its own actions.

The table shows the performance and availability typically associated with each consistency guarantee. Strong consistency is desirable from a consistency viewpoint but offers the worst performance and availability since it generally requires reading from a majority of replicas. Eventual consistency, on the other hand, allows clients to read from any replica, but offers the weakest consistency. The table illustrates the tradeoffs involved as each guarantee offers a unique combination of consistency, performance, and availability.

Cosmos DB consistency levels

Cosmos DB allows developers to choose among 5 well-defined consistency models along the consistency spectrum. While the definitions of strong, eventual, and consistent prefix are the same as the ones discussed in the report, Cosmos DB strengthens the definitions for bounded staleness and session consistency, making them more useful for the clients.

More specifically, Cosmos DB's bounded staleness is strengthened to offer total global order except within the "staleness window". In addition, monotonic read guarantees exist within a region both inside and outside the "staleness window".

Cosmos DB's session consistency (again scoped to a client session) is strengthened to include consistent prefix, monotonic writes, read my writes, and write-follows-reads in addition to the monotonic read property. As such, it is ideal for all scenarios where a device or user session is involved. (You can check the clickable consistency map by Kyle Kingsburry to read about the definitions of monotonic writes and write-follows-reads which impose order on the writes.)

It is possible to sign up for a free trial (with no credit card and commitment) to test these guarantees on Cosmos DB. Once you create a Cosmos DB resource, you can also see an animation of the consistency guarantees after selecting the default consistency option from the tab. I am pasting screenshots for strong consistency and bounded staleness animations below.

While animations are nice for intuitively understanding consistency levels, inside Cosmos DB, the TLA+ specification language is used for specifying these models precisely and model checking them with the global distribution protocols considered. In the coming weeks, as I promised, I will try to sanitize and publish from a client-side perspective the TLA+ specifications for these consistency levels.

To put their money where their mouth is, Cosmos DB offers comprehensive 99.99% SLAs which guarantee throughput, consistency, availability, and latency for Cosmos DB database accounts scoped to a single Azure region configured with any of the five consistency levels, or database accounts spanning multiple regions, configured with any of the four relaxed consistency levels. Furthermore, independent of the choice of a consistency level, Cosmos DB offers a 99.999% SLA for read and write availability for database accounts spanning two or more regions. I will dedicate a separate blog post on SLAs in the coming weeks, as I start learning more about them.

Baseball analogy

This toy example assumes that the score of the game is recorded in the  key-value store in two objects, one for the number of runs scored by the "visitors" and one for the "home" team's runs. When a team scores a run, a read operation is performed on its current score, the returned value is incremented by one, and the new value is written back to the key-value store.

This sequence of writes is from a hypothetical baseball game with the inning-by-inning line score, and the game is currently in the middle of the seventh inning, and the home team is winning 2-5.

Different read guarantees may result in clients reading different scores for this game that is in progress. The table below lists the complete set of scores that could be returned by reading the visitors and home scores with each of the 6 consistency guarantees. The visitors' score is listed first, and different possible return values are separated by comas.

A strong consistency read can only return one result, the current score. On the other hand, an eventual consistency read can return one of 18 possible scores, many of which are ones that were never the actual score. The consistent prefix property limits the result to scores that actually existed at some time. The results that can be returned by a bounded staleness read  depend on the desired bound.

The paper considers 6 hypothetical participants querying the baseball database for scores: the scorekeeper, umpire, radio reporter, sportswriter, statistician, and the stat-watcher. The table lists the consistencies that these participant use. Of course, each participant would be okay with strong consistency, but, by relaxing the consistency requested for her reads, she will likely observe better performance and availability. Additionally, the storage system may be able to better balance the read workload across servers since it has more flexibility in selecting servers to answer weak consistency read requests.

The toy example is meant to illustrate that the desired consistency depends as much on who is reading the data as on the type of data. All of the 6 presented consistency guarantees are useful, because each guarantee appears at least once in the participant needs table. That is, different clients may want different consistencies even when accessing the same data.


The report (here is the talk by Doug Terry on the report if you like to get a more immersive/extensive discussion of the topic) concludes as follows:
Clients should be able to choose their desired consistency. The system cannot possibly predict or determine the consistency that is required by a given application or client. The preferred consistency often depends on how the data is being used. Moreover, knowledge of who writes data or when data was last written can sometimes allow clients to perform a relaxed consistency read, and obtain the associated benefits, while reading up-to-date data. This could be of practical significance since the inherent trade-offs between consistency, performance, and availability are tangible and may become more pronounced with the proliferation of georeplicated services. This suggests that cloud storage systems should at least consider offering a larger choice of read consistencies.

As I discussed above, Cosmos DB fits the bill. It provides 5 consistency level choices in an all-in-one packet. It is easy to configure the consistency level on the fly, and the effects take place quickly. Cosmos DB also allows you the flexibility to override the default consistency level you configured for your account on a specific read request. It turns out only about 2% of Cosmos DB tenants override consistency levels on a per request basis.

MAD questions

1. What are the effects of granularity of writes on consistency?
The paper said: "We assume that the score of the game is recorded in a key-value store in two objects, one for the number of runs scored by the visitors and one for the home team's runs."

Why keep two separate objects "home" "visitor" though? Instead, what if we used just one object called "score" that consists of a tuple <visitor, home>. Then the reads would be more consistent by design; even the eventual consistency read will not return a score that was not an actual store at one point in the game.

Sometimes you don't get to design the data storage/access schemes, but when you get to decide this, you can act smartly and improve consistency. This reminds me of techniques used in self-stabilizing systems for compacting the state space to forbid bad states by construction.

2. What are the theoretical limits on the tradeoffs among consistency levels?
As we have seen, each proposed consistency model occupies some point in the complex space of tradeoffs. The CAP theorem shows a coarse tradeoff between consistency and availability. The PACELC model tries to capture further tradeoffs between consistency, availability, and latency.

More progress has been made in exploring the tradeoff space from the theoretical perspective since then. The causal consistency result showed that natural causal consistency, a strengthening of causal consistency that respects the real-time ordering of operations, provides a tight bound on consistency semantics that can be enforced without compromising availability and convergence. There has been plethora of papers recently on improvements on causal consistency georeplicated datastores, and I hope to summarize the prominent ones in the coming weeks.

3. Is baseball scoring also nonlinear?
Baseball is still foreign to me, even though I have been in US for 20 years now. While I try to be open-minded and eager to try new things, I can be stubborn about not learning some things. Here is another example where I was unreasonably close minded. Actually, though I had seen this paper earlier, I didn't read it then because I thought I would have to learn about baseball rules to understand it. Funny? No! I wonder what other opportunities I miss because of being peculiarly close minded on certain things.

Well, to win a battle against my obstinate and peculiar ignorance on baseball, I just watched this 5 minute explanation of baseball rules. Turns out, it is not that complicated. Things never turn out as scary/complicated/bad as I make them to be in my mind.

I had written about how bowling scoring is nonlinear and consequences of that. Does baseball scoring also have a nonlinear return? It looks like you can get a lot of runs scored in a single inning. So maybe that counts as nonlinearity as it can change the game score quickly, at any inning.

Tuesday, August 14, 2018

Schema-Agnostic Indexing with Azure Cosmos DB

This VLDB'15 paper is authored by people from the Cosmos DB team and Microsoft research. The paper has "DocumentDB" in the title because DocumentDB was the project that evolved into Azure Cosmos DB. The project started in 2010 to address the developer pain-points inside Microsoft for supporting large scale applications. In 2015 the first generation of the technology was made available to Azure developers as Azure DocumentDB. It has since added many new features and introduced significant new capabilities, resulting in Azure Cosmos DB!

This paper describes the schema-agnostic indexing subsystem of Cosmos DB.  By being fully schema-agnostic, Cosmos DB provides many benefits to the developers. Keeping the database schema and indexes in-sync with an application's schema is especially painful for globally distributed apps. But with a schema-agnostic database, you can iterate your app quickly without worrying of schemas or indexes. Cosmos DB automatically indexes all the data --no schema, no indexes required-- and serves queries very fast. Since no schema and index management is required, you don't have to worry about application downtime while migrating schemas either.

To achieve these, Cosmos DB natively supports JSON (JavaScript Object Notation) data model and JavaScript language directly within its database engine. This approach enables the following:

  • The query language (rooted in JavaScript's type system, expression evaluation and function invocation model) supports rich relational and hierarchical queries and is exposed to developers as a SQL dialect. 
  • By default, the database engine automatically indexes all documents without requiring schema or secondary indexes from developers.
  • The database supports transactional execution of application logic provided via stored procedures and triggers, written entirely in JavaScript. 

Of course the challenge is to do all of these while (1) providing predictable performance guarantees and (2) remaining cost effective. For (1), the challenges are in updating the index efficiently and synchronously against high write rates and in hitting the SLA performance guarantees as well as the configured consistency levels. For (2), the challenges are in resource governance for providing multitenancy under extremely frugal budgets: the index updates must be performed within the strict budget of system resources (CPU, memory, storage), and load balancing and relocation of replicas should be performed such that the worst-case on-disk storage overhead of the index is bounded and predictable. Please see my overview post for a high level overview of Cosmos DB. I will be diving in to the subsystems in the coming days.

The resource model

A tenant starts by provisioning a database account using an Azure subscription (you can try it for free at The database account manages one or more databases, each of which manages a set of resources: users, permissions, and containers.  In Cosmos DB, a container can be projected as a "collection, table, graph" for supporting APIs for "SQL, Table, Gremlin". In addition, a container also manages stored procedures, triggers, user defined functions (UDFs) and attachments.

Each resource is uniquely identified by a stable logical URI and is represented as a JSON document. Developers can interact with resources via HTTP (and over a stateless TCP protocol) using the standard HTTP verbs for CRUD (create, read update, delete), queries, and stored procedures.

All the data within an Azure Cosmos DB container (e.g. collection, table, graph) is horizontally partitioned and Cosmos DB transparently manages those resource partitions behind the scene. Tenants can elastically scale a resource by simply creating new resources which get partitioned (or distributed) across resource partitions automatically.

A resource partition is made highly available by a replica set across Azure regions. I talk about this global distribution next.

Global distribution

The service is deployed worldwide across multiple Azure regions (50+ of them). Cosmos DB is classified as a foundational service in Azure --which means it is automatically available in every Azure region including public, sovereign, and government clouds. Cosmos DB runs on clusters of machines each with dedicated local SSDs --other storage options for petabyte level data is also conīŦgurable.

Each stamp has its own fabric controller. A stamp consists of racks, each of which is built out as a separate fault domain with redundant networking and power. I will summarize the Microsoft fabric paper later.

Each machine hosts replicas corresponding to various resource partitions, which are placed and load balanced across machines in the federation. Each replica hosts an instance of the database engine, which manages the resources (e.g. documents) as well as the associated index. The database engine contains replicated state machine for coordination, the JavaScript language runtime, the query processor, and the storage and indexing subsystems responsible for transactional storage and indexing of documents.

While I won't be able to detail it in this post, Cosmos DB's unique design of resource governance underlies the core of the system: on any single machine there could be hundreds of tenants co-existing, each properly isolated (without any noisy neighbor symptoms) and resource-governed. In addition, Cosmos DB provides a consistent programming model to provision throughput across a heterogeneous set of database operations for each of the tenants at each region at any point in time while maintaining the SLAs.

Schema-agnostic indexing

The schema of a document describes the structure and the type system of the document. In contrast to XML, JSON's type system is simple: it is a strict subset of the type systems of Javascript as well as many other programming languages.

The Cosmos DB database engine operates directly at the level of JSON grammar, remaining agnostic to the concept of a document schema and blurring the boundary between the structure and instance values of documents. This, in turn, enables it to automatically index documents without requiring schema or secondary indices as we see below.

Documents as trees

By representing documents as trees, Cosmos DB normalizes both the structure and the instance values across documents into the unifying concept of a dynamically encoded path structure. In this representation, each label in a JSON document (including both the property names and their values) becomes a node of the tree. In other words, the values become first class citizen, the same level as schema labels. Note that the leaves contain actual values and the intermediate nodes the schema information.

The figure above shows the trees created for a document 1 and document 2. The labels are inferred from the document. A (pseudo) root node is created to parent the rest of the (actual) nodes corresponding to the labels in the document underneath. The nested data structures drive the hierarchy in the tree. Intermediate artificial nodes labeled with numeric values (e.g. 0, 1, ...) are employed for representing enumerations, array indices.


With automatic indexing, every path in a document tree is indexed (unless the developer configures it to exclude certain path patterns). Therefore, it is important to use a normalized path representation to ensure that the cost to indexing and querying a document with deeply nested structure, say 10 levels, is the same as that of a flat one consisting of key-value pairs of 1 level depth.

There are two possible mappings of document and the paths: a forward index mapping, which keeps a map of (document id, path) tuples, as we saw above, and an inverted index mapping, which keeps a map of (path, document id) tuples.

The inverted index provides a very efficient representation for querying. The index tree is a document which is constructed out of the union of all of the trees representing individual documents within the collection. The index tree grows over time as new documents get added or updated to the collection. In contrast to relational database indexing, there is no need for restarting indexing from scratch as new fields are introduced.

Each node of the index tree is an index entry containing the label and position values, called the term, and the ids of the documents, called the postings. The postings in the curly brackets (e.g. {1,2}) in the inverted index figure correspond to the documents (e.g., Document1 and Document2) containing the given label value. An important implication of treating both the schema labels and instance values uniformly is that everything is packed inside a big index. An instance value (still in the leaves) is not repeated, it can be in different roles across documents, with different schema labels, but it is the same value.

The inverted index looks to me as similar to the indexing structures used in a search engine in the information retrieval domain. In a sense, Cosmos DB let's you search your database for anything added to it regardless of its schema structure.

For the normalized path, the service uses a combination of partial (consisting of 3 segments) forward path representation for paths where range support is needed while following partial reverse path representation for paths needing equality (hash) support.


Developers can query the collections using queries written in SQL and JavaScript. Both SQL and JavaScript queries get translated to an internal intermediate query language (Query IL). The Query IL supports projections, filters, aggregates, sort, flatten operators, expressions (arithmetic, logical, and various data transformations), and user defined functions (UDFs).

The query model attempts to strike a balance between functionality, efficiency, and simplicity. The database engine natively compiles and executes the SQL query statements. You can query a collection using the REST APIs or any of the client SDKs. The .NET SDK comes with a LINQ provider. Here is a query playground where you can experiment.

The below figures illustrate point querying and range querying. The inverted index allows the query to identify the documents that match the query predicate quickly. Another important implication of treating both the schema and instance values uniformly in terms of paths is that the inverted index is also a tree. Thus, the index and the results can be serialized to a valid JSON document and returned as documents themselves as they are returned in the tree representation. This enables recursing over these results with additional querying.

For the second query, a range query, GermanTax is a user defined function executed as part of query processing.

Scratching the surface

Here is the summary of the stuff so far. The database engine is designed to be schema-agnostic by representing documents as trees. This enables supporting automatic indexing of documents, serving consistent queries in the face of sustained write volumes under an extremely frugal resource budget in a multitenant environment. (This 2015 talk by Dharma provides a nice summary of these concepts. This page provides code examples for these concepts.)

In this post, I only scratched the surface.  I haven't started to write about the technical meat of the paper in the logical index organization and the physical index organization sections. These sections introduce a novel logical index layout and a latch-free log-structured storage with blind incremental updates in order to meet the strict requirements of performance (SLA-bounded) and cost effectiveness (in a multitenant environment). Bw-trees introduced by Microsoft Research, a type of B-trees, is employed as part of the physical index organization.

In order to write about those sections more confidently, I will track down the experts here on the logical and physical index organization and will try to get a crash course on these.

MAD questions

1. How do NoSQL and NewSQL indexing methods compare/contrast with those for relational databases?
I realized that I am completely ignorant on this topic. But, it also appears that there aren't good explanations on this when I search for it. I will try to find more about this.

2. How do these design decisions fit with the historical trend/direction of database systems?
I found this recent article by Pat Helland very interesting: "Mind Your State for Your State of Mind". So I will plan to summarize it and try to analyze how today's design decisions fit with the historical trends/directions.
"Applications have had an interesting evolution as they have moved into the distributed and scalable world. Similarly, storage and its cousin databases have changed side by side with applications. Many times, the semantics, performance, and failure models of storage and applications do a subtle dance as they change in support of changing business requirements and environmental challenges. Adding scale to the mix has really stirred things up. This article looks at some of these issues and their impact on systems."

3. Can there be a ranking component to the search/querying that enables ranking with respect to user-defined priority? That can be useful for long running or realtime streaming query.

Thursday, August 9, 2018

The many faces of consistency

This is a lovely paper (2016) from Marcos Aguilera and Doug Terry. It is an easy read. It manages to be technical and enlightening without involving analysis, derivation, or formulas.

The paper talks about consistency. In distributed/networked systems (which includes pretty much any practical computer system today), data sharing and replication brings up a fundamental question: What should happen if a client modifies some data items and simultaneously, or within a short time, another client reads or modifies the same items, possibly at a different replica?

The answer depends on the context/application. Sometimes eventual consistency is what is desired (e.g., DNS), and sometimes you need strong consistency (e.g., reservations, accounting).

The paper points out two different types of consistency, state consistency and operation consistency, and focuses mainly on comparing/contrasting these two approaches.

When I see the terms state versus operation consistency, without reading the rest of the paper, my gut reaction is to conclude that this is a matter of abstraction. If you are implementing the system, you work at the state consistency level. If you are exposing this to the clients, since they do not need to know the internals of the system and would only care about the operations they invoke, then you work at the operational consistency level. The state consistency can help support and provide operational consistency to the clients.

My gut reaction turns out to be pretty accurate, yet the paper goes into more details and elaborates on some subtleties. I am glad to read about the two approaches in more detail. I think, going forward, it would be important to relate/bridge/reconcile these two approaches.

1. State consistency

State consistency pertains to the state of the system (which comprises of the current values of the data items). These are properties that the system should satisfy despite concurrent access and the existence of multiple replicas.

A major subcategory of state consistency is one defined by an invariant ---a predicate on the state that must evaluate to true. For example, in a concurrent program, a singly linked list must not contain cycles. As another example, in a primary-backup system mutual consistency invariant requires that replicas have the same state when there are no outstanding updates.

It is possible to weaken/loosen invariants to include error bounds, probabilistic guarantees, and eventual satisfaction.

2. Operation consistency

State consistency is limited to the properties on state, but in many cases clients care little about the system state and more about the results that they obtain from the system. This calls for a different form of consistency, operation consistency, which pertains to the behavior that clients observe from interacting with the system.

Operation consistency has subcategories based on the different ways to define the consistency property.

2.1 Sequential equivalence

This subcategory defines the permitted operation results of a concurrent execution in terms of the permitted operation results in a sequential execution. Some examples are as follows.

Linearizability is a strong form of consistency. Each operation must appear to occur at an instantaneous point between its start time (when the client submits it) and finish time (when the client receives the response), and execution at these instantaneous points form a valid sequential execution. More precisely, there must exist a legal total order T of all operations with their results, such that (1) T is consistent with the partial order <, meaning that if op1 finishes before op2 starts then op1 appears before op2 in T, and (2) T defines a correct sequential execution.

Sequential consistency is weaker than linearizability. It requires that operations execute as if they were totally ordered in a way that respects the order in which *each client* issues operations. More precisely, the partial order < is this time defined as: op1 < op2 iff both operations are executed by *the same client* and op1 finishes before op2 starts. In other terms, while linearizability imposes across-clients system-level ordering, sequential consistency is content with imposing a client-centric ordering.

The next examples pertain to systems that support transactions. Intuitively, a transaction is a bundle of one or more operations that must be executed as a whole. The system provides an isolation property, which ensures that transactions do not significantly interfere with one another. There are many isolation properties: serializability, strong session serializability, order-preserving serializability, snapshot isolation, read committed, repeatable reads, etc. All of these are forms of operation consistency, and several of them (e.g., serializability, strong session serializability, and order-preserving serializability aka strict-serializability) are of the sequential equivalence subcategory.

2.2 Reference equivalence

Reference equivalence is a generalization of sequential equivalence. Examples of this occur often in database systems.

Snapshot isolation requires that transactions behave identically to the following reference implementation. When a transaction starts, it gets assigned a monotonic start timestamp. When the transaction reads data, it reads from a snapshot of the system as of the start timestamp. When a transaction T1 wishes to commit, the system obtains a monotonic commit timestamp and verifies whether there is some other transaction T2 such that (1) T2 updates some item that T1 also updates, and (2) T2 has committed with a commit timestamp between T1’s start and commit timestamp. If so, then T1 is aborted; otherwise, T1 is committed and all its updates are applied instantaneously as of the time of T1’s commit timestamp.

Another example is one-copy serializability where the replicated system must behave like a single node (no replicas!) and provides serializability.

2.3 Read-write centric

The read-write centric subcategory applies to systems with two very specific operations: read and write. Consistency is defined with respect to the set of writes that could have potentially affected the read.

Read-my-writes requires that a read by a client sees at least all writes previously executed by the same client, in the order in which they were executed. This is relevant when clients expect to observe their own writes, but can tolerate delays before observing the writes of others.

Bounded staleness bounds the time (or number of updates) it takes for writes to be seen by reads: A read must see at least all writes that complete d time (or number of updates) before the read started.

3. State versus operation consistency

Operation consistency is, well..., operational! I was inculcated as part of my PhD training in distributed systems to avoid operational reasoning as it fails to work for concurrent execution. This is also what I teach my students within the first week of distributed systems class, as well. Operational reasoning does not scale since there are too many corner cases to check. For reasoning about distributed systems, we use invariant-based reasoning, which lends itself better to state consistency.

However, the truth is multidimensional. While unsuitable for reasoning/verification, operational consistency has its advantages.

On which type of consistency to use, the paper suggests the following:
"First, think about the negation of consistency: what are the inconsistencies that must be avoided? If the answer is most easily described by an undesirable state (e.g., two replicas diverge), then use state consistency. If the answer is most easily described by an incorrect result to an operation (e.g., a read returns stale data), then use operation consistency. 
A second important consideration is application dependency. Many operation consistency and some state consistency properties are application independent (e.g., serializability, linearizability, mutual consistency, eventual consistency). We recommend trying to use such properties, before defining an application-specific one, because the mechanisms to enforce them are well understood. If the system requires an application specific property, and state and operation consistency are both natural choices, then we recommend using state consistency due to its simplicity."

Notice that while operation consistency section listed more than a dozen consistency levels, the state consistency section just named a couple invariant types, including a vaguely named mutual consistency invariant. This is because the state-consistency is implementation specific, a whitebox approach. It is more suitable for the distributed systems designers/builders rather than users/clients. By  restricting the domain to specific operations (such as read and write), operation consistency is able to treat the system as a blackbox and provides a reusable abstraction to the users/clients.

4. What is really behind the curtain?

Here is another advantageous usecase for operation consistency approach. It provides you an abstraction (i.e., a curtain, a veil) that you can leverage in your implementation. Behind this curtain, you can pull tricks. The paper gives this simple example.
"An interesting example is a storage system with three servers replicated using majority quorums, where (1) to write data, the system attaches a monotonic timestamp and stores the data at two (a majority of) servers, and (2) to read, the system fetches the data from two servers; if the servers return the same data, the system returns the data to the client; otherwise, the system picks the data with the highest timestamp, stores that data and its timestamp in another server (to ensure that two servers have the data), and returns the data to the client. This system violates mutual consistency, because when there are no outstanding operations, one of the servers deviates from the other two. However, this inconsistency is not observable in the results returned by reads, since a read filters out the inconsistent server by querying a majority. In fact, this storage system satisfies linearizability, one of the strongest forms of operation consistency."

This is sort of a lazy evaluation idea. You don't always need to expend work/energy to keep the database consistent, you tidy up the database only when it is queried, only when it needs to perform.

The "TAPIR: Building Consistent Transactions with Inconsistent Replication (SOSP'15)" paper also builds on this premise. Irene puts it this way:
"Today, systems that want to provide strong guarantees use Paxos (or, if you are hipper than me, RAFT), and everyone else uses something cheaper. Paxos enforces a strict serial ordering of operations across replicas, which is useful, but requires coordination across replicas on every operation, which is expensive. 
What we found in the TAPIR project is that Paxos is too strong for some strong system guarantees and, as a result, is wasting work and performance for those systems. For example, a lock server wants mutual exclusion, but Paxos provides a strict serial ordering of lock operations. This means that a lock server built using Paxos for replication is coordinating across replicas even when it is not necessary to ensure mutual exclusion. 
Even more interesting, a transactional storage system wants strictly serializable transactions, which requires a linearizable ordering of transactions but only requires a partial ordering of operations (because not all transactions touch all keys). With some careful design in TAPIR, we are able to enforce a linearizable ordering of transactions with no ordering of operations."

5. What role do these concepts play in Cosmos DB?

Cosmos DB provides 5 well defined operation consistency properties to the clients: strong, bounded, session, consistent prefix, and eventual consistency. These consistency models were chosen because they are practical and useful as signaled by their actual usage by the customers for their production workloads.

And, of course, in order not to leave performance on the table, instead of imposing strong state-based consistency, Cosmos DB global distribution protocols employ several optimizations behind the curtain while still managing to provide the requested operation consistency levels to the clients.

Thus, state-consistency still plays a major role in Cosmos DB, from the core layer developers/designers perspective. The core layer backend team uses TLA+ to identify and check weakest state-consistency invariants that  support and imply the desired operational consistency levels. These weak invariants are efficient and do not require costly state synchronization.

As I mentioned in my previous post, I will post about TLA+/PlusCal translation of consistency levels provided by Cosmos DB here. I also hope to talk about some of the state-consistency invariants and efficiency techniques employed when I start describing the global distribution at the Cosmos DB core layer in my upcoming blog posts.

MAD questions

1. Is it possible to bridge the two approaches?

These two approaches can be complementary like two sides of the same coin.

I think it is easier to go from state-consistency of the system to inferring and designing operation consistency properties.

How about the other way? While checking operation consistency requires analyzing an execution log, by restricting the domain to specific operations (such as read and write), it is possible to achieve application independence and treat the system as a blackbox. The Jepsen library achieves that for distributed systems testing. If the system is a blackbox, or was developed without invariant/state consistency, this operation consistency based testing approach can help in identifying problems. But it is still unclear, how to infer/design state consistency properties/invariants that can help in fixing the problem.

The development of better tools for observability/auditability approaches (such as Retroscope) can help in bridging this gap.

Another effort to help bridge the gap could be to identify/name and create an ontology of invariants used in state consistency approaches. I don't know much work in that direction except this one from VLDB15.

2. Do these two approaches converge at the eventual consistency position?
The paper states the following. "Operational eventual consistency is a variant of eventual consistency (a form of state consistency) defined using operation consistency. The requirement is that each write be eventually seen by all reads, and if clients stop executing writes then eventually every read returns the same latest value."

Eventual consistency is likely a natural convergence point for the state and operation consistency. This reminds of the "Conflict-free replicated data types" paper.

3. Are terms/definitions about consistency consistent yet?
Consistency is an important concept so it emerged and developed in different domains (distributed systems, databases, and computer architecture) simultaneously. And of course different domains used different terminology and confusion arose.

I am not surprised. Even in the distributed systems community, on the restricted topic of distributed consensus, researchers have been using inconsistent terminology for many decades before consensus (pun intended) arose and the terminology converged and standardized.

Consistency definitions can be overwhelming because there are many parameters involved, even without considering transactions.

Kyle Kingsbury recently provided a simplified clickable map of major consistency models (including transactional consistency models).

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...