Understanding Inconsistency in Azure Cosmos DB with TLA+

This paper, by Finn Hackett, Joshua Rowe, Markus Kuppe, appeared in International Conference on Software Engineering 2023. It presents a specification of Azure Cosmos DB consistency behavior as exposed to the clients. 

During my sabbatical at CosmosDB in 2018, I was involved in a specification of CosmosDB as exposed to the clients.  The nice thing about these specs is that they didn't need to model internal implementation but just captured the consistency semantics for clients precisely, rather than ambiguously like English explanations do. They aimed to answer the question of what kind of behavior should a client be able to witness while interacting with the service? The feedback at that time was that customers found this very useful.

This 2023 paper improves on our preliminary specs from 2018. It has this great opening paragraph, which echoes the experience of everyone that has painstakingly specified a distributed/concurrent system behavior.

Consistency guarantees for distributed databases are notoriously hard to understand. Not only can distributed systems inherently behave in unexpected and counter-intuitive ways due to internal concurrency and failures, but they can also lull their users into a false sense of functional correctness: most of the time, users of a distributed database will witness a much simpler and more consistent set of behaviors than what is actually possible. Only timeouts, fail-overs, or other rare events will expose the true set of behaviors a user might witness. Testing for these scenarios is difficult at best: reproducing them reliably requires controlling complex concurrency factors, latency variations, and network behaviors. Even just producing usable documentation for developers is fundamentally challenging and explaining these subtle consistency issues via documentation comes as an additional burden to distributed system developers and technical writers alike.

I will focus on the TLA+ models developed by this work, as they are very instructive for intermediate/advanced TLA+ users. Also, the specifications provided in the paper generalizes well, so it is a good place to start with when modeling datastores. 


This spec models the multi-region Cosmos DB as a single log, which regulate the replication and failure behavior on this abstract log. I first took issue with this single log modeling. I thought this would be too simple, and would not be able to capture the behavior of a multi-regional deployment, with logs in each region. But when I challenged myself to find a client-exposed behavior of the multiple log system that would not be captured by the single log spec here, I was at a loss. Such is the power of nondeterminism. The single log, when instantiated nondeterministically is able to capture all those behaviors. For example, if the client in one region reads with eventual consistency, it is possible that the read value will later disappear if it wasn't part of the globally-committed log. Well, the single log spec allows truncation of non-committed suffix on the log, and captures this behavior.

CosmosDB.tla leverages the IncreaseReadIndexAndOrCommitIndex and TruncateLog actions to model this log. The commitIndex is the point in the log where all previous entries are replicated at a global majority of replicas, and are therefore durable due to consensus. The readIndex trails the commitIndex in the log potentially arbitrarily, and indicates the value is available at every region. Occasionally the spec will increase the readIndex and commitIndex of the log. TruncateLog can delete the suffix of the log upto reaching the commitIndex.

Note that the write action and read action are not exercised from this spec; those come in the `CosmosDBProps.tla` and `CosmosDBLinearizability.tla` specs that instantiate this spec, and use it to check consistency.


This spec provides a brief linearizability test for the CosmosDB spec. In the first part of the file, it defines LinSpec which instantiates CosmosDB.tla spec and invokes CosmosDB!Init and CosmosDB!Next actions to exercise it. LinNext also introduces the write action, finally, to be exercised on CosmosDB spec.

The second part of the file imposes DictSpec as a property for LinSpec to satisfy. DictSpec lists all linearizable behavior of a simple dictionary writing values to keys and exposing results. By giving DictSpec in the Properties section of CosmosLinearizability.cfg, we are claiming that LinSpec refines DictSpec.

The interesting thing here is the dictView. We defined it almost like a materialized view in LinSpec figure. And here in this figure, we seed that DictSpec maintains another dictView expression. DictView is not even a variable, but it gets updated through the DictSpec. If the two dictViews contradict, this is flagged as property violation by the TLA model checker, and the trace is given as a counterexample.


This spec exercises write and read actions on CosmosDB.tla which it extends, and checks strong-consistency (aka linearizability) and causal-consistency properties on the MDB.tla spec.  

The first part of the spec extends the spec in the CosmosDB.tla with write action. The second part is about invariants and properties this spec satisfies for strong-consistency, bounded-staleness, session-consistency, and consistent-prefix/eventual consistency. Session-consistency read/writes is accomplished through session-tokens that are written to writeHistory variable in this spec. The properties checked are pretty general.  


Popular posts from this blog

Learning about distributed systems: where to start?

Hints for Distributed Systems Design

Foundational distributed systems papers

Metastable failures in the wild

Scalable OLTP in the Cloud: What’s the BIG DEAL?

The end of a myth: Distributed transactions can scale

Always Measure One Level Deeper

Dude, where's my Emacs?

There is plenty of room at the bottom

Know Yourself