TLA+ specification of the bounded staleness and strong consistency guarantees

In my previous post, I had presented a TLA+ modeling of distributed data store that provides the consistent prefix property. In this post, I extend this model slightly to build bounded and strong consistency. In fact the strong consistency specification is achieved when we take the Delta on the bounded consistency as 1.

The TLA+ (well, PlusCal to be more accurate) code for this model is available at

The system model

As in the previous post, we assume there is a write region (with ID=1) and read regions (with IDs 2 through NumRegions). The write region performs the write and copies it to the read regions. There are FIFO communication channels between the write and read regions.
WriteRegion == 1
ReadRegions == 2..NumRegions
chan = [n \in 1..NumRegions |-> <<>>]; 

We use D to denote the Delta on the bounded staleness consistency. Bounded staleness ensures that read results are not too stale. That is, the read operation is guaranteed to see at least all writes that precedes D  number of updates before the read started. The read may potentially see some more recently written values.

Strong consistency ensures that a read operation returns the value that was last written for a given object. This is achieved by using D=1.

The write region actions

The write region has 3 actions to perform: Commit the write in the write region, Multicast the write to the read regions, and Receive/process an ack message from a read region.

These actions can be performed in any arbitrary order inside the process loop, and the model checker will methodically explore all possible combinations of these actions interleaved with the read region actions to expose any flaws in the protocol.

The first action selection is to commit a write in the write region. (We don't get into how that is implemented in the region; maybe commit is done by replicating the update to a write quorum in the region.) As a result the CommittedLSN is incremented. Note that, in contrast to prefix consistency model, in the bounded staleness model the write is throttled to not advance more than D ahead of any of the read regions.

The second action selection is to multicast a committed write to the read regions through the FIFO channels. and this is an asynchronous replication. These may be sent whenever this action is chosen, and is not blocked on waiting any acknowledgements from the read regions.

This action is exactly the same as in the prefix consistency model. The SentLSN variable denotes the last CommittedLSN write that is forwarded to the read regions and is used for ensuring ordered replication.

The final action selection is to receive an Ack message from the channel from a read region. The Progress variable keeps track of the Ack messages, and the CompletedLSN is updated to reflect the highest write that is acknowledged by all the read regions. Harking back to action 1, notice that the write region lazily disseminates this CompletedLSN information with the read regions by piggybacking this to the commit-write messages. In this model the read-regions do not utilize this CompletedLSN information, but as I start to explore in the MAD questions, this can be useful.

The read region actions

The read regions only react to the replicated messages from the write region. The first action selection is to receive a message pending in the channel from the write region. The second action selection is to send back an Ack message for any replication message that is not acknowledged yet. The actions are almost the same except for the line updating the CompletedLSN at the read region.

Invariant checking and testing

The consistent prefix invariant still holds for this model as we refined that model to obtain this one.
CP == [][\A i \in ReadRegions: 
               CommittedLSN'[i] = CommittedLSN[i] 
            \/ CommittedLSN'[i] = CommittedLSN[i] + 1]_vars

The BoundedC invariant is to check that the read regions are always maintained to be within the staleness bound of the most recent CommittedLSN.  (Since I used CommittedLSN variable for both read and write regions, the TLA translator assigned CommittedLSN_ "with underscore" to the write region's version to distinguish it from that of the read regions.)
BoundedC  == \A i \in ReadRegions : 
                      CommittedLSN[i]=< CommittedLSN_[1] 
                   /\ CommittedLSN[i]>= CommittedLSN_[1] -D

The SyncStep invariant is to check the relationship between the CompletedLSN at the write region and the copies maintained at the read regions.
SyncStep  == \A i \in ReadRegions : 
                      CompletedLSN[i] =< CompletedLSN_[1]
                   \/ CompletedLSN[i] > CompletedLSN_[1] -D

I first wrote this predicate with "CompletedLSN[i] > CompletedLSN_[1] -1" but the model checker was quick to tell me I was wrong. This is bounded by D and "not 1" as receive operations at the read regions can be asynchronous within the D staleness bound. Here the write region received Acks for its two commits back to back so the CompletedLSN at the write region was 2 versions ahead of those in the read regions.

MAD questions

1. Did we explore the design space thoroughly within this simple model?

No, it turns out, there is still surprisingly interesting and useful tricks we can pull within this simple model.

As I mentioned in the review of the "Many Faces of Consistency" paper, there is the "operational definitions of consistency" exposed to the client and there is the "state based definitions consistency" used by the developers, and there is a gap between the two where you can play interesting tricks and expose the client operational consistency it cares about in an efficient way.

In our model we approached things from the state consistency perspective and made sure everything works safely. We can still add a lot of efficiency and functionality by slightly changing how we expose things to the client from an operational consistency perspective. Azure Cosmos DB performs many interesting tricks under the hood to implement many consistency models in concert. More on this later...


Anonymous said…

Thank you for posting about TLA+, you're one of several people who contributed to educate me on this powerful tool, which I now use in my day job. :)

A question: in your spec, the multicast macro atomically appends a message to all chans: all processes receive messages in the same order (that isn't the case in a less sophisticated network). To me this would mean that you (implicitly?) rely on some atomic broadcast system (like Virtual Synchrony). Is that on purpose? Does it matter?
Anonymous said…
Replying to myself: there is only one process doing multicast so it doesn't matter. :)

PS: I didn't see a definition for send?

Popular posts from this blog

I have seen things

SOSP19 File Systems Unfit as Distributed Storage Backends: Lessons from 10 Years of Ceph Evolution

PigPaxos: Devouring the communication bottlenecks in distributed consensus

Frugal computing

Learning about distributed systems: where to start?

Fine-Grained Replicated State Machines for a Cluster Storage System

My Distributed Systems Seminar's reading list for Spring 2020

Cross-chain Deals and Adversarial Commerce

Book review. Tiny Habits (2020)

Zoom Distributed Systems Reading Group