TLA+ specification of the consistent prefix guarantee

We had covered consistency properties in the recent posts. Now to keep it more concrete and to give you a taste of TLA+, I present a model of a distributed data store that provides the consistent prefix property.

In my next post, I will extend this model slightly to present bounded and strong consistency properties. And, in the post after that, I will add a client model to show how this data store can be extended to provide session consistency. The TLA+ (well, PlusCal to be more accurate) code for this model is available here. 

The system model

We assume there is a write region (with ID=1) and read regions (with IDs 2 through NumRegions). The write region performs the write and copies it to the read regions. There are FIFO communication channels between the write and read regions.
WriteRegion == 1
ReadRegions == 2..NumRegions
chan = [n \in 1..NumRegions |-> <<>>]; 

The write region actions


The write region has 3 actions to perform:

  • Commit the write in the write region
  • Multicast the write to the read regions
  • Receive/process an ack message from a read region

These three actions can be performed in any arbitrary order inside the process loop, and the model checker will ruthlessly try all possible combinations of these actions interleaved with the read region actions to check if there is a violation of the invariant properties entered.

The first action selection is to commit a write in the write region. (We don't get into how that is implemented in the region; maybe commit is done by replicating the update to a write quorum in the region.) As a result the CommittedLSN is incremented. CommittedLSN keeps track of the write operation. We are not modeling the content of the write, we will model the write only with its CommittedLSN and try to get the writes replicated in the read regions in the same order.

The second action selection is to multicast a committed write to the read regions. To send the writes in order through the FIFO channels, the write region maintains a SentLSN variable to denote the last CommittedLSN write that is forwarded to the read regions. If there are some CommittedLSN writes that are yet to be forwarded,  these are multicasted in order of the LSN numbers without any gap. However, note that, this is an asynchronous replication. The messages may be sent whenever this action is chosen, and this send action is not blocked on waiting any acknowledgements from the read regions.

The final action selection is to receive an Ack message from a read region. This model does not use the Ack messages for updating anything, but they can be used for tracking the progress of the read regions. This will become important for the bounded consistency and strong consistency properties.

The loop is bounded by a constant MAXN so that the model checking space is limited. MAXN=5 is an acceptable number. If you have worked with a model checker, you know that checking 5 repeat of the operation without the invariant broken is a very good indication it works. Yes, this is not induction proof, but since the model checker tries everything that could possibly go wrong, it would have found a counterexample (e.g., race-condition that violates an invariant) even with 3 repeats or so.

The read region actions


The read regions only react to the replicated messages from the write region. So they have only two actions.

The first action selection is to receive a message pending in the channel from the write region. CommittedLSN variable is then updated to match the LSN in the message. Here we don't check for the gaps, but since the channels are FIFO and the messages are sent by the write region in the increasing LSN order without gaps, the consistent prefix property holds ---as we later confirm when we model-check with the invariant.

The second action selection is to send back an Ack message for any replication message that is not acknowledged yet. Since this action is also asynchronously chosen, the acknowledgement can be cumulative, it can skip over LSNs and acknowledge the highest seen, and that's OK.

Invariant checking and testing

Our invariant is short and sweet. It checks that at any read region, the CommittedLSN --if updated-- is always incremented by 1 over its previous value. That is, there are no gaps in the commit sequence.
CP == [][\A i \in ReadRegions: 
               CommittedLSN'[i] = CommittedLSN[i] 
            \/ CommittedLSN'[i] = CommittedLSN[i] + 1]_vars

An alternative would be to insert a PrefixConsistency boolean variable in the specification, which gets set to FALSE at the read region if the replica receives an out-of-order commit request. But that makes the model ugly; it is not nice to entangle the model and property to be checked.

Another alternative would be to write a client, and to check for operation consistency among the client reads. But that is cumbersome, because you need to introduce a client and a read operation at the read regions. Furthermore, that will also cause the model checker to take longer time to complete.

What are some other properties we can test here?

In the consistent prefix guarantee the write region can commit asynchronously and freely without waiting for the read regions to catchup. The read regions can catchup on their own pace.

Let's write some conditions, "fake invariants", to test that this is the case.
SyncStep  == \A i \in ReadRegions  : 
                   CommittedLSN[i]> CommittedLSN_[1] -3
SyncStep2 == \A i,j \in ReadRegions: 
                   CommittedLSN[i]# CommittedLSN[j]  -3

The model checker is quick to find counterexamples for these conditions. For the SyncStep the error trace provides a short counterexample where the write region raced ahead to commit 3 updates without broadcasting any updates to the read regions. (Since I used CommittedLSN variable for both read and write regions, the TLA translator assigned CommittedLSN_ "with underscore" to the write region's version to distinguish it from that of the read regions.)


For the SyncStep2, the error trace is longer because it needs some setup. Here the write region performed 3 writes and broadcasted these to the read regions, but only the read region 2 performed the updates and get to CommittedLSN=3, while the read region 3 has not performed any receive action from its channel.


Things I tried to speed up the TLA model checking

I believe it is important to share the mistakes committed as well as the end result, so others can learn better.

My first model was large. My improvements involved trimming and simplifying that model.
Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away. --Antoine de Saint-Exupery.
To keep model checking feasible I used MAXN. But I found that the model checker would still be crunching scenarios after ten minutes. I knew that this is a simple protocol and it shouldn't take so long to model check. Then I noticed that the problem is with bounding the run on the wrong thing: I was bounding the CompletedLSN variable with MAXN, but the CommittedLSN was still able to race unbounded above the MAXN. After I noticed my mistake the model checker took only a couple seconds for MAXN=5.

What is CompletedLSN? In my earlier version of the model, the write region used CompletedLSN to keep track of the progress of the read regions, but I realized this variable was unnecessary for checking the consistent prefix, so I took that out entirely.

I also performed other simplifications to reduce the model checking state space. Instead of sending messages to the regions one by one, I modified the write region to queue the messages at once via the multicast operation.


MAD questions

1. How do you relax this for eventual consistency?
For eventual consistency, we don't need to maintain CommittedLSN and a sequencing of writes. The write region can maintain a set for writes/updates, and multicast any write/update from that set at random. The read region just receives an update and performs it. There is also no need for acknowledgement from the read region. The repeated multicasts will eventually establish that every write/update is replicated.

2. How would you extend this to the stronger consistency guarantees?
In my next post, I will extend this model to specify bounded staleness and strong consistency properties. As part of this extension, the write region would need to keep track of the updates performed at the read regions and slow down to make sure it is not going too far ahead of the read regions.

Since session consistency is specific to a client, we will write a client model and let the client share/copy LSN numbers and get served using that LSN number. The "LSN" will be our session state.

3. Was this too simple an example to be worth modeling?
As I wrote above, I had several blunders before I could simplify this model to what I presented. And I learned a lot from developing this simple model. So, this was definitely worth modeling and working on.

I had written about the benefits of modeling before here.

Comments

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Advice to the young

Foundational distributed systems papers

Distributed Transactions at Scale in Amazon DynamoDB

Linearizability: A Correctness Condition for Concurrent Objects

Understanding the Performance Implications of Storage-Disaggregated Databases

Designing Data Intensive Applications (DDIA) Book