TLA+ modeling of a single replicaset transaction modeling

For some time I had been playing with transaction modeling and most recently with replicaset modeling by way of a single log. While playing with these, I realized I can build something cool on top of these building blocks. I just finished building snapshot isolated transaction modeling that sit on top of a replicaset log. This is also a high level approximation of MongoDB-style snapshot isolated transactions on a single replicaset. I put this up on github at https://github.com/muratdem/MDBTLA/tree/main/SingleShardTxn

Next, I will work on extending this to modeling MongoDB multishard transactions. That will be a whole lot of more work, but I think I have a good foundation to take on the challenge.  


Modeling process

My starting point was these components.

The entire modeling process was like climbing. After I had some small cool feature, I was able to see the next cool feature to add leveraging this. If you like to follow everything in this model, you should be checking the above building blocks first. But by just reading this post, you can still appreciate what is being done here.


The first basic model

First I adopted/simplified the CosmosDB spec to model MDB single replicaset log behavior. I adopted this with minimal modifications, and instantiated the MDB module from my main spec.  

Then I modified the KVSnap transaction process to work on this replicaset log. This is cool because previously the KVsnap did not sit on a replicaset; it was using a single atomic global kv-store variable, called store. Now, the transactions sit on a database log: they commit their updates to the log, and read their snapshots off of that log. Think of transactions as threads running on the primary node of the replicaset.

The model checking quickly showed me that all updates from a transaction has to be committed all-or-nothing/atomically to the replicaset log. (This involves updating commitIndex to include all updates from this transaction at once.) Failing that, another transaction that happen to start can see fractured writes and take its snapshot there, leading to snapshot isolation violation. Here is this commit.

But this model was limited because here the transaction readset and writeset were given in advance. Each transaction was doing all its reads first atomically, then do all its writes atomically. In other words, this modeled one-shot transactions rather than interactive transactions. This was not able to show a read-after-a-write that reads what this transaction has last written locally for that key in its snapshot copy.


The interactive transactions model

To model interactive transactions, I added a router to feed in the read and write statements gradually to the transactions. The router adds statements to each transaction. I hard-coded each transaction to have STMTS=3 statements, so that if the transaction is not aborted after processing the third statement, it commits itself to the database log. In this model, it is possible have a read follow a write, and showcase read-your-write inside the transaction's snapshotted version.

The router fed the transactions to the rlog, and I used a global variable "aborted" for a transaction process to signal the router not to waste more time on feeding the rlog for this transaction. The "aborted" flag is a hack, but I am not too interested in modeling this router interaction, so it suffices.


I then moved the atomic readset and writeset ops writing in the KVSnap model  to be piecemeal for each statement seen. As expected, this piecemeal processing of transaction created a lot of juicy snapshot isolation violations. I learned that I had to keep better track of "overlapping transactions". Previously, for the the single-shot transaction version, it was sufficient to just check with the active transaction set "tx" at commit-time and update their missed-update sets. Now, since the writeset is not available to the transactions in advance (and rather grow incrementally), I have to keep track of any transaction that overlapped with this one for detecting conflicts on keys.

To do an update on a key, the transaction process has to check that among the non-aborted overlapping transactions (some of which may have committed already) none of them updated this key first. Since that would be a violation of snapshot isolation, the process aborts itself as there is no point in proceeding after that.

But how do I track the overlapping transactions? This was not an easy task due to many different scheduling/yielding times among processes. The big insight that let me solve these problems was to realize that the overlap set needs to be symmetric. If t1 comes first and misses observing t2 in active transaction set tx, then t2 when it starts will observe t1 in tx, and update both of them as overlapping. And vice-versa, so we are covered.  

Here is the first part of the transaction process. It waits to receive a statement from the router. If this is the first statement of the transaction, it adds itself to active transaction set, and takes a snapshot of the database by reading from the replicaset log with MDB!Read. It also updates the transactions it is overlapping with. If this is a read statement, read is performed and logged in the ops log for the clientcentric module to be able to do its snapshot isolation checking. 


Here is the UPDATE part of the transaction for when the statement is a write transaction. If this write would not conflict with any non-aborted overlapping transactions (some of which may have committed already), the write is performed on snapshot copy, and logged in ops-log for snapshot isolation checking. If there is a conflict, the transaction is aborted, resetting its ops-log and updated keys set to stop spurious aborts of other trsanactions. Otherwise if this is the last statement, the transaction is committed. 


To my delight, the ClientCentric snapshot isolation checking covered the case where a transaction reads from its own snapshot a key it modified earler. (This is the internal consistency bullet mentioned here: within a transaction, reads observe that transaction’s most recent writes--if any. It is there in the Crooks' paper in the definition of reads-states as well.) So the isolation checking module was fine with the log growing incrementally. Another thing I realized is that if two transactions both commit satisfying snapshot isolation, then at any step of their past ops-logs, they don't have any snapshot isolation violation. 

Here is that commit:

https://github.com/muratdem/MDBTLA/tree/d66d84d4a6074511b135ef7e2a149bc613e92730/SingleShardTxn

And here are the invariants I check. Serialization is supposed to break because we implement snapshot isolated transactions.


I had put the serialization in properties label and was surprised when it didn't break. After going crazy for more than 15 minutes, I realized that I had to put Serialization under Invariants and not Properties. 



Comments

Tim said…
Really nice to see usage of my Client Centric TLA+ formalization. :D
This kind of add-on usage is exactly where it was designed for.

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Advice to the young

Foundational distributed systems papers

Distributed Transactions at Scale in Amazon DynamoDB

Linearizability: A Correctness Condition for Concurrent Objects

Understanding the Performance Implications of Storage-Disaggregated Databases

Designing Data Intensive Applications (DDIA) Book