TLA+ modeling of MongoDB logless reconfiguration

Here we do a walkthrough of the TLA+ specs for the MongoDB logless reconfiguration protocol we have reviewed recently.

The specs are available at the https://github.com/will62794/logless-reconfig repo provided by Will Schultz, Siyuan Zhou, and Ian Dardik. 

I really admire how these specs provided a modular composition of the reconfiguration protocol and Raft-based replication protocol. I figured I would explain how this works here, since walkthroughs of advanced/intermediate TLA+ specifications, especially for composed systems, are rare.

I will cover the structure of the two protocols (CSM and OSM) briefly, before diving into how they are composed.

At the end I will also show you that by using tla-web application (developed by Will Schultz), it is possible to interactively-explore the executions of the combined spec without the need to install any TLA+ tooling at all by just using the web browser. This is a great way to share specifications and counterexample traces with your colleagues who don't dabble in TLA+ and I am excited for this.


CSM: MongoLoglessDynamicRaft 

This file is at https://github.com/will62794/logless-reconfig/blob/master/MongoLoglessDynamicRaft.tla if you like to follow along.

  • Server is the set of all nodes.
  • state denotes whether a node is Primary or Secondary.
  • currentTerm is the Raft term of a node
  • config is the current config the node knows of
  • configVersion is the version associated with the config
  • configTerm is the term associated with that config    

Refer back to our post on MongoDB logless reconfiguration to see how the configVersion and configTerm plays out. For ensuring safety, the protocol uses the (configTerm,configVersion) as a pair, and requires that the CSM commits the config on the most recent configTerm, and the OSM follows the committed configs in sequential order.  

These are the four possible actions in the CSM protocol.

 

The first action does not seem to constrain the newConfig and seem to allow it to be any possible set in all possible subsets of the Server set, claiming an unrestricted/arbitrary next reconfiguration. But, if you look at the preconditions of Reconfig action on line 99, the QuorumsOverlap check requires that all quorums of newConfig share at least one overlapping node with all quorums of the current config. Pew, sanity restored. If we comment out the QuorumsOverlap condition we would encounter a safety violation. You can try this, after I teach you about the tla-web app at the end of this post.

SendConfig is used for sharing config between any two servers. NewerConfig, determined by higher configTerm or if configTerms are equal by higher configVersion, dominates and gets adopted by the other server.

UpdateTerms is used for updating the currentTerm variable. And BecomeLeader is for one server to become a Primary by getting a quorum of votes from others and increasing its currentTerm. These two actions are shared with the OSM protocol, and we will later see how the composition model will insist that these actions get jointly executed by both CSM and OSM for a superposed composition of the two protocols.


OSM: MongoStaticRaft

This file is at https://github.com/will62794/logless-reconfig/blob/master/MongoStaticRaft.tla if you like to follow along. This is the pull-based replication protocol based on Raft that we covered in this post earlier.

Many of these variables/constants are shared with the CSM above, so let's talk about the diff.

  • log is the oplog for the OSM protocol.
  • committed is the set of committed (majority replicated) entries in the log.

Here are the actions of this OSM protocol

 

BecomeLeader is for becoming a Raft Primary, and UpdateTerms is for propagating the Raft currentTerm variable between two nodes. (These two actions are in fact the same as the same named two actions in the CSM: MongoLoglessDynamicRaft spec.) ClientRequest is for the Primary to accept a request for replication. GetEntries perform pull-based replication of entries in oplog. RollbackEntries performs oplog rollback when warranted upon a primary change, and CommitEntry commits majority replicated entries. Standard Raft stuff.


MongoRaftReconfig

This file is at https://github.com/will62794/logless-reconfig/blob/master/MongoRaftReconfig.tla

The MongoRaftReconfig file composes the MongoLoglessDynamicRaft (CSM) and MongoStaticRaft (OSM), and regulates the superpositioned execution of the two specs.

  • CSM and OSM share actions for BecomeLeader and UpdateTerms.
  • OSM reads from CSM config and uses this as the process set to execute the replication/commit and leader election protocol
  • CSM reads from OSM the OplogCommitment condition. (Note that the composition model guards the CSM's Reconfig action by OplogCommitment, and commenting this out will also lead to a safety violation as we will explore below.)

Due to the shared actions for CSM and OSM for BecomeLeader and UpdateTerms, the composition spec restricts both CSM's and OSM's possible computations/executions to become refinements of their respective original specs.

Note that the initial states of CSM and OSM also need to agree for their shared variables, which are currentTerm, state, and config. The variables of the composition spec is the union of the variables of CSM and OSM.


The composition spec defines OSMNext and CSMNext as follows. Note the OplogCommitment guarding the CSM reconfig action. The composition spec also restricts the shared actions BecomeLeader and UpdateTerms to execute jointly, meaning that both actions take effect in the corresponding specs at the same step. This keeps the CSM and OSM states in synchrony and prevent the two state machines diverging from each other's views.


The Next step here defines a step of the composition spec as either OSMNext, CSMNext, or JointNext. 


State exploration

It is time for us to reach out to the tla-web app and play around with an interactive version of this spec. The spec here is equivalent to the compositional spec we reviewed above, except that it is not specified in the compositional manner, but rather combined into one file together.

If you are ready, just follow my directions. This will be fun. 

Open this link in another tab. This will readily load the joint TLA+ spec. And you can see the spec by clicking the spec button at top right. You can even modify the spec, by commenting out a line.

And that is what we are going to do. The best way to learn something is to break it and see why it breaks, and then fix it back. Go to line 259, and comment the precondition check for OplogCommitment, by putting \* at the beginning of the line. Click the details button up top to come back to the initial page.

Now, the prompt there is asking us to choose a possible initial state. Choose the one where config={s1,s2} that means we have two replicas in the initial configuration. This is the second choice from top. Inspect the assignments in the variables in this initial state and click on it.  Note that the right side of the page changed to show the history of the trace. We only got one step into this execution, so that is what we see on the right. The left pane also changed to list the new actions/transitions that are enabled in this next state resulting from our initial state choice.

Choose the transition where node 1 becomes the leader. This is the BecomeLeader action with s1 becoming the leader using the quorum s1,s2. Note that there are two sub-buttons enabled, the other option is to choose s2 to become the leader. So, follow me, choose the first button to make s1 the leader.

Note that our trace grew by another state transition on the right, and we have a new set of options opened up for us. Choose SendConfig s1, s2. Now our trace is of length 3.

Oh, I almost forgot, on the right pane in the text entry at the top enter LeaderCompleteness and press the AddTraceExpression button next to it. This will let us monitor the invariant LeaderCompleteness, which is line 320 in the Spec tab if you like to check. This evaluates to TRUE now, but since we had introduced a bug in the spec by commenting out the OplogCommitment precondition, we should expect to see this violated soon. 

Let's go faster now.

Choose Reconfig s1, {s1} so that node 1 reconfigures the system to drop down to only one replica, that is itself.

Choose ClientRequest s1 option. And all the while monitor how the state variables evolve on the right as we take the system through this path of execution. Notice the log variable on s1 become <<1>>.

CommitEntry s1 {s1}. Note that s1 did not need another node to commit, since the reconfiguration took the system to just one node, it can commit locally.

Now, choose Reconfig s1, {s1,s2}.

SendConfig s1, s2

Reconfig s1, {s1,s2,s3}

SendConfig s1, s3

SendConfig s1, s2

BecomeLeader s2 {s2,s3}. Wow, notice that LeaderCompleteness became FALSE. This is because s2 does not have what s1 committed (locally), so we lost a commit. Let's continue to see what more trouble this can result in. On the right pane Add Trace Expression StateMachineSafety which is on line 325 in the spec. Note that this still shows as TRUE.

SendConfig s2, s3

ClientRequest s2, so s2 accepts to the same slot (1st slot in the log) another value <<2>>.

GetEntries s3, s2

CommitEntry s2 {s2, s3}. Wow! Notice the StateMachineSafety is violated now. This invariant said:

By modifying the spec and introducing a bug, we violated this invariant. We have a 16 step counter example for this violation. By copying the URL, or using the Copy Trace Link, we can share this counterexample with our colleagues.

But there is a catch for our case. We had done our modification on our local buffer, and not at the spec url, so this link will load the correct version and will only be able to follow the trace to step 8 where the faulty version diverges from the unmodified correct version. If you like to get the full 16 step counterexample you can put the modified version on a publicly accessible link and load that version, and share the steps from that faulty version.  


Comments

Popular posts from this blog

Learning about distributed systems: where to start?

Hints for Distributed Systems Design

Foundational distributed systems papers

Metastable failures in the wild

The demise of coding is greatly exaggerated

Scalable OLTP in the Cloud: What’s the BIG DEAL?

The end of a myth: Distributed transactions can scale

SIGMOD panel: Future of Database System Architectures

Why I blog

There is plenty of room at the bottom