Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+.

The two-phase commit protocol is practical and is used in many distributed systems today. Yet it is still brief enough that we can model it quickly, and learn a lot from modeling it. In fact, we see that through this example we get to illustrate a fundamental impossibility result in distributed systems directly.

The two-phase commit problem

A transaction is performed over resource managers (RMs). All RMs must agree on whether the transaction is committed or aborted.

The transaction manager (TM) finalizes the transaction with a commit or abort decision. For the transaction to be committed, each participating RM must be prepared to commit it. Otherwise, the transaction must be aborted.

Some notes about modeling

We perform this modeling in the shared memory model, rather than in message passing, to keep things simple. This also ensures that the model checking is fast. But we add nonatomicity to the "read from neighbor and update your state" actions in order to capture the interesting behaviors that would happen in a message passing implementation.

A RM can only read the TM's state and read/update its own state. It cannot read other RM's state.  A TM can read all RM nodes' state and read/update its own state.


Lines 9-10 set the initial rmState for each RM to "working" and that of the TM to "init".

The canCommit predicate is defined to be true iff all RMs are "prepared" (they are ready for a commit decision). If there exists an RM with "abort" state, canAbort becomes truthified.

TM model

The TM modeling is straightforward. TM checks if it can commit or can abort, and updates tmState accordingly.

TM can also fail making tmState "unavailable", but this gets exercised only if the constant TMMAYFAIL is set to true before the model checking starts. In TLA+, the labels determine the granularity of atomicity. By putting the labels F1 and F2, we denote that the corresponding statements are executed nonatomically (after some indeterminate time passes) with respect to the previous statements.

RM model

The RM model is also simple. Since "working" and "prepared" states are nonterminal states, the RM nondeterministically chooses among the enabled actions until a terminal state is reached. A "working" RM may transition to an "aborted" or "prepared" state. A "prepared" RM waits to hear the commit and abort decision from the TM and acts accordingly. The figure below shows the state transitions possible for one RM. But note that we have multiple RMs, each of which is going through their state transitions at their pace without the knowledge of the state of another RM.

Model checking the two-phase commit

We are interested in checking that our two-phase commit is consistent: there are no two RMs such that one says "committed" and another "aborted".

The predicate Completed checks that the protocol does not hang on forever: eventually each RM reaches to a terminal "committed" or "aborted" state.

With that, we are ready to model check this protocol. Initially we set TMMAYFAIL=FALSE, RM=1..3 to run the protocol with three RMs and one TM that is reliable. The model checker takes 15 seconds and tells us that there are no errors. Both Consistency and Completed are satisfied by any possible execution of the protocol with different interleaving of enabled RM actions and TM actions.

Now, we set  TMMAYFAIL=TRUE and restart the model checker. The model checker is quick to give us a counterexample trace where the RMs in this execution is stuck waiting to hear back from the TM which has become unavailable.

We see that on State=4 the RM2 transitions to aborted, on State=7 the RM3 transitions to aborted, on State=8 the TM transitions to "abort", and crashes on State=9. On State=10, the system is stuck because the RM1 is in the prepared state waiting forever to learn a decision from the TM which has crashed.

BTM modeling

In order to avoid any downtime for many transactions the TM may be commandeering, we add a Backup TM  (BTM) to quickly take over when the TM becomes unavailable. The BTM uses the same logic as the TM to make decisions. And for simplicity we assume the BTM does not fail.

When we model check with the BTM process added, we get a new error message.

BTM cannot rule for a commit, because our original, canCommit condition asserted that all RMstates should be "prepared" and didn't account for the condition when some RMs already learned the "commit" decision from the original TM before the BTM takes over. So we revise our canCommit condition to account for this.

Success! When we check the model, we find that both Consistency and Completed are satisfied, as the BTM can take over and finalizes the transaction when TM fails. Here is the 2PCwithBTM model in TLA+ (initially the BTM and the second line of canCommit commented out). Here is the pdf corresponding to the pdf.

What if RMs could also fail?

We assumed the RMs are reliable. Now we relax that to see how the protocol behaves in the presence of RM failure. We add an "unavailable" state to model for a crash. In order to explore more behavior and model intermittent loss of availability, we allow a crashed RM to recover back and continue the protocol by reading its state from its log. Here is the RM state transition diagram again with the newly added "unavailable" state and transitions marked with red. And below that is the revised model for the RMs.

We also need to strengthen canAbort to take into account the unavailable state. The TM can rule for "abort" if any of the RMs is in "aborted" or "unavailable" state. If we omit this condition, an RM that has crashed (and never recovers) may make the transaction lose progress. Of course precaution should be taken to ensure that BTM cannot rule for an abort if there exists a RM that has learned of a "committed" decision from the original TM.

Model checking

When we check this model, we discover an inconsistency problem! How did this happen? Let's follow the execution trace for the counterexample.

On State=6, all the RMs are in prepared state, the TM had made a commit decision, and RM1 has seen that decision and moved to label RC, meaning that it is ready to change its state to "committed". (Keep RM1 in mind, this gun will fire at the last act.) Unfortunately, the TM crashes in State=7, and RM2 becomes =unavailable= in State=8. In State 9, the BTM takes over reads the RMstates as <prepared, unavailable, prepared> and rules for an abort in State=10. Remember RM1, it finally acts on the commit decision it saw from the original TM, and transitions to committed in State=11. In State=13, RM3 heeds the abort decision from the BTM and transitions to aborted, and we have the Consistency violated with RM1 deciding for committed and RM3 for aborted.

In this case, the BTM made a decision that violated consistency. On the other hand, if we had made BTM to wait until there are no "unavailable" RMs, it could be waiting forever on a crashed node, and this would then violate the progress condition.

The updated TLA+ model file is available here, and the corresponding pdf here.

FLP impossibility

So, what happened?  We hit the Fischer, Lynch, Paterson (FLP) impossibility result: In an asynchronous system, it is impossible to solve consensus (satisfy both safety and progress conditions) in the presence of crash faults.

In our example, the BTM cannot correctly decide whether RM2 is crashed or not, and rules incorrectly for an abort. If there was only one decision maker, just the TM, the inaccurate failure detector would not be an issue. The RMs would follow whatever the TM decides and both Consistency and Progress would be satisfied.

What surfaces and illustrates the problem is that we have two decision makers TM and BTM looking at the state of the RMs at different times, and making different decisions. This asymmetry of information is the root of all evil in distributed systems.

This problem doesn't go away even with the three-phase commit extension. Here is the three-phase commit extension modeled in TLA+ (here is the pdf version), and below is the error trace that shows this time Progress is violated. (The wikipedia page on three-phase commit says in a timeout after receiving the precommit order the RMs, i.e., RM2 and RM3, should go ahead and commit, but that would instead cause a consistency problem in this case.)

Paxos, making the world a better place

Not all hope is lost. We have Paxos. Paxos treads carefully within bounds of the FLP result. The innovation in Paxos is that it is always safe (even in presence of inaccurate failure detectors, asynchronous execution, faults), and it eventually makes progress when consensus gets in the realm of possibility.

You can emulate the TM with a Paxos cluster of 3 nodes and that will solve the inconsistent TM/BTM problem. Or as Gray and Lamport show in the transaction commit paper, if the RMs use the Paxos box to store their decisions concurrently with replying to the TM, this shaves off the extra one message delay from the obvious protocol.


Joel Dedrick said…
Paxos (which I admit I haven't studied) is everybody's go-to consensus algorithm. But I thought it was fascinating that John Ousterhout wrote a paper about finding a different one, not because there's anything wrong with Paxos per-se, but because it's so hard to understand that almost nobody who uses it actually knows how it works.

Popular posts from this blog

Learning about distributed systems: where to start?

Hints for Distributed Systems Design

Foundational distributed systems papers

Metastable failures in the wild

Scalable OLTP in the Cloud: What’s the BIG DEAL?

SIGMOD panel: Future of Database System Architectures

The end of a myth: Distributed transactions can scale

There is plenty of room at the bottom

Distributed Transactions at Scale in Amazon DynamoDB

Dude, where's my Emacs?