TLA+/Pluscal modeling of 2-phase commit transactions

For the second project in my distributed systems class Fall 17, I assigned modeling of the two-phase transaction commit protocol. I ask students to model what happens when the initiator/transaction manager (TM) fails, how would a backup (TM) take over, and what type of problems could arise.

Here is the description of the project. I will post the solution later on.

2 phase commit

In a distributed system, a transaction is performed by a collection of processes called resource managers (RMs), each executing on a different node. The transaction ends when the transaction manager (TM) issues a request either to commit or to abort the transaction. For the transaction to be committed, each participating RM must be willing to commit it. Otherwise, the transaction must be aborted. The fundamental requirement is that all RMs must eventually agree on whether the transaction is committed or aborted.

Here is a model of 2-phase commit. (This was built by modifying the /P2TCommit.tla/ at Lamport's site to add a TM agent). I decided to stay with the shared memory model rather than the message passing model to keep the project simple. The interesting scenarios is still possible under the shared memory model.

Here are the constraints on the shared memory communication. A RM node can only read tmState and read/update its own rmState. It cannot read other RM's rmState. A TM node can read all RM nodes' rmState and read/update tmState. A BTM node can read all RM nodes' rmState and read/update tmState.

2 phase commit modeling and validation

If no faults occur, the 2-phase commit algorithm is correct. In the presence of a crash fault, however, problems can arise. In the questions below, we will use TLA+/PlusCal to explore what problems may arise, and how to properly design the protocol to overcome those problems.

Part 1.

  • Fill in the reducted PlusCal code. (The /macro Fail/ models RM failure.)
  • Add /Consistency/ and /Termination/ properties.
  • Model check /Consistency/ and /Termination/ with no failures (RMMAYFAIL=FALSE and TMMAYFAIL=FALSE). You should see no errors.
  • Model check with RM failure (RMMAYFAIL=TRUE and TMMAYFAIL=FALSE). You should see no errors. 

Part 2.

  • Model check with RMMAYFAIL=FALSE and TMMAYFAIL=TRUE. (No need to recover a failed TM.)
  • Write in the comments section, after the "==================" line, your findings/observations. Comment whether the /Termination/ property is violated by a TM failure. 

Part 3.

  • Add a backup TM process to take over if primary crashes. (Assume the BTM cannot fail. TMMAYFAIL can only affect TM not BTM.)
  • Test satisfaction of /Consistency/ and /Termination/ properties with no TM or RM failures. Make sure BackupTM terminates, so the /Termination/ property is also satisfied as well as /Consistency/ property. 
  • Model check with both TM and RM failure allowed (RMMAYFAIL=TRUE and TMMAYFAIL=TRUE). Write down your observations. (No need to recover a failed RM or TM.)

Here is the synchronized consensus problem I assigned as the first project in the class.

Here is some previous discussion/context about why I started assigning TLA+/PlusCal modeling projects in distributed systems classes.


Popular posts from this blog

Graviton2 and Graviton3

Foundational distributed systems papers

Learning a technical subject

Learning about distributed systems: where to start?

Strict-serializability, but at what cost, for what purpose?

CockroachDB: The Resilient Geo-Distributed SQL Database

Amazon Aurora: Design Considerations + On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes

Warp: Lightweight Multi-Key Transactions for Key-Value Stores

Anna: A Key-Value Store For Any Scale

Your attitude determines your success