Modeling a Replicated Storage System in TLA+, Project 1

- November 24, 2016

Why a TLA+ project?

The first project assignment in my distributed systems class this semester was modeling a replicated storage system in TLA+. Assigning a TLA+ project makes me a rarity among distributed systems professors. A common project would be a MapReduce programming assignment or a project to implement a simple distributed service (such as a key-value store) in Java.

I think that a MapReduce deployment project does not teach much about distributed systems, because MapReduce is a very crude abstraction and hides all things distributed under the hood. Using MapReduce for the distributed systems class project would be like handing people a mechanics certification upon successful completion of a driving test.

Implementing a simple distributed service, on the other hand, would teach students that indeed programming and debugging distributed systems is very hard. However, I would suspect that much of the hardship in that project would be due to accidental complexities of the implementation language rather than the intrinsic complexity of reasoning about distributed systems in the presence of concurrent execution and failures. In a distributed systems deployment, it would be hard for the students to test exhaustively for race conditions, concurrency bugs, and exercise many possible combinations of node failures and message losses. That is notoriously hard for even the professionals, as shown by this survey of 104 distributed concurrency bugs from Cassandra, HBase, Hadoop MapReduce, and ZooKeeper.

I am of course not against assigning an implementation project in a distributed systems class. I see the utility and necessity in giving students hands-on programming experience on distributed systems. I think an implementation project would be suitable for an advanced distributed systems class. The class I am teaching is the first distributed systems class for the students, so an implementation project would be unnecessarily complicated and burdensome for these students, who are also taking 3 other classes.

I teach the distributed systems class with emphasis on reasoning about the correctness of distributed algorithms, so a TLA+ is a good fit and complement for my course. Integrating TLA+ to the class gave students a way to get a hands-on experience in algorithms design and dealing with the intrinsic complexities of distributed systems: concurrent execution, asymmetry of information, concurrency bugs, and a series of untimely failures.

TLA+ has a lot to offer for practice and implementation of distributed systems. At Amazon, the engineers used TLA+ for modeling S3, DynamoDB, and some other production systems that see a lot of updates and new features. TLA+ helped the engineers find several critical bugs introduced by updates/features in the design stage, which if not found would have resulted in large amount of engineering effort later on. These are detailed in a couple of articles from Amazon engineers. I have been hearing other TLA+ adoption cases in the industry, and hope to see increasingly more adoption in the coming years.

Modeling Voldemort replicated storage system with client-side routing in TLA+

I wanted to assign the TLA+ modeling project on a practical useful application. So I chose modeling of a replicated storage system. I assigned the students to model Voldemort with client-side routing as their first project.

Here is the protocol. The client reads the highest version number "hver" from the read quorum (ReadQ). The client then writes to the write quorum nodes (WriteQ) the store the updated record with "hver+1" version number. The storage nodes can crash or recover, provided that no more than FAILNUM number of nodes are crashed at any moment. Our WriteQ and ReadQ selection will consist of the lowest id storage nodes that are up (currently not failed).

I asked the students to model check with different combinations of ReadQ, WriteQ, and FAILNUM, and figure out the relation that needs to be satisfied among these configuration parameters in order to ensure that the protocol satisfies the single-copy consistency property. I wanted my students to see how consistency can be violated as a result of a series of unfortunate events (such as untimely node death and recoveries). The model checker is very good for producing counterexamples where consistency is violated.

Simplifying assumptions and template to get the students started

I tried to keep the first project easy. We simplified things by modeling the storing (and updating) of just a single data item, so we didn't have to model the hashing part. We also used shared memory. The client directly writes (say via an RPC) to the db of the storage nodes. (It is possible to model a message-box at each process, and I assigned that for the second project.)

I also gave the students the template for the model. This helps my TAs a lot in grading. Without any template the students can go in wildly different directions with their modeling. Below is the template. In case you want to give this a try and see how you do with it, I will wait a week or so before I share my solution for the project1.

Extending to the second project

I have assigned the second project again on the modeling of a replicated storage system. But this time instead of a quorum, "chain replication" is used for ensuring persistence and consistency of storage. In the second project, the replication is done by server-side routing, and the modeling includes message passing.