Tuesday, August 12, 2014

Using TLA+ for teaching distributed systems

I am teaching CSE 4/586 Distributed Systems class again this Fall (Fall 2014). This is the course I have most fun teaching. (I would like to think my students also feel that way :-) I teach the course with emphasis on reasoning about the correctness of distributed algorithms. Here are the topics I cover in sequence:

  1. Introduction, Syntax and semantics for distributed programs, predicate calculus
  2. Safety and progress properties
  3. Proof of program properties
  4. Time: logical clocks, State: distributed snapshots
  5. Mutual exclusion, Dining philosophers
  6. Consensus, Paxos
  7. Fault-tolerance, replication, rollback recovery, self-stabilization
  8. Programming support for distributed systems
  9. Data center computing and cloud computing 
  10. CAP theorem and NOSQL systems
  11. Distributed/WAN storage systems

I put emphasis on reasoning about distributed algorithms because concurrency is very tricky; it truly humbles human brain. More than 3 actions in a distributed program and your intuitions will fail, you won't be able to hand-wave and apply operational reasoning on the program. You may think you could, but you would be very wrong (I know from first-hand experience).

I use invariant-based reasoning of program properties for the first 4 weeks exclusively. But this becomes less applicable when we move into more involved protocols in weeks 5 and beyond. This is where I give up being rigorous and make tell the class: "We could push things down the most rigorous invariant-based reasoning and predicate calculus level but we don't. Instead we give arguments in English, with the appreciation of how these arguments correspond to the proof rules in previous chapters." Yes, this is not very satisfying, but I didn't have much choice.


So for these reasons, the AWS TLA+ article got my attention recently. The article talked about how AWS successfully used invariant-based reasoning and formal methods (in particular TLA) for building robust distributed systems. TLA is a tool for specifying distributed algorithms/protocols and model checking them. AWS used TLA in many key projects: S3, DynamoDB, EBS, and a distributed lock manager. Here is the technical report by AWS. It is a very good read.

TLA+ is Leslie Lamport's brainchild. Of course you know Lamport if you are working on distributed systems. Lamport got a Turing award in 2013; he is famous for logical clocks, Paxos, and several other influential results in distributed systems. As a side-project, he wrote a wrapper around Knuth's TeX, called LaTeX ("La" for Lamport?), which is still the typesetting tool for almost all math/CSE academic papers. Lamport has always been a firm proponent of invariant-based reasoning for distributed algorithms and it seems like he has been dedicating most of his effort on prostelyzing TLA in recent years.

There are other successful model checkers (Spin, SMV, Promela), but TLA is more focused on supporting distributed algorithms reasoning. In addition, the PlusCal language (in the TLA+ toolkit) provides a high-level pseudo language to write distributed algorithms easily.

How I went about learning TLA

This was a straightforward and easy process. This is the main page for TLA, where the other pages can be reached. To download the toolkit, I first went to this page which forwards to this download page.

Then I downloaded the Hyperbook and started following it. The chapters were all straightforward for me, because this is very similar to the material I teach in my 486/586 class for invariant-based reasoning of distributed algorithms. The hyperbook has a lot of examples and is the best place to start learning TLA.

For the PlusCal language reference I downloaded this.

After I got the hang of it,  I decided to get my hands dirty with my own toy programs. I wrote TLA+ specifications for some simple coffee bean problems.  Then using PlusCal, I wrote specifications for Dijkstra's stabilizing token ring algorithm. First without using process abstraction, then with the process abstraction when I finished Chapter 7 in Hyperbook. Finally I wrote specifications for Dijkstra's 3-state and 4-state token ring algorithms, which progressed very smoothly. Next, I will use it on Paxos (here is a TLA+ specification of epaxos) and my own work.


The guarded-command language I use for teaching 4/586 translates very easily to PlusCal, so TLA+ is a good fit for my course. I will start using it in my 4/586 class this coming semester. I think the students will enjoy having hands-on experience with reasoning about non-toy distributed protocols.


My experience with using TLA+ in distributed systems class

No comments: