Saturday, January 31, 2015

Dijkstra's stabilizing token ring algorithm

One of the classical algorithms I teach in my distributed systems class is Dijkstra's stabilizing token ring algorithm. This algorithm has started the self-stabilization field as a subfield of fault-tolerance. And, it still receives interest even after 40 years. There has been probably hundreds of self-stabilization papers that revisits Dijkstra's stabilizing token ring algorithm as part of a solution or as part of a case study. This algorithm never gets old for me as well. I still enjoy talking about this algorithm in class and thinking about it once in a while. I guess this is because it is a very elegant algorithm.

Linked is Dijkstra's original paper introducing the algorithm. This paper also includes two variants of the stabilizing token ring algorithm, 3-state and 4-state token ring algorithms. Dijkstra would later do a followup writeup, titled, "A belated proof of self-stabilization", where he gave a proof of stabilization for 3-state token ring program.

As a side note, when I asked my distributed class whether they heard Dijkstra's name before, almost all hands went up. And almost all of them had heard his name due to the "all pairs shortest path algorithm". Interesting. Probably, in Dijkstra's mind, that algorithm was a minor contribution. He came up with that algorithm at a cafe in 20 minutes without using pen and paper. None of the students heard the "Go-to considered harmful" paper, which may be closer to what Dijkstra was aiming with his career. Dijkstra's later work focused on predicate calculus for enabling  structured/principled verifiably-correct programming.

Stabilizing fault-tolerance

Stabilization is at the heart of the Dijkstra's token ring algorithm. Stabilization is a type of fault tolerance that advocates dealing with faults in a principled unified approach instead of on a case by case basis. Instead of trying to figure out how much faults can disrupt the system stabilization assumes arbitrary state corruption, which covers all possible worst-case collusions of faults and program actions. Stabilization advocates designing recovery actions that takes the program back to invariant states starting from any arbitrary. Anish Arora's paper "Practical Self-Stabilization for Tolerating Unanticipated Faults in Networked Systems" provides a great overview of stabilization concepts.

Token Ring stabilization

Let's start with the simple token ring concept. The processes are arranged in a ring fashion and there is a unique token circulating in this ring. (You can think of the token may be providing mutual exclusion to the processes; whichever process has the token can access the critical-section/shared-resource).

In the invariant states there is exactly one token. So in the bad states, there are two possible cases: 1) we end up with a ring with no tokens,  2) we end up with a ring with multiple tokens.

Let's consider how Dijkstra's token ring algorithm deals with case 1. Dijkstra's token ring algorithm maps the concept of a physical token to a relation between two neighboring processes. This mapping is done skillfully to limit the state space so the case 1 faults cannot possibly happen in that state space. In the mapped state space, any arbitrary state  is guaranteed to contain 1 or more tokens.

How does this work? Dijstra's algorithm encodes a token with the relation: "the x variable of a process is DIFFERENT than the x variable of its predecessor in the ring". In this case, if all x variables are the same, there won't be any token in the system. To fix this Dijkstra's algorithm encodes the token differently for process 0: "There is a token at process 0, if its x variable is the SAME as that of its predecessor, process N". Encoded this way, we cannot not have 0 tokens in any arbitrary state in the state space.

A good analogy for bounding the corruption space is the use of the carousal door. For a normal door, you have two possible states: door is open or door is closed. For a carousal door, there is only one possible space: the door is closed. (Yet you still make it through that door by revolving through it.)

In order to deal with case 2, Dijkstra's algorithm uses some token overwriting mechanisms. First let's observe that no process action introduces an extra token to the system. (Of course, a fault action can corrupt state arbitrarily, and can throw the system to a state with many extra tokens.) In the normal/fault-free case, by taking an action a process consumes a token that exists between its predecessor and itself and passes this token to its successor. In the faulty case, this is also the worst damage a process can do, because it will always consume the token at itself, and since its state change affects only its successor, it may or may not pass the token to its successor. If the process execution does not pass the token to its successor, this is actually good: an extra token is removed from the system. And if enough of these happen, the system will stabilize to exactly one token case.

But what if this leads to a system with no tokens? Recall that this cannot happen due to the discussion regarding Case 1.

Stabilizing token ring algorithm in TLA+ 

After explaining the program actions, in order to help students internalize what is going on, I call 5 students to stand up in front of the class. The students form a ring. And they use hands up or down to represent x=1 or x=0, and enact a simulation of the program starting from some arbitrarily corrupted states. The students first get to observe that there is indeed no configuration possible with 0 tokens. Then as they enact several scenarios, they observe how no new tokens are introduced by a process action, but in some cases an extra token is consumed.

Then, I challenge them to come up with a scenario where this binary token ring formed does not get to converge.  The corruption continues to live on in the system.  Here is such a run in a ring of 4:
1010 --> tokens at nodes 1,2,3
1011 --> tokens at nodes 1,2, and 0
1001 --> tokens at nodes 0,1,3
1101 --> tokens at nodes 0,2,3
0101 --> tokens at nodes 1,2,3
0100 --> tokens at nodes 1,2,0
0110 --> tokens at nodes 0,1,3
0010 --> tokens at nodes 0,2,3
1010 --> tokens at nodes 1,2,3
and this gets us to the first state. The system can repeat this loop forever without stabilizing.

Of course, when you run the model checker for this setup (where x is limited to 0 and 1) the TLA+ model checker provides you this counterexample.

The M>=N requirement 

So, what went wrong? Wasn't Dijkstra's token ring supposed to be stabilizing? The first time around we discussed the algorithm since x was an unbounded integer, the algorithm was stabilizing. When we considered the binary token ring example, x was bounded by 2, and that failed to stabilize in a ring of size 3 or more. There is a relation on M, the bound on x, and the algorithm's ability to stabilize for a ring of size N. We had seen in the counter example that for M<N there is a way to form a loop in the execution trace and keep the corruption going in the system.

For M>=N, the node 0 acts as a gate. Consider the case where node 0 never fires. In this case the other nodes fire, and eventually they all copy the x value at node 0, and in that configuration the system reduces to one token (at node 0) and stabilization is achieved. For M>=N, we can use the pigeon-hole principle to show that the system  eventually reaches the case where node 0 does not fire until all other nodes  copy its value. This is because, for M>=N, at any configuration there is a value "v" of x that is not currently present in the ring. Eventually node 0 will hit "v", and then node 0 will not be enabled until v is copied by all nodes. This is how node 0 acts as a gateway.

Concluding remarks  

I may talk about Dijkstra's 3-state and 4-state token ring algorithms in another post. Those are also interesting and elegant algorithms.  In a paper I wrote in 2002, I used Dijkstra's classic, 3-state, and 4-state token ring algorithms as case studies. If you like to learn more about this, you can read that paper here.

Related links

Crash only software 
The role of distributed state

Thursday, January 29, 2015

My Distributed Systems Seminar reading list for Spring 2015

Below is the list of papers I plan to discuss in my distributed systems seminar this semester. If you have some suggestions on other good/recent papers to cover, please let me know in the comments.
  1. Perspectives on the CAP Theorem
  2. Consistency, availability, and convergence.
  3. Salt: combining ACID and BASE in a distributed database
  4. Extracting More Concurrency from Distributed Transactions
  5. Eidetic Systems
  6. Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems
  7. The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services
  8. All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications
  9. GraphX: Graph Processing in a Distributed Dataflow Framework
  10. Architecture of a database system
  11. Arrakis: The Operating System is the Control Plane
  12. Fast Databases with Fast Durability and Recovery Through Multicore Parallelism
  13. Project Adam: Building an Efficient and Scalable Deep Learning Training System

My experience with using TLA+ in distributed systems class

I used TLA+ in my distributed system class in Fall 2014. (To learn the backstory on this, read my pre-semester TLA+ post.)

In short, I loved the experience and I am hooked. Integrating TLA+ to the class gave students a way to get a hands-on experience in algorithms design and correctness verification. Even for a sophisticated algorithm, such as Paxos, I can refer the students to TLA+ to practice and play with the algorithm, so they can internalize what is going on. Throughout the semester as I had more chance to work with TLA+, I started growing a library of TLA+ modeling of distributed algorithms. Next time I teach the class, I hope to provide the students with a TLA+ model of each algorithm I cover in the class.

The student feedback was also positive. The students liked TLA+ since it gave them a way to experiment and supported them in reasoning about the algorithms. Several students complained of the learning curve. They wanted me to cover TLA+ in more detail in the class. (The TLA+ manual can be very dense for the students.) Next semester I plan to allocate more time getting the students acquainted with TLA+. (Next time around, I will also work on devising automated ways to grade the TLA+ projects. My TA had do manual testing of the projects, not a pleasant task for a class of 60 students.)

The TLA+ mini-projects I used

Throughout the semester, I assigned 3 TLA+ mini-projects with increasing complexity.

The aim of the first mini-project was to get the students started with TLA+. I asked the students to model a bean can problem (white and black beans with rules to remove beans form the can). This was simple and there was only one process. But the students got to learn how they can check safety and liveness conditions on their program.
The second mini-project required them to deal with multiple processes. I asked them to model the logical physical clocks we had introduced in our research.  Using TLA+ model checker the students found the counterexample in the naive algorithm, where the logical clock gradually but unboundedly gets ahead of the physical clock. This counterexample is hard to find and involves 30 steps in the algorithm. So this was an example were TLA+ proved its value.  (Try to figure out this counterexample with unit tests, I dare you.)
The final mini-project gave them more practice on modeling distributed algorithms. I asked them to model Dijkstra's classical stabilizing token ring algorithm as well as Dijkstra's lesser known 4-state stabilizing token ring algorithm. Students learned about self-stabilization and how it deals with arbitrary memory corruption.

Lessons learned (the good, the bad, and the ugly)

I learned that PlusCal language is the right abstraction for modeling distributed algorithms. TLA+ is too low level for writing (and reading) distributed algorithms. PlusCal allows you to include TLA+ expressions and definitions, so you benefit from all the expressive power of TLA+ without going full TLA+. PlusCal is not a separate entity, but just a more convenient way to write TLA+ code. In the TLA+ toolkit you can write the algorithm in PlusCal and hit the "translate to TLA+" to produce the corresponding TLA+ code. When I use the term "modeling an algorithm in TLA+", I really mean writing the model in PlusCal, auto-translating it to TLA+ and model checking it.

Learning to model algorithms using TLA+ takes time. The language is powerful and very general; it is Math after all. You need to study many examples to learn about the RIGHT ways of doing things. I experienced this as a novice first hand. I had written a PlusCal code for Paxos within a couple days. My code worked, but it used procedures, and it was very procedural/operational rather than declarative, the way Math should be. And so my code was very slow to model check. Model checking with 2 acceptors worked, but I gave up on model checking with 3 acceptors since it took more than an hour. Then I found Lamport's PlusCal code about Byzantine agreement. I rewrote my Paxos implementation, imitating his style. I used macros, instead of procedures. My code become more declarative than operational. And, for message passing, I used Lamport's trick of using record typed messages which are written just once to a shared message board; it was the acceptors responsibility to react to them (by just reading, not consuming them). My new refactored code translated to ~150 LOC TLA+, whereas my first version had translated to ~800 LOC TLA. And the new version model checked in a minute for 3 acceptors, leading to a couple of magnitudes of order improvement in efficiency.

While PlusCal is nice, the crutch of using labels in PlusCal algorithms (for enabling TLA+ translation) is a bit ugly. The labels in PlusCal are not just aesthetic. They determine the granularity of atomic execution of blocks of code, so it is critical to place the labels right. If you fail to do this right, you will run into deadlocks due to co-dependent await conditions in different blocks of code. Inserting a label reduces the granularity of block of code from atomic to break at the label. Labels were a little unintuitive for me. But after some practice, I was able to get a hang of them.

Another ugly part is the lack of data structure support for passing messages and parsing them. Using tuples without field names is flexible, but it becomes intractable. Using records help somewhat, but manipulating field names in records also gets complicated quickly. TLA+ doesn't have static/strict typing, and I got stuck with runtime errors several times. In one case, I realized I had to find a way to flatten out a set inside of another set; after a lot of head banging, I found about the UNION operator, which solved the problem.

Good engineers use good tools. And TLA+ is a great tool for modeling distributed algorithms. Modeling an algorithm with TLA+ is really rewarding as it makes you "grok" the algorithm.

Useful Links:

There is a vibrant Google Groups forum for TLA+ :!forum/tlaplus

My previous post has links to useful documentation for TLA+/PlusCal

Clicking on label "tla" at the end of the post you can reach all my posts about TLA+