Dijkstra's stabilizing token ring algorithm

One of the classical algorithms I teach in my distributed systems class is Dijkstra's stabilizing token ring algorithm. This algorithm has started the self-stabilization field as a subfield of fault-tolerance. And, it still receives interest even after 40 years. There has been probably hundreds of self-stabilization papers that revisits Dijkstra's stabilizing token ring algorithm as part of a solution or as part of a case study. This algorithm never gets old for me as well. I still enjoy talking about this algorithm in class and thinking about it once in a while. I guess this is because it is a very elegant algorithm.

Linked is Dijkstra's original paper introducing the algorithm. This paper also includes two variants of the stabilizing token ring algorithm, 3-state and 4-state token ring algorithms. Dijkstra would later do a followup writeup, titled, "A belated proof of self-stabilization", where he gave a proof of stabilization for 3-state token ring program.

As a side note, when I asked my distributed class whether they heard Dijkstra's name before, almost all hands went up. And almost all of them had heard his name due to the "all pairs shortest path algorithm". Interesting. Probably, in Dijkstra's mind, that algorithm was a minor contribution. He came up with that algorithm at a cafe in 20 minutes without using pen and paper. None of the students heard the "Go-to considered harmful" paper, which may be closer to what Dijkstra was aiming with his career. Dijkstra's later work focused on predicate calculus for enabling  structured/principled verifiably-correct programming.

Stabilizing fault-tolerance

Stabilization is at the heart of the Dijkstra's token ring algorithm. Stabilization is a type of fault tolerance that advocates dealing with faults in a principled unified approach instead of on a case by case basis. Instead of trying to figure out how much faults can disrupt the system stabilization assumes arbitrary state corruption, which covers all possible worst-case collusions of faults and program actions. Stabilization advocates designing recovery actions that takes the program back to invariant states starting from any arbitrary. Anish Arora's paper "Practical Self-Stabilization for Tolerating Unanticipated Faults in Networked Systems" provides a great overview of stabilization concepts.

Token Ring stabilization

Let's start with the simple token ring concept. The processes are arranged in a ring fashion and there is a unique token circulating in this ring. (You can think of the token may be providing mutual exclusion to the processes; whichever process has the token can access the critical-section/shared-resource).

In the invariant states there is exactly one token. So in the bad states, there are two possible cases: 1) we end up with a ring with no tokens,  2) we end up with a ring with multiple tokens.

Let's consider how Dijkstra's token ring algorithm deals with case 1. Dijkstra's token ring algorithm maps the concept of a physical token to a relation between two neighboring processes. This mapping is done skillfully to limit the state space so the case 1 faults cannot possibly happen in that state space. In the mapped state space, any arbitrary state  is guaranteed to contain 1 or more tokens.

How does this work? Dijstra's algorithm encodes a token with the relation: "the x variable of a process is DIFFERENT than the x variable of its predecessor in the ring". In this case, if all x variables are the same, there won't be any token in the system. To fix this Dijkstra's algorithm encodes the token differently for process 0: "There is a token at process 0, if its x variable is the SAME as that of its predecessor, process N". Encoded this way, we cannot not have 0 tokens in any arbitrary state in the state space.

A good analogy for bounding the corruption space is the use of the carousal door. For a normal door, you have two possible states: door is open or door is closed. For a carousal door, there is only one possible space: the door is closed. (Yet you still make it through that door by revolving through it.)

In order to deal with case 2, Dijkstra's algorithm uses some token overwriting mechanisms. First let's observe that no process action introduces an extra token to the system. (Of course, a fault action can corrupt state arbitrarily, and can throw the system to a state with many extra tokens.) In the normal/fault-free case, by taking an action a process consumes a token that exists between its predecessor and itself and passes this token to its successor. In the faulty case, this is also the worst damage a process can do, because it will always consume the token at itself, and since its state change affects only its successor, it may or may not pass the token to its successor. If the process execution does not pass the token to its successor, this is actually good: an extra token is removed from the system. And if enough of these happen, the system will stabilize to exactly one token case.

But what if this leads to a system with no tokens? Recall that this cannot happen due to the discussion regarding Case 1.

Stabilizing token ring algorithm in TLA+ 

After explaining the program actions, in order to help students internalize what is going on, I call 5 students to stand up in front of the class. The students form a ring. And they use hands up or down to represent x=1 or x=0, and enact a simulation of the program starting from some arbitrarily corrupted states. The students first get to observe that there is indeed no configuration possible with 0 tokens. Then as they enact several scenarios, they observe how no new tokens are introduced by a process action, but in some cases an extra token is consumed.

Then, I challenge them to come up with a scenario where this binary token ring formed does not get to converge.  The corruption continues to live on in the system.  Here is such a run in a ring of 4:
1010 --> tokens at nodes 1,2,3
1011 --> tokens at nodes 1,2, and 0
1001 --> tokens at nodes 0,1,3
1101 --> tokens at nodes 0,2,3
0101 --> tokens at nodes 1,2,3
0100 --> tokens at nodes 1,2,0
0110 --> tokens at nodes 0,1,3
0010 --> tokens at nodes 0,2,3
1010 --> tokens at nodes 1,2,3
and this gets us to the first state. The system can repeat this loop forever without stabilizing.

Of course, when you run the model checker for this setup (where x is limited to 0 and 1) the TLA+ model checker provides you this counterexample.

The M>=N requirement 

So, what went wrong? Wasn't Dijkstra's token ring supposed to be stabilizing? The first time around we discussed the algorithm since x was an unbounded integer, the algorithm was stabilizing. When we considered the binary token ring example, x was bounded by 2, and that failed to stabilize in a ring of size 3 or more. There is a relation on M, the bound on x, and the algorithm's ability to stabilize for a ring of size N. We had seen in the counter example that for M<N there is a way to form a loop in the execution trace and keep the corruption going in the system.

For M>=N, the node 0 acts as a gate. Consider the case where node 0 never fires. In this case the other nodes fire, and eventually they all copy the x value at node 0, and in that configuration the system reduces to one token (at node 0) and stabilization is achieved. For M>=N, we can use the pigeon-hole principle to show that the system  eventually reaches the case where node 0 does not fire until all other nodes  copy its value. This is because, for M>=N, at any configuration there is a value "v" of x that is not currently present in the ring. Eventually node 0 will hit "v", and then node 0 will not be enabled until v is copied by all nodes. This is how node 0 acts as a gateway.

Concluding remarks  

I may talk about Dijkstra's 3-state and 4-state token ring algorithms in another post. Those are also interesting and elegant algorithms.  In a paper I wrote in 2002, I used Dijkstra's classic, 3-state, and 4-state token ring algorithms as case studies. If you like to learn more about this, you can read that paper here.

Related links

Crash only software 
The role of distributed state


Brent M. Spell said…
I was interested in your PlusCal specification for the token ring algorithm, and you mentioned that you could get the model checker to report the counterexample when M<N. I have been trying to do so with the TLA+ tools for M = 2 and N = 3 without much luck. Do you know where I might get a copy of your PlusCal specification in text form, or do you have any hints on where I might be going wrong? Thanks for any help.
Murat said…
Brent, email me, and I will email you the spec.
You are in fact using N+1 processes right?

Popular posts from this blog

Foundational distributed systems papers

Your attitude determines your success

Progress beats perfect

Cores that don't count

Silent data corruptions at scale

Learning about distributed systems: where to start?

Read papers, Not too much, Mostly foundational ones

Sundial: Fault-tolerant Clock Synchronization for Datacenters


Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3 (SOSP21)