Modeling the hygienic dining philosophers algorithm in TLA+

This post on hygienic dining philosophers algorithm is the continuation of the post on dining philosophers. If you haven't seen that one, it is better to read that first before you continue with this one.

The hygienic dining philosophers algorithm (due to Chandy & Misra 1984) generalizes the ring topology of the dining philosophers algorithm to an arbitrary undirected graph. The most important contribution of the hygienic philosophers algorithm is that it introduces a priority concept that ensures "No Starvation" even under weak fairness. (NoStarvation == \A p \in Procs: state[p]="Hungry" ~> state[p]="Eating") The scheme deprioritizes a process that has eaten so the process cannot greedily beat its neighbors to eating again. This is achieved by flagging the forks of a process that has eaten as "dirty". Dirty forks denote that the process has just eaten, therefore that process needs to provide the fork to any of its neighbors that asked for the dirty fork.

PlusCal modeling of the algorithm

Here is my PlusCal algorithm modeling of the hygienic dining philosophers problem.

Initialization and Modeling

I model the Topology as a set. Since we are looking at a undirected graph, the edges are bidirectional. Our definition of "Edge" predicate, and consequentially that of "GetNbrs", reflect the bidirectionality of the edges.


I modeled "forks" and "request tokens" as sets as well. The modeling of forks and requests are directional, and the definitions of "Forkat" and "Reqat" reflects this point. I also modeled "clean forks" as a set: if a fork is clean, the fork is listed in the set "Clean". (I might have gone wild with using sets for modeling all the data structures of the algorithm. But in my defense, sets are one of the simplest data structures, and are flexible and useful.)

Now let's talk about the initialization of the forks, request-tokens, and clean forks set. The initial placement of forks and their marking as clean or dirty determines the initial priority relation among the processes. So, when we initialize, we must be careful not to introduce a cycle, since that would result in a deadlock. Therefore, a structure should be imposed on the graph such that it is a directed acyclic graph (DAG).

The recommended initialization is that initially all the forks are dirty (i.e., Clean={}) and each process is "Thinking". If a process becomes hungry, it can then get the forks it requests (since they are already dirty) without waiting on other process (which may not have the inclination) to eat first. For simplicity, I decided to assign the dirty forks to the lower id process of an edge, so I initialized the fork set to be equal to the Topology set. (Of course in the fork set the edges are directional: the process that comes earlier in the edge tuple denotes that fork resides at that side of the edge.) Since I had to make sure that the request token is not on the side of the dirty fork (because that would be misinterpreted as the other node asking for the fork in the initial state) I initialized the request-token set to be equal to the reverse of the Topology set.

Process actions

Each process is initialized to start in Thinking state. The process then loops through checking appropriate actions for its current state, and perform state transitions when conditions are met.



But first the process checks if it needs to respond to any requests send to it for forks and does so if its state is not "Eating". The requests for the dirty forks are answered inside a while loop iteration. That a fork is dirty indicates the process has a lower priority and must defer. It does so by sending the fork to the other side, and "cleaning" the fork while it is being passed to the neighbor.

Else if the process is hungry and has some forks to request, the process starts requesting its missing forks. For this, the process needs to iterate through its neighbors via a while loop. It sets Q to be the set of neighbors for which a fork needs to be requested. Then, in each iteration of the loop, a neighbor is picked out from this set, and the request is sent to that neighbor's side. The while loop is completed when all neighbors are covered ---set Q becomes an empty set.

Otherwise, if the process is "Hungry", and satisfies all the conditions to eat, it transitions to the "Eating" state. The conditions are: the process has all the forks in adjacent edges on its side, and each fork is either clean, or the dirty forks have not received requests yet. In other words, a process can eat with some of the forks being dirty, provided that the corresponding neighbors for those dirty forks are not hungry and have not send requests for those forks. Since a process eats only when it has all the forks on its side, we are guaranteed that when a process is eating none of its neighbors can be eating. (Otherwise a fork would be in both sides of the edge, which would violate TypeOkFork predicate.)

If a process is Eating, it can transition to Thinking. For this, the process marks all its forks as dirty, iterating over a while loop.

Finally if a process is Thinking, it can transition to Hungry anytime.

A note about the use of labels

As I alluded to in the previous posts, the labels are not there for cosmetic reasons. They determine the granularity of steps. The TLA+ conversion of the PlusCal code uses the labels. So if you forget to put a label where it is needed, TLA+ will warn you about it. For example, TLA+ dictates that a while loop has a label. This means that each iteration of the while loop can be atomic, but the entire while loop cannot be executed as one atomic step. TLA+ does not allow for updating the same variable twice in one step. So TLA+ complained and required me to add "Hii:" and "Aii" labels.

I found that this corresponds reasonably well to the atomicity distributed shared memory model, where a process can write to one neighbor in one step, but cannot write to multiple neighbors at once.

Bugs 

I found three bugs in my earlier versions of the code through model checking. The first was on the guard of the "Hungry" to "Eating" transition. I had wrote "Edge(self,k)" as a conjunction. After TLA toolkit trace showed me my mistake, I corrected that to "Edge(self,k)=>".

The second bug was much more subtle. When a hungry process was requesting tokens, I had written the request token subtraction outside the if clause. (There was an if clause checking neighbor should be sent the request. And I had the request addition (req:= req \union {<<q,self>>};) inside the if, request subtraction outside the if (Hii: req:= req \ {<<self,q>>};). So the model checker was maliciously timing the other process to send the fork, and then send a request for that fork back. Then when the line "Hii: req:= req \ {<<self,q>>}" is executed the removal of the request token would be removing the genuine request send back from that other process, not the ghost of the request token that the process had sent in the first place. At that point I hadn't written the TypeOK predicates in my model, so I didn't catch this quickly. Instead, I had to figure this subtle bug by inspecting a trace of 61 steps (the smallest number of steps needed to create this bug). The lesson I learned is not to procrastinate about checking for the TypeOK predicates for ensuring sound updating of the data structures.



Finally, the third bug was also very subtle. I had a "doneReq" flag to avoid unnecessary requesting of forks, if the forks were requested before. Again the model checker was being a jerk and finding a rare sequencing of events to surface a rare bug. The trace was 55 steps, repeating after step 43. The process 3 kept eating, while 0, 1, and 2 starved. Processes 0, 1, and 2 are hungry, and since their doneReq flag was true, they didn't send additional requests. They didn't actually send request to process 3 because they already had the dirty token on their side. But then process 3 send them the request and got those tokens. And process 0,1,2 are not sending the request again, because they think they are done with requests. I fixed the problem by removing the doneReq flag but instead checking for the need for requesting. This also sped up model checking, as less state is maintained.

Improving the model checking complexity

This was a small model with 4 processes, but the model checking took about 50 minutes and used about 2GB of storage to exhaustively search the entire plausible execution state space. To get that space back, make sure you manually delete the model checking folder, folder ending with ".toolbox". You don't need to worry about space too much though: if you start a new model check it writes over the previous one.

My initial attempts had took about 4GB of space and progressed slowly, which led me stopping them. To speed up the model checking, I improved the answering of requests for forks by making the GetReqNbrs function more selective by adding the second line to that check. Seeing that it helped speed up model checking, I also wrote a more selective GetForkNbrs function. These reduced the state space, and the execution time. But I still can't try checking with more than 4 processes as that takes a long time.

Concluding remarks

So there we go. Hygienic philosophers algorithm extend the topology of dining philosophers solution to an arbitrary graph, and more importantly, provides "NoStarvation" guarantee even under a weakly-fair scheduler. The algorithm ensures that only one node executes critical section in a neighborhood, and this is useful for coordinating critical action executions in a distributed system. If all nodes execute without coordination, we have maximum parallel semantics which cannot provide serializability (sequential consistency) guarantees. Using dining philosophers for coordination provides serializability guarantees. So, hygienic dining philosophers algorithm find applications in transforming programs written at a shared memory model with serializability assumptions to execute correctly with the serializability guarantees at the message passing model.

Logistics

The PlusCal code for the programs are accessible through the below links. When you want to model check the program, start a new model, and enter 4 for N, so the model checker is ready to start. In the model checker tab, click on Invariant pane and enter ME then TypeOkFork then TypeOkReq, so that these three predicates are checked against any violation, and in the Properties pane enter NoStarvation so that the progress property is checked for satisfaction. If you make a modification to the PlusCal code, make sure you click on File, Translate Algorithm, so the PlusCal code gets translated to the TLA+ code, which is then used for the model-checking. See the TLA+ posts below about how this works.

Here is the final correct version of the model

Here is the version that has bug 3

Comments

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Advice to the young

Foundational distributed systems papers

Distributed Transactions at Scale in Amazon DynamoDB

Linearizability: A Correctness Condition for Concurrent Objects

Understanding the Performance Implications of Storage-Disaggregated Databases

Designing Data Intensive Applications (DDIA) Book