Tuesday, September 17, 2019

Modeling a read-write version of Texel: an asynchronous consensus algorithm without rounds

In the previous post I gave a model of atomic Texel, where a node can atomically read all other nodes' decision and update its own decision. Here is a refined version of that, where a node can atomically read the state of *one* other node and update its decision. This refined model shows why it is important for the nodes to read from consistent cuts, and how when multiple nodes are experimenting they can violate this requirement, and Agreement property is violated as a result.

The model

This builds and extends over the previous model. N stands for number of nodes, and F denotes the number of nodes that can crash. We use f to keep track of actual number of nodes that crash. In addition to the *decision* array that tracks the decision of each node, we now have an *exp* array that denotes the experimentation status of each node. Initially each node is in the experimenting state.

Each node starts with t=FALSE (the decision is not finalized), pollSet= Procs \{self} (the node can poll all nodes except self), and tally=<<0,0>> (the number of votes from the polled nodes for "a" and "b" is initally 0 and 0).

Each node has three actions that it can choose from and execute as long as the node has not finalized its decision or crashed.

The first action starts in Line 21. This action is enabled if the node is in experimenting state. It picks (and removes) a node k from its pollSet. If k's decision is "a", it increases the tally for "a" and if k's decision is "b", it increases the tally for "b". After this, if any of these tallies is a supporting decision, i.e., is greater than F (which means it is a majority of N-F nodes), then the node adopts it as its own decision.

Line 32 starts the second action. If f, the actual number of crashes is still less than F, the allowed number of classes, then a process can crash, by setting its decision to crash permanently.

Line 36 starts the third action. If a node finds that its current decision is shared by at least N-F processes, then that decision is "anchored", and the node can finalize its decision by setting its t=TRUE. If no such anchor is in place, and the node is not in experimenting state, the node switches to experimenting state (resetting pollSet and tally). By experimenting again, the node can potentially change its decision to another supporting decision, which may lead to progress and finalization of consensus.

Safety violation

When I model-check this protocol for N=4, F=1, this model violates the Agreement property. Two nodes can finalize their decisions with different values, because they experiment concurrently, and one of them reads from an inconsistent cut. In the trace below node 2 builds its supporting decision on an inconsistent snapshot involving 1, which changes its state after being read by 2.

Here are the steps to the violation of Agreement. Initially the decision array of nodes is "a","b","b","a".
1. Node 1 reads from node 2 the value "b".
2. Node 2 reads from node 1 the value "a". (Note that two nodes are concurrently experimenting and reading state from each other which will get inconsistent soon.)
3. Node 1 reads from node 3 the value "b", and since tally for "b" is more than F=1, node 1 changes its decision to "b", and concludes its experimentation.
4. Node 1 finalizes its decision of "b", because it sees an anchored quorum (cardinality >= N-F) for "b".
5. Node 2 reads node 4 the value "a", and since tally for "a" is more than F=1 (including the now invalid vote from Node 1), node 2 changes its decision to "a", and concludes its experimentation.
6. Node 3 reads from node 2 the value "a".
7. Node 3 reads node 4 the value "a", and since tally for "a" is more than F=1, node 3 changes its decision to "a", and concludes its experimentation.
8. Node 2 finalizes its decision of "a", because it sees an anchored quorum (cardinality >= N-F) for "a". This decision violates Agreement, because Node 1 has finalized its decision to "b", and we have conflicting decisions.

To fix the safety violation, we should disallow concurrent experimentation when it may lead to reading from inconsistent snapshots. This is possible by making the reads preemptive/destructive. (If, instead of using preemptive reads, we try constraining the nodes to read from only non-experimenting nodes, deadlock would happen.) In the above trace, when node 2 reads from node 1, this should have halted node 1's already ongoing experiment. This is easy to achieve by extending/modifying the model above, and when I fixed this problem, I found that safety is always satisfied for N>=3*F+1. (I don't provide that model, because I am considering assigning modeling Texel and Ben-Or as a TLA+ project in my distributed systems class.)

Liveness violation

Liveness is also an interesting story. Even with F=0 and starting state of <<"a","b","b","b">>, we can have a liveness violation. With F=0, reading from one node is enough to change your vote. So the value of "a" may be circulated in the system, since it can keep getting adopted by another minority of processes. The system may not be able to anchor the majority value as the consensus value, and as a result cannot finalize a decision. Side note: When you appoint a leader for consensus (as in Paxos) this vote looping does not become an issue, because the leader will break the symmetry by dictating the value it picks (or a suitable value) to the other nodes for adoption.

In that same setup (with <a,b,b,b>), if I make it F=1, liveness is satisfied, because no node will copy a, as it will need to see another node with a before passing threshold. So, in this case, increasing F did help for liveness. This suggests that maybe we should introduce another free parameter to serve as threshold for value adoption, and not tie that strictly to F the potential number of faults.

By restricting the problem to binary (with only two values) consensus and with proper selection of the threshold for adoption, Texel may solve the problem of having a minority value circulating in the system forever and breaking progress. But even then, we have another progress violation problem. When we introduce experiment cancellation to satisfy the Agreement property, nodes that keep interfering and canceling each others' experiments will violate progress.

This is another place where having a leader for consensus provides advantage. The leader by definition reads from a consistent state, as for that round the other nodes are passive. When you have each node polling for itself, coordinate these distributed transactions to read from clean consistent states becomes very difficult (maybe requires consensus itself).