Monday, June 10, 2019

Is this consensus?

The specification for consensus is as follows. The first two are safety properties, the last one a liveness property.
  • Agreement: No two node can commit different decisions.
  • Validity (Non-triviality): If all initial values are same, nodes must commit that value.
  • Termination: Nodes commit eventually.
Below I talk about whether ABD or chain replication solve consensus, and whether it would be possible to implement state machine replication using them.

Does ABD solve consensus?



No, ABD does not solve consensus. I had written a summary of the ABD protocol in a 2012 post. And I had talked about why ABD is not consensus in a 2015 post. Below is a short recap of that followed by a discussion of whether ABD can still be employed to solve the state machine replication problem.

Consensus is prone to the FLP impossibility result, and it may lose progress under FLP conditions. In particular, for Paxos, if we can't determine whether the incumbent leader has failed in an asynchronous environment, then we may run into the dueling leader problem, which may continue until failure detection and consequently leader election stabilizes.

In contrast, ABD is not affected by the FLP result. This is because ABD is memoryless and hedonistic. ABD is happy with unresolved, partial acceptances in the past. Heck, it will completely overwrite a value that is accepted by all nodes if another write comes with a higher timestamp.

In comparison, Paxos is obsessed with the past. For each consensus instance, Paxos clings onto the value a node has seen (with the highest ballot). In the first and second phase of write in ABD there is no rejection/restart of the operation. In Paxos, both in the first phase and second phase, a leader/candidate may receive a rejection and goes back to retry phase 1.

Since ABD is memoryless and hedonistic, we can not implement replicated state machines (RSMs) with ABDs. But let's push on this a bit. So far, we were treating timestamps as ballots within the same instance. What if we treat the timestamps in ABD as slots in multiple instances of accepting values? By using "timestamp = counter + leaderID", we can implement total order on slots and have linearizability and strong consistent reads with multiple clients.

Can we then implement RSM over ABD participant nodes using these slot numbers? No! Because ABD skips past some slot numbers as unresolved since it is always eager to move ahead in a hedonistic manner. ABD goes with the highest timestamp of majority read, and it is not possible to go back to earlier slots and resolve them in case of ambiguity. In ABD, the nodes accept independently, and there is no commit/resolution phase, and the logs in the nodes diverge. We can't make a resolution about a slot's outcome even in God-view mode, where we can look at the values in a majority of the nodes (not all nodes, because up to a minority of nodes may be unavailable per fault-model). Let's say we see a value at node A, and no value at the other nodes constituting majority for this slot. The ambiguity that remains is as follows. Maybe node A was part of the majority that had this value and the other nodes are not reachable now. Or maybe A was the only node that had received this value. We cannot determine the difference. If we go with the first possibility, it is possible that, another God-view to a different majority may find that indeed the opposite was the case.

In contrast,  Paxos has a commit phase that marks that the value for that slot is resolved and finalized. Any commit (even read from one node's committed value) is a valid commit because the leader has observed that majority has accepted/stored it. So it is guaranteed that other nodes will also know (or learn) that commit. So Paxos relearns and does not leave any gap in the slot numbers while committing, because those slots numbers get executed as they are committed.

As the closing word on ABD, we should note that ABD is still useful for storage and linearizability, it solves the atomic storage problem. Here comes the difference between stateless operations (register operations put and get) versus stateful operations (commands in general that mutate state, which by definition depends on the state they are invoked/executed). For storage, we don't need stateful operations. Using ABD we achieve linearizability, and can serve strong-consistency reads via using ABD even with multiple clients.

Does chain replication solve consensus?



I had written up a description of the chain replication protocol in a 2011 post. Chain replication employs a Paxos-backed configuration box that maintains the configuration/topology of the chain nodes, and the chain nodes just replicate the values in a streamlined fashion. The beauty here is that Paxos is kept out of the data path, so it is not involved with any replication request. Paxos is employed in the control path, and is consulted only when a fault happens.

Does chain replication solve consensus? I haven't seen this question addressed in previous work neither in the original chain replication work, nor in any of the followup work. The answer is, no, chain replication does not solve the consensus problem! This is a trickier point to appreciate than the ABD case.

Chain replication does not violate agreement/safety property: for a given instance, no two nodes will have different commits because they copy the head of the chain. But chain replication will violate progress for the consensus instance in that slot if the chain topology changes. Let's say only the head and another node committed a value and they died or get disconnected, and as a result the chain topology is reconfigured by the config-box. No other node can commit another value for that instance because the epoch-number has changed with the configuration decision from the config-box. This is both good news and bad news. Safety is not violated but we lost progress/termination for that slot: the remaining nodes are not able to resurrect and resolve this particular consensus instance to termination. So although chain replication solves consensus in the absence of failures, in the presence of failures it deserts the consensus instance without culminating it to resolution and moves on. After the config-box appoints a new chain topology, the progress and safety are both satisfied for the next consensus instance (with the incremented epoch number).

To recap, chain replication gets things resolved/finalized and keeps the same log in the absence of faults, but in the presence of faults, the logs in participants may diverge. Consider a node that accepts a value, and then due to failure and chain reconfiguration it has been pushed out of the chain. How does that node learn whether what it has accepted before it crushed is skipped over or finalized? There is no commit in chain replication (of course the ack-backpropagation in the CRAQ optimization may work as commit)... Even with the plain chain replication, we can argue that, that node is now an incorrect node as marked by the config-box, so we don't care about its consistency. And if that node joins the chain again, it will join as tail, learn the same log as the other nodes. From that perspective, and by seeding off of the Paxos-backed config-box, we can argue that RSM can be implemented over chain replication.

Acknowledgment

The way to understand these things is by sparring with colleagues. I am grateful for Ailidani Ailijiang and Aleksey Charapko for the discussion. It is not easy to reason about distributed systems --but is certainly rewarding after the fact. It took us two or three animated discussion sessions over coffee to get to the bottom of this.

No comments:

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...