Paper review. A generalized solution to distributed consensus

This paper (by Heidi Howard and Richard Mortier) proposes a framework for consensus that uses immutable state to simplify the exposition. They show that both Paxos and Fast Paxos are certain instantiations of this general consensus framework. Finally, they outline some new instances of the consensus framework which provide nice benefits in certain setups.

I should note and caution you that this paper considers single instance/decree consensus rather than multiple instances/slots back-to-back consensus. So if you are interested in the latter, there are gaps to fill before you can implement these algorithms to solve multi-decree consensus and maintain RSMs. Moreover while immutability via the write-once registers is great for exposition, extra work needs to be done for achieving implementation efficiency of this abstraction.

The consensus problem

An algorithm solves consensus if it satisfies these requirements:
  • Non-triviality. All output values must have been the input value of a client.
  • Agreement. All clients that output a value must output the same value.
  • Progress. All clients must eventually output a value if the system is reliable and synchronous for a sufficient period.
If we have only one server, the solution is straightforward. The server has a single persistent write-once register, R0, to store the decided value. Clients send requests to the server with their input value. If R0 is unwritten, the value received is written to R0 and is returned to the client. If R0 is already written, then the value in R0 is read and returned to the client. The client then outputs the returned value. This algorithm achieves consensus but requires the server to be available for clients to terminate. To overcome this limitation requires deployment of more than one server, so we now consider how to generalise to multiple servers.
And holy moly! By considering more than one node, we open Pandora's box! It is very difficult to provide distributed consensus in a fault-tolerant manner to the face of bouts of asynchrony, node failures, and message losses.

The reason Paxos is so popular is because Paxos and its variants have excellent fault-tolerant properties that cover all possible corner cases. Paxos preserves consistency even when the system is asynchronous, all timing assumptions are wrong, all failure detectors are wrong, processes crash in an undetected ways, and messages can be lost arbitrarily, etc. The beauty of Paxos is that safety (agreement) is preserved under any condition, and only the progress condition depends on the availability of weakest failure detector (which eventually stabilizes to ensure that the processes do not suspect a single up process as failed). In other words, whenever the setup is out of the impossibility realm of FLP and attacking general results, progress is guaranteed. And even when the setup has not moved beyond that point, Paxos can still provide progress probabilistically.

Generalized consensus framework

Each node (i.e., server) has an infinite series of write-once, persistent registers, {R0, R1, ...}. Clients read and write registers on servers and, at any time, each register is in one of the three states:
  • unwritten, the starting state for all registers
  • contains a value,
  • contains nil.
A register set, i, is the set comprised of the register Ri from each server. Each register set i can be pre-configured (hardwired) with a set of quorums.

Figure 3 describes the generalized consensus framework by giving four rules governing how clients interact with registers.

I like rule 1: quorum agreement. According to this rule,  the clients/proposers need not be competing, they can in fact be collaborating and strengthening each other if they write the same values in the same register set across nodes. This is different than Lamport's formulation. Lamport also gives similar rules for an abstract Vote algorithm as an intermediate step to deriving the Paxos algorithm. But by divorcing the register-sets abstraction from ballots, this framework achieves more generality.

A register-set corresponds roughly to a round or view in the Paxos algorithm. The framework still allows client-restricted register-sets, where a register-set is allocated for use of only a certain client where only one proposal is voted on. These client-restricted register sets can be implemented using ballot-numbers. But it is also possible to have non client-restricted register-sets for which there is no need for clients to use different ballot-numbers. Multiple clients can then write concurrently to the same register-set, and it may even possible to consider them as cooperating if they write the same value.

From its local decision table, each client can track whether decisions have been reached or could be reached by previous quorums. At any given time, each quorum is in one of four decision states:
  • Any: Any value could be decided by this quorum.
  • Maybe v: If this quorum reaches a decision, then value v will be decided.
  • Decided v: The value v has been decided by this quorum; a final state.
  • None: This quorum will not decide a value; a final state.

Figure 5 describes how clients can use decision tables to implement the four rules for correctness. If a client reads a (non-nil) value v from the register r on server s, it learns that:
  • If r is client restricted then all quorums in r must decide v if they reach a decision (Rule 3)
  • If any quorum of register sets 0 to r − 1 reaches a decision then value v is decided (Rule 4)

Instantiating to Paxos

The paper shows that the framework can be instantiated to produce the single-decree Paxos algorithm. All the register-sets are client-restricted and all quorums in a register-set are assigned to be majority quorums. Note that this Paxos instance Figure 8 satisfies the four rules in the general consensus framework introduced, so it is an instance of that framework. On the other hand, the algorithm is still not straightforward even after understanding the general framework. There is still a leap to be made from the framework to the rules to the single-decree Paxos algorithm. Why two phases? What should be accomplished in Phase-1? How does Phase-2 relate and build on Phase-1? Similar issues were also present  going from the Vote algorithm to Paxos in Lamport's exposition of Paxos. As I have been telling students, Paxos looks deceptively simple, but there is a lot of depth and levels to it, and you should consider other ways to make this work (and mostly fail) to appreciate why the algorithm is that way.

Again, remember that this is only single instance/decree Paxos, and the paper does not provide a MultiPaxos instantiation. For this single decree Paxos the clients may be competing. Two clients may be sending their proposals to different replicas in the register-sets assigned to them via their unique ballot-numbers. If you consider the client colocated with one of the replicas  then this setup maps better to the popular descriptions of Paxos with a leader that goes through Phase-1 and Phase-2.

A new consensus algorithm

In Section 6, the paper gives example of a new consensus algorithm that has some advantages for certain environments.
Co-located consensus. Consider a configuration which uses a quorum containing all servers for the first k register sets and majority quorums afterwards, as shown in Figure 12b. All register sets are client restricted. Participants in a system may be deciding a value between themselves, and so a server and client are co-located on each participant. A client can therefore either achieve consensus in one round trip to all servers (if all are available) or two round trips to any majority (in case a server has failed).
Ok, what does this mean?

All the k clients are colocated with the k replicas/servers. If only one client/replica has something to propose that client/replica can write its value in the register set allocated to it at all other nodes in an uncontested manner and succeed in one round.

If multiple client/replicas has a value to propose, there is still chance that the highest id client/replica x succeed in achieving consensus in one round if all replicas are available. When x writes to its client-restricted register set, it writes nil to its register for all previous rounds and thus invalidates consensus attempts of all the replica/clients with smaller id register sets making their decision=NONE. This satisfies the 4 rules in Figure 3 as well as the client decision table rules in Figure 5.

Another way to understand this, in a manner closer to a message-passing implementation, is to consider the flexible quorums result applied to Paxos protocol. Here all the replicas are competing with ballot-numbers (all register sets are client-restricted), and phase 1 quorum is just one node, which is the replica/proposer itself. So the phase 1 is skipped since it does not require consent from another node. In order for phase 2 to intersect phase 1, the phase 2 quorum is the set of all replicas. And in this set up, if multiple replicas have something to propose, the replica with the highest id can achieve consensus in one phase (that is phase 2) if all replicas are available.

If a replica is crushed or unresponsive, the algorithm still offers fault-tolerance, because after their first try with flexible quorum, the replicas switch to majority phase-1 & phase-2 quorums and achieve consensus if a majority of nodes are available.

MAD questions

1. Is this just a different presentation of the same old stuff or is there more here?

Yes, there is more here than just re-packaging consensus with immutable write-once registers.

As I wrote above, I really liked how this enables the nodes to collaborate in non client-restricted register sets. When the register set is non client-restricted, the clients do not compete with ballot-numbers and multiple clients writing the same value can in fact strengthen that value and improve consensus.

I also liked that it is possible to pre-assign different quorums to different rounds. Many different combinations can be possible this way. Also, as I mentioned, the algorithms in the last section are interesting.

These being said, even if you didn't get anything new here and labeled this as just a different packaging/presentation of the Paxos consensus idea, I would argue this is still worth reading. Because by learning about the same topic from a different perspective, you will strengthen your understanding of the topic.


Popular posts from this blog

Learning about distributed systems: where to start?

Hints for Distributed Systems Design

Foundational distributed systems papers

Metastable failures in the wild

Scalable OLTP in the Cloud: What’s the BIG DEAL?

SIGMOD panel: Future of Database System Architectures

The end of a myth: Distributed transactions can scale

There is plenty of room at the bottom

Distributed Transactions at Scale in Amazon DynamoDB

Dude, where's my Emacs?