State-Machine Replication for Planet-Scale Systems (Eurosys 2020)

- June 21, 2020

This paper appeared at Eurosys 2020, and is authored by Vitor Enes, Carlos Baquero, Tuanir França Rezende, Alexey Gotsman, Matthieu Perrin, and Pierre Sutra.

The paper introduces Atlas, a consensus protocol similar to EPaxos, that builds on the Fast Paxos idea. Most people call these protocols leaderless, but I think that is a misnomer. I call these protocols as opportunistic/any-leader protocols. In these protocols, any node can propose a command by becoming an opportunistic leader and using a fast path. If there is no other simultaneously proposed command by another opportunistic leader or if this command commutes with all simultaneously proposed commands, the fast path is enough for achieving consensus. In the presence of concurrent non-commuting commands, the protocol will have to take a slow path, the purpose of which is to establish (teach to the other nodes) the dependencies learned for this command in the fast path attempt.

This opportunistic/any-leader family of protocols do not keep the same log (where the same slot number has the same command committed) at each node. There is no total order imposed in logs at different processes, because if a pair of commands commute with each other, why care about whether they are ordered the same way at different replicas. (We had seen this idea exploited also in the CURP paper, that we recently read.) Instead the requirement is that the commands that conflict are to be ordered the same way with respect to each other satisfying the dependencies at each node.

Ok, what is new in Atlas? Atlas improves on EPaxos in several aspects.

Atlas makes the maximum number of sites that can fail (f) configurable independently of the overall number of sites (n). In EPaxos f= $\lfloor{n/2}\rfloor$ . In Atlas f can be less, opening the opportunity to trade higher f, fault-tolerance, with higher scalability, as we discuss next.
Atlas uses a small fast quorum, FQ= $\lfloor{n/2}\rfloor+f$ , compared to EPaxos fast quorum of 3n/4. Of course EPaxos can tolerate f= $\lfloor{n/2}\rfloor$ with its larger fast quorum, but Atlas considers small f, e.g. f=1 or f=2, arguing that only a couple datacenters may be down at most, and reaps the benefit of this by using smaller FQ.
Atlas shows how we can apply the flexible quorums result to fast Paxos family of protocols. It shows that thanks to flexible quorums, the slow quorum can be made also very small, SQ= $f+1$ , provided that we make the recovery quorums RQ= $n-f$ . This is a reasonable tradeoff, because recovery quorum is exercised rarely, only when faults happen.
Atlas processes a high percentage of accesses in a single round trip, even when these conflict. But this requires advance explanation and optimization, so we will come back to this below.
Atlas provides a simpler recovery round. Recovery is still hard, mind you, but it is an improvement over EPaxos recovery, which should caution you about the trickiness of recovery in opportunistic/any leader protocols.

Side remark. Here is the paper's argument about using small f:

Running a typical protocol over 13 data centers would tolerate 6 of them failing. However, natural disasters leading to the loss of a data center are rare, and planned downtime can be handled by reconfiguring the unavailable site out of the system [15, 31]. Furthermore, temporary data center outages (e.g., due to connectivity issues) typically have a short duration [20], and, as we confirm experimentally in §5, rarely happen concurrently. For this reason, industry practitioners assume that the number of concurrent site failures in a geo-distributed system is low, e.g. 1 or 2.

Here is the video of presentation of this paper in our Zoom Dist-Sys Reading Group. You can also find the first author's (Vitor Enes) presentation video and slides by following this link.

Atlas protocol

Let me put this here in advance for easy referral. In Atlas

the fast quorum size is $\lfloor{n/2}\rfloor+f$ ,
the slow quorum size is $f+1$ , and
the recovery quorum size is $n-f$ .

An opportunistic coordinator of a command can avoid consensus via slow path when it can ensure that any process performing recovery will propose the same set of dependencies to consensus. In Figure 1, this is the case for process 5 coordinating command b. This is reflected in the fast path of Atlas, in which a command is committed after a single round trip to the closest fast quorum (line 16).

In order to take the fast path, previous state machine replication (SMR) protocols, such as Generalized Paxos and EPaxos, required fast-quorum replies to match exactly. One of the key innovations of Atlas is that it is able to take the fast path even if this is not the case, e.g., when conflicting commands are submitted concurrently. Instead the coordinator can still take the fast path if every dependency reported by some fast-quorum process is actually reported by at least f such processes. This is illustrated in Figure 2.

In the slow path, Atlas implements consensus using single-decree (Flexible) Paxos. This is similar to phase-2 in classical Paxos: after the coordinator acquired dependencies in its failed attempt for the fast path, it now needs to teach this to enough number of other nodes. For each identifier ballot numbers is allocated to processes in a round-robin fashion, with ballot i reserved for the initial coordinator i and ballots higher than n for processes that try to take over. The size of the slow quorum is only

$f+1$ , rather than a majority like in classical Paxos. This is compensated for using larger quorums,

$n-f$ in recovery.

Finally, a command is eventually executed only after all its dependencies are in the commit or execute phases. There could be equivalence classes in this directed acyclic dependency graphs, and the paper calls them batches. The execution conditions also need to be tracked by nodes, which make opportunistic/any-leader protocols hard to implement and use.

In summary, we have this as the overview of Atlas versus EPaxos and FPaxos.

Recovery

OK, let's talk recovery. A recovery is triggered when a node suspect failure.

In order to find out if a decision on the dependencies of id has been reached in consensus, the new coordinator first performs an analog of Paxos Phase 1. It picks a ballot number it owns higher than any it participated in so far (line 32) and sends an MRec message with this ballot to all processes.

In the MRecAck handler (line 44), the new coordinator computes its proposal given the information provided by processes and sends this proposal in an MConsensus message to all processes. Line 45 requires that response from recovery quorum of $n-f$ should be received. This guarantees that, if a quorum of

$f+1$ processes accepted an MConsensus message with a proposal (which could have thus been sent in an MCommit message), the new coordinator will find out about this proposal. If indeed some node knows the commit, the coordinator will learn and propose this as the consensus value.

If that is not the case (i.e., if no consensus proposal has been accepted before), the new coordinator checks whether any of the processes that replied has seen the initial MCollect message, by looking for any non-empty fast quorum (line 49). If the fast quorum is known, depending on whether the initial coordinator replied or not, there are two possible cases.

The initial coordinator replies to the new coordinator: This means fast path is definitely not taken before recovery. The new coordinator collects dependencies from n – f nodes and proposes these as the consensus proposal via the slow path.
The initial coordinator does not reply: This means that fast path may have been taken before recovery. If a fast path was taken, that implies that the dependencies were seen by at least f processes. Recall that a fast path has $\lfloor{n/2}\rfloor+f$ processes. Even if f of those processes crash, including the initiator (initiator crash is tolerated here thanks to line 8), the system can still recover the dependencies by taking the union of the dependencies sent in MCollectAck by at least n/2 fast-quorum processes. Otherwise, if the new coordinator finds that fast-path was not taken, it proposes no-op (line 52) on slow path.

Optimizations

The paper lists two optimizations, both of which are very interesting and useful, but hard to follow intuitively. Fortunately proofs are provided in an extended version

The first optimization is for pruning dependencies in the slow path, and it enables avoiding slow path when this condition holds: Commands that been reported by less than f fast-quorum processes ({id | count(id) < f }) can be pruned from dependencies. For example, using this optimization, process 1 can avoid executing a slow path in Figure 1, all the way up in this post.

The second optimization is for allowing reads with a smaller fast quorum. For a read operation, the coordinator selects a plain majority as a fast quorum (line 4), independently of the value of f. Then, at the end of the collect phase, it immediately commits id, setting dep[id] to the union of all dependencies returned by this quorum (line 16). This optimization accelerates the execution of linearizable reads and reduces their impact in the protocol stack.

Evaluation

The evaluation shows that Atlas is up to two times faster than Flexible Paxos with identical failure assumptions, and more than doubles the performance of Egalitarian Paxos in the YCSB benchmark.

What does Mad Danny say?

I asked Mad Danny about his views. This is what he had to say.

I really like the technical advances in the Atlas protocol. It improved over EPaxos by marrying it with Flexible Paxos, and using better handling of dependencies. I also like the optimized reads idea. Atlas provided very solid technical contribution.

But, on the practical side of things, I am not very fond of the wide area network application of Atlas, and EPaxos for that matter. For globe spanning consensus applications, sharded many-leader Paxos deployments, like static sharding in CockroachDB or dynamic sharding in WPaxos, works better to solve the problem. This is due to the following reasons.

The scalability of EPaxos like approaches is limited due to firstly due to conflicts (throughput plummets quickly when conflict rate increases), and secondly because every consensus instance is communicated to every node at least for commit and execution.
Even the fast path of $\lfloor{n/2}\rfloor+f$ spans half the globe in a world-wide deployment. Due to this, the benefit you get from proposing the command from any-leader close to you is not much. Sharded deployments, after brunting the cost of initial communication to the corresponding leader can perform phase-2 from nearby datacenters (not using the half the globe). Moreover sharded many-leader Paxos protocols can solve the initial cost to going to the leader by adopting to the traffic by repositioning the leader, which is done much quickly using a dynamic many-leader sharding solution like WPaxos. The evaluation compares with FPaxos, but it is a single-leader solution, where the single leader becomes a bottleneck both in terms of distance and throughput. With many-leader sharding solutions both problems are addressed and even with a totally random access pattern and no locality-adaptation, they could outperform opportunistic/any-leader approaches, thanks to their nearby consensus quorums. And they are much easier to implement.