Efficient Replication via Timestamp Stability (EuroSys 2021)
This paper introduces Tempo is a leaderless Paxos variant for implementing a state machine replication (SMR) protocol. Leaderless means, there is no dedicated/stable Paxos leader, instead any node can become a proposer any time (allowing many concurrent proposers) and get consensus on a proposal opportunistically. In this blog, we had discussed many examples of leaderless and multileader Paxos variants, including Mencius, MDCC, ePaxos, wPaxos, SDPaxos, Hermes, Atlas. And there are several more that we didn't get to yet. It seems like I've been proven right when I foretold my PhD students in 2016 that we will be seeing a Paxos renaissance.
On the flip side, you may ask, why do we need another Paxos variant? My selfish answer is that I love Paxos variants. A more practical answer is that because this one simplifies/streamlines the leaderless SMR ideas a bit more. Recently we have been seeing a convergence of the leaderless SMR ideas with loosely synchronized clocks. This is bringing us closer to reaping the benefits of synchronized system model, while still being safe to the face of bouts of full asynchrony in the system. You know where I stand on this. Ensure safety foremost and provably by leveraging Paxos, but don't leave any performance on the table. This paper brings us one step closer to there.
The big idea in Tempo is this. Using a loosely synchronized clock at each node, Tempo timestamps each application command and executes it only after the timestamp becomes stable, i.e., all commands with a lower timestamp are known. Both the timestamping and stability detection mechanisms are decentralized (quorum-based).
Tempo generalizes in a straightforward way to partial replication, where each replica has split/shard of service state. This way it achieves superior throughput and offers predictable performance even in contended workloads.
Motivation
Most popular SMR protocols, such as Multi-Paxos and Raft, rely on a leader that defines the order in which client commands are executed at the replicas. Unfortunately, this leader is a single point of failure and contention, and a source of higher latency for clients located far from it. A lot of effort has focused on leaderless protocols, which distribute the task of ordering commands among nodes and allow a client to contact the closest node instead of the leader. Compared to centralized solutions, leaderless SMR offers lower average latency, fairer latency distribution with respect to client locations, and higher availability.
(Side Remark. Of course, leaderless Paxos variants is not the only option here. We have developed many techniques to scale leader-based Paxos, including PigPaxos, wPaxos, and PQR. Compartmentalized Paxos work shows how these techniques can be used together to resolve bottlenecks at each component. For this post, we will take this leaderless path with Tempo.)
Drawbacks of leaderless protocols. Existing leaderless SMR protocols have drawbacks due to how they order commands. They maintain explicit dependencies between commands: a replica may execute a command only after all its dependencies get executed. These dependencies may form arbitrary long chains. In practice, this means that the performance of these protocols is unpredictable, and exhibits a high tail latency.
What is this duality about ordering and execution? In SMR, a service is defined by a deterministic state machine, and each site maintains its own local replica of the machine. An SMR protocol coordinates the execution of commands at the sites to ensure linearizability. Execution of the commands come after this ordering of commands at a log. The execution happens following the log slots without leaving empty slots in the log, at least in terms of the dependencies of a given slot. Unresolved dependencies delay execution of the log. The original ePaxos paper somewhat swept this under the rug. The recent ePaxos Revisited paper discusses these in detail.
Tempo. Tempo addresses these limitations leveraging loose time synchronization to assign a scalar timestamp to each command and executing commands in the order of these timestamps. To determine when a command can be executed, each replica waits until the command's timestamp is stable, i.e., all commands with a lower timestamp are known. (As a sneak preview, this is referred to as the seniority idea in the decoupled transactions paper.)
This guarantees progress under a synchronous network. Tempo also achieves low tail latency even in contended workloads, ensuring predictable performance. It delivers superior throughput than prior solutions, such as EPaxos and Janus.
Tempo protocol overview
The basic protocol is simple.
To submit a command, a client sends it to the closest node, which acts as its coordinator. The coordinator computes a timestamp for the command by forwarding it to a quorum of replicas, each of which makes a timestamp proposal, and taking the maximum of these proposals.
If enough replicas in the quorum make the same proposal, then the timestamp is decided immediately (fast path).
If not, the coordinator does an additional round trip to the replicas to persist the timestamp (slow path); this may happen when commands are submitted concurrently.
Recall that the execution comes after SMR ordering via the basic protocol. In Tempo, the execution algorithm is worth considering immediately after the basic protocol. Since we are making timestamps do some heavy lifting, we shall make sure safety is not compromised. So here is the execution protocol. To execute a command, a replica needs to determine when its timestamp is stable, i.e., it knows about all commands with lower timestamps. To check the stability of a timestamp t each process i tracks timestamp proposals issued by other processes. Once the Clocks at any majority of the processes pass t, process i can be sure that new commands will get higher timestamps: these are computed as the maximal proposal from at least a majority, and any two majorities intersect.
Tempo protocol zoom in
Each partition is replicated at r processes, of which at most f may fail. Following Flexible Paxos, f can be any value between 1 and r-1. This allows using small values of f regardless of the replication factor r, which is appropriate in geo-replication.
$I$ is the set of all processes. $I_P$ denotes the set of all the processes replicating a partition p. $I_c$ is the set of processes that replicate the partitions accessed by a command c.
Here is the entire SMR commit protocol.
Start phase. When a process receives an MSubmit message, it starts serving as the command coordinator (line 5). The coordinator first computes its timestamp proposal for the command as Clock + 1. After computing the proposal, the coordinator sends an MPropose message to the fast quorum $Q[p]$ of $f+r/2$ processes (including itself) and an MPayload message to the remaining processes.
Payload phase. Upon receiving an MPayload message (line 9), a process simply saves the command payload in a mapping cmd and sets the command's phase to payload. It also saves Q in a mapping quorums. This is necessary for the recovery mechanism to know the fast quorum used for the command.
Propose phase. Upon receiving an MPropose message (line 12), a fast-quorum process also saves the command payload and fast quorums, but sets its phase to propose. Then the process computes its own timestamp proposal using the function proposal and stores it in a mapping ts. Finally, the process replies to the coordinator with an MProposeAck message, carrying the computed timestamp proposal.
Commit phase. Once the coordinator receives an MProposeAck message from all the processes in the fast quorum (line 17), it computes the command's timestamp as the highest of all timestamp proposals. Then the coordinator decides to either take the fast path (line 20) or the slow path (line 21). Both paths end with the coordinator sending an MCommit message containing the command's timestamp.
Fast path
Since the fast quorum consists of $f+r/2$ processes (including the coordinator itself), we are ensured that a committed timestamp is computed over (at least) a majority of processes. The fast path can be taken if the highest proposal t is made by at least f processes, even if the coordinator fails before sending all the MCommit messages, t can be recovered by reading the remaining $\lfloor r/2 \rfloor$ quorum members.
This is not the case for the example in Table 1 b). Here A receives tA = 6, tB = 7, tC = 11 and tD = 6, and again computes t= max{6, 7, 11} = 11. Since count(11) = 1 < f, the coordinator cannot take the fast path: timestamp 11 was proposed solely by C and would be lost if both this process and the coordinator fail.
The examples in Table 1 c) and d) consider f = 1, and the fast path is taken in both, independently of the timestamps proposed. This is because the fast-path condition trivially holds with f = 1, and thus Tempo f = 1 always takes the fast path.
Slow path
When the fast-path condition does not hold, the timestamp computed by the coordinator is not yet guaranteed to be persistent: if the coordinator fails before sending all the MCommit messages, a process taking over its job may compute a different timestamp. To maintain Property 1 in this case, the coordinator first reaches an agreement on the computed timestamp with other processes replicating the same partition. This is implemented using single-decree Flexible Paxos.
For each identifier Tempo allocates ballot numbers to processes round-robin, with ballot i reserved for the initial coordinator i and ballots higher than r for processes performing recovery. Every process stores for each identifier id the ballot bal[id] it is currently participating in and the last ballot abal[id] in which it accepted a consensus proposal (if any). When the initial coordinator i decides to go onto the slow path, it performs an analog of Paxos Phase 2: it sends an MConsensus message with its consensus proposal and ballot i to a slow quorum that includes itself. Following Flexible Paxos, the size of the slow quorum is only f+1, rather than a majority like in classical Paxos.
As usual in Paxos, a process accepts an MConsensus message only if its bal[id] is not greater than the ballot in the message (line 27). Then it stores the consensus proposal, sets bal[id] and abal[id] to the ballot in the message, and replies to the coordinator with MConsensusAck. Once the coordinator gathers f+ 1 such replies (line 31), it is sure that its consensus proposal will survive the allowed number of failures f, and it thus broadcasts the proposal in an MCommit message.
Properties
These are the invariants that Tempo satisfy. Proofs are given in the appendix of the paper.
Property 1 (Timestamp agreement). Two processes cannot commit the same command with different timestamps. (But it is OK to commit two different commands with the same timestamp after ts stability.)
Property 2 (Timestamp stability). Consider a command c committed at i with timestamp t. Process i can only execute c after its timestamp is stable, i.e., every command with a timestamp lower or equal to t is also committed at i.
Property 3. For any message MCommit(id,t), there is a set of processes Q such that |Q|>=r/2+1 and t=max{tj : j \in Q}, where tj is the output of function proposal(id, _) previously called at process j \in Q.
Property 4. Any timestamp committed on the fast path can be obtained by selecting the highest proposal sent in MPropose by at least r/2 fast-quorum processes distinct from the initial coordinator.
Execution protocol at node i
For completeness, this the execution protocol at node i.
Now that you read the protocol summary, you should jump to the YouTube presentation to reinforce your understanding and see Tempo applied to various command proposal arrival scenarios.
Multi-Partition Protocol
Let's start with a quick primer about partitioned SMRs. This means the service state is split into a set of partitions, each stored at a group of replicas. A client command can access multiple partitions, and the SMR protocol ensures that the system is still linearizable, i.e., behaves as if the commands are executed by a single machine storing a complete service state. This approach allows implementing services that are too big to fit onto a single machine.
Multi-Partition commit is achieved by submitting a multi-partition command at each of the partitions it accesses using Algorithm 1.
Once committed with some timestamp at each of these partitions, the command’s final timestamp is computed as the maximum of the committed timestamps.
A command is executed once it is stable at all the partitions it accesses, following the timestamp order.
Based on the fast path discussion above, I had this idea to extend the protocol as a low hanging fruit. The key to fast decisions is to propose a time in the future of each region receiving the proposal, and they can promise with that, and hence we get fast quorum.
A couple days after I read Tempo and had this idea, I had caught up to the end of a talk on Cassandra Enhancement Protocol "CEP-15: General Purpose Transactions" and they seem to be doing that.
Recovery
Recovery is the Achilles heel of leaderless protocols. Ideally having a separate protocol for recovery is undesirable. Recovery should be embedded in normal operation of the protocol like Paxos; otherwise it is exercised seldom and becomes a source of bugs.
Compared to the long and often hairy recovery protocols in previous leaderless Paxos solutions, the recovery protocol for Tempo is /relatively/ simple. But, unfortunately the paper does not contain any evaluation with failure recovery. This is understandable because it is hard to test and present. But this is why this separate recovery code doesn't get exercised and becomes a source of problems later, especially for leaderless Paxos variants.
For the recovery protocol, I copy the explanation verbatim from the paper.
A process takes over as the coordinator for some command with identifier id by calling function recover(id) at line 72. Only a process with id \in pending can take over as a coordinator (line 73): this ensures that the process knows the command payload and fast quorums. In order to find out if a decision on the timestamp of id has been reached in consensus, the new coordinator first performs an analog of Paxos Phase 1. It picks a ballot number it owns higher than any it participated in so far (line 74) and sends an MRec message with this ballot to all processes.
As is standard in Paxos, a process accepts an MRec message only if the ballot in the message is greater than its bal[id] (line 77). If bal[id] is still 0 (line 78), the process checks the command's phase to decide if it should compute its timestamp proposal for the command. If phase[id] = payload (line 79), the process has not yet computed a timestamp proposal, and thus it does so at line 80. It also sets the command's phase to recover-r, which records that the timestamp proposal was computed in the MRec handler. Otherwise, if phase[id] = propose (line 82), the process has already computed a timestamp proposal at line 15. In this case, the process simply sets the command’s phase to recover-p, which records that the timestamp proposal was computed in the MPropose handler. Finally, the process sets bal[id] to the new ballot and replies with an MRecAck message con- taining the timestamp (ts), the command's phase (phase) and the ballot at which the timestamp was previously accepted in consensus (abal).
In the MRecAck handler (line 86), the new coordinator computes the command’s timestamp given the information in the MRecAck messages and sends it in an MConsensus message to all processes. As in Flexible Paxos, the new coordinator waits for r-f such messages. This guarantees that, if a quorum of f+ 1 processes accepted an MConsensus message with a timestamp (which could have thus been sent in an MCommit message), the new coordinator will find out
about this timestamp.
If no consensus proposal has been accepted before, the new coordinator first computes at line 92 the set of processes I that belong both to the recovery quorum Q and the fast quorum quorums[id][p]. Then, depending on whether the initial coordinator replied and in which handler the processes in I have computed their timestamp proposal, there are two possible cases.
- The initial coordinator replies or some process in 𝐼 has computed its timestamp proposal in the MRec handler (s = true, line 93).
- The initial coordinator does not reply and all processes in I have computed their timestamp proposal in the MPropose handler (s = false, line 93).
Evaluation
To improve the fairness of the comparison, all protocols are implemented in the same framework which consists of 33K lines of Rust and contains common functionality necessary to implement and evaluate the protocols: a networking layer, an in-memory key-value store, dstat monitoring, and a set of benchmarks (e.g. YCSB). Source code is openly accessible at https://github.com/vitorenesduarte/fantoch
The paper evaluates Tempo in three environments: a simulator, a controlled cluster environment and using multiple regions in Amazon EC2. They show that Tempo improves throughput over existing SMR protocols by 1.8-5.1x, while lowering tail latency with respect to prior leaderless protocols by an order of magnitude. This advantage is maintained in partial replication, where Tempo outperforms Janus by 1.2-16x.
Notice that the latency-axis is given in log-scale in the graphs. This highlights how much improvement timestamp stability approach provides.
While batching can boost leader-based SMR protocols, the benefits are limited for leaderless ones. However, because leaderless protocols already efficiently balance resource us- age across replicas, they can match or even outperform the performance of leader-based protocols, as seen in Figure 8.
For evaluating partial replication deployment, they compare Tempo with Janus using the YCSB+T benchmark. They define a shard as set of several partitions co-located in the same machine. Each partition contains a single YCSB key. Each shard holds 1M keys and is replicated at 3 sites (Ireland, N. California and Singapore) emulated in the cluster. Clients submit commands that access two keys picked at random following the YCSB access pattern (a zipfian distribution).
Tempo provides nearly the same throughput as the best-case scenario for Janus (w = 0%). Moreover, its performance is virtually unaffected by the increased contention.
Discussion
There has been some parts I didn't understand in the paper, and in some parts I wondered about extension opportunities.
The paper says clocks are based on local physical clocks. But clocks go to even microsecond granularity, so the operation Clock:=Clock+1 doesn't make sense to me. I believe Tempo imposes fixed blocks/intervals/quantization on the clocks.
UPDATE (2/16/22): Turns out this was using logical clocks, and not physical clocks. I read this all the way as synchronized physical clocks, and didn't even suspect it could be just logical clocks. But the CEP-15 paper shows this is extendable to using loosely synchronized physical clocks.
Does Tempo always need to recover proposals? If you don't do recovery, simply abort the proposal without more work, does anything get broken, do you lose progress? If a slot in SMR is not recovered left empty, and the time has moved across it, can we just carry on in some cases? For one-shot updates, overwriting value of keys, do we have a chance to pull this?
Can Tempo be relaxed to use snapshot isolation semantics, so it can always make progress at real-time progression rate unaffected by failures and the need to recover slots (if needed aborting some commands as noops in the meanwhile)?
I will check the "CEP-15: General Purpose Transactions" to see if they have some ideas that could be useful here.
Comments