Nezha: Deployable and High-Performance Consensus Using Synchronized Clocks

Nezha (VLDB'23) is a consensus protocol that leverages synchronized clocks to decrease latency and increase throughput. There is also a GitHub repo for the implementation of Nezha and TLA+ model associated with the protocol.

Nezha's approach is to offload the traditional leader or sequencer-based ordering to synchronized clocks, achieving decentralized coordination without the need to rely on network routers or sequencers. Here, time synchronization is leveraged on a best-effort basis, with no impact on correctness. You guessed it right: there is a fast-path where the best-effort message ordering works, and the client waits for a super-majority quorum of replies ordered consistently. And then there is a slow-path that covers for the case where that fails.

The evaluation suggests that Nezha outperforms previous protocols significantly, including an order of magnitude improvements in throughput. But the evaluations are performed with ideal conditions, and overloook the metastability effects. While time synchronization doesn't affect correctness, it does lead to some modality: one misordered entry ruins/invalidates the well-ordered ones following it. That means, when the slow path is hit once, it is possible for the system to be stuck in the slow path in the following requests, as it may not get enough slack to recover back to the fast-path mode.


Nezha's contribution

This paper follows the research trend of utilizing time synchronization to enhance consensus performance. This is a timely trend. We have very precise time synchronization available in the datacenters thanks to advances in atomic clocks and better time sync infrastructure. It is no longer unreasonable to assume good time sync in the datacenters. We had talked about Tempo and Accord and their use of time.
Nezha seems more practical and more immediately applicable.

Nezha makes a couple simplifications that reduces the big modality gap between the fast and slow paths. In contrast to traditional Fast Paxos protocols, in Nezha there is always a dedicated stable leader. This leader is required to be included/involved in both fast and slow quorums. I love the simplification this stable leader brings. Each replica follows the log of the leader, rather than try to piece together logs across multiple leaderless nodes themselves. The speculative execution at the leader is a bonus. Watch out for these point in the rest of the summary.

Keep in mind that the modality gap is still not entirely closed as I complained above about the lack of evaluation of metastability. But this is an improvement, and I am happy to take it.


The reordering problem and quantifying it

Ensuring the same consistent order across all receivers is a significant challenge in distributed systems. While TCP maintains order consistency for a single receiver, this cannot be used to guarantee the same message order across different receivers.

Reordering across receivers occur due to different paths/routers taken by the messages to the receivers. To quantify the extent of the reordering problem, the paper introduces the reordering score.

For two receivers, R1 and R2, receiving multicast messages from various senders, we establish the sequence of messages received by R1 as the reference. Each message receives a sequence number corresponding to its order of arrival at R1. Using these sequence numbers, the reordering score is computed to measure the extent of reordering in R2. This score leverages the length of the longest increasing subsequence (LIS) in R2's sequence.

Deadline-Ordered Multicast (DOM)

Nezha uses the DOM primitive for best effort consistent message ordering across receivers. DOM assigns a deadline timestamp (with global time using synchronized clocks) to each request and only delivers requests after that deadline time is reached (and in the deadline timestamp order). The intuition here is that the deadline acts as a buffer. By holding a message m' until its deadline (instead of immediately delivering it), we get a chance to receive any earlier message m that m' may have over-taken at this receiver, and so we are able to deliver them in the right order: m followed by m'.

DOM is a best-effort primitive: a sequence of messages is processed in order at a receiver if they all arrive before their deadlines, but DOM does not guarantee that messages arrive reliably at all receivers either before the deadline or at all.

Figure 3 shows different percentiles (i.e., 50th , 75th , 90th , and 95th) for DOM to decide its deadlines. A higher percentile means lower reordering. However, a higher percentile also means longer holding delay for messages in DOM, which in turn decreases the latency savings of Nezha. The paper uses the 50th percentile in Nezha to strike a balance.


Fast path

Although the figure shows a single proxy, this is a multiple proxy system. We get scaling because each proxy has access to synchronized clocks, each can send their DOM requests using local clocks without the need to communicate with each other. The time synchronization ensures that the requests that Proxies send have an agreed upon order, and that the replicas will deliver them in that order.

If the time synchronization and message delivery works well (that is, if messages are delivered before their deadlines in the leader and fast-quorum number of nodes), then we have 1 RTT consensus.

The leader is not an inbound I/O bottleneck because it doesn't need to aggregate messages from the replicas. The leader and each replica does the same work: 1 receive, 1 send. The time synchronization takes care of ordering of messages without a sequencer at the replicas. The proxy does collecting of replies from replicas and checks whether the fast-quorum is achieved. This is achieved if leader's reply is received, alongside another f+f/2 replica responses (out of 2f+1 total nodes, which at most f of them can fail), which indicate they delivered the same message id as the leader did. This is checked through the use of hashing, and the paper also has a nice optimization in Section 7.1 for doing this incrementally (reducing the check to checking set equivalence rather than list equivalence because delivery at each replica is done in timestamped order).

Only the leader executes the request, the replicas just respond saying that they delivered the same message, and are ready to serve as fault-tolerance agents/replicas. The replicas do not execute the request. Well at least not immediately. It is ok for replicas to execute requests later, after they confirm the leader's order.

The execution at the leader is said to be speculative, because the leader doesn't know that the replicas also delivered this message in order. But from one perspective, this is not very speculative: the leader knows that its ordering will take affect (unless of course it is dethroned before this ordering is log-replicated by majority number of replicas.)


Slow path


If the fast path condition fails (that is if the super majority quorum does not have the same value as the leader), the slow path picks up the slack.

This is more asynchronous in nature than the fast path execution. The replicas stream the log from the leader. And they reply back to the proxy. For this  only f+1 replies (one of which must be from the leader) is sufficient.


Discussion about the protocol

Nezha increases latency more by requiring that the leader's message is also included in the super-majority quorums of the fast path. But this seems to help bridge the gap between fast and slow paths. Having a dedicated leader, and insisting on following the order of that leader reduces the pain of dropping to the slow path, and the pain of recovery of proposals. Recovery of proposals becomes just the recovery of the leader which is a better known problem, and a more exercised code path. I like this continuity from the fast path to slow path thanks to both relying on the same stable leader. This is not as big modality jump as in leaderless protocols like Tempo and Accord.

But, still, there is a modality drop. I don't like that one misordered entry ruins the delivery of the well-ordered ones following it. After going slow path once, it would require sometime for the system to recover back and get to fast path replies again. This is because one misordered entry via slow path invalidates the order of fast-path delivered entries following that one, as we lost the order in the prefix of that log. Those would also likely need to follow slow-path, if the problem is not resolved before their deadlines expire.

There is a metastability risk here, since this may not happen at all as the system keeps playing catchup and get overwhelmed with the busy work to get back to fast path deliveries again. I think commutative operations may help for cutting some slack to Nezha, but I don't think there is any guarantees there. I am optimistic here because the slow path does not create extra looped-in traffic to the system and does not overload the system further than normal. So catchup is likely if there is enough idle period in-between requests. Unfortunately, the paper does not have experiments on this in the evaluation section.

I think for a faster recovery, we can take a cue from the metastability paper.  It may be best to shed load fast upon falling to slow path, recover, and then accept load, otherwise it is possible for the system to be stuck grinding on the slow path. Once the replicas detect the slow path is used and system is grinding, they can inform the proxies to initiate some load shedding.

Nezha does not prescribe anything new for leader change and reconfiguration (node addition/removal to the node set). They default to existing techniques here, which I think is a good thing.

Finally, there are some subtle differences in the leader role from vanilla MultiPaxos. Having the same log entry at a quorum of replicas does not guarantee anchoring that entry for committing. Nezha is very leader-centric. For anchoring the entry for commit, the leader must have also appended the corresponding entry to its log at the same position. Even if all replicas share the same entry at log  position K, if the entry differs from the leader's entry at position K, the leader can revert those log entries at those replicas in the slow path. This may happen because the replicas may have delivered the message using DOM, where as this message is delivered to the leader after its deadline expired, so the leader has this in a different order.


Discussion about time synchronization and applications of DOM

The authors had proposed in 2018 a very nice protocol, Huygens, for tight clock synchronization without requiring dedicated networking support. I had also liked the practicality of the approach used in Huygens. For a description of Huygens, see the second session heading in my NSDI'18 post. (For a more broader discussion on time synchronization, see this post.)

Nezha builds up on that line of work. The paper mentions that, with its use of proxies, Nezha can serve as a drop-in replacement of Raft/Multi-Paxos and metadata stores like etcd and ZooKeeper. They also list fair-access financial exchange system for the cloud as an application.

Nezha showcased DOM in use with single conflict domain consensus. I am excited about the applications of DOM for other consensus deployments and Paxos variants. There are many opportunities here.

This is an exciting time. We are seeing microsecond-accuracy time sync available at the cloud, and database products like Aurora Limitless making use of that for improving the performance of distributed transaction processing across shards. (Of course, we do not forget about Spanner's original introduction of TrueTime in distributed transaction processing.) We are likely to see more adoption of tight clock synchronization for distributed systems in the coming years.


Implementation and evaluation

Nezha outperforms the 4 baselines (Multi-Paxos, Fast Paxos, NOPaxos, Raft) by 1.9–20.9x in throughput, and by 1.3–4.0x in latency. The github repo for the code is available at https://github.com/Steamgjk/Nezha.








Comments

Popular posts from this blog

Learning about distributed systems: where to start?

Hints for Distributed Systems Design

Foundational distributed systems papers

Metastable failures in the wild

The demise of coding is greatly exaggerated

Scalable OLTP in the Cloud: What’s the BIG DEAL?

The end of a myth: Distributed transactions can scale

SIGMOD panel: Future of Database System Architectures

Why I blog

There is plenty of room at the bottom