PigPaxos: Devouring the communication bottlenecks in distributed consensus

This is our most recent work, started and led by Aleksey Charapko. (This is a joint post with him.) You can get the paper at arxiv.org. The paper is currently under submission to a journal.

The story

One day I challenged Aleksey to give me a ballpark number on how much he thinks we can scale Paxos vertically. While sharding --as in CockroachDB and Spanner-- helps for scaling Paxos deployments horizontally, vertical scaling is about how many nodes you can cram in a single Paxos cluster, with a single conflict domain.

Aleksey, who is not known for being an optimist, said that we can scale Paxos to several hundreds of nodes! He said this may be possible by employing intermediate proxy nodes to relay the communication between the leader and followers, as this would relieve the communication bottleneck at the leader.

I thought "yeah, it is a neat trick, but maybe not that impressive, because it is very simple". Surely others must have tried this, and there must be a catch/drawback. We checked but couldn't find any previous work on this. At the Sigmod'19 reception at Van Gogh museum, I mentioned this idea to Miguel Castro (of PBFT fame among others). He liked the idea and said he couldn't think of anyone that studied this before.

The central idea in PigPaxos is to decouple the communication from the decision-making at the leader. PigPaxos revises the communication flow to replace the direct communication between the leader and followers in Paxos with a relay based communication flow. PigPaxos chooses relays randomly from follower clusters at each communication round to reduce contention and improve scalability of throughput. 

When Aleksey started evaluating PigPaxos, we were surprised by the  effectiveness of this simple technique. This shouldn't have been too much of a surprise because in our recent Sigmod paper we showed that leader bottleneck is the culprit behind the scalability problems of Paxos family of protocols and quantified on this bottleneck with back-of-the-envelope formulas. Still, the results from PigPaxos were beyond our expectations. We repeated our experiments many times, and double-checked everything before we could allow ourselves to believe these results. Employing relay nodes for relieving the leader, and randomly rotating the relay nodes for communication bottleneck shedding did wonders for the performance. We found that PigPaxos improved the throughput limit more than 3 folds over Paxos with negligible latency deterioration at 25 nodes. Even for as low as 9 nodes, we were able to see 1.5 folds throughput improvement with the same latency as Paxos.

This was a very fun paper to write. The writing went easy and quick. The analytical results section at the end of the paper added a lot to the paper. We put that section after the evaluation section to show that a simple back-of-the-envelope analysis formula can explain the evaluation results nicely.

What is up with the name?

When we started working on the idea, we were calling it BigPaxos, because our ultimate goal was to scale to hundreds of nodes. One day while shooting the shit about BigPaxos on Slack, Aleksey made a typo and wrote "PigPaxos". He didn't realize he made the typo (even though it is very hard to confuse b and p on the keyboard). I teased Aleksey for a couple of minutes by putting various pig emojis in my responses to his comments. He still didn't get the hint, and then I pointed the typo to him. We got a good laugh out of it.

I kept referring to the protocol as PigPaxos in later conversations. With the submission deadline looming, we were unable to prepare for 100 node experiments and add the optimizations we had in mind. I told Aleksey that we should call this protocol PigPaxos officially, and reserve the name BigPaxos for the large scale protocol we can show in the next paper.

Aleksey was not very receptive to the idea, thinking PigPaxos would look too informal in a research paper. He also argued that we don't have a good way to connect the name to the algorithm, all we had was a silly typo. The connection occurred to me on Slack again. In the protocol, the relay nodes wait for the followers’ responses from its cluster and piggyback them together into a single message. Well, this was a close enough association, so we went with the name PigPaxos.

Where can we use PigPaxos?

Paxos protocols are most commonly deployed with 3 and 5 nodes. But there are also several applications that require vertically scaling Paxos to run on a large number of nodes, all within the same conflict domain. One example is consistent cloud configuration management. Configuration management is required for gating new product features, conducting experiments (A/B tests), performing application-level traffic control, performing topology setup and load balancing, monitoring and remediation, updating machine learning models (which vary from KBs to GBs), controlling applications’ behaviors (related to caching, batching, prefetching etc), and controlling chain replication topologies as in Physalia.

Another example is geo-replicated databases. A consensus group in a geo-replicated database may consist of dozens of nodes across many regions around the globe. As we show in our evaluation, PigPaxos increases throughput scalability significantly across WAN deployments with a large number of nodes. Even for Paxos clusters with a small number of (say 5) nodes, large messages (such as database replication messages as in CockroachDB and Spanner) trigger a communication bottleneck at the leader. PigPaxos's randomized relaying technique can help with those bottlenecks as well.

While the idea in PigPaxos is simple and similar aggregation-based approaches have been employed in the context of weak-consistency replication protocols, PigPaxos is novel because it shows how these aggregation-based approaches can be effectively and safely integrated into the strong consistency distributed consensus protocols. The PigPaxos technique, being a simple general technique, is applicable to many Paxos-variant protocols, including Raft, Zab, WPaxos, etc.


Summary of the paper

OK, here is where the more technical content starts. Ready? I promise this will be good. You need a break from reading Covid-19 news anyways.

There are many optimizations possible over the basic scheme we outline below, but we relegate that discussion to the paper.

Background and related work

Paxos family of protocols are employed by many cloud computing services and distributed databases due to their excellent fault-tolerance properties. Unfortunately, current Paxos deployments do not scale for more than a dozen nodes due to the communication bottleneck at the leader.

In the most generic form, Paxos runs in 3 phases. Phase-1 establishes some node as the leader, phase-2 lets the leader to  impose its will onto the followers by telling what command to accept, and phase-3 finalizes the commitment by informing the followers that consensus has been reached.

The basic protocol is rather inefficient with these three communication phases, and Multi-Paxos optimization is often adopted to cut down the unnecessary pleasantries. Multi-Paxos elects one node as a stable leader for some prolonged time, and repeats the phase-2 however many times possible under the same leader, without needing to perform another phase-1. Phase-3 also gets piggybacked to some future phase-2 to reduce communication even more.

It is evident that the leader bears the brunt of the load in Paxos and MultiPaxos. In previous work, Mencius relieved leaders' workload by rotating them.  Recent blockchain consensus protocol LibraBFT from Facebook Libra also used pipelining to improve throughput (the original reason for pipelining was to reduce the effects of a Byzantine leader on the protocol). In contrast, the pipelining in PigPaxos employs random rotation of relay nodes, rather than leader rotation, and improves the throughput scalability significantly without any side effects. Since this is a very simple technique, it is more easily applicable and implementable.

PigPaxos communication flow

As shown in Figure 1&2, the communication in Paxos is direct between the leader and the followers with a fan-out to send messages and fan-in to collect the replies. PigPaxos observes that it is possible to employ intermediate nodes/proxies to help relay and aggregate the messages in this communication pattern. Instead of sending the fan-out messages to all of the followers, the leader transmits these to a small set of relay nodes, which propagate the messages to the remaining followers. The relay nodes also act as aggregators for the fan-in communication of the followers’ responses, and pass the combined results to the leader.

For simplicity sake, PigPaxos divides the entire cluster into a small static number of  relay groups, and a single relay node is chosen from each relay group randomly for every round trip communication round.  We use PigPaxos with the MultiPaxos optimization so only the phase-2 communication is performed in the normal case.

The randomized rotation of the relay nodes provide a big relief from communication bottlenecks. True, a relay node has its work cut out if it needs to aggregate responses from 10 nodes in its cluster. But since the relay nodes randomly rotate, a particular relay node will be off the hook for the several consequent rounds, and will be able to process these messages and send to the leader without getting overwhelmed.

The randomized rotation of the relay nodes also help for improving liveness when a relay crash occurs. Moreover, to guard against crashed or sluggish follower nodes, a timeout is used for setting a time threshold for followers in the group to reply. When the timeout occurs, the relay node acks to the leader with a partial aggregation.

Paxos to PigPaxos mapping

PigPaxos generalizes the communication implementation of the Paxos protocol. Paxos has $N-1$ groups, where each group has one element and the groups do not intersect with each other. In contrast in PigPaxos there are $p$ groups where $p \in \{ 1..N-1\}$.

We note that the safety and liveness proofs of Paxos do not depend on the communication implementation between the leader and follower nodes. In Paxos, maintaining correctness in spite of failures is guaranteed by quorum size and the information exchanged within the quorums, and the proofs are oblivious to how communication of this information is achieved. Therefore, PigPaxos preserves the safety and liveness properties of Paxos, as it only modifies the communication implementation. For reasoning about liveness, the message flow pattern and the use of relay nodes requires special attention, as failure of a relay node has disproportionate impact compared to the failure of a regular follower. Since PigPaxos uses random selection of relay/aggregator nodes at each round, it circumvents this problem and retains liveness.

Evaluation

We implemented PigPaxos in our Paxi framework with almost no changes to the core Paxos code, as we focused only on the message passing layer and relay group orchestration. The entire protocol was implemented in just 1200 lines of code. For evaluation we deployed the protocols on a cluster of up to 25 AWS EC2 m5a nodes with 2 vCPUs and 8 GB of RAM.

The PigPaxos leader communicates with only a handful of nodes for each consensus instance. Naturally, we wanted to see if relay groups get overwhelmed by the extra communication burden. To our surprise, PigPaxos with just 2 relay groups performed the best in a 25 node deployment. This suggests that the leader still remains a bottleneck and the relay groups still have resources to spare. As shown in Figure 7, for a cluster of “25 nodes divided into 2 relay groups”, PigPaxos had nearly ~25% advantage in maximum throughput compared to a 5-relay group setup.


The performance relative to Multi-Paxos on a 25-node cluster is astonishing (Figure 8). PigPaxos shows nearly 3.5 fold improvement in throughput compared to Paxos and 10 folds improvement over EPaxos.


The benefits of PigPaxos do not stop at the large clusters. To our surprise, we observed better throughput from PigPaxos on just a 5 node cluster. Yes, PigPaxos has slightly higher latency due to the additional network hop, but it is still able to edge out  some throughput improvements.

Analysis

We back our empirical results with simple back-of-the-envelope load formulas. Based on our work in the SIGMOD paper, we use a simple count of messages processed by each node as a heuristic for the node’s relative load.  To start with the leader, its communication is no longer dependent on N, or the number of nodes in the cluster, and instead it is a linear function of r, the number of relay groups, plus 2 messages (incoming and outgoing) talking to the client:

Computing the relative message load on the follower is more involved, as we need to account the roles the follower can take on and the probability of each role:


Plugging our experimental configuration into these simple formulas shows us that the relay nodes are never a bottleneck (regardless of both the number of nodes N and number of relay groups r), and keeping the number of relay nodes small can move the entire system closer to the load parity between the leader and followers. The reason the relay nodes don’t become a bottleneck is because the random alternation of the relay nodes shields them from becoming hotspots: the extra traffic load a relay node incurs in one round is offset in consecutive rounds when the node no longer serves as relay.


The drawback with using a very small number of relay nodes (say only 1 relay group in the extreme) is that the performance becomes too fragile: few sluggish or crushed nodes may force the system to wait the timeouts at the relay groups. Using more relay groups mean the leader will receive majority confirmation even in the presence of sluggish/crashed nodes holding back the response from some relay groups. Finally the load that randomly rotated relays can manage is likely to be limited in practice due to hw/sw limitations at the node,  and that is also a factor to be explored.

Future work

There are several optimizations to help the Pig go Big. In our implementation, the relay group partitioning was static to keep things simple. Using dynamic relay groups we may add even more randomness to the communication process, and can deal with failures and sluggish nodes more smoothly. Another optimization is to make the relay nodes reply to the leader before collecting all the responses from peers: collecting one fewer response is not likely to affect the ability to reach majority quorum at the leader, but it prevents incurring a timeout due to one sluggish node in the relay group. More experiments are underway to see how this Pig squeals.

Here is a link to our paper again, if you are looking for some quarantine reading.

Comments

Ineiti said…
The following paper, where I'm a co-author, has an alternative Paxos consensus that is fully BFT. It looks quite similar to what you are proposing here. Have a look ;)

OmniLedger
Anonymous said…
Why scale a consensus group so large though? Having a consensus group with so many members just seems a bit unnecessary. You don't need that much fault tolerance, so rather you can keep the consensus group small and just replicate the updates / state machine snapshots to other members outside of the consensus protocol. The smaller consensus group takes care of the consensus and fault tolerance (3-7 members typically) and you can focus on deploying those members in an optimal way for supporting all the other members outside of the consensus group. For replicating to the members outside of consensus you can use the communication techniques you describe to make it more efficient, so that you don't have hundreds of members receiving updates from a small set of consensus nodes.
Anonymous said…
How it is related to zookeper's hierarchical quorums?

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Foundational distributed systems papers

Advice to the young

Linearizability: A Correctness Condition for Concurrent Objects

Understanding the Performance Implications of Storage-Disaggregated Databases

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Designing Data Intensive Applications (DDIA) Book