Rabia: Simplifying State-Machine Replication Through Randomization (SOSP'21)

This paper appeared in SOSP'21. I took notes and screen-snapshots during the presentation of this paper, and decided to put together a summary of what I understood from it. The paper has a simple idea and a somewhat unexpected result. It will be interesting to dive deep and explore to the extent this idea can be applied in practice. 

Here is the idea.


State-machine replication through Paxos, more accurately MultiPaxos, is commonly used in practice. (Yes, that includes state-machine replication through Raft, if I have to spell this out to the Rafters among us.) The MultiPaxos leader drives the protocol. It is basically one round, phase-2 execution of "accept this!" "yes, boss" between the leader and followers, where phase3 commit can be piggybacked to phase-2 message.

This is as simple and efficient as it gets. You can't beat it!

Or can you?


The paper argues that, although the happy path of MultiPaxos based state-machine replication is simple and efficient, that goes out the window when there is a failure. Then the state-machine replication implementation has to deal with corner cases. While the protocols for doing this are there, the implementation is still complex. (Sort of like how Paxos implementation was complex even when the abstract protocol had been available for more than a decade.)



The paper offers something radical: Use Ben-Or for state-machine replication rather than dealing with trying to implement the complicated recovery/leader-election path in Paxos. This is crazy because:

  1. While Paxos gives you arbitrary value consensus, Ben-Or only gives you binary-value (yes/no) consensus
  2. While Paxos is O(N) cost, Ben-Or is a O(N*N) cost because all nodes talk to all nodes. 

I talked about Ben-Or and all of this in a previous post, actually through a series of posts.



But the paper mentions two tricks for adopting Ben-Or for state-machine replication and argues it can be made a competitor to MultiPaxos state-machine replication.


Trick 1. Stable network makes randomized consensus fast. In a stable network, a majority of replicas see the same set of messages. Replicas order the messages they see with respect to the request timestamps. If all replicas propose the same request-order, agreement can be achieved in 3 communication delays. They use Ben-Or, which only solves binary consensus, to come to consensus on this yes/no question among the nodes: Can you agree on a certain order? If they can from the messages they see, they say yes.


Trick 2. When opinions diverge, fail-fast with a No-Op decision, and hope to do fast consensus in the next slot. Randomized consensus takes a long time when replicas propose different requests, so when the nodes notice this is the case, they propose and quickly come to consensus on a no-op decision, because in the next slot with nodes seeing the same requests and ordering them, they will be able to get to a consensus fast with the same proposals coming from majority of the nodes. 


They call the resulting adoption of Ben-Or for state-machine replication, Rabia: RAndomized BInary Agreement protocol. They report experiments where this can be competitive with MultiPaxos state-machine replication.




So this is an interesting paper and it generated a lot of discussion after the presentation. 

In the Question&Answer session, they mentioned that Multi-Paxos saturates earlier than Rabia. But in Ben-Or every node does as much work as a leader, and this did not sound right. It turns out they are somehow distributing the responsibility of responding back to clients among nodes in Rabia.

Rabia, in its implementation, doesn't allow pipelining of consensus instances. Instead they tune the batching to improve the performance for Rabia.


I am not sure how practical the Rabia idea is yet. I haven't read the paper. It is possible that the concessions the paper makes for adopting Ben-Or is not applicable well in practice.

It is worth mentioning they are comparing Rabia against an unoptimized vanilla MultiPaxos deployment. You can use PigPaxos and Paxos Quorum Reads and in general the compartmentalization techniques for making MultiPaxos based state-machine replication, make more effective. I don't think there is even a competition. As I argued in 2014, there is a lot of efficiency/scalability advantages that logically centralized control approach has over the fully decentralized approach. 

On the other hand, due to dealing with the recovery-path in a uniform way, Ben-Or based state-machine replication may find a niche to fill in for some applications. Using an all node symmetric consensus protocol can save a lot of headache by avoiding the asymmetric view problem among nodes. 

Comments

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Foundational distributed systems papers

Advice to the young

Linearizability: A Correctness Condition for Concurrent Objects

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Understanding the Performance Implications of Storage-Disaggregated Databases

Designing Data Intensive Applications (DDIA) Book