FlexiRaft: Flexible Quorums with Raft

This paper appeared in CIDR23 and is from Meta (wow, this is the first time I used the new name without needing to mention it is in fact Facebook... wait.. goddammit). The paper talks about how they applied Raft to MySQL replication, and used the flexible quorums in the process.

This is not a technically deep paper, but it was interesting to see a practical application of flexible quorums idea to Raft rather than Paxos. The most technically interesting part is the adoption of flexible quorums to Raft rather than Paxos. What is the difference? Flexible quorums idea was developed for Paxos's two phases. But, Raft needs to impose an extra requirement on quorums in order to guarantee Leader Completeness: "the new leader must already have all log entries replicated by a majority of nodes in the previous term."

The paper does not call this explicitly. This is what the paper says: "For state machine safety, every data commit quorum needs to intersect with every leader election quorum. If this constraint is violated, two disjoint sets of servers would exist where one set would be able to commit transactions without the knowledge of any member in the other set which is capable of independently electing a new leader (data loss). Similarly, since our setup only allows for at most one leader at any given point in time, every pair of valid leader election quorums should intersect as well. If this were not the case, there would exist two disjoint sets of servers wherein each set is capable of independently electing a new leader without the knowledge of any member in the other set (split brain)."

Yeah, the first part is required for flexible quorums even for Paxos. But the second part is not required for Paxos. Even without the second part, Paxos is doing fine without any split brain, thank you. That second part is there only for Raft for ensuring LeaderCompleteness. I am disappointed that the paper doesn't call this distinction out. 

Ok, let's carry on with the rest of the paper.


The MySQL problem

MySQL serves as Meta's primary transactional datastore for relational workloads, handling petabytes of data. Before Raft, the consensus mechanism was split between the MySQL server and various automation tools, making code modifications error-prone. Modifying the code was error prone because the logic to support leader elections and data commit was spread across multiple bespoke automation tools. 

In the semisynchronous replication setup, a primary replica and its two binlog servers formed the core unit. Transactions were committed when acknowledged by one of the primary's binlog servers. What is a binlog server? It is a special server storing only recent binlogs (write-ahead logs) rather than a full database copy that the primary possesses.

Other replicas applied updates asynchronously. Note that binlog servers R1.1 and R1.2 (respectively R2.1 and R2.2) get updates from R1 (resp. R2) and not from R0 directly.

If the primary failed, another replica (R1 or R2) would take over, using the failed primary's binlog servers (R0.1 and R0.2) to recover any missed operations.

As mentioned in the first paragraph, this system had some correctness problems because the leader election was externally managed and the separation of data commit and leader election logic made maintenance difficult. To address this issue Meta (yay!) transitioned MySQL from semisynchronous replication to a modified Raft consensus protocol called FlexiRaft.

Maybe a short aside here to discuss about why Raft but not Paxos. Raft has many opensource implementations for one. But why is Raft more popular in industry, and why is it a better fit here for MySQL log replication than Paxos. I don't think the answer is because Raft is simpler. I think the reason is because Raft is designed to be log-centric. It makes the most of the log abstraction and leveraging the log abstraction it provides stronger guarantees than Paxos (e.g., the Leader Completeness we discussed above). Paxos is more general, and a more abstract algorithm that lends itself better to many different ways of customizations. The log is not the central abstraction in Paxos.


FlexiRaft

A straightforward application of Raft would address correctness but would introduce new challenges. It would increase cross-region network utilization, as the leader had to send updates directly to all replica set members. It would also increase data commit latency and reduced throughput due to the requirement for a majority agreement across regional boundaries.

To overcome these limitations, FlexiRaft introduced flexible quorums, allowing users to customize data commit quorums based on their specific needs for latency, throughput, and fault tolerance. Let's remember the flexible quorums result. The idea for flexible quorums was introduced in a paper in 2016 summer. This simple and surprising result says "majority agreement is not required in Paxos and the sets of acceptors required to participate in agreement (known as quorums) do not even need to intersect with each other".

So now, we talk about two types of quorums.

  • Leader election quorum: The minimal set of servers that must accept a candidate as leader for it to safely assert leadership over the entire replica set.
  • Data commit quorum: The minimal set of servers (including MySQL and binlog servers) that must acknowledge a transaction before it can be committed.

Let's revisit what I said in the introduction. The paper says: "For state machine safety, every data commit quorum needs to intersect with every leader election quorum. If this constraint is violated, two disjoint sets of servers would exist where one set would be able to commit transactions without the knowledge of any member in the other set which is capable of independently electing a new leader (data loss). Similarly, since our setup only allows for at most one leader at any given point in time, every pair of valid leader election quorums should intersect as well. If this were not the case, there would exist two disjoint sets of servers wherein each set is capable of independently electing a new leader without the knowledge of any member in the other set (split brain)."

Yes, the first part is required for flexible quorums even for Paxos. This is required for ensuring that every data commit quorum intersects with every leader election quorum. And this is sufficient for state-machine safety. But the second part, which states that leader election quorums should be at least majority quorums, is there only for Raft for ensuring LeaderCompleteness: "the new leader must already have all log entries replicated by a majority of nodes in the previous term." I am disappointed that the paper doesn't call this distinction out.   

Ok, moving on. FlexiRaft supports two modes for choosing data commit quorums: static and dynamic. So let's cover these next.


Static quorums

Static quorums can be configured to span multiple geographical regions to increase fault tolerance.

For example, we can specify a data commit quorum as:  Majority in 2 out of 5 groups: {G1, G2, ..., G5} OR Majority in 2 out of 3 groups: {G6, G7, G8}

Here, G1, G2, G3, G4 and G5 are disjoint groups of servers in the United States and G6, G7 and G8 are disjoint groups of servers in Europe. all servers in a group (replicaset) belong to the same datacenter. The corresponding leader election quorum would be the following: Majority in 4 out of 5 groups: {G1, G2, ..., G5} AND Majority in 2 out of 3 groups: {G6, G7, G8}


Here is another example, where the data commit quorum is: Majority in 2 out of 5 groups: {G1, G2, ..., G5} AND Majority in 2 out of 3 groups: {G6, G7, G8}.  The corresponding leader election quorum which gets automatically computed is: Majority in 4 out of 5 groups: {G1, G2, ..., G5} AND Majority in 2 out of 3 groups: {G6, G7, G8}


Dynamic quorums

The dynamic mode, which is popular in Meta's current setup, limits the data commit quorum to a single geographical group. Both data commit and leader election quorums are expressed as a majority within the group containing the current leader. Unlike static quorums which are defined independent on the leader's location, dynamic quorums are leader location dependent.

This approach optimizes for local operations but is less available. Dynamic quorums may not be resilient to the failure of a majority of servers in one group, particularly if that group contains the leader. Even if that is not the current leader group, through a combination of leader failover and previous leader information not being available through voting history, unavailability may result.

Let's explain this by first defining the Pessimistic leader election quorum as "A majority of servers from every constituent group in a replica set, guaranteeing intersection with every valid leader election or data commit quorum."

The FlexiRaft algorithm first checks if the the configured data commit (and hence, the leader election) quorum is static and returns if the quorum has been satisfied based on the votes received thus far. In dynamic mode, the algorithm checks for the satisfaction of the pessimistic quorum. The pessimistic quorum for a replica set is defined as a majority within all groups of servers individually. By definition, satisfaction of the pessimistic quorum guarantees the satisfaction of the leader election quorum (since it is defined as majority in a subset of the groups). If the candidate does not have any knowledge of the last known leader, the only way to win an election is to satisfy the pessimistic quorum. If the term in which the candidate is conducting an election immediately succeeds that of the last known leader, getting a majority from the group of the last known leader is sufficient to win the election. Otherwise, the candidate now needs to learn about the leader (if it exists) with the highest term and solicit majority votes from that leader’s group. Unavailability may result if this is not learnable through the voting history.


Evaluation

In all of our experiments, FlexiRaft was deployed in the dynamic mode with data commit quorum restricted to a single group. Each test machine was configured with a dual socket Intel Xeon Gold 6138 CPU and 256 GB DDR4 RAM. Each socket offered 20 physical cores.

Figure 3 shows the throughput as they increase the number of clients writing to the database. Latency measurements are specified in Table 1 and were computed across a million uniform commit samples for each algorithm. It can be seen that the average numbers for FlexiRaft (with dynamic mode) and semisynchronous algorithms are similar.

Figure 4 shows a graph between average latency observed by clients for each of these algorithms. Raft’s latency goes up with an increase in the replica set size because of the corresponding growth in data commit quorum size. However, for semisynchronous and FlexiRaft (dynamic mode), it remains stable since the quorum is confined to a single group of servers belonging to the same datacenter.

The paper says: "[With FlexiRaft] Tail commit latencies became independent of the number of replicas in the cluster." But this has nothing to do with tail latency, but to do with WAN latency. In dynamic quorum the system can avoid waiting for WAN RTT.


Discussion

The implementation of the algorithm (linked here) was extracted from the Apache Kudu open source project. There is a TLA+ specification of the algorithm here as well.

The paper mentions that pre-voting (or pre-election) became more important with the addition of flexible quorums. "In the pre-vote algorithm, a candidate only increments its term if it first learns from the memebers of a leader election quorum that they would be willing to grant the candidate their votes. This helps reduce the likelihood of disruptions, especially in the dynamic quorum mode where coordinated failures of the leader and a group not containing the leader can cause availability loss for the replica set."


Finally, there are several interesting directions for future work. 

The paper identifies one as follows. "Another significant avenue for improvement is when strict serializability and consistency can be relaxed in favor of low latency and high throughput. The ParallelRaft algorithm implemented by PolarFS makes this tradeoff. It relaxes Raft’s strict serialization by allowing out-of-order acknowledgement, commit and application of logs. Since MySQL doesn’t care about the I/O sequence of underlying storage, exploring this strategy for FlexiRaft to further provide a throughput boost for applications could be a fruitful endeavor in the future."

At one point the paper mentions cross-region network traffic, and says: "The mitigation strategy used to reduce cross-region network utilization is beyond the scope of this paper." Our PigPaxos work would be useful here to reduce WAN costs and bottlenecks at the leader.

In previous work, we had also developed WPaxos, a wide-area-network Paxos protocol that employs flexible quorums. The goal there was to distribute the workload across multiple leaders to achieve scalability. WPaxos employed an adaptive object stealing protocol to match the workload access locality. WPaxos was developed for Paxos, rather than Raft, but it should be adaptable to Raft as well.

Comments

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Advice to the young

Foundational distributed systems papers

Distributed Transactions at Scale in Amazon DynamoDB

Linearizability: A Correctness Condition for Concurrent Objects

Understanding the Performance Implications of Storage-Disaggregated Databases

Designing Data Intensive Applications (DDIA) Book