Cabinet: Dynamically Weighted Consensus Made Fast

- August 20, 2025

This paper (to appear in VLDB'25) proposes a consensus algorithm called "Cabinet", which dynamically adjusts node weights based on responsiveness.

As in the past three weeks, Aleksey and I live-recorded our first blind read and discussion of the paper to capture in real time how experts approach and dissect a paper. Below is the link to the video. The paper I annotated during our read is available here. Frankly, we regretted reading this paper, as it had big flaws.

Motivation

The paper began by revisiting the role of consensus protocols in distributed systems: ensuring consistent state across replicas. It claimed that consensus often struggles with scalability because majority quorums scale poorly. But this is misleading because it omits that flexible quorums already address this issue. The paper ignores flexible quorums work, and mentions it only with one sentence. Ironically, as we will see later, flexible quorums resurface and undermine all of the paper's premise.

Carrying on with the introduction, the paper confused us by asserting that Google Spanner uses quorums of hundreds of nodes. Wait, wut!? "For instance, in Google’s Spanner's evaluation [15], despite the system scaling from tens to hundreds of nodes, the quorum size keeps growing (e.g., 51 replies in 100 nodes), leading to low performance, especially with high latency, though the probability of half of the nodes failing in the system is exceedingly low."

This is incorrect. Spanner uses replica sets of 3–5 nodes, then relies on two-phase commit (2PC) across their primaries. 2PC requires unanimity ("all Aye") from participants and does not use quorums. So Spanner only employs flat quorums at the replica set level. Unfortunately, the paper doubles down on this misunderstanding, further claiming that Spanner uses hierarchical quorums in the related work section. Wait, wut?! "When scaling a system, sharding—partitioning a large group of nodes into hierarchical quorums—is often applied, as exemplified by Google Spanner [16]. Compared to majority-quorum consensus, hierarchical quorum consensus (HQC) allows consensus decisions to be made simultaneously within groups. Once a consensus decision is reached at the child-level groups, parent groups aggregate these decisions up to the root group (shown in Figure 2)."

The paper seemed desperate to argue that large replica sets are used in practice and majority quorums fail to scale, since that would motivate the need for weighted consensus. They even cited a 2004 survey for this claim (really, 2004?) but could not identify any large replicaset deployments in use in practice. That is not surprising, because it does not make much sense to have large size replicasets. F>2 is uncommon and large sized replicasets are not economical. Some geo-replication papers mention large quorums, but the remote replicas usually act as learners, not as quorum participants. Again, those are just papers, not practically deployed systems. The largest quorum I know of is from Physalia, which uses N=7 and quorum size 4. Even there, the choice was made to reduce blast radius and improve partition tolerance, and the cost was justified because a Physalia deployment serves as configuration for many chain-replication groups at once.

So Cabinet's motivation of framing inefficiency as stemming solely from majority quorums is simplistic and unconvincing. Even in theoretically plausible large deployments, easier mitigations exist. For example, communication overlays, such as those we proposed in PigPaxos (2020), can address scaling bottlenecks directly.

Weighted Consensus Algorithm

Let's move past the shaky motivation and examine the algorithm itself. Cabinet introduces a weighted consensus approach for "fast agreement in heterogeneous systems." The paper allows a customizable failure threshold t, just as flexible Paxos does, and requires only t+1 responses to commit a decision, again, just as flexible Paxos does.

The paper recasts the quorum intersection property in terms of weights. There has been a lot of weighted voting work in the 1980s, and the paper draws on them. Traditional weighted voting sets the consensus threshold (CT) at half of the total weight. But, the paper points out that this alone cannot guarantee both safety and liveness. To fix this, the authors propose two invariants:

I1: The sum of the t+1 highest weights (the cabinet members) must be greater than the consensus threshold (CT).
I2: The sum of the t highest weights (cabinet members minus the lowest member) must be less than the consensus threshold.

In other words, a valid weight scheme must satisfy Equation 2.

Notice that this is just recasting the quorum intersection property in terms of weights. Neat. The weight allocation mechanism was the most interesting part of the paper. Cabinet observes that by assigning weights using geometric sequences, it can satisfy Equation 2 formulaically. Specifically, Cabinet applies geometric sequences for a given t (where 1 ≤ t ≤ (n−1)/2). With a common ratio 1 < r < 2, the sequence increases monotonically but never lets any single term exceed the sum of prior terms, preserving quorum intersection.

Initially, nodes are assigned weights in descending order of their IDs. Higher IDs get lower initial weights. These initial weights are then redistributed dynamically based on node responsiveness. Importantly, no new weights are created; the leader merely redistributes existing ones among nodes. Cabinet dynamically reassigns weights according to node responsiveness, prioritizing faster nodes as cabinet members. The paper claims that fast nodes are also reliable, but its only justification is again that 2004 survey which is hardly convincing.

The idea behind dynamic reassignment is to adapt to performance fluctuations quickly and get to the quorum weight threshold with low latency, but this introduces significant overhead. The paper mentions the weights need to be communicated and maintained through log replication/communication, which leads to logging overhead. Worse, the reweighting is performed for every consensus instance. (What puzzles me is how this could possibly interact with pipelining in MultiPaxos and Raft, but that is another story.) That seems overkill and unnecessary.

Algorithm 2 shows the mechanics. The leader maintains a weight clock (W_clock) and assigns weight values (W_i) per node to track consensus rounds and update weights. I guess the reasoning is that this must be logged and communicated to followers to ensure safety during leader changes. Otherwise, a newly elected leader might form non-intersecting quorums and violate safety. The result is weight metadata stored with every message. Uff. (Again, how does this reconcile with pipelined/Multi-Paxos style communication?)

Flexible Quorums with a vengeance

This is where things fall apart. In Section 4.1.3.

This subsection says that, Cabinet adopts Raft's leader election mechanism but does not apply weighting there. For simplicity, weighted consensus is used only in replication. But the paper continues, to adjust for t+1 quorums, Raft's leader election quorum size should be set to n-t: a candidate must receive votes from n-t nodes to win.

But, this is exactly how flexible quorums work. If phase-2 (commit) quorums are size t+1, then phase-1 (leader election) quorums must be size n-t. Cabinet reduces to flexible quorums. Unfortunately, there are multiple things falling apart with this.

This undercuts the entire framing. The paper claims tolerance of at least t failures in the worst case and up to n-t-1 in the best case. But leader election forces the system into the flexible quorum model, where fault tolerance is exactly defined to be at most t, and the range promised by the Cabinet is gone.

Here is another kicker: flexible Paxos already uses the fastest t+1 responses, without bothering with weights. It doesn't need to guess who the fastest nodes will be; it simply waits for the quickest t+1 replies. All nodes are equal, no weight-tracking needed, so no overhead incurred in tracking and communicating that through the logs. So what was the point of all this? Defaulting to flexible Paxos simplifies everything and is more efficient. No weight bookkeeping, no per-slot weight metadata. Cabinet's methods add complexity and overhead without providing benefits beyond flexible Paxos.

Finally, why does Cabinet focus only on making replication/commit quorums intersect? The answer seems to be that it missed the key lesson from the flexible quorum paper. Flexible quorums weakened Paxos's requirement that all quorums intersect to the more efficient rule that only phase-1 (Q1) and phase-2 (Q2) quorums must intersect. In other words, it is enough for any Q1 quorum (used for election) to intersect with any Q2 quorum (used for commits), and Q2 quorums themselves do not need to intersect with each other. Cabinet, however, spent considerable effort designing a weighted system to ensure intersection among all Q2 quorums, only to later fall back on flexible quorum logic in Section 4.1.3. As a result, the work on enforcing Q2–Q2 intersection ended up achieving nothing.

Let's do the crosscheck on this. Cabinet requires maintaining and logging weights for every consensus instance. Yet Raft's strong leader property already ensures that a newly elected leader cannot overwrite previously committed values in a disjoint Q2-quorum, regardless of quorum composition. Even if the new elected leader produces a non-intersecting quorum, safety is preserved because Raft enforces log continuity by selecting the node with the longest log in the Q1 quorum (and the extra condition for Flexible-Raft is that Q1 quorums should also intersect). Under this model, storing and propagating weight assignments per slot becomes unnecessary overhead. (Paxos elected leader learning all previous values before going to multipaxos mode also achieves the same goal.) The insistence on tracking weights appears rooted in a misunderstanding of how leader election and log safety interact. This undermines the central justification for Cabinet's complexity, revealing that the elaborate weight bookkeeping brings no tangible benefit.

Can this be salvaged?

During our blind read, we initially thought the paper might explore probabilistic failure detection and weighted fault probabilities. That would have been a more compelling direction. Recall the HotOS'25 paper we reviewed: "Real Life Is Uncertain. Consensus Should Be Too!" Pivoting in that direction could rescue some of these ideas.

Another promising salvage path could be non-leader linearizable reads. If weights were cached and relatively stable, they could be exploited to form smaller read quorums (in the style of our PQR work), avoiding the need for majorities or full Q2 complements. This could deliver practical efficiency gains, especially in read-heavy workloads.

Finally, it might be worth exploring weight-encoding mechanisms that explicitly target Q1–Q2 quorum intersection. This would likely involve assigning two weight sets per node:one for Q1 participation and another for Q2. Such a design could provide a cleaner, purpose-built use of weights rather than forcing them into roles already covered by flexible quorums.

Comments

Anonymous said…

Thanks for the paper summary and the blog (an ex-student send me this post' link and I spend the whole morning skipping through your posts :-)).

The authors "neat" idea of weighted quorums requirements was introduced in an SRDS'15 paper (https://www.di.fc.ul.pt/~bessani/publications/srds15-wheat.pdf), which also define a way to precisely define weights for a given set of replicas to make quorums as small as possible both for crash and Byzantine failures, considering a property called "Safe minimality". This idea was later explored for dynamic allocation of weights (https://ieeexplore.ieee.org/document/9222302) and very recently for speed-of-light wide-area consensus (https://dl.acm.org/doi/abs/10.1145/3652892.3700756), both considering Byzantine faults.

It is sad but not uncommon that database papers ignore papers published in other communities (dependability/systems), and vice-versa.

August 26, 2025 at 6:35 AM