Real Life Is Uncertain. Consensus Should Be Too!

Aleksey and I sat down to read this paper on Monday night. This was an experiment which aimed to share how experts read papers in real time. We haven't read this paper before to keep things raw. As it is with research, we ended up arguing with the paper (and between each other) back and forth. It was messy, and it was also awesome. We had a lot of fun. Check our discussion video below (please listen at 1.5x, I sound less horrible at that speed, ah also this thing is 2 hours long). The paper I annotated during our discussion is also available here.

This paper appeared in HotOS 2025, so it is very recent. It's a position paper arguing that the traditional F-threshold fault model in consensus protocols is outdated and even misleading.

Yes, the F-threshold fault model does feel like training wheels we never took off. In his essay "the joy of sects", Pat Helland bring this topic to tease distributed systems folk: "Distributed systems folks. These people vacillate between philosophers and lawyers. No one else can talk so fluently about total order without discussing the scope of totality. Availability is always couched by assuming a loss of, at most, F servers, while never veering into what happens when an operator logs into the system. Integral to their psyche is the belief in deterministic outcomes in the face of nondeterministic environments. Sometimes these discussions get seriously F'd up!" I am proud to report that, during a meeting with Pat, I came up with the pun this is "F'ed up, for F=1 and N=3". Although I must concede that Pat is the real A-tell-a The Pun!

So, the premise is agreeable: the clean abstraction of "up to f faults" is too simplistic for how real distributed systems fail. They argue that treating all faults equally (as the f-model does) hides real-world complexities. Machines don’t fail uniformly. Failures evolve over time (bathtub curve), cluster around software updates, and depend on hardware, and even location in datacenter (temperature). This 6 page paper has 77 references, so they seem to have done an extensive literature search on this topic.

Building on this observation, the paper advocates a paradigm shift: replacing the fixed F-model with a probabilistic approach based on per-node failure probabilities, derived from telemetry and predictive modeling. While the paper doesn't propose a new algorithm, it suggests that such a model enables cost-reliability tradeoffs. For instance, it mentions that running Raft on nine unreliable nodes could match the reliability of three high-end ones at a third of the cost. (It is unclear whether this accounts/prorates for throughput differences.)

But the deeper we read in to the paper, the more we found ourselves asking: what exactly is a fault curve (p_u), and what is this new model? Is this p_u = 1%,  per year, per month, per quorum formation? The paper never quite defines fault-curves, even though it's central to the argument.

We got even more confused in the paper's conflation of safety and liveness for Raft. FLP (1985) taught us to keep safety and liveness separate. Raft and Paxos are known for prioritizing safety above all. Even when there are more crash failures than F, the safety is not violated. So when the paper reports “Raft is only 99.97% safe and live,” the precision is misleading. What does that number even mean? How was it calculated? Also there is a big difference between "safe OR live" and "safe AND live". Why were the two bunched together in Table 2 for Raft?  What is meant here?

The paper says: "As faults are probabilistic, it is always possible for the number of faults to exceed F. Thus no consensus protocol can offer a guarantee stronger than probabilistic safety or liveness." Again, I suspect that "or" (in this case) between safety and liveness is carrying a lot of load. The safety of Paxos family of protocols rely on the quorum intersection property, so even when F is exceeded, the safety is not violated, although liveness could be lost. The paper says "Violating quorum intersection invariants triggers safety violations." But the quorum intersection is a priori calculated, the sum of two quorum sizes has to be bigger than N, so this is guaranteed by arithmetic, and it is not a probabilistic guarantee. We had to hypothesize a lot about why the paper seems to claim some safety violation: Is it maybe some durability loss? Is this assuming Byzantine failure? We still don't have an answer.

The paper does better with PBFT, separating safety and liveness in their reliability table. But even there, the model feels underspecified. There's a leap from "fault curves exist" to "this quorum configuration gives X nines of safety" without laying out a mathematical foundation.

Another counterintuitive point in the paper was the idea that more nodes can make things worse, probabilistically. For instance, the paper claims that a 5-node PBFT deployment could be significantly more reliable than a 7-node one, because quorum intersection becomes more fragile as the system scales with unreliable nodes. Again, we couldn't really make sense of this claim either, as there was not much explanation for it. 


Is giving up the F faults abstraction worth it?

This is a position paper, and it plays that role well. It surfaces important observations, challenges sacred abstractions, and provokes discussion. It aims to bring consensus modeling into a more probabilistic/complex (real?) world where failure rates vary, telemetry exists, and tradeoffs matter. It advocates for getting rid of the rigid F upper-bound for fault-tolerance. But complexity cuts both ways. A richer/complex model may capture more nuance, but it can also make reasoning and safety proofs much harder. And clarity and simplicity and guaranteed fault-tolerance is essential for consensus.

Actually, the paper made me appreciate the F abstraction for faults even more. It is simple, but it makes reasoning simpler in return.  It is possible to still be probabilistic and do all that analysis in selecting the F number. These days due to constant software rollovers many systems go with F=2 and N=5, or even higher numbers. And the nice thing about the Paxos family of protocols is due to quorum intersection, safety is always guaranteed, non-probabilistic, even when the F limit is exceed by extra crash faults (in reality network faults and partitions also bunch in here). And there has been good progress in decoupling F from N (thanks to the flexible quorums result), which addresses some of the complaints in the paper (e.g., "Linear size quorums can be overkill"). Moreover, heterogeneous deployments are already considered, and leader selection heuristics exist.

If the goal is the replace the F abstraction, there should be more thought put into what new abstraction would be proposed to take over. Abstractions are at the core of Computer Science and Distributed Systems. As one of my favorite Dijkstra quotes say:  "The purpose of abstraction is not to be vague, but to create new semantic level where one can be absolutely precise."

Comments

Popular posts from this blog

Hints for Distributed Systems Design

My Time at MIT

Advice to the young

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Foundational distributed systems papers

Learning about distributed systems: where to start?

Distributed Transactions at Scale in Amazon DynamoDB

Making database systems usable

Looming Liability Machines (LLMs)

Analyzing Metastable Failures in Distributed Systems