Posts

Showing posts with the label fault-tolerance

Real Life Is Uncertain. Consensus Should Be Too!

Image
Aleksey and I sat down to read this paper on Monday night. This was an experiment which aimed to share how experts read papers in real time. We haven't read this paper before to keep things raw. As it is with research, we ended up arguing with the paper (and between each other) back and forth. It was messy, and it was also awesome. We had a lot of fun. Check our discussion video below (please listen at 1.5x, I sound less horrible at that speed, ah also this thing is 2 hours long). The paper I annotated during our discussion is also available here. This paper appeared in HotOS 2025 , so it is very recent. It's a position paper arguing that the traditional F-threshold fault model in consensus protocols is outdated and even misleading. Yes, the F-threshold fault model does feel like training wheels we never took off. In his essay " the joy of sects" , Pat Helland bring this topic to tease distributed systems folk: " Distributed systems folks. These people vacilla...

Analyzing Metastable Failures in Distributed Systems

Image
So it goes: your system is purring like a tiger, devouring requests, until, without warning, it slumps into existential dread. Not a crash. Not a bang. A quiet, self-sustaining collapse. The system doesn’t stop. It just refuses to get better. Metastable failure is what happens when the feedback loops in the system go feral. Retries pile up, queues overflow, recovery stalls. Everything runs but nothing improves. The system is busy and useless. In an earlier post, I reviewed the excellent OSDI ’22 paper on metastable failures , which dissected real-world incidents and laid the theoretical groundwork. If you haven’t read that one, start there. This HotOS ’25 paper picks up the thread. It introduces tooling and a simulation framework to help engineers identify potential metastable failure modes before disaster strikes. It’s early stage work. A short paper. But a promising start. Let’s walk through it. Introduction Like most great tragedies, metastable failure doesn't begin with villain...

DDIA: Chp 8. The Trouble with Distributed Systems

Image
This is a long chapter. It touches on so many things. Here is the table of contents. Faults and partial failures Unreliable networks detecting faults timeouts and unbounded delays sync vs async networks Unreliable clocks monotonic vs time-of-day clocks clock sync and accuracy relying on sync clocks process pauses Knowledge truth and lies truth defined by majority byzantine faults system model and reality I don't know if listing a deluge of problems is the best way to approach this chapter. Reading these is fine, but it doesn't mean you learn them. I think you need time and hands on work to internalize these. What can go wrong? Computers can crash. Unfortunately they don't fail cleanly.   Fail-fast is failing fast! And again unfortunately, partial failures (limping computers) are very difficult to deal with. Even worse, with the transistor density so high, we now need to deal with silent failures. We have memory corruption and even silent faults from CPUs. The HPTS'24 s...

SRDS Day 2

Image
Ok, continuing on the SRDS day 1 post , I bring you SRDS day 2. Here are my summaries from the keynote, and from the talks for which I took some notes.  Mahesh Balakrishnan's Keynote Mahesh 's keynote was titled "An Hourglass Architecture for Distributed Systems" . His talk focused on the evolution of his perspective on distributed systems research and the importance of abstraction in managing complexity. He began by reflecting on how in his PhD days, he believed the goal was to develop better protocols with nice properties. He said, he later realized that the true challenge in distributed systems research lies in creating new abstractions that simplify these complex systems. Complexity creeps into distributed systems through failures, asynchrony, and change. Mahesh also confessed that he didn't realize the extent to the importance of managing change until his days in industry.  While other fields in computer science have successfully built robust abstractions (su...

SRDS Day 1

Image
This week, I was at the 43rd International Symposium on Reliable Distributed Systems (SRDS 2024) at Charlotte, NC. The conference center was at the UNC Charlotte, which has a large and beautiful campus. I was the Program Committee chair for SRDS'24 along with Silvia Bonomi. Marco Vieira and Bojan Cukic were the general co-chairs. A lot of work goes in to organizing a conference. Silvia and I recruited the reviewers (27 reviewers, plus external reviewers), oversaw the reviewing process (using hotcrp conference review management system), and came up with the conference program for the accepted papers. We got 87 submissions. 2 papers have been desk-rejected. Each paper received 3 reviews, which brought it up to 261 reviews in total. 27 papers were finally accepted (acceptance rate of ~30%), 1 has been withdrawn after the notification. Marco and Bojan took care of everything else including all local arrangements, web-site management, which are a lot more involved than you would guess....

HPTS'24 day 1, part 2

Image
This is part 2 of day 1 of HPTS'24. (You can tell I did some lisp programming back in the day, huh?) Here is the first part of day 1 , you should check that out as well.  There were 2 sessions each with 4 talks in the afternoon of day 2. After dinner, we had a gong show presentation on miscellaneous topics as well.  Session 3: DBOS Virtual Memory: a Huge Step Backwards- Daniel Bittman (Elephance) Daniel argued that virtual memory, despite its widespread use, is fundamentally flawed. He contends that virtual memory presents an abstraction of memory as  flat, uniform, contiguous, infinite, isolated, and fast - all of which are untrue.  He pointed to significant performance issues with virtual memory, noting that virtual memory adds substantial overhead to every memory access. Every memory access costs 4 extra memory accesses and 100s of nanoseconds of wasted CPU time, which is an eternity in CPU world. He criticized the Transaction Look-Aside Buffer (TLB) for its compl...

Fault tolerance (Transaction processing book)

Image
This is Chapter 3 from the Transaction Processing Book Gray/Reuter 1992 . Why does the fault-tolerance discussion come so early in the book? We haven't even started talking about transactional programming styles, concurrency theory, concurrency control. The reason is that the book uses dealing with failures as a motivation for adopting transaction primitives and a transactional programming style. I will highlight this argument now, and outline how the book builds to that crescendo in about 50 pages. The chapter starts with an astounding observation. I'm continuously astounded by the clarity of thinking in this book: "The presence of design faults is the ultimate limit to system availability; we have techniques that mask other kinds of faults." In the coming sections, the book introduces the concepts of faults, failures, availability, reliability, and discusses hardware fault-tolerance through redundancy. It celebrates wins in hardware reliability through several examp...

Fault-Tolerant Replication with Pull-Based Consensus in MongoDB

Image
This paper, from NSDI 2021, presents the design and implementation of strongly consistent replication in MongoDB using a consensus protocol derived from Raft. Raft provides fault-tolerant state-machine-replication (SMR) over asynchronous networks. Raft (like most SMR protocols) uses push-based replication. But MongoDB uses pull-based replication scheme, so when integrating/invigorating MongoDB's SMR with Raft, this caused challenges. The paper focuses on examining and solving these challenges, and explaining the resulting MongoSMR protocol (my term, not the paper's).  The paper restricts itself to the strongest consistency level, linearizability, but it also talks about how serving weaker models interact/shape decisions made in MongoDB's replication protocol. The paper talks about extensions/optimizations of MongoDB SMR protocol, but I skip those for brevity. I also skip the evaluation section, and just focus on the core of the SMR protocol. Design Background Unlike conve...

Popular posts from this blog

Hints for Distributed Systems Design

My Time at MIT

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Foundational distributed systems papers

Advice to the young

Learning about distributed systems: where to start?

Distributed Transactions at Scale in Amazon DynamoDB

Making database systems usable

Looming Liability Machines (LLMs)

Analyzing Metastable Failures in Distributed Systems