DDIA: Chp 8. The Trouble with Distributed Systems

- October 12, 2024

This is a long chapter. It touches on so many things. Here is the table of contents.

Faults and partial failures

Unreliable networks

detecting faults
timeouts and unbounded delays
sync vs async networks

Unreliable clocks

monotonic vs time-of-day clocks
clock sync and accuracy
relying on sync clocks
process pauses

Knowledge truth and lies

truth defined by majority
byzantine faults
system model and reality

I don't know if listing a deluge of problems is the best way to approach this chapter. Reading these is fine, but it doesn't mean you learn them. I think you need time and hands on work to internalize these.

What can go wrong? Computers can crash. Unfortunately they don't fail cleanly. Fail-fast is failing fast! And again unfortunately, partial failures (limping computers) are very difficult to deal with. Even worse, with the transistor density so high, we now need to deal with silent failures. We have memory corruption and even silent faults from CPUs. The HPTS'24 session on failures was bleak indeed.

That was only the beginning. Then there are network failures. And clock synchronization failures. Did we leave anything?

There are metastable failures that are emergent behavior distributed systems behavior. We can then venture into more malicous cases, like Byzantine failures. If you take into account security/hacking as failures, there are so many more problems to list. There are hurricanes, natural disasters, and even cyberattacks during natural disasters.

The failure modes are so many to enumerate. So maybe instead of enumerating and explaining all these in detail, it is better to give the fundamental impossibility results haunting distributed systems: Coordinated Attack Impossibility and the FLP (Fischer-Lynch-Paterson) impossibility.

I like these results because they abstract away things to simplify (only a single process crash, and only message loss) and combine these with the inherent complexity of distributed systems (asynchrony, asymmetry of information) and give powerful and damning impossibility results.

But, believe it or not, the distributed systems field prospered after the researchers were able to identify these damning impossibility results. This had a freeing effect on the field. The researchers finally achieved clarity of thinking when faced with these impossibility results. They found ways to circumvent them, having learned where to buckle.

They categorized systems properties as safety and liveness properties. Knowing both could not be always satisfied together for all executions, they found ways to prioritize preserving safety under all conditions, and achieving liveness only when the conditions moved beyond the realm of impossibility. Paxos is a great example of this. You would not be able to really understand Paxos without understanding these impossibility results. The impossibility results is in fact where I suggest people to start when learning distributed systems. You don't need to understand the proofs of the impossibility results, they are technical. But understanding what the impossibility results imply, and what they don't imply are important.

Couple of things from the chapter

The book has a great illustration of the (check-write) problem that fencing tokens solve. The figure explains it all. Without the fencing tokens, the last unchecked write would cause mayhem.

The book has a good discussion of why you should not use time-of-day clocks, but rather use monotonic clocks. This is a basic but important thing. If you also like to avoid clock skew problems across process and still get causal consistency use hybrid logical clocks (HLC)!

Additional Links

We had recently reviewed the Gray-Reuters book. It has a great chapter on fault-tolerance, early on. I had written a long summary/review of that chapter. That discussion is relevant here.

Jim Waldo's "A Note on Distributed Computing" is relevant. "It is futile to paper over the distinction between local and remote objects... such a masking will be impossible"

Unfortunately the self-stabilization approach to fault-tolerance didn't get the attention/recognition it deserves. I think there are a lot of good ideas there: converging to the invariant, soft-state, and stateless design.

For fault-tolerance in cloud computing I had written this in 2017.

That was a long time ago though. With more experience in cloud computing systems, I may want to write a followup on that post. In cloud computing systems, control plane is where you design most of your fault-tolerance. Fault-tolerance does not work like a black box, and it needs to be custom-fitted for your application service. So you have customized control plane design. Control planes almost always use a centralized controller to make decisions. This simplifies things a lot, but isn't the best approach in terms of latency. It maybe advantageous to have autonomous low-layer quick decisions to fix (self-stabilize) problems for low-latency recovery and augment these with centralized high-level decisions to check/intervene further when needed. There hasn't been much innovation in this domain. It would be a good area to research.

Finally, here are some fault-tolerance tips from my "Hints for distributed systems design" post. I like this one: "Feel the pain".