DDIA: Chp 8. The Trouble with Distributed Systems
This is a long chapter. It touches on so many things. Here is the table of contents. Faults and partial failures Unreliable networks detecting faults timeouts and unbounded delays sync vs async networks Unreliable clocks monotonic vs time-of-day clocks clock sync and accuracy relying on sync clocks process pauses Knowledge truth and lies truth defined by majority byzantine faults system model and reality I don't know if listing a deluge of problems is the best way to approach this chapter. Reading these is fine, but it doesn't mean you learn them. I think you need time and hands on work to internalize these. What can go wrong? Computers can crash. Unfortunately they don't fail cleanly. Fail-fast is failing fast! And again unfortunately, partial failures (limping computers) are very difficult to deal with. Even worse, with the transistor density so high, we now need to deal with silent failures. We have memory corruption and even silent faults from CPUs. The HPTS'24 s