How Complex Systems Fail

This is a 4 page report about the nature of failures in complex systems. It is a gloomy report. It says that complex systems are always ridden with faults, and will fail when some of these faults conspire and cluster. In other words, complex systems constantly dwell on the verge of failures/outages/accidents.

The writing of the report is peculiar. It is written as a list of 18 items (ooh, everyone loves lists). But the items are not independent. For example, it is hard to understand items 1 and 2, until you read item 3. Items 1 and 2 are in fact laying the foundations for item 3.

The report is written by an MD, and is primarily focused on healthcare related complex systems, but I think almost all of the points also apply for other complex systems, and in particular cloud computing systems. In two recent posts (Post1, Post2), I had covered papers that investigate failures in cloud computing systems, so I thought this report would be a nice complement to them.

1) Complex systems are intrinsically hazardous systems.
I think the right wording here should be "high-stakes" rather than "hazardous". For example, cloud computing is not "hazardous" but it is definitely "high-stakes".

2) Complex systems are heavily and successfully defended against failure.
Is there an undertone here which implies these defense mechanisms contribute to make these high-stakes systems even more complex?

3) Catastrophe requires multiple failures – single point failures are not enough.
This is because the anticipated failure modes are already well guarded.

4) Complex systems contain changing mixtures of failures latent within them.
"Eradication of all latent failures is limited primarily by economic cost but also because it is difficult before the fact to see how such failures might contribute to an accident. The failures change constantly because of changing technology, work organization, and efforts to eradicate failures." This is pretty much the lesson from the cloud outages study. Old services fail as much as new services, because the playground keeps changing.

5) Complex systems run in degraded mode.
"A corollary to the preceding point is that complex systems run as broken systems."

6) Catastrophe is always just around the corner.

7) Post-accident attribution accident to a ‘root cause’ is fundamentally wrong.
"Because overt failure requires multiple faults, there is no isolated ‘cause’ of an accident. There are multiple contributors to accidents. Each of these is necessary insufficient in itself to create an accident. Only jointly are these causes sufficient to create an accident."

8) Hindsight biases post-accident assessments of human performance.
The "everything is obvious in hindsight" fallacy was covered well in this book.

9) Human operators have dual roles: as producers & as defenders against failure.
10) All practitioner actions are gambles.
11) Actions at the sharp end resolve all ambiguity.

12) Human practitioners are the adaptable element of complex systems.
What about software agents? They can also react adaptively to the developing situations.  And today with machine learning and deep learning, especially so.

13) Human expertise in complex systems is constantly changing.
14) Change introduces new forms of failure.
The cloud outages survey has showed that updates and configuration changes and human factors account for more than 1/3rd of outages.

15) Views of ‘cause’ limit the effectiveness of defenses against future events.
Case-by-case addition of fault-tolerance is not very effective. "Instead of increasing safety, post-accident remedies usually increase the coupling and complexity of the system."

16) Safety is a characteristic of systems and not of their components
Safety is a system-level property, unit testing of components is not enough.

17) People continuously create safety.
18) Failure free operations require experience with failure.
What doesn't kill you makes you stronger. In order to grow, you need to push the limits, and stress the system. Nassim Taleb's book about antifragility makes similar points.
And this short video on resilience is simply excellent.


Popular posts from this blog

Foundational distributed systems papers

Your attitude determines your success

Progress beats perfect

Cores that don't count

Silent data corruptions at scale

Learning about distributed systems: where to start?

Read papers, Not too much, Mostly foundational ones

Sundial: Fault-tolerant Clock Synchronization for Datacenters


Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3 (SOSP21)