Paper review. An Empirical Study on Crash Recovery Bugs in Large-Scale Distributed Systems

Crashes happen. In fact they occur so commonly that they are classified as anticipated faults. In a large cluster of several hundred machines, you will have one node crashing every couple of hours. Unfortunately, as this paper shows, we are still not very competent at handling crash failures.

This paper from 2018 presents a comprehensive empirical study of 103 crash recovery bugs from 4 popular open-source distributed systems: ZooKeeper, Hadoop MapReduce, Cassandra, and HBase. For all the studied bugs, they analyze their root causes, triggering conditions, bug impacts and fixing.

Summary of the findings

Crash recovery bugs are caused by five types of bug patterns:
  • incorrect backup (17%)
  • incorrect crash/reboot detection (18%)
  • incorrect state identification (16%)
  • incorrect state recovery (28%) 
  • concurrency (21%)

Almost all (97%) of crash recovery bugs involve no more than four nodes. This finding indicates that we can detect crash recovery bugs in a small set of nodes, rather than thousands.

A majority (87%) of crash recovery bugs require a combination of no more than three crashes and no more than one reboot. It suggests that we can systematically test almost all node crash scenarios with very limited crashes and reboots.

Crash recovery bugs are difficult to fix. 12% of the fixes are incomplete, and 6% of the fixes only reduce the possibility of bug occurrence. This indicates that new approaches to validate crash recovery bug fixes are necessary.

A generic crash recovery model 

The study uses this model for categorizing parts of crash-recovery of a system.

The study leverages on the existing cloud bug study database, CBS. CBS contains 3,655 vital issues from six distributed systems (ZooKeeper, Hadoop MapReduce, Cassandra, HBase, HDFS and Flume), reported from January 2011 to January 2014. The dataset of the 103 crash recovery bugs this paper analyzes is available at

Root cause

Finding 1: In 17/103 crash recovery bugs, in-memory data are not backed up, or backups are not properly managed.

Finding 2: In 18/103 crash recovery bugs, crashes and reboots are not detected or not timely detected. (E.g., when a node crashes, some other relevant nodes may access the crash node without perceiving that the node has crashed, and they may then hang or throw errors. Or, if a crash node reboots very quickly, then the crash may be overlooked by the crash detection component based on timeout. Thus, the crash node may contain corrupted states, and the system has nodes whose state is out-of-synch, violating invariants.)

Finding 3: In 17/103 crash recovery bugs, the states after crashes/reboots are incorrectly identified. (E.g., the recovery process mistakenly considers wrong states as incorrect: the node may think it is still the leader after recovery.)

Finding 4: The states after crashes/reboots are incorrectly recovered in 29/103 crash recovery bugs. Among them, 14 bugs are caused by no handling or untimely handling of certain leftovers.

Finding 5: The concurrency caused by crash recovery processes is responsible for 22/103 crash recovery bugs. (Concurrency between a recovery process and a normal execution amounts to 13 bugs, concurrency between two recovery processes amount to 4 bugs, and concurrency within one recovery process amounts to 5 bugs.)

Finding 6: All the seven recovery components in our crash recovery model can be incorrect and introduce bugs. About one third of bugs are caused in crash handling component. (Overall, the crash handling component is the most error-prone (34%). The next top three components, backing up (19%), local recovery (17%) and crash detection (13%) also occupy a large portion.)

Finding 7: In 15/103 crash recovery bugs, new relevant crashes occur on/during the crash recovery process, and thus trigger failures.

So it seems that for 85 out of 103 bugs (excluding the 17 bugs from Finding 1) state-inconsistency across nodes is the culprit. I don't want to be the guy who always plugs "invariant-based design" but, come on... Invariant-based design is what we need here to prevent the problems that arose from operational reasoning. In operational reasoning, you start with a "happy path", and then you try to figure out "what could go wrong?" and how to prevent them. Of course, you always fall short in that enumeration of problem scenarios and overlook corner cases, race conditions, and cascading failures. In contrast, invariant-based reasoning focuses on "what needs to go right?" and how to ensure this properties as invariants of your system at all times. Invariant-based reasoning takes a principled state-based rather than operation/execution-based view of your system. To attain invariant-based reasoning, we specify safety and liveness properties for our models. Safety properties specify "what the system is allowed to do" and ensures "nothing bad happens". For example, at all times, all committed data is present and correct. Liveness properties specify "what the system should eventually do" and ensures "something good eventually happens". For example, whenever the system receives a request, it must eventually respond to that request.

Here is another argument for using invariant-based design, and as an  example, you can check my post on the two-phase commit protocol.

Triggering conditions

Finding 8: Almost all (97%) crash recovery bugs involve four nodes or fewer.

Finding 9: No more than three crashes can trigger almost all (99%) crash recovery bugs. No more than one reboot can trigger 87% of the bugs. In total, a combination of no more than three crashes and no more than one reboot can trigger 87% (90 out of 103) of the bugs.

Finding 10: 63% of crash recovery bugs require at least one client request, but 92% of the bugs require no more than 3 user requests.

Finding 11: 38% of crash recovery bugs require complicated input conditions, e.g., special configurations or background services.

Finding 12: The timing of crashes/reboots is important for reproducing crash recovery bugs.

Bug impact

Finding 13: Crash recovery bugs always have severe impacts on reliability and availability of distributed systems. 38% of the bugs can cause node downtimes, including cluster out of service and unavailable nodes.

Finding 14: Crash recovery bugs are difficult to fix. 12% of the fixes are incomplete, and 6% of the fixes only reduce the possibility of bug occurrence.

Finding 15: Crash recovery bug fixing is complicated. Amounts of developer efforts were spent on these fixes.

MAD questions

1. Why is this not a solved problem?
The crash faults have been with us for a long time. They have become even more relevant with the advent of containers in cloud computing, which may be shutdown or migrated for resource management purposes. If so, why are we still not very competent at handling crash-recovery?

It even seems like we have some tool support as well. There are a lot of write-ahead-logging available to deal with the bugs in Finding 1 related to backing up in-memory data. (Well, that is assuming you have a good grasp on which data are important to backup via write-ahead-logging.) We can use ZooKeeper for keeping configuration information, so that the nodes involved don't have differing opinions about which node is down. While keeping the configuration in ZooKeeper helps alleviate some of the state-inconsistency problems, we still need invariant-based design (and model-checking the protocol) to make sure the protocol does not suffer from state-inconsistency problems.

This is a sincere question, and not a covert way to suggest that developers are being sloppy. I know better than that and I have a lot of respect for the developers. That means the problem is more sticky hairy at the implementation level, and our high-level abstractions are leaking. There has been relevant work "On the complexity of crafting crash-consistent applications" in OSDI'14. There is also recent followup work on "Protocol-Aware Recovery for Consensus-Based Distributed Storage" which I like to read soon.

Finally, John Ousterhout had work on write-ahead-logging and recovery in in-memory systems as part of RamCloud project, and I should check recent work from that group.

2. How does this relate to crash-only software?
It is unfortunate that the crash-only software paper has not been cited and discussed by this paper, because I think crash-only software suggested a good, albeit radical, way to handle crash-recovery. As the findings in this paper show a big part of the reason the bugs occur is because the crash-recovery paths are not exercised/tested enough during development and even normal use. The "crash only software" work had the insight to exercise crashes as part of normal use: "Since crashes are unavoidable, software must be at least as well prepared for a crash as it is for a clean shutdown. But then --in the spirit of Occam's Razor-- if software is crash-safe, why support additional, non-crash mechanisms for shutting down? A crash-only system makes it affordable to transform every detected failure into component-level crashes; this leads to a simple fault model, and components only need to know how to recover from one type of failure."

3. How does this compare with TaxDC paper? 
The TaxDC paper studied distributed coordination (DC) bugs in the cloud and showed that more than 60% of DC bugs are triggered by a single untimely message delivery that commits order violation or atomicity violation, with regard to other messages or computation. (Figure 1 shows possible triggering patterns.) While this claim sounds like a very striking and surprising finding, it is actually straightforward. What is a DC bug? It is a manifestation of state inconsistency across processes. What makes the state inconsistency into a bug? A communication/message-exchange between the two inconsistent processes.

Compared to the TaxDC paper, this paper focuses on a small set of bugs, only the crash-recovery bugs. In comparison to the TaxDC paper, the paper states that crash recovery bugs are more likely to cause fatal failures than DC bugs. In contrast to 17% of DC bugs, a whopping 38% of the crash-recovery bugs caused additional node downtimes, including cluster out of service and unavailable nodes.


Popular posts from this blog

Graviton2 and Graviton3

Foundational distributed systems papers

Learning a technical subject

Your attitude determines your success

Learning about distributed systems: where to start?

Progress beats perfect

CockroachDB: The Resilient Geo-Distributed SQL Database

Warp: Lightweight Multi-Key Transactions for Key-Value Stores

Amazon Aurora: Design Considerations + On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes

Anna: A Key-Value Store For Any Scale