Crash-only software, HOTOS'03

Here is the summary for the second paper we covered in the first class of our seminar. The paper has a provocative title, and in fact there is a wikipedia entry on this title: "Crash-only software refers to computer programs that handle failures by simply restarting, without attempting any sophisticated recovery. Correctly written components of crash-only software can microreboot to a known-good state without the help of a user. Since failure-handling and normal startup use the same methods, this can increase the chance that bugs in failure-handling code will be noticed..." (We had previously seen this observation in the last section of Ousterhout's "The role of distributed state" paper.)

Motivation
Since crashes are unavoidable, software must be at least as well prepared for a crash as it is for a clean shutdown. But then --in the spirit of Occam's Razor-- if software is crash-safe, why support additional, non-crash mechanisms for shutting down? A frequent reason is the desire for higher performance. For example, to avoid slow synchronous disk writes, many UNIX file systems cache metadata updates in memory. As a result, when a UNIX workstation crashes, the file system reaches an inconsistent state that takes a lengthy fsck to repair, an inconvenience that could have been avoided by shutting down cleanly. This captures the design tradeoff that improves steady state performance at the expense of shutdown and recovery performance. But, if the cost of such performance enhancements is dependability, perhaps it's time to reevaluate our design strategy.

The major benefit of a crash-only design is the following: A crash-only system makes it affordable to transform every detected failure into component-level crashes; this leads to a simple fault model, and components only need to know how to recover from one type of failure. If we state invariants about the system's failure behavior and make such behavior predictable, we are effectively coercing reality into a small universe governed by well-understood laws. If we don't use a crash-only system and allow any fault and recovery behavior, the resulting set of states becomes very big and complex due to many possible states.

Requirements of Crash-Only Software
To make components crash-only, the crash-only approach requires that all important non-volatile state be kept in dedicated crash-only state stores, leaving applications with just program logic. Specialized state stores (e.g., databases, file system appliances, distributed data structures, non-transactional hashtables, session state stores, etc.) are much better suited to manage state than code written by developers with minimal training in systems programming. Applications become stateless clients of the state stores, which allows them to have simpler and faster recovery routines. (This requirement will hurt the system performance unavoidably.)

To make a system of interconnected components crash-only, it must be designed so that components can tolerate the crashes and temporary unavailability of their peers. Thus, the approach prescribes strong modularity with relatively impermeable component boundaries, timeout-based communication and lease-based resource allocation, and self-describing requests that carry a time-to-live and information on whether they are idempotent. Many Internet systems today have some subset of these properties as they are built from many heterogenous components, accessed over standard request-reply protocols such as HTTP, and serve workloads that consist of large numbers of relatively short tasks that frame state updates.

What can go wrong here?
The paper mentions the following potential problem: The dynamics of loosely coupled systems can sometimes be surprising. For example, resubmitting requests to a component that is recovering can overload it and make it fail again; so the RetryAfter exceptions provide an estimated time-to-recover. To prevent reboot cycles and other unstable conditions during recovery, it is possible to quiesce the system when a set of components is being crash-rebooted. This can be done at the communication/RPC layer, or for the system as a whole. In our prototype, we use a stall proxy in front of the web tier to keep new requests from entering the system during the recovery process.

Taking things further, I can think of more scary scenarios. A component restart may take some time and, hence, may trigger restarts or problems at other components that depend on this component. As a result one restart may trigger a system-wide restart storm. The silver-lining in this cloud is that, hopefully, by designing thr system to be crash-only, you will test for and notice these problems before deployment as these cases will be exercised often.

Another complaint is that the crash-only approach puts some burden on the developers. It requires the components are designed to be very loosely coupled. It requires the developer to write code to keep track of lost requests, and retry the lost requests if they are idempotent, else either perform a rollback recovery or apply a compensating operation for it or somehow tolerate the inconsistency. While these code are for the component-level recovery (and thankfully not system-wide recovery), the code may still grow too complex for the developer to handle. Again, this may be unavoidable, fault-tolerance comes with a cost.

My another concern about the crash-only approach is a continuous cyclic corruption of components (this case is different than "the recorruption of a component during recovery" mentioned two paragraphs above). In this scenario, faults in component A leak and corrupts component B, and after A restarts and corrects itself, this time faults leak from component B to re-contaminate/corrupt A. Rinse repeat the above loop, and we have a vicious cycle of on-going corruption/contamination. My advisor Anish Arora had a solution to this problem for composition of self-stabilizing components. The solution used dependency graphs on the corruption and correction relations among components and accordingly prescribed some blocking wrappers to freeze contamination and break cycles. I found that the authors of the crash-only approach has a simpler solution to this, which they provide in their recursive-restartability paper. Their idea is to first attempt recovery of a small subset of components (say only one component), and if restart proves ineffective, the subsequent attempts recover progressively larger subsets. In other words this technique chases fault through successive boundaries, and can thus break the corruption cycle. It is not very efficient but it is simple and it works eventually.

The paper is realistic about the limitations of the crash-only approach, and does not paint an all rosy picture. Here are some quotes from the paper on this:
  • Building crash-only systems is not easy; the key to widespread adoption of our approach will require employing the right architectural models and having the right tools.
  • We are focusing initially on applications whose workloads can be characterized as relatively short-running tasks that frame state updates. Substantially all Internet services fit this description, in part because the nature of HTTP has forced designers into this mold. We expect there are many applications outside this domain that could not easily be cast this way, and for which deriving a crash-only design would be impractical or infeasible.
  • We expect throughput to suffer in crash-only systems, but this concern is secondary to the high availability and predictability we expect in exchange.
Comparison to self-stabilization
I think the crash-only approach is a special (more blackbox and more scalable) instance of self-stabilizing system design.

The crash-only approach uses the component crash abstraction to abstract away different fault-types and different arrival sequence of these faults. Similarly, the self-stabilization approach uses the arbitrary state corruption abstraction to avoid the need to characterize the effects of different types of faults. Of course the difference between the stabilization approach and crash-only approach is that stabilization requires access to the code and requires figuring out the good states of the system to converge to. In other words, stabilization is a chisel and crash-only is a large hammer. But, for scalability of adding fault-tolerance to large systems, you may be better off using a hammer than a chisel.

State-store approach and keeping application logic stateless also connects back to the self-stabilization approach. Building a stateless system is a trivial way to make a system self-stabilizing. If you don't have any state to corrupt, the system is trivially self-stabilizing. (Similarly, if the system is built to be soft-state, then the corrupted state expires with time and new/correct information is written in the new state, so the system is again easily shown to be self-stabilizing.)

Final remarks
The paper includes the following sentence in the conclusions section, which I think is a very good summary/evaluation of the approach: "Writing crash-only components may be harder, but their simple failure behavior can make the assembly of such components into large systems easier."

Yes, crash-only approach provides some compositionality of fault-tolerance because it uses restarts to transform arbitrary faults into crashes and each crash-only component is written anticipating that it will interact with other crash-only components so it can tolerate crashes of other components. However, things are not always easy in practice and there are caveats as we mentioned above. Some faults can leak through the crash-only abstraction and contaminate other components. Also there could be hidden/emergent interactions/couplings between components that the developer needs to tune.

Comments

Anonymous said…
This comment has been removed by a blog administrator.

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Advice to the young

Foundational distributed systems papers

Linearizability: A Correctness Condition for Concurrent Objects

Understanding the Performance Implications of Storage-Disaggregated Databases

Designing Data Intensive Applications (DDIA) Book

Use of Time in Distributed Databases (part 2): Use of logical clocks in databases