Paper review. Gray Failure: The Achilles' Heel of Cloud-Scale Systems

- September 13, 2019

This paper (by Peng Huang, Chuanxiong Guo, Lidong Zhou, Jacob R. Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao) occurred in HotOS 2017. The paper is an easy read at 5 pages, and considers the fault-tolerance problem of cloud scale systems.

Although cloud provides redundancy for masking and replacing failed components, this becomes useful only if those failures can be detected. But some failures that are partial and suble failures remain undetected and these "gray failures" lead to major availability breakdowns and performance anomalies in cloud environments. Examples of gray failures are performance degradation, random packet loss, flaky I/O, memory thrashing/leaking, capacity pressure, and non-fatal exceptions.

The paper identifies a key feature of gray failure as differential observability. Consider the setup in Figure 2. Within a system, an observer gathers information about whether the system is failing or not. Based on the observations, a reactor takes actions to recover the system. The observer and reactor are considered part of the system. A system is defined to experience gray failure when at least one app makes the observation that system is unhealthy, but observer observes that system is healthy.

This setup may seem appealing, but I take issue with it. Before buying into this setup did we consider a simpler and more general explanation? How about this one? Gray failures occur due to a gap/omission in the specification of the system. If we had formally defined the "healthy" behaviors we want from the system, then we would be able to notice that our detectors/observers are not able to detect the "unhealthy" behaviors sufficiently. And we would look into strengthening the observers with more checks or by adding some end-to-end observers to detect the unhealthy behaviors. In other words, if we had cared about those unhealthy behaviors, we should have specified them for detection, and developed observers for them.

The paper states that the realization about differential observability as the cause of gray failures implies that, "to best deal with them, we should focus on bridging the gap between different components' perceptions of what constitutes failure."

Even for cases where the underlying problem is simply that the observer is doing a poor job of detecting failures (so the gray failures are extrinsic and could be avoided by fixing the observer), such distributed observation can also be helpful. Where to conduct such aggregation and inference is an interesting design question to explore. If it is done too close to the core of a system, it may limit what can be observed. If it is near the apps, the built-in system fault-tolerance mechanisms that try to mask faults may cause differential observability to be exposed too late. We envision an independent plane that is outside the boundaries of the core system but nevertheless connected to the observer or reactor.

But this is a tall order. And this doesn't narrow the problem domain, or contribute to the solution. This is the classic instrumentation and logging dilemma. What should we log to be able to properly debug a distributed system? The answer depends on the system and what you care about the system, what you define as healthy behavior for the system.

I think for dealing with gray failures, one guiding principle should be this: don't ever mask a fault silently. If you mask a fault, do complain about it (raise exceptions and log it somewhere). If you mask failures without pointing out problems, you are in for an unexpected breakdown sooner or later. (This is also the case in human relationships.) Communication is the key. Without complaining and giving feedback, backpressure, the system may be subject to a boiling frog problem, as the baselines gradually slip and degrade.

Cloud anomalies with gray failures

In Section 2, the paper gives examples of cloud anomalies with gray failures, but I have problems with some of these examples.

The Azure IaaS service provides VMs to its customers using highly complex subsystems including compute, storage, and network. In particular, VMs run in compute clusters but their virtual disks lie in storage clusters accessed over the network. Even though these subsystems are designed to be fault-tolerant, parts of them occasionally fail. So, occasionally, a storage or network issue makes a VM unable to access its virtual disk, and thus causes the VM to crash. If no failure detector detects the underlying problem with the storage or network, the compute-cluster failure detector may incorrectly attribute the failure to the compute stack in the VM. For this reason, such gray failure is challenging to diagnose and respond to. Indeed, we have encountered cases where teams responsible for different subsystems blame each other for the incidents since no one has clear evidence of the true cause.

Isn't this a classic example of under-specification of healthy behavior. The VM should have specified its rely/expectation from the storage for its correct execution and accordingly included a detector for when that expectation is violated?

I also have a problem with "the high redundancy hurts" example.

As a consequence, we sometimes see cases where increasing redundancy actually lowers availability. For example, consider the following common workload pattern: to process a request, a frontend server must fan out requests to many back-end servers and wait for almost all of them to respond. If there are n core switches, the probability that a certain core switch is traversed by a request is $1-\frac{n-1}{n}*m$, where m is the fan-out factor. This probability rapidly approaches 100% as m becomes large, meaning each such request has a high probability of involving every core switch. Thus a gray failure at any core switch will delay nearly every front-end request. Consequently, increasing redundancy can counter-intuitively hurt availability because the more core switches there are, the more likely at least one of them will experience a gray failure. This is a classic case where considering gray failure forces us to re-evaluate the common wisdom of how to build highly available systems.

I think this is not a redundancy problem, this is the "you hit cornercases faster in big deployments" problem. This setup is in fact an abuse of distribution as it is the exact opposite of providing redundancy. This setup provides a conjunctive distribution as it makes success depend on success of each component, rather than a disjunctive distribution to make success depend on success of some components.

MAD questions

1. Is this related to robust-yet-fragile concept?
This notion of masked latent faults later causing big disruptions reminds be of the robust-yet-fragile systems concept. Robust-yet-fragile is about highly optimized tolerance. If you optimize your tolerance only for crash failures but not partial/gray failures, you will be very disappointment when you are faced with this unanticipated fault type.

A good example here is the glass. Glass (think of automobile glasses or gorilla glass) is actually very tough/robust material. You can throw pebbles, and even bigger rocks at it, and it won't break or scratch, well, up to a point that is. The glass is very robust to the anticipated faults (stressor) up to a point. But, exceed that point, and then the glass is in shambles. That shows an unanticipated stressor (a black swan event in Taleb's jargon) for the glass: a ninja stone. The ninja stone is basically a piece of ceramic that you take from the spark plug, and is denser than glass. So if you gently throw this very little piece of ceramic to your car window, it breaks in shambles.

This is called a robust-yet-fragile structure, and this is actually why we had the Titanic disaster. Titanic, the ship, had very robust panels, but again upto a point. When Titanic exceeded that point a little bit (with the iceberg hitting it), the panels broke into shambles, very much like the glass meeting ninja stone. Modern ships after Titanic, went for resilient, instead of robust (yet fragile) panels. The resilient panels bend easier, but they don't break as miserably. They still hold together to the face of an extreme stressor. Think of plastic; it is less robust but more resilient than glass.

The robust-yet-fragile effect is also known as highly optimized tolerance. If you optimize tolerance for one anticipated stressor, you become very vulnerable to another unanticipated fault. (Much like the closed Australian ecosystem.)

2. Is fuzzy logic applicable here?
It seems like instead of binary detectors which output healthy or failed, it is better to have detectors that give probabilities and confidence to their decisions. So I thought the phi accrual detectors should be a relevant work to consider for detecting gray failures. I don't know if there are any other fuzzy detectors work for identifying latent failures.

Update: Ryan Huang, one of the authors of the work, left a comment with insightful response to my questions. In the response, he includes links to followup work as well. https://docs.google.com/document/d/18Du33J1v3wOhqj-Vcuv5-wPnaweGhvnWFzenmHoxVcc/edit

Comments

Ryan Huang said…

Hi Murat,

Thank you for your post about our paper. I made a few comments but they exceed the limit, so I post them in a Google doc here. Would appreciate if you could take a look and let me know if it helps clarify some of your questions in the post.

Thanks!
Ryan

September 15, 2019 at 4:12 PM