Metastable failures in the wild

This paper appeared in OSDI'22. There is a great summary of the paper by Aleksey (one of the authors and my former PhD student, go Aleksey!). There is also a great conference presentation video from Lexiang.

Below I will provide a brief overview of the paper followed by my discussion points. This topic is very interesting and important, so I hope you have fun learning about this.

Metastability concept and categories

Metastable failure is defined as permanent overload with low throughput even after the fault-trigger is removed. It is an emergent behavior of a system, and it naturally arises from the optimizations for the common case that lead to sustained work amplification.

In this paper, the authors are able to capture/abstract the system behavior of interest in terms of two parameters, the load and capacity. If the load is above capacity, you have work piling up, right? Or if the capacity drops under the sustained load level, the same effect, right? Both of these create  a temporary pile up, and grows the queues. Hopefully this queue is going to decrease as the system catches up and you'll be good. But metastable failure is when this temporary pile up becomes permanent, and you don't get back to the stable state even after the original triggering event is removed. In other words, you get stuck in a metastable state/attractor and you are not able to get back to your big attractor, your stable/homeostasis state. (The term metastability comes from physics.) This leads to bad performance and unavailability because your system is grinding but not doing useful work. It is doing busy-work without sufficient goodput.

Figure 1 explores possible metastability categories, capturing them in four quadrants. Again, this is all about load and capacity. And we have two axes: x-axis is the triggering factor (either load increase or capacity decrease), and the y-axis is the sustaining factor (either load increase or capacity decrease). The combination results in the four quadrants.

The top-left quadrant is when a load-spike triggered the event, and created the temporary overload, and then timeout requests did retries and sustained the effect through load amplification again.

The top-right category is where the trigger is a capacity drop, and sustaining effect is workload increase. An example could be from a replicated state machine with Raft. Initially, the primary has a lapse/glitch and gets behind in processing requests. Then the followers keep retrying the pull, and this increases the load and sustains the metastable state of reduced goodput.

The bottom-left category is where the trigger comes from load-increase and sustaining effect comes from capacity decrease. They show/replicate this via an actual subsystem in Twitter, where the garbage collection kicks in due to quickly increased queue lengths, and degrades the performance of the system, and the system is stuck in the metastability rut without ever getting back to the normal stability state.

Finally, the bottom-right category is where the trigger comes from a capacity drop (say your cache is wiped out, and your serving capacity decreased) and the sustaining effect also comes from capacity drop (your database is overwhelmed by the sudden herd effect due to cache misses, and times out serving requests, and is not able to repopulate the cache). Marc Brooker had a great post about this in 2021.

The paper has a very nice evaluation section, where they replicate each of these categories by controlling the trigger duration/size and show the sustaining effect kicking in and keeping the system in metastable state until manual intervention happens. They are able to identify a stable region, vulnerable region, and thee metastable region, and often with a thin um margin, as shown in Figure 5. Also checkout 4.c and 6.c. These are impressive to witness! (The code is available at

Metastability induced outages in real world

The paper looks at 600 public postmortem incident reports at many companies, big and small, and identifies 21 metastable outages as shown in Table 1.

In about 35%, the trigger is due to load spikes. Buggy configuration or code deployments, and latent bugs are responsible for 45% of triggers. About 45% of the cases involve multiple triggers.

Retries induced load increase constitutes over 50% of the sustaining effects. Other factors include expensive error handling, lock contention, and performance degradation due to leader election churn.

Recovering from metastable failure is possible by breaking the sustaining effect cycle. This comes by way of intervention in the form of direct load-shedding, throttling, dropping requests, changing workload parameters, and sometimes through indirect load-shedding via reboots and policy changes.

My encoding of this

I really love this paper. I was involved at the initial state of the HotOS'21 paper that laid out the metastability concept (Brooker has a great post on that paper as well),  but I stopped my involvement due to work commitments. I guess I had to learn Mohamed's advice the hard way. Haha.

So here is my read and some lessons I draw from this paper.  

1. Being in the vulnerable region is not necessarily bad. Vulnerability is a spectrum. You want to be in the less risky part of vulnerable states, because you want to push you system for efficient use of resources. You cannot always afford to be in the very safe stable region, because of the cost/expense involved.

2. But that means, you run the risk of crossing over to the metastability region at any time. So what can you do about that? Feel the pain! Don't mask the pain, feel the pain, and attribute the pain to the correct subsystem and shed load quickly so you do not trip over to the metastable state. If you get stuck in the metastability state, you elongate the unavailability, and need to shed load in even at a bigger scale, and need to do big reset, because this runs the risk of cascading to other subsystems and bringing them down. How do you feel the pain, and act quickly? What should you monitor? Rebecca Isaacs suggests monitoring the rate/derivative of queue length increase is useful, but this is not commonly monitored.

3. A meta lesson is, don't DOS yourself! Design your system so it doesn't inadvertently launch a denial of service on itself. That is avoid an asymmetric request to response ratio. An easy thing to observe is to be careful about retries. Don't blindly retry, because you are causing work/load amplification. But there is a more general principle behind this meta lesson. See the next point.

4. Don't overoptimize one part of your system/protocol to the detriment of creating an asymmetric work for response (work amplification) in other cases. You can see this disproportional overoptimization at play in the cache example. When your system is thrown out of the overoptimized common case of 80-90% cache hit rate to 0% hit rate, it becomes unable to handle the new work and fails to recover from this overwhelmed metastable state. As Aleksey Charapko puts it "maintain a performance gradient" in your system. When you overoptimize one part you are making yourself more susceptible to metastable failure. (This is similar to the robust-yet-fragile design versus resilient design approach.)

Related to this fourth point, I want to mention DynamoDB's excellent treatment of the issue. This comes from "Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service (USENIX ATC 2022)". To improve system stability, DynamoDB prioritizes predictability over absolute efficiency. While components such as caches can improve performance, DynamoDB does not allow them to hide the work that would be performed in their absence, ensuring that the system is always provisioned to handle the unexpected.

"When a router received a request for a table it had not seen before, it downloaded the routing information for the entire table and cached it locally. Since the configuration information about partition replicas rarely changes, the cache hit rate was approximately 99.75 percent. The downside is that caching introduces bimodal behavior. In the case of a cold start where request routers have empty caches, every DynamoDB request would result in a metadata lookup, and so the service had to scale to serve requests at the same rate as DynamoDB. A new partition map cache was deployed on each request router host to avoid the bi-modality of the original request router caches. In the new cache, a cache hit also results in an asynchronous call to MemDS to refresh the cache. Thus, the new cache ensures the MemDS fleet is always serving a constant volume of traffic regardless of cache hit ratio. The constant traffic to the MemDS fleet increases the load on the metadata fleet compared to the conventional caches where the traffic to the backend is determined by cache hit ratio, but prevents cascading failures to other parts of the system when the caches become ineffective."

Before I finish, I want to discuss couple other points below.

Self-stabilization and metastability

Self-stabilization is about the system's ability to make forward progress from any faulty state to converge back to the good states. Instead of trying to figure out how much faults can disrupt the system's operation, stabilization assumes arbitrary state corruption, which covers all possible worst-case collusions of faults and program actions. Stabilization then advocates designing recovery actions that takes the program back to invariant states starting from any arbitrary state. In other words, we design the good states (invariant states) to be the only attractor, and wait for the system converges to that. You can see my mentions of stabilization in the blog here, and in relation to cloud computing here.

It looks like there is a relation here to metastability. Could the stabilization work help for metastability? I guess the answer is, no, not directly. Stabilization theory is concerned with state-space corruption and stabilization to good states. It doesn't capture (or even register) performance degradation as a fault. Only state corruption/missynchronization is seen as a fault. Stabilization theory doesn't recognize goodput reduction and busy work as a failure.

There has been control theory approaches to self-stabilization, like Lyapunov stability. Maybe Lyapunov stability, a continuous form of stabilization idea, can produce a formulation that could be useful for alleviating metastability.

Static stability and metastability

Inside Amazon/AWS, static stability is a fault-tolerance design principle that is well-known and applied. The origin of the idea comes from mechanical engineering. When applied in the context of cloud computing systems, static stability refers to the characteristics for the data plane to remain at equilibrium in response to disturbances in the data plane without coordination through control plane. In other words, it is the ability for systems to continue to operate as best they can (with predefined fault-tolerance action/fallback) when they aren't able to coordinate.

What does static stability mean for metastable failures? My guess is that since it enables reaction quickly (without involving the control plane, with hardwired resilience/robustness) it may intercept the journey in vulnerable region towards the direction metastable region, and pull the system back to less vulnerable region.


Popular posts from this blog

Learning about distributed systems: where to start?

Hints for Distributed Systems Design

Foundational distributed systems papers

Scalable OLTP in the Cloud: What’s the BIG DEAL?

The end of a myth: Distributed transactions can scale

Always Measure One Level Deeper

Dude, where's my Emacs?

There is plenty of room at the bottom

Know Yourself