Analyzing Metastable Failures in Distributed Systems
So it goes: your system is purring like a tiger, devouring requests, until, without warning, it slumps into existential dread. Not a crash. Not a bang. A quiet, self-sustaining collapse. The system doesn’t stop. It just refuses to get better. Metastable failure is what happens when the feedback loops in the system go feral. Retries pile up, queues overflow, recovery stalls. Everything runs but nothing improves. The system is busy and useless.
In an earlier post, I reviewed the excellent OSDI ’22 paper on metastable failures, which dissected real-world incidents and laid the theoretical groundwork. If you haven’t read that one, start there.
This HotOS ’25 paper picks up the thread. It introduces tooling and a simulation framework to help engineers identify potential metastable failure modes before disaster strikes. It’s early stage work. A short paper. But a promising start. Let’s walk through it.
Introduction
Like most great tragedies, metastable failure doesn't begin with villainy; it begins with good intentions. Systems are built to be resilient: retries, queues, timeouts, backoffs. An immune system for failure, so to speak. But occasionally that immune system misfires and attacks the system itself. Retries amplify load. Timeouts cascade. Error handling makes more errors. Feedback loops go feral and you get an Ouroboros, a snake that eats its tail in an eternal cycle. The system gets stuck in degraded mode, and refuses to get better.
This paper takes on the problem of identifying where systems are vulnerable to such failures. It proposes a modeling and simulation framework to give operators a macroscopic view: where metastability can strike, and how to steer clear of it.
Overview
The paper proposes a modeling pipeline that spans levels of abstraction: from queueing theory models (CTMC), to simulations (DES), to emulations, and finally, to stress tests on real systems. The further down the stack you go, the more accurate and more expensive the analysis becomes.
The key idea is a chain of simulations: each stage refines the previous one. Abstract models suggest where trouble might be, and concrete experiments confirm or calibrate. The pipeline is bidirectional: data from low-level runs improves high-level models, and high-level predictions guide where to focus concrete testing.
The modeling is done using a Python-based DSL. It captures common abstractions: thread pools, queues, retries, service times. Crucially, the authors claim that only a small number of such components are needed to capture the essential dynamics of many production services. Business logic is abstracted away as service-time distributions.
Figure 2 shows a simple running example used throughout the paper: a single-threaded server handling API requests at 10 RPS, serving a client that sends requests at 5 RPS, with a 5s timeout and five retries. The queue bound is 150. The goal is to understand when this setup tips into metastability and how to tune parameters to avoid that.
Continuous-Time Markov Chains (CTMC)
CTMC provides an abstract average-case view of a system, eliding the operational details of the constructs. Figure 3 shows a probabilistic heatmap of queue length vs. retry count (called orbit). Arrows show the most likely next state; lighter color means higher probability. You can clearly see a tipping point: once the queue exceeds 40 and retries hit 30, the system is likely to spiral into a self-sustaining feedback loop. Below that threshold, it trends toward recovery. This model doesn't capture fine-grained behaviors like retry timers, but it's useful for quickly flagging dangerous regions of the state space.
Simulation (DES)
Discrete event simulation (DES) hits a sweet spot between abstract math and real-world mess. It validates CTMC predictions but also opens up the system for inspection. You can trace individual requests, capture any metric, and watch metastability unfold. The paper claims that operators often get their "aha" moment here, seeing exactly how retries and queues spiral out of control.
Emulation
Figure 4 shows the emulator results. This stage runs a stripped-down version of the service on production infrastructure. This is not the real system, but its lazy cousin. It doesn't do real work (it just sleeps on request) but it behaves like the real thing under stress. The emulator confirms that the CTMC and DES models are on track: the fake server fails in the same way as the real one.
Testing
The final stage is real stress tests on real servers. It's slow, expensive, and mostly useless unless you already know where to look. And that's the point of the whole pipeline: make testing less dumb. Feed it model predictions, aim it precisely, and maybe catch the metastable failure before it catches you.
Discussion
There may be a connection between metastability and self-stabilization. If we think in terms of a global variant function (say, system stress) then metastability is when that function stops decreasing and the system slips into an attractor basin from which recovery is unlikely. Real-world randomness might kick the system out. But sometimes it is already stuck so bad that it doesn't. Probabilistic self-stabilization once explored this terrain, and it may still have lessons here.
The paper nods at composability, but doesn't deliver. In practice, feedback loops cross the component boundaries. Metastability often emerges from these interdependencies. Component-wise analysis may miss the whole. As we know from self-stabilization: composition is hard. It works by layering or superposition, not naive composition.
The running example in the paper is useful but tiny. The authors claim this generalizes, but don't show how. For a real system, like Kafka or Spanner, how many components do you need to simulate? What metrics matter? What fidelity is enough? This feels like a "marking territory” paper that maps a problem space.
There's also a Black Swan angle here. Like Taleb's rare, high-impact events, metastable failures are hard to predict, easy to explain in hindsight, and often ignored until too late. Like Black Swan detection, I think metastability is less about prediction and more about preparation: structuring our systems and minds to notice fragility before it breaks. The paper stops at identifying potential metastability risks, and recovery is not considered. Load shedding would work, but we need some theoretical and analytical guidance, otherwise it is too easy to do harm instead of recovery via load shedding. Which actions would help nudge the system out of the metastable basin? In what order, as not to cause further harmful feedback loops? What runtime signals suggest you're close?
Comments