A Case for Simulation-Driven Resilience in Agentic Data Systems
As I mentioned in my previous post, I traveled to San Jose at the end of May for the ACM CAIS conference. On Day 0, I gave a very short talk at the Supporting our AI Overlords (SAO) workshop. This post is the promised summary of our paper, "A Case for Simulation-Driven Resilience in Agentic Data Systems", joint work with Aleksey Charapko (University of New Hampshire) and Akshat Vig (MongoDB).
Metastability is critical for building the next generation of distributed systems
Our story starts with metastability. Metastability is the failure mode where the mechanisms built to protect the system (retries, queues, timeouts, load shedding) turn into amplifiers. Even after the trigger that caused the overload goes away, the system stays behind, churning through busy work, perpetually trying to catch up with the remnants of failed and behind-schedule tasks. It's a bit like missing some foundational math in high school. You spend so long backfilling the old gaps that you never keep up with the new material piling on top, so you stay permanently behind. The catching up work is what keeps you behind, which requires catching up work later, which keeps you behind. (This is presumably why I'm not a rich machine learning scientist today at Deep Mind.)
Avoiding and tolerating metastable failures is critical to building the next generation of reliable distributed systems. Aleksey's Metastable Failures in the Wild study (OSDI'22) cataloged these failures from real production incidents. They matter so much because they are the hard faults that remain. We have largely learned to deal with the straightforward ones (crashes, corruptions, dropped packets), which leaves emergent performance failures as the final boss. Metastable failures are responsible for a disproportionate share of critical cloud unavailability incidents, with no single broken part to point at and no obvious way to fix/reset them when they emerge. The cloud economics make the problem worse. Providers and operators have every incentive to run with the absolute minimum of excess capacity and to trim slack and "waste". But this thin margin is exactly where metastability thrives.
Agents supercharge the feedback loops and put metastability on steroids
As AI agents are replacing humans as the primary clients of modern data systems, they are bringing a qualitative shift in workload characteristics. Agents retry aggressively while mutating the query on each attempt. They fan out into bursty parallel sub-tasks. They hold transactions open while they wait on an external LLM to provide the next step. The "Supporting our AI overlords" paper showed that agents create ~20x more branches and perform ~50x more rollbacks than humans do. These behaviors violate assumptions baked into every layer of a modern data system. Execution control assumes stationary arrivals. Caching assumes temporal locality. Concurrency control assumes bounded hold times. Agents break all three at once!
We propose simulation-driven resilience to address this problem
Simulation can enable us to systematically explore the agent-database boundary, and discover/prevent metastability failures before a production incident forces a reactive (and nonworking) patch. We propose a simulation based approach because only simulation is cheap enough to sweep an enormous trigger space, and deterministic enough to replay and dissect every failure it finds.
- Benchmarks measure steady state. But, metastability is transient and emergent: it lives in the sequence of rare events, not in the average.
- Queueing theory assumes mostly stationary independent arrivals. Agents break these assumptions.
- Testing with the production system is hopeless for design sweeps and gives you almost no observability. When your database falls over under load, you still don't know what went wrong, or how it went wrong, and you can't explore the design space because you have no feedback to explore it with.
MESSI finds failures in the seams
Metastability scurries in the seams/interaction of the composition of subsystems, so we need a tool that lets us look there. Aleksey developed MESSI (MEtaStability SImulator), a discrete-event simulation framework for exploring metastability dynamics in distributed systems. MESSI enables modeling any (sub)system as a directed graph and composition of (sub)systems. Logic Nodes implement routing and state policy. Processors model the physical resource constraints (queues, I/O delays, network latency). Individual work items, QItems, carry state as they traverse the graph. There is a clear separation of roles: policy lives in the nodes, resource contention lives in the edges, and you can vary each independently.
Because this is a simulator, it is deterministic and replayable, and it exposes the full internal state of every component at every tick. A metastable trigger you discover once can then be re-run against alternate designs to see which ones survive.
Our findings from the Execution Control System (ECS) simulations
Using MESSI, we performed an analysis of the Execution Control System (ECS), because it is the critical first domino: when the ECS fails, every subsystem downstream of it (caches, buffer pools, lock managers) fails after it. The ECS sits between admitted requests and the execution engine and decides who runs, in what order, with how much. It is the component that mediates resource contention at the backend. The usual design hands out a bounded pool of execution tickets (one ticket buys one worker thread), sorts admitted requests into a few priority queues with different ticket budgets, and dispatches them to worker threads and I/O slots. But, unlike an OS scheduler, which aims for fairness and wants every thread to eventually run, the ECS has to make decisive choices. It needs to prioritize the latency-sensitive short queries and shed the excess, as it aims the cost of waiting off the server and back onto the client. This is, of course, exactly what closes the feedback loop with a retrying agent.
We found two interesting results in our analysis.
Two reasonable policies may compose into a metastable loop
A natural ECS design uses two queues, a high-priority one for short tasks and a low-priority one for long tasks, with a probing policy on each that nudges its ticket count up or down to chase a performance metric. Each policy is sensible in isolation. But when you compose them, they interfere with each other on occasion in a metastable manner. The long queue probes for more tickets to improve its own throughput. More long-task tickets means more threads contending for the same cores, which steals CPU from the short queue, which then escalates its ticket count to keep up. Now both queues are inflating until they slam into their hard limits. The trap is that ticket acquisition rate, the metric each policy is optimizing, stops predicting actual progress under overload. A queue full of waiting tasks can churn tickets at a furious rate (grab a ticket, do 1 ms of work, yield, repeat) while getting almost nothing done. The metric looks healthy while the system performance collapses.
Admission control and the ECS destructively interfere
Putting an admission controller in front of the ECS sounds like defense in depth, but this can also backfire when done naively. Admission control drops requests indiscriminately. It sits at the network edge, and when it sees elevated latency, it starts rejecting a fraction of everything, short and long task alike. In simulations, we found that a workload spike can trigger admission control immediately, even though the ECS, left alone, would have rebalanced its tickets and absorbed the load after a brief adjustment. But the drops from the admission control prevent the ECS from rebalancing. With work being shed out from under it, the ECS never gets the signal it needs to adjust allocations and priorities, and the system stays parked in reduced-goodput mode until the workload subsides on its own. Under agents this gets worse, because admission control can't tell an agent's first attempt from its fifth retry, and a rejection would usually cause the agent to escalate things.
The full ECS design study is in our companion paper, "Towards Designing an Execution Control System with Metastability Resilience" (to appear at IEEE ICCCN 2026).
Comments