Mitigating Application Resource Overload with Targeted Task Cancellation
The Atropos paper (SOSP'25) argues that overload-control systems are built on a flawed assumption. They monitor global signals (like queue length or tail latency) to adjust admission control (throttling new arrivals or dropping random requests). This works when the bottleneck is CPU or network, but it fails when the real problem is inside the application. This considers only the symptoms but not the source. As a result, it drops the victims rather than the culprits.
Real systems often run into overload because one or two unlucky timed requests monopolize an internal logical resource (like buffer pools, locks, and thread-pool queues). These few rogue whales have nonlinear effects. A single ill-timed dump query can thrash the buffer pool and cut throughput in half. A single backup thread combined with a heavy table scan can stall writes in MySQL as seen in Figure 3. The CPU metrics will not show this.
Atropos proposes a simple fix to this problem. Rather than throttling or dropping victims at the front of the system, it continuously monitors how tasks use internal logical resources and cancels the ones most responsible for the collapse. The name comes from the three Greek Fates. Clotho spins the thread of life, Lachesis measures its length, and Atropos cuts it when its time is up. The system plays the same role: it cuts the harmful task to protect the others.
The first interesting point in the paper is that their survey of 151 real applications (databases, search engines, and distributed systems) shows that almost all of them already contain safe cancellation hooks. Developers have already built the ability to stop a task cleanly. So the concerns about cancellation being too dangerous or too invasive turn out to be outdated. The support is already there; what's missing is a systematic way to decide which tasks to cancel in the first place.
To identify which tasks are the rogue whales, Atropos introduces a lightweight way to track logical resource usage. It instruments three operations: acquiring a resource, releasing it, and waiting on it. These are cheap counters and timestamps. For memory this wraps code that gets or frees buffer pages. For locks it wraps lock acquisition and lock waits. For queues it wraps enqueue and dequeue. The runtime can now trace who is touching which resource and who is blocking whom.
The map above (from Nano Banana Pro) lays out the Atropos design. Let's walk through its main pieces.
The runtime borrows the "overload detection" from prior work (Breakwater). When latency rises above the SLO but throughput stays flat, overload is present. At that moment Atropos inspects the resource traces and identifies hot spots. It computes two measures per resource per task.
- The first measure is the contention level. This is a measure of how much time tasks spend waiting due to that resource. High eviction ratios, long lock wait ratios, and long queue wait times all point to a rogue whale task.
- The second measure is the resource gain. This estimates how much future load would be relieved if a task were canceled. This is the clever idea. The naive approach counts past usage, but a task that has consumed many pages but is almost finished does not pose much future harm. A task that has barely begun a huge scan is the real whale to watch, because its future thrashing potential is far larger than what its small past footprint suggests. The system uses progress estimates to scale current usage by remaining work. Databases usually track rows read vs rows expected. Other systems provide analogous application-specific counters for an estimate of future demand. This allows the controller to avoid killing nearly finished tasks.
The policy engine considers all resources together. Some tasks stress memory, others stress locks, and others queues. A single-resource policy would make narrow decisions. Instead Atropos identifies the non-dominated set of tasks across all resources and computes a weighted score, where the weights are the contention levels of the resources. The resultant score is the expected gain from canceling that task. The task with the highest score is canceled using the application's own cancellation mechanism.
The evaluation of the paper is strong. This is a prototypical SOSP paper. The authors reproduce sixteen real overload bugs in systems like MySQL, Postgres, Elasticsearch, Solr, Apache, and etcd. These include buffer pool thrashing, lock convoys, queue stalls, and indexing contention. Across these cases Atropos restores throughput close to the baseline. The median result is around ninety six percent of normal throughput under overload. Tail latency stays near normal. The cancellation rate is tiny: less than one in ten thousand requests! Competing approaches must drop far more requests and still fail to restore throughput. The key here is the nonlinear effect of freeing internal logical resources. Canceling the right task unblocks a crowd of others.
As usual, Aleksey and I did our live blind read of the paper, which you can watch below. My annotated copy of the paper is also available here.
What I like about the paper is its clarity. The motivating examples are strong and concrete. The design is small, understandable, and modular. The progress-based future estimation is a good idea. The policy avoids naive heuristics. This is not a general overload controller for all situations. It does not manage CPU saturation or network overload. Instead it solves overload caused by contention on internal logical resources by killing the rogue whale, and quickly restoring normal behavior.








Comments