Mitigating Application Resource Overload with Targeted Task Cancellation
The Atropos paper (SOSP'25) argues that overload-control systems are built on a flawed assumption. They monitor global signals (like queue length or tail latency) to adjust admission control (throttling new arrivals or dropping random requests). This works when the bottleneck is CPU or network, but it fails when the real problem is inside the application. This considers only the symptoms but not the source. As a result, it drops the victims rather than the culprits. Real systems often run into overload because one or two unlucky timed requests monopolize an internal logical resource (like buffer pools, locks, and thread-pool queues). These few rogue whales have nonlinear effects. A single ill-timed dump query can thrash the buffer pool and cut throughput in half. A single backup thread combined with a heavy table scan can stall writes in MySQL as seen in Figure 3. The CPU metrics will not show this. Atropos proposes a simple fix to this problem. Rather than throttling or dropping ...