TLA+ Modeling of AWS outage DNS race condition

- November 05, 2025

On Oct 19–20, 2025, AWS’s N. Virginia region suffered a major DynamoDB outage triggered by a DNS automation defect that broke endpoint resolution. The issue cascaded into a region-wide failure lasting nearly a full day and disrupted many companies’ services. As with most large-scale outages, the “DNS automation defect” was only the trigger; deeper systemic fragilities (see my post on the Metastable Failures in the Wild paper) amplified the impact. This post focuses narrowly on the race condition at the core of the bug, which is best understood through TLA+ modeling.

My TLA+ model builds on Waqas Younas’s Promela/Spin version. To get started quickly, I asked ChatGPT to translate his Promela model into TLA+, which turned out to be a helpful way to understand the system’s behavior, much more effective than reading the postmortem or prose descriptions of the race.

The translation wasn’t perfect, but fixing it wasn’t hard. The translated model treated the enactor’s logic as a single atomic action. In that case, no race could appear: the enactor always completed in uninterrupted fashion. Splitting the action into three steps (receive the DNS plan, apply it, and clean up old plans) exposed the race condition clearly. I then worked further to simplify the model to its essence.

Below, I walk through the model step by step. This is meant as an exercise in practical TLA+ modeling. We need more such walkthroughs to demystify how model checking works in real debugging. You can even explore this one interactively, without installing anything, using the browser-based TLA+ tool Spectacle. Spectacle, developed by my brilliant colleague Will Schultz at MongoDB research provides an interactive playground for exploring and sharing TLA+ specifications in the browser. But patience: I’ll need to explain the model first before we can do a walk through of the trace.

My model is available here. It starts by defining the constants and the variable names. There is one planner process and a set of enactor processes. I define ENACTORS as {e1, e2} as two processes in the config file.

These variables are initialized in a pretty standard way. And the Planner process action is also straightforward. It creates the next plan version if the maximum hasn’t been reached, increments latest_plan, and appends the new plan to plan_channel for enactors to pick up. All other state variables remain unchanged.

As I mentioned, the enactor process has three actions. It’s not a single atomic block but three subactions. The apply step is simple. EnactorReceiveStep(self) models an enactor taking a new plan from the shared queue. If the queue isn’t empty and the enactor is idle (pc=0), it assigns itself the plan at the head, advances its program counter to 1 (ready to apply), removes that plan from the queue, and leaves other state unchanged.

The EnactorApplyStep models how an enactor applies a plan. If the plan’s processing isn’t deleted, it makes that plan current, updates the highest plan applied, and advances its counter to 2. If the plan was deleted, it resets its state (processing and pc to 0).

The EnactorCleanupStep runs when an enactor finishes applying a plan (pc=2). It updates plan_deleted using Cleanup, by marking plans older than PLAN_AGE_THRESHOLD as deleted. It then resets that enactor’s state (processing and pc to 0), and leaves all other variables unchanged.

Next defines the system’s state transitions: either the planner generates a new plan or an enactor executes one of its steps (receive, apply, cleanup).

NeverDeleteActive asserts that the currently active plan must never be marked deleted. This invariant breaks because the enactor’s operation isn’t executed as one atomic step but split into three subactions for performance reasons. Splitting the operation allows parallelism and avoids long blocking while waiting for slow parts—such as applying configuration changes—to complete. This design trades atomicity for throughput and responsiveness.

Anyone familiar with concurrency control or distributed systems can foresee how this race condition unfolds and leads to a NeverDeleteActive violation. The root cause is a classic time-of-check to time-of-update flaw.

The trace goes as follows. The planner issues several consecutive plans. Enactor 1 picks up one but lags on the next two steps. Enactor 2 arrives and moves faster: it takes the next plan, applies it, and performs cleanup. Because the staleness threshold isn’t yet reached, Enactor 1’s plan survives this cleanup. On the next round, Enactor 2 gets another plan and applies it. Only then does Enactor 1 finally execute its Apply step, overwriting Enactor 2’s newly installed plan. When Enactor 2’s cleanup runs again, it deletes that plan (which is now considered stale) violating the invariant NeverDeleteActiveRecord.

You can explore this violation trace using a browser-based TLA+ trace explorer that Will Schultz built by following this link. The Spectacle tool loads the TLA+ spec from GitHub, interprets it using JavaScript interpreter, and shows/visualizes step-by-step state changes. You can step backwards and forwards using the buttons, and explore enabled actions. This makes model outputs accessible to engineers unfamiliar with TLA+. You can share a violation trace simply by sending a link as I did above.

UPDATE: Within a few hours of posting this, Markus Kuppe extended it by adding an animation to the model on a fork of Spectacle. You can now watch the race condition unfold visually, plans being created, applied, and deleted.