TLA+ Modeling of AWS outage DNS race condition
As I mentioned, the enactor process has three actions. It’s not a single atomic block but three subactions. The apply step is simple. EnactorReceiveStep(self) models an enactor taking a new plan from the shared queue. If the queue isn’t empty and the enactor is idle (pc=0), it assigns itself the plan at the head, advances its program counter to 1 (ready to apply), removes that plan from the queue, and leaves other state unchanged.
The EnactorApplyStep models how an enactor applies a plan. If the plan’s processing isn’t deleted, it makes that plan current, updates the highest plan applied, and advances its counter to 2. If the plan was deleted, it resets its state (processing and pc to 0).
The EnactorCleanupStep runs when an enactor finishes applying a plan (pc=2). It updates plan_deleted using Cleanup, by marking plans older than PLAN_AGE_THRESHOLD as deleted. It then resets that enactor’s state (processing and pc to 0), and leaves all other variables unchanged.
Next defines the system’s state transitions: either the planner generates a new plan or an enactor executes one of its steps (receive, apply, cleanup).
NeverDeleteActive asserts that the currently active plan must never be marked deleted. This invariant breaks because the enactor’s operation isn’t executed as one atomic step but split into three subactions for performance reasons. Splitting the operation allows parallelism and avoids long blocking while waiting for slow parts—such as applying configuration changes—to complete. This design trades atomicity for throughput and responsiveness.
Anyone familiar with concurrency control or distributed systems can foresee how this race condition unfolds and leads to a NeverDeleteActive violation. The root cause is a classic time-of-check to time-of-update flaw.
The trace goes as follows. The planner issues several consecutive plans. Enactor 1 picks up one but lags on the next two steps. Enactor 2 arrives and moves faster: it takes the next plan, applies it, and performs cleanup. Because the staleness threshold isn’t yet reached, Enactor 1’s plan survives this cleanup. On the next round, Enactor 2 gets another plan and applies it. Only then does Enactor 1 finally execute its Apply step, overwriting Enactor 2’s newly installed plan. When Enactor 2’s cleanup runs again, it deletes that plan (which is now considered stale) violating the invariant NeverDeleteActiveRecord.
You can explore this violation trace using a browser-based TLA+ trace explorer that Will Schultz built by following this link. The Spectacle tool loads the TLA+ spec from GitHub, interprets it using JavaScript interpreter, and shows/visualizes step-by-step state changes. You can step backwards and forwards using the buttons, and explore enabled actions. This makes model outputs accessible to engineers unfamiliar with TLA+. You can share a violation trace simply by sending a link as I did above.




Comments