Modeling Token Buckets in PlusCal and TLA+

- March 16, 2026

Retry storms are infamous in distributed systems. It is easy to run into them. Inevitably, a downstream service experiences a hiccup, so your clients automatically retry their failed requests. Those retries add more load to the struggling service, causing more failures, which trigger more retries. Before you know it, the tiny unavailability cascades into a full-blown self-inflicted denial of service.

Token Bucket is a popular technique that helps with gracefully avoiding retry storms. Here is how the token bucket algorithm works for retries:

Sending is always free. When a client sends a brand-new request, it doesn't need a token. It just sends it.
Successes give you credit. Every time a request succeeds, the client deposits a small fraction of a token into its bucket (up to a maximum capacity).
Retries costs you credit. If a request fails, the client must spend a whole token (or a large fraction of one) to attempt a retry.

If the downstream service is healthy, the bucket stays full. But if the service goes down, the initial failures quickly drain the bucket. Once the bucket is empty, the client stops retrying. It gracefully degrades, protecting the downstream service until it recovers and new requests start succeeding again.

It is an elegant solution. But when implementing it in a client driver, there is a subtle concurrency trap that is easy to fall into. We model it below to show exactly where things go wrong.

The Sequential Processing Trap

When an engineer sits down to write a client driver, they usually think sequentially: Pull a request from a queue, send it, check the response, and retry if it fails. We can model this sequential control flow using PlusCal (the pseudoalgorithm DSL that compiles down to TLA+). In this first model below, we simulate a thread iterating over a queue of requests. If a request fails, the thread checks the token bucket. If the bucket is empty, the thread naturally awaits for tokens to become available.

(Aside: Before we look at the results, notice the placement of labels in the model. In PlusCal, labels are not just aesthetic markers or simple goto targets. They are structurally vital because they define the boundaries of atomic steps. Label to label is the unit of atomic execution, so we have to place them very deliberately. For example, making a request to a server and evaluating its response cannot be a single, instantaneous action. It violates the physical reality of the network boundary. By placing the ProcessResponse: label right before the if server_succeeds check, we correctly split this into two atomic actions. As a fun piece of TLA+ trivia: if we omitted that label, the model checker would evaluate the network call and the await statement simultaneously. To avoid getting stuck at an invalid await state, TLC would magically "time travel" and force the server to succeed every single time, completely hiding our bug! )

If we run this model, the TLC model checker immediately throws a Termination violation. Why?

We accidentally built a Head-of-Line (HOL) blocking bug. This is how it happens. A burst of failures drains the bucket. The thread hits the "await condition" and goes to sleep, waiting for tokens. But because the thread is blocked, it can't loop around to send new requests. Since only new requests can earn tokens, the bucket will never refill, and the client driver is permanently deadlocked.

The Fix: Fail-Fast

To fix this, we have to change how the client driver reacts to an empty bucket. Instead of waiting, it must fail-fast. So we remove the "await" and simply drop the request if we can't afford the retry. We unblock the queue as the thread moves on to the next request. TLC confirms that this model passes with flying colors. starvation, no deadlocks.

The Pure TLA+ Perspective

It is worth noting that if we had modeled this system using pure TLA+ from the start, we wouldn't have stumbled into this specific deadlock. Instead of sequential thread logic, Pure TLA+ models state machines and event-driven logic. In the below TLA+ model, Send, ServerRespond, Retry, and Drop are independent actions. If the bucket is empty, the Retry action simply becomes disabled. But because there is no while loop tying actions together, the Send action remains perfectly valid for any new incoming requests. The guarded-command TLA+ model below naturally avoids the head-of-line blocking problem.

To run this TLA+ model at Spectacle web tool for visualization, simply click this link. The Spectacle tool loads the TLA+ spec from GitHub, interprets it using JavaScript interpreter, and visualizes step-by-step state changes as you press a button corresponding to an enabled action. You can step backwards and forwards, and explore different executions. This makes model outputs accessible to engineers unfamiliar with TLA+. You can share a trace simply by sending a URL as I did above.

This brings us to an important lesson about formal verification and system design: the paradigm gap. Pure TLA+ is a beautiful event-driven way to describe the mathematically correct state of your system. However, the environments where these systems actually live (Java, Go, C++, or Rust) are fundamentally built around sequential threads, loops, and queues, just like our PlusCal model. The impedance mismatch between an event-driven specification and a sequential implementation introduces the risk of HOL blocking. Because modern programming languages make it so effortless to pause a thread and wait for a resource, it is incredibly easy for a system to fall into the blocking trap. We should be cognizant of this pitfall when implementing our designs.

Search This Blog

Metadata