TLA+ as a Design Accelerator: Lessons from the Industry
After 15+ years of using TLA+, I now think of it is a design accelerator. One of the purest intellectual pleasures is finding a way to simplify and cut out complexity. TLA+ is a thinking tool that lets you do that.
TLA+ forces us out of implementation-shaped and operational reasoning into mathematical declarative reasoning about system behavior. Its global state-transition model and its deliberate fiction of shared memory make complex distributed behavior manageable. Safety and liveness become clear and compact predicates over global state. This makes TLA+ powerful for design discovery. It supports fast exploration of protocol variants and convergence on sound designs before code exists.
TLA+ especially shines for distributed/concurrent complex systems. In such systems, complexity exceeds human intuition very quickly. (I often point out to very simple interleaving/nondeterministic execution puzzles to show how much we suck at reasoning about such systems.) Testing is inadequate for subtle design errors for complex distributed/concurrent systems. Code may faithfully implement a design, but the design itself can fail in rare concurrent scenarios. TLA+ provides exhaustively testable design; catches design errors before code, and enables rapid "what if" design explorations and aggressive protocol optimizations safely.
This is why there are so many cases of TLA+ modeling used in industry, including Amazon/AWS, Microsoft Azure, MongoDB, Oracle Cloud, Google, LinkedIn, Datadog, Nike, Intel.
In this post, I will talk about the TLA+ modeling projects I worked at. Mostly from the industry. (Ok, I counted it, 8 projects. I talk to you about 8 projects.)
1. WPaxos (2016)
This is the only non-industry experience I will mention. It is from my work on WPaxos at SUNY Buffalo, with my PhD students.
WPaxos adapts Paxos for geo-distributed systems by combining flexible quorums with many concurrent leaders. It uses two quorum types: a phase-1 quorum (Q1) used to establish or steal leadership for an object, and a phase-2 quorum (Q2) used to commit updates. Flexible quorum rules require only that every Q1 intersect every Q2, not that all quorums intersect each other. WPaxos exploits this by placing Q2 quorums largely within a single region, so the common case of committing updates happens with low local latency, while the rarer Q1 leader changes span zones. Nodes can steal leadership of objects via Q1 when they observe demand elsewhere, in order migrate the object's ownership toward the region issuing the most writes. Safety is assured because any attempted commit must pass through a Q2 quorum that intersects prior Q1 decisions, preventing conflicting updates despite failures, network delays, and concurrent leaders.
I had explained the basic WPaxos protocol here, if you like to read more. (Sadly there was never a part 2 for that post. I don't know if anyone is using WPaxos in production, but it is a really good idea and I hope to hear about deployments in the wild.) As for our use of TLA+ for the protocol, it came early on. After we had the intuitive idea of the protocol, we knew we needed strong modeling support in order to get this complex thing completely right. The modeling also helped us sharpen our definitions. It is not straightforward to define quorums across zones, while getting the intersections right. The TLA+ modeling was so useful in fact that we used TLA+/PlusCal snippets in our paper to explain concepts (model-checking validated spec, rather than a hail-Mary pseudocode like everyone else does). The definitions also came from our TLA+ formal definitions.
The lesson learned: Model early! Like we predicted, we got a lot of mileage by modeling in TLA+ early on in this project.
2. CosmosDB (2018)
During my sabbatical at Microsoft Azure CosmosDB, I helped specify the database's client-facing consistency semantics. The nice thing about these specs was that they didn't need to model internal implementation. The goal was to capture the consistency semantics for clients in a precise manner, rather than providing ambiguous English explanations. The model aimed to answer the question: What kind of behavior should a client be able to witness while interacting with the service?
The anti-pattern here would be to try to model the distributed database engine. Trying to show how the replication/coordination works would have lead to immediate state-space explosion and an unreadable/unusable specification. Instead we modeled at a high level and honed in on the "history as a log" abstraction for us to represent/capture the user-facing concurrency.
The model is available here. The "history" variable records all client operations (reads and writes) in order, and each consistency level validates different properties against it:
- Strong Consistency enforces Linearizability: reads must always see the latest write globally across all regions
- Bounded Staleness uses ReadYourWrite: clients see their own writes, plus bounded staleness by K operations
- Session uses MonotonicReadPerClient: a client's reads are monotonic (never go backward)
- Consistent Prefix uses MonotonicWritePerRegion: writes in a region appear in order
- Eventual uses Eventual Convergence: reads eventually see writes that exist in the database.
The replication macro was particularly high-level and clever. When region d replicates from region s, it merges both write histories, sorts them, and deduplicates to get a consistent, monotonically increasing sequence of writes. After replication, Data[d] is set to the last value in the merged database, ensuring regions eventually converge to the same state.
The lesson here is to model minimalistically. A model does not have to capture everything to be highly valuable; it just needs to capture the part/behavior that matters.
This minimalistic model served as precise documentation for outside-facing behavior, replacing ambiguous English explanations, and became foundational enough that a 2023 academic paper built/improved on it. I talked about this improved model here. The history/log was again the main abstraction in that model. The 2023 paper accompanying the model has this great opening paragraph, which echoes the experience of everyone that has painstakingly specified a distributed/concurrent system behavior:
"Consistency guarantees for distributed databases are notoriously hard to understand. Not only can distributed systems inherently behave in unexpected and counter-intuitive ways due to internal concurrency and failures, but they can also lull their users into a false sense of functional correctness: most of the time, users of a distributed database will witness a much simpler and more consistent set of behaviors than what is actually possible. Only timeouts, fail-overs, or other rare events will expose the true set of behaviors a user might witness. Testing for these scenarios is difficult at best: reproducing them reliably requires controlling complex concurrency factors, latency variations, and network behaviors. Even just producing usable documentation for developers is fundamentally challenging and explaining these subtle consistency issues via documentation comes as an additional burden to distributed system developers and technical writers alike."
3. AWS DistSQL (2022)
I worked on AWS DistSQL 2022 and 2023. Aurora DSQL builds a WAN distributed SQL database. It decomposes the database into independent services: stateless SQL compute nodes, durable storage, a replicated journal, and transaction adjudicators. Transactions execute optimistically and mostly locally. Reads use MVCC snapshots, and writes are buffered without coordination. Only at commit does the system perform global conflict validation and ordering, using adjudicators and the journal to finalize the transaction. This design pushes almost all distributed coordination to commit time, allowing statements inside a transaction to run with low latency while still providing strong transactional guarantees.
I did a first version of the TLA+ modeling of this system. This was great for getting confidence on the protocol. After writing the model, I had a better understanding of the invariants. This also served as a communication tool, to keep people on the same page. When we were trying to get more formal methods support, the TLA+ models sped up the process and anchored the communications. This was a surprising thing I learned, working as part of a big team, what a big challenge it is to keep everyone aligned. Brooker had banged out a 100+ page book on the design, which did really help. He also had written PLang models of the system as well. As far as I know, both modeling eventually gave way to closer-to-implementation Rust models/testing. I am not able to share the TLA+ models for DSQL, but I hope when a DSQL publication comes out eventually, it would have some TLA+ models accompanying.
The lesson here is that TLA+ also works well as communication tool and serving as scaffolding for further formal methods and testing support.
4. StableEmptySet (2022)
I worked on this problem for a short time, but I find it still worth mentioning. We needed to implement a distributed set that, once empty, remains empty permanently. This is a crucial property for safely garbage-collecting a set and ensuring we don't add symbolic links to a record that has already been deleted, and later lose durability when garbage collection kicks in.
Ok, but why don't we just check IsEmpty during an add operation, and disallow the addition if it is being done to an empty-set? You don't get to have such simple luxuries in a distributed system. This is a set implemented in a distributed manner, so we do not have an atomic check for IsEmpty. Think of a LIST scan across many machines, that is inherently a non-atomic check...
In a distributed system, concurrency is our nemesis, and it makes implementing this StableEmpty protocol very tricky due to many cornercases. Ernie Cohen designed a brilliant/elegant protocol to solve this. Ernie is a genius, who lives in an abstract astral plane many miles above us. My role was simply to translate his protocol into TLA+ to bring it down to a concrete plane where mere mortals like me could understand it. Again sorry that I cannot disclose the model.
The reason I am mentioning this problem is because it radically expanded my horizons on how far we can/should push the fine granularity of atomic actions. Of course the IsEmpty check is non-atomic, but Ernie also did push the update/communication steps of the algorithm to be as fine grain as possible so that there won't be a need to do distributed locking, and the implementation can scale. The problem is when you develop an algorithm with extremely fine-grained actions, the surface area for operation interleaving and interference explodes. That is why modeling the protocol in TLA+ and model-checking it for correctness becomes non-negotiable.
The anti-pattern here would be to attempt to implement pseudo-atomic checks via distributed locking, or handling concurrent additions and deletions as ad-hoc operational edge cases, which would be both doomed to fail at large scale.
The Lesson: Choose atomicity carefully and push for the finest possible granularity using TLA+ modeling as an exploration tool. TLA+ forces you to define exactly what operations are atomic and helps you to model-check if the interleavings of these operations are safe.
5. PowerSet Paxos (2022)
I helped briefly with this model when my colleague (Chuck Carman) hit an exponential state-space explosion with his distributed transaction protocol, PowerSet Paxos. For a change, this time we have a testimonial from Chuck with his own words.
"The first time I made a distributed transaction protocol, I did it by sitting in a coffee shop with an excel sheet trying to come up with sets of rules to evolve the rows over time. This took weeks after work. I got on the core idea: metadata encoding partial causal orders writing overlapping sets of keys. I didn't trust my algorithm at all. It took another three months to refine the algorithm using a tool to make sure it was correct (TLA+). It took a week to translate TLA+ algorithm to code. It took way more time to write the test code. Maybe 75-80% if the code is testing all the invariants the TLA+ spec had in it. The long pole there was creating a DSL in Java land to effectively test all of the invariants TLA+ checked. That took a month or two.
For PowerSet Paxos... We haven't had a transaction corrupted yet, and the team is learning Pluscal to apply it to the rest of the system where there are errors around state machine transitions. Much regret has been expressed around not modeling those parts in TLA+ so that the main implementations would be mostly error free."
The Lesson: Code is cheap, testing broken algorithms is expensive.
I hope Chuck can share this model someday, alongside a publication on this protocol.
6. Secondary Index (2023)
When starting development on a secondary index for DSQL, the engineer that drafted the initial protocol wrote a 6 pager document as is Amazon's tradition. But he kept finding cornercases with the indexing. This problem, at first, did not look very complex and therefore a good fit for TLA+ modeling. After all, the indexing is happening centrally at a node, the concurrency came from ongoing operations on the database, while the indexing was trying to catchup to existing work. This sounds more data-centric than operation/protocol-centric so it didn't sound like an ideal fit. Here is the description simplified from the patent which ended up getting award for the final protocol:
"We initiate the creation of a unique secondary index on a live database without interrupting ongoing operations. To achieve this, the system backfills historical data up to a specific point in time while simultaneously applying all new, incoming updates directly to the index. Once the backfill finishes, we perform a final evaluation across the entire index to ensure no duplicate entries slipped through during this highly concurrent phase. If any unique constraint violations are detected, the system immediately flags the error and reports the exact cause."
I was visiting Seattle offices, and I told him that we should try TLA+ modeling this given the large number of cornercases popping up. I then wrote a very crude initial model, and apparently that was enough to get him started. I was surprised to find that over the weekend, he had written variations of the model and made improvements on the model, without prior TLA+ experience.
The anti-pattern here would be to design/grow the protocol by thinking in control flow and patching corner cases one by one as they arise. Using TLA+ forced a more declarative mathematical approach. It acted as a design accelerator, because it is faster to fix a conceptual model than to whack-a-mole corner cases in code.
The lesson: break the implementation mindset and search at the protocol solution space.
7. LeaseGuard: Raft Leases Done Right (2024)
I joined MongoDB Research in 2024. MongoDB has 10+ year history of TLA+ success, including the Logless dynamic reconfiguration, pull-based consensus replication, and the extreme modeling projects.
Leader lease design was my first project at MongoDB. We designed a simple lease protocol tailored for Raft, called LeaseGuard. Our main innovation is to rely on Raft-specific guarantees to design a simpler lease protocol that recovers faster from a leader crash. We wrote about it here. Please go read it, it is a really good read.
Since we are TLA+ fans, and we knew the importance of getting started early on for modeling the algorithm in TLA+. And this paid off big time, we discovered our two optimizations in while writing the TLA+ spec for our initial crude concept of the algorithm. The inherited lease reads optimization was especially surprising to us; we probably wouldn't have realized it was possible if TLA+ wasn't helping us think. We also used TLA+ to check that LeaseGuard guaranteed Read Your Writes and other correctness properties.
The Lesson: Design discovery over verification. TLA+ is useful not just for checking the correctness of a completed design, but for revealing new insights while the design is in progress. Modeling in TLA+ actively informed our protocol and we discovered surprising optimizations by exploring the protocol in TLA+.
8. MongoDB Distributed Transactions Modeling (2025)
This was my second project at MongoDB. We also wrote a blog post on this here, so I am going just cut to the chase here.
In this project, we developed the first modular TLA+ specification of MongoDB's distributed transaction protocol. The model separates the sharded transaction logic from the underlying storage and replication behavior, which lets us reason about the protocol at a higher level while still capturing key cross-layer interactions. Using the model, we validated isolation guarantees, explored anomalies under different ReadConcern and WriteConcern settings, and clarified subtle issues such as interactions with prepared transactions. Our spec is available publicly on GitHub.
This effort brought much-needed clarity to a big complex distributed transactions protocol. I believe this is the most detailed TLA+ model of a distributed transactions protocol in existence. While database/systems papers occasionally feature a TLA+ transaction model, those typically focus on a very narrow slice. I am not aware of any other model that captures distributed transactions at this level of granularity. A big value of our model is that it serves as a reference to answer questions about a protocol which span many teams, many years of development, and several software/service layers.
Furthermore, we used the TLA+ model traces from our spec to generate thousands of unit tests that exercise the actual WiredTiger implementation. This successfully bridged the gap between formal mathematical specification and concrete code.
The Lesson: Models can add value even retroactively, and can have a life beyond the initial design phase.
When I started writing this post this morning, I was originally planning to talk also about how to go about starting with your TLA+ modeling, and how things are/might-be changing in the post LLM world. But this post already got very long, and I will leave that for next time.
Comments