Modular verification of MongoDB Transactions using TLA+
Joint work with Will Schultz.
A transaction groups multiple operations into an all-or-nothing logical-box to reduce the surface area exposed to concurrency control and fault recovery, simplifying the application programmer's job. Transactions support ACID guarantees: atomicity, consistency, isolation, durability. Popular isolation levels include, Read Committed (RC), Snapshot Isolation (SI), and Serializability (SER), which offer increasing protection against concurrency anomalies.
MongoDB Transactions
MongoDB’s transaction model has evolved incrementally.
- v3.2 (2015): Introduced single-document transactions using MVCC in the WiredTiger storage engine.
- v4.0 (2018): Extended support to multi-document transactions within a replica set (aka shard).
- v4.2 (2019): Enabled fully distributed transactions across shards.
Replica Set Transactions. All transaction operations are first performed on the primary using the WiredTiger transaction workflow/algorithm. Before commit of the transaction, all the updates are Raft-replicated with secondaries using the assigned timestamp to ensure consistent ordering. MongoDB uses Hybrid Logical Clocks (HLCs). The read timestamp reflects the latest stable snapshot. The commit timestamp is issued atomically and advances the cluster time. The default ReadConcern="Snapshot" ensures reads reflect a majority-committed snapshot with a given timestamp without yielding. And the WriteConcern="Majority" guarantees writes are durably replicated to a majority.
Distributed Transactions. MongoDB txns are general interactive transactions, rather than limited one-shot transactions as in DynamoDB. Clients interact through mongos, the transaction router, for executing a transaction. Mongos assigns the transaction a cluster-wide read timestamp and dispatches operations to relevant shard primaries. Each shard primary sets its local WiredTiger read timestamp and handles operations. If a conflict (e.g., write-write race) occurs, the shard aborts the transaction and informs mongos.
If the transaction is not aborted and the client asks for a commit, the mongos asks the first shard contacted for the transaction to coordinate the 2-phase commit (2PC). Handing off the commit-coordination to a Raft-replicated primary helps ensure durability/recoverability of the transaction. The shard primary launches a standard 2PC:
- Sends "prepare" to all participant shards.
- Each shard Raft-replicates a prepare oplog entry with a local prepare timestamp.
- The coordinator picks the max prepare timestamp returned from the shards as the global commit timestamp.
- Participant shards Raft-replicate commit at this timestamp and acknowledge back.
Using Alex Miller's Execution, Validation, Ordering, Persistence (EVOP) framework to describe MongoDB's distributed transactions, we get the following figure. MongoDB overlaps execution and validation. Execution stages local changes at each participant shard. All the while, Validation checks for write-write and prepare conflicts. Ordering and Persistence come later. The global commit timestamp provides the ordering for the transactions. Shards expose changes atomically using this timestamp.
MongoDB provides an MVCC+OCC flavor that prefers aborts over blocking. WiredTiger acquires locks on keys at first write access, causing later conflicting transactions to abort. This avoids delays from waiting, reducing contention and improving throughput under high concurrency.
TLA+ Modeling
Distributed transactions are difficult to reason about. MongoDB’s protocol evolved incrementally, tightly coupling the WiredTiger storage layer, replication, and sharding infrastructure. Other sources of complexity include aligning time across clusters, speculative majority reads, recovery protocol upon router failure, chunk migration by the catalog, interactions with DDL operations, and fault-tolerance.
To reason about MongoDB's distributed transaction protocol formally, we developed the first TLA+ specification of multi-shard database transactions at scale. Our spec is available publicly on GitHub. This spec captures the transaction behavior and isolation guarantees precisely.
Our TLA+ model is modular. It consists of MultiShardTxn, which encodes the high-level sharded transaction protocol, and Storage, which models replication and storage behavior at each shard. This modularity pays off big-time as we discuss in the model-based verification section below.
We validate isolation using a state-based checker based on the theory built by Crooks et al., PODC’17 and the TLA+ library implemented by Soethout. The library would take a log of operations and verify whether the transactions satisfy snapshot isolation, read committed, etc. This is a huge boost for checking/validating transaction isolation.
Our TLA+ model helps us explore how RC/WC selection for MongoDB tunable consistency levels affect transaction isolation guarantees. As MongoDB already tells its customers, MongoDB's "ReadConcern: majority" does not guarantee snapshot isolation. If you use it instead of "ReadConcern:Snapshot", you may get fractured reads: a transaction may observe some, but not all, of another transaction's effects.
Let's illustrate this with a simplified two-shard, two-transaction model from an earlier spec. T1 writes to K1 and K2 (sharded to S1 and S2, respectively) and commits via two-phase commit. T2 reads K1 before T1 writes it and K2 after T1 has committed. Due to `readConcern: majority`, it reads the old value of K1 and the new value of K2, violating snapshot isolation. The read is fractured.
You can explore this violation trace using a browser-based TLA+ trace explorer that Will Schultz built by following this link. The Spectacle tool loads the TLA+ spec from GitHub, interprets it using JavaScript interpreter, and shows/visualizes step-by-step state changes. You can step backwards and forwards using the buttons, and explore enabled actions. This makes model outputs accessible to engineers unfamiliar with TLA+. You can share a violation trace simply by sending a link.
Modeling helped us to clarify another subtlety: handling prepared but uncommitted transactions in the storage engine. If the transaction protocol ignores a prepare conflict, T2's read at the t-start snapshot might see a value that appears valid at its timestamp, but is later overwritten by T1's commit at an earlier timestamp, violating snapshot semantics. That means a read must wait on prepared transactions to avoid this problem. This is an example of cross-layer interaction between the transaction protocol and the underlying WiredTiger storage we mentioned earlier.
Finally, some performance stats. Checking Snapshot Isolation and Read Committed with the TLA+ model on an EC2 `m6g.2xlarge` instance takes around 10 minutes for small instances of the problem. With just two transactions and two keys, the space is large but manageable. Bugs, if they exist, tend to show up even in small instances.
Model-based Verification
We invested early in modularizing the storage model (a discipline Will proposed) which paid off. With a clean storage API between the transaction layer and storage engine, we can generate test cases from TLA+ traces that exercise the real WiredTiger code, not just validate traces. This bridges the gap between model and implementation.
WiredTiger, being a single-node embedded store with a clean API, is easy to steer and test. We exploit this by generating unit tests directly from the model. Will built a Python tool that:
- Parses the state graph from TLC,
- Computes a minimal path cover,
- Translates each path into Python unit tests,
- Verifies that implementation values conform to the model.
This approach is powerful: our handwritten unit tests number in the thousands, but the generator produces over 87,000 tests in 30 minutes. Each test exercises the precise contract defined in the model, systematically linking high-level protocol behavior to the low-level storage layer. These tests bridge model and code, turning formal traces into executable validations.
Permissiveness
We use TLA+ to also evaluate the permissiveness of MongoDB’s transaction protocol—the degree of concurrency it allows under a given isolation level without violating correctness. Higher permissiveness translates to fewer unnecessary aborts and better throughput. Modeling lets us quantify how much concurrency is sacrificed for safety, and where the implementation might be overly conservative.
To do this, we compare our protocol's accepted behaviors to abstract commit tests from Crooks et al PODC'17 paper. By comparing the transaction protocol behavior to idealized isolation specs, we can locate overly strict choices and explore safe relaxations.
For example, for read committed, MongoDB's transaction protocol (computed over our finite model/configurations) accepts around 76% of the behaviors allowed by the isolation spec. One source of restriction for Read Committed is prepare conflicts, a mechanism that prevents certain races. Disabling it raises permissiveness to 79%. In one such case, a transaction reads the same key twice and sees different values: a non-repeatable read. Snapshot isolation forbids this; but read committed allows it. MongoDB blocks it, maybe unnecessarily. If relaxing this constraint improves performance without violating correctness, it may be worth reconsidering.
Future Work
Our modular TLA+ specification brings formal clarity to a complex, distributed transaction system. But work remains on the following fronts:
- Model catalog behavior to ensure correctness during chunk migrations.
- Extend multi-grain modeling to other protocols.
- Generate test cases directly from TLA+ to bridge spec and code.
- Analyze and optimize permissiveness to improve concurrency.
Comments