Multi-Grained Specifications for Distributed System Model Checking and Verification

- April 22, 2025

This EuroSys 2025 paper wrestles with the messy interface between formal specification and implementation reality in distributed systems. The case study is ZooKeeper. The trouble with verifying something big like ZooKeeper is that the spec and the code don’t match. Spec wants to be succinct and abstract; code has to be performant and dirty.

For instance, a spec might say, “this happens atomically.” But the actual system says, “yeah, buddy, right.” Take the case of FollowerProcessNEWLEADER: the spec bundles updating the epoch and accepting the leader’s history into a single atomic step. But in the real system, those steps are split across threads and separated by queuing, I/O, and asynchronous execution. Modeling them as atomic would miss real, observable intermediate states, and real bugs.

To bridge this model-code gap, the authors use modularization and targeted abstraction. Don’t write one spec, write multi-grained specs. Different parts of the system are modeled at different granularities, depending on what you're verifying. Some parts are modeled in fine detail; others are coarse and blurry. Like reading a novel and skimming the boring parts. These modules, written at different levels of abstraction, are composed into a mixed-grained spec, tailored to the verification task at hand.

Central to the method is the interaction-preserving principle, a key idea borrowed from a 2022 paper by some of the same authors. Coarsening is allowed, but only if you preserve interaction variables--shared state that crosses module boundaries. To abstract away things, modules can internally lie to themselves, but not to each other. Other modules should not be able to distinguish whether they are interacting with the original or a coarsened module. The paper formalizes this using dependency relations over actions and their enabling conditions. This sounds a lot like rely-guarantee, but encoded for TLA+.

What makes this paper stand out is its application. This is a full-on, hard-core TLA+ verification effort against a real-world messy codebase. It’s a good guide for how to do formal work practically, without having to rewrite the system in a verification-oriented language. This paper is not quite a tutorial as it skips a lot of the details, but the engineering is real, and the bugs are real too.

Technical details

The system decomposition follows the Zab protocol phases: Election, Discovery, Synchronization, Broadcast. This is a clever hack. Rather than inventing new module boundaries (say through roles or layers), the authors simply split along the natural phase boundaries in the protocol. The next-state action in the TLA+ spec is already a disjunction across phase actions; this decomposition just formalizes and exploits that.

They focus verification on log replication involving the Synchronization and Broadcast phases, while leaving Election and Discovery mostly coarse. They say that's where the bugs are. They also mention that the leader election is a tangle of nondeterminism and votes-in-flight, and models take days to check. Log replication, by contrast, is where local concurrency hits hard, and where many past bugs and failed fixes appear.

Table 1 summarizes the spec configurations: baseline (system spec), coarsened, and fine-grained. The fine-grained specs are where the action is: they capture atomicity violations, thread interleavings, and missing transitions. These bugs don’t show up when model checking with the baseline spec. It is all about how deep granularity you are willing to go for capturing implementation. Targeted fine-granularity checking really helps keep this practical.

Table 5 shows model performance. Coarse specs are fast but miss bugs. Fine specs find bugs but blow up. Mixed specs get the best of both. For instance, mSpec-3 finds deep bugs in seconds, while the full system spec doesn’t terminate in 24 hours. The bottleneck is the leader election spec, which proves again that targeted coarsening pays when it preserves interaction.

Bug detection is driven by model checking with TLC. Violations are found in model traces when TLC reports a violation of an invariant. The violations found are then confirmed via code-level deterministic replay. The authors built a system (Remix) to instrument ZooKeeper, inject RPC hooks via AspectJ, and replay TLA-level traces using a centralized coordinator that schedules and controls the interleaving of code-level actions using the developer-provided mapping from model actions to code events. This deterministic replay allows them to validate that the model-level bug can actually manifest in real runs.

Figure 8 ties back to the core atomicity problem in FollowerProcessNEWLEADER. Many of these bugs arise when the update of epoch and the logging of history are not properly sequenced or observed. The spec treats them as one unit, but the implementation spreads them across threads and queues. By splitting the spec action into three, the authors model each thread’s contribution separately, and capturing the inter-thread handoff. With this, they catch data loss, inconsistency, and state corruption bugs that had previously escaped detection, or worse, had been "fixed" and re-broken.

The authors didn’t stop at finding bugs. They submitted fixes upstream to ZooKeeper, validating them with the same fine-grained specifications. To fix existing bugs and also make it easy to implement correctly, they removed the atomicity requirement of the two updates from the Zab protocol but require their order: the follower updates its history before updating its epoch. The patched version passed their model checks, and the PR was merged. This tight loop (spec, bug, fix, verify) is what makes this method stand out: formal methods not as theory, but as a workflow.

Conclusions

The results back the thesis: fine-grained modeling is essential to catch real bugs. Coarsening is essential to make model checking scale. Modularity and compositionality is a feasible way to manage verification of a complex, concurrent, evolving system. It's not push-button verification. But it's doable and useful.

The work opens several directions. Could this technique support test generation from the mixed model? If you already have deterministic replay and instrumentation hooks, generating inputs for fault-injection testing seems within reach.

More speculatively, are there heuristics to suggest good modular cuts? The current modularization is protocol-phase aligned, but could role-based, data-centric, or thread-boundary modularization give better results? That seems like a good area for exploration.