Smart Casual Verification of the Confidential Consortium Framework
This paper (NSDI'25) applies lightweight formal methods (hence the pun "smart casual" in contrast to formal attire) to the Confidential Consortium Framework (CCF). CCF is an open-source platform for trustworthy cloud applications, used in Microsoft's Azure Confidential Ledger service. The authors combine formal specification, model checking, and automated testing to validate CCF's distributed protocols, specifically its custom consensus protocol.
Background
CCF uses trusted execution environments (TEEs) and state machine replication. Its consensus protocol is based on Raft but has major modifications: signature transactions (Merkle tree root over the whole log signed by the leader), optimistic acknowledgments, and partition leader step-down. CCF also avoids RPCs, using a uni-directional messaging layer. These changes add complexity, requiring formal verification. At the time of writing, CCF's implementation spans 63 kLoC in C++.
The authors define five requirements for CCF verification: verifying safety properties, documenting system behavior, increasing confidence in implementation, integrating with the codebase, and evolving with the system. They use TLA+ for formal specification and model checking. TLA+ specifies system behavior using temporal logic, verifying safety and liveness properties via model checking and simulation. However, verifying the protocol does not guarantee the implementation conforms to it.
Bridging this gap requires conformance checking, which can be done in two ways: test-case generation from the spec or trace validation of the implementation. MongoDB's VLDB'20 paper titled "eXtreme Modelling in Practice" explored both approaches, and is worth a look. The last four years brought significant progress in the applicability of both methods. I think deterministic simulations getting mature and gaining popularity helped both approaches.
Ok, the first approach, test-case generation, uses spec executions to create tests, which are then run against the implementation. Conformance is verified by comparing the implementation's results to those of the spec. This approach requires a highly controllable and observable implementation, effectively eliminating all non-determinism. This is most applicable and effective for single-node implementation components. Recently, at MongoDB Research, we leveraged our distributed transactions specs to generate test cases for the WiredTiger storage component testing. This compositional spec work, led by Will Schultz, will be submitted for publication soon.
This current Microsoft paper focuses on the second approach, trace validation, which works in reverse. It begins with an implementation trace and verifies whether that trace is permitted by a spec execution. This method presents its own challenges, often requiring the spec to be sufficiently detailed to align with the implementation. Consequently, it typically necessitates a "medium-level" spec rather than a more abstract, high-level one.
Distributed Consensus Specification
The TLA+ consensus specification models CCF's protocol with 17 actions and 13 variables. (TLA+ specifications from the project are available on GitHub.) The authors extend the spec for simulation and model checking to explore a wide range of behaviors. Simulation weights actions to increase coverage of rare but critical scenarios.
Trace validation checks if implementation traces match the system’s specification. Since collecting traces from a live production distributed system is difficult, the authors use traces from existing CCF tests. CCF had extensive unit, functional, and end-to-end tests before this work. Consensus testing used a scenario driver that serialized execution deterministically and mocked unrelated components, allowing controlled network faults and observability.
Trace validation verifies whether the behaviors in a trace (T) intersect with those allowed by the spec (S), ensuring T ∩ S ≠ ∅. TLC (TLA's explicit state model checker) constructs S, but cannot generate T from a trace. To address this, the authors wrote a new TLA+ spec, "Trace," reusing the spec actions but constraining them to observed events. The actions are only enabled iff the current event in the trace matches an action, and the actions are parameterized by the values taken from the trace.
The paper reports that aligning spec actions with trace events was still challenging and they had to address granularity mismatches. Invalid traces could still result from incorrect logging, faulty mappings, or genuine spec-implementation mismatches. Unlike model checking, failed validation provided no counterexample, but behaviors in T helped diagnose the issue.
State-space explosion made validating traces computationally expensive. To address this, the authors used depth-first search (DFS) in TLC instead of breadth-first search (BFS). This significantly improving performance. For example, validating a consistency trace took under a second with DFS, compared to an hour with BFS.
Client Consistency Specification
The client consistency spec models interactions between clients and CCF, verifying guarantees like linearizability for read-write transactions and serializability for read-only transactions. TLA+ formalizes these guarantees, and trace validation ensures the implementation adheres to them. This process mirrored the distributed consensus verification effort, so details are omitted. After seeing the method applied to consensus, CCF engineers applied the method themselves with minimal involvement from formal methods specialists. The effort took about one engineer-week spread over two weeks.
Here's a bug they found: Linearizability requires transaction execution to respect real-time ordering. The OBSERVEDROINV property states that committed read-only transactions must reflect all prior committed read-write transactions. Model checking found a 12-step counterexample in four seconds: a read-only transaction was processed by an old but active leader that had not logged recent write transactions from the new leader. This scenario requires a falsely replaced leader still handling requests before retiring. They chose not to change the behavior, as serializability suffices for most applications, but the spec helps clarify guarantees for developers.
Here is an interactive exploration of this bug using the Spectacle (nee tla-web) tool. (For some reason, the trace exploration works only after you press "Reset" first.)
Results
The verification process found bugs in election quorum tallying, commit index advancement, and premature node retirement. These bugs were detected through model checking, simulation, and trace validation. Table 2 lists the most serious consensus protocol bugs found. These affected both safety and liveness and were detected at various verification stages.
One notable bug is the Commit advance on AE-NACK. The spec required a leader’s matchIndex to remain unchanged after receiving an AE-NACK, but the implementation allowed it to decrease. This came from aggressively applying an optimization from the Raft paper. A one-line code change aligned the spec with the implementation, but subsequent simulation found a 34-state counterexample violating a key correctness property. Manual translation of this counterexample into a 150 LoC functional test confirmed that the implementation could advance its commit index incorrectly. Further refinement of the spec revealed that matchIndex should never decrease except after leader elections. Adding this property allowed model checking to find a shorter counterexample. The authors report that the combination of functional testing and model checking allowed a quick and confident fix.
Many bugs were first identified during spec development and alignment with the implementation. Writing the consensus and consistency specs forced deeper thinking about protocol invariants. This process also enabled bug discoveries that would have been difficult to confirm otherwise, especially in complex reconfiguration and failure-handling scenarios.
Comments