Vive la Difference: Practical Diff Testing of Stateful Applications
This Google paper (to appear in VLDB'25) is about not blowing up your production system. That is harder than it sounds, especially with stateful applications with memories. When rolling out new versions of stateful applications, the "shared, persistent, mutable" data means bugs can easily propagate across versions. Modern rollout tricks (canaries, blue/green deployments) don't save you from this. Subtle cross-version issues often slip through pre-production testing and surface in production, sometimes days or weeks later.
These bugs can be severe, and the paper categorizes them as data corruption, data incompatibility, and false data assumptions. The paper mentions real-world incidents from Google and open-source projects to emphasize these bugs' long detection and resolution times, and the production outages and revenue loss they cause.
So, we need tooling that directly tests v1/v2 interactions on realistic data before a rollout. The paper delivers a prototype of a stateful diff testing framework, which extends the familiar idea of diff testing (comparing outputs between two implementations) to handle persistent state. So, you take v1 (the old code) and v2 (the shiny new code). You feed them the same requests on the same kind of data you'll see in production. Then you compare two things: the answers they give, and the state they leave behind. This framework aims to solve three problems:
- Comparing versions on production-scale data without risking live systems.
- Testing subtle cross-version interactions, including request interleavings that occur during rollouts.
- Capturing and efficiently reasoning about changes to persistent state.
The framework (code available on Github) takes three inputs: an initial database state (preferably a snapshot from production in a test environment), a set of representative requests (sampled, synthetic, or hand-crafted), and two binaries: v1 (current production) and v2 (candidate release).
The framework operates across three stages: branching, replaying, diffing.
Branching
They bolt a branching layer on PostgreSQL to create instant, isolated, writable branches of the database without copying the entire dataset. Two special tables (R⁺, R⁻) track changes and a derived view (R′) is used as the new database. Postgres triggers on views intercept every insert, delete, and update, keeping the base table untouched. Common integrity constraints (primary key, uniqueness, non-null, foreign key) are enforced per branch, but certain features (CHECK constraints, pre-existing triggers) are unsupported, limiting coverage for some schemas. The approach avoids changes to Postgres internals or application queries, making it minimally invasive.
I liked this approach a lot. It uses monotonicity principles to branch the database instantly and enables branching in time/work proportional to the size of the diffs rather than the size of the whole table. This is a bolt-on approach, so they don't have the full benefits/features of versioned databases, but they can also boast that this bolt-on approach is applicable to other databases as well (MySQL, Oracle, MongoDB).
Replaying
Requests are executed serially, in three patterns: entirely on v1, entirely on v2, and in interleaved sequences between the two versions. Interleaving is crucial for catching compatibility issues where one version reads or writes data in a way the other cannot handle. The framework generates multiple random interleavings for each request sequence, trading exhaustive coverage for practical runtime. Concurrency bugs are explicitly out of scope in this framework, as it is hard to deal with the heavy/exponential nondeterminism they induce.
Diffing
The database diffing algorithm is designed for efficiency: rather than scanning full tables, the diffing inspects only the delta tables (R⁺, R⁻) , making runtime proportional to the number of changed rows. It produces three-way diffs: original state, v1 result, v2 result. This gives crucial context absent in two-way comparisons. This shows not only what changed, but how each version diverged from the common starting point.
Evaluation
The authors implemented the framework in Go and tested it using synthetic workloads and the Bank of Anthos, a friendly little demo application. In these tests, they introduced the three bug categories described earlier and showed that the framework can catch them and produce clear diffs that explained the discrepancies.
They had a lot of performance benchmarks to show that branch creation is ~100 ms and independent of database size, read/write overheads are modest, and diff computation time scales with the size of the diff rather than the full dataset. They show that compared to naive Postgres SQL queries, diffing is up to 100x faster. They also show that compared to Dolt (a versioned database), branching is slower but read/write performance is better for large tables.
Discussion
As in the past two weeks, Aleksey and I live-recorded our reading and discussion of this paper to show the thought process and mechanics of how experts read papers in real time. You can watch the discussion video here (1.5x speed is best). The paper I annotated during our live-read is also available here.
Below are some of the main points from our discussion.
We initially expected the paper to be more of an experience report on bug-catching at Google scale. Instead, the evaluation presented a toy example and focused on microbenchmarks measuring latency of framework components. In that sense, the evaluation did not connect to the production-safety problem tightly. Something to look forward to in a follow-up paper.
The approach applies to a single-node database, but Google operates large-scale distributed databases. How would this method translate to Spanner? Although the current framework does not address concurrency, if one could capture the exact sequence of production inputs and replay them in a test setting, could that provide meaningful results?
The paper says: "The exact procedure by which inputs (the sequence of client requests) are generated is out of scope for this paper. Typically, they are generated randomly or drawn from a pool of hand-crafted inputs." But this sidesteps the important question: how do you generate the right inputs to catch bugs? And how do you measure coverage, improve it, and gain confidence before rolling out to production?
We also wanted a more explicit, qualitative comparison between their delta-table branching method and alternative techniques such as copy-on-write branching.
Another concern was that manual diff inspection could become a bottleneck. Could LLMs help triage diffs and surface the most suspicious ones?
Overall, the paper is in a strong place. It bridges the database and testing fields, and even touches cloud operations, making it well-positioned for long-term impact. It would be good to check on this prediction in 5-10 years.
Finally, shout out to a great guy, Michael Whittaker. We had collaborated on the "Compartmentalized Paxos" paper, and we were impressed with his clear on-the-fly iPad illustrations of protocols/ideas in our Zoom meetings. He is likely responsible for the clean figures in this paper. Fun fact: he skipped formal slides for his PhD thesis defense, instead drew the protocols live on his iPad (which comes from hundreds of hours of prior practice in meetings).
Comments