Metadata

Neurosymbolic AI: The 3rd Wave

2025-08-08T11:37:00.000-04:00

The paper (arXiv 2020, also AI review 2023) opens up with discussing recent high-profile AI debates: the Montréal AI Debate and the AAAI 2020 fireside chat with Kahneman, Hinton, LeCun, and Bengio. A consensus seems to be emerging: for AI to be robust and trustworthy, it must combine learning with reasoning. Kahneman's "System 1 vs. System 2" dual framing of cognition maps well to deep learning and symbolic reasoning. And AI needs both.

Neurosymbolic AI promises to combine data-driven learning with structured reasoning, and provide modularity, interpretability, and measurable explanations. The paper moves from philosophical context to representation, then to system design and technical challenges in neurosymbolic AI.

Neurons and Symbols: Context and Current Debate

This section lays out the historic divide within symbolic AI and neural AI. Symbolic approach supports logic, reasoning, and explanation. Neural approach excels at perception and learning from data. Symbolic systems are good at thinking, but not learning. Deep learning is good at learning, but not thinking. Despite the great progress recently, deep learning still lacks transparency and remains energy-hungry. Critics like Gary Marcus argue that symbolic manipulation is needed for generalization and commonsense.

The authors here appeal to Valiant's call for a "semantics of knowledge" and say that neural-symbolic computing aims to answer this call. Symbolic logic can be embedded in neural systems, and neural representations can be interpreted in symbolic terms. Logic Tensor Networks (LTNs) are presented as a concrete solution. They embed first order logic formulas into tensors, and sneak logic into the loss function to help learn not just from data, but from rules. For this, logical formulas are relaxed into differentiable constraints. These are then used during training, guiding the model to satisfy logical relationships while learning from data. I was surprised to see some concrete work and software for LTNs on github. There is also a paper explaining the principles.

Distributed and Localist Representation

This section reframes the debate around representation. Neural networks use distributed representations: knowledge is encoded in continuous vectors, over which the concepts are smeared. This works well for learning and optimization. Symbolic systems use localist representations: discrete identifiers for concepts. These are better for reasoning and abstraction.

The challenge is to bridge the two. LTNs do this by grounding symbolic logic into tensor-based representations. Logic formulas are mapped to constraints over continuous embeddings. This enables symbolic querying over learned neural structures, while preserving the strengths of gradient-based learning. LTNs also allow symbolic structure to emerge during learning.

There is an interesting contrast here with the Neurosymbolic AI paper we reviewed yesterday. That paper favored Option2 approaches, which begins with the neural representation and lifts it into symbolic form. In other words, it advocates extracting structured symbolic patterns, explanations, or logical chains of reasoning from the output of neural systems. This paper, through advocating for LTNs, seems to favor Option1: embedding symbolic structures into neural vector spaces.

Neurosymbolic Computing Systems: Technical Aspects

Symbolic models use rules: decision trees, logic programs, structured knowledge. They are interpretable but brittle to change and new information. Deep nets, on the other hand, learn vector patterns using gradient descent. They are great with fuzz, but awful with generalizing rules. They speak linear algebra, not logic.

The first approach to combine them is to bake logic into the network's structure. The second approach is to encode logic in the loss function, but otherwise keep it separate from the network's architecture. The authors seem to lean toward the second approach for its flexibility, modularity, and scalability.

LTNs also seem to fall into the second approach. LTNs represent logical formulas as differentiable constraints, which are added to the loss function during training. The network learns to satisfy logic, but the logic is not hardwired into its structure. So the logic guides learning, but it is not embedded in the weights.

Challenges for the Principled Combination of Reasoning and Learning

Combining reasoning and learning introduces new challenges. One is how to handle quantifiers. Symbolic systems handle universal quantifiers (\forall) well. Neural networks are better at spotting existential patterns (\exists). This asymmetry makes hybrid systems attractive: let each side do what it does best.

Restricted Boltzmann Machines (RBMs) are discussed as early examples of hybrid models. They learn probability distributions over visible and hidden variables. With modular design, rules can be extracted from trained RBMs. But as models grow deeper, they lose modularity and interpretability. Autoencoders, GANs, and model-based reinforcement learning may offer ways to address this.

Neurosymbolic AI: Why, What, and How

2025-08-07T10:33:00.006-04:00

The paper (2023) argues for integrating two historically divergent traditions in artificial intelligence (neural networks and symbolic reasoning) into a unified paradigm called Neurosymbolic AI. It argues that the path to capable, explainable, and trustworthy artificial intelligence lies in marrying perception-driven neural systems with structure-aware symbolic models.

The authors lean on Daniel Kahneman’s story of two systems in the mind (Thinking Fast and Slow). Neural networks are the fast ones: pattern-hungry, intuitive, good with unstructured mess. Symbolic methods are the slow ones: careful, logical, good with rules and plans. Neural networks, especially in their modern incarnation as large language models (LLMs), excel at pattern recognition, but fall short in tasks demanding multi-step reasoning, abstraction, constraint satisfaction, or explanation. Conversely, symbolic systems offer interpretability, formal correctness, and composability, but tend to be brittle (not incremental/monotonic), difficult to scale, and poorly suited to noisy or incomplete inputs.

The paper argues that true AI systems must integrate both paradigms, leveraging the adaptability of neural systems while grounding them in symbolic structure. This argument is compelling, particularly in safety-critical domains like healthcare or law, where transparency and adherence to rules are essential.

The paper divides Neurosymbolic AI into two complementary approaches: compressing symbolic knowledge into neural models, and lifting neural outputs into symbolic structures.

The first approach is to embed symbolic structures such as ontologies or knowledge graphs (KGs) into neural vector spaces suitable for integration with neural networks. This can be done via embedding techniques that convert symbolic structures into high-dimensional vectors, or through more direct inductive bias mechanisms that influence a model's architecture. While this enables neural systems to make use of background knowledge, it often loses semantic richness in the process. The neural model benefits from the knowledge, but the end-user gains little transparency, and the symbolic constraints are difficult to trace or modify. Nevertheless, this approach scales well and offers modest improvements in cognitive tasks like abstraction and planning.

The second approach works in the opposite direction. It begins with the neural representation and lifts it into symbolic form. This involves extracting structured symbolic patterns, explanations, or logical chains of reasoning from the output of neural systems. One common approach is through federated pipelines where a large language model decomposes a query into subtasks and delegates those to domain-specific symbolic solvers, such as math engines or search APIs. Another strategy involves building fully differentiable pipelines where symbolic constraints, expert-defined rules, and domain concepts are embedded directly into the neural training process. This allows for true end-to-end learning while preserving explainability and control. These lifting-based systems show the greatest potential: they not only maintain large-scale perception but also achieve high marks in abstraction, analogy, and planning, along with excellent explainability and adaptability.

The case study in mental health application shows promise. The system's ability to map raw social media text to clinical ontologies and generate responses constrained by medical knowledge illustrates the potential of well-integrated symbolic and neural components. However, these examples also hint at the limitations of current implementations: it is not always clear how the symbolic reasoning is embedded or whether the system guarantees consistency under update or multi-agent interaction.

Knowledge graphs versus symbolic solvers

The paper claims that knowledge graphs (KGs) are especially well-suited for this integration—serving as the symbolic scaffolding that supports neural learning. KGs are graph-structured representations of facts, typically in the form of triples: (subject, predicate, object). KGs are praised for their flexibility, updateability, and ability to represent dynamic real-world entities. But the paper then waves off formal logic, especially first-order logic (FOL) as static and brittle. That's not fair. Knowledge graphs are great for facts: "Marie Curie discovered radium". But when it comes to constraint satisfaction or verifying safety, you'll need real logic. The kind with proofs.

First-order logic is only brittle when you try to do too much with it all at once. Modern logic systems (SMT solvers, expressive type systems, modular specs) can be quite robust. The paper misses a chance here. It doesn't mention the rich and growing field where LLMs and symbolic solvers already collaborate (e.g., GPT writes a function and Z3 checks if it's wrong, and logic engines validate that generated plans do not violate physics or safety).

Knowledge graphs and symbolic logic don’t need to fight, as they don't compete like Coke and Pepsi. They are more like peanut-butter and jelly. You can use a knowledge graph to instantiate a logic problem. You can generate FOL rules from ontologies. You can use SMT to enforce constraints (e.g., cardinality, ontological coherence). You can even use a theorem prover to validate new triples before inserting them into the graph. You can also run inference rules to expand a knowledge graph deductively.

But the paper doesn't explore how lifted neural outputs could feed into symbolic solvers for planning or synthesis or reasoning. It misses the current push to combine neural generation with symbolic checking, where LLMs propose, and the verifiers dispose in a feedback loop.

Can a Client–Server Cache Tango Accelerate Disaggregated Storage?

2025-08-06T17:24:00.004-04:00

This paper from HotStorage'25 presents OrcaCache, a design proposal for a coordinated caching framework tailored to disaggregated storage systems. In a disaggregated architecture, compute and storage resources are physically separated and connected via high-speed networks. These became increasingly common in modern data centers as they enable flexible resource scaling and improved fault isolation. (Follow the money as they say!) But accessing remote storage introduces serious latency and efficiency challenges. The paper positions OrcaCache as a solution to mitigate these challenges by orchestrating caching logic across clients and servers. Important note: in the paper's terminology the server means the storage node, and the client means the compute node.

As we did last week for another paper, Aleksey and I live-recorded our reading/discussion of this paper. We do this to teach the thought-process and mechanics of how experts read papers in real time. Check our discussion video below (please listen at 1.5x, I sound less horrible at that speed). The paper I annotated during our discussion is also available here.

The problem

Caching plays a crucial role in reducing the overheads of disaggregated storage, but the paper claims that current strategies (client-local caching, server-only caching, and independent client-server caching) fall short. Client-local caching is simple and avoids server overhead but underutilizes memory on the server. Server-only caching can reduce backend I/O pressure but comes at the cost of network round-trips and significant server CPU load. Independent client-server caching combines the two but lacks coordination between the caches, leading to data duplication, inefficient eviction and prefetching policies, and causes fairness issues in multi-client environments.

The proposed design

OrcaCache proposes to address these shortcomings by shifting the cache index and coordination responsibilities to the client side. Clients maintain a global view of the cache and communicate directly with the server-side cache using RDMA, which enables bypassing the server CPU in the common case. Server-side components are minimized to a daemon that tracks resource usage and allocates memory based on fairness and pressure.

Discussion

OrcaCache stops short of addressing the core system-level challenges in a realistic multi-client deployment. A single server single client setup is used in experiments in Figure 1, and also for most of the description in the paper. The paper's solution to dealing with multiple clients is to use a separate namespace for each client, but then at the server-side this uses up a lot of resources, cause duplication of cached items. There is no mutual benefit and collaboration among clients in this setup.

The paper also mentions how clients could interact with a server-side daemon, how RDMA-based lookups and cache updates would be issued, and how resources might be allocated based on monitored pressure, but many of these mechanisms remain speculative. The authors mention about flexible eviction and prefetching but do not explore the complexity of maintaining consistency or fairness across diverse workloads. AI/ML workloads mentioned/alluded but not really tested in the paper.

In the end, the paper's contribution lies more in reopening a line of thought from 1990s cooperative caching and global memory management research: how to make cache coherence across disaggregated compute and storage both efficient and scalable. The idea OrcaCache seems to lean on is that rather than burden the server, it makes the client responsible for coordination, enabled by fast networks and abundant memory.

Also despite the title, there was not much Tango in the paper. It was mostly cache.

Transaction Healing: Scaling Optimistic Concurrency Control on Multicores

2025-08-06T08:16:00.003-04:00

This paper from SIGMOD 2016 proposes a transaction healing approach to improve the scalability of Optimistic Concurrency Control (OCC) in main-memory OLTP systems running on multicore architectures. Instead of discarding the entire execution when validation fails, the system repairs only the inconsistent operations to improve throughput in high-contention scenarios.

If this sounds familiar, it's because we recently reviewed the Morty paper from EuroSys 2023, which applied healing ideas to interactive transactions using continuations to support re-execution. This 2016 Transaction Healing paper is scoped to static stored procedures, and focuses more on integrating healing into OCC for stored procedures.

Key Ideas

OCC works well under low contention because it separates reads from writes and keeps critical sections short (only for validation). But under high contention, especially in workloads with skewed access patterns (like Zipfian distributions), transactions are frequently invalidated by concurrent updates. The naive OCC response of abort and restart leads to wasting CPU cycles and degrading cache locality.

Transaction healing aims to address this problem by observing/betting that most validation failures affect only a subset of a transaction's operations. If only the affected operations can be detected and recovered, the system can avoid redoing the entire transaction. They implement this by leveraging two components.

First, a static analysis phase extracts operation dependencies from the stored procedure a priori. The dependency analysis distinguishes between two types of relations: key-dependencies, where the result of one operation determines the lookup key for another; and value-dependencies, where the value produced by one operation is used in a subsequent one. With this graph in hand, transaction healing can surgically repair any non-serializable operation at runtime.

Second, a runtime access cache, maintained per thread, tracks the behavior of each executed operation (its inputs, outputs, effects, and the memory addresses it accessed) and identifies conflicted parts of a transaction at runtime. The access cache supports this by recording memory addresses (avoiding repeated index lookups) and allowing efficient reuse of unaffected results.

Transaction healing

The healing process is triggered during the validation phase, when an inconsistency is detected in the read/write set. Rather than aborting immediately, the system identifies the earliest affected operation (using its dependency graph), and restores it. If the operation is value-dependent, healing updates its effects based on cached inputs and outputs. If it's key-dependent, a re-execution is necessary since the accessed record may change. The healing propagates forward through the dependency graph, recursively restoring all operations affected by the initial inconsistency.

The healing mechanism is built to preserve serializability. Validation acquires locks in a globally consistent order (e.g., sorted by memory address) to avoid deadlocks. If during healing a lock must be acquired out of order (e.g., due to new dependencies introduced by re-executed operations), the transaction is aborted in order not to risk a deadlock. The paper says this situation is rare due to validation-order optimizations. Despite occasional aborts, transaction healing guarantees forward progress and eventual termination: each transaction's read/write set is finite and every element is validated at most once, which ensures that healing either succeeds or fails definitively.

Evaluation Highlights

They implemented a C++ in-memory database engine, THEDB, to test these ideas. THEDB employs LLVM to perform static dependency analysis on stored procedures and includes support for standard database features like inserts, deletes, and range queries (the latter protected against phantoms via B+-tree versioning, as in Silo). The authors evaluate THEDB on a 48-core AMD machine using two common benchmarks: TPC-C and Smallbank. THEDB is compared against five systems: variants of OCC (including Silo-style), 2PL, a hybrid OCC-2PL approach, and a deterministic partitioned system.

The results show that, under high contention, THEDB significantly outperforms the alternatives, achieving up to 6.2x higher throughput than Silo and approaching the performance of an idealized OCC system with validation disabled. This shows that transaction healing adds minimal overhead and successfully eliminates the restart costs that dominate OCC's performance under load. Moreover, THEDB maintains stable throughput as contention increases (e.g., under more skewed Zipfian distributions), while traditional OCC and Silo degrade rapidly. Scalability is also great up to 48 cores.

Discussion

**** What are the limitations of static analysis used?

Transaction healing proposed here is limited to stored procedures because it relies on static dependency extraction. Unlike Morty, which handles interactive transactions using runtime continuations, this work cannot deal with dynamic control flow or unknown transaction logic at runtime. As a result, ad-hoc queries revert to standard OCC, where any healing benefit is lost.

On the other hand, there is some subtlety here. Transaction healing does not require read/write sets to be declared in advance as the deterministic systems like Calvin do. Deterministic systems must know the exact records a transaction will access before it begins execution, so they can assign transactions to partitions and establish a global execution order. Transaction healing avoids this rigidity. It doesn't need to know which specific records a transaction will access ahead of time. Instead, it relies on static analysis to extract the structure of the transaction logic, namely which operations depend on which others. These dependencies, such as key or value dependencies between operations, are known statically because the transaction logic is written as a stored procedure. But the actual keys and values involved are discovered dynamically as the transaction executes. The system uses an access cache to record which memory locations were read or written, and validation happens afterward. This flexibility allows transaction healing to support dynamic, cross-partition access patterns without prior declaration.

**** How does this compare with Morty?

Transaction Healing is designed for in-memory OLTP systems running with OCC on multicore machines, where the workload consists of static stored procedures. Morty, in contrast, is built for a distributed geo-replicated system and handles interactive transactions with dynamic control flow. It uses MVTSO, with speculative execution and a priori ordering. Unlike THEDB, Morty allows transactions to read from uncommitted versions, exposing concurrency that traditional systems suppress. It tracks execution through continuation-passing style (CPS) in order to make control dependencies explicit and enable partial re-execution of logic branches. While transaction healing employed LLVM to automatically perform static dependency analysis on stored procedures, Morty did not automate translation of transaction program to CPS program. Finally, since it is distributed and deployed over WAN, Morty integrates concurrency control with replication to reduce latency and uses quorum voting to maintain fault-tolerant correctness without centralized logging.

Analysing Snapshot Isolation

2025-08-05T09:11:00.006-04:00

This paper (PODC'2016) presents a clean and declarative treatment of Snapshot Isolation (SI) using dependency graphs. It builds on the foundation laid by prior work, including the SSI paper we reviewed recently, which had already identified that SI permits cycles with two adjacent anti-dependency (RW) edges, the so-called inConflict and outConflict edges. While the SSI work focused on algorithmic results and implementation, this paper focuses more on the theory (this is PODC after all) of defining a declarative dependency-graph-based model for SI. It strips away implementation details such as commit timestamps and lock management, and provides a purely symbolic framework. It also proves a soundness result (Theorem 10), and leverages the model for two practical static analyses: transaction chopping and robustness under isolation-level weakening.

Soundness result and dependency graph model

Let's begin with Theorem 10, which establishes both the soundness and completeness of the dependency graph characterization of SI. The soundness direction states that any dependency graph satisfying the SI condition (i.e., every cycle contains at least two adjacent RW edges) corresponds to a valid SI execution. The completeness direction, which follows from prior work, asserts that every valid SI execution induces such a dependency graph. The proof of soundness is technically involved, requiring the authors to construct valid SI executions from dependency graphs by solving a system of relational constraints that preserve the required visibility and ordering properties.

Building on foundational work by Adya, this model represents executions as graphs whose nodes are transactions and whose edges capture observable transactional dependencies in terms of 3 edge types: write-read (WR), write-write (WW), and the anti-dependency capturing read-write (RW) edges. The SI, Serializability (SER), and Parallel SI (PSI) isolation levels are then defined in terms of the structural properties in these graphs, specifically by the presence or absence of certain cycles. This abstraction supports symbolic reasoning about anomalies like write skew or long fork manifest as specific, checkable subgraphs. Informally, a WR edge from T to S means that S reads T’s write to some object x; a WW edge means that S overwrites T’s write; and a RW edge indicates that S overwrites the value of x read by T, introducing an anti-dependency.

Definition 4 and Figure 1 provide an elegant axiomatization of abstract executions. The visibility relation (VIS) must always be a subset of the commit order (CO), and in the case of Serializability, the two are equal. In my mind, this captures the key conceptual divide between SER and SI: Serializability enforces a total order over committed transactions, wheras SI permits partial orders.

Figure 2 illustrates the anomalies that differentiate SER, SI, and PSI. Figure 2(d) captures the classic write skew anomaly, which SI allows but SER prohibits. This scenario arises when two transactions read disjoint keys and then write disjoint values based on those reads, each unaware of the other's effects. SI permits this since it allows partial visibility so long as snapshots are consistent. On the other hand, the long fork anomaly shown in Figure 2(c) is prohibited by SI but allowed by PSI, which weakens the snapshot guarantees further.

Applications of the model

The second half of the paper shows applications of the model for static analyses. The first application is transaction chopping, where large transactions are split into smaller subtransactions to improve performance. The challenge here is to ensure that the interleaving of chopped pieces does not introduce new behaviors/anomalies that the original monolithic transaction would have prevented. This is captured through spliceability: whether an execution of chopped transactions can be "stitched back" into an execution that would have been legal under SI for the unchopped program. Spliceability is formulated through a chopping graph, which augments standard dependencies with session-local ordering among chopped subtransactions. A cycle in the chopping graph DCG(G) is considered critical if (1) it does not contain two occurrences of the same vertex, (2) it includes a sequence of three edges where a conflict edge is followed by a session (predecessor) edge and then another conflict edge, and (3) any two RW (anti-dependency) edges in the cycle are not adjacent. Such critical cycles represent dependency patterns that cannot be reconciled with the atomicity guarantees expected by the original transaction, and thus cannot be realized under SI. Figures 4, 5 and 6 illustrate how small structural differences in the chop can lead to either results that are sound (Figure 6) or unsound (Figure 5 creates a critical cycle). Compared to serializability, SI's more relaxed visibility rules allow for a wider range of safe chops, but care must still be taken to avoid dependency structures that violate snapshot consistency.

The second application of the dependency graph model is in analyzing robustness across isolation levels. The central question is whether a program behaves identically under SI and a weaker or stronger model. An interesting case here is the relation between SI and Parallel SI (PSI). We covered PSI in our earlier review of Walter (SOSP 2011). PSI weakens SI by discarding the prefix requirement on snapshots: it ensures only that visibility is transitive, not that it forms a prefix of the commit order. Thus, PSI admits behaviors that SI prohibits. Theorem 22 formalizes one such divergence. It shows that if a cycle in the dependency graph contains at least two RW edges and no two of them are adjacent, then this cycle is allowed under PSI but not under SI. This captures the long fork anomaly, in which concurrent writers are seen inconsistently by different readers (each reader forming a different branch of the history).

To illustrate the long fork, consider a cycle where T1 and T2 are concurrent writers, and two readers, T3 and T4, observe them inconsistently.

T1 --WR--> T3
T2 --WR--> T4
T3 --RW--> T2
T4 --RW--> T1

In this scenario, T3 sees T1's write but not T2's, and T4 sees T2's write but not T1's. Both readers construct transitive but incompatible snapshots that fork the timeline. SI prohibits this because it cannot construct prefix-closed snapshots that explain both T3 and T4's observations. But since PSI lacks the prefix constraint, it allows this behavior, while still disallowing anomalies like lost update (through its NOCONFLICT axiom).

Robustness from SI to PSI therefore requires ruling out that specific structural pattern: cycles with multiple RW edges where none are adjacent. If such a cycle appears in the dependency graph, PSI will admit the behavior, while SI will not, and robustness would fail.

Discussion

This does invite comparison to the Seeing is Believing (SiB) paper (PODC'17), one of my favorite papers, and its state-centric formulation of isolation guarantees. In SiB, executions are modeled as sequences of global states and snapshots. Transactions observe one of these states and transition the system to a new one. Isolation models are defined in terms of whether there exists a sequence of global states consistent with the observations and effects of each transaction.

While structurally different, the two models are not in conflict. It appears feasible to translate between the dependency graph and state-centric views. The SI model used in this PODC2016 paper already adopts a declarative, axiomatic approach centered on visibility and commit order that is already close to SiB.

For static program analysis, the dependency graph model seems to offer advantages. By abstracting away from global states, it allows symbolic reasoning directly over transactional dependencies. This makes it well-suited to analyses like transaction chopping and robustness checking, which rely on detecting structural patterns such as cycles with certain edge configurations. While the SiB model is semantically expressive and well-suited to observational reasoning, it may be less conducive to structural checks like cycle-freedom or anti-dependency adjacency.

Recent reads (July 2025)

2025-07-30T22:11:00.003-04:00

I know I should call this recent listens, but I am stuck with the series name. So here it goes. These are some recent "reads" this month.

Billion Dollar Whale

Reading the Billion Dollar Whale was exhausting. I am not talking about the writing, which was well-paced and packed with a lot of detail. The problem is the subject, Jho Low, who is a slippery and soulless character, who conned Malaysia out of billions via the 1Malaysia Development Berhad sovereign wealth fund.

Jho Low is a Wharton grad. He is a big spender and party boy. Dropping millions of dollars a night for gambling and partying. His party buddies included Leonardo DiCaprio, Paris Hilton, and Jamie Foxx. Jho was a showoff and pretentious ass. What does Wharton teach these people? Do they actively recruit for this type of people?

Jho was aided by the complicity of Prime Minister Najib Razak and his luxury-addicted wife. We are talking entire stores shut down for private shopping and flights hauling nothing but shoes, and cash bleeding out of a developing nation. Maybe the corruption isn't that surprising if you have followed similar stories. Turkey has been going through similar deals with Qatar, Gulf royalty, and procurement grifts. What was shocking in the book was the scale, and how easily it worked. Everyone looked away, banks, regulators, and governments.

Najib Razak is serving time at prison, but Jho Low still walks free, and probably still convinced he is legit and bright businessman. This book left me fuming throughout.

The Shepherd's Crown

The Shepherd’s Crown is Terry Pratchett’s last Discworld book, and you can feel it through the themes of death, legacy, and transition in the book.

Granny Weatherwax, the formidable witch and moral center of the series, dies early. Her death isn't dramatic but quiet and peaceful. She leaves her cottage and her responsibilities to Tiffany Aching, the young witch she mentored. (It was later revealed that Pratchett intended one last twist in the book: Granny Weatherwax had hidden her soul in a cat, delaying her meeting with Death until she could say, "I'm leaving on my own terms". But Pratchett didn't get to write that scene.)

The story seemed rushed (obviously given the circumstances), but the writing is still high quality. There were several sentences that made me laugh out loud. Even weakened by Alzheimer's, Pratchett was still sharper than most writers at their best. He also had an awesome command of the English language. Borrowing what Douglas Adams said about Wodehouse, Pratchett was one of the greatest musicians of the English language.

Stephen Briggs, the longtime narrator, does a beautiful job with the audiobook. It feels like a goodbye letter to his friend.

David Heinemeier Hansson (DHH) interview

DHH, who now looks like a young Schwarzenegger with perfect curls, just did a six-hour interview with Lex Fridman. Yes, six hours and eight minutes. Here’s the link (also to the transcript), if you dare. Halfway along the interview, I noticed DHH sounds a lot like Bill Burr, especially when he is ranting, and he rants a lot. It is not just the voice, but the delivery, the takes, the contrarianism, and the angst. It's uncanny.

I remember reading DHH years ago and thinking, "This guy's basically a communist." But now he talks like a proper capitalist, maybe even a conservative. And yet, I still think DHH is authentic. He has "strong opinions, loosely held". He is loud and often smug. He is not afraid to throw elbows and get into fights. But I get the sense that he is persuadable, and if he saw he was wrong, he would change course.

Of course the conversation spans Ruby, Rails, AI, and the philosophy of programming. DHH argues "Ruby does scale" citing that Shopify runs on Rails and handles over a million dynamic requests per second. He says Ruby is a "luxury language" that is human-friendly. Sure, it’s not the fastest. But it lets developers move fast, stay happy, and write expressive code. DHH argues the performance bottlenecks are usually at the database. And in most businesses, developer time costs more than servers.

DHH says he uses AI daily as a tutor, a pair programmer, and a sounding board. But he draws a hard line at "vibe coding". He said vibe coding felt hollow, and worse, it felt like his skills were evaporating. His rule is to always keep hands on the keyboard. He argues convincingly that programming is learned by doing, not by watching. Like playing guitar, the muscle memory is the knowledge. DHH sees programming not just as a job, but as a craft, something worth doing for its own sake.

Real Life Is Uncertain. Consensus Should Be Too!

2025-07-30T09:28:00.006-04:00

Aleksey and I sat down to read this paper on Monday night. This was an experiment which aimed to share how experts read papers in real time. We haven't read this paper before to keep things raw. As it is with research, we ended up arguing with the paper (and between each other) back and forth. It was messy, and it was also awesome. We had a lot of fun. Check our discussion video below (please listen at 1.5x, I sound less horrible at that speed, ah also this thing is 2 hours long). The paper I annotated during our discussion is also available here.

This paper appeared in HotOS 2025, so it is very recent. It's a position paper arguing that the traditional F-threshold fault model in consensus protocols is outdated and even misleading.

Yes, the F-threshold fault model does feel like training wheels we never took off. In his essay "the joy of sects", Pat Helland bring this topic to tease distributed systems folk: "Distributed systems folks. These people vacillate between philosophers and lawyers. No one else can talk so fluently about total order without discussing the scope of totality. Availability is always couched by assuming a loss of, at most, F servers, while never veering into what happens when an operator logs into the system. Integral to their psyche is the belief in deterministic outcomes in the face of nondeterministic environments. Sometimes these discussions get seriously F'd up!" I am proud to report that, during a meeting with Pat, I came up with the pun this is "F'ed up, for F=1 and N=3". Although I must concede that Pat is the real A-tell-a The Pun!

So, the premise is agreeable: the clean abstraction of "up to f faults" is too simplistic for how real distributed systems fail. They argue that treating all faults equally (as the f-model does) hides real-world complexities. Machines don’t fail uniformly. Failures evolve over time (bathtub curve), cluster around software updates, and depend on hardware, and even location in datacenter (temperature). This 6 page paper has 77 references, so they seem to have done an extensive literature search on this topic.

Building on this observation, the paper advocates a paradigm shift: replacing the fixed F-model with a probabilistic approach based on per-node failure probabilities, derived from telemetry and predictive modeling. While the paper doesn't propose a new algorithm, it suggests that such a model enables cost-reliability tradeoffs. For instance, it mentions that running Raft on nine unreliable nodes could match the reliability of three high-end ones at a third of the cost. (It is unclear whether this accounts/prorates for throughput differences.)

But the deeper we read in to the paper, the more we found ourselves asking: what exactly is a fault curve (p_u), and what is this new model? Is this p_u = 1%, per year, per month, per quorum formation? The paper never quite defines fault-curves, even though it's central to the argument.

We got even more confused in the paper's conflation of safety and liveness for Raft. FLP (1985) taught us to keep safety and liveness separate. Raft and Paxos are known for prioritizing safety above all. Even when there are more crash failures than F, the safety is not violated. So when the paper reports “Raft is only 99.97% safe and live,” the precision is misleading. What does that number even mean? How was it calculated? Also there is a big difference between "safe OR live" and "safe AND live". Why were the two bunched together in Table 2 for Raft? What is meant here?

The paper says: "As faults are probabilistic, it is always possible for the number of faults to exceed F. Thus no consensus protocol can offer a guarantee stronger than probabilistic safety or liveness." Again, I suspect that "or" (in this case) between safety and liveness is carrying a lot of load. The safety of Paxos family of protocols rely on the quorum intersection property, so even when F is exceeded, the safety is not violated, although liveness could be lost. The paper says "Violating quorum intersection invariants triggers safety violations." But the quorum intersection is a priori calculated, the sum of two quorum sizes has to be bigger than N, so this is guaranteed by arithmetic, and it is not a probabilistic guarantee. We had to hypothesize a lot about why the paper seems to claim some safety violation: Is it maybe some durability loss? Is this assuming Byzantine failure? We still don't have an answer.

The paper does better with PBFT, separating safety and liveness in their reliability table. But even there, the model feels underspecified. There's a leap from "fault curves exist" to "this quorum configuration gives X nines of safety" without laying out a mathematical foundation.

Another counterintuitive point in the paper was the idea that more nodes can make things worse, probabilistically. For instance, the paper claims that a 5-node PBFT deployment could be significantly more reliable than a 7-node one, because quorum intersection becomes more fragile as the system scales with unreliable nodes. Again, we couldn't really make sense of this claim either, as there was not much explanation for it.

Is giving up the F faults abstraction worth it?

This is a position paper, and it plays that role well. It surfaces important observations, challenges sacred abstractions, and provokes discussion. It aims to bring consensus modeling into a more probabilistic/complex (real?) world where failure rates vary, telemetry exists, and tradeoffs matter. It advocates for getting rid of the rigid F upper-bound for fault-tolerance. But complexity cuts both ways. A richer/complex model may capture more nuance, but it can also make reasoning and safety proofs much harder. And clarity and simplicity and guaranteed fault-tolerance is essential for consensus.

Actually, the paper made me appreciate the F abstraction for faults even more. It is simple, but it makes reasoning simpler in return. It is possible to still be probabilistic and do all that analysis in selecting the F number. These days due to constant software rollovers many systems go with F=2 and N=5, or even higher numbers. And the nice thing about the Paxos family of protocols is due to quorum intersection, safety is always guaranteed, non-probabilistic, even when the F limit is exceed by extra crash faults (in reality network faults and partitions also bunch in here). And there has been good progress in decoupling F from N (thanks to the flexible quorums result), which addresses some of the complaints in the paper (e.g., "Linear size quorums can be overkill"). Moreover, heterogeneous deployments are already considered, and leader selection heuristics exist.

If the goal is the replace the F abstraction, there should be more thought put into what new abstraction would be proposed to take over. Abstractions are at the core of Computer Science and Distributed Systems. As one of my favorite Dijkstra quotes say: "The purpose of abstraction is not to be vague, but to create new semantic level where one can be absolutely precise."

Morty: Scaling Concurrency Control with Re-Execution

2025-07-29T13:58:00.007-04:00

This EuroSys '23 paper reads like an SOSP best paper. Maybe it helped that EuroSys 2023 was in Rome. Academic conferences are more enjoyable when the venue doubles as a vacation.

The Problem

Morty tackles a fundamental question: how can we improve concurrency under serializable isolation (SER), especially without giving up on interactive transactions? Unlike deterministic databases (e.g., Calvin) that require transactions to declare read and write sets upfront, Morty supports transactions that issue dynamic reads and writes based on earlier results.

Transactional systems, particularly in geo-replicated settings, struggle under contention. High WAN latency stretches transaction durations, increasing the window for conflicts. The traditional answer is blind exponential backoff, but that leads to low CPU utilization. TAPIR and Spanner replicas often idle below 17% under contention as Morty's evaluation experiments show.

Morty's approach to tackle the problem is to start from first principles, and investigate what limits concurrency under serializability? For this, Morty formalizes serialization windows, defined per transaction and per object, stretching from the commit of the value read to the commit of the value written. Serializability requires these windows not to overlap. (Note that avoiding overlaps is necessary but not sufficient for SER: you also need to prevent cycles through read-only transactions or indirect dependencies, which Morty addresses at commit time validation checks.)

Figures 1 and 2 illustrate these serialization windows. With re-execution, Morty commits T2 with minimal delay, instead of aborting and retrying. Figure 3 shows another case where re-execution avoids wasted work.

Morty Design

Morty's re-execution mechanism hinges on two ideas: read unrolling and a priori ordering.

Read unrolling allows Morty to selectively rewind and recompute parts of a transaction that depended on outdated reads. Rather than aborting the entire transaction, Morty re-executes just the stale portion. This is possible because transactions are written in continuation-passing style (CPS), which makes control flow and data dependencies explicit. CPS is common in asynchronous programming (JavaScript, Go, Java, Python, and libraries like NodeJS, LibEvent, and Tokio) and it maps well to networked databases like Morty.

Morty’s CPS API supports re-execution directly. Get(ctx, key, cont) reads a key and resumes at cont. Put(ctx, key, val) tentatively writes. Commit(ctx, cont) initiates prepare and finalize. Abort drops the execution. Abandon means transaction is re-executed (often partially) using a continuation from an earlier point. Re-execution reuses the context, shifts the stale read forward, and resumes execution.

A priori ordering assigns each transaction a speculative timestamp at arrival, defining a total order a la MVTSO. This order is not revised, even if clocks are skewed or messages are delayed. Instead, if execution violates the speculative order (e.g., a read misses a write that should've come earlier), Morty detects the conflict and re-executes the transaction to realign with the original order. The system adapts execution to fit the speculative schedule, not vice versa. The paper claims aborts are rare since re-execution usually succeeds.

I think a key idea in Morty is that contrary to most approaches, Morty ignores read validity (that committed transactions only observe committed data) during execution to expose more concurrency to transactions. It exposes both committed and uncommitted write to transactions by leveraging MVTSO and allows reads from uncommitted versions. These speculative reads are later validated at prepare-time prior to commit. If a read depended on a write that never committed, or missed a newer write, Morty re-executes the transaction (through abondon call) or aborts it as a last resort.

In addition to serialization windows, Morty defines validity windows to measure how long a transaction waits for its inputs to commit. A transaction Ti's validity window on object x starts when its dependency commits and ends when Ti commits. Like serialization windows, overlapping validity windows are disallowed. But unlike serialization windows, Morty doesn't try to align validity windows, and instead focuses on minimizing their span. Long validity windows mean low throughput. Morty shortens validity windows by avoiding unnecessary delays between reads and commits, preventing cascading speculative reads, and favoring re-execution over abort-and-retry.

Re-execution typically occurs during the commit protocol, when replicas must check commit status across a quorum. If they detect a stale read or violated serialization window, they trigger re-execution before finalizing. Validation checks include:

Reads didn't miss writes.
Other transactions didn't miss our writes.
Reads match committed state (no dirty reads).
No reads from garbage-collected transactions.
If all pass, replicas vote to commit. Otherwise, they vote to abandon and may supply a new value to trigger re-execution.

But why do replicas vote at all? Because Morty doesn't use Raft-style replica groups, with a leader calling the shots. In contrast to Raft-groups approach, Morty doesn't have a central log or a leader for serializing/ordering all commands. It is closer to TAPIR, and it uses timestamps to assign speculative order. By integrating concurrency control with replication, Morty aims to improve throughput under contention and achieve low-latency geo-replication. So, quorum-based voting ensures consistency and fault-tolerance as in TAPIR.

Voting ensures that a commit is durable across failures, visible to a majority, and recoverable even if the coordinator crashes. Without this, there's no way to guarantee correctness in a crash or partition.

Recovery is still tricky. Morty replicates across 2f+1 nodes and tolerates f failures. Coordinators may stall, so Morty uses a Paxos-style recovery protocol with view changes: any replica can step up and finalize the commit decision for a failed coordinator. This isn't trivial as it requires care to avoid split-brain and maintain consistency.

Morty's re-execution resembles CockroachDB’s read-refresh a bit. CRDB refreshes read timestamps if read spans haven't been overwritten, but it doesn't re-execute application logic. If one key's value changes, Morty rewinds only the dependent continuation. In contrast to CRDB, which must restart the whole transaction if refresh fails, Morty semantically rewinds and reruns logic with new values.

Evaluation

The results are impressive. On TPC-C, Morty achieves 7.4x the throughput of Spanner-style-TAPIR, 4.4x of TAPIR, and 1.7x of MVTSO. On the high-contention Retwis benchmark, Morty delivers 96x throughput over TAPIR.

Morty scales with CPU. On Zipfian Retwis, it grows from 7.8k to 35.3k txn/s with 20 cores. Spanner and TAPIR plateau early (at 17% CPU utilization) due to frequent aborts and exponential backoff.

Conclusion

Morty is one of the most technically rich papers on serializability in recent years. It's dense and demanding. It assumes deep familiarity with concurrency control, replication, and async programming. But for those in the distributed systems and databases intersection, Morty is a very rewarding read.

One gripe: the code link is broken. https://github.com/matthelb/morty/

Serializable Isolation for Snapshot Databases

2025-07-23T23:03:00.007-04:00

This paper (SIGMOD '08) proposes a lightweight runtime technique to make Snapshot Isolation (SI) serializable without falling back to locking. The key idea behind Serializable SI (SSI) is to detect potentially dangerous (write-skew) executions at runtime and abort one of the transactions to guarantee serializability (SER).

The goal is to offer the strong guarantees of SER without sacrificing SI's high performance and non-blocking reads. But would it make sense to implement SER by layering on MVCC SI instead of implementing it directly? Do you think an SI-based implementation would be more performant than native 2PL-based SER implementations? What about compared to OCC-based SER? The evaluation section gives some answers.

The problem and insight

Let's back up.

Write-skew is the canonical anomaly under SI. And the canonical example for write-skew is the "doctors on-call" scenario. Consider two doctors removing themselves from a duty roster. Each transaction checks that at least one doctor remains on duty and proceeds to update its own status. Under SI, each transaction sees the old snapshot where the other doctor is still on duty. Both commit. The resulting state violates the application’s invariant, even though each transaction was locally consistent. Compare this to SER which ensures that integrity constraints are maintained even if those constraints are not explicitly declared to the DBMS. (Another famous write-skew example is reading savings and checking accounts both, and withdrawing from one. Yet another example could be graph coloring, reading neighbor's color to switch your own in response.)

A naive fix for preventing write-skew would be to track the read set of a transaction T1 and, at commit time, check whether any item read by T1 was overwritten by a concurrent writer. If so, we abort T1. This eliminates outgoing rw-dependencies and thus prevents write-skew cycles. It works, but it's overkill: You end up aborting transactions even when no cycle would form. For example, in the execution "b1 r1(x) w2(x) w1(y) c1 c2", T1 would be aborted because its read r1(x) is overwritten by T2's w2(x), but it is possible to serialize this execution as T1 followed by T2, as the writesets are not conflicting as per SI.

The paper proposes a smarter fix: only abort if a transaction has both an inConflict and an outConflict edges with other transactions. That is, it sits in the middle of two concurrent rw-dependencies (the paper calls this a pivot in a dangerous structure). Prior work by the authors (Fekete et al. 2005) proves that all SI anomalies contain such a pivot. So this check would suffice.

The approach still over-approximates somewhat: having both in and out conflicts doesn’t always imply a cycle. See Fig 9, for an example. Although T0 has both inConflict and outConflict, it is possible to serialize these transactions as TN, T0, and T1 order. The paper doesn't quantify the false positive rate or explore pathological workloads.

The paper says full serialization graph cycle detection would be too expensive, and this is the compromise. This is still a tighter and more efficient condition than the naive approach. The Ferro-Yabandeh Eurosys '12 paper uses something close to the naive approach. Same with naive OCC-based SER implementations. They abort based on single outConflict edge.

Implementation and evaluation

The implementation tracks two flags per transaction: inConflict and outConflict. It also adds a new lock mode, SIREAD, to record reads. Reads acquire SIREAD locks; writes acquire WRITE locks. When a read at T1 observes a version written by T2, we set T1.outConflict and T2.inConflict. This is called a rw-dependency (a.k.a antidependency) from T1 to T2. When a write at T1 detects an existing SIREAD by T2, we set T1.inConflict and T2.outConflict. See the pseudocode in Figures 5-8 for how this is implemented.

When both flags are set for any transaction, one transaction in the conflict chain is aborted. The implementation prefers to abort the pivot (the transaction with both in and out conflicts), unless it has already committed. Else an adjacent transaction is aborted. If there are two pivots in the cycle, whichever is detected first gets aborted. This choice is sound and gives room for policy tuning (e.g., prefer aborting younger transactions).

Note that all of this is layered on top of SI. Readers never block writers. Writers never block readers. The only cost is the extra bookkeeping to track conflicts.

The authors implemented SSI over Berkeley DB (which provides SI through MVCC) with minimal changes, around 700 lines of modified code out of a total of over 200,000 lines of code in Berkeley DB. The experiments with the SmallBank benchmark show that the performance is close to plain SI and significantly better than 2PL under contention. Note that the S2PL implementation of Berkeley DB does not leverage MVCC and locks at page level.

The paper does not contrast SSI with OCC implementation of SER. But it argues that naive OCC implementations abort on any rw-conflict. SSI waits until both in and out conflicts are present, and would result in fewer false positives and better throughput.

My take

SI is attractive because it avoids blocking: readers don't block writers and vice versa. Maybe the takeaway is this: Stick with SI. But if you really must go serializable, consider layering SSI on top of SI instead of using 2PL or OCC based SER implementations that doesn't leverage MVCC.

But, still, there are many caveats to SSI as presented in this paper: phantoms are not addressed, false positives remain, B-tree level conflicts inflate aborts, and adapting this to distributed systems is non-trivial.

This paper is a research prototype after all. A VLDB'12 paper by Dan Ports followed this algorithm to provide a PostgreSQL implementation adds integration with real-world features like: replication and crash recovery, two-phase commit (tupac), subtransactions and savepoints, and memory bounding via transaction summarization. It also handled phantoms by leveraging SIREAD locks on index ranges to detect predicate read/write conflicts.

Caveats and Open Questions

The paper punts on phantoms. It says that Berkeley DB uses page-level locking, so predicate-based anomalies don’t occur. But for engines with row-level locking, it suggests that SSI would need to acquire SIREAD locks on index ranges or use multigranularity locking.

The design assumes a single-node system. In a distributed DB, you would need to propagate SIREAD metadata and conflict flags across partitions. Detecting cross-partition dependencies and coordinating pivot aborts would be hard. The paper does not talk about implementation in distributed databases.

A known artifact in the Berkeley DB implementation is page-level locking. Updates to shared pages, like the B-tree root during a split, look like write-write conflicts across unrelated transactions. These inflate false positives and aborts under SSI.

My troubles with Figure 4

I mean I have to complain about this (otherwise it won't be a blog post from Murat). Figure 4 was terribly confusing for me. I had a hard time figuring out why these two scenarios are not serializable. It turns out I got confused by the "helpful text" saying: "In Figure 4(a), both reads occur after the writes. In Figure 4(b), TN reads x before it is written by T0." No, that is irrelevant: the reads in SI happen from the snapshot at the transaction start time, so when the reads happen is not relevant. Confused by this I was wondering why we can't serialize Figure 4.a execution as T1, T0, TN.

Ok, let's focus on Fig 4.a. T0's serialization point should be before T1, because of rw-dependency from T0 to T1: T0' read happens before T1's write. T1 needs be serialized before TN, because TN reads z from T1's write. But TN is problematic. TN introduces rw-dependence from TN to T0. It reads x from its snapshot which is earlier than T0's commit point. So TN should come before T0. This conflicts with the two previous requirements that T0 precedes T1 and T1 precedes TN. We have a cycle that prevents serialization.

Figure 4.b is very similar to Figure 4.a and has the same cycle. The only difference is TN's commit point comes a little later, but it doesn't change anything, TN does its read from the snapshot taken at the start of TN.

It would have been nice to supplement the figure with a text explaining the cycles, and giving some insights into the thought process.

ATC/OSDI’25 Technical Sessions

2025-07-10T18:46:00.003-04:00

ATC and OSDI ran in parallel. As is tradition, OSDI was single-track; ATC had two parallel tracks. The schedules and papers are online as linked above.

USENIX is awesome: it has been open access for its conference proceedings since 2008. So you can access all the paper pdfs through the links above now. I believe the presentation videos will be made available soon as well. Kudos to USENIX!

I attended the OSDI opening remarks delivered by the PC chairs, Lidong Zhou (Microsoft) and Yuan Yuan Zhou (UCSD). OSDI saw 339 submissions this year, which is up 20% from last year. Of those, 53 were accepted, for an acceptance rate of 16%. The TPC worked through Christmas to keep the publication machine running. We really are a bunch of workaholics. Who needs family time when you have rebuttals to respond to?

OSDI gave two best paper awards:

Basilisk: Using Provenance Invariants to Automate Proofs of Undecidable Protocols. Tony Nuda Zhang and Keshav Singh, University of Michigan; Tej Chajed, University of Wisconsin-Madison; Manos Kapritsos, University of Michigan; Bryan Parno, Carnegie Mellon University
Building Bridges: Safe Interactions with Foreign (programming) Languages through OmniGlot. Leon Schuermann and Jack Toubes, Princeton University; Tyler Potyondy and Pat Pannuto, University of California San Diego; Mae Milano and Amit Levy, Princeton University

The Distinguished Artifact award went to: PoWER Never Corrupts: Tool-Agnostic Verification of Crash Consistency and Corruption Detection. Hayley LeBlanc, University of Texas at Austin; Jacob R. Lorch and Chris Hawblitzel, Microsoft Research; Cheng Huang and Yiheng Tao, Microsoft; Nickolai Zeldovich, MIT CSAIL and Microsoft Research; Vijay Chidambaram, University of Texas at Austin

2026 OSDI' will be in Seattle organized by Eddie Kohler (Harvard) and Amar Phanishayee (Meta).

Emery Berger’s Keynote: Accelerating software development, the LLM revolution

Deniz Altinbuken (ATC cochair) introduced the OSDI/ATC-joint keynote speaker: Professor Emery Berger of Umass Amherst, who is also an Amazon scholar. Emery's Plasma lab is a birthplace of many interesting projects: Scalene, Hoard, DieHard. Emery is sigma. I mean, he has an ACM Sigma distinguished service award, and he is an ACM fellow. He is also the creator of cs-ranking, which he keeps getting occasional hate mail every now and then.

Emery gives great talks. I think this was the best presentation across all OSDI and ATC sessions. What is Emery's secret? Like other good speakers, rehearsing crazy number of times and preparing really well.

His talk was about the coming Cambrian explosion of LLM-augmented developer tools. Here’s the gist: "Traditional software development is the dinosaur. LLMs is the asteroid. Cambrian explosion is coming for tooling, not just apps. So let’s evolve in order not to go extinct."

He first introduced the framework he used repeatedly for the new generation of LLM-enhanced tools coming out of his lab.

Evolve.
Exploit a niche.
Ensure fitness.

He argued throughout the talk that AI-assisted tools can deliver the best of both worlds precision from PL, and flexibility from LLMs.

Scalene: Evolving the Profiler. Emery's group LLM-augmented Scalene their Python profiler tool, to not only pinpoint inefficiencies in your code but also to explain why your code is slow and provide suggestions on how to fix them. Common suggestions include: replace interpreted loops with native libraries, vectorize, use GPU. LLMs help generate optimization suggestions with surprising performance boosts (sometimes 90x). That was evolve (adopt LLMs), and exploit a niche (profiler knows where code is inefficient, why it is inefficient). The ensure fitness comes from running the optimized code against the original.

chatDBG: Evolving a Debugger. Most developers don’t use debuggers. Print statements are more popular. Why? Because debuggers are clunky and stagnant. chatDBG changes this. It turns the debugger into a conversational assistant. You can type "why" at the prompt, and it gives you the root cause and a proposed fix. You can query code slices, symbol declarations, even ask for documentation. The LLM has access to dynamic program state and source context provided by the debugger, and external knowledge provided by huge real world training. Success rates improved with more context: 25% (default stack), 50% (with "why"), 75–80% (with targeted queries). Safety is enforced using containers and command whitelists. Again: evolve the interface, exploit the niche (source + state + knowledge), ensure fitness (bounded functionality and testable fixes).

cwhy: Evolving a Compiler. C++ compiler errors are notoriously incomprehensible. cwhy wraps around Clang++ and gives human-readable explanations with concrete fix suggestions. It uses git diffs to localize the cause and context of the error. The assumption: if it compiled yesterday and not today, something broke in the diff. cwhy figures out what and suggests ways to fix it. It even handles things like regex errors. This tool impressed a library author so much that they adopted it after cwhy fixed a bug in their freshly published code.

coverup: Evolving a Coverage Tool. Writing good tests is hard and thankless. Coverage reports only cause guilt-trips. LLMs can write tests, but cannot methodically increase coverage. Coverup is a next-gen testing assistant. It builds on slipcover (Plasma's coverage analysis tool), then uses LLMs to generate new test cases that increase branch coverage methodically.

flowco: Rethinking Notebooks. Jupyter notebooks are the lingua franca of data science. But they're also a mess: brittle, unstructured, and difficult to maintain. Flowco reimagines notebooks as dataflow graphs. Each step (load, clean, wrangle, visualize) is a node in the graph/pipeline. LLMs guide code generation at each step. Combined with pre/post condition checks, you get a smart notebook that requires no code but ensures correct workflows. The metaphor shift from script to graph is what enables notebooks to evolve.

The talk was a tour-de-force across many tools showing how LLMs can help them evolve and improve significantly. Emery is doing a big service building these tools to help all developers. Based on my interactions with talented coders, I have come to conclude that LLMs actually boost performance way more for experts than beginners. I think these tools will help all kinds of developers, not just experts.

Using Provenance Invariance to automate proofs

This paper was one of the standout papers from OSDI'25, and winner of a Best Paper Award. The lead author, Tony Zhang, inow an engineer at Databricks, was a PhD student at University of Michigan and a MongoDB PhD Fellow. He delivered a sharp, polished talk. Here is my reconstruction from my notes.

Distributed protocols are notoriously hard to get right. Testing can show the presence of bugs but never their absence. Formal verification can give stronger correctness guarantees, but only if you can manage to prove your protocol satisfies a desired safety property. This involves crafting an inductive invariant: one that holds initially, implies the safety property, and is closed under protocol transitions.

Coming up with a suitable inductive invariant is hard. It's an iterative process: guess, check, fail, refine, repeat. You often start with a property you care about (say, agreement), only to find it isn’t inductive. Then you strengthen it with auxiliary lemmas and conditions. If you are skilled and somewhat lucky, you eventually get there. I wrote a bit about this loop back in 2019, after SOSP’19 on an earlier effort on this problem.

Basilisk’s goal is to short-circuit this painful invariant discovery loop using provenance invariants. Provenance invariants relate a local variable at a node to its provenance: the causal step or message that caused the variable to have its current value. For example, instead of guessing an invariant like "If replica A has voted yes, then replica B must also have voted yes," Basilisk works backwards to discover why A voted yes and then connects that cause to B’s state. Sort of history variables on steroids. By tracing data dependencies across steps and messages, Basilisk derives inter-host invariants that explain not just what the state is, but how it came to be.

Basilisk builds on the authors' prior work Kondo, which had two types of invariants:

Protocol invariants (expressive but hard to generate automatically)
Regular invariants (mechanically derivable but low-level)

Basilisk generalizes this by introducing host provenance (HP) and network provenance (NP). HP traces a variable's value to a local decision; NP traces it to a message received. Together, these form a causal chain, or witness, which Basilisk uses to justify why a state is safe.

The provenance invariants replace the original inductive invariant. Then Basilisk proves that these provenance invariants imply the desired property. All of this is implemented in a toolchain that extends the Dafny language and verifier. Protocols are modeled as async state machines in Dafny, Basilisk analyzes them, and outputs inductive invariants along with machine-checkable proofs. Basilisk was evaluated on 16 distributed protocols, including heavyweights like Paxos, MultiPaxos, and various consensus and replication variants.

Tigon: A Distributed Database for a CXL Pod

Traditional distributed databases synchronize over the network. This means network overhead, message exchange complexity, and coordination cost. Tigon replaces that network with a CXL memory pod. Instead of using sockets and RPCs, nodes coordinate using inter-host atomic instructions and hardware cache coherence. The only downside is this is only a single-rack-scale database design, which also comes with unavailability disadvantages.

There are still challenges remaining with this architecture. The CXL memory has higher latency (250–400ns) and lower bandwidth than local DRAM. It also provides limited hardware cache coherence capacity. Tigon addresses this with a hybrid architecture: partitioned and shared. All control-plane synchronization is done via CXL-coherent memory. And messages are only exchanged for data movement.

To minimize coherence pressure, they compact metadata into 8-byte words and piggyback coordination info. They also show how to eliminate two-phase commit by taking advantage of reconstructability during crash recovery. The system is single-rack and does not replicate. It's built on a fail-stop assumption, and scaling (adding/removing nodes) requires restart. Evaluated on TPC-C and a YCSB variant using simulation experiments (since the hardware is not there), Tigon outperforms existing CXL-based shared-nothing DBs by up to 2.5x, and RDMA-based systems by up to 18.5x.

Mako: Speculative Distributed Transactions with Geo-Replication

Geo-replicated transactions are hard. The cost of consensus across data centers kills latency and throughput. Mako sidesteps this by decoupling execution from replication. Transactions execute speculatively using two-phase commit (2PC, Tupac) locally, without waiting for cross-region consensus. Replication happens in the background to achieve fault-tolerance.

The core idea is to allow transactions to commit optimistically and only roll back if replication later fails. This opens the door to cascading aborts, and Mako tries to alleviate unbounded rollback problem by tracking transactional dependencies using vector clocks.

Ok, but there is another problem. How do you get consensus among nodes on roll back when the speculative/optimistic execution fails. This was not explained during the talk, and I brought this up during Q&A. The answer seems to be using epochs and sealing, and doing this in a delayed manner. This could open more problems. I haven't read the paper to understand how this works.

Skybridge: Bounded Staleness in Distributed Caches

From Meta (I guess we are not calling it Facebook anymore), this paper addresses the pain of eventual consistency in global cache layers. Eventual is fine for most cases, until it's not. Products like Facebook and Instagram rely on caches being mostly up-to-date. Inconsistencies may cause some occasional weird bugs, user confusion, and degraded experience.

Skybridge is an out-of-band replication stream layered on top of the main async replication pipeline. It adds redundancy without relying on the same code paths, avoiding correlated failures. Skybridge focuses only on timely delivery of updates, not durability.

Skybridge itself is also eventual consistency, but being lightweight it gives you bounded consistency almost always. By leveraging Bloom filter-based synchronization, Skybridge provides a 2-second bounded staleness window for 99.99998% of writes (vs. 99.993% with the baseline). All that, at just 0.54% of the cache deployment size because of reduced scope in this superposed/layered system.

SpecLog: Low End-to-End Latency atop a Speculative Shared Log

Shared logs are foundational in modern distributed systems (e.g., Corfu, Scalog, Boki). They are deployed in Meta (formerly Facebook) as I discussed here earlier. But their latency is often too high for real-time workloads because coordination delays application progress. SpecLog proposes speculative delivery: let applications start processing records before the final global order is fixed.

To make this safe, they introduce Fix-Ante Ordering: a mechanism that deterministically assigns quotas to each shard ahead of time. If each shard sticks to its quota, the global cut is predictable. If not, speculation may fail and need rollback. Their implementation, Belfast, shows 3.5x faster delivery latency and 1.6x improvement in end-to-end latency over existing shared logs.

This is conceptually similar to moving Scalog’s ordering step upfront and letting applications run optimistically. As I noted back in my Scalog post, removing coordination from the critical path is the holy grail. SpecLog pushes this further by betting on speculative execution + predictable quotas. Again, I haven't read the papers to analyze disadvantages.

ATC/OSDI 2025 impressions

2025-07-10T10:54:00.012-04:00

This week I was in Boston for ATC/OSDI’25. Downtown Boston is a unique place where two/three-hundred-year-old homes and cobblestone streets are mixed with sleek buildings and biotech towers. The people here look wicked smart and ambitious (although lacking the optimism/cheer of Bay area people). It’s a sharp contrast from Buffalo, where the ambition is more about not standing out.

Boston was burning. 90°F and humid. I made the mistake of booking late, so I got the DoubleTree Boston-Downtown instead of the conference hotel. The mile-long walk to the Sheraton felt like a hike through a sauna. By the time I got there, my undershirt was soaked, and stuck to my back cold under the conference hall’s AC. Jane Street's fitted t-shirt swag saved the day.

The Sheraton looked ragged from the outside, aged on the inside, but it was functional. The conference felt underfilled, with many empty seats. Later, I learned that the total ATC+OSDI attendance was under 500. That's a big drop from even ATC/OSDI 2022 attendance, which I discussed here.

The conference was also low energy. Few questions after talks. People felt tired. Where were the 20+ strong MIT systems profs? ATC/OSDI happened in their backyard, but there was only one of them, and that only for the first day/morning.

The presentation quality was disappointing. A couple speakers looked like they were seeing the slides for the first time. Very demoralizing. Many ATC talks were just low-quality recorded videos. Visa issues accounted for many of the no-shows, which sucks. I can’t believe we are dealing with this in 2025. Apparently, some European faculty are skipping U.S. conferences altogether now because of the political climate.

Missing speakers were somehow 10x more prevalent in ATC than OSDI. Two out of five talks were prerecorded (with no Q&A later) in the ATC session I chaired, as with several of the other sessions. I guess in a couple cases some US-based co-authors didn’t bother to show up and present. I just saw one OSDI talk being prerecorded. The session chair just told people to watch the recording later rather than playing the recording, which honestly felt like the right call. At the end of the day, I can still understand the prerecorded talks, but the low presentation quality in many of the talks are unexcusable. Boring talks meant empty seats, and the people in the room checking emails in their laptops rather than listening.

The conference attendees this year had a striking bimodal distribution. On the one end, there were very young PhD students, and even some undergraduates. On the other: old-timers from the first USENIX days (70+ or 80+ years old), in town for USENIX's 50th anniversary and what feels like ATC’s last rites, as USENIX Annual Technical Conference was ended this year.

Conferences live and die by the community around them. When the community around the conference weakens, the quality and energy degrades. ATC is ended, and it seems like OSDI needs some TLC (tender loving care) to build up the community around it. This is a very hard thing to do, and I suspect there are no quick/easy hacks.

I don't know. Maybe people are tired and overwhelmed. The conference submissions keep going up by about 30% each year, the program committee reviewing is hard and thankless. It is getting harder and harder to get papers accepted. Maybe people are getting fed up with the paper publishing game, and as a result don't find conferences as useful or sincere. Oh, well...

Coincidentally, both NSDI and SOSP PC meetings start this Thursday, and there was a flurry of online discussion about the papers on Monday and Tuesday. Monday night I had to look back on papers I reviewed to respond to the rebuttals and discussion comments. I had to stay up late till after midnight, and I had Squid Games final season opened on hotel TV to get some background noise. Let's just say there are a lot of parallels with Squid Games and the publishing game.

I ran into many interesting folks in the hallway track: folks working on ML infrastructure, teaching LLMs to code, running AI infrastructure at OpenAI, researching new hardware for distributed systems. I met Amplify VC people, Sunil and Arjun, both very smart and technical. They fund distributed systems work and infrastructure for AI.

One pattern I noticed was there were lots of young folks skipping the PhD pipeline entirely. They went straight from school (with some undergraduate research work under their belt, and I presume good coding skills) to Anthropic or OpenAI.

Hallway conversations should be easier to start. I always enjoy them once they get going. It’s the starting that's hard. But it’s worth pushing through the awkwardness and meet people.

Swag Rankings

I will talk about some intesting papers in a later blog post. Now, let’s get to the real reason we’re all here: the swag.

Databricks: A sticker. I swear that was it. They came with 20 databricks stickers to pass around. Are you serious?

Amazon: A shopping bag. Thanks, but no thanks.

Google: Stickers. And hats, but they only display the hats on the table, and don't give them to you. When I asked for a hat, they said it is for the students only (how did they know I am not a student?). And I don't know if they let students sign their soul out before passing them a hat.

Meta: Three ballpoint pens packaged/branded neatly. You give 100 millions a year to poach AI researchers, but you only pass around ballpoint pens at ATC/OSDI? (Well, on testing, the pens write real smooth, and my kids like them. Still beats the shopping bag, which I didn't bother picking up.)

Why send two staff to conferences just to hand out crap? You burned four days of salary and travel just to say, "Apply on our website" to people approaching the booth. This is not outreach. If you’re not going to hand out decent swag, don’t put up a booth. At least preserve some dignity.

Jane Street on the other hand was pure class. Their t-shirts are so soft and form-fitting, it feels like they were spun from distilled 401K pensions. You know what? I no longer feel bad about them recruiting top talent from research.

Chapter 7: Distributed Recovery (Concurrency Control Book)

2025-07-01T23:15:00.011-04:00

Chapter 7 of the Concurrency Control and Recovery in Database Systems book by Bernstein and Hadzilacos (1987) tackles the distributed commit problem: ensuring atomic commit across a set of distributed sites that may fail independently.

The chapter covers these concepts:

The challenges of transaction processing in distributed database systems (which wasn't around in 1987)
Failure models (site and communication) and timeout-based detection
The definition and guarantees of Atomic Commitment Protocols (ACPs)
The Two-Phase Commit (2PC) protocol (and its cooperative termination variant)
The limitations of 2PC (especially blocking)
Introduction and advantages of the Three-Phase Commit (3PC) protocol

Despite its rigor and methodical development, the chapter feels like a suspense movie today. We, the readers, equipped with modern tools like FLP impossibility result and Paxos protocol watch as the authors try to navigate a minefield, unaware of the lurking impossibility results that were published a couple years earlier and the robust consensus frameworks (Viewstamped replication and Paxos) that would emerge just a few years later.

Ok, let's dive in.

Atomic Commitment Protocol (ACP) problem

The problem is to ensure that in the presence of partial failures (individual site failures), a distributed transaction either commits at all sites or aborts at all sites, and never splits the decision. The authors define the desired properties of ACPs through a formal list of conditions (AC1–AC5).

We know that achieving these in an asynchronous setting with even one faulty process is impossible as FLP impossibility result established in 1985. Unfortunately, this impossibility result is entirely absent from the chapter’s framework. The authors implicitly assume bounded (and with known bounds) message delays and processing times, effectively assuming a synchronous system. That is an unrealistic portrayal of real-world distributed systems, even today in the data-centers.

A more realistic framework for distributed systems is the partially asynchronous model. Rather than assuming known and fixed bounds on message delays and processing times, the partially asynchronous model allows for periods of unpredictable latency, with the only guarantee being that bounds exist, just not that we know them. This model captures the reality of modern data centers, where systems often operate efficiently but can occasionally experience transient slowdowns or outages where fixed bounds would be violated and maybe higher bounds might be established for some duration before convergence to stable. This also motivates the use of weak failure detectors, which cannot definitively distinguish between a crashed node and a slow one.

This is where Paxos enters the picture. Conceived just a few years after this chapter, Paxos provides a consensus protocol that is safe under all conditions, including arbitrary message delays, losses, and reordering. It guarantees progress only during periods of partial synchrony, when the system behaves reliably enough for long enough, but it never violates safety even when conditions degrade. This doesn't conflict with what the FLP impossibility result of 1985 proves: you cannot simultaneously guarantee both safety and liveness in an asynchronous system with even one crash failure. But that doesn't mean you must give up on safety. In fact, the brilliance of Paxos lies in this separation: it preserves correctness unconditionally and defers liveness until the network cooperates. This resilience is exactly what's missing in the ACP designs of Bernstein and Hadzilacos even when using 3PC protocols.

If you like a quick intro to the FLP and earlier Coordinated Attack impossibility results, these three posts would help.

2PC and 3PC protocols

The authors first present the now-classic Two-Phase Commit (2PC) protocol, where the coordinator collects YES/NO votes from participants (the voting phase) and then broadcasts a COMMIT or ABORT (the decision phase). While 2PC satisfies AC1–AC4 in failure-free cases, it fails AC5 under partial failures. If a participant votes YES and then loses contact with the coordinator, it is stuck in an uncertainty period, unable to decide unilaterally whether to commit or abort. The authors provide a cooperative termination protocol, where uncertain participants consult peers to try to determine the outcome. It reduces, but does not eliminate, blocking.

Thus comes the Three-Phase Commit (3PC) protocol, which attempts to address 2PC's blocking flaw by introducing an intermediate state: PRE-COMMIT. The idea is that before actually committing, the coordinator ensures all participants are "prepared" and acknowledges that they can commit. Only once everyone has acknowledged this state does the coordinator send the final COMMIT. If a participant times out during this phase, it engages in a distributed election protocol and uses a termination rule to reach a decision.

Indeed, in synchronous systems, 3PC is non-blocking, and provides an improvement over 2PC. The problem is that 3PC relies critically on timing assumptions, always requiring bounded message and processing delays. The protocol's reliance on perfect timeout detection and a perfect failure detector makes it fragile. As another secondary problem, the 3PC protocol discussed in the book (Skeen 1982) has also been shown to contain some subtle bugs as well even in the synchronous model.

In retrospect

Reading this chapter today feels like watching a group of mountaineers scale a cliff without realizing they’re missing key gear. I spurted out my tea when I read these lines in the 3PC discussion. "To complete our discussion of this protocol we must address the issue of elections and what to do with blocked processes." Oh, no, don't go up that path without Paxos and distributed consensus formalization!! But the book predates Paxos (1989, though published later), Viewstamped Replication (1988), and the crystallization of the consensus problem. It also seems to be completely unaware of the FLP impossibility result (1985), which should have stopped them in their tracks.

This chapter is an earnest and technically careful work, but it's flying blind without the consensus theory that would soon reframe the problem. The chapter is an important historical artifact. It captures the state of the art before consensus theory illuminated the terrain. The authors were unable to realize that the distributed commit problem includes in it the distributed consensus problem, and that all the impossibility, safety, and liveness tradeoffs that apply to consensus apply here too.

Modern distributed database systems use Paxos-based commit. This is often via 2PC over Paxos/Raft groups for participant-sites. See for example our discussion and TLA+ modeling of distributed transactions in MongoDB.

Miscellaneous

This is funny. Someone is trolling on Wikipedia, trying to introduce Tupac as an alternative way to refer to 2PC.

Analyzing Metastable Failures in Distributed Systems

2025-06-05T13:50:00.005-04:00

So it goes: your system is purring like a tiger, devouring requests, until, without warning, it slumps into existential dread. Not a crash. Not a bang. A quiet, self-sustaining collapse. The system doesn’t stop. It just refuses to get better. Metastable failure is what happens when the feedback loops in the system go feral. Retries pile up, queues overflow, recovery stalls. Everything runs but nothing improves. The system is busy and useless.

In an earlier post, I reviewed the excellent OSDI ’22 paper on metastable failures, which dissected real-world incidents and laid the theoretical groundwork. If you haven’t read that one, start there.

This HotOS ’25 paper picks up the thread. It introduces tooling and a simulation framework to help engineers identify potential metastable failure modes before disaster strikes. It’s early stage work. A short paper. But a promising start. Let’s walk through it.

Introduction

Like most great tragedies, metastable failure doesn't begin with villainy; it begins with good intentions. Systems are built to be resilient: retries, queues, timeouts, backoffs. An immune system for failure, so to speak. But occasionally that immune system misfires and attacks the system itself. Retries amplify load. Timeouts cascade. Error handling makes more errors. Feedback loops go feral and you get an Ouroboros, a snake that eats its tail in an eternal cycle. The system gets stuck in degraded mode, and refuses to get better.

This paper takes on the problem of identifying where systems are vulnerable to such failures. It proposes a modeling and simulation framework to give operators a macroscopic view: where metastability can strike, and how to steer clear of it.

Overview

The paper proposes a modeling pipeline that spans levels of abstraction: from queueing theory models (CTMC), to simulations (DES), to emulations, and finally, to stress tests on real systems. The further down the stack you go, the more accurate and more expensive the analysis becomes.

The key idea is a chain of simulations: each stage refines the previous one. Abstract models suggest where trouble might be, and concrete experiments confirm or calibrate. The pipeline is bidirectional: data from low-level runs improves high-level models, and high-level predictions guide where to focus concrete testing.

The modeling is done using a Python-based DSL. It captures common abstractions: thread pools, queues, retries, service times. Crucially, the authors claim that only a small number of such components are needed to capture the essential dynamics of many production services. Business logic is abstracted away as service-time distributions.

Figure 2 shows a simple running example used throughout the paper: a single-threaded server handling API requests at 10 RPS, serving a client that sends requests at 5 RPS, with a 5s timeout and five retries. The queue bound is 150. The goal is to understand when this setup tips into metastability and how to tune parameters to avoid that.

Continuous-Time Markov Chains (CTMC)

CTMC provides an abstract average-case view of a system, eliding the operational details of the constructs. Figure 3 shows a probabilistic heatmap of queue length vs. retry count (called orbit). Arrows show the most likely next state; lighter color means higher probability. You can clearly see a tipping point: once the queue exceeds 40 and retries hit 30, the system is likely to spiral into a self-sustaining feedback loop. Below that threshold, it trends toward recovery. This model doesn't capture fine-grained behaviors like retry timers, but it's useful for quickly flagging dangerous regions of the state space.

Simulation (DES)

Discrete event simulation (DES) hits a sweet spot between abstract math and real-world mess. It validates CTMC predictions but also opens up the system for inspection. You can trace individual requests, capture any metric, and watch metastability unfold. The paper claims that operators often get their "aha" moment here, seeing exactly how retries and queues spiral out of control.

Emulation

Figure 4 shows the emulator results. This stage runs a stripped-down version of the service on production infrastructure. This is not the real system, but its lazy cousin. It doesn't do real work (it just sleeps on request) but it behaves like the real thing under stress. The emulator confirms that the CTMC and DES models are on track: the fake server fails in the same way as the real one.

Testing

The final stage is real stress tests on real servers. It's slow, expensive, and mostly useless unless you already know where to look. And that's the point of the whole pipeline: make testing less dumb. Feed it model predictions, aim it precisely, and maybe catch the metastable failure before it catches you.

Discussion

There may be a connection between metastability and self-stabilization. If we think in terms of a global variant function (say, system stress) then metastability is when that function stops decreasing and the system slips into an attractor basin from which recovery is unlikely. Real-world randomness might kick the system out. But sometimes it is already stuck so bad that it doesn't. Probabilistic self-stabilization once explored this terrain, and it may still have lessons here.

The paper nods at composability, but doesn't deliver. In practice, feedback loops cross the component boundaries. Metastability often emerges from these interdependencies. Component-wise analysis may miss the whole. As we know from self-stabilization: composition is hard. It works by layering or superposition, not naive composition.

The running example in the paper is useful but tiny. The authors claim this generalizes, but don't show how. For a real system, like Kafka or Spanner, how many components do you need to simulate? What metrics matter? What fidelity is enough? This feels like a "marking territory” paper that maps a problem space.

There's also a Black Swan angle here. Like Taleb's rare, high-impact events, metastable failures are hard to predict, easy to explain in hindsight, and often ignored until too late. Like Black Swan detection, I think metastability is less about prediction and more about preparation: structuring our systems and minds to notice fragility before it breaks. The paper stops at identifying potential metastability risks, and recovery is not considered. Load shedding would work, but we need some theoretical and analytical guidance, otherwise it is too easy to do harm instead of recovery via load shedding. Which actions would help nudge the system out of the metastable basin? In what order, as not to cause further harmful feedback loops? What runtime signals suggest you're close?

Aleksey Charapko, the leading author of the OSDI'22 paper on metastability, is helping MongoDB to identify potential metastability risks and address them with preventive strategies and defenses.

Chapter 6: Centralized Recovery (Concurrency Control Book)

2025-06-03T23:27:00.010-04:00

With Chapter 6, the Concurrency Control and Recovery in Database Systems book shifts focus from concurrency control to the recovery! This chapter addresses how to preserve the atomicity and durability of transactions in the presence of failures, and how to restore the system to a consistent state afterward.

The book offers a remarkably complete foundation for transactional recovery, covering undo/redo logging, checkpointing, and crash recovery. While it doesn't use the phrase "write-ahead logging", the basic concepts are there, including log-before-data and dual-pass recovery. When the book was written, the full WAL abstraction in ARIES was still to come in another five years at 1992 (see my review here). I revisit this discussion/comparison at the end of the post.

System Model and Architecture

In earlier chapters, we had reviewed the architecture: transactions pass through a transaction manager (TM), which sends operations to a scheduler and then to a data manager (DM). For recovery, the DM is split into:

A Cache Manager (CM), which handles reading/writing between memory and disk, with a volatile cache and a stable storage backend.
A Recovery Manager (RM), which interprets high-level operations (Read, Write, Commit, Abort, Restart) and ensures durability and atomicity. What is Restart? Well, we will discuss it in the recovery algorithm section.

The model distinguishes between volatile storage (lost on crash) and stable storage (assumed to survive crashes). Cache slots have a "dirty" bit to track whether their content has diverged from stable storage. The focus of the chapter is primarily on system failures that involve a crash or reboot that wipes out volatile memory but leaves the disk intact.

Stable Storage and Shadowing

For maintaining data in stable storage, in-place updating means overwrites destroy prior values. The shadowing idea allows fast copy as new versions are written to different locations, and a directory atomically switches the system view to the new state.

Shadowing avoids UNDO, but complicates consistency. In-place updating is simpler and more compatible with buffering, but requires careful logging. I had seen a shadowing solution being developed in S3 for a "fast copy" feature, and I was amazed how much complexity it introduces for correctness of deleting/garbage collection. It is surprisingly tricky. When describing the recovery algorithms, the book goes into detail discussing how the algorithm would handle shadowing.

Logging and Recovery

This section lays the groundwork for WAL in modern systems. The log contains entries of the form [T_i, x, v], which record that transaction T_i wrote value v to data item x.

There are two key rules to ensure safe recover.

Undo Rule: Before a transaction overwrites a committed value in the database, the old value must be saved.
Redo Rule: A transaction's writes must be logged before the transaction is considered committed.

These collectively ensure that the system can UNDO the effects of uncommitted transactions and REDO the effects of committed ones.

Recovery Algorithm Types

The chapter categorizes recovery algorithms by their needs:

Undo/Redo: This can handle both missing and uncommitted updates. This is the most flexible category as it allows steal/no-force as in ARIES. The chapter focuses on Undo/Redo algorithms as the most general and efficient in runtime operation.
Undo/No-Redo: This requires forcing all updates to disk before commit, eliminating redo. This is steal/force.
No-Undo/Redo: This is no-steal and no-force. So it disallows overwriting before commit, eliminating undo.
No-Undo/No-Redo: This is the least flexible one: no-steal, force. It requires full data persistence at commit.

What is steal and force, you ask? Steal says: A transaction can steal pages from the buffer to do work. This allows Transaction C's dirty pages to be written to disk before it completes so that there's room for Transaction B to do work. No-Force says: Dirty pages for finished transactions can be flushed to disk whenever convenient. They don't have to flush to disk before we can commit the transaction.

Remember the restart operation, which we said that the RM is responsible for. A Restart is invoked after a crash to restore the database to the last consistent committed state. This involves for the most flexible category both Undo followed by Redo:

Scanning the log backward to undo changes from uncommitted transactions.
Scanning forward to redo changes from committed ones.

The book discusses how this process should be performed in an idempotent manner, so that it can be safely retried if another crash interrupts it.

Checkpointing

Without optimization, Restart must scan the entire log. Checkpointing minimizes recovery time by flushing current database state to disk and marking that point in the log. Checkpointing approaches trades off between runtime efficiency and recovery time. Types include:

Commit-consistent: Waits for all transactions to finish, then flushes all dirty pages.
Cache-consistent: Flushes all dirty cache slots but doesn't wait for transactions to finish.
Fuzzy checkpointing: Flushes only old dirty pages, reducing disruption.

Implementation of Undo/Redo with optimizations

The chapter discusses several performance optimizations for implementing Undo/Redo algorithm.

Partial data item logging: Log only modified bytes/fields within a page.
Buffering the log: To reduce I/O, log entries are staged in memory and flushed in batches.
Log Sequence Numbers (LSNs): Pages and log entries are tagged with sequence numbers to track which updates have been applied.
Delayed (group) commit: Allows batching commits to improve efficiency.

These optimization suggestions resemble those that ARIES (1992) paper formalized and implemented efficiently and correctly. More on this below.

Logical Logging

Instead of logging low-level data changes, the system may log high-level operations like "insert record" or "update field". Logical logging is more compact but harder to undo or redo, since it may depend on the current state.

Logical logging contrasts with ARIES-style logging, which uses physiological logging: a hybrid of logical and physical approaches. While logical logging records abstract operations without referencing specific data layouts, ARIES logs changes to specific pages along with logical intent (e.g., "insert at offset X in page P") and includes both before- and after-images to enable precise undo and redo. ARIES also relies on Log Sequence Numbers (LSNs) embedded in pages and Compensation Log Records (CLRs) to make recovery idempotent and efficient. This makes ARIES more robust under the steal/no-force buffer policy, whereas logical logging is lighter but more fragile.

ARIES critiques earlier approaches, like the one in this book, for performing undo before redo. It shows that this ordering becomes incorrect when combined with optimizations like LSNs and selective redo. In such hybrids, undo can write CLRs that raise page LSNs and cause the redo phase to mistakenly skip necessary committed updates.

ARIES reverses the order. It does redo first, and reapplies all committed changes that might be missing. Then it does the undo and erases changes from uncommitted transactions. This avoids the problem above. The redo is based on comparing page LSNs to log record LSNs. Since undo hasn't happened yet, page LSNs still reflect what's truly on disk. With this approach, you only advance page LSNs after you know the change is present. Tricky stuff.

Chapter 5: Multiversion Concurrency Control (Concurrency Control Book)

2025-05-27T23:24:00.008-04:00

Chapter 5 of Concurrency Control and Recovery in Database Systems (1987) introduces multiversion concurrency control (MVCC), a fundamental advance over single-version techniques. Instead of overwriting data, each write operation creates a new version of the data item. Readers can access older committed versions without blocking concurrent writes or being blocked by concurrent writes.

MVCC removes read-write conflicts and increases concurrency significantly. Having multiple versions around gives the scheduler flexibility: if a read arrives "too late" to see the latest write, it can still proceed by accessing an older version. This avoids unnecessary aborts. Writes may still abort due to write-write conflicts, but reads are largely unimpeded. This is especially beneficial in read-heavy workloads.

This chapter presents three broad classes of multiversion methods: Multiversion Timestamp Ordering (MVTO), Multiversion Two-Phase Locking (MV2PL), and Multiversion Mixed Methods.

For all the benefits MVCC provides, the trade-off is additional storage and complexity in managing versions and garbage collection. This is a good tradeoff to take, and MVCC is the dominant concurrency control approach today. Oracle uses MV2PL. PostgreSQL uses MVCC natively. MySQL uses MVCC in InnoDB. For both of them, reads get a consistent snapshot without locking and writes create new versions and require locking at commit time. Microsoft Hekaton implements an MVTO-style engine in its in-memory OLTP system (see my 2022 post on Hekaton). Pavlo's VLDB'17 paper on evaluation of in-memory MVCC is also a good read here. Spanner may be best viewed as MVTO plus external certification: it uses multiversion reads and assigns commit timestamps via TrueTime, while writes acquire locks and are certified at commit to ensure strict serializability. Unlike MV2PL, reads never block, and unlike pure MVTO, writes are serialized through locking and timestamp-based validation.

Let's dig in to the sections.

Multiversion Serializability Theory

To reason about the correctness of multiversion schemes, this section extends classical serializability theory. It defines MV histories (which include explicit version metadata) and 1V histories, which reflect what users see: a single logical version per item. The key challenge is to ensure that MV histories are equivalent to 1-serial 1V histories. A 1-serial MV history is one in which each read observes the latest committed version at the read's logical time. This avoids anomalies like fractured reads (e.g., reading stale x and fresh y from the same transaction).

Correctness is characterized using a Multiversion Serialization Graph (MVSG). An MV history is 1-serial iff its MVSG is acyclic. This parallels the classical serializability theory, and extends it with versioning. The rest of the section develops the correctness proof via MVSGs.

Multiversion Timestamp Ordering (MVTO)

MVTO generalizes timestamp ordering by storing multiple versions. Each transaction is assigned a unique timestamp at start. When a transaction Ti reads x, it finds the latest version of x with a timestamp less than TS(Ti). When Ti writes x, it buffers the write and, at commit, creates a new version tagged with TS(Ti).

MVTO guarantees serializability: transactions appear to execute in timestamp order. The main difference from single-version TO is that MVTO avoids aborting reads. Since older versions are preserved, reads always succeed. However, MVTO still aborts transactions on write-write conflicts. Also if Ti tries to write x, but another transaction Tj with TS(Tj) > TS(Ti) has already read an older version of x, Ti must abort to preserve timestamp order.

MVTO is good for read-mostly workloads but struggles with high write contention. Garbage collection also becomes a concern. Old versions can be discarded only after all transactions that might read them complete.

Multiversion Two-Phase Locking (MV2PL)

MV2PL extends strict 2PL by adding versioning. Unlike 2PL, where transactions block when accessing locked items, MV2PL lets readers proceed by using older versions (e.g., accessing the latest committed version as of the lock time). While 2PL systems block on read-write conflicts; MV2PL avoids this by separating read and write versions.

MV2PL also extends 2PL for writes by introducing certify locks: Instead of overwriting in-place, the writers in MV2PL buffer and acquire certify locks at commit time to serialize version creation. A certify lock is exclusive: only one transaction can hold it on a data item at any time. This prevents races and ensures only one new version per item per commit.

Multiversion Mixed Method

Multiversioning gives the scheduler more flexibility, especially for read-only transactions. If the system knows in advance which transactions are queries (read-only) and which are updaters (perform writes), it can increase concurrency by handling them differently. This method uses MVTO for queries and Strict 2PL for updaters.

Queries behave like in MVTO: they are assigned timestamps at start, read the latest version less than their timestamp, and never block. Updaters use Strict 2PL for mutual exclusion during execution. At commit, the transaction manager assigns each updater a timestamp consistent with its position in the serialization graph, ensuring global consistency. This resembles concurrency control in modern OLAP systems: large analytical reads proceed without blocking, while updates are carefully serialized.

A key innovation here is the commit list. Each transaction maintains a commit list of versions it plans to write. When committing, the transaction must check for conflicts: It cannot write a version if a transaction with a later timestamp has already read an earlier version of that item. This would violate timestamp order. In a centralized system, the commit list can be scanned at commit time to detect such anomalies. In distributed systems, where this check can’t be performed atomically, the system needs to use a distributed coordination protocol like 2PC.

Chapter 4: Non-Locking Schedulers (Concurrency Control Book)

2025-05-24T08:32:00.007-04:00

Chapter 4 of the Concurrency Control and Recovery in Database Systems book (1987) opens with a sentence that doesn't quite pass the grammar test: "In this chapter we will examine two scheduling techniques that do not use locks, timestamp ordering (TO) and serialization graph testing (SGT)." That comma is trying to do the job of a colon and failing at it. Precision matters, more so in technical writing.

The writing is otherwise clear and careful. And as par the book, it is ahead of its time. The chapter presents a spectrum of non-locking schedulers, starting from Basic TO, expanding into certifiers (which basically stands for optimistic concurrency control), and ending with modular, composable scheduler designs that separate synchronization concerns cleanly between read-write and write-write synchronization.

Let's dig into the details.

Timestamp Ordering (TO)

Timestamp Ordering (TO) uses transaction start timestamps to impose a serial order on conflicting operations. It maintains for every data item x the maximum timestamps of Reads and Writes on x that it has sent to the DM, denoted max-r-scheduled[x] and coordinates with the data manager (DM) to preserve ordering constraints.

Timestamp management in TO is a disguised form of watermarking and garbage collection. The system purges "stale" max-r/max-w entries using a time threshold ts_min, and rejects operations too old to be safely scheduled.

Strict TO extends Basic TO with delayed visibility to ensure recoverability, cascading abort avoidance, and strictness. Reads and writes are withheld until prior conflicting writes are resolved via commit or abort acknowledgment. I was happy to predict this modification while reading. It’s the same trick repeated from the 2PL chapter: keep the locks all the way and delay read/write visibility until commit/abort. This technique also shows up in Conservative SGT and certifiers later in this chapter. It's a useful pattern.

Despite being aggressive (rejecting operations that arrive "late"), Strict TO is still more permissive than two-phase locking (2PL). It accepts more interleavings as serializable (see the example below), and reads never wait. This is great for read-heavy workloads.

The section on distributed TO is particularly interesting. TO is trivially parallelizable: each scheduler can operate independently using local metadata. Unlike distributed 2PL, it avoids inter-scheduler communication and deadlocks entirely. The global timestamp order serves as a natural tiebreaker. This is simple and powerful. With loosely synchronized clocks/hybrid logical clocks, you get a deadlock-free system that preserves serializability with minimal coordination.

For more on this topic, see my review of the TO paper from VLDB'90 by Bernstein and Goodman here. I also wrote about how Amazon DynamoDB implements Basic TO before. This illustrates how Basic TO made it into production systems. It is a sweet deal: Synchronization helps performance, but safety does not depend on it.

It might be illustrative to contrast TO with 2PL here.

TO avoids locking overhead and deadlocks.
TO doesn't serialize unnecessarily,
TO is a natural fit for distributed systems, especially with loosely synchronized clocks. And it is great when conflicts are rare.

Serialization Graph Testing (SGT)

SGT builds the serialization graph (SG) incrementally and aborts transactions to prevent cycles. This can, in theory, yield maximum concurrency under SER constraints, but it comes at a cost. It requires maintaining the SG, detecting cycles, and carefully pruning old transactions. These incur space and computation overhead.

Maybe the quickest way to introduce SGT is again to compare it with 2PL.

2PL prevents cycles via locking. It blocks and delays operations, can deadlock, and is conservative. SGT detects cycles dynamically. It allows speculative progress, aborts instead of blocking, and is more permissive.
SGT is more general and precise than 2PL but costs more in metadata and coordination. 2PL, on the other hand, is simpler, and naturally enforces recoverability and strictness.

The Basic SGT approach has no space bounds unless you prune carefully. Even committed transactions can't be garbage-collected immediately.

Conservative SGT can avoid aborts by requiring predeclared read/write sets and using readiness rules to delay operations. But this only works in restricted environments where access sets can be declared in advance.

Trying to distribute SGT hits a wall. Global cycle detection is hard. Unlike deadlock detection (which can take its sweet time because deadlocked transactions are blocked) SGT must check cycles before committing. Since that's a high-frequency, high-cost operation, it doesn't scale.

Certifiers

Certifiers are optimistic schedulers. They execute operations without delay and validate serializability only at commit time. In other words, transactions proceed speculatively/optimistically, and the scheduler check for conflicts at commit time. This model avoids blocking under light contention but wastes work under heavy contention.

The section presents three certifier styles: 2PL-based, TO-based, and SGT-based.

The 2PL certifier is textbook OCC. It tracks read/write sets and validates at commit. There's no locking, just late conflict detection.
TO certification is also straightforward, but redundant: it applies the same checks as Basic TO but defers them till the end. So Basic TO is better.
SGT certification builds the SG dynamically and checks for cycles at commit. It mirrors SGT scheduling but pushes all cycle detection to the end.

Certifiers seem to help for distributed transactions, as you can delay the cost of validation until the transaction decides to commit. Distributed certifiers would then engage in coordination, and this brings us straight into two-phase commit (2PC) territory. Each certifier votes on whether to commit or abort. The TM collects votes, makes a global decision, and informs all participants.

Integrated Schedulers

This is conceptually the most exciting part of the chapter. This section suggests decomposing concurrency control into two independent subproblems: read-write (rw) synchronization and write-write (ww) synchronization. Each subproblem can then be solved with any of the core techniques (2PL, TO, or SGT), and the solutions can be composed into a composite scheduler.

Thomas Write Rule (TWR) makes a cameo appearance here as a ww synchronizer. TWR simply discards late writes if they've been overwritten by a newer write. No reordering, no rejection is needed --just drop the write and move on because it is overwritten anyway. But TWR doesn't provide serializability on its own. It must be paired with a proper rw synchronizer to ensure proper transaction isolation. The section discusses how to combine TWR with TO-based rw synchronization to build an integrated scheduler.

The mixed scheduler (2PL for rw + TWR for ww) is also interesting. Reads are protected by locks, ensuring strictness and recoverability, while writes proceed optimistically and avoid unnecessary aborts. This enables non-blocking writes without sacrificing the guarantees of strict 2PL for reads.

The chapter claims that hundreds of correct schedulers can be built from these compositions. But it stops here, only two examples are given. I think, in practice, many modern systems (especially MVCC-based ones) implicitly implement composite schedulers. Snapshot reads impose TO-like constraints, while writes are often validated with locking.

Chapter 3: Two Phase Locking (Concurrency Control Book)

2025-05-17T23:00:00.003-04:00

Chapter 3 presents two-phase locking (2PL). Remember I told you in Chapter 2: Serializability Theory that the discussion is very scheduler-centric? Well, this is a deeper dive into the scheduler, using 2PL as the concurrency control mechanism. The chapter examines the design trade-offs in scheduler behavior, proves the correctness of basic 2PL, dissects how deadlocks arise and are handled, and discusses many variations and implementation issues.

Here are the section headings in Chapter 3.

Aggressive and Conservative Schedulers
Basic Two Phase Locking
Correctness of Basic Two Phase Locking
Deadlocks
Variations of Two Phase Locking
Implementation Issues
The Phantom Problem
Locking Additional Operators
Multigranularity Locking
Distributed Two Phase Locking
Distributed Deadlocks
Locking Performance
Tree Locking

Yep, this is a long chapter: 65 pages.

3.1 Aggressive and Conservative Schedulers

The chapter opens by asking: how aggressive or conservative should a scheduler be? An aggressive scheduler tries to schedule operations immediately, risking aborts later. A conservative scheduler delays operations to avoid future conflicts, often needlessly.

This choice affects throughput, latency, and starvation. If conflicts are rare, aggressive wins. If conflicts are frequent or costly, conservative may be preferable.

The book says: "A very conservative version of any type of scheduler can usually be built if transactions predeclare their readsets and writesets. This means that the TM begins processing a transaction by giving the scheduler the transaction’s readset and writeset. Predeclaration is more easily and efficiently done if transactions are analyzed by a preprocessor, such as a compiler, before being submitted to the system, rather than being interpreted on the fly."

This reminded me of Chardonnay, OSDI'23, where Bernstein is an author. (Here is my review/summary of it.) Isn't that an interesting follow-up to the above paragraph 36 years later.

3.2 Basic Two Phase Locking

Three rules define 2PL: don’t grant conflicting locks, don’t release until the data manager (DM) acknowledges, and once you release a lock, you’re done acquiring. These encode the "growing" and "shrinking" phases that give 2PL its name.

Here is the notation the book uses: `rl_i[x]` denotes a read lock by transaction Ti on item x, `wl_i[x]` a write lock, and `wu_i[x]` means Ti releases the write lock. These are distinct from the operations themselves (like `r_i[x]` or `w_i[x]`), and the system must track both.

The figure below shows an example of what would go wrong if rule 3 is violated.

3.3 Correctness of Basic Two Phase Locking

In this section, the authors prove that histories produced by 2PL schedulers are serializable. The proof hinges on showing that the serialization graph (SG) is acyclic. They derive this from the locking orderings induced by the scheduler's behavior. The key idea is that if T1 precedes T2 in SG, then T1 must have released a lock before T2 acquired a conflicting one. Since T1 does not acquire any locks after releasing that one, this precludes cycles in the SG.

3.4 Deadlocks

2PL enables serializability, but at the cost of deadlocks. The section opens with classic 2PL-induced deadlocks: mutual lock upgrades. Then it dives into detection via Waits-For Graphs (WFGs), where the idea is to add an edge when one transaction waits on another, and detect deadlocks by finding cycles.

Timeouts versus explicit detection gets compared. Timeouts are crude but simple (deadlocks don’t vanish on their own, so detection can be delayed safely). WFG cycle detection is precise but expensive. The section discusses how often to detect, how to tune timeouts, and how to pick a victim.

3.5 Variations of Two Phase Locking

Conservative 2PL requires that a transaction predeclare all the data it may access and acquire all locks up front before execution begins. If even one lock is unavailable, it waits for all. This avoids deadlocks entirely but sacrifices concurrency and assumes predictability in data access.

Strict 2PL, used in nearly all real systems, releases locks only at commit time. This is more strict than the basic 2PL we discussind in 3.2, where the shrinking could happen gradually. But in return, this guarantees strictness and recoverability. To avoid cascading aborts, write locks are held until after commit; read locks can be released a bit earlier.

3.6 Implementation Issues

This section gives some guidelines (as of 1987) for implementing a lock manager, handling blocking and aborting transactions, and ensuring atomicity of reads and writes.

3.7 The Phantom Problem

The phantom problem is set up nicely. Most real databases can dynamically grow and shrink. In addition to Read and Write, they usually support Insert and Delete. Index locking is introduced as a way to lock ranges instead of records. This helps detect conflicts for inserts into a scanned range.

Two locks conflict if there could be a record that satisfies both predicates, i.e., the predicates are mutually satisfiable. This is predicate locking. While more general than index locking, it's more expensive, as it requires the lock manager (LM) to reason about predicates. Predicate locking is rare in practice.

We had discussed index locking and predicate locking in Chapter 7 of the Gray/Reuters book, so I defer to that discussion.

3.8 Locking Additional Operators

Similar to how Chapter 2 extended serializability theory beyond read/write to include increment/decrement, Chapter 3 does the same for locking. I really like this treatment and generalization opportunity!

To add new operation types, follow these rules:

Make each new operation atomic w.r.t. all others.
Define a lock type for each operation.
Build a compatibility matrix where two lock types conflict iff their corresponding operations don’t commute.

3.9 Multigranularity Locking

Multi-granularity locking (MGL) allows each transaction to use granule sizes most appropriate to its behavior. Long transactions can lock coarser items like files; short transactions can lock individual records. In this way, long transactions don’t waste time setting too many locks, and short transactions don’t block others by locking large portions of the database that they don’t access. This balances overhead and concurrency.

We assume the lock instance graph is a tree. A lock on a coarse granule x implicitly locks all its descendants. For instance, a read lock on an area implicitly read-locks its files and records.

To make fine-grained locks compatible with this, we use intention locks. Before locking a record x, the scheduler sets intention locks (`ir`, `iw`) on x’s ancestors (database, area, file). This prevents conflicting coarse-grain locks from being acquired concurrently.

3.10 Distributed Two Phase Locking

Section 3.10 explains how Strict 2PL makes distributed concurrency control simpler. With Basic 2PL, one site might release locks while another is still acquiring them, which breaks the two-phase rule unless schedulers coordinate. But with Strict 2PL, locks are only released after commit, so once a scheduler sees the commit, it knows the transaction is done acquiring locks. No need to talk to other sites. The TM only sends commit after all operations are done, so this setup guarantees that each site’s local two-phase rule lines up with the global one. Strict 2PL gives you correctness without extra coordination.

3.11 Distributed Deadlocks

Distributed deadlocks arise when the global Waits-For Graph (WFG), formed by uniting local WFGs, contains a cycle even if each local WFG is acyclic. Detecting global cycles requires message exchanges among sites. A site may initiate detection when it sees a potential cycle, collecting dependency information and detecting loops.

A deadlock prevention method is to use timestamps. When a transaction T1 is about to wait for T2, the system checks their timestamps. If the wait would lead to a possible deadlock, it aborts one of them.

In wait-die, older transactions wait for younger ones; younger ones abort if they encounter older ones.
In wound-wait, older transactions preempt younger ones by aborting them; younger transactions are allowed to wait for older ones.

Wait-die favors younger transactions but can cause repeated aborts; wound-wait favors older transactions and avoids starvation.

3.13 Tree Locking

Suppose data items are structured as nodes in a tree, and transactions always follow tree paths. The scheduler can exploit this predictability using Tree Locking (TL), rather than requiring the grow/shrink discipline of 2PL. That is, a transaction can release a lock and later acquire another, so long as it follows the tree order (root to leaf).

TL resembles resource ordering schemes in OSes for deadlock avoidance. It ensures deadlock freedom. Once a transaction locks all children of a node it needs, it can safely release the lock on the parent. This can improve performance by holding locks for shorter durations. But it only makes sense if transactions follow predictable root-to-leaf access patterns. Otherwise, TL can reduce concurrency among transactions.

TL must also be extended if we want recoverability, strictness, or to avoid cascading aborts. For example, writes should be held until commit. In practice, most updates are on leaf nodes, making this workable.

This is a niche idea, but when applicable, it's elegant. I like it! The section closes by discussing locking in B-trees as an application.

Chapter 2: Serializability Theory (Concurrency Control Book)

2025-05-14T13:13:00.004-04:00

Chapter 2 of Concurrency Control and Recovery in Database Systems (1987) by Bernstein, Hadzilacos, and Goodman is a foundational treatment of serializability theory. It is precise, formal, yet simple and elegant, a rare combination for foundational theory in a systems domain. Databases got lucky here: serializability theory is both powerful and clean.

The chapter builds up the theory step by step, introducing:

Histories
Serializable histories
The Serializability Theorem
Recoverability and its variants
Generalized operations beyond reads/writes
View equivalence

Each section motivates definitions clearly, presents tight formalism, and illustrates ideas with well-chosen examples.

2.1 Histories

This section lays the groundwork. It starts slow, and doesn't do anything fancy. It first defines what it means for the operations within a transaction to form a well-founded partial order. This intra-transaction ordering extends naturally to inter-transaction operations, forming a superset relation. Building on this, the book then defines a history as a partial order over operations from a set of transactions. Operations include reads (r₁[x]), writes (w₁[x]), commits (c₁), and aborts (a₁). Here the subscripts denote these operations all belong to transaction T₁. A history models the interleaving of these operations from different transactions, respecting the per-transaction order and ensuring all conflicting operations are ordered.

The abstraction here is elegant. The model omits values, conditionals, assignments, anything not visible to the scheduler. Only the structure of dependencies matters. This minimalism is a strength: it gives just enough to reason about correctness. The section is also careful to handle incomplete histories (i.e., prefixes), which is crucial for modeling crashes and recovery.

The discussion is very scheduler-centric. Since the scheduler is the component responsible for enforcing serializability, the model is tailored to what the scheduler can observe and act on. But this leads to missed opportunities as I mention below in Section 2.6 View equivalence.

2.2 Serializable Histories

This is where things get real. The goal is to define when a concurrent execution (history) is “as good as” some serial execution. The section has a simple game plan: define equivalence between histories (conflict equivalence), define serial histories, and finally say a history is serializable if it is equivalent to some serial history.

The chapter introduces the serialization graph (SG). Nodes are committed transactions. Edges capture conflicts: if T_i writes x before T_j reads or writes x, we add an edge T_i → T_j.

The formalism is tight, but not heavy. A good example is Fig 2-1 (p. 30), which compares equivalent and non-equivalent histories. (Minor typo: second w₁[x] in H₃ should be w₁[y].)

Theorem 2.1 (Serializability Theorem): A history is serializable iff its serialization graph is acyclic.

That's a beautiful punchline: serializability reduces to a cycle check in the conflict graph.

Credit is where it is due. In this case it is Jim Gray again! At the end of the section there is an acknowledgment stating that: The definition of equivalence and serializability used here and the Serializability Theorem are from "Gray, J.N., Lorie, R.A., Putzulo, G.R., Traiger, I.L. Granularity of Locks and Degrees of Consistency in a Shared Database. Research Report, IBM, September, 1975."

2.3 The Serializability Theorem

This section proves the big theorem about equivalence to a serial execution.

Intuitively, if SG(H) is acyclic, we can topologically sort it into a serial history equivalent to H. And if SG(H) has a cycle, then H cannot be serialized.

The figure shows an example. Edges in SG(H) constrain any valid serial order. A cycle means contradictory constraints, and hence, no valid serial order.

2.4 Recoverable Histories

Serializability ensures correctness in a crash-free world. But in practice, systems crash, transactions abort, and partial executions matter. To account for faults, this section defines increasingly strict notions:

Recoverable (RC): If T_j reads from T_i, then T_i must commit before T_j.

Avoids Cascading Aborts (ACA): T_j can only read from committed transactions.

Strict (ST): Reads and overwrites can happen only after the previous writer commits or aborts.

Theorem 2.2: ST ⊂ ACA ⊂ RC

The inclusion hierarchy is illustrated in Fig 2-2. The intersection of these properties with SR defines realistic schedules that can tolerate failures and support recovery.

Note that RC, ACA, and ST are prefix commit-closed properties, that is if they hold for history H, they also hold for any prefix of H. Nice and clean.

2.5 Operations Beyond Reads and Writes

I was happy to see this section. The authors show how the theory generalizes to operations beyond reads and writes, as long as we redefine the notion of conflict appropriately.

Two operations conflict if the order in which they execute affects: the final state of the database, or the values returned by operations. Fig. 2-3 gives the compatibility matrix to capture conflict relations between Read/Write/Inc/Dec. Increment(x) adds 1 to data item x and Decrement(x) subtracts 1 from x. An Inc or Dec does not return a value to the transaction that issues it.

The example history H₁₁ shows how to construct SG(H) even with these new operations. The theory lifts cleanly. Since SG(H_11) is acyclic, the generalized Serializability Theorem says that H_11 is SR. It is equivalent to the serial history T1 T3 T2 T4 which can be obtained by topologically sorting SG(H_11). DAGs FTW!

Still, this is a limited discussion. It's not expressive enough to model more general commutativity. But the seeds of CRDTs are here. In CRDTs, you model operations based on whether they commute under all interleavings. There is an on-going research direction trying to build a nice theory around more general operations and monotonicity concepts.

2.6 View Equivalence

This section introduces view serializability (VSR), a more permissive alternative to conflict serializability. Two histories are view equivalent if:

Every read in one history reads from the same write as in the other.
The final write to each data item is the same in both histories.

VSR allows more schedules than CSR. Consider the example below. T3 overwriting both x and y saves the day and masks the conflicts in T1 and T2. Interesting special case!

The book says that testing for VSR is NP-complete, and that kills its usefulness for schedulers. The authors argue that all practical schedulers use conflict serializability, and view serializability is too expensive to enforce. Yet they acknowledge it’s conceptually useful, especially for reasoning about multicopy databases (Chapters 5 and 8).

Reading this section, I had an interesting question. Is view serializability the same as client-centric consistency? They seem very similar. Client-centric consistency, as discussed in Crooks et al., PODC 2017, defines correctness in terms of what reads observe. VSR is similarly observational.

So while VSR wasn’t practical for scheduling, it found a second life in defining observational consistency in distributed systems. I think there's more fruit on this tree.

Modular verification of MongoDB Transactions using TLA+

2025-05-08T11:00:00.009-04:00

Joint work with Will Schultz.

A transaction groups multiple operations into an all-or-nothing logical-box to reduce the surface area exposed to concurrency control and fault recovery, simplifying the application programmer's job. Transactions support ACID guarantees: atomicity, consistency, isolation, durability. Popular isolation levels include, Read Committed (RC), Snapshot Isolation (SI), and Serializability (SER), which offer increasing protection against concurrency anomalies.

MongoDB Transactions

MongoDB’s transaction model has evolved incrementally.

v3.2 (2015): Introduced single-document transactions using MVCC in the WiredTiger storage engine.
v4.0 (2018): Extended support to multi-document transactions within a replica set (aka shard).
v4.2 (2019): Enabled fully distributed transactions across shards.

Replica Set Transactions. All transaction operations are first performed on the primary using the WiredTiger transaction workflow/algorithm. Before commit of the transaction, all the updates are Raft-replicated with secondaries using the assigned timestamp to ensure consistent ordering. MongoDB uses Hybrid Logical Clocks (HLCs). The read timestamp reflects the latest stable snapshot. The commit timestamp is issued atomically and advances the cluster time. The default ReadConcern="Snapshot" ensures reads reflect a majority-committed snapshot with a given timestamp without yielding. And the WriteConcern="Majority" guarantees writes are durably replicated to a majority.

Distributed Transactions. MongoDB txns are general interactive transactions, rather than limited one-shot transactions as in DynamoDB. Clients interact through mongos, the transaction router, for executing a transaction. Mongos assigns the transaction a cluster-wide read timestamp and dispatches operations to relevant shard primaries. Each shard primary sets its local WiredTiger read timestamp and handles operations. If a conflict (e.g., write-write race) occurs, the shard aborts the transaction and informs mongos.

If the transaction is not aborted and the client asks for a commit, the mongos asks the first shard contacted for the transaction to coordinate the 2-phase commit (2PC). Handing off the commit-coordination to a Raft-replicated primary helps ensure durability/recoverability of the transaction. The shard primary launches a standard 2PC:

Sends "prepare" to all participant shards.
Each shard Raft-replicates a prepare oplog entry with a local prepare timestamp.
The coordinator picks the max prepare timestamp returned from the shards as the global commit timestamp.
Participant shards Raft-replicate commit at this timestamp and acknowledge back.

Using Alex Miller's Execution, Validation, Ordering, Persistence (EVOP) framework to describe MongoDB's distributed transactions, we get the following figure. MongoDB overlaps execution and validation. Execution stages local changes at each participant shard. All the while, Validation checks for write-write and prepare conflicts. Ordering and Persistence come later. The global commit timestamp provides the ordering for the transactions. Shards expose changes atomically using this timestamp.

MongoDB provides an MVCC+OCC flavor that prefers aborts over blocking. WiredTiger acquires locks on keys at first write access, causing later conflicting transactions to abort. This avoids delays from waiting, reducing contention and improving throughput under high concurrency.

TLA+ Modeling

Distributed transactions are difficult to reason about. MongoDB’s protocol evolved incrementally, tightly coupling the WiredTiger storage layer, replication, and sharding infrastructure. Other sources of complexity include aligning time across clusters, speculative majority reads, recovery protocol upon router failure, chunk migration by the catalog, interactions with DDL operations, and fault-tolerance.

To reason about MongoDB's distributed transaction protocol formally, we developed the first TLA+ specification of multi-shard database transactions at scale. Our spec is available publicly on GitHub. This spec captures the transaction behavior and isolation guarantees precisely.

Our TLA+ model is modular. It consists of MultiShardTxn, which encodes the high-level sharded transaction protocol, and Storage, which models replication and storage behavior at each shard. This modularity pays off big-time as we discuss in the model-based verification section below.

We validate isolation using a state-based checker based on the theory built by Crooks et al., PODC’17 and the TLA+ library implemented by Soethout. The library would take a log of operations and verify whether the transactions satisfy snapshot isolation, read committed, etc. This is a huge boost for checking/validating transaction isolation.

Our TLA+ model helps us explore how RC/WC selection for MongoDB tunable consistency levels affect transaction isolation guarantees. As MongoDB already tells its customers, MongoDB's "ReadConcern: majority" does not guarantee snapshot isolation. If you use it instead of "ReadConcern:Snapshot", you may get fractured reads: a transaction may observe some, but not all, of another transaction's effects.

Let's illustrate this with a simplified two-shard, two-transaction model from an earlier spec. T1 writes to K1 and K2 (sharded to S1 and S2, respectively) and commits via two-phase commit. T2 reads K1 before T1 writes it and K2 after T1 has committed. Due to `readConcern: majority`, it reads the old value of K1 and the new value of K2, violating snapshot isolation. The read is fractured.

You can explore this violation trace using a browser-based TLA+ trace explorer that Will Schultz built by following this link. The Spectacle tool loads the TLA+ spec from GitHub, interprets it using JavaScript interpreter, and shows/visualizes step-by-step state changes. You can step backwards and forwards using the buttons, and explore enabled actions. This makes model outputs accessible to engineers unfamiliar with TLA+. You can share a violation trace simply by sending a link.

Modeling helped us to clarify another subtlety: handling prepared but uncommitted transactions in the storage engine. If the transaction protocol ignores a prepare conflict, T2's read at the t-start snapshot might see a value that appears valid at its timestamp, but is later overwritten by T1's commit at an earlier timestamp, violating snapshot semantics. That means a read must wait on prepared transactions to avoid this problem. This is an example of cross-layer interaction between the transaction protocol and the underlying WiredTiger storage we mentioned earlier.

Finally, some performance stats. Checking Snapshot Isolation and Read Committed with the TLA+ model on an EC2 `m6g.2xlarge` instance takes around 10 minutes for small instances of the problem. With just two transactions and two keys, the space is large but manageable. Bugs, if they exist, tend to show up even in small instances.

Model-based Verification

We invested early in modularizing the storage model (a discipline Will proposed) which paid off. With a clean storage API between the transaction layer and storage engine, we can generate test cases from TLA+ traces that exercise the real WiredTiger code, not just validate traces. This bridges the gap between model and implementation.

WiredTiger, being a single-node embedded store with a clean API, is easy to steer and test. We exploit this by generating unit tests directly from the model. Will built a Python tool that:

Parses the state graph from TLC,
Computes a minimal path cover,
Translates each path into Python unit tests,
Verifies that implementation values conform to the model.

This approach is powerful: our handwritten unit tests number in the thousands, but the generator produces over 87,000 tests in 30 minutes. Each test exercises the precise contract defined in the model, systematically linking high-level protocol behavior to the low-level storage layer. These tests bridge model and code, turning formal traces into executable validations.

Permissiveness

We use TLA+ to also evaluate the permissiveness of MongoDB’s transaction protocol—the degree of concurrency it allows under a given isolation level without violating correctness. Higher permissiveness translates to fewer unnecessary aborts and better throughput. Modeling lets us quantify how much concurrency is sacrificed for safety, and where the implementation might be overly conservative.

To do this, we compare our protocol's accepted behaviors to abstract commit tests from Crooks et al PODC'17 paper. By comparing the transaction protocol behavior to idealized isolation specs, we can locate overly strict choices and explore safe relaxations.

For example, for read committed, MongoDB's transaction protocol (computed over our finite model/configurations) accepts around 76% of the behaviors allowed by the isolation spec. One source of restriction for Read Committed is prepare conflicts, a mechanism that prevents certain races. Disabling it raises permissiveness to 79%. In one such case, a transaction reads the same key twice and sees different values: a non-repeatable read. Snapshot isolation forbids this; but read committed allows it. MongoDB blocks it, maybe unnecessarily. If relaxing this constraint improves performance without violating correctness, it may be worth reconsidering.

Future Work

Our modular TLA+ specification brings formal clarity to a complex, distributed transaction system. But work remains on the following fronts:

Model catalog behavior to ensure correctness during chunk migrations.
Extend multi-grain modeling to other protocols.
Generate test cases directly from TLA+ to bridge spec and code.
Analyze and optimize permissiveness to improve concurrency.

Concurrency Control and Recovery in Database Systems: Preface and Chapter 1

2025-05-06T22:54:00.007-04:00

I'm catching up on Phil Eaton's book club and just finished the preface and Chapter 1 of Concurrency Control and Recovery in Database Systems by Bernstein, Hadzilacos, and Goodman.

This book came out in 1987: two years before the fall of Berlin wall, 5 years before Windows 3.1, and before the Gray/Reuters book on Transaction Processing.

I was first surprised about why "Recovery" was featured prominently in the book title, but then realized that in 1987 there was no solid solution for recovery. The ARIES paper on the write-ahead-log came out in 1992.

Once I realized that, the structure made sense: concurrency control (Chapters 1–5), recovery (Chapters 6–7), and a forward-looking chapter on replication (Chapter 8), where they candidly admit: "No database systems that we know of support general purpose access to replicated distributed data."

Chapter 1: The Problem

Chapter 1 motivates the twin problems of concurrency control and recovery. It defines correct transaction behavior from the user’s point of view and introduces an internal system model that will be used throughout the book.

"User's point of view" here means an operational one, focused on histories of reads/writesm rather than the state-based, client-centric view I am fond of. (See my write-up of Seeing is Believing, Crooks et al., PODC 2017.)

With 40 years of hindsight, the definitions in Chapter 1 reads as straightforward: transactions, commit/abort, recoverability, cascading aborts, lost updates, serializability. The chapter uses many scenarios to illustrate how transactions interfere and how their effects must sometimes be undone upon abort.

I was initially surprised by the level of detail around undoing writes, until I remembered that this is a single-version model. No MVCC yet. Chapter 5 eventually gets there, explaining its benefits and trade-offs. But in 1987, MVCC was too costly in terms of storage/CPU, and OCC was not considered due to high contention workloads of that era. Instead, transactions were implemented with 2PL on resource-constrained single computers. Bernstein himself contributed a lot for the popularization of both MVCC and OCC.

The chapter closes with a model of a database system with four components: transaction manager (TM), scheduler, recovery manager (RM), and cache manager (CM). Transactions issue operations to the TM, which forwards them to the scheduler for concurrency control. The scheduler ensures serializability (and often strictness) by delaying, executing, or rejecting operations. The RM guarantees atomicity and durability, and uses Fetch/Flush from the CM to read/write persistent state. The CM handles movement between volatile and stable storage.

Coordination among modules relies on handshaking: if a module wants two operations ordered (say, write before flush), it must wait for an acknowledgment before issuing the next operation. This is a minimalist model of layered interaction that earns points for clarity and separation of concerns, although real systems today bear little resemblance to it.

Notes from the TLA+ Community Event

2025-05-06T10:39:00.005-04:00

I attended the TLA+ Community Event at Hamilton, Ontario on Sunday. Several talks pushed the boundaries of formal methods in the real world through incorporating testing, conformance, model translation, and performance estimation. The common thread was that: TLA+ isn't just for specs anymore. It's being integrated into tooling: fuzzers, trace validators, and compilers. The community is building bridges from models to reality, and it's a good time to be in the loop.

Below is a summary of selected talks, followed by some miscellaneous observations. This is just a teaser; the recordings will be posted soon on the TLA+ webpage.

Model-Based Fuzzing for Distributed Systems — Srinidhi Nagendra

Traditional fuzzing relies on random inputs and coverage-guided mutation, and works well for single-process software. But it fails for distributed systems due to too many concurrent programs, too many interleavings, and no clear notion of global coverage.

Srinidhi's work brings model-based fuzzing to distributed protocols using TLA+ models for coverage feedback. The approach, ModelFuzz, samples test cases from the implementation (e.g., Raft), simulates them on the model, and uses coverage data to guide mutation. Test cases are not sequences of messages, but of scheduling choices and failure events. This avoids over-generating invalid traces (e.g., a non-leader sending an AppendEntries).

The model acts as a coverage oracle. But direct enumeration of model executions is infeasible because of too many traces, too much instrumentation, too much divergence from optimized implementations (e.g., snapshotting in etcd). Instead, ModelFuzz extracts traces with lightweight instrumentation as mentioned above, simulates them on the model, and mutates schedules in simple ways: swapping events, crashes, and message deliveries. This turns out to be surprisingly effective. They found 1 new bug in etcd, 2 known and 12 new bugs in RedisRaft. They also showed faster bug-finding compared to prior techniques.

TraceLink: Automating Trace Validation with PGo — Finn Hackett & Ivan Beschastnikh

Validating implementation traces against TLA+ specs is still hard. Distributed systems don't hand you a total order. Logs are huge. Instrumentation is brittle. This talk introduced TraceLink, a toolchain that builds on PGo (a compiler from PlusCal to Go) to automate trace validation.

There are three key ideas. First, compress traces by grouping symbolically and using the binary format. Second, track causality using vector clocks, and either explore all possible event orderings (breadth-first) or just one (depth-first). Third, generate diverse traces via controlled randomness (e.g., injecting exponential backoffs between high-level PlusCal steps).

TraceLink is currently tied to PGo-compiled code, but they plan to support plain TLA+ models. Markus asked: instead of instrumenting with vector clocks, why not just log with a high-resolution global clock? That might work too.

Finn is a MongoDB PhD fellow, and will be doing his summer internship with us at MongoDB Research in the Distributed Systems Research Group.

Translating C to PlusCal for Model Checking — Asterios Tech

Asterios Tech (a Safran defense group subsidiary) works on embedded systems with tiny kernels and tight constraints. They need to verify their scheduler, and manual testing doesn't cut it. So the idea they explore is to translate a simplified C subset to PlusCal automatically, then model check the algorithms for safety to the face of concurrency.

The translator, c2pluscal, is built as a Frama-C plugin. Due to the nature of the embedded programming domain, the C code is limited: no libc, no malloc, no dynamic structures. This simplicity helps in the translation process. Pointers are modeled using a TLA+ record with fields for memory location, frame pointer, and offset. Load/store/macros are mapped to PlusCal constructs. Arrays become sequences. Structs become records. Loops and pointer arithmetic are handled conservatively.

I am impressed that they model pointer arithmetic. This is a promising approach for analyzing legacy embedded C code formally, without rewriting it by hand.

More talks

The "TLA+ for All: Notebook-Based Teaching" talk introduced Jupyter-style TLA+ notebooks for pedagogy supporting inline explanations, executable specs, and immediate feedback.

I presented the talk "TLA+ Modeling of MongoDB Transactions" (joint work with Will Schultz). We will post a writeup soon.

Jesse J. Davis presented "Are We Serious About Using TLA+ For Statistical Properties?". He plans to blog about it.

Andrew Helwer presented "It’s never been easier to write TLA⁺ tooling!", and I defer to his upcoming blog post as well.

Markus Kuppe, who did the crux of organizing the event, demonstrated that GitHub Copilot can solve the Diehard problem with TLA+ in 4 minutes of reasoning, with some human intervention. He said that the TLA+ Foundation and NVidia is funding the "TLAI" challenge, for exploring novel AI augmentation of TLA+ modeling.

Miscellaneous

The 90-minute lunch breaks were very European. A U.S. conference would cap it at an hour, and DARPA or NSF would eliminate it entirely: brown bag through talks. The long break was great for conversation.

In our workshop, audience questions were frequent and sharp. We are a curious bunch.

The venue was McMaster University in Hamilton, 90 minutes drive from home. Border crossings at the Lewiston-Queenston bridge were smooth without delays. But questions from border officers still stressed my daughters (ages 9 and 13). I reminded them how much worse we had it when we had visas, instead of the US citizenship.

My daughters also noticed how everything (roads, buildings, parks) is called Queen's this and Queen's that. My 9th year old tried to argue that since Canada is so close to US and since it looks so similar to US, it feels more like a U.S. state than a separate country. Strong Trump vibes with that one.

USD to CAD exchange rate is $1.38. I still remembered them to be pretty much on par, so I was surprised. We hadn’t visited Canada since 2020. A Canadian friend told me there's widespread discontent about the economy, rent and housing prices.

Canadians are reputed to be very nice. But drivers were aggressive—cutting in, speeding in mall lots. I also received tense, passive-aggressive encounters from two cashiers and a McMaster staff. Eh.

TLA+ Community Event at ETAPS 2025

2025-05-01T09:31:00.000-04:00

This Sunday, I'll be attending (and speaking at) the TLA+ Community Event, co-located with ETAPS 2025 in Hamilton, Ontario.

The setting is fitting. ETAPS (European Joint Conferences on Theory and Practice of Software) has long been a hub for research that combines theory with software engineering. It seems that, while the U.S. academia largely left software engineering to industry, European researchers remained more strongly involved in software engineering discipline. ETAPS has consistently hosted work on model checking, type systems, static analysis, and formal methods. Think of work on abstract interpretation, K frameworks, or compilers verified in Coq.

I have never been to ETAPS before. It seems that they are rebranding as "International Joint Conferences On Theory and Practice of Software" and droppping the European. And this year is the first time, after 28 years, that the event moves outside of Europe. Interesting.

McMaster University, the ETAPS 2025 host, is a strong research school, particularly in health sciences and engineering. Huh, the department is called "Department of Computing & Software", and it gives a degree in Computer Science and several others in Software Engineering. It's also just an hour's drive from Buffalo, where I live, so this is a rare hometown event for me.

The TLA+ Community Event runs all day Sunday, May 4. The program features researchers and practitioners from academia and industry. Some highlights:

ModelFuzz for distributed systems (MPI-SWS)
Source-level safety checking via C-to-PlusCal translation (Asterios Technologies)
TLA+ in Python notebooks (Loyola University Chicago)
Modeling and Modular Verification of MongoDB’s distributed transactions (joint work between Will Schultz and yours truly)
How do we use TLA+ for statistical properties? (by Jesse Jiryu Davis)
And a talk on building TLA+ tooling (by Andrew Helwer)

There's no formal proceedings from the event, but slides and recordings will be online.

I would be amiss if I don't mention the TLA+ Foundation Grant Program. The TLA+ Foundation is accepting proposals for grant funding to support projects that advance the state of the art in TLA+ and improve the experience of using TLA+ in research and industry. Grants will be awarded based on the significance of the proposed work and its potential to benefit the TLA+ community.

Multi-Grained Specifications for Distributed System Model Checking and Verification

2025-04-22T14:11:00.002-04:00

This EuroSys 2025 paper wrestles with the messy interface between formal specification and implementation reality in distributed systems. The case study is ZooKeeper. The trouble with verifying something big like ZooKeeper is that the spec and the code don’t match. Spec wants to be succinct and abstract; code has to be performant and dirty.

For instance, a spec might say, “this happens atomically.” But the actual system says, “yeah, buddy, right.” Take the case of FollowerProcessNEWLEADER: the spec bundles updating the epoch and accepting the leader’s history into a single atomic step. But in the real system, those steps are split across threads and separated by queuing, I/O, and asynchronous execution. Modeling them as atomic would miss real, observable intermediate states, and real bugs.

To bridge this model-code gap, the authors use modularization and targeted abstraction. Don’t write one spec, write multi-grained specs. Different parts of the system are modeled at different granularities, depending on what you're verifying. Some parts are modeled in fine detail; others are coarse and blurry. Like reading a novel and skimming the boring parts. These modules, written at different levels of abstraction, are composed into a mixed-grained spec, tailored to the verification task at hand.

Central to the method is the interaction-preserving principle, a key idea borrowed from a 2022 paper by some of the same authors. Coarsening is allowed, but only if you preserve interaction variables--shared state that crosses module boundaries. To abstract away things, modules can internally lie to themselves, but not to each other. Other modules should not be able to distinguish whether they are interacting with the original or a coarsened module. The paper formalizes this using dependency relations over actions and their enabling conditions. This sounds a lot like rely-guarantee, but encoded for TLA+.

What makes this paper stand out is its application. This is a full-on, hard-core TLA+ verification effort against a real-world messy codebase. It’s a good guide for how to do formal work practically, without having to rewrite the system in a verification-oriented language. This paper is not quite a tutorial as it skips a lot of the details, but the engineering is real, and the bugs are real too.

Technical details

The system decomposition follows the Zab protocol phases: Election, Discovery, Synchronization, Broadcast. This is a clever hack. Rather than inventing new module boundaries (say through roles or layers), the authors simply split along the natural phase boundaries in the protocol. The next-state action in the TLA+ spec is already a disjunction across phase actions; this decomposition just formalizes and exploits that.

They focus verification on log replication involving the Synchronization and Broadcast phases, while leaving Election and Discovery mostly coarse. They say that's where the bugs are. They also mention that the leader election is a tangle of nondeterminism and votes-in-flight, and models take days to check. Log replication, by contrast, is where local concurrency hits hard, and where many past bugs and failed fixes appear.

Table 1 summarizes the spec configurations: baseline (system spec), coarsened, and fine-grained. The fine-grained specs are where the action is: they capture atomicity violations, thread interleavings, and missing transitions. These bugs don’t show up when model checking with the baseline spec. It is all about how deep granularity you are willing to go for capturing implementation. Targeted fine-granularity checking really helps keep this practical.

Table 5 shows model performance. Coarse specs are fast but miss bugs. Fine specs find bugs but blow up. Mixed specs get the best of both. For instance, mSpec-3 finds deep bugs in seconds, while the full system spec doesn’t terminate in 24 hours. The bottleneck is the leader election spec, which proves again that targeted coarsening pays when it preserves interaction.

Bug detection is driven by model checking with TLC. Violations are found in model traces when TLC reports a violation of an invariant. The violations found are then confirmed via code-level deterministic replay. The authors built a system (Remix) to instrument ZooKeeper, inject RPC hooks via AspectJ, and replay TLA-level traces using a centralized coordinator that schedules and controls the interleaving of code-level actions using the developer-provided mapping from model actions to code events. This deterministic replay allows them to validate that the model-level bug can actually manifest in real runs.

Figure 8 ties back to the core atomicity problem in FollowerProcessNEWLEADER. Many of these bugs arise when the update of epoch and the logging of history are not properly sequenced or observed. The spec treats them as one unit, but the implementation spreads them across threads and queues. By splitting the spec action into three, the authors model each thread’s contribution separately, and capturing the inter-thread handoff. With this, they catch data loss, inconsistency, and state corruption bugs that had previously escaped detection, or worse, had been "fixed" and re-broken.

The authors didn’t stop at finding bugs. They submitted fixes upstream to ZooKeeper, validating them with the same fine-grained specifications. To fix existing bugs and also make it easy to implement correctly, they removed the atomicity requirement of the two updates from the Zab protocol but require their order: the follower updates its history before updating its epoch. The patched version passed their model checks, and the PR was merged. This tight loop (spec, bug, fix, verify) is what makes this method stand out: formal methods not as theory, but as a workflow.

Conclusions

The results back the thesis: fine-grained modeling is essential to catch real bugs. Coarsening is essential to make model checking scale. Modularity and compositionality is a feasible way to manage verification of a complex, concurrent, evolving system. It's not push-button verification. But it's doable and useful.

The work opens several directions. Could this technique support test generation from the mixed model? If you already have deterministic replay and instrumentation hooks, generating inputs for fault-injection testing seems within reach.

More speculatively, are there heuristics to suggest good modular cuts? The current modularization is protocol-phase aligned, but could role-based, data-centric, or thread-boundary modularization give better results? That seems like a good area for exploration.

What I'd do as a College Freshman in 2025

2025-04-10T10:26:00.007-04:00

Do Computer Science

Absolutely. Still would.

Many are spooked by LLMs. Some, like Jensen Huang, argue that "nobody has to learn how to program."

I argue the opposite. And I double down.

Being supported by AI tools is not a substitute for mastery. You can’t borrow skills. You have to earn them.

Computer science builds vital skills: hacking, debugging, abstract thinking, and quick adaptation. These don’t go out of style.

Do STEM. It’s LLM-resistant. LLMs can retrieve and remix information, but do you know what to do with them? Like the dog chasing the car, what now? STEM teaches you that. It teaches you to think, to reason, to act. It gets you from information to wisdom. But only after you've mastered the foundations.

We're heading into the age of π-shaped people: depth in two areas, and generalist across. Building depth first, and then ranging is good strategy.

So yes, I would learn the foundations of both CS and AI. And then do AI + X, where X is systems, databases, or PL. These combos are powerful.

Build Soft Skills

You can't skip these. Soft skills are indispensible, classic Lindy.

Clear communication matters more than ever:

for working with others (especially remotely)
for building in the open
for working with LLMs.

Management skills matters too. Not just for leading teams, but also for leading yourself. Self-help book gets mocked for their vacuous/repetitive advice, but they are useful for the querying/investigating they ignite in you. Know yourself. Then fool yourself into greatness.

You don’t need a title to lead teams. Influence scales from the bottom. You have more leverage and autonomy than you realize, especially early in your career.

Stay in the U.S. Stay in College

Despite the noise, U.S. is still the best launchpad for a tech career. It has the resources, companies, and a decent (but imperfect) merit system.

Despite losing face, colleges still provide value. Credentialism is fading, but community isn't. College is still the best place to meet smart people, get inspired, and build with peers. Don't waste it competing and just following lectures. Learn from each other. Take initiative. Start things.

Be a Jeep or a Ferrari. Not a Corolla.

Be entrepreneurial. Be effectual. Effectual thinking is messy yet powerful. It starts with what you have. You act, learn, improvise. You don't wait for perfect conditions; you shape them.

My friend Mahesh Balakrishnan put it best: on his dream team, he wants Jeeps or Ferraris. Jeeps go anywhere. No roads, no map. Just point them at a challenge. Ferraris go fast --but only with a good road. What he doesn't want is Corollas. They are slow and they still need roads.

So here's the corollary (yes, pun intended):

Be entrepreneurial. Be a Jeep.
If you must follow plans, be a Ferrari. Specialized, fast, precise.
Don't be a Corolla. In the age of AI, that gets automated.

No shade to Corollas--I’ve owned two. Great cars: reliable, economical, and yes, easily replaceable. But this is about professional growth. In tech, aim higher. Don’t be a commodity.

Play the Long Game

Tech rewards leverage, not shortcuts. Optimize for momentum and stack useful skills, relationships, and systems. Compounding is the strongest force in your career.

Don’t chase trends. Understand them. Ride the ones that match your strengths. Learn to take a step back, and aim for depth, clarity, and direction.

Here is more unsolicited advice from me.

Cedar: A New Language for Expressive, Fast, Safe, and Analyzable Authorization

2025-03-24T14:11:00.008-04:00

This paper (2024) presents Cedar, AWS's new authorization policy language. By providing a clear declarative way to manage access control policies, Cedar addresses the common limitations of embedding authorization logic directly into application code: problems with correctness, auditing, and maintainence. Cedar introduces a domain-specific language (DSL) to express policies that are separate from application code, and improves readability and manageability. In that sense, this is like aspect-oriented programming but for authorization policy.

The language is designed with four main objectives: expressiveness, performance, safety, and analyzability. Cedar balances these goals by supporting role-based (RBAC), attribute-based (ABAC), and relationship-based (ReBAC) access control models.

Policy Structure and Evaluation

Cedar policies consist of three primary components: effect, scope, and conditions. The effect can either be "permit" or "forbid", defining whether access is granted or denied. The scope specifies the principal (user), action, and resource. Conditions provide additional constraints using entity attributes.

Entities in Cedar represent users, resources, etc. Each entity belongs to a specific type, such as User, Team, or List. Entities are uniquely identified by a combination of their type and a string identifier (e.g., User::"andrew"). Entity attributes are key-value pairs associated with entities, providing additional information. Attributes can include primitive data types like strings or numbers, collections like lists or sets, or references to other entities. For example, a List entity might have attributes like owner, readers, editors, and tasks, where owner references a User entity, and readers references a set of Team entities. Entities and their attributes form a hierarchical structure, which Cedar uses to evaluate policies efficiently.

Cedar ensures safety by applying a deny-by-default model. If no "permit" policy is applicable, access is denied. Additionally, "forbid" policies always take precedence over "permit" policies.

A schema defines entity types, attributes, and valid actions, and Cedar's policy validator uses optional typing and static capabilities to catch potential errors, like accessing non-existent attributes. Another key feature is policy templates, which allow developers to define reusable policy patterns with placeholders. These placeholders are instantiated with specific entities or attributes at runtime, reducing redundancy and simplifying policy management.

Symbolic Analysis and Proofs

Cedar supports deeper policy analysis through symbolic compilation. Symbolic compilation reduces policies to SMT (Satisfiability Modulo Theories) formulas, enabling automated reasoning about policy behavior. This allows checking for policy equivalence, identifying inconsistencies, and verifying security invariants.

Checking policy equivalence is particularly useful for ensuring that policy refactoring or updates do not introduce unintended behavior. By comparing two versions of a policy set, Cedar can determine if they produce identical authorization decisions for all possible requests. This is crucial for organizations with frequent policy updates to ensure no permissions are inadvertently granted or revoked.

By compiling to SMT, Cedar leverages the rapid advancements in SMT solvers over the past two decades. Improvements in solver algorithms, better heuristics for decision procedures, and more efficient memory management have significantly enhanced SMT solver performance. Tools like Z3 and CVC5 are now capable of solving complex logical formulas quickly, making real-time policy analysis feasible. These advancements enable Cedar to perform sophisticated policy checks with minimal overhead.

Cedar is formalized in the Lean programming language and uses its proof assistant to establish key properties like correctness of the authorization algorithm, sound slicing, and validation soundness. The compiler's encoding is proved to be decidable, sound, and complete. This result, the first of its kind for a non-trivial policy language, is made possible by Cedar's careful control over expressiveness and by leveraging invariants ensured by Cedar's policy validator.

Implementation and Evaluation

Cedar is implemented in Rust and open-sourced at https://github.com/cedar-policy/. The implementation is extensively tested using differential random testing to ensure correctness.

Cedar was evaluated against two prominent open-source, general-purpose authorization languages, OpenFGA and Rego, using three example sets of policies. The evaluation demonstrated that Cedar's policy slicing significantly reduces evaluation time. For large policy sets, Cedar achieved a 28.7x to 35.2x speedup over OpenFGA and a 42.8x to 80.8x speedup over Rego.

Unlike OpenFGA and Rego, which show scaling challenges with increased input sizes, Cedar maintains consistently low evaluation latency. In a simulated Google Drive authorization scenario, Cedar handled requests in a median time of 4.0 microseconds (𝜇s) with 5 Users/Groups/Documents/Folders, increasing only to 5.0𝜇s with 50 Users/Groups/Documents/Folders. Similarly, in a GitHub-like authorization scenario, Cedar maintained a median performance of around 11.0𝜇s across all input ranges. Even at the 99th percentile (p99), Cedar's response times remained low, with under 10𝜇s for Google Drive and under 20𝜇s for GitHub. Further validating its real-world applicability, Cedar is deployed at scale within Amazon Web Services, providing secure and efficient authorization for large and complex systems.