Posts

Chapter 7: Distributed Recovery (Concurrency Control Book)

Image
Chapter 7 of the Concurrency Control and Recovery in Database Systems book by Bernstein and Hadzilacos (1987) tackles the distributed commit problem: ensuring atomic commit across a set of distributed sites that may fail independently. The chapter covers these concepts: The challenges of transaction processing in distributed database systems (which wasn't around in 1987) Failure models (site and communication) and timeout-based detection The definition and guarantees of Atomic Commitment Protocols (ACPs) The Two-Phase Commit (2PC) protocol (and its cooperative termination variant) The limitations of 2PC (especially blocking) Introduction and advantages of the Three-Phase Commit (3PC) protocol Despite its rigor and methodical development, the chapter feels like a suspense movie today. We, the readers, equipped with modern tools like FLP impossibility result and Paxos protocol watch as the authors try to navigate a minefield, unaware of the lurking impossibility results that were pu...

Analyzing Metastable Failures in Distributed Systems

Image
So it goes: your system is purring like a tiger, devouring requests, until, without warning, it slumps into existential dread. Not a crash. Not a bang. A quiet, self-sustaining collapse. The system doesn’t stop. It just refuses to get better. Metastable failure is what happens when the feedback loops in the system go feral. Retries pile up, queues overflow, recovery stalls. Everything runs but nothing improves. The system is busy and useless. In an earlier post, I reviewed the excellent OSDI ’22 paper on metastable failures , which dissected real-world incidents and laid the theoretical groundwork. If you haven’t read that one, start there. This HotOS ’25 paper picks up the thread. It introduces tooling and a simulation framework to help engineers identify potential metastable failure modes before disaster strikes. It’s early stage work. A short paper. But a promising start. Let’s walk through it. Introduction Like most great tragedies, metastable failure doesn't begin with villain...

Chapter 6: Centralized Recovery (Concurrency Control Book)

Image
With Chapter 6, the Concurrency Control and Recovery in Database Systems book shifts focus from concurrency control to the recovery! This chapter addresses how to preserve the atomicity and durability of transactions in the presence of failures, and how to restore the system to a consistent state afterward. The book offers a remarkably complete foundation for transactional recovery, covering undo/redo logging, checkpointing, and crash recovery. While it doesn't use the phrase "write-ahead logging", the basic concepts are there, including log-before-data and dual-pass recovery. When the book was written, the full WAL abstraction in ARIES was still to come in another five years at 1992 ( see my review here ). I revisit this discussion/comparison at the end of the post. System Model and Architecture In earlier chapters, we had reviewed the architecture: transactions pass through a transaction manager (TM), which sends operations to a scheduler and then to a data manager (DM...

Chapter 5: Multiversion Concurrency Control (Concurrency Control Book)

Image
Chapter 5 of Concurrency Control and Recovery in Database Systems (1987) introduces multiversion concurrency control (MVCC), a fundamental advance over single-version techniques. Instead of overwriting data, each write operation creates a new version of the data item. Readers can access older committed versions without blocking concurrent writes or being blocked by concurrent writes. MVCC removes read-write conflicts and increases concurrency significantly. Having multiple versions around gives the scheduler flexibility: if a read arrives "too late" to see the latest write, it can still proceed by accessing an older version. This avoids unnecessary aborts. Writes may still abort due to write-write conflicts, but reads are largely unimpeded. This is especially beneficial in read-heavy workloads. This chapter presents three broad classes of multiversion methods: Multiversion Timestamp Ordering (MVTO), Multiversion Two-Phase Locking (MV2PL), and Multiversion Mixed Methods. For ...

Chapter 4: Non-Locking Schedulers (Concurrency Control Book)

Image
Chapter 4 of the Concurrency Control and Recovery in Database Systems book (1987) opens with a sentence that doesn't quite pass the grammar test: "In this chapter we will examine two scheduling techniques that do not use locks, timestamp ordering (TO) and serialization graph testing (SGT)."  That comma is trying to do the job of a colon and failing at it. Precision matters, more so in technical writing. The writing is otherwise clear and careful. And as par the book, it is ahead of its time. The chapter presents a spectrum of non-locking schedulers, starting from Basic TO, expanding into certifiers (which basically stands for optimistic concurrency control), and ending with modular, composable scheduler designs that separate synchronization concerns cleanly between read-write and write-write synchronization.  Let's dig into the details. Timestamp Ordering (TO) Timestamp Ordering (TO) uses transaction start timestamps to impose a serial order on conflicting operations...

Chapter 3: Two Phase Locking (Concurrency Control Book)

Image
Chapter 3 presents two-phase locking (2PL). Remember I told you in Chapter 2: Serializability Theory that the discussion is very scheduler-centric? Well, this is a deeper dive into the scheduler, using 2PL as the concurrency control mechanism. The chapter examines the design trade-offs in scheduler behavior, proves the correctness of basic 2PL, dissects how deadlocks arise and are handled, and discusses many variations and implementation issues. Here are the section headings in Chapter 3. Aggressive and Conservative Schedulers Basic Two Phase Locking Correctness of Basic Two Phase Locking Deadlocks Variations of Two Phase Locking Implementation Issues The Phantom Problem Locking Additional Operators Multigranularity Locking Distributed Two Phase Locking Distributed Deadlocks Locking Performance Tree Locking Yep, this is a long chapter: 65 pages.  3.1 Aggressive and Conservative Schedulers The chapter opens by asking: how aggressive or conservative should a scheduler be? An aggress...

Chapter 2: Serializability Theory (Concurrency Control Book)

Image
Chapter 2 of Concurrency Control and Recovery in Database Systems (1987) by Bernstein, Hadzilacos, and Goodman is a foundational treatment of serializability theory. It is precise, formal, yet simple and elegant, a rare combination for foundational theory in a systems domain. Databases got lucky here: serializability theory is both powerful and clean. The chapter builds up the theory step by step, introducing: Histories Serializable histories The Serializability Theorem Recoverability and its variants Generalized operations beyond reads/writes View equivalence Each section motivates definitions clearly, presents tight formalism, and illustrates ideas with well-chosen examples. 2.1 Histories This section lays the groundwork. It starts slow, and doesn't do anything fancy. It first defines what it means for the operations within a transaction to form a well-founded partial order. This intra-transaction ordering extends naturally to inter-transaction operations, forming a superset rel...

Modular verification of MongoDB Transactions using TLA+

Image
Joint work with Will Schultz . A transaction groups multiple operations into an all-or-nothing logical-box to reduce the surface area exposed to concurrency control and fault recovery, simplifying the application programmer's job. Transactions support ACID guarantees: atomicity, consistency, isolation, durability. Popular isolation levels include, Read Committed (RC), Snapshot Isolation (SI), and Serializability (SER), which offer increasing protection against concurrency anomalies. MongoDB Transactions MongoDB’s transaction model has evolved incrementally.   v3.2 (2015): Introduced single-document transactions using MVCC in the WiredTiger storage engine.   v4.0 (2018): Extended support to multi-document transactions within a replica set (aka shard).   v4.2 (2019): Enabled fully distributed transactions across shards. Replica Set Transactions.  All transaction operations are first performed on the primary using the WiredTiger transaction workflow/algorithm. Before co...

Popular posts from this blog

Hints for Distributed Systems Design

My Time at MIT

Making database systems usable

Advice to the young

Looming Liability Machines (LLMs)

Learning about distributed systems: where to start?

Foundational distributed systems papers

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Distributed Transactions at Scale in Amazon DynamoDB

Linearizability: A Correctness Condition for Concurrent Objects