DDIA: Chp 9. Consistency and Consensus

- October 25, 2024

The chapter 9 of the Designing Data Intensive Applications (DDIA) book has the following 4 sections (which contain a total of 10 subsections).

Consistency guarantees
Linearizability
Ordering guarantees
Distributed transactions and consensus

TMI (Too much info)

The chapter tries to do too much. Almost an entire semester of distributed systems content is force-jammed into this chapter. I don't think the discord reading group will be able to complete this chapter and make sense of what is going on. For a more manageable load, much of this chapter should be moved to other chapters.

The chapter starts with explaining linearizability. It would have been better to explain this in Chapter 5, Replication. Linearizability sounds easy, but it takes time to internalize. I had discussed linearizability here, and later revisited it with this paper. And of course it was mentioned in many other posts.

The chapter continues with CAP tradeoff, rightly calling it "the unhelpful CAP theorem", but then since this is a crowded chapter, there isn't enough space to talk about what was wrong with CAP's misinterpretations and misuses, and what do we have in our arsenal to attack these problems. I talked about CAP here and here, and mentioned it in several places as well.

Then the chapter goes onto "ordering and causality". That is another huge topic that should have been its own chapter. It takes time (pun intended) to understand the happened before, causality, and concurrency concepts. In addition to Lamport clocks, this discussion should also include vector clocks. As the book says, linearizability is not the only option out there. In order to build such applications/algorithms, we should first build the abstractions and the tools to capture partial orderings.

I really liked the Walter paper (SOSP 2011) as it managed to bridge the two worlds: 2PC based reconciliation of the updates with CRDT based no-coordination updates. Walter uses multi-version concurrency control within each site, provides a global causal order for regular objects (either doing the update at the owner site or using 2PC to coordinate across sites), and is also able to avoid coordination for certain tasks by using CRDT objects exclusively. In sum, Walter (using Parallel Snapshot Isolation) ensures consistent snapshots and transaction ordering within a site, causal ordering across sites, and prevention of write-write conflicts globally.

State Machine Replication (SMR) and consensus

For furthering the discussion of linearizable distributed systems, I don't think introducing the total order (atomic) broadcast is helpful. I think state machine replication (SMR) is a much better abstraction to think about this concept. The Tango and Delos papers explains things in terms of SMR.

Chapter 5 introduced replication for storage. I think it would be better to introduce Raft/Paxos there, and then also show how to move from atomic storage problem to the SMR maintenance problem as a more general solution. I think the new version of the book should introduce how Paxos/Raft protocols work as most distributed systems had to include/customize them as a component.

Consensus is easy to solve if you don't care about fault-tolerance: just have a single leader responsible. But leader failover when that leader crashes, or worse becomes unavailable for sometime for some of the system, is very challenging and leads to all kinds of cornercases including split brain scenarios. It is easy to underestimate how difficult a problem this is, and easy to mess up things as you cook up your solution or even modify/customize an existing Paxos/Raft implementation. So that is why it is important to cover Paxos/Raft.

As for this chapter, Chapter 9, to keep things concrete it would be nice to show how a Paxos-groups based approach (like Spanner, MongoDB, CockroachDB) provides SMR and cross replicaset transactions with 2PC running on top of Paxos-maintained replicasets. This pattern is the most popular approach to building scalable distributed systems in production.

Atomic commit

The chapter introduced the atomic commit problem which is a related problem to the consensus problem. In the consensus problem: multiple nodes must agree on a single value, any proposed value can be chosen, and we can make progress if majority agrees. In the atomic commit problem: multiple nodes must unanimously agree to a commit or abort decision (often for a transaction), the only two possible values are commit or abort, and majority participation is not enough (since a single "no" vote forces abort, we need to account for everyone).

2PC protocol (which I summarized here) provides a simple solution for atomic commit. Well, a simple but fault-intolerant solution... Since you need fault-tolerance, this simple solution is not acceptable (analogous to just using a single leader is not acceptable for solving consensus). As I mentioned above a common approach is to run 2PC over Paxos groups so each participant is virtually infallible, and using 2PC would work. Or you go with a fault-tolerant SMR solution like Tango Delos that leverages a log for the entire system.

Other solutions are also possible. Since the two problems are related, there is interesting bridging opportunities. The U2PC paper shows how atomic commit and fault-tolerant consensus reconciles as special cases. But again, understanding this paper requires a lot of background in consensus, Paxos, and even Fast Paxos, to appreciate that bridging in the first place.

A nitpick

The book had established isolation properties in Chapter 7: Transactions, so I was confused why it repeats here (in this most crowded section) that linearizability is not equal to serializability. Linearizability is a consistency property and Serializability is a transaction isolation property.

I also have a beef about the book calling Serializable Snapshot Isolation (SSI) not linearizable.

However, serializable snapshot isolation (see “Serializable Snapshot Isolation (SSI)”) is not linearizable: by design, it makes reads from a consistent snapshot, to avoid lock contention between readers and writers. The whole point of a consistent snapshot is that it does not include writes that are more recent than the snapshot, and thus reads from the snapshot are not linearizable.

I think SSI (summarized here) does not preclude linearizability per key. That a transaction does snapshot-isolation reads for its readset from a transaction-start point does not make SSI violate linearizability. Reads as part of the transaction are subject to transaction jurisdiction property of isolation, in which case we have SSI as the isolation property, meaning those transactions are serializable with respect to other transactions. It would make sense to say that SSI does not guarantee strict-serializability (SSER) (a clock order), and maybe the book means to say that.

In fact a better example to provide here would be CockroachDB. CockroachDB provides linearizable reads/writes and serializable transactions, but it still cannot provide SSER transactions, as I explained here.

Search This Blog

Metadata