DDIA: Chp 5. Replication (Part 2)

- September 19, 2024

Chapter 5 of the Designing Data Intensive Applications (DDIA) book discusses strategies and challenges in replicating data across distributed systems. I had covered the first part last week, here is the second part of leaderless replication.

Leaderless replication abandons the concept of a leader node, and allows any replica to directly accept writes from clients. This method, while used in some of the earliest replicated data systems, fell out of favor during the dominance of relational databases. However, it regained popularity after Amazon implemented it in their in-house Dynamo system (not to be confused with DynamoDB). This inspired several open-source datastores like Riak, Cassandra, and Voldemort, which are often referred to as Dynamo-style databases.

In leaderless replication systems, the concepts of quorums and quorum consistency are crucial. These systems use three key parameters:

n: the total number of replicas
w: the write quorum (number of nodes that must acknowledge a write)
r: the read quorum (number of nodes that must respond to a read)

The quorum condition, w + r > n, allows the system to tolerate unavailable nodes while still maintaining consistency. For example, with n = 3, w = 2, and r = 2, the system can tolerate one unavailable node. With n = 5, w = 3, and r = 3, it can tolerate two unavailable nodes.

In practice, operations are typically sent to all n replicas in parallel, but the system only waits for w or r successful responses before considering the operation complete. When w + r > n, you can generally expect every read to return the most recent value written for a key, as there must be an overlap between the nodes written to and the nodes read from.

However, even with this quorum condition, there are many corner cases where stale values might be returned. As we discuss below, sloppy quorums may violate node-overlapping. Concurrent writes create ambiguity about which write happened first and may result in only some replicas reflecting the new value. Moreover, node failures and data restoration from old replicas can violate quorum conditions.

The ABD paper (1995) does a good job of enforcing and explaining linearizability/atomicity in replicated storage. Note that read-repair is built into the read operation's set phase there.

Sloppy quorum and hinted handoff

What does a sloppy quorum mean? In large clusters, network partitions might prevent a client from reaching the specific w nodes for a write operation, but other nodes might still be reachable. In this case, the system can either return errors for all requests that can't reach a quorum, or accept writes to nodes that are reachable but not among the designated n "home" nodes for the value. This latter approach is called a sloppy quorum. It still requires w and r successful responses, but these may include nodes outside the usual set. Once network issues are resolved, any temporarily accepted writes are sent to the appropriate "home" nodes through a process called hinted handoff.

Sloppy quorums increase write availability but may impact read consistency, as the latest value might be temporarily stored on nodes outside the usual set. It's important to note that sloppy quorums aren't true quorums in the traditional sense; they only assure durability, not immediate consistency.

Last write wins

When it comes to conflict resolution, one common approach is Last Write Wins (LWW). In this strategy, each replica only stores the most "recent" value, discarding older ones. As long as we have some way of unambiguously determining which write is more “recent,” and every write is eventually copied to every replica, the replicas will eventually converge to the same value.

The book has harsh words for LWW, and declares it as unsafe.

"LWW achieves the goal of eventual convergence, but at the cost of durability: if there are several concurrent writes to the same key, even if they were all reported as successful to the client (because they were written to w replicas), only one of the writes will survive and the others will be silently discarded. Moreover, LWW may even drop writes that are not concurrent, as we shall discuss in “Timestamps for ordering events”. There are some situations, such as caching, in which lost writes are perhaps acceptable. If losing data is not acceptable, LWW is a poor choice for conflict resolution."

I think the criticism of LWW is overly harsh. In a single copy semantics, this is the right behavior. These are not "lost updates": they are overwritten by other updates to the same keys. This is to be expected in a distributed system with multiple clients. Applications requiring strict consistency for specific keys can use conditional updates.

The “happens-before” relationship and concurrency

To properly handle concurrent updates, it's crucial to understand the Lamport's "happens-before" definition. For any two operations A and B, either A happened before B, B happened before A, or A and B are concurrent. Happened before is a transitive causal relationship defined by communication or occurring earlier on the same process. Detecting concurrency and resolving conflicts often involves using techniques like version vectors or vector clocks built on top of the happened-before concept.

I think this discussion was rushed in the book, and it would have been best to introduce happened-before properly and take time to introduce logical clocks and talk about partial ordering in distributed systems.

Logical clocks capture happened-before only in one way. A hb B implies lc.A < lc.B but implication the other way is not guaranteed. That is why we need vector clocks. After all this discussion about "time in distributed systems", it would have made more sense to talk about version vectors. See my short explanation on logical and vector clocks here. And then there is hybrid logical clocks.