DDIA: Chp 5. Replication (Part 1)

- September 12, 2024

Chapter 5 of the Designing Data Intensive Applications (DDIA) book discusses strategies and challenges in replicating data across distributed systems.

Replication is critical for ensuring data availability, fault tolerance, and scalability. One of the key challenges in replication is maintaining consistency across multiple replicas. Leader-based replication, the most common method, involves a single leader node accepting all writes, while follower nodes replicate the leader's changes. While single leader replication does not guarantee consistency readily (due to leader failover cornercases), it gives us a fighting chance.

One primary advantage of leader-based replication is its simplicity: all write requests are serialized through the leader, ensuring a consistent order of operations. Followers receive a stream of changes (i.e., the replication log) from the leader and apply them in sequence, ensuring eventual consistency. However, replication lag, the delay between when a write is committed/acknowledged by the leader and when it is reflected in the followers, can lead to inconsistencies for reads. For example, a user might read outdated information from a follower if the follower hasn't caught up with the leader.

To mitigate this, several strategies are employed. For instance, when ensuring read-after-write consistency (read-your-writes), it is recommended to serve reads from the leader if the data has been recently modified by the user. This prevents the user from seeing stale data after making updates. Additionally, the system can track the timestamp of the user's most recent write and ensure that subsequent reads reflect updates until that timestamp. If a follower is not sufficiently up to date, the system can wait for it to catch up or route the query to a different replica.

Synchronous replication alleviates the replication lag problem because it ensures that followers are always in sync with the leader before the leader acknowledges a write. While this approach provides stronger consistency guarantees, it can degrade performance, as the leader must wait for followers to confirm receipt of each change. In practice, a hybrid approach, often called semi-synchronous replication, is commonly used. In this setup, one follower is kept synchronous to ensure availability in case the leader fails, while other followers replicate asynchronously to optimize performance. The quorum approach, used in Paxos, ABD, and variants, provide more flexibility by not requiring the synchronous replica to be predetermined, and change between updates. Note that synchronous replication is still prone to replication lag problem and need to employ the above strategies to ensure consistency for reading from the replicas. This is because of inherent asymmetry of view/information in distributed systems: the leader commits/acks first and the followers then learn of the commit/ack with a delay. Synchronous replication is still very useful for preventing replicas to diverge too far from the leader's state. With pipelined replication, it is still possible to pull good throughput from synchronous replication.

Asynchronous replication is used for relaxing the tight synchronization requirements and alleviating the latency of write-ack penalties associated with synchronous replication. it allows followers to operate bit more independently of the leader, and could be suitable for scenarios with heavy read workloads and light write activity especially in WAN environments. However, asynchronous replication carries the risk of inconsistency since there is no guarantee of how quickly followers will replicate changes from the leader.

Let's get back to the issue of replication inconsistency cornercases even when using single leader synchronous replication. Failover, the process of promoting a follower to leader when the leader fails, introduces additional complexity for consistency. The new leader may not have all the latest writes from the failed leader, leading to potential data loss or inconsistency. Or even worse, in certain fault scenarios (involving lost messages and asynchrony), it could happen that two nodes both believe that they are the leader. This situation is called split brain, and it is dangerous: if both leaders accept writes, and there is no process for resolving conflicts, data is likely to be lost or corrupted. Properly handling thesse scenarios are hard and involves principled distributed consensus algorithm design. This is why Paxos and later implementations including Raft are so popular in distributed systems deployments.

How: Implementation of Replication Logs

There are three main log replication methods.

Statement-based replication logs every write statement (e.g., SQL INSERT, UPDATE, DELETE) executed by the leader and sends them to followers, who re-execute them. While compact, this method is error-prone due to nondeterminism (e.g., throught function call execution at the leader versus at the replicas). Though still used in some cases, it has largely been replaced by more reliable techniques.

Write-ahead log (WAL) shipping involves logging low-level changes to disk blocks (e.g., B-tree, Postgres). The leader sends its WAL to followers, enabling them to replicate exact disk state changes. This method is tightly coupled to the storage engine, limiting flexibility in database versioning, and requiring downtime during upgrades.

Logical (row-based) log replication decouples replication from storage and logs the changes at the row level. Each log record details the new or deleted row data, and transactions generate multiple records. This approach allows leader and follower nodes to run different database versions, facilitating zero-downtime upgrades. This also enables external systems to consume the log for tasks like change data capture. Among many other systems, MySQL's binlog (when configured to use row-based replication) uses this approach.

Multi-leader replication

Finally, multi-leader replication (aka master-master or active-active replication) is an alternative to leader-based replication. It allows multiple nodes to accept writes simultaneously. Each node then forwards changes to the other leaders, which also act as followers. This setup can be useful for geographically distributed systems, but it introduces challenges in resolving write conflicts when concurrent writes occur across multiple leaders. Handling these conflicts requires sophisticated techniques to ensure consistency across nodes.

Some related paper summaries from my blog

These two papers provide a good overview of replicated data consistency models/definitions in distributed systems.

The many faces of consistency (2016)

Replicated Data Consistency Explained Through Baseball (2011)

There is also causal consistency model, which is very useful for read-your-write. An example is Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS.

I recently reviewed these consistency papers from MongoDB.

Checking Causal Consistency of MongoDB

Implementation of Cluster-wide Logical Clock and Causal Consistency in MongoDB

Tunable Consistency in MongoDB

Clocks and hybrid logical clocks are becoming more popular in enforcing consistency.

On the use of Clocks to Enforce Consistency in the Cloud

On the asynchronous replication side of things, the "optimistic replication" survey from 2005 provides a great coverage of asynchronous replication literature.

Probabilistically bounded staleness (PBS) paper deserve special mention as it added a new perspective when studying/analyzing asynchronous replication.

You can think of this as a practical followup to PBS: Measuring and Understanding Consistency at Facebook

And of course CRDTs.

Conflict-free Replicated Data Types

Keep CALM and CRDT on