Chain replication for supporting high throughput and availability

Chain replication (2004) is a replication protocol to support large-scale storage services (mostly key-value stores) for achieving high throughput and availability while providing strong consistency guarantees. Chain replication is essentially a more efficient retake of primary-backup replication. The below figure explains it all. I really love the way the protocol is presented as a stepwise-refinement starting from the high-level specificaiton of a storage system. This exposition provides a simple way of justifying the correctness of the protocol (especially for fault-tolerance actions for handling failures of servers), and also is convenient for pointing out to other alternative implementations.

Chain replication protocol There are two types of requests, a query request (read) and an update request (write). The reply for every request is generated and sent by the tail. Each query request is directed to the tail of the chain and processed there atomically using the replica of objID stored at the tail. Each update request is directed to the head of the chain. The request is processed there atomically using replica of objID at the head, then state changes are forwarded along a reliable FIFO link to the next element of the chain (where it is handled and forwarded), and so on until the reply to the request is finally sent to the client by the tail.

Coping with server failuresThe fault model of the servers are fail-stop (crash). With an object replicated on t servers, as many as t−1 of the servers can fail without compromising the object's availability. When a server is detected as failed, the chain is reconfigured to eliminate the failed server. For this purpose a "master" process is used.

Master is actually a replicated process at m nodes; Paxos is used for coordinating the m replicas so they behave in aggregate like a single process that does not fail. (Thus, if we think holistically, chain-replication is not tolerant to t-1 failures as claimed; since chain-replication employs a master implemented in Paxos, the system is tolerant to only m/2 failures. I wish this was explicitly mentioned in the paper to prevent any confusion.)

When the master detects that the head is failed, the master removes head from the chain and makes the successor of the old head as the new head of the chain. Master also informs client about the new head. This action is safe, as it only causes some transient drops of client requests, but that was allowed in the high-level specification of the system. When the master detects that the tail is failed, it removes the tail and makes the predecessor of the old tail as the new tail fo the chain. This actually does not drop any requests at all.

When the master detects the failure of a middle server, it deletes the middle server "S" from the chain and links the predecessor "S-" and successor "S+" of the failed server to re-link the chain. Below figure provides an overview of the process. Message 1 informs S+ of its new role; message 2 acknowledges and informs the master with the sequence number sn of the last update request S+ has received; message 3 informs S- of its new role and of sn so S- can compute the suffix of Sent_S- to send to S+; and message 4 carries that suffix. (To this end, each server i maintains a list Sent_i of update requests that i has forwarded to some successor but that might not have been processed by the tail. The rules for adding and deleting elements on this list are straightforward: Whenever server i forwards an update request r to its successor, server i also appends r to Sent_i. The tail sends an acknowledgement ack(r) to its predecessor when it completes the processing of update request r. And upon receipt ack(r), a server i deletes r from Sent_i and forwards ack(r) to its predecessor.)

As failed servers are removed, the chain gets shorter, and can tolerate fewer failures. So it is important to add a new server to the chain when it gets short. The simplest way to do this is by adding the new server as the tail. (For the tail, the value of Sent_tail is always empty list, so it is easy to initialize the new server to be the tail.)

Comparison to primary-backup replicationWith chain replication, the primary's role in sequencing requests is shared by two replicas. The head sequences update requests; the tail extends that sequence by interleaving query requests. This sharing of responsibility not only partitions the sequencing task but also enables lower-latency and lower-overhead processing for query requests, because only a single server (the tail) is involved in processing a query and that processing is never delayed by activity elsewhere in the chain. Compare that to the primary backup approach, where the primary, before responding to a query, must await acknowledgements from backups for prior updates.

Chain replication is at a disadvantage for reply latency to update requests since it disseminates updates serially, compared to primary-backup which disseminates updates parallelly. This does not have much effect on throughput however. (It is also possible to modify chain replication for parallel wiring of middle-servers, but this will bring some complexity to the fault-handling actions.)

Concluding remarksThe paper also provides evaluation results from an implementation of the chain replication protocol. Chain replication is simple and is a more efficient retake of primary backup replication. Since this is a synchronous replication protocol it falls into CP edge of the CAP theorem triangle. Chain replication has been used in Fawn and also in some other key-value storage systems.

Additional links:
This is genius. With a very simple observation and fix, Terrace and Freedman were able to optimize chain replication further for read-mostly workloads. The below two figures from their paper in 2009 are enough to explain the modification.

Related links:
High-availability distributed logging with BookKeeper


Popular posts from this blog

Learning about distributed systems: where to start?

Hints for Distributed Systems Design

Foundational distributed systems papers

Metastable failures in the wild

Scalable OLTP in the Cloud: What’s the BIG DEAL?

The end of a myth: Distributed transactions can scale

Always Measure One Level Deeper

Dude, where's my Emacs?

There is plenty of room at the bottom

Know Yourself