This paper appeared in DISC 2003, and describes an application of the ABD replicated atomic storage algorithm for replication of large objects. When objects being replicated is much larger than the size of the metadata (such as tags or pointers), it is efficient to tradeoff performing cheaper operations on the metadata in order to avoid expensive operations on the data itself.
The basic idea of the algorithm is to separately store copies of the data objects in replica servers, and information about where the most up-to-date copies are located in directory servers. This Layered Data Replication (LDR) approach adopts the ABD algorithm for atomic fault-tolerant replication of the metadata, and prescribes how the replication of the data objects in the replica servers can accompany replication of the metadata in directory servers in a concurrent and consistent fashion: In order to read the data, a client first reads the directories to find the set of up-to-date replicas, then reads the data from one of the replicas. To write, a client first writes its data to a set of replicas, then informs the directories that these replicas are now up-to-date.
The LDR algorithm replicates a single data object supporting read and write operations, and guarantees that the operations appear to happen atomically. While there exist multiple physical copies of the data, users only see one logical copy, and user operations appear to execute atomically on the logical copy. As such LDR provides linearizability, a strong type of consistency, that guarantees that a read operation returns the most recent version of data. LDR provides single-copy consistency and is on the CP side of the CAP triangle; availability is sacrificed when a majority of replicas are unreachable.
Client ProtocolWhen client i does a read, it goes through four phases in order: rdr, rdw, rrr and rok. The phase names describe what happens during the phase: read-directories-read, read-directories-write, read-replicas-read, and read-ok. During rdr, i reads (utd, tag) from a quorum of directories to find the most up-to-date replicas. i sets its own tag and utd to be the (tag, utd) it read with the highest tag, i.e., timestamp. During rdw, i writes (utd, tag) to a write quorum of directories, so that later reads will read i’s tag or higher. During rrr, i reads the value of x from a replica in utd. Since each replica may store several values of x, i tells the replica it wants to read the value of x associated with tag. During rok, i returns the x-value it read in rrr.
When i writes a value v, it also goes through four phases in order: wdr, wrw, wdw and wok. These phase names stand for write-directories-read, wrw for write-replicas-write, wdw for write-directories-write, and wok for write-ok, respectively. During wdr, i reads (utd, tag) from a quorum of directories, then sets its tag to be higher than the largest tag it read. During wrw, i writes (v, tag) to a set acc of replicas, where |acc| ≥ f + 1. Note that the set acc is arbitrary; it does not have to be a quorum. During wdw, i writes (acc, tag) to a quorum of directories, to indicate that acc is the set of most up-to-date replicas, and tag is the highest tag for x. Then i sends each replica a secure message to tell them that its write is finished, so that the replicas can garbage-collect older values of x. Then i finishes in phase wok.
If you have difficulty in understanding the need for 2-round directory reads/writes this protocol, reviewing how the ABD protocol works will help.
Replica and Directory node protocolThe replicas respond to client requests to read and write values of data object x. Replicas also garbage-collect out of date values of x, and gossip among themselves the latest value of x. The latter is an optimization to help spread the latest value of x, so that clients can read from a nearby replica.
The directories' only job is to respond to client requests to read and write utd and tag.
Questions and discussionGoogle File System (SOSP 2003) addressed efficient replication of large data objects for datacenter computing in practice. GFS also provides a metadata service layer and data object replication layer. For the metadata directory service, GFS uses Chubby, a Paxos service which ZooKeeper cloned as opensource. Today if you want to build from a consistent large object replication storage from scratch, your architecture would most likely use ZooKeeper as the metadata directory coordination service as GFS prescribed. ZooKeeper provides atomic consistency already, so it eliminates the 2-round needed for directory-reads and directory-writes in LDR.
LDR does not use a separate metadata service, instead it can scavenge raw dumb storage nodes for directory service and achieve the same effect by using ABD replication for making the metadata directory atomic/fault-tolerant. In other words, LDR takes a fully-decentralized approach, and can support loosely-connected heterogenous wimpy devices (maybe even smartphones?). I guess that means more freedom. On the other hand, LDR is bad for performance. It requires 2 rounds of directory-write for each write operation and 2 rounds of directory-read for each read operation. This is major drawback for LDR. Considering reads are generally 90% of the workload, supporting 1 round directory-reads would have alleviated the performance problem somewhat. Probably in normal cases (in the absence of failures, the first directory read (rdr operation) will show the up-to-date replica copy is present in a quorum of directory nodes, and the second round of directory access (rdw operation) can be skipped.
Using ZooKeeper for the metadata directory helps a lot, but a downside can be that ZooKeeper is a single centralized location, and that means for some clients across to ZooKeeper will always incur high WAN communication penalty. Using ZooKeepers observers reduce this cost for read operations. And as I will blog about soon, our work on WAN-Keeper reduces this cost also for write operations. The LDR paper suggests that LDR is suitable for WAN, but LDR still incurs WAN latencies while accessing a quorum of directory nodes (twice!) across WAN.
Another way to efficiently replicate large data objects is of course key-value stores. In key-value stores, you don't have a metadata directory, as "hashing" takes care of that. On the other hand, most key-value stores sacrifice strong consistency, in lieu for eventual consistency. Is it true that you can't just get away with using hashes and need some sort of metadata service if you like to achieve consistency? The consistent key-value stores I can think of (and I can't think of too many) use either a Paxos commit on metadata or at least a chain replication approach such as in Hyperdex and Replex. The chain replication approach uses a Paxos box only for directory node replication configuration information; does that still count as a minimal and 1-level-indirect metadata service?