Replicated/Fault-tolerant atomic storage

- May 02, 2012

The cloud computing community has been showing a lot of love for replicated/fault-tolerant storage these days. Examples of replicated storage at the datacenter level are GFS, Dynamo, Cassandra, and at the WAN level PNUTS, COPS, and Walter. I was searching for foundational distributed algorithms on this topic, and found this nice tutorial paper on replicated atomic storage: Reconfiguring Replicated Atomic Storage: A Tutorial, M. K. Aguilera, I. Keidar, D. Malkhi, J-P Martin, and A. Shraer, 2010.

Replication provides masking fault-tolerance to crash failures. However, this would be a limited/transient fault-tolerance unless you reconfigure your storage service to add a new replica to replace the crashed node. It turns out, this on-the-fly reconfiguration of a replicated storage service is a subtle/complicated issue due to the concurrency and fault-tolerance issues involved. This team at MSR @ Slicon Valley has been working on reconfiguration issues in replicated storage for some time. But, in this post I am not going to talk about their reconfiguration work, and instead will just focus on the replicated/fault-tolerant atomic storage part.

Majority replication algorithm
There is this elegant algorithm for replicated/fault-tolerant atomic storage that I think every distributed systems researcher/developer should know about. It is simple and powerful. And, it is fun; I promise your brain will feel better about itself after you learn this majority replication algorithm. The algorithm originally appeared in: Sharing memory robustly in message-passing systems, H. Attiya, A. Bar-Noy, and D. Dolev, 1995. Here I will summarize the algorithm based on the discussion provided in the tutorial paper.

The algorithm employs majority replication to provide atomic read/write operations in the presence of crash failures under an asynchronous execution model. (Of course, the FLP result states the impossibility of solving consensus under this model, but this is a weaker problem than solving consensus.) Here, atomic means that the system provides linearizability, a strong type of consistency that guarantees that a read returns the most recent version of data. This single-copy consistency is stronger than Amazon Dynamo's eventual consistency and even GFS's consistency. The algorithm is on the CP side of the CAP triangle; availability is sacrificed when a majority of replicas are unreachable.

Write operation
Each storage node keeps a local copy of what it believes to be the most recent value stored by a client, together with a timestamp indicating the freshness of the value. A vt-pair refers to a pair of (value, timestamp), which a storage node keeps. To execute a write(v) operation, the client proceeds in two phases: it executes a get phase followed by a set phase.

get phase:
vt-set= read vt pairs from majority of storage nodes
select unique t' such that t' > max (t in vt-set)

set phase:
write_request (v, t') on storage nodes
storage nodes store vt' only if t' > their stored t
storage nodes send ack
when majority acks are received, return OK

(Uniqueness of t' can be ensured by adjoining the client-id to the timestamp, so that a timestamp consists of a pair with a number and a client-id, ordered lexicographically.)

Read operation
The read() operation is very similar to the write operation. The client also executes the get and set phases back to back. The only difference is that in the set phase, the client writes back the maximum timestamp vt pair it learns in the get phase.

get phase:
vt-set= read vt pairs from majority of storage nodes
select vt' such that t' = max (t in vt-set)

set phase:
write_request (v,t') on storage nodes
storage nodes store vt' only if t' > their stored t
storage nodes send ack
when majority acks are received, return v

The Set phase in read() is needed to prevent oscillating reads due to storage node failures, in which successive reads oscillate between a new and an old value while a write is in progress-- which is a violation of atomicity. The Set phase ensures that a subsequent read() will return a value at least as recent as the value returned by the current read(). The key intuition here is that any two majorities of storage nodes always have at least one storage node in common. Therefore if some client stores value v at a majority of storage nodes then another client is guaranteed to see v when it queries any majority.

Relation to Paxos
The majority replication algorithm seems closely related to the Paxos consensus algorithm. The t in the vt-pair corresponds to ballot number in Paxos. The Get and Set phases correspond to the phase1 and phase2 of Paxos. (Of course since majority replication is not for consensus, there is nothing corresponding to phase3:commit of Paxos.) Finally, the read operation corresponds to the learning operation in Paxos. Now the differences. In majority replication clients do not coordinate for the write operation, whereas in Paxos, leaders are constrained to re-propose/rewrite the value with the highest t. Also to avoid the dueling-leaders problem, Paxos relies on a leader election service so that the system eventually converges to one leader that can safely anchor/finalize a value as the decision value. (The leader election service in Paxos needs some partial-synchrony to make progress, so consensus is achieved only then.) In summary, majority replication is a building block for Paxos consensus.

This relation is explained in more detail in the "Perspectives on the CAP theorem" paper.

Concluding remarks
The nice thing about this elegant algorithm is that it can tolerate/mask the crash of a minority of storage nodes and an arbitrary number of client nodes, and it works in an "asynchronous" system. That the correctness of this algorithm does not depend on a synchronous system makes this algorithm really robust for deployment in distributed systems, especially WAN systems.

Consensus/Paxos based algorithms can make reconfiguration of replication service possible. Examples are RAMBO algorithm, and FAB: Building Distributed Enterprise Disk Arrays from Commodity Components, which provides an implementation of these ideas. But, the reconfiguration tutorial paper explains that it is also possible to implement reconfiguring of replication under the asynchronous model (without consensus)!!

Comments

Anonymous said…

Looks like value is key at the same time since "storage nodes store vt' only if t' > their stored t". This means nodes does not differentiate between clients. And that means value is interpreted as key at the same time. Didn't read paper am I missing something or these are too weird to be true.

May 5, 2012 at 3:49 PM