Conflict-free Replicated Data Types


Below are my notes summarizing the paper "Conflict-free Replicated Data Types" by Marc Shapiro, Nuno Preguica, Carlos Baquero, and Marek Zawirski. The paper is available here.

Replicated state machines (RSMs) are a basic and important tool for distributed systems. The idea in RSMs is if the replicas start from the same initial state and perform the same updates with the same order, then their end states are the same. The "strong consistency" approach provides this guarantee by serializing updates to the replicas in a global total order. However, the down side to strong consistency approach is that it is a performance & scalability bottleneck, and it also conflicts with availability and partition-tolerance (due to the CAP theorem).

Replicating data under Eventual Consistency (EC) allows any replica to accept updates without remote synchronization. However, published EC approaches are ad-hoc and error-prone (they come without a formal proof of corrrectness). CRDT work tries to address this problem by proposing a simple formally-proven approach to EC, called Strong Eventual Consistency (SEC), which avoids the complexity of conflict-resolution and roll-back.

SEC defines a stronger convergence property than EC. SEC states that correct replicas that have delivered the same updates have equivalent state "immediately", whereas EC states this "eventually" ---since conflict recovery needs to be performed across replicas. (Actually Strong Eventual "Convergence" is a more appropriate and accurate term to describe SEC than Strong Eventual Consistency, because SEC does not really improve the consistency over EC.)

SEC obviates the need for the replicas to coordinate for conflict recovery because it provides conflict-freedom by leveraging on monotonicity in a semi-lattice or commutativity (a trivial example to this is a "set" data type). On the other hand, the downside to SEC and the CRDT approach is that the applicability is limited to simple locally verifiable invariants. In other words, conflict-freedom is traded off with meaningful & useful invariants that span across replicas: When a replica is at a particular state, it is hard to state any predicate about the state of the other replicas. (Note that, in strong consistency---achieved using Paxos for example--- you could state that the state of all replicas are equal. CRDT is PAEL and Paxos is PCEC.)

In the rest of the paper, two sufficient conditions are presented for SEC ---a state-based implementation of CRDT and an operation-based implementation of CRDT--- and a strong equivalence between the two conditions is proven.

State-based replication

Executing an update modifies the state of a single replica. Using gossip protocols, every replica occasionally sends its local state to some other replica, which merges this state into its own state. Every update eventually reaches every replica, either directly or indirectly. (For efficiency, the metadata of the object to be replicated may be gossiped first.)

A semilattice is a partial order equipped with a least upper bound (LUB) for all pairs. LUB is commutative, idempotent, and associative. Monotonic semi-lattice is where (i) The state set S forms a semilattice ordered by . (ii) Merging state s with remote state s' computes the LUB of the two states. (iii) State is monotonically non-decreasing across updates, i.e., s s + u.

Theorem 1 (Convergent Replicated Data Type (CvRDT)). Assuming eventual delivery and termination, any state-based object that satisfies the monotonic semilattice property is SEC.

At this point it is important to note that monotonic semilattice property is not a simple property, because providing a LUB for all state pairs take some pains. For instance, a simple diamond tiling lattice does not satisfy this property.

Operation-based replication

Alternative to the state-based style, a replicated object may be specified in the operation-based style. An op-based object has no merge method; instead an update is split into a pair (t, u), where t is a side-effect-free prepare-update method and u is an effect-update method. The effect-update method executes at all downstream replicas. A sufficient condition for convergence of an op-based object is that "all" its concurrent operations commute. An object satisfying this condition is called a Commutative Replicated Data Type (CmRDT).

Theorem 2 (Commutative Replicated Data Type (CmRDT)). Assuming causal and exactly-once delivery of updates and method termination, any op-based object that satisfies the commutativity property for all concurrent updates, and whose delivery precondition is satisfied by causal delivery, is SEC.

CvRDTs and CmRDTs are equivalent

The equivalence means, we can rewrite a CvRDT as a CmRDT, and vice a versa, with some effort. The proofs in the paper essentially show that this can be done, albeit in an inefficient manner. To give some intuition about this equivalence, let's compare a CmRDT counter implementation versus a CvRDT counter implementation below.

A CmRDT counter is very simple to implement, when you get a new increment operation delivered just increment your counter by 1. If we unequivocally identify every operation and the delivery pre-condition guarantees that all operations are delivered and executed exactly once through the causally-ordered broadcast middleware, replicas will converge regardless of whatever order operations are applied. We have SC.

A CvRDT monotonic counter is not that trivial to implement: Consider a counter that maintains the number of increments. To merge the value (i.e., state) of two replicas with value 1, we could sum the values, or select the maximum value of the merging replicas, but neither would return the correct value in every case: If the replicas had concurrent updates, then the value should be 2. However, if the two replicas are readily synchronized (through gossip), the final value should be 1. This make it impossible to use either max or sum as the merge procedure. Instead, the correct solution is inspired by vector clocks: Store the number of increments for each replica indexed by position in a vector. The query operation retrieves the sum of every vector position and the merge procedure selects the maximum value for each index in the vector.

So what happens is this. In order to avoid state replication, CmRDT assumes an underlying reliable causally-ordered broadcast communication middleware; one that delivers every message to every recipient exactly once and in an order consistent with happened-before. This is actually hiding state under the rug, because often (for most examples) the causally-ordered broadcast communication middleware requires the replicas to keep version vectors, and to buffer/wait operations before delivering (as long as it takes) to ensure that all the operations that causally-precede these operations are delivered first.

Concluding remarks

CRDT provides a limited remedy to CAP theorem. A CRDT replica is always available for both reads and writes, independently of network conditions. Any communicating subset of replicas of a SEC object eventually converges, even if partitioned from the rest of the network. This is not in any way at odds with the CAP theorem, because CRDT weakens consistency to provide more availability. The consistency is weakened so much that it is not possible to provide an invariant that spans acrosss replicas. To address the useful invariants problem, future work prescribes investigating stronger global invariants (using probabilistic or heuristic techniques), and supporting infrequent, non-critical synchronous operations (such as committing a state or performing a global reset).

Comments

Popular posts from this blog

The end of a myth: Distributed transactions can scale

Hints for Distributed Systems Design

Foundational distributed systems papers

Learning about distributed systems: where to start?

Metastable failures in the wild

Scalable OLTP in the Cloud: What’s the BIG DEAL?

SIGMOD panel: Future of Database System Architectures

The demise of coding is greatly exaggerated

Dude, where's my Emacs?

There is plenty of room at the bottom