Optimistic Replication

This 2005 paper by Saito and Shapiro provides a comprehensive survey of the optimistic replication area and is a must-read for distributed services designers. Below, I am providing some snippets from the paper to give a brief overview.

Data replication improves both availability and performance for distributed services. Availability is improved by allowing access to the data even when some of the replicas are unavailable. Performance improvements concern reduced latency (by enabling local access from replica instead of remote access) and increased throughput (by letting multiple computers serve the data).

Pessimistic techniques block access to a replica unless it is provably up to date. Such pessimistic techniques perform well in local-area networks, in which latencies are small and failures uncommon, but is unsuitable for wide-area networks, because the Internet remains slow and unreliable. Moreover, pessimistic algorithms scale poorly, because its throughput and availability suffer as the number of sites increases.

Compared to traditional "pessimistic" techniques, optimistic replication promises higher availability and performance, but lets replicas temporarily diverge and lets users see inconsistent data. In optimistic techniques,
updates are propagated in the background, and occasional conflicts are fixed after they happen. Optimistic algorithms should be able to scale to a large number of replicas, because they require little synchronization among sites.

These benefits, however, come at a cost. CAP theorem states that any distributed system faces a trade-off between availability and consistency. Where a pessimistic algorithm waits, an optimistic one speculates. Thus, optimistic replication faces the challenges of diverging replicas and conflicts between concurrent operations. So it is applicable only for applications that can tolerate occasional conflicts and inconsistent data. Fortunately, in many real-world systems, especially file systems, conflicts are known to be rather rare. Some example applications of optimistic replication are DNS, USENET, PDAs, CVS, Amazon Dynamo.

When describing algorithms, it is useful to distinguish sites that can update an object --called master sites -- from those that store read-only replicas.

To update an object, a user submits an operation at some site. An operation submitted by the user of a replica is tentatively applied to the local replica to let the user continue working based on that update. It is also logged, i.e., remembered in order to be propagated to other sites later. These systems often deploy epidemic propagation to let all sites receive operations, even when they cannot communicate with each other directly. Because of background propagation, operations are not always received in the same order at all sites. Each site must reconstruct an appropriate ordering that produces an equivalent result across sites. In addition, with no a priori site coordination, multiple users may update the same object at the same time, so there is a need for detecting operations that are in conflict and resolve them. Commitment refers to an algorithm to converge the state of replicas by letting sites agree on the set of operations and their final ordering and conflict-resolution results.

Optimistic Replication: Design Choices
We classify optimistic replication systems along the following axes: (The rest of the paper investigates these design choices in detail)
  • Number of writers: single-master vs. multi-master: Single-master systems designate one replica as the master. All updates originate at the master and then are propagated to other replicas, or slaves. They are simple but have limited availability, especially when the system is write-intensive. Multi-master systems let updates be submitted at multiple replicas independently and exchange them in the background. They are more available but significantly more complex. In particular, operation scheduling and conflict management are issues unique to these systems. According to Gray et al. [1996], a naive multi-master system would encounter concurrent updates at the rate of O(M^2), assuming that each master submits operations at a constant rate. (M is the number of masters.)
  • Definition of operations: state transfer vs. operation transfer. State-transfer systems limit an operation either to read or to overwrite an entire object. State transfer is simple, because maintaining consistency only involves sending the newest replica contents to other replicas. Operation-transfer systems must maintain either a history of operations and have replicas agree on the set of applied operations and their order. On the other hand, they can be more efficient, especially when objects are large and operations are high level. Operation transfer also allow for more flexible conflict resolution.
  • Scheduling: syntactic vs. semantic. The goal of scheduling is to order operations in a way expected by users, and to make replicas produce equivalent states. Scheduling policies can be classified into syntactic and semantic policies. Syntactic methods are simple but may cause unnecessary conflicts. The most popular example is ordering operations by timestamp. Semantic scheduling, on the other hand, exploits semantic properties such as commutativity of idempotency of operations. Semantic scheduling increases scheduling flexibility and reduces conflict, but at the cost of application dependence and complexity.
  • Handling conflicts. The best approach is to prevent conflicts from happening altogether. Pessimistic algorithms prevent conflicts by blocking or aborting operation as necessary. Single-master systems avoid conflicts by accepting updates only at one site (but allow reads to happen anywhere). These approaches, however, come at the cost of lower availability. Conflicts can also be reduced, for example, by quickening propagation or by dividing objects into smaller independent units.
  • Propagation strategies and topologies. Local operations must be transmitted and re-executed at remote sites. In general, the quicker the propagation, the less the degree of replica inconsistency and the rate of conflict, but more the complexity and overhead, especially when the application receives many updates relative to read requests.
  • Consistency guarantees. In any optimistic replication system, the states of replicas may diverge somewhat. A consistency guarantee specifies whether a client application may observe divergence, and how much.


Keep it simple. Traditional, pessimistic replication, with many off-the-shelf solutions, is perfectly adequate in small-scale, fully connected, reliable networking environments. Where pessimistic techniques are the cause of poor performance or lack of availability, or do not scale well, try single-master replication: it is simple, conflict-free, and scales well in practice. State transfer using Thomas's write rule (last writer wins) works well for many applications. Advanced techniques such as version vectors and operation transfer should be used only when you need flexibility and semantically rich conflict resolution.

Propagate operations quickly to avoid conflicts. While connected, propagate often and keep replicas in close synchronization. This will minimize divergence when disconnection does occur.

Exploit commutativity. Commutativity should be the default; design your system so that non-commutative operations are the uncommon case. For instance, whenever possible, partition data into small, independent objects. Within an object, use monotonic data structures such as an append-only log, a monotonically increasing counter, or a union-only set.


Popular posts from this blog

Foundational distributed systems papers

Your attitude determines your success

Progress beats perfect

Cores that don't count

Silent data corruptions at scale

Learning about distributed systems: where to start?

Read papers, Not too much, Mostly foundational ones

Sundial: Fault-tolerant Clock Synchronization for Datacenters


Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3 (SOSP21)