Making Geo-Replicated Systems Fast as Possible, Consistent when Necessary

Appeared in OSDI '12, Cheng Li, Daniel Porto, Allen Clement, Johannes Gehrke, Nuno Preguica, and Rodrigo Rodrigues.

In order to reduce latencies to geographically distributed users, big webservices companies (Google, Yahoo, Facebook, Twitter) replicate data across geographical regions. But replication across datacenters, create cross-site consistency problems, which is further complicated with the huge WAN latency delays. If you want to have strong consistent updates across sites, you have to pay the price in terms of latency (basically you revert to doing synchronous replication via say Paxos as in MDCC). This is the ELC part in the PACELC.

To alleviate this latency versus consistency tension, this paper proposes RedBlue consistency, which enables blue operations to be fast/asynchronous (and eventually consistent) while the remaining red operations are strongly-consistent/synchronous (and slow). So a program is partitioned into red and blue operations, which run with different consistency levels. While red operations must be executed in the same order at all sites (which make them slow), the order of execution of blue operations can vary from site to site (allowing them to be executed without requiring coordination across sites). "In systems where every operation is labeled red, RedBlue consistency is equivalent to serializability; in systems where every operation is labeled blue, RedBlue consistency allows the same set of behaviors as eventual consistency."

To facilitate this red blue partioning, each program operation u is split into two components: a generator operation g_u with no side effects, which is executed only at the primary site against some system state S, and produces a shadow operation h_u(S), which is executed at every site (including the primary site).

This is simply separation of concerns principle. The generator operation decides which state transitions should be made while the shadow operation applies the transitions in a state-independent manner. This separation leads to a fine-grained classification of operations, with potentially more execution paths leading to blue operations. Also this leads to a simple logic for shadow operations that can be based on operations that are intrinsically commutative (increment/decrement, insertion/removal) or that commute via last-writer-wins strategy.

Red Blue rules

Now here are the rules for labelling the [shadow] operations as red or blue.
  1. For any pair of non-commutative shadow operations u and v, label both u and v red.
  2. For any shadow operation u that may result in an invariant being violated,  label u red.
  3. Label all non-red shadow operations blue.
This is unfortunately not an automated process. Developer has to manually partition the operations to generator and shadow operations, and after that mark them as red and blue manually, by following the above rules. As such, this is an error-prone process and is hard to scale.

The below is the description from the paper of how to keep track of the order and consistency of red/blue operations:
When a generator operation completes, the coordinator must determine if the operation reads a coherent system snapshot and obeys the ordering constraints of a causal serialization. To do this, the coordinator checks the timestamps of the data items read and written by the completing operation, and compares them to the timestamps associated with operations completing concurrently and the remote shadow operations that were being applied simultaneously at that site. Upon successful completion of a generator operation the coordinator assigns the corresponding shadow operation a timestamp that is component-wise equal to the latest operation that was incorporated at its site, and increments its blue and, if this shadow operations is red, the red component of the logical timestamp. This timestamp determines the position of the shadow operation in the RedBlue order, with the normal rules that determine that two operations are partially ordered if one is equal to or dominates the other in all components.
The paper reports an evaluation of this idea by modifying the TPC-W and RUBiS benchmarks on an online social network case study (Twitter is the Hello World of georeplication :-). The experimental results show that RedBlue consistency provides substantial performance gains without sacrificing consistency.


Although the paper does not mention it, a very relevant work to this is the CALM conjecture and the Bloom project from UC Berkeley. The CALM principle says that (1) logically monotonic distributed code is eventually consistent without any need for coordination protocols (distributed locks, two-phase commit, paxos, etc.) and (2) eventual consistency can be guaranteed in any program by protecting non-monotonic statements ("points of order") with coordination protocols. It is easy to see that the logically monotonic operations correspond to the blue operations, and non-monotonic operations correspond to red operations in the RedBlue work.

After the OSDI presentation of the paper, there were a couple of concerns raised about the approach. Mike Freedman (Princeton) asked the question: "blue cannot overtake the red, so you cannot vary their order, doesn't this degrade performance significantly?". Marcos Aguilera (HP Research) commented that similar approaches have been around; he referred to the generic broadcast work, and the later and more general Generalized Paxos work. The Generalized Paxos work seems to be very related indeed, and I am not sure what in the RedBlue work constitute the major differences. Maybe the RedBlue work provides a case study and more principled approach to identify commutable actions in Generalized Paxos. Another shortcoming of the RedBlue work is that it does not have any fault-tolerance build in. It may be hard to add fault-tolerance as an after thought, so maybe it is best to think of this work in the Generalized Paxos framework.


Unknown said…
A minor correction: Marcos Aguilera is from Microsoft Research Silicon Valley
Anonymous said…
Very nice summary of the work. Thank you.

The commonality between the RedBlue work and Generalized Paxos is that both approaches lead to partial orders of operations, and that partially ordered operations must commute with each other.

The key difference between the two proposals is that Generalized Paxos allows for an ex post decision on the final order, while RedBlue requires this decision to be made ex ante.

Put another way, under Generalized Paxos the replicas must coordinate on the set of operations before the order can be determined; under RedBlue the order of Blue operations is determined without any coordination at all.

Popular posts from this blog

Graviton2 and Graviton3

Foundational distributed systems papers

Learning a technical subject

SQLite: Past, Present, and Future

Learning about distributed systems: where to start?

Strict-serializability, but at what cost, for what purpose?

CockroachDB: The Resilient Geo-Distributed SQL Database

The Seattle Report on Database Research (2022)

Amazon Aurora: Design Considerations + On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes

Warp: Lightweight Multi-Key Transactions for Key-Value Stores