MDCC: Multi-Data Center Consistency

- November 03, 2012

In order to reduce latencies to geographically distributed users, Google, Yahoo, Facebook, Twitter, etc., replicate data across geographical regions. But replication across datacenters, over the WAN, is expensive. WAN delays are in the hundreds of milliseconds and vary significantly, so common wisdom is that synchronous wide-area replication is unfeasible, which means strong consistency should be diluted/relaxed. (See COPS paper for an example.)

In the MDCC work (Tim Kraska, Gene Pang, Michael J. Franklin, Samuel Madden, March 2012), the authors describe an "optimistic commit protocol, that does not require a master or partitioning, and is strongly consistent at a cost similar to eventually consistent protocols". Optimistic and strongly-consistent is an odd/unlikely couple. The way MDCC achieves this is by starting with Paxos, which is a strongly consistent protocol. MDCC then adds optimizations to Paxos to obtain an optimistic commit protocol that does not require a master and that is strongly consistent at a very low-cost.

Paxos optimized

Classic Paxos is a 2-phase protocol. In Phase 1, a master M tries to establish the mastership for an update for a specific record r. Phase 2 tries to accept a value: the master M requests the storage nodes to store r next (using the ballot number M established in the first phase), and waits for a majority number of ACKs back.

Multi-Paxos optimization employs the lease idea to avoid the need for phase 1 in the fault-free executions (timely communication and master does not crash). The master puts a lease on other nodes on being a master for many seconds, so there is no need to go through Phase 1. (This is still fault-tolerant. If the master fails, other nodes need to wait until the lease expires, but then they are free to chose a new master as per classic Paxos protocol.) Fast-Paxos optimization is more complex. In fast-paxos a node that wants to commit a value first try without going through the master (saving a message round, potentially over the WAN), directly communicating with the other nodes optimistically. If a fast-quorum number of nodes (more than majority number of nodes needed in classic quorum) reply, the fast round has succeeded. However, in such fast rounds, since updates are not issued by the master, collisions may occur. When a collision is detected, the node then goes through the master which resolves the situation with a classic round.

Finally building over these earlier optimizations, the Generalized-Paxos optimization (which is a superset of Fast-Paxos) makes the additional observation that some collisions (different orderings of operations) are really not conflicts, as the colliding operations commute, or the order is not important. So this optimization does not enforce reverting to classic Paxos for those cases. MDCC uses Generalized Paxos, and as a result, MDCC can commit transactions in a single round-trip across data centers in the normal operation case.

The good thing about using Paxos over the WAN is you /almost/ get the full CAP (all three properties: consistency, availability, and partition-freedom). As we discussed earlier (Paxos taught), Paxos is CP, that is, in the presence of a partition, Paxos keeps consistency over availability. But, Paxos can still provide availability if there is a majority partition. Now, over a WAN, what are the chances of having a partition that does not leave a majority? WAN has a lot of redundancy. While it is possible to have a data center partitioned off the Internet due to a calamity, what are the chances of several knocked off at the same time. So, availability is also looking good for MDCC protocol using Paxos over WAN.

Reads

MDCC provides read-committed consistency: Reads can be done from any (local) storage node and are guaranteed to return only committed records. However, by just reading from a single node, the read might be stale. That node may have missed some commits. Reading the latest value requires reading a majority of storage nodes to determine the latest stable version, making reads an expensive operation. MDCC cites Megastore and adopts similar techniques to MegaStore in trying to provide local reads to the nodes. The idea is basically: If my datacenter is up-to-date do a local read, otherwise go through the master to do a local read. (We had discussed these techniques in Megastore earlier.)

MDCC details

Now, while the optimistic and strongly-consistent protocol is nice, one may argue that there has not been much novel (at least academically) in that part of the paper; MDCC basically puts together known optimizations over Paxos to achieve that optimistic and strongly-consistent replication protocol. The claim to novelty in the MDCC paper comes from their proposed new programming model which empowers the application developer to handle longer and unpredictable latencies caused by inter-data center communication. The model allows developers to specify certain callbacks that are executed depending on the different phases of a transaction. MDCC’s transaction programming model provides a clean and simple way for developers to implement user-facing transactions with potentially wildly varying latencies, which occur when replicating across data centers. Figure below shows an example of a transaction for checking out from a web store.

Evaluation of MDCC is done using the TPC-W benchmark with MDCC deployed across 5 geographically diverse data centers. The evaluation shows that "MDCC is able to achieve throughput and latency similar to eventually consistent quorum protocols and that MDCC is able to sustain a data center outage without a significant impact on response times while guaranteeing strong consistency."

Conclusion

This is a nice and useful paper because it tells a simpler united story about using Paxos (more accurately, Generalized Paxos) for executing transactions across datacenters. This paper also provides a very nice summary of the Paxos and optimizations to Paxos, as well as providing a case study where the usefulness of these optimizations are presented. So even if you are not interested with the details and quantitative measurements about the actual multi datacenter replication problem, the paper is still worth a read in this respect.