MDCC: Multi-Data Center Consistency

In order to reduce latencies to geographically distributed users, Google, Yahoo, Facebook, Twitter, etc., replicate data across geographical regions. But replication across datacenters, over the WAN, is expensive. WAN delays are in the hundreds of milliseconds and vary significantly, so common wisdom is that synchronous wide-area replication is unfeasible, which means strong consistency should be diluted/relaxed. (See COPS paper for an example.)

In the MDCC work (Tim Kraska, Gene Pang, Michael J. Franklin, Samuel Madden, March 2012), the authors describe an "optimistic commit protocol, that does not require a master or partitioning, and is strongly consistent at a cost similar to eventually consistent protocols". Optimistic and strongly-consistent is an odd/unlikely couple. The way MDCC achieves this is by starting with Paxos, which is a strongly consistent protocol. MDCC then adds optimizations to Paxos to obtain an optimistic commit protocol that does not require a master and that is strongly consistent at a very low-cost.

Paxos optimized

Classic Paxos is a 2-phase protocol. In Phase 1, a master M tries to establish the mastership for an update for a specific record r. Phase 2 tries to accept a value: the master M requests the storage nodes to store r next (using the ballot number M established in the first phase), and waits for a majority number of ACKs back.

Multi-Paxos optimization employs the lease idea to avoid the need for phase 1 in the fault-free executions (timely communication and master does not crash). The master puts a lease on other nodes on being a master for many seconds, so there is no need to go through Phase 1. (This is still fault-tolerant. If the master fails, other nodes need to wait until the lease expires, but then they are free to chose a new master as per classic Paxos protocol.) Fast-Paxos optimization is more complex. In fast-paxos a node that wants to commit a value first try without going through the master (saving a message round, potentially over the WAN), directly communicating with the other nodes optimistically. If a fast-quorum number of nodes (more than majority number of nodes needed in classic quorum) reply, the fast round has succeeded. However, in such fast rounds, since updates are not issued by the master, collisions may occur. When a collision is detected, the node then goes through the master which resolves the situation with a classic round.

Finally building over these earlier optimizations, the Generalized-Paxos optimization (which is a superset of Fast-Paxos) makes the additional observation that some collisions (different orderings of operations) are really not conflicts, as the colliding operations commute, or the order is not important. So this optimization does not enforce reverting to classic Paxos for those cases. MDCC uses Generalized Paxos, and as a result, MDCC can commit transactions in a single round-trip across data centers in the normal operation case.

The good thing about using Paxos over the WAN is you /almost/ get the full CAP  (all three properties: consistency, availability, and partition-freedom). As we discussed earlier (Paxos taught), Paxos is CP, that is, in the presence of a partition, Paxos keeps consistency over availability. But, Paxos can still provide availability if there is a majority partition. Now, over a WAN, what are the chances of having a partition that does not leave a majority? WAN has a lot of redundancy. While it is possible to have a data center partitioned off the Internet due to a calamity, what are the chances of several knocked off at the same time. So, availability is also looking good for MDCC protocol using Paxos over WAN.


MDCC provides read-committed consistency: Reads can be done from any (local) storage node and are guaranteed to return only committed records. However, by just reading from a single node, the read might be stale. That node may have missed some commits. Reading the latest value requires reading a majority of storage nodes to determine the latest stable version, making reads an expensive operation. MDCC cites Megastore and adopts similar techniques to MegaStore in trying to provide local reads to the nodes. The idea is basically: If my datacenter is up-to-date do a local read, otherwise go through the master to do a local read. (We had discussed these techniques in Megastore earlier.)

MDCC details

Now, while the optimistic and strongly-consistent protocol is nice, one may argue that there has not been much novel (at least academically) in that part of the paper; MDCC basically puts together known optimizations over Paxos to achieve that optimistic and strongly-consistent replication protocol. The claim to novelty in the MDCC paper comes from their proposed new programming model which empowers the application developer to handle longer and unpredictable latencies caused by inter-data center communication. The model allows developers to specify certain callbacks that are executed depending on the different phases of a transaction. MDCC’s transaction programming model provides a clean and simple way for developers to implement user-facing transactions with potentially wildly varying latencies, which occur when replicating across data centers. Figure below shows an example of a transaction for checking out from a web store.

Evaluation of MDCC is done using the TPC-W benchmark with MDCC deployed across 5 geographically diverse data centers. The evaluation shows that "MDCC is able to achieve throughput and latency similar to eventually consistent quorum protocols and that MDCC is able to sustain a data center outage without a significant impact on response times while guaranteeing strong consistency."


This is a nice and useful paper because it tells a simpler united story about using Paxos (more accurately, Generalized Paxos) for executing transactions across datacenters. This paper also provides a very nice summary of the Paxos and optimizations to Paxos, as well as providing a case study where the usefulness of these optimizations are presented. So even if you are not interested with the details and quantitative measurements about the actual multi datacenter replication problem, the paper is still worth a read in this respect.


Unknown said…
Very interesting perspective on the optimization of Paxos. Curious to know if you have researched any IT asset management software products to determine which is the best overall for data center management?
tiger said…
A new Service-level-objective aware programming model which empowers the developer with more information about the transaction, Raised Floors in Raleigh
Cecilia said…
Thanks for sharing! We recently revamped our data center and got some colocation services so it would be more tailored to our needs. It worked out really well! Great information, thanks again!
Netrack said…
NetRack is worldwide manufactures of Data Centre Solutions like Racks, Data center Racks, Data Center Solutions, Data center management in Chennai, Kerala, Hyderabad, Mumbai, Ahmedabad, Delhi, Kolkata, Indore and Gowhathi.
Unknown said…
Great post. I have been looking up data center solutions lately trying to get a better understanding. This was very informational and helpful, thanks so much for sharing.
Unknown said…
This comment has been removed by the author.
Unknown said…
This is such a great post. I have been doing research on data center design for our company. We have been looking to update our systems. This post was very helpful. Thanks so much for sharing.
Netrack said…
Present day data centers require Data center solutions
John Michle said…
Its really good to know about that some facts and other points given here are quite considerable and to the point as well, would be better to look for more of these kind for efficient results.

Hvac Service Management Software

Popular posts from this blog

Foundational distributed systems papers

Your attitude determines your success

Progress beats perfect

Cores that don't count

Silent data corruptions at scale

Learning about distributed systems: where to start?

Read papers, Not too much, Mostly foundational ones

Sundial: Fault-tolerant Clock Synchronization for Datacenters


Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3 (SOSP21)