Thursday, February 24, 2011

Volley: Automated Data Placement for Geo-Distributed Cloud Services

Datacenters today are distributed across the globe, yet they need to share data with other datacenters as well as their clients. This paper from Microsoft Research presents a heuristic strategy for data placement to these geo-distributed datacenters. While there has been previous work on data placement in LANs and WSNs, Volley is the first heuristic for data placement strategies for WANs.

A simple heuristic is to place each data to the datacenter closest to the client of that data. But things are not that simple, there are several additional constraints to be considered, including business constraints, WAN bandwidth costs, datacenter capacity limits, data interdependencies, user-perceived latency, etc. For example, it makes more sense to collocate data that are tightly-coupled/interdependent, such as two friends in Facebook that update each other walls. As another example, the frequency of the clients accessing the data needs to be taken in to account as well. As live mesh and live messenger traces show, there is significant data sharing across clients (Figure 5), there can be significant benefits to placing data closest to those who use it most heavily, rather than just placing it close to some particular client that accesses the data. Finally, the live mesh and messenger traces also show that a significant portion of the clients move and change locations (Figure 7), so the ideal placement of data needs to be changed adaptively as well.
Volley takes as input the request logs for data in the system, analyzes them, and outputs the results on where to best place the data. Volley is not concerned with the actual migration, that should be handled by other applications.

Volley is an iterative algorithm. In phase 1, Volley computes an initial placement based on client IPs. In phase 2, Volley iteratively moves data to reduce latency. Finally in phase 3, Volley iteratively map data items into datacenters taking into account the datacenter capacities. In order

The live mesh and live messenger traces are used to evaluate Volley via emulations. In these emulations, 12 DCs are assumed. Volley is compared with the commonIP protocol (which places data as close as possible to the IP address that most commonly accesses it), hashing protocol (which randomly places data to one the 12 DCs to optimize for load-balancing), and oneDataCenter (which places all the data in one datacenter).

Among these protocolsVolley is the one with lowest latency, and hash is unsurprisingly the one with the highest latency. Hash protocol leads to a lot of inter-datacenter traffic as it frequently places interdependent data in different datacenters. Volley has the least inter-datacenter traffic, of course excluding the oneDC protocol which obviously has no inter-datacenter traffic.

The evaluations show that Volley converged after a small number of iterations, and reduced skew by 2x, inter-datacenter traffic by 1.8x, and latency by 30%. The paper does not make any optimality claims for Volley, as it just uses heuristics. The real contribution of Volley is reported as automating the data placement process.


Anonymous said...

Can you please post some exercises (like in a textbook) along with each article review?

Murat said...

That's not a bad idea. I will give it a try in the next reviews. Thanks.

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...