A distributed systems research agenda for cloud computing

- August 11, 2015

Distributed systems (a.k.a. distributed algorithms) is an old field of almost 40 years old. It gave us impossibility proofs on the theory side, and also algorithms like Paxos, logical/vector clocks, 2/3-phase commit, leader election, dining philosophers, graph coloring, spanning tree construction which are adopted in practice widely. Cloud computing is a relatively new field in contrast. It provides new opportunities as well as new challenges for the distributed systems/algorithms area. Below I briefly discuss some of these opportunities and challenges.

Opportunities

Cloud computing provides abundance. Nodes are replaceable, even hot swappable. You can dedicate several nodes for running customized support services, such as monitoring, logging, storage/recovery service. These opportunities are likely to have impact on how fault-tolerance is considered in distributed systems/algorithms work.

Programmatic interfaces and service-oriented architecture are also hallmarks of cloud computing domain. Similarly, virtualization/containerization facilitates many things, including moving the computation to the data. However, it is unclear how these technologies can be employed to provide substantial benefits for distributed algorithms which operate at a more abstract plane.

Challenges

Coordination introduces synchronization, which introduces potential for cascading shutdowns and halts. Especially, in a large-scale system of systems, any coordination introduced may lead to spooking/latent/self-organized synchronization. This point is explained nicely in "Towards A Cloud Computing Research Agenda". Thus the challenge is to avoid coordination as much as possible, and still build useful and "consistent" systems.

Another curse of the extreme scale in cloud computing is the large fan-in/fan-out of components. These challenges are explained nicely in the "Tail at Scale" paper. These can lead to broadcast storms and incast storms, and algorithms/systems should be designed to avoid these problems.

Finally, another major challenge is geographically distributed services. Due to speed of light communication over large distances are prone to latencies and especially pose consistency versus availability challenges due to the CAP theorem.

The current state of distributed systems research on cloud computing

This is not an exhaustive list. From what I can recollect now, here is a crude categorization of current research in distributed systems on the cloud computing domain.

Consistency in geo replicated systems, e.g., COPS, Red Blue Consistency, Use of clocks for consistency, HLC
Paxos variants, e.g., Egalitarian Paxos, Mencius, Raft, Tango, Chain replication
Efficient transaction primitives, e.g., Sinfonia, Salt, Calvin, Spanner, Partitioned replicated cloud databases
Data center networking, e.g., Onix, VL2, and many recent work.
CAP boundaries, e.g., Consistency, availability, and convergence
CRDTs, e.g., Conflict-free replicated data types, Eventual consistency language frameworks, Coordination Avoidance in Database Systems
New programming abstractions for big data and big graph processing, e.g., Naiad GraphX, Pregel

What is next?

If only I knew... I can only speculate, I have some ideas, and that will have to wait for another post.

In the meanwhile, I can point to these two articles which talked about what would be good/worthy research areas in cloud computing.
How can academics do research on cloud computing?
Academic cloud: pitfalls, opportunities

UPDATE: Part 2 of this post (i.e., new directions) is available here.