New directions for distributed systems research in cloud computing
This post is a continuation of my earlier post on "a distributed systems research agenda for cloud computing". Here are some directions I think are useful directions for distributed systems research in cloud computing.
This trend may indicate that the distributed algorithms should need to adopt to the data it operates on to improve performance. So, we may see the adoption of machine-learning as input/feedback to the algorithms, and the algorithms becoming data-driven and data-aware. (For example, this could be a good way to attack the tail-latency problem discussed here.)
Similarly, driven by the demand from the large-scale cloud computing services, we may see power-management, energy-efficiency, electricity-cost-efficiency as requirements for distributed algorithms. Big players already partition data as hot, warm, cold, and employ tricks to reduce power. We may see algorithms becoming more aware of this.
Heeding the advice in the first challenge in my previous post, this may suggest that we should look into implicit/diffusing/asynchronous/eventual coordination, such as coordination by writing to datastores and other processes reading off of it. Pat Helland's article suggested entity and activities abstractions which can be useful primitives to get started on implicit/diffusing coordination.
Another way to scale coordination is to relax consistency. It is easy to scale consistency, it is easy to scale availability, but not both! Eventual-agreement/convergent-consistency provides a way out of this. There are already a lot of exciting work in this area, and this area will receive getting more attention. Brewer, in his "CAP 12 years later" article, has given nice clues to pursue these kind of systems. We may see systems that also associates uncertainty with the consistency of current state in order to facilitate conflict recovery and eventual consistency.
Self-stabilization is a great approach to deal with unanticipated faults. I am guessing we will see a surge in research on self-stabilizing algorithms to achieve extreme resiliency to the face of unanticipated faults in cloud computing systems. Recovery oriented computing (ROC), resettable systems (crash-only software) is a special case of self-stabilization. And we may see extensions of that work for distributed systems. A critical question here will be "How can we make ROC compose nicely for distributed systems?"
To insure against correlated failures, we may see multiversion programming approaches to be revisited. This can also be helpful to avoid the spooky/self-organizing synchronization in cloud computing systems.
For scalable fault-tolerance, asynchronous algorithms like self-stabilizing Propogation of Information with Feedback (PIF) algorithms may be adopted in the cloud domain. Furthermore, pheromene/hormone based algorithms that run in the background in a slow mode can be made extremely scalable exploiting peer-to-peer random-gossip techniques.
Data-driven/data-aware algorithms
Please check the Facebook and Google software architecture diagrams in these two links: Facebook Software Stack, Google Software Stack. You will notice that the architecture is all about data: almost all components are about either data processing or data storage.This trend may indicate that the distributed algorithms should need to adopt to the data it operates on to improve performance. So, we may see the adoption of machine-learning as input/feedback to the algorithms, and the algorithms becoming data-driven and data-aware. (For example, this could be a good way to attack the tail-latency problem discussed here.)
Similarly, driven by the demand from the large-scale cloud computing services, we may see power-management, energy-efficiency, electricity-cost-efficiency as requirements for distributed algorithms. Big players already partition data as hot, warm, cold, and employ tricks to reduce power. We may see algorithms becoming more aware of this.
Scalable coordination
Again refer to the Facebook and Google service stacks linked above. Facebook stack does not have dedicated coordination services, only monitoring tools. (Of course the data stores employ Paxos to replicate the masters.) Google stack has some coordination and cluster management tools. These large scale systems already seem to embrace the principle of operating with as little coordination as possible.Heeding the advice in the first challenge in my previous post, this may suggest that we should look into implicit/diffusing/asynchronous/eventual coordination, such as coordination by writing to datastores and other processes reading off of it. Pat Helland's article suggested entity and activities abstractions which can be useful primitives to get started on implicit/diffusing coordination.
Another way to scale coordination is to relax consistency. It is easy to scale consistency, it is easy to scale availability, but not both! Eventual-agreement/convergent-consistency provides a way out of this. There are already a lot of exciting work in this area, and this area will receive getting more attention. Brewer, in his "CAP 12 years later" article, has given nice clues to pursue these kind of systems. We may see systems that also associates uncertainty with the consistency of current state in order to facilitate conflict recovery and eventual consistency.
Extremely resilient systems
Cloud computing transformed the fault-tolerance landscape. Node failures are not a big deal, thanks to the abundance and replication in cloud systems, the nodes are replaceable. Now complex failure modes and distributed state corruption based failures became more critical problems. Improving the availability of these cloud systems are very important to the face of these unanticipated failure modes; it makes the news if a large cloud service is unavailable for several minutes. In his advice for research on cloud computing, Matt Welsh mentions these two: 1) Building failure recovery mechanisms that are robust to massive correlated outages. 2) Handling both large-scale upgrades to computing capacity as well as large-scale outages seamlessly, without having to completely shut down your service and everything it depends on."Self-stabilization is a great approach to deal with unanticipated faults. I am guessing we will see a surge in research on self-stabilizing algorithms to achieve extreme resiliency to the face of unanticipated faults in cloud computing systems. Recovery oriented computing (ROC), resettable systems (crash-only software) is a special case of self-stabilization. And we may see extensions of that work for distributed systems. A critical question here will be "How can we make ROC compose nicely for distributed systems?"
To insure against correlated failures, we may see multiversion programming approaches to be revisited. This can also be helpful to avoid the spooky/self-organizing synchronization in cloud computing systems.
For scalable fault-tolerance, asynchronous algorithms like self-stabilizing Propogation of Information with Feedback (PIF) algorithms may be adopted in the cloud domain. Furthermore, pheromene/hormone based algorithms that run in the background in a slow mode can be made extremely scalable exploiting peer-to-peer random-gossip techniques.
Comments