Cloud fault-tolerance

- August 19, 2017

I had submitted this paper to SSS'17. But it got rejected. So I made it a technical report and am sharing it here. While the paper is titled "Does the cloud need stabilizing?", it is relevant to the more general cloud fault-tolerance topic.

I think we wrote the paper clearly. Maybe too clearly.

Here is the link to our paper: https://www.cse.buffalo.edu//tech-reports/2017-02.pdf
(Ideally I would have liked to expand on Section 4. I think that is what we will work on now.)

Below is an excerpt from the introduction of our paper, if you want to skim that before downloading the pdf.
-------------------------------------------------------------------------
The last decade has witnessed rapid proliferation of cloud computing. Internet-scale webservices have been developed providing search services over billions of webpages (such as Google and Bing), and providing social network applications to billions of users (such as Facebook and Twitter). While even the smallest distributed programs (with 3-5 actions) can produce many unanticipated error cases due to concurrency involved, it seems short of a miracle that these web-services are able to operate at those vast scales. These services have their share of occasional mishap and downtimes, but overall they hold up really well.

In this paper, we try to answer what factors contribute most to the high-availability of cloud computing services, what type of fault-tolerance and recovery mechanisms are employed by the cloud computing systems, and whether self-stabilization fits anywhere in that picture.
(Stabilization is a type of fault tolerance that advocates dealing with faults in a principled unified manner instead of on a case by case basis: Instead of trying to figure out how much faults can disrupt the system's operation, stabilization assumes arbitrary state corruption, which covers all possible worst-case collusions of faults and program actions. Stabilization then advocates designing recovery actions that takes the program back to invariant states starting from any arbitrary state.)

Self-stabilization had shown a lot of promise early on for being applicable in the cloud computing domain. The CAP theorem seemed to motivate the need for designing eventually-consistent systems for the cloud and self-stabilization has been pointed out by experts as a promising direction towards developing a cloud computing research agenda. On the other hand, there has not been many examples of stabilization in the clouds. For the last 6 years, the first author has been thinking about writing a "Stabilization in the Clouds" position paper, or even a survey when he thought there would surely be plenty of stabilizing design examples in the cloud. However, this proved to be a tricky undertaking. Discounting the design of eventually-consistent key-value stores and application of Conflict-free Replicated Data Types (CRDTs) for replication within key-value stores, the examples of self-stabilization in the cloud computing domain have been overwhelmingly trivial.

We ascribe the reason self-stabilization has not been prominent in the cloud to our observation that cloud computing systems use infrastructure support to keep things simple and reduce the need for sophisticated design of fault-tolerance mechanisms. In particular, we identify the following cloud design principles to be the most important factors contributing to the high-availability of cloud services.

Keep the services "stateless" to avoid state corruption. By leveraging on distributed stores for maintaining application data and on ZooKeeper for distributed coordination, the cloud computing systems keep the computing nodes almost stateless. Due to abundance of storage nodes, the key-value stores and databases replicate the data multiple times and achieves high-availability and fault-tolerance.
Design loosely coupled distributed services where nodes are dispensable/substitutable. The service-oriented architecture, and the RESTful APIs for composing microservices are very prevalent design patterns for cloud computing systems, and they help facilitate the design of loosely-coupled distributed services. This minimizes the footprint and complexity of the global invariants maintained across nodes in the cloud computing systems. Finally, the virtual computing abstractions, such as virtual machines, containers, and lambda computing servers help make computing nodes easily restartable and substitutable for each other.
Leverage on low level infrastructure and sharding when building applications. The low-level cloud computing infrastructure often contain more interesting/critical invariants and thus they are designed by experienced engineers, tested rigorously, and sometimes even formally verified. Higher-level applications leverage on the low-level infrastructure, and avoid complicated invariants as they resort to sharding at the object-level and user-level. Sharding reduces the atomicity of updates, but this level of atomicity has been adequate for most webservices, such as social networks.

A common theme among these principles is that they keep the services simple, and trivially "stabilizing", in the informal sense of the term. Does this mean that self-stabilization research is unwarranted for cloud computing systems? To answer this, we point to some silver lining in the clouds for stabilization research. We notice a trend that even at the application-level, the distributed systems software starts to get more complicated/convoluted as services with more ambitious coordination needs are being build.

In particular, we explore the opportunity of applying self-stabilization to tame the complications that arise when composing multiple microservices to provide higher-level services. This is getting more common with the increased consumer demand for higher-level and more sophisticated web services. The higher-level services are in effect implementing distributed transactions over the federated microservices from multiple geodistributed vendors/parties, and that makes them prone to state unsynchronization and corruption due to incomplete/failed requests at some microservices. At the data processing systems level, we also highlight a need for self-regulating and self-stabilizing design for realtime stream processing systems, as these systems get more ambitious and complicated as well.

Finally, we point out to a rift in the cloud computing fault model and recovery techniques, which motivates the need for more sophisticated recovery techniques. Traditionally the cloud computing model adopted the crash failure model, and managed to confine the faults within this model. In the cloud, it was feasible to use multiple nodes to redundantly store state, and easily substitute a stateless worker with another one as nodes are abundant and dispensable. However, recent surveys on the topic remark that more complex faults are starting to prevail in the clouds, and recovery techniques of restart, checkpoint-reset, and devops involved rollback and recovery are becoming inadequate.