Wednesday, November 30, 2016

How does one get motivated for teaching?

A friend recently complained that, even after years in academia, he never got quite adjusted to teaching. He said he doesn't see much incentive in teaching and found it boring, and asked about how one can get motivated for teaching.

This is actually a good question and it is important to get in to the habit of questioning and checking oneself. Here is how I answer this question.

  1. I don't need to get motivated for teaching. Teaching is a responsibility bestowed upon me as an academician. So I look at this from a responsibility perspective. I also have responsibilities as a parent, as a husband, or as a driver (if I am the designated driver), and for matters of responsibility, I don't need motivation. I know I should rise to the occasion.
  2. Teaching is my service to the students, who really need it. Today students have many alternatives. They can watch a lecture on YouTube, and can find a variety of study material on the Internet. But it is only I that stand in front of them in person and I get to serve as an example and inspiration to them. I take this role seriously and try to show the students how a curious researcher approaches problems and concepts. I try to show them, through diffusion, the love and fun of figuring out something new. At each class, I make sure that I take enough time to motivate my students about a question because this motivation determines the amount of attention my students will devote to learning that subject. (Yes there will inevitably be some students that look uninterested and doze off. With experience, I learned not to get demoralized by a couple such students, because that would be a disservice to the rest of the students who needs to see me motivated and engaging. I remind myself that I don't know the situation of the disengaged students---they may have other problems--, and I hope to be able to engage them at another time.)
  3. I regard teaching as a challenge and try to improve myself. I figured out early on that teaching is a good practice for giving better conference talks. When I was a fresh faculty, I felt very anxious and nervous before conference talks and before classes. But with practice gained via teaching this went away gradually, and I started to really enjoy teaching and giving conference talks. Nowadays I started to see teaching as a performance art. (Sure enough, I wasn't the first to think this, and there exists a book with this title.) When teaching, I look at the students' eyes and I try to connect. I try to see that they understand.  I try to focus on the message that I want to communicate, and detach my ego out of the way. I had written on this at a previous post titled "How to present your work". Ironically if you put your effort in impressing the audience you will fail at it, but if you focus on teaching to your audience, you will impress them.
  4. I see teaching as an opportunity to simplify the material. When teaching I try to keep the message simple. I try to figure out the gist of the material, and challenge myself to communicate that to the students. This reductionist approach should be familiar. This is also how we do research. We try to reduce complex material by throwing away the accidental complexity and focus on the intrinsic complexity. We then attack intrinsic complexity to reduce it to simpler principles. Approached this way, teaching also serves a good practice for research. This type of practice helps for research in general and sometimes leads to a new paper in particular.

Some disclaimers are in order to put my advice in context. My teaching load as a professor in a research university is pretty light. I teach one class a semester. So, maybe some of what I say may look impractical for more heavy teaching duties. I teach mostly graduate courses, so advice number 4 about teaching helping research may be less applicable when teaching undergraduate courses.

Finally, here are some practical tricks I picked up about teaching:

  • Prepare well. If I go underprepared to a class, I give a bland lecture, and I feel bad about it all day. Instead of wasting time feeling bad about bad teaching, I shift that time to preparing better for the class, so I enjoy teaching, and enjoy the rest of my day.
  • Concentrate on a simpler message, and communicate that really well in several ways to leave no doubt of transmission. When I have a bad class, it is often because I tried to cram too many things and didn't distill my message to its essence. So I learned to leave out the unessential, and communicate the essentials as clearly and rigorously as possible. As you may have noticed, Powerpoint makes it easy to cram many things to lecture notes, so you have to actively resist the temptation.
  • Slow down, pace yourself better. I learned to empathize with the students who are seeing the material for the first time, and need some time and immersion to wrap their brains around it.
  • Ask questions to the class frequently. To get the students engaged, I direct questions to the class frequently, and wait patiently until I start hearing some answers or guesses. I comment on these replies, suggest alternatives, and ask more specific questions to direct my students to think harder to provide better answers. In order to learn the subject matter, the students should be forced to do some thinking.
  • Use reenactments and use interactive material. To engage the students, I invite the students to the front of the class to reenact some algorithms by acting as the processes involved in the algorithm. I also don't shy away from showing YouTube videos that help communicate a point.

Sunday, November 27, 2016

My Distributed Systems Seminar's reading list for Spring 2017

Below is the first draft list of papers I plan to discuss in my distributed systems seminar in the Spring semester. If you have some suggestions on some good/recent papers to cover, please let me know in the comments.

Datacenter Operating System

Firmament: Fast, Centralized Cluster Scheduling at Scale (OSDI 16)
Large-scale cluster management at Google with Borg (Eurosys 15)
Apache Hadoop YARN: yet another resource negotiator (SOCC 13)
Slicer: Auto-Sharding for Datacenter Applications (OSDI 16)


Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems (SOSP 15)
Shasta: Interactive Reporting At Scale (SIGMOD 16)
Adaptive Logging: Optimizing Logging and Recovery Costs in Distributed In-memory Databases (SIGMOD 16)


The many faces of consistency (2016)
The SNOW Theorem and Latency-Optimal Read-Only Transactions (OSDI 16)
Incremental Consistency Guarantees for Replicated Objects  (OSDI 16)
Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering  (OSDI 16)
FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs  (OSDI 16)


The Honey Badger of BFT Protocols (2016)
The Bitcoin Backbone Protocol: Analysis and Applications (2015)
XFT: Practical Fault Tolerance beyond Crashes (OSDI 16)


2016 Seminar reading list
2015 Seminar reading list

Thursday, November 24, 2016

Modeling a Replicated Storage System in TLA+, Project 1

Why a TLA+ project?

The first project assignment in my distributed systems class this semester was modeling a replicated storage system in TLA+. Assigning a TLA+ project makes me a rarity among distributed systems professors. A common project would be a MapReduce programming assignment or a project to implement a simple distributed service (such as a key-value store) in Java.

I think that a MapReduce deployment project does not teach much about distributed systems, because MapReduce is a very crude abstraction and hides all things distributed under the hood. Using MapReduce for the distributed systems class project would be like handing people a mechanics certification upon successful completion of a driving test.

Implementing a simple distributed service, on the other hand, would teach students that indeed programming and debugging distributed systems is very hard. However, I would suspect that much of the hardship in that project would be due to accidental complexities of the implementation language rather than the intrinsic complexity of reasoning about distributed systems in the presence of concurrent execution and failures. In a distributed systems deployment, it would be hard for the students to test exhaustively for race conditions, concurrency bugs, and exercise many possible combinations of node failures and message losses. That is notoriously hard for even the professionals, as shown by this survey of 104 distributed concurrency bugs from Cassandra, HBase, Hadoop MapReduce, and ZooKeeper.

I am of course not against assigning an implementation project in a distributed systems class. I see the utility and necessity in giving students hands-on programming experience on distributed systems. I think an implementation project would be suitable for an advanced distributed systems class. The class I am teaching is the first distributed systems class for the students, so  an implementation project would be unnecessarily complicated and burdensome for these students, who are also taking 3 other classes.

I teach the distributed systems class with emphasis on reasoning about the correctness of distributed algorithms, so a TLA+ is a good fit and complement for my course. Integrating TLA+ to the class gave students a way to get a hands-on experience in algorithms design and dealing with the intrinsic complexities of distributed systems: concurrent execution, asymmetry of information, concurrency bugs, and a series of untimely failures.

TLA+ has a lot to offer for practice and implementation of distributed systems. At Amazon, the engineers used TLA+ for modeling S3, DynamoDB, and some other production systems that see a lot of updates and new features. TLA+ helped the engineers find several critical bugs introduced by updates/features in the design stage, which if not found would have resulted in large amount of engineering effort later on. These are detailed in a couple of articles from Amazon engineers. I have been hearing other TLA+ adoption cases in the industry, and hope to see increasingly more adoption in the coming years.

Modeling Voldemort replicated storage system with client-side routing in TLA+

I wanted to assign the TLA+ modeling project on a practical useful application. So I chose modeling of a replicated storage system. I assigned the students to model Voldemort with client-side routing as their first project.

Here is the protocol. The client reads the highest version number "hver" from the read quorum (ReadQ). The client then writes to the write quorum nodes (WriteQ) the store the updated record with "hver+1" version number. The storage nodes can crash or recover, provided that no more than FAILNUM number of nodes are crashed at any moment. Our WriteQ and ReadQ selection will consist of the lowest id storage nodes that are up (currently not failed).

I asked the students to model check with different combinations of ReadQ, WriteQ, and FAILNUM, and figure out the relation that needs to be satisfied among these configuration parameters in order to ensure that the protocol satisfies the single-copy consistency property. I wanted my students to see how consistency can be violated as a result of a series of unfortunate events (such as untimely node death and recoveries). The model checker is very good for producing counterexamples where consistency is violated.

Simplifying assumptions and template to get the students started

I tried to keep the first project easy. We simplified things by modeling the storing (and updating) of just a single data item, so we didn't have to model the hashing part. We also used shared memory. The client directly writes (say via an RPC) to the db of the storage nodes. (It is possible to model a message-box at each process, and I assigned that for the second project.)

I also gave the students the template for the model. This helps my TAs a lot in grading. Without any template the students can go in wildly different directions with their modeling. Below is the template. In case you want to give this a try and see how you do with it, I will wait a week or so before I share my solution for the project1.

Extending to the second project

I have assigned the second project again on the modeling of a replicated storage system. But this time instead of a quorum, "chain replication" is used for ensuring persistence and consistency of storage. In the second project, the replication is done by server-side routing, and the modeling includes message passing.


Using TLA+ for teaching distributed systems (Aug 2014)
My experience with using TLA+ in distributed systems class (Jan 2015)
Modeling the hygienic dining philosophers algorithm in TLA+

There is a vibrant Google Groups forum for TLA+ :!forum/tlaplus
Clicking on label "tla" at the end of the post you can reach all my posts about TLA+

Monday, November 21, 2016

Book Review: Einstein (by Walter Isaacson)

I read this book recently and liked it a lot. It was a surprisingly engaging read. I thought I knew a lot about Einstein, and the book would be redundant and bland. But the book proved me wrong. I learned a lot of new things about Einstein, especially about his personality and his research perspective and style. The book also did a great job on explaining Einstein's theories, scientific achievements in an accessible and interesting manner.

Einstein's personality

Let's get this out of the way. Einstein didn't fail math. He was always a very smart and hardworking student. In primary school, he was at the top of his class and "far above the school requirements" in math. And before he was 15 he had mastered differential and integral calculus.

What may have helped propagate the myth was Einstein's unruly and defiant personality. Einstein would do great at the things he likes, and not good at the things he doesn't like. He had excelled in his university classes that he liked, and did average in the ones he disliked (such as physics experiments classes). Einstein was certainly a black sheep and kept clashing with authority. He was an outspoken peace activist and social democrat. He was brave, and not afraid to speak his mind even when he was getting dead threats and the wars were brewing in Europe. This defiant/black-sheep personality also helped him formulate novel theories that revolutionized the prevalent physics views/understanding at the time.

True to the stereotype, Einstein was goofy and forgot stuff all the time. He had spent a lot of time in this brain, thinking about physics all the time. "I can already imagine the fun we will have," he wrote to Maric about a prospective vacation. "And then we'll start in on Helmholtz's electromagnetic theory of light." He was immersed with physics/ideas all the time, and his daily life almost came as interludes to his physics thinking. This has also been a prevalent pattern in  Richard Feynman's life.

Contrary to the nerd professor stereotype, Einstein was also very outdoorsy, sailed, hiked, and liked to live a good life. He was a very good violin player and enjoyed giving concerts to friends. Teaching was initially hard and awkward for Einstein, but he gradually got accustomed to it, and found his own unique style and voice for teaching.

Einstein enjoyed company and attention of his friends. He founded the Olympia Academy with group of friends in Bern, Switzerland, who met in his apartment in order to discuss philosophy and physics. Einstein used this group to bounce off his ideas, and study and learn new things very much  like Benjamin Franklin's Junto group. (I also recommend the Benjamin Franklin's biography written by Walter Isaacson.)

Einstein's research

Einstein was a very intuitive and creative person. He famously said "Imagination is more important than knowledge." Einstein was a big ideas person. He was a deep thinker, but not a brute force technical/math/hardwork physicist type. His papers are not tour-de-force papers, rather they were simple, and more like position papers. He was a theorist. He read a lot of papers, including many experimental papers, and thought a lot and suggested elegant theories to explain the phenomena reported in those papers. The way he developed those theories is mainly by running thought experiments (gedankenexperiment).

The chapter about Einstein's miracle year (1905) is a very good read. This was where Einstein caught his first breakthrough. Einstein, then 26, published 4 groundbreaking papers (short position papers), on the photoelectric effect, Brownian motion, special relativity, and the equivalence of mass and energy. Einstein kept this momentum and kept publishing interesting papers at a good rate, including his new theory of general relativity in 1911.

What factors led to his breakthrough miracle year? Einstein had been churning some ideas and gestating theories starting from his college years. He had also been bouncing ideas off his friends in Olympia Academy. And finally, he was employed at the patent office which stimulated Einstein's thinking. Working at the patent office was like listening to high-tech startup pitches everyday, except the topics were mostly on electric/magnetic waves and clock synchronization, which instigated his thinking on photoelectric effect and relativity.

Einstein's relentless focus

It was very interesting to read about the promising young Einstein trying to get a job, but failing to get even a small teaching job. His attempts at PhD dissertations were also turned down. Unappreciated genius indeed. In addition to those difficulties, Einstein also lived through very turbulent times in Europe. His marriage and personal life also went through very tough periods. Throughout all this, he was always focused on his research and found solace in his studies. He was a man with a mission. He was afraid he would die before he could build his relativity theory, and was OK with dying afterwards.

On page 162, the book talks about how deep Einstein could concentrate. When his students dropped by his apartment, they found that Einstein was finishing work on a complicated problem, while he had his small son in his lap crying. Not exactly an ideal work environment, but Einstein was so focused, he wasn't bothered by the son crying, and solved the problem. A couple page later, there was a heartwarming description of Einstein and Lorentz doing research together. Two passionate physics geeks discussing ideas and bouncing questions back and forth.

Thursday, November 17, 2016

How Complex Systems Fail

This is a 4 page report about the nature of failures in complex systems. It is a gloomy report. It says that complex systems are always ridden with faults, and will fail when some of these faults conspire and cluster. In other words, complex systems constantly dwell on the verge of failures/outages/accidents.

The writing of the report is peculiar. It is written as a list of 18 items (ooh, everyone loves lists). But the items are not independent. For example, it is hard to understand items 1 and 2, until you read item 3. Items 1 and 2 are in fact laying the foundations for item 3.

The report is written by an MD, and is primarily focused on healthcare related complex systems, but I think almost all of the points also apply for other complex systems, and in particular cloud computing systems. In two recent posts (Post1, Post2), I had covered papers that investigate failures in cloud computing systems, so I thought this report would be a nice complement to them.

1) Complex systems are intrinsically hazardous systems.
I think the right wording here should be "high-stakes" rather than "hazardous". For example, cloud computing is not "hazardous" but it is definitely "high-stakes".

2) Complex systems are heavily and successfully defended against failure.
Is there an undertone here which implies these defense mechanisms contribute to make these high-stakes systems even more complex?

3) Catastrophe requires multiple failures – single point failures are not enough.
This is because the anticipated failure modes are already well guarded.

4) Complex systems contain changing mixtures of failures latent within them.
"Eradication of all latent failures is limited primarily by economic cost but also because it is difficult before the fact to see how such failures might contribute to an accident. The failures change constantly because of changing technology, work organization, and efforts to eradicate failures." This is pretty much the lesson from the cloud outages study. Old services fail as much as new services, because the playground keeps changing.

5) Complex systems run in degraded mode.
"A corollary to the preceding point is that complex systems run as broken systems."

6) Catastrophe is always just around the corner.

7) Post-accident attribution accident to a ‘root cause’ is fundamentally wrong.
"Because overt failure requires multiple faults, there is no isolated ‘cause’ of an accident. There are multiple contributors to accidents. Each of these is necessary insufficient in itself to create an accident. Only jointly are these causes sufficient to create an accident."

8) Hindsight biases post-accident assessments of human performance.
The "everything is obvious in hindsight" fallacy was covered well in this book.

9) Human operators have dual roles: as producers & as defenders against failure.
10) All practitioner actions are gambles.
11) Actions at the sharp end resolve all ambiguity.

12) Human practitioners are the adaptable element of complex systems.
What about software agents? They can also react adaptively to the developing situations.  And today with machine learning and deep learning, especially so.

13) Human expertise in complex systems is constantly changing.
14) Change introduces new forms of failure.
The cloud outages survey has showed that updates and configuration changes and human factors account for more than 1/3rd of outages.

15) Views of ‘cause’ limit the effectiveness of defenses against future events.
Case-by-case addition of fault-tolerance is not very effective. "Instead of increasing safety, post-accident remedies usually increase the coupling and complexity of the system."

16) Safety is a characteristic of systems and not of their components
Safety is a system-level property, unit testing of components is not enough.

17) People continuously create safety.
18) Failure free operations require experience with failure.
What doesn't kill you makes you stronger. In order to grow, you need to push the limits, and stress the system. Nassim Taleb's book about antifragility makes similar points.
And this short video on resilience is simply excellent.

Thursday, November 10, 2016

Modeling Paxos and Flexible Paxos in Pluscal and TLA+

The first part of this post describes modeling Paxos in Pluscal. The second part shows how to modify that model to achieve a flexible quorum Paxos. The idea for flexible quorums was introduced in a paper in 2016 summer. This simple and surprising result says "majority agreement is not required in Paxos and the sets of acceptors required to participate in agreement (known as quorums) do not even need to intersect with each other".

Modeling Paxos in Pluscal

While there are many examples of Paxos modeling in TLA+ available, I haven't found any Pluscal modeling of Paxos, except this one from Lamport, which helped me come up with my Pluscal model below. The problem with TLA+ is that it is too low-level (i.e., too declarative and math-like) for writing--and reading-- distributed algorithms. The PlusCal language provides a higher-level pseudocode, which is easier to follow.

However, as you go through my Pluscal model below, you will find that it doesn't follow your expectations of an implementation in your favorite imperative language. This is OK, Pluscal is meant to just model the algorithm so that we can model check for correctness against concurrency bugs. I had written more about modeling at a higher abstraction level earlier in this post and this post.

Leader denotes the range used for the ids of the leader processes, and Acceptor denotes the range used for the ids of the acceptor processes. Slot is the range of slots, and Ballots is the range of ballots.

Acceptors are simple, they just react to leaders' Phase1, Phase2, Phase3 messages sent with various ballot numbers. To this end, the acceptor body calls macros, which are inlined while the Pluscal code is being translated to TLA+ for model checking.

An acceptor keeps a variable for remembering the maximum ballot number maxBal it promised. It also remembers all values it accepted at Phase2a using hVal, a set of <slot, ballot, value> tuples. Finally an acceptor stores the decided proposals at each slot as a set. Of course if the agreement property of Paxos holds, the decided set for a slot has cardinality <=1.

The leader loops through the 3 phases of a round for each slot. It tries to dominate in Phase1, so it can go to Phase2. After the leader is elected, Phase1 can be skipped in subsequent slots, if the leader is not preempted by another leader. An elected leader can get preempted any time and CollectP2 can fail, so the leader checks this before it can decide a value at SendP3.

The ballot number of the leader b is incremented modulo M (the number of leaders) so it remains unique across leaders. The variable pVal is a set to store the values accepted in earlier slots, so a suitable value can be re-proposed in Phase2a. (See CollectP1 and SendP2 macros.)

AccMsg denotes set of acceptor messages sent and LMsg denotes set of leader messages sent. Instead of actually sending messages in channels and to each acceptor, sending message is modeled as adding a message in a messageboard, where other processes can nondestructively read the message. (This idea resembles the Linda tuplespaces idea.)

The macros SentXX returns a set of messages in the messageboard that match a specific filter. SuitVal is a macro for identifying the proposal with the highest ballot id accepted for a given slot.

SendP1(b) lets a leader put a Phase1a message with ballot number b to the AccMsg messageboard. ReplyP1(b) lets acceptors react to a Phase1a message with ballot number b by writing a reply back to LMsg messageboard. CollectP1(b) lets a leader to proceed as elected from Phase1 if a majority of acceptors said OK. Or the leader may learn that it has been preempted and retries Phase1 with a higher ballot number. The await statement serves as a guard: if the await predicate is not satisfied, the rest of the macro is not executed.

SendP2(b,s) lets a leader put a Phase2a message with ballot number b, slot number s to the AccMsg messageboard. The message proposes self as value, or SuitVal as value if applicable. ReplyP2(b) lets acceptors react to a Phase2a message with ballot number b by writing a reply back to LMsg messageboard.

Using CollectP2(b,s) a leader can learn that its proposal was accepted by majority of acceptors, or else it can learn that it has been preempted by a higher ballot number. SendP3 and ReceiveP3 macros implement Phase3 of Paxos.

The model checking took 7 minutes with the parameters I mentioned in the comments. Not bad.

Flexible Paxos

The flexible quorums idea was introduced in a paper in 2016 summer.  It says that we can weaken the Paxos requirement that "all quorums intersect" to require that "only quorums from different phases intersect". That is, majority quorums are not necessary, provided that Phase1 quorums intersect with Phase2 quorums.

For example  in a system of 10 acceptors, we can safely allow any set of only 3 acceptors to participate in Phase2, provided that we require 8 acceptors to participate for Phase1. This decreasing of Phase2 quorums at the cost of increasing Phase1 quorums is called as simple quorums.

Or alternatively, we can use grid quorums, where every column forms a Phase1 quorum, and every row a Phase2 quorum. In grid quorums, the quorums within either phase do not intersect with each other.

The flexible Paxos paper gave a TLA+ model, but didn't give a Pluscal model. But as I show next, we can implement the flexible Paxos quorums with a simple modification of our Paxos Pluscal model.

Quorum1 denotes Phase1 quorums, and Quorum2 denotes Phase2 quorums. Quorum1 quorums must intersect with Quorum2 quorums, but neither needs to be a majority quorum. Also Quorum1 quorums need not intersect among each other, and Quorum2 quorums need not intersect among each other. In the comments, I provide example parameters for model-checking the flexible Paxos extension.

Before the leader can get elected in Phase1, it checks to see if acceptors from a Quorum1 quorum said OK. SatQ1(b) and SatQ2(b) are macros for checking whether acceptors from a Quorum1 and Quorum2 quorum responded back for a given ballot number b.

Related links

Paxos and FPaxos Pluscal programs available on Github.

Using TLA+ for teaching distributed systems

My experience with using TLA+ in distributed systems class

There is a vibrant Google Groups forum for TLA+:!forum/tlaplus

By clicking on label "tla" at the end of the post you can reach all my posts about TLA+

Wednesday, November 9, 2016

[Paper review] TaxDC: A Taxonomy of nondeterministic concurrency bugs in datacenter distributed systems

This paper appeared in ASPLOS 2016 and the authors are Leesatapornwongsa, Lukman, Lu, and Gunawi.

The paper provides a comprehensive study on real-world distributed concurrency bugs and is a good complement to the paper I reviewed before: "Why does cloud stop computing". While the previous paper looked at all possible bugs that lead to service outages, this paper focuses only on distributed concurrency (DC) bugs. This type of bug happens to be my favorite kind of bug. I am fascinated with "failures", and with DC bugs doubly so. DC bugs are execution timing/ordering/scheduling related, they occur nondeterministically, and are extremely difficult to detect, diagnose, and fix in production systems. I love seeing those long traces of improbable DC bugs surface when I model check distributed algorithms in TLA+. The experience keeps me humble.

The paper presents analysis of 104 DC bugs from Cassandra, HBase, Hadoop MapReduce, and ZooKeeper. These bugs are categorized and studied according to the triggering timing condition and input preconditions, error and failure symptoms, and fix strategies, as shown in Table 1. The bug taxonomy database is available here.

Trigger warning

More than 60% of DC bugs are triggered by a single untimely message delivery that commits order violation or atomicity violation, with regard to other messages or computation. Figure 1 shows possible triggering patterns.

This sounds like a very striking and surprising finding. How is it that simple? How do we reduce most DC bugs to untimely message delivery? If you think about it, it is actually not very surprising. What is a DC bug? It is a state inconsistency across processes. What makes the state inconsistency into a bug? A communication/message-exchange between the two inconsistent processes.

Of course the root cause leading to the state inconsistency can be pretty complex. Figure 3 shows that many DC bugs need complex input preconditions, such as faults (63% in Figure 3.b), multiple protocols (80% in Figure 3.f), and background protocols (81% in Figure 3g). As an example, consider the juicy bug described in Figure 4.

Can we fix it? Yes, we can!

The paper analyzes bug patches to understand developers' fix strategies for DC bugs. They find that DC bugs can be fixed by either disabling the triggering timing by adding extra synchronization, or by changing the system's handling logic by ignoring/neutralizing the untimely message.

Yes, this sounds very simple. But what this perspective hides is that you first need to figure out the bugs before you can fix them. And identifying the DC bugs is the hard question as they occur nondeterministically and are extremely difficulty to detect and diagnose. For example, Figure 3 shows 47% of DC bugs lead to silent failures and hare hard to detect and debug in production and reproduce offline.

This is also a losing game, as you always need to play catch up with the new corner cases arising and haunting your system. Instead of doing a case-by-case fixing of the systems, it is better to fix our misconceptions and approach to designing these distributed protocols.

I like the section on Root Causes, which talks about developers misconceptions about distributed systems that leads to these bugs. These misconceptions are simple to correct, and doing so can eradicate many DC bugs.

  • One hop is faster than two hops.
  • No hop is faster than one hop.
  • Atomic blocks cannot be broken.
  • Interactions between multiple protocols seem to be safe.
  • Enough states are maintained to detect/handle problems. (Upon observing that some fixes add new in-memory/on-disk state variables to handle untimely message and fault timings.)

Another thing to improve would be to use better testing tools. Recall that 63% of DC bugs surface in the presence of hardware faults such as machine crashes (and reboots), network delay and partition (timeouts), and disk errors. The paper doesn't mention the Jepsen testing tool, but it would be a good fit to detect many DC bugs mentioned in this study.

And finally ounce of prevention is worth a pound of cure. Design your distributed protocols right in the first place. You can use TLA+ to specify and model-check your distributed protocols/algorithms before you start implementing them.

In the lessons learned section the paper discusses about what can be done to detect/prevent DC bugs, and says that "No matter how sophisticated the tools are, they are ineffective without accurate specifications. This motivates the creation or inference of local specifications that can show early errors or symptoms of DC bugs." The implication is that since the protocols do not come with good formal specifications, the tool support for testing, checking, detection, etc. becomes ineffective. This point further motivates that it is important to get the distributed protocol right first, and the specification should accompany the protocol so that the implementation can be made reliable. TLA+ provides support for writing the specifications. You can see my previous posts on TLA+ below.

Using TLA+ for teaching distributed systems

My experience with using TLA+ in distributed systems class

Modeling the hygienic dining philosophers algorithm in TLA+

There is a vibrant Google Groups forum for TLA+:!forum/tlaplus

By clicking on label "tla" at the end of the post you can reach all my posts about TLA+

Sunday, November 6, 2016

Paper Review: Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages

This paper conducts a cloud outage study of 32 popular Internet services, and analyzes outage duration, root causes, impacts, and fix procedures. The paper appeared in SOCC 2016, and the authors are Gunawi, Hao, Suminto Laksono, Satria, Adityatama, and Eliazar.

Availability is clearly very important for cloud services. Downtimes cause financial and reputation damages. As our reliance to cloud services increase, loss of availability creates even more significant problems. Yet, several outages occur in cloud services every year. The paper tries to answer why outages still take place even with pervasive redundancies.

To answer that big question, here are the more focused questions the paper answers first.
  1. How many services do not reach 99% (or 99.9%) availability?
  2. Do outages happen more in mature or young services?
  3. What are the common root causes that plague a wide range of service deployments?
  4. What are the common lessons that can be gained from various outages?
What would be your answers for these questions? Please note them down before reading further in to the paper summary. These were my answers to those questions.
  1. 99% availability is easily achievable, 99.9% availability is still hard for many services.
  2. Young services would have more outages than mature older services.
  3. Most common root causes would be configuration and update related problems.
  4. "KISS: Keep it simple stupid" would be a common lesson. 

Methodology of the paper

The paper surveys 597 outages from 32 popular cloud services. Wow, that is impressive! One would think these authors must be very well connected to teams in the industry to perform such an extensive survey.

It turns out they just used Google search. They identified 32 popular cloud services (see Table 1), and then googled "service_name outage month year" for every month between January 2009 and December 2006. Then they went through the first 30 search hits and gathered 1247 unique links that describe 597 outages. They then systematically went through those post-mortem reports. Clever!

The paper says that this survey was possible "thanks to the era of providers' transparency". But this also constitutes the caveat for there approach as well. The results are only as good as the providers' transparency allowed. First, the dataset is not complete. Not all outages are reported publicly. The paper defines "service outage" as an unavailability of full or partial features of the service that impacts all or a significant number of users in such a way that the outage is reported publicly. Second, there is a skew in the dataset. The more popular a service is, the more attention its outages will gather. Third, outage classifications are incomplete due to lack of information. For example, only 40% outage descriptions reveal root causes and only 24% reveal fix procedures. (These ratios are disappointingly low.) And finally root causes are sometimes described vaguely in the postmortem reports. "Due to a configuration problem" can imply software bugs corrupting the configuration or operators setting a wrong configuration. But in this case, the paper chooses tags based on the information reported and use CONFIG tag, and not the BUGS or HUMAN tags.

In order not to discredit any service, the paper anonymizes the service names as category type followed by a number. (It is left as a fun exercise to the reader to de-anonymize the service names. :-)


If we consider only the worst year from each service, 10 services (31%) do not reach 99% uptime and 27 services (84%) do not reach 99.9% uptime. In other words, five-nine uptime (five minutes of annual downtime) is still far from reach.

Regarding the question "does service maturity help?", I got this wrong. I had guessed that young services would have more outages than mature services. But turns out, the outage numbers from young services are relatively small. Overall, the survey shows that outages can happen in any service regardless of its maturity. This is because the services do not remain the same as they mature. They evolve and grow with each passing year. They handle more users and complexity increases with the added features. In fact, as discussed in the root causes section, every root cause can occur in large popular services almost in every year. As services evolve and grow, similar problems in the past might reappear in new forms.

Root causes

The “Cnt” column in Table 3 shows that 355 outages (out of the total 597) have UNKNOWN root causes. Among the outages with reported root causes, UPGRADE, NETWORK, and BUGS are three most popular root causes, followed by CONFIG and LOAD. I had predicted the most common root causes to be configuration and update related and I was right about that.

BUGS label is used for tag reports that explicitly mention “bugs” or “software errors”. UPGRADE label implies hardware upgrades or software updates. LOAD denotes unexpected traffic overload that lead to outages. CROSS labels outages caused by disruptions from other services. POWER denotes failures due to power outages which account for 6% of outages in the study. Last but not least, external and natural disasters (NATDIS) such as lightning strikes, vehicle crashing into utility poles, government construction work cutting two optical cables, and similarly under-water fibre cable cut, cover 3% of service outages in the study.

The paper mentions that UPGRADE failures require more research attention. I think the Facebook "Configerator: Holistic Configuration Management" paper is a very relevant effort to address UPGRADE and CONFIG failures.

Single point of failure (SPOF)?

While component failures such as NETWORK, STORAGE, SERVER, HARDWARE, and POWER failures are anticipated and thus guarded with extra redundancies, how come their failures still lead to outages?  Is there another "hidden" single point of failure?

The paper answers this paradox as follows: "We find that the No-SPOF principle is not merely about redundancies, but also about the perfection of failure recovery chain: complete failure detection, flawless failover code, and working backup components. Although this recovery chain sounds straightforward, we observe numerous outages caused by an imperfection in one of the steps. We find cases of missing or incorrect failure detection that do not activate failover mechanisms, buggy failover code that cannot transfer control to backup systems, and cascading bugs and coincidental multiple failures that cause backup systems to also fail."

While the paper misses to mention them, I believe the following work are very related for addressing the No-SPOF problem. The first one is the crash-only software idea, which I had reviewed before: "Crash-only software refers to computer programs that handle failures by simply restarting, without attempting any sophisticated recovery. Since failure-handling and normal startup use the same methods, this can increase the chance that bugs in failure-handling code will be noticed." The second line of work is on recovery blocks and n-version software. While these are old ideas, they should still be applicable for modern cloud services. Especially with the current trend of deploying microservices, micro-reboots (advocated by crash-only software) and n-version redundancy can see more applications.

Figure 5 breaks down the root-cause impacts into 6 categories: full outages (59%), failures of essential operations (22%), performance glitches (14%), data loss (2%), data staleness/inconsistencies (1%), and security attacks/breaches (1%). Figure 5a shows the number of outages categorized by root causes and implications.

Only 24% of outage descriptions reveal the fix procedures. Figure 5.b breaks down reported fix procedures into 8 categories: add additional resources (10%), fix hardware (22%), fix software (22%), fix misconfiguration (7%), restart affected components (4%), restore data (14%), rollback software (8%), and "nothing" due to cross-dependencies (12%).


The take home message from the paper is that outages happen because software is a SPOF. This is not a new message, but the paper's contribution is to validate and restate this for cloud services.

On a personal note, I am fascinated with "failures".  Failures are the spice of distributed systems. Distributed systems would not be as interesting and challenging without them. For example, without crashed nodes, without loss of synchrony, and without lost messages, the consensus problem is trivial. On the other hand, with any of those failures, it becomes impossible to solve the consensus problem (i.e., to satisfy both safety and liveness specifications), as the attacking generals and FLP impossibility results prove.

Related links

Jim Hamilton (VP and Distinguished Engineer at Amazon Web Services) is also fascinated with failures. In his excellent blog Perspectives, he provided detailed analysis of the tragic end of the Italian cruise ship Costa Concordia (Post1, Post2) and another analysis about the Fukushima disaster. His paper titled "On designing and deploying internet scale services" is also a must read. Finally here is a video of his talk "failures at scale and how to ignore them". 

Here is an earlier post from me about failures, resilience, and beyond: "Antifragility from an engineering perspective".

Here is Dan Luu's summary of Notes on Google's Site Reliability Engineering book.

Finally this is a paper worth reviewing as a future blog post: How Complex Systems Fail.

Wednesday, November 2, 2016

Modeling the hygienic dining philosophers algorithm in TLA+

This post on hygienic dining philosophers algorithm is the continuation of the post on dining philosophers. If you haven't seen that one, it is better to read that first before you continue with this one.

The hygienic dining philosophers algorithm (due to Chandy & Misra 1984) generalizes the ring topology of the dining philosophers algorithm to an arbitrary undirected graph. The most important contribution of the hygienic philosophers algorithm is that it introduces a priority concept that ensures "No Starvation" even under weak fairness. (NoStarvation == \A p \in Procs: state[p]="Hungry" ~> state[p]="Eating") The scheme deprioritizes a process that has eaten so the process cannot greedily beat its neighbors to eating again. This is achieved by flagging the forks of a process that has eaten as "dirty". Dirty forks denote that the process has just eaten, therefore that process needs to provide the fork to any of its neighbors that asked for the dirty fork.

PlusCal modeling of the algorithm

Here is my PlusCal algorithm modeling of the hygienic dining philosophers problem.

Initialization and Modeling

I model the Topology as a set. Since we are looking at a undirected graph, the edges are bidirectional. Our definition of "Edge" predicate, and consequentially that of "GetNbrs", reflect the bidirectionality of the edges.

I modeled "forks" and "request tokens" as sets as well. The modeling of forks and requests are directional, and the definitions of "Forkat" and "Reqat" reflects this point. I also modeled "clean forks" as a set: if a fork is clean, the fork is listed in the set "Clean". (I might have gone wild with using sets for modeling all the data structures of the algorithm. But in my defense, sets are one of the simplest data structures, and are flexible and useful.)

Now let's talk about the initialization of the forks, request-tokens, and clean forks set. The initial placement of forks and their marking as clean or dirty determines the initial priority relation among the processes. So, when we initialize, we must be careful not to introduce a cycle, since that would result in a deadlock. Therefore, a structure should be imposed on the graph such that it is a directed acyclic graph (DAG).

The recommended initialization is that initially all the forks are dirty (i.e., Clean={}) and each process is "Thinking". If a process becomes hungry, it can then get the forks it requests (since they are already dirty) without waiting on other process (which may not have the inclination) to eat first. For simplicity, I decided to assign the dirty forks to the lower id process of an edge, so I initialized the fork set to be equal to the Topology set. (Of course in the fork set the edges are directional: the process that comes earlier in the edge tuple denotes that fork resides at that side of the edge.) Since I had to make sure that the request token is not on the side of the dirty fork (because that would be misinterpreted as the other node asking for the fork in the initial state) I initialized the request-token set to be equal to the reverse of the Topology set.

Process actions

Each process is initialized to start in Thinking state. The process then loops through checking appropriate actions for its current state, and perform state transitions when conditions are met.

But first the process checks if it needs to respond to any requests send to it for forks and does so if its state is not "Eating". The requests for the dirty forks are answered inside a while loop iteration. That a fork is dirty indicates the process has a lower priority and must defer. It does so by sending the fork to the other side, and "cleaning" the fork while it is being passed to the neighbor.

Else if the process is hungry and has some forks to request, the process starts requesting its missing forks. For this, the process needs to iterate through its neighbors via a while loop. It sets Q to be the set of neighbors for which a fork needs to be requested. Then, in each iteration of the loop, a neighbor is picked out from this set, and the request is sent to that neighbor's side. The while loop is completed when all neighbors are covered ---set Q becomes an empty set.

Otherwise, if the process is "Hungry", and satisfies all the conditions to eat, it transitions to the "Eating" state. The conditions are: the process has all the forks in adjacent edges on its side, and each fork is either clean, or the dirty forks have not received requests yet. In other words, a process can eat with some of the forks being dirty, provided that the corresponding neighbors for those dirty forks are not hungry and have not send requests for those forks. Since a process eats only when it has all the forks on its side, we are guaranteed that when a process is eating none of its neighbors can be eating. (Otherwise a fork would be in both sides of the edge, which would violate TypeOkFork predicate.)

If a process is Eating, it can transition to Thinking. For this, the process marks all its forks as dirty, iterating over a while loop.

Finally if a process is Thinking, it can transition to Hungry anytime.

A note about the use of labels

As I alluded to in the previous posts, the labels are not there for cosmetic reasons. They determine the granularity of steps. The TLA+ conversion of the PlusCal code uses the labels. So if you forget to put a label where it is needed, TLA+ will warn you about it. For example, TLA+ dictates that a while loop has a label. This means that each iteration of the while loop can be atomic, but the entire while loop cannot be executed as one atomic step. TLA+ does not allow for updating the same variable twice in one step. So TLA+ complained and required me to add "Hii:" and "Aii" labels.

I found that this corresponds reasonably well to the atomicity distributed shared memory model, where a process can write to one neighbor in one step, but cannot write to multiple neighbors at once.


I found three bugs in my earlier versions of the code through model checking. The first was on the guard of the "Hungry" to "Eating" transition. I had wrote "Edge(self,k)" as a conjunction. After TLA toolkit trace showed me my mistake, I corrected that to "Edge(self,k)=>".

The second bug was much more subtle. When a hungry process was requesting tokens, I had written the request token subtraction outside the if clause. (There was an if clause checking neighbor should be sent the request. And I had the request addition (req:= req \union {<<q,self>>};) inside the if, request subtraction outside the if (Hii: req:= req \ {<<self,q>>};). So the model checker was maliciously timing the other process to send the fork, and then send a request for that fork back. Then when the line "Hii: req:= req \ {<<self,q>>}" is executed the removal of the request token would be removing the genuine request send back from that other process, not the ghost of the request token that the process had sent in the first place. At that point I hadn't written the TypeOK predicates in my model, so I didn't catch this quickly. Instead, I had to figure this subtle bug by inspecting a trace of 61 steps (the smallest number of steps needed to create this bug). The lesson I learned is not to procrastinate about checking for the TypeOK predicates for ensuring sound updating of the data structures.

Finally, the third bug was also very subtle. I had a "doneReq" flag to avoid unnecessary requesting of forks, if the forks were requested before. Again the model checker was being a jerk and finding a rare sequencing of events to surface a rare bug. The trace was 55 steps, repeating after step 43. The process 3 kept eating, while 0, 1, and 2 starved. Processes 0, 1, and 2 are hungry, and since their doneReq flag was true, they didn't send additional requests. They didn't actually send request to process 3 because they already had the dirty token on their side. But then process 3 send them the request and got those tokens. And process 0,1,2 are not sending the request again, because they think they are done with requests. I fixed the problem by removing the doneReq flag but instead checking for the need for requesting. This also sped up model checking, as less state is maintained.

Improving the model checking complexity

This was a small model with 4 processes, but the model checking took about 50 minutes and used about 2GB of storage to exhaustively search the entire plausible execution state space. To get that space back, make sure you manually delete the model checking folder, folder ending with ".toolbox". You don't need to worry about space too much though: if you start a new model check it writes over the previous one.

My initial attempts had took about 4GB of space and progressed slowly, which led me stopping them. To speed up the model checking, I improved the answering of requests for forks by making the GetReqNbrs function more selective by adding the second line to that check. Seeing that it helped speed up model checking, I also wrote a more selective GetForkNbrs function. These reduced the state space, and the execution time. But I still can't try checking with more than 4 processes as that takes a long time.

Concluding remarks

So there we go. Hygienic philosophers algorithm extend the topology of dining philosophers solution to an arbitrary graph, and more importantly, provides "NoStarvation" guarantee even under a weakly-fair scheduler. The algorithm ensures that only one node executes critical section in a neighborhood, and this is useful for coordinating critical action executions in a distributed system. If all nodes execute without coordination, we have maximum parallel semantics which cannot provide serializability (sequential consistency) guarantees. Using dining philosophers for coordination provides serializability guarantees. So, hygienic dining philosophers algorithm find applications in transforming programs written at a shared memory model with serializability assumptions to execute correctly with the serializability guarantees at the message passing model.


The PlusCal code for the programs are accessible through the below links. When you want to model check the program, start a new model, and enter 4 for N, so the model checker is ready to start. In the model checker tab, click on Invariant pane and enter ME then TypeOkFork then TypeOkReq, so that these three predicates are checked against any violation, and in the Properties pane enter NoStarvation so that the progress property is checked for satisfaction. If you make a modification to the PlusCal code, make sure you click on File, Translate Algorithm, so the PlusCal code gets translated to the TLA+ code, which is then used for the model-checking. See the TLA+ posts below about how this works.

Here is the final correct version of the model

Here is the version that has bug 3

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...