Monday, November 18, 2019

Book Review. Digital minimalism: Choosing a Focused Life in a Noisy World

"Digital Minimalism: Choosing a Focused Life in a Noisy World" is Cal Newport's new book. The topic of the book is clear from the title. You should be quitting your Facebook checking, Twitter chatting, Imgur browsing, phone fiddling habits. No more monkey business. It is time to get to work. Deep work.

In Chapter 4 of the book, Calvin forms the term *solitude deprivation*. Solitude deprivation is on the other end of the spectrum to solitary confinement, but it can also be bad for you as well over a long duration. The book argues that today we all experience solitude deprivation. The smartphones, laptops, and screens do not give us time to be alone with our thoughts and process things in our speeds. I had heard a nice story, where the Amazon natives recruited for an expedition in to the jungle would take long breaks after doing some walking. They would say they are waiting for their soul to catch up to their bodies. Today we don't give time for our souls to catch up to us, and process both emotionally and rationally the flood of events/news we are bombarded with every day.

So I really liked Chapter 4 that talked about solitude deprivation. This chapter made me really worried that online connectivity, and never getting bored, could be doing more harm to me than I thought.  This chapter made a much more convincing case for the need for quitting social media than the first couple chapters in my view. But maybe it is because I am more of an abstract ideas guy.

Calvin's previous book "Deep Work" had a lot of impact. I think "Digital Minimalism" may not have that much impact. (Well, Digital Minimalism already has become a New York Times, Wall Street Journal, Publishers Weekly, and USA Today bestseller... I guess I mean more impact than that ;-) Deep Work had a positive message, "embrace deeper concentration", whereas Digital Minimalism has a negative message, "prevent digital clutter". I know, I know... For each book, you could simply switch the statements from positive to negative and vice versa. I am just referring to the tone/mood of the books. Digital Minimalism is more of a self-help/how-to book. It prescribes lists of things to do and not to do in a somewhat patronizing voice. The Deep Work book was more conceptual and thought-provoking, and less of a how-to self-help book. I have listened to Deep Work at least three times. I don't see that happening with the "Digital Minimalism" book. I would have liked to read a book titled "Deep Solitude" from Calvin, which I am sure I would be re-reading several times.
If you want to build a ship, don't drum up people to collect wood and don't assign them tasks and work, but rather teach them to long for the endless immensity of the sea.
--Antoine de Saint-Exupery

In any case, I think this is a book you should definitely check out. I wish Calvin best of luck with getting these ideas adopted. They are very timely and important.  In 2005, I was labmates with Calvin at Nancy Lynch's theory of distributed systems group. Calvin is like a real life Captain America. Always responsible, kind, tidy, and disciplined. He would arrange his working hours carefully and would optimize everything. He is super smart and productive. His publication record is impressive. He is a theory and abstract thinking person at heart. He thinks clearly and in a sound manner with much deliberation. He is both a successful professor and a successful writer. He is walking the walk as well. So we should pay attention when he is talking.

MAD questions

1. Is social media more dangerous than TV was? 
It may be so because it is algorithmically manipulated to be addictive. Technology companies completed the loop and this is a quick feedback loop fed by millions of people participating in the experiment.  On the other hand, I have heard of a hypothesis that, since the kids are raising up with this technology, they will develop ways to be immune to it. But I don't know if I am convinced by that argument. Parents should definitely regulate/restrict for these technologies. And I think even the governments should be regulating for these technologies. The Digital Minimalism book cites that mental health is at crisis level for millennials raised with this technology available to them. I see students on their phones walking in the halls and fiddling with their phones even in class. They are always busy catching up to what is on their screens, but they are missing up on things happening around them, and most importantly happening within them.

2. Is it possible to have a good social media tool?
Things have a way of going south quickly for social media and common collaboration tools. Quora, which was once where insightful articles reside, is a cesspool now. I guess we chalk it up to human nature.

I like my Twitter-verse. It is nicely curated to give me a chance to observe interesting people chat and think. It is like watching passersby at a coffee. It is not deep conversation, but it is still useful to keep me inspired and informed on these people's interests. I wish we could write paragraphs on Twitter, but then, maybe people wouldn't write and interact that much.

Saturday, November 16, 2019

SOSP19. I4: Incremental Inference of Inductive Invariants for Verification of Distributed Protocols

This paper is by Haojun Ma (University of Michigan), Aman Goel (University of Michigan), Jean-Baptiste Jeannin (University of Michigan), Manos Kapritsos (University of Michigan), Baris Kasikci (University of Michigan), Karem A. Sakallah (University of Michigan).

This paper is about formal verification of distributed systems. Writing proofs manually is cumbersome. Existing tools for formal verification all require the human to find the inductive invariant.

I4 combines power of Ivy (a tool for interactive verification of infinite-state systems) and model checking in order to find inductive invariant without relying on human intuition. Ivy takes as input a protocol description and a safety property, and guides the user interactively to discover an inductive invariant. The goal for finding an inductive invariant is to prove that the safety property always holds. An inductive proof has a base case, which proves initial state is safe, and an inductive step, which proves if state k is safe, prove state k+1 is safe. Once that inductive invariant is found, Ivy automatically verifies that it is indeed inductive.

The insight in I4 is that the safety/correctness behavior of a distributed system does not fundamentally change as the size increases. I witness this regularly in my use of TLA+ for model checking protocols. TLA+ is able to identify any problem (sometimes requiring upto 40 steps) by finding a counterexample involving three nodes. Three nodes is often what it takes. One node initializes a coordination operation, and the other two nodes see a different perspective of the ongoing computation, maybe due to exchanging messages with each other (i.e. doing stale reads) at inopportune times, and arrive to conflicting decisions that violate the goal of the coordination operation.

I4 uses inductive invariants from small instances and apply/generalize to large instances, and automates this with model-checking. More specifically, I4 first creates a finite instance of the protocol; use a model checking tool to automatically derive the inductive invariant for this finite instance; and generalize this invariant to an inductive invariant for the infinite protocol. This amends the I4 approach in that it automates the inductive invariant discovery process. This amends the model checking approach as well. While model checking is fully automated, it doesn’t scale to distributed systems. I4 applies model checking to small, finite instances and then generalizes the result to all instances.

The figure above shows an overview of the I4 flow for the invariant generation on a finite instance.
Given a protocol description--written in Ivy--and an initial size, I4 first generates a finite instance of that protocol with a given initial size. For example, ... I4 will generate a finite instance of the protocol with one server and two clients. It then uses the Averroes model checker to either generate an inductive invariant that proves the correctness of the protocol for that particular instance, or produce a counterexample demonstrating how the protocol can be violated and which can be used to debug the protocol. If the protocol is too complex, the model checker may fail to produce an answer within a reasonable amount of time or it may run out of memory. If this occurs, the finite encoding is simplified—using a concretization technique—to further constrain it and make it easier for the model checker to run to completion. This step is currently done manually but is easily automatable. Once an inductive invariant has been identified, I4 generalizes it to apply not only to the finite instance that produced it, but also to all instances of the protocol.
It is important to note that if the safety invariant does not hold, Averroes produces a counterexample and the human should work on the protocol to come up with a safety invariant that holds for the protocol. I4 is automatic, in that if the protocol safety invariant holds, then the inductive invariant is generated automatically by the Averroes tool.  But, wait, what is the difference between safety invariant and inductive invariant? Isn't safety invariant already inductive?

Safety property P may be an invariant but not an inductive one. "The verification proof requires the derivation of additional invariants that are used to constrain P until it becomes inductive. These additional invariants are viewed as strengthening assertions that remove those parts of P that are not closed under the system's transition relation." In other words, while the safety property holds for reachable states, it may not be closed under program actions outside the reachable states. This makes safety invariant unsuitable for verification since proving properties is not constrained to the reachable states (as it is hard to enumerate/identify reachable states in a proof). So, the inductive invariant is a version of the safety property that is closed under the program actions. The figure below illustrates this relationship. I think this concept is explored further in the Ivy paper.

If the safety property holds, then Averroes generates an inductive invariant for the finite instance; minimizes the invariant by removing redundant clauses; and then passes it on to the next step to be generalized. However, occasionally the finite instance may still be too large for the Averroes model checker, and it may run out of memory. This is where human involvement is needed again. The human helps concretize the small finite version of the protocol further to avoid state space explosion. Symmetry plays a big role here. FIRST is the keyword that denotes the node that sends the first message. The model checker can try instances where all the three nodes in the finite instances might be the one that sends the message. The human can notice a symmetry and set "FIRST = Node1" to help reduce the state space. (The team is working on automating this step as well.)

Then I4 uses Ivy for the proof generation as shown below, and the verification is complete.

I4 is available as opensource at They applied I4 to several examples as shown in the table.

I4 improves on manual verification via using Coq and interactive verification using Ivy.

A restriction in I4 is that it applies to verification of safety properties, and not to liveness properties.

I am happy to find so many verification papers at SOSP. This paper appeared in the distributed systems in the afternoon of Day 2. In the morning of Day 2, there was a session on verification which included four papers. I had reviewed two of these papers earlier: "Scaling Symbolic Evaluation for Automated Verification of Systems Code with Serval" and "Verifying Concurrent, Crash-safe Systems with Perennial". It looks like the verification community at SOSP is quick to take results from more general and theoretic verification conferences, and integrate those tools and improve upon them to put them in use for verification of practical systems.

Thursday, November 14, 2019

SOSP19 Lineage Stash: Fault Tolerance Off the Critical Path

This paper is by Stephanie Wang (UC Berkeley), John Liagouris (ETH Zurich), Robert Nishihara (UC Berkeley), Philipp Moritz (UC Berkeley), Ujval Misra (UC Berkeley), Alexey Tumanov (UC Berkeley), Ion Stoica (UC Berkeley).

I really liked this paper. It has a simple idea, which has a good chance of getting adopted by real world systems. The presentation was very well done and was very informative. You can watch the presentation video here.

Low-latency processing is very important for data processing, stream processing, graph processing, and control systems. Recovering after failures is also important for them, because for systems composed of 100s of nodes, node failures are part of daily operation.

It seems like there is a tradeoff between low latency and recovery time. The existing recovery methods either have low runtime overhead or low recovery overhead, but not both.
  • Global checkpoint approach to recovery achieves a low runtime overhead, because a checkpoint/snapshot can be taken asynchronously and off the critical path of the execution. On the other hand, the checkpoint approach has high recovery overhead because the entire system needs to be rolled back to the checkpoint and then start from there again.
  • Logging approach to recovery has high runtime overhead, because it synchronously records/logs every data about any nondeterministic execution after the last checkpoint. On the flip side of the coin, it can achieve low overhead to recovery because only the failed processes need to be rolled back a little and resume from there. 

Can we have a recovery approach that achieves both low runtime overhead and low recovery overhead? The paper proposes the "lineage stash" idea to achieve that. The idea behind lineage stash is simple.

The first part of the idea is to reduce the amount of data logged by only logging the lineage. Lineage stash logs the pointers to the data messages instead of the data, and also logs task descriptions in case that data needs to be recreated by the previous operation. Lineage stash also logs the order of execution.

The second part of the idea is to do this lineage logging asynchronously, off the critical path of execution. The operators/processes now include a local volatile cache for lineage, which is asynchronously flushed to the underlying remote global lineage storage. The global lineage store is a sharded and replicated key-value datastore.

But then the question becomes, is this still fault tolerant? If we are doing the logging to the global lineage store asynchronously, what if the process crashes before sending the message, and we lose the log information?

The final part of the idea is to use a causal logging approach, and piggybacking the uncommitted lineage information to the other processes/operations for them to store in their stashes as well. So this kind of resembles a tiny decentralized blockchain stored in the stashes of interacting processes/operations.

In the figure, the filter process had executed some tasks and then passed messages to the counter process. Since the logging is off the critical path, the lineage for these tasks was not yet replicated to the global lineage stash. But as part of the rule, the lineage was piggybacked to the messages sent to the counter, so the counter has also a copy of the lineage in its stash, when the filter process crashed. Then in the recovery, the counter process helps by flushing this uncommitted lineage to the global lineage storage for persistence. The recovering filter process can then retrieve and replay this lineage to achieve a correct and quick (on the order of milliseconds) recovery.

Lineage stash idea was implemented and evaluated in Apache Flink for a stream processing word count application over 32 nodes. It was compared against the default global checkpoint recovery, and the default augmented with synchronous logging.

As the figure above shows, by using asynchronous logging approach, linear stash is able to avoid the runtime latency overhead of synchronized logging and matches that of the asynchronous checkpointing approach. Moreover, as the figure below shows, the recovery latency of checkpointing is very high. The lineage stash approach reaches similar recovery latency as the syncronized logging approach.

The lineage stash looks very promising for providing lightweight (off the critical path) fault-tolerance for fine-grain data processing systems. I really like the simplicity of the idea. I feel like I have seen a related idea somewhere else as well. But I can't quite remember it.

Monday, November 11, 2019

SOSP19 Verifying Concurrent, Crash-safe Systems with Perennial

This paper is by Tej Chajed (MIT CSAIL), Joseph Tassarotti (MIT CSAIL), Frans Kaashoek (MIT CSAIL), Nickolai Zeldovich (MIT CSAIL).

Replicated disk systems, such as file systems, databases, and key-value stores, need both concurrency (to provide high performance) and crash safety  (to keep your data safety). The replicated disk library is subtle, but the paper shows how to systematically reason about all possible executions using verification. (This work considers verification of a single computer storage system with multiple disk --not a distributed storage system.)

Existing verification frameworks support either concurrency (CertiKOS [OSDI ’16], CSPEC [OSDI ’18], AtomFS [SOSP ’19]) or crash safety (FSCQ [SOSP ’15], Yggdrasil [OSDI ’16], DFSCQ [SOSP ’17]).

Combining verified crash safety and concurrency is challenging because:
  • Crash and recovery can interrupt a critical section,
  • Crash can wipe in-memory state, and
  • Recovery logically completes crashed threads' operations. 

Perennial introduces 3 techniques to address these three challenges:
  • leases to address crash and recovery interrupting a critical section,
  • memory versioning to address crash wiping in-memory state, and
  • recovery helping to address problems due to interference from recovery actions.

The presentation deferred to the paper for the first two techniques and explained the recovery helping technique.

To show that the implementation satisfies the high-level specification a forward simulation is shown under an abstraction relation. The abstraction relation maps the concrete/implementation state to the high-level abstract specification state. Perennial adopted the abstraction relation as: "if not locked (due to an operation in progress), then the abstract state matches the concrete state in both disks".

The problem is "crashing" breaks the abstraction relation. To fix this problem, Perennial separates crash invariant (which refers to interrupted spec operations) from the abstraction invariant. The recovery proof relies on the crash invariant to restore the abstraction invariant.

Crash invariant says "if disks disagree, some thread was writing the value on the first disk". Then the recovery helping technique helps recovery commit writes from before the crash. The recovery proof shows the code restores the abstraction relation by completing all interrupted writes. As a result users get correct behavior and atomicity.

The Perennial proof framework was written in 9K lines of coq which provides crash reasoning: leases, memory versioning, and recovery helping. Perennial is built on top of Iris concurrency framework (for concurrency reasoning), which is built on top of Coq. (Iris:  R. Krebbers, R. Jung, A. Bizjak, J.-H. Jourdan, D. Dreyer, and L. Birkedal. The essence of higher-order concurrent separation logic. In Proceedings of the 26th European Symposium on Programming Languages and Systems, pages 696–723, Uppsala, Sweden, Apr. 2017.)

The authors have developed Goose for reasoning about Go implementations, but they also defer this to the paper. The developer writes Go code, and the Goose translator (written in 2K lines of Go code) translates this to Perennial proof, where it is machine checked with Coq.

As evaluation of Perennial framework, they verified a mail server written in Go. They argue that compared to a verification in CSCSPEC [OSDI ’18] (their earlier verification framework), the verification in Perennial takes less effort and is done in less number of lines of proof.

The software is available at

MAD questions

1. Is this an instance of a convergence refinement relation? 
In 2001, I was thinking on fault-tolerance preserving refinements as a graduate student working on graybox design of self-stabilization. The question was that: If we design fault-tolerance at the abstract, what guarantee do we have that after the abstract code is compiled and implemented in concrete, the fault-tolerance still holds/works?

It is easy to see that fault-tolerance would be preserved by an "everywhere refinement" that preserves the abstraction relation (between concrete and abstract) at any state, including the states outside the invariant states that are not reachable in the absence of faults. But the problem is that outside the invariant, the abstraction relation may not hold due to recovery actions being different than normal actions. That is pretty much the dilemma the Perennial work faced in verifying the recovery of replicated disks above.

OK, I said, let's relax the everywhere refinement to an "everywhere eventual refinement" and that would work for preserving fault-tolerance. Yes, it works, but it is not easy to prove that the concrete is an everywhere eventual refinement of the abstract because there is a lot of freedom in this type of refinement, and not much of a structure to leverage. The proof becomes as hard as proving fault-tolerance of the concrete from scratch. So, what I ended up proposing was a "convergent refinement", where the actions of the concrete provides a compacted version of the actions of the abstract outside the invariant. In other words, the forward simulation outside the invariant would be skipping states in the concrete. Perennial faced with the same dilemma chose to use a different abstraction relation. Whereas the convergence refinement idea is to keep the same abstraction relation but allow it to contract/skip steps in the computations outside the invariant states. I wonder if this could be applicable in the Perennial problem.

My reasoning with going compacting steps in refinement outside invariant was because it is safer than expanding the computation: if you show recovery in states in the abstract, by skipping steps (and not adding new ones) the concrete is also guaranteed to preserve that recovery.

Here is the abstract of my 2002 paper on convergence refinement. I just checked and this paper only got 19 citations in 19 years. It did not age well after getting a best paper award at ICDCS'02. In comparison, some of the papers we wrote quickly and published as short paper or as a workshop paper got more than 150-900 citations in less than 10 years. Citations is funny business.
Refinement tools such as compilers do not necessarily preserve fault-tolerance. That is, given a fault-tolerant program in a high-level language as input, the output of a compiler in a lower-level language will not necessarily be fault-tolerant. In this paper, we identify a type of refinement, namely "convergence refinement", that preserves the fault-tolerance property of stabilization. We illustrate the use of convergence refinement by presenting the first formal design of Dijkstra’s little-understood 3-state stabilizing token-ring system. Our designs begin with simple, abstract token-ring systems that are not stabilizing, and then add an abstract "wrapper" to the systems so as to achieve stabilization. The system and the wrapper are then refined to obtain a concrete token-ring system, while preserving stabilization. In fact, the two are refined independently, which demonstrates that convergence refinement is amenable for "graybox" design of stabilizing implementations, i.e., design of system stabilization based solely on system specification and without knowledge of system implementation details.

Saturday, November 9, 2019

SOSP19 Day 2, Scaling Symbolic Evaluation for Automated Verification of Systems Code with Serval

Verification session was the first session for Day 2. I like formal methods, and I did enjoy these papers. In this post I will only talk about the first paper in the session, the Serval paper. (You can read about SOSP19 Day 1 here.)

This paper is by Luke Nelson (University of Washington), James Bornholt (University of Washington), Ronghui Gu (Columbia University), Andrew Baumann (Microsoft Research), Emina Torlak (University of Washington), Xi Wang (University of Washington).

This paper received a best paper award at SOSP19, and the software is publicly available at

SOSP has a tradition of publishing systems verification papers, such as seL4 (SOSP’09), Ironclad Apps (OSDI’14), FSCQ (SOSP’15), CertiKOS (PLDI’16), Komodo (SOSP’17). A downside of systems verification is it is very effort-intensive. The Certikos manual proof consisted of more than 200K lines.

To help address this problem,  this paper introduces Serval, a framework for  developing automated verifiers for systems software. Serval accomplishes this by lifting interpreters written by developers into automated verifiers. It also provides a systematic approach to identifying and repairing verification performance bottlenecks using symbolic profiling and optimizations.

Wait, wait... What is an interpreter? And what is lifting?

In prior work on automatic verification (such as Hyperkernel SOSP17), a verifier implements symbolic evaluation for specific systems, and the verifier is not reusable/generalized. To make the verifier reusable and general, in Serval, the developers write an interpreter for an instruction set using Rosette, an extension of the Racket language for symbolic reasoning. Serval leverages Rosette to "lift" an interpreter into a verifier; which means to "transform a regular program to work on symbolic values". The developers also give the system specifications to be verified.

In the Serval framework the verifier consists of the lifted interpreter and the symbolic optimization. The steps are: write a verifier as interpreter, then Serval performs symbolic profiling to find bottleneck, and apply optimizations until verification becomes feasible.

Serval uses symbolic execution to avoid the state space explosion problem. But the program counter (PC) becoming symbolic is bad as it unnecessarily opens up search space. Serval prevents this with symbolic optimizations:

  • peephole optimization
  • fine-tune symbolic evaluation
  • use domain language to reduce the concrete values PC can take, and avoid path explosion problem.

Unfortunately I didn't understand much about the first two optimizations from listening to the presentation.

Using Serval, the authors build automated verifiers for the RISC-V, x86-32, LLVM, and BPF instruction sets. Targeting low level end of compiling stack can be an advantage for verification, because we don't need to trust higher level language toolkits. Future work will consider how the low-level-guarantees identified and verified by Serval could be connected to high level data structures for proof verification.

To show that existing systems can be retrofitted for Serval, they Retrofitted CertiKOS and Komodo for Serval. They mention this takes around 4 weeks for a new system. They also found 15 new bugs in Linux BPF JIT.

I will read the paper carefully to understand Serval better. It seems promising for scaling verification to practical systems. Of course the process still requires expertise and several weeks worth of effort, but Serval improves on the state-of-the-art with many months of effort.

Thursday, November 7, 2019

SOSP19 Day 1 wrap up

It was only 3 sessions into day 1, and my brain was fried.
Conferences are tiring because you are exposed to so many new ideas in a short time. It was clear I would not be able to pay attention to the papers in the last session, so I skipped that session (the privacy session which included the following three papers) to go for a walk at the golf park behind the conference center.

After the privacy session, there was a poster session and reception from 5-7:30pm. The poster session was nice for asking authors questions about the papers and having more in-depth conversation.

A student had told me he doesn't know how to start conversations with other conference attendees. I told him "That's easy.. Just ask them about what they are working on these days." A better way to start deeper conversations is to listen to the paper presentations, and have genuine questions about future work, or some extension and connection, and go discuss with them at coffee breaks, lunch, or poster session.

In the free-roaming poster session and reception, I had a chance to meet many colleagues and catch up on what they are working these days. When they returned the question, I had to talk for 3-5 minutes about what I am working on these days. I found that my "elevator pitch" got better and better as I had to answer this question many times.

I am a shy person, but at conferences my curiosity works in my favor, and I approach people to learn about their current work, and what they think of this paper versus that paper. I really enjoy talking to fellow researchers, each of whom is an expert in a small part of a big field. We may have different opinions on things, they may not like the papers/ideas I like, but I get to learn about their perspectives and file them in my brain without having to agree or disagree with them for now.

General impressions about SOSP 

SOSP is single track, so 500+ people were in the same big conference room for the sessions. The first half of the room had tables, and the second half just chairs. If you sat at a table row, you can rest your laptop on the table and type comfortably. I sat at the very front row and took notes. Interestingly, there is little contention for the front rows. Another advantage of sitting at the front row is that I am not distracted by seeing other audience members checking Facebook, Twitter, and mail on their laptops.

(Rant: This is my pet peeve. This drives me nuts. What is the point of flying over a long distance, driving at least two more hours from the airport to come to the conference, and check mail and social media all day long? You disrupted your whole week to travel to this conference and now you are not "at the conference". This I will never understand. Be here now!)

To comment on the papers and presentations on the first day, I found all the sessions very interesting. I don't have a favorite, all the three sessions I attended were very good.

Most of the presentations were given by graduate students. The quality of most of the presentations were very good. It is obvious a lot effort went into the rehearsals of those presentation. Almost all presenters had written (and memorized) extensive speaker notes and while the presentation view was displayed on the curtain, they had the presenter notes open on their laptops. Some of the presenters just read from their presentation notes. Those presentations were not very engaging. But at least the slides were very well organized, and the messages were distilled down to important points and were easy to follow.

Each presentation was about 20 minutes including a 2-3 minutes question answering slot at the end. (I think the SOSP conferences I attended before had 5 minutes reserved Q&A slot, but for this one the Q&A was not as rigidly reserved and enforced.)

Most of the presenters were able to cover around 40 slides in 20 minutes. This is an astonishingly large number. Normally the rule of thumb is to have 10 slides to present in 20 minutes. But being well prepared for a smooth flowing presentation, the presenters were somehow able to pull this off. I guess this takes its toll on the listeners though. I felt overwhelmed and exhausted after three sessions being bombarded by too many ideas, concepts, and acronyms.

I had written two posts about how to present in case you are looking for advice in that department.

Tuesday, November 5, 2019

SOSP19 Day 1, Debugging session

This session was the first session after lunch and had four papers on debugging in large scale systems.

CrashTuner: Detecting Crash Recovery Bugs in Cloud Systems via Meta-info Analysis

This paper is by Jie Lu (The Institute of Computing Technology of the Chinese Academy of Sciences), Chen Liu (The Institute of Computing Technology of the Chinese Academy of Sciences), Lian Li (The Institute of Computing Technology of the Chinese Academy of Sciences), Xiaobing Feng (The Institute of Computing Technology of the Chinese Academy of Sciences), Feng Tan (Alibaba Group), Jun Yang (Alibaba Group), Liang You (Alibaba Group).

Crash recovery code can be buggy and often result in catastrophic failure. Random fault injection is ineffective for detecting them as they are rarely exercised. Model checking at the code level is not feasible due to state space explosion problem. As a result, crash-recovery bugs are still widely prevalent. Note that the paper does not talk about "crush" bugs, but "crash recovery" bugs, where the recovery code interferes with normal code and causes the error.

Crashtuner introduces new approaches to automatically detect crash recovery bugs in distributed systems. The paper observes that crash-recovery bugs involve "meta-info" variables. Meta-info variables include variables denoting nodes, jobs, tasks, applications, containers, attempt, session, etc. I guess these are critical metadata. The paper might include more description for them.

The insight in the paper is that crash-recovery bugs can be easily triggered when nodes crash before reading meta-info variables and/or crash after writing meta-info variables.

Using this insight, Crashtuner inserts crash points at read/write of meta-info variables. This results in a 99.91% reduction on crash points with previous testing techniques.

They evaluated Crashtuner on Yarn, HDFS, HBase, and ZooKeeper and found 116 crash recovery bugs. 21 of these were new crash-recovery bugs (including 10 critical bugs). 60 of these bugs were already fixed.

The presentation concluded by saying that  meta-info is a well-suited abstraction for distributed systems. After the presentation, I still have questions about identifying meta-info variables and what would be false-positive and false-negative rates for finding meta-info variables via heuristic definition as above.

The Inflection Point Hypothesis: A Principled Debugging Approach for Locating the Root Cause of a Failure

This paper is by Yongle Zhang (University of Toronto), Kirk Rodrigues (University of Toronto), Yu Luo (University of Toronto), Michael Stumm (University of Toronto), Ding Yuan (University of Toronto).

This paper is about debugging for finding the root cause of a failure. The basic approach is to collect large number of traces (via failure reproduction and failure execution), and try to find the strongest statistical correlation with the fault, and identify this as the root cause. The paper asks the question: what is the fundamental property of root cause that allows us to build a tool to automatically identify the root cause?

The paper offers as the definition for root cause as "the most basic reason for failure, if corrected will prevent the fault from occurring." This has two parts: "if changed would result in correct execution" and "the most basic cause".

Based on this, the paper defines inflection point as the first point in the failure execution that differs from the instruction in nonfailure execution. And develops Kairux: a tool for automated root cause localization. The idea in Kairux is to construct the nonfailure execution that has the longest common prefix. To this end it uses unit tests, stitches unit test to construct nonfailure execution, and modifies existing unit test for longer common prefix. Then it uses dynamic slicing to obtain partial order.

The presentation gave a real world example from HDFS 10453, delete blockthread. It took the developer one month to figure out the root cause of the bug. Kairux does this automatically.

Kairux was evaluated on 10 cases from JVM distributed systems, including  HDFS, HBase, ZooKeeper. It successfully found the root cause for 7 out the 10 cases. For the 3 unsuccessful cases, the paper claims this was because the root cause location could not be reached by modifying unit tests

This paper was similar to the previous paper in the session in that it had a heuristic insight which had applicability in a reasonably focused narrow domain. I think the tool support would be welcome by developers. Unfortunately I didn't see that the code and tool is available as opensource anywhere.

Finding Semantic Bugs in File Systems with an Extensible Fuzzing Framework

This paper is by Seulbae Kim (Georgia Institute of Technology), Meng Xu (Georgia Institute of Technology), Sanidhya Kashyap (Georgia Institute of Technology), Jungyeon Yoon (Georgia Institute of Technology), Wen Xu (Georgia Institute of Technology), Taesoo Kim (Georgia Institute of Technology).

The presentation offers by asking "Can file systems be bug free?" and answers this in the negative, citing that the codebase for filesystems is massive (40K-100K) and are constantly evolving. This paper proposes to use fuzzing as an umbrella solution that unifies existing bug checkers for finding semantic bugs in filesystems.

The idea in fuzzing is to give crashes are feedback to the fuzzers. However, the challenge for finding semantic bugs using fuzzers is that semantic bugs are silent, and won't be detected. So we need a checker to go through the test cases and check for the validity of return values, and give this feedback to fuzzer.

To realize this insight, they built Hydra. Hydra is available as opensource at

Hydra uses checker defined signals, automates input space exploration, test execution, and incorporation of test cases. Hydra is extensible via pluggable checkers for spec violation posix checker (sybilfs), for logic bugs, for memory safety bugs, and for crash consistency bug (symC3).

So far, Hydra has discovered 91 new bugs in Linux file systems, including several crash consistency bugs. Hydra also found a bug in a verified file system (FSCQ), (because it had used an unverified function in implementation).

The presenter said that Hydra generates better test cases, and the minimizer can reduce the steps in crashes from 70s to 20s. The presentation also live demoed Hydra in action with symC3.

Efficient and Scalable Thread-Safety Violation Detection --- Finding thousands of concurrency bugs during testing

This paper is by Guangpu Li (University of Chicago), Shan Lu (University of Chicago), Madanlal Musuvathi (Microsoft Research), Suman Nath (Microsoft Research), Rohan Padhye (Berkeley).

This paper received a best paper award at SOSP19. It provides an easy push button for finding concurrency bugs.

The paper deals with thread safety violations (TSV). A thread safety violation occurs if two threads concurrently invoke two conflicting methods upon the same object. For example, the C# list datastructure has a contract that says two adds cannot be concurrent. Unfortunately thread safety violations still exist, and are hard to find via testing as they don't show up in most test runs. The presentation mentioned a major bug that lead to Bitcoin loss.

Thread safety violations are very similar to data race conditions, and it is possible to use data-race detection tools in a manually intensive process to find some of these bugs in small scale. To reduce the manual effort, it is possible to adopt dynamic data race analysis while running the program under test inputs, but these require a lot of false-positive pruning.

In a large scale, these don't work. The CloudBuild at Microsoft involves  100million tests from 4K team and upto 10K machines. At this scale, there are three challenges: integration, overhead, and false positives.

The paper presents TSVD, a scalable dynamic analysis tool. It is push button. You provide TSVD only the thread safety contract, and it finds the results with zero false positives. TSVD was  deployed in Azure, and it has found more than 1000 bugs in a short time. The tool is available as opensource at

To achieve zero false positive, TSVD uses a very interesting trick. A potential violation (i.e., a call site that was identified by code analysis as one that may potentially violate the thread safety contract) is retried in many test executions by injecting delays to trigger a real violation. If a real violation is found, this is a true bug. Else, it was a false-positive and is ignored.

But how do we do the analysis to identify these potentially unsafe calls to insert delays? TSVD uses another interesting trick to identify them. It looks for conflicting calls with close-by physical timestamps. It flags likely racing calls, where two conflicting calls from different threads to the same object occur within a short physical time window. This way of doing things is more efficient and scalable than trying to do a happened-before analysis and finding calls with concurrent logical timestamps. Just identify likely race calls.

OK, what if there is actual synchronization between the two potentially conflicting calls within closeby physical timestamps? Why waste energy to keep testing it to break this? Due to synchronization, this won't lead to a real bug. To avoid this they use synchronization inference (another neat trick!): If m1 synchronized before m2, a delay added to m1 leads to the same delay to m2. If this close correlation is observed in the delays, TSVD infers synchronization. This way it also infers if a program is running sequentially or not, which calls are more likely to lead to problems, etc.

They deployed TSVD at Microsoft for several months. It was given thread safety contracts of 14 system classes in C#, including list, dictionary, etc. It was tested on 1600 projects, and was run 1 or 2 times, and found 1134 thread safety violations. During the validation procedure, they found that 96% TSVs are previously unknown to developers and 47% will cause severe customer facing issues eventually.

TSVD beats other approaches, including random, data collider, happened-before (hb) tracking. 96% of all violations were captured by running TSVD 50 times. And 50% violations were captured by running TSVD once! This beats other tools with little overhead.

One drawback to TSVD approach is that it may cause a false negative by adding the random delay. But when you run the tool multiple times, those missed false negatives are captured due to different random delays tried.

Yep, this paper definitely deserved a best paper award. It used three very interesting insights/heuristics to make the problem feasible/manageable, and then built a tool using these insights, and showed exhaustive evaluations of this tool. 

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...