Saturday, November 30, 2019

Book Review. The Dark Forest (2008)

This book is the second book in the trilogy titled "Remembrance of Earth's Past" by Liu Cixin. I had reviewed the first book in the series, "The Three Body Problem", earlier. 

I had listened to this book as audiobook. The reader of the audiobook, the voice-actor, was very competent, and also remarkably good in speaking in accents. This made the already engaging story more captivating.

As in the first book, this book also introduces new concepts. The Dark Forest theory is one. The theory posits that the universe is like a dark forest, where everybody is out there to hunt anybody. Due to chain-of-suspicion, a civilization can never be certain of an alien civilization's true intentions.
The extreme distances between stars creates an insurmountable "chain of suspicion", where any two civilizations cannot communicate well enough to dissipate mistrust, making conflict inevitable. Leaving a primitive civilization alone is not an option due to the exponential progress of technological change, and a civilization you have detected might easily surpass your own technological level in a few centuries and become a threat. If you have detected a civilization, then you have also confirmed that said civilization will eventually be able to detect you. Therefore, it is in every civilization's best interest to preemptively strike and destroy any developing civilization before it can become a threat, but without revealing their own location to the wider universe, thus explaining the Fermi paradox.
This is very much the coordinating attack problem in distributed systems. Two generals will attack a city. Only if they attack together, they will have victory. If only one party attacks, and the other party does not join, the attacker will be defeated/destroyed. The two generals try to coordinate this attack by talking to each other via messengers, which may be captured by the city. When the channel is arbitrarily lossy, two parties cannot solve the consensus problem in a deterministic manner for all cases. Chain-of-suspicion arises.
  1. When you get my message, you know my vote, but I don't know you know. So you send me a message back to coordinate the attack, because you suspect that if I don't know you know, I won't join you in the attack.  
  2. When I get your message, I know that you know my vote. But I cannot commit on this, because you don't know that "I know that you know" (because the message may have been lost). So I suspect that because of this lack of information, you won't join me in the attack. So I send you a message back. 
  3. When you get my message, you know that I know that you know my vote. But you realize that I don't know that "you know that I know that you know my vote". Then you suspect that under this condition, I may not join you in the attack and you will get destroyed by the enemy attacking alone .So you send me a message back.
  4. When I get your message, I know that you know that I know that you know my vote. But I realize that you don't know this: "I know that you know that I know that you know my vote", and I suspect you won't join me in the attack due to this lack of information. So I need to send you a message. 
(And this is when the generals were not Byzantine. With Byzantine participants, things get even more hairy.) This chain of suspicion makes consensus/cooperation impossible. However, it should be noted that these are theoretical impossibilities and has restrictions: such as using deterministic protocol. The reality is not this bleak. In practice we solve the coordinated attack problem every time we establish a TCP connection. I had written something on this before.

This book picks up from where the first book has left. The setup is that Trisolarans can observe, through sophons, any visual and audio activity in the world, and even anything written to any electronic storage. The only thing they cannot observe is what goes on in human's brains. (What an interesting setup... I suspect the computer security researchers would appreciate this setup even more.) This led the earth to select wallfacers, the keepers of plans known only to themselves, who are granted full access to the resources of the UN. Another interesting twist was that the Trisolaran society does not have any distinction between communication and thinking, as their thinking is opaque. So they were initially unfamiliar with lying and deception. I thought the book could have developed on this more but this didn't develop as much I thought it could be.

Other interesting concepts include the droplet from Trisolarans, the hibernation to move in time toward doomsday battle, and the mental seal machine.

It was a long book but very interesting and very engaging. I really like Liu Cixin's writing. Of course the protagonists and most of the main characters are Chinese, because Liu is Chinese and is writing for a Chinese audience primarily. After all we have seen many many many AngloSaxon/Western heroes in the literature and media, and of course Chinese protagonists have every right to save the world themselves. I welcome that. But unfortunately I started seeing Chinese government party-lines seeping into the book's narrative. The first book had stroke a balanced tone with respect to this (especially with its discussion/coverage of the cultural revolution), but in this book, especially in the second part, there are undertones of propaganda. The book discusses why it is important to be cold-blooded pragmatists and how totalitarianism is very efficient to deal with crisis situations, etc. I just started the third book in the series, and I find that there is even more propaganda in that book. But, of course, that won't stop me from reading and enjoying the book.

I like these book series because it sets the perspective to a wide perspective: the infinite space versus the marble sized earth. It is a shame we are not thinking more about space, and establishing space-fairing civilizations. When I was growing up in Turkey, I had a very exaggerated view of Turkey. For me it was the entire world. I thought its institutions and its politicians were important and powerful, and the power struggles there were consequential. When I traveled to US for graduate studies, I very quickly realized how limited that worldview that was. Similarly, earth-centric thinking is very limited, but we are very slow to realize this. 

In the book, all the projects are all planned ahead for the four centuries to come. But you can appreciate that although our time is just but a little sliver of time in that timeline, the actions we take are important enablers and first steps for the centuries down the line. So yes, I really enjoyed the big perspective thinking this leads to, and I think we should try to expose ourselves to this type of thinking to break free of our very limited perspective in terms of time and space. 

Thursday, November 28, 2019

Paper review. Threshold Logical Clocks for Asynchronous Distributed Coordination and Consensus

This is a recent arxiv paper by Bryan Ford, EPFL. The figures I use are from Bryan's presentation.

The paper introduces a threshold logical clock (TLC) abstraction and uses it to implement decentralized asynchronous consensus on top. In contrast to Ben-Or which implements decentralized asynchronous binary consensus, TLC based Que-Sera-Consensus (QSC) achieves consensus for arbitrary values proposed.

After I summarize the paper, I will compare/contrast QSC with Paxos, Texel/Avalanche, and Ben-Or.

Threshold Logical Clocks

TLC ensures that a number of nodes progress through logical time in a lock-step fashion. On reaching logical time-step s, each node waits for a threshold tm of broadcasts received from s before it can proceed to step s+1.

Different nodes may see different subsets of the tm messages. The adversarial network schedule ultimately determines this, but can we at least measure a-posteriori the success/failure of a given message's propagation to other nodes? For this purpose, TLC supports witnessing.

Each time-step occurs in two logical phases. Each node broadcasts unwitnessed message and collects responses from at least t witnesses. Each node re-broadcasts witnessed message and collects at least t witnessed messages

A particular protocol instance TLC(tm, tw, n) is parameterized by message threshold tm, witness threshold tw, and number of nodes n. To get from step s to s+1, each node must collect not just tm messages but tm threshold witnessed messages from step s. Each threshold message must have been witnessed by at least tw participants.

A problem still exists. Different nodes may see different subsets, and different messages may have been witnessed by different subsets of nodes. TLC observes that causal ordered message delivery adds some convergence properties to the protocol and simplifies reasoning about it. One way to ensure causal message propagation is to piggyback and gossip every message send so far. Each node i simply includes in every message it sends a record of i's entire causal history. This makes time advancement events propagate virally.

When configured with majority thresholds, TLC offers a useful property. Every witnessed message m broadcast in step s is seen by all n nodes by step s+2:
  • By majority nodes by s+1 by definition of witnessing 
  • Each node collects majority step s+1 msgs by s+2 
  • Since any two majorities intersect, there is at least a 1-node overlap at s+1

Que Sera Consensus (QSC)

QSC provides an asynchronous consensus built on top of TLC. Each round takes three TLC logical time-steps, in which
  • Node broadcasts proposal w/ random priority p 
  • Waits and observes for three TLC time-steps 
  • Decides if any proposal is undisputable winner
This description is of course too high-level. Since there is a lot of subtlety involved due to network asynchrony, the QSC protocol is reached by way of explaining several strawman protocols leading to it.

Here is the baseline and the first strawman (genetic fitness lottery) towards deriving QSC.

Strawman 3 makes the nodes pick celebrity proposals, which a majority of nodes have heard of by the next time-step. This rule still leaves uncertainty, however, since different participants might have seen different subsets of confirmed proposals from step s, and not all of them might have seen the eligible proposal with the globally winning ticket.

In order to seek a universal celebrity, Strawman 5 says we should watch the paparazzi.

The paparazzi condition guarantees that, everyone knows the proposal's existence by s+2, and everyone knows of its celebrity by s+3. But not everyone might have seen paparazzi message!

This brings us to Que Sera, whatever happens happens, Consensus (QSC).  Nodes prefer highest-priority celebrity proposal p and build on it in proposals for future rounds, but don't assume everyone agrees on p! This says that only some consensus rounds may succeed, and that only some nodes may even realize that a round succeeded. And it also guarantees that when this holds, all nodes will build on that value and eventually decide.

Nodes conservatively determine agreement when they are unaware of existence of higher-priority proposal and aware of at least one paparazzi node for p.

Each node i will observe successful consensus with a probability of at least 1/2 in each round, independently of other rounds. Thus, the probability i has not yet finalized a unique proposal for round r by a later round r+k is at most $1/2^k$. What I really like is that by tying the consensus with a blockchain structure/formation, the delay in committing is made somewhat compensated/pipelined. When a commit is finally achieved, any round that i sees as successful will permanently commit both proposal p and any prior uncommitted blocks that p built on in the blockchain structure.

QSC implementations are available in Promela/Spin & Go.

How does this compare with other consensus algorithms?

I am a distributed systems guy. I will be looking at this from a distributed algorithms perspective. It looks like Bryan is coming from security background, and he may be trying to emphasize other properties (like network adversary tolerance) of the protocol. So I may be missing some security related properties of the protocol.

Before I go on to compare and at some points criticize the algorithm, I want to mention that overall this is a nice algorithm. I want to model this in TLA+ when I find some time so that I will be able to get a better understanding of what is going on in TLC and QSC.

The paper claims that QSC is simple, and even simpler than Paxos, but I disagree with that strongly. It took 9 subsections with many strawman algorithms to describe the QSC algorithm in Section 4.10 after TLC is described in Section 2. That said, the protocol is nice and interesting, and it has some advantages not present in other decentralized consensus algorithms like Texel and Ben-Or.

QSC is not binary consensus. This makes it more powerful than Texel and Ben-Or. However, if the domain is blockchain, a binary consensus works with no problems because you are deciding whether to include this proposed transaction in the chain or not. If only one value (one transaction) is proposed for a given UTXO, a binary consensus algorithm can guarantee termination for it. Well-formed clients propose only a single transaction for a given UTXO because otherwise it would be a double-spend attempt, and then the binary consensus does not need to guarantee termination for this instance.

TLC enforces a single frame of reference for the rounds as it aligns rounds across a threshold of participants. In this sense, TLC is reminiscent of the Ben-Or algorithm which also aligns rounds across threshold number of participants. In Ben-Or, N-F participants proceed in lock-step for the two phases of each round. In Ben-Or the termination is again probabilistic with probability 1, as the non termination chance after k rounds reduce to  $1/2^k$ probability. If one node decides at round k, it is guaranteed that all the other nodes will decide in the next round. As we mentioned above, while Ben-Or is restricted to binary consensus, QSC allows arbitrary proposals as input.

It may be possible to argue that the single frame of reference for the rounds, simplifies reasoning. But as a distributed algorithms person, I think this is just unnecessary extra synchronization and not a good idea. In Paxos, the rounds not aligned across participants. Paxos uses concurrent/separate frames of reference for rounds with no problem at all. This is achieved by Paxos's use of totally ordered ballot numbers to make rounds client-restricted. This allows multiple frame of reference rounds to exist concurrently without violating the safety/agreement property.

This brings me to do a more in depth comparison with Paxos, and a speculation.

QSC vs Paxos (and SDPaxos?)

QSC is still randomized consensus. So is Paxos in an asynchronous environment. In an asynchronous environment, dueling leaders problem may be probabilistically avoided (using probabilistically backing off). However, if there is some little partial synchrony which allows the failure detectors can stabilize to $\diamond S$, stable leaders can emerge, and progress will be guaranteed. In comparison, progress in QSC will always be probabilistic.

By using a leader based solution Paxos gets other benefits as well. Paxos uses much less communication to get the consensus done. Paxos communication is 1-to-all-to-1  (with 1 being the leader). In contrast communication in QSC is all-to-all-to-all across the three steps in the rounds.

Here is the speculation part. In effect a leader also emerges in QSC. It is the higher-priority celebrity proposal. So QSC is closer to leader-based Paxos protocols rather than to the leaderless Ben-Or and Texel protocols. What is more the random priority assignment and the use of paparazzi seem to implement a dynamic version of the sequencer node in the SDPaxos protocol. Let me explain.

Let's recall the decision procedure in QSC. A node decides as consensus on a proposal p, when it is (1) unaware of existence of higher-priority proposal than p, and (2) aware of at least one paparazzi node for p.

SDPaxos separates the leader into two roles of ordering/value-choosing done by the sequencer and the replication-checking done by any node. And it seems like QSC divides the sequencer role in to two, via the celebrity and paparazzi roles. QSC confines everything in single-frame of reference rounds by its use of TLC. On the other hand, SDPaxos allows multiple rounds occurring concurrently, via ballotnumber use for sequencer.

Extensions for scalability

QSC uses a lot of communication (all-to-all steps) and wouldn't scale. Finding alternate ways for scaling QSC would be beneficial.

Avalanche has extended Texel via sampling to large scale decentralized consensus.

Is it possible to come up with a sampling based extension to QSC? Texel had two advantages going for it. It is pull-based solution, and it only considers binary consensus. In contrast QSC is push-based, considers arbitrary consensus, and uses a single-reference frame for rounds. These make it hard to apply sampling to QSC.

It would be interesting to think about scalability extensions for QSC. Of course committee-selection based approaches should be applicable at least.

Tuesday, November 26, 2019

SOSP19 File Systems Unfit as Distributed Storage Backends: Lessons from 10 Years of Ceph Evolution

This paper is by  Abutalib Aghayev (Carnegie Mellon University), Sage Weil (Red Hat Inc.), Michael Kuchnik (Carnegie Mellon University), Mark Nelson (Red Hat Inc.), Gregory R. Ganger (Carnegie Mellon University), George Amvrosiadis (Carnegie Mellon University)

Ceph started as research project in 2004 at UCSC. At the core of Ceph is a distributed object store called RADOS. The storage backend was implemented over an already mature filesystem. The filesystem helps with block allocation, metadata management, and crash recovery. Ceph team built their storage backend on an existing filesystem, because they didn't want to write a storage layer from scratch. A complete filesystem takes a lot of time (10 years) to develop, stabilize, optimize, and mature.

However, having a filesystem in the path to the storage adds a lot of overhead. It creates problems for implementing efficient transactions. It  introduces bottlenecks for metadata operations. A filesystem directory with millions of small files will be a metadata bottleneck forexample. Paging etc also creates problems. To circumvent these problems, Ceph team tried hooking into FS internals by implementing WAL in userspace, and use the NewStore database to perform transactions. But it was hard to wrestle with the filesystem. They had been patching problems for seven years since 2010. Abutalib likens this as the stages of grief: denial, anger, bargaining, ..., and acceptance!

Finally the Ceph team deserted the filesystem approach and started writing their own storage system BlueStore which doesn't use a filesystem. They were able to finish and mature the storage level in just two years! This is because a small, custom backend matures faster than a POSIX filesystem.

The new storage layer, BlueStore, achieves a very high-performance compared to earlier versions. By avoiding data journaling, BlueStore is able to achieve higher throughput than FileStore/XFS.

When using a filesystem the write-back of dirty meta/data interferes with WAL writes, and causes high tail latency. In contrast, by controlling writes, and using write-through policy, BlueStore ensures that no background writes to interfere with foreground writes. This way BlueStore avoids tail latency for writes.

Finally, having full control of I/O stack accelerates new hardware adoption. For example, while filesystems have hard time adopting to the shingled magnetic recording storage, the authors were able to add metadata storage support to BlueStore for them, with data storage being in the works.

To sum up, the lesson learned was for distributed storage it was easier and better to implement a custom backend rather than trying to shoehorn a filesystem for this purpose.

Here is the architecture diagram of BlueStore, storage backend. All metadata is maintained in RocksDB, which layers on top of BlueFS, a minimal userspace filesystem.

Abutalib, the first author on the paper, did an excellent job presenting the paper. He is a final year PhD with a lot of experience and expertise on storage systems. He is on the job market.

Friday, November 22, 2019

Book diet Oct-Nov 2019

Here are some books I listened to in the last couple months. These were all audiobooks that the Libby app enabled me to borrow from my public library online.

It is convenient to listen to books rather than reading them. On the other hand, I think I don't retain as much information when I listen to books. It feels like I learn better visually than by listening. Feelings aside, as one concrete difference, I can't take notes when I listen. When I read a physical book, I use a blank letter page as page separator and on it I note down important concepts/ideas I encounter. When I read an ebook on Kindle, I can more conveniently highlight paragraphs, and have them available for me as highlights.

The Kingdom of Speech (Tom Wolfe)

Tom Wolfe wrote this book in 2016, and died in 2018 at age 88. He was a master storyteller and journalist associated with the New Journalism style. (New Journalism is a literary style reminiscent of long-form non-fiction and emphasizing "truth" over "facts", and intensive reportage in which reporters immersed themselves in the stories as they reported and wrote them.)

This book provocatively claims that the theories of scientists from Darwin to Chomsky on the evolution of speech is wrong. However, for his arguments, Wolfe has only his chutzpah to back him up as he has no scientific background to understand let alone theorize on this topic.

It seems like Wolfe had a chip on his shoulder against Darwin and Chomsky. His arguments are horrendous. Even for outsiders to the field, it is easy to tell he is wrong and unjust. For the first half of the book, he mocks and tries to discredit Darwin on his theorizing evolution, and argues that the credit should go to Wallace. Only at the last moment, he admits with a half-mouth that actually Darwin had been working on this theory for the preceding two decades and had more to back his theory. Similarly for Noam Chomsky. He has a bone to pick with Chomsky, and it looks like this is due to Chomsky's leftist views.

I have a very low tolerance for injustice. So I am outraged at his chutzpah and the nonsense he spews. But I couldn't stop listening to the book. His writing is so good. I just wished, he had written something that made sense, and I would enjoy it better. But noo... He had to spew these unjust statements. Unjust but beautifully composed, cleverly written sentences. The book is a train-wreck, but with beautiful cinematography... I couldn't stop watching, and enjoying it.

If this is what New Journalism style is like, I like the Old Journalism, thank you.

Anathem (Neal Stephenson)

This book was long and boring... Some other authors also write long books and include numerous details as well, but Neal Stephenson is just testing the limits of our patience in this book. Maybe he is trying to filter his audience: "Only the worthy among you should be reading my books."

I had really liked the Snow Crash, Diamond Age, and the Seveneves books. They are among my favorite science-fiction books. I didn't like Cryptonomicon much, and I didn't enjoy this book at all. I gave up on this book after a couple chapters in... Too dull, with not enough payback potential.

The Big Short: Inside the doomsday machine (Michael Lewis)  

Oh brother... I have zero interest for the stock market. But, the book was superb and very engaging. It was about the creation of the creation of the credit default swap market, the US housing bubble, and the resulting 2008 market crash.

The book introduced a foreign world of hedge fund managers, financial advisors, and stock traders. It introduced and made alive eccentric people like Mike Burry, Steve Eisman, Greg Lippman, and Eugene Xu. It was a very colorful book about a very technical subject. I see that this book is made into a movie, it should be worth watching now that I listened to this book.

This book showed me how people in power positions may not know that much and can be participants (knowingly or unknowingly) of a big scam banking on the ignorance of others. Just because things didn't break down yet, doesn't mean it won't ever break.

The book thought me that if you take the time, question other peoples basic assumption, do the grunt work, then you could be the real pioneer leader in an area. It takes some time but if you know something to be true whereas almost all other people are on the other side, be steadfast. One bit of truth can destroy vast amount of lies!

This also reminds me of Peter Thiel's favorite question: "What important truth do very few people agree with you on?" If you have a good/undervalued answer to this, this becomes your competitive advantage!

Next: The Future Just Happened (Michael Lewis)

This is another superb book by Michael Lewis, written in 2001, about how the internet got adopted and become mainstream approaching 2000s.

The book is now almost 20 years old. But it aged really well. It contains all the insights about the big role Internet would play in our daily lives. I lived through the 2000s and watch Internet get increasingly more adopted. I know this stuff, but Lewis is a great story teller, he still kept the content interesting to me after 20 years.

Lewis argues that Internet makes information readily available to everyone including fringes, and this disrupts the status quo. The internet is the great equalizer and democratizer of the information.

Here are descriptions of some chapters from the book: 
  • At the age of twelve, Jonathan Lebed began trading stocks.  By the age of fourteen, he learned that by posting messages in chat rooms, he could affect stock prices. 
  • The legal profession suffers a blow to its pride at the hands of fifteen-year-old Marcus Arnold, who became a top-ranked respondent to legal questions on
  • The British rock band Marillion could not get funds for a promotional tour from their record company so they turned to the internet.
  • The TiVo and Replay transmit information back to the advertisers on consumer preferences. Consumers gain convenience at the cost of privacy.
The book finishes strong when Michael Lewis writes a response to Bill Joy's "The future doesn't need us" article. There he also replies to the 10K year clock guy, Hillis. (Here is an excerpt from that chapter).
When I put down Joy's essay, I thought: It was exactly how some old peer of the realm might have behaved had he found himself troubled by some new development. Call the old boys in the network. Talk it over. Build consensus. Reach a conclusion that satisfies the old boys, and then call directly on political authority to take care of the problem. Assume everyone else will bow to the old boys' wisdom. 
When highly self-conscious, highly intelligent, perfectly nice men chuck the principles on which they have built their careers and reinvent themselves as qualified enemies of their own idea of progress, it is as disconcerting in its way as gray goo on the kitchen floor. You see it and you know something is up. Hillis and Joy are trying to tell us something, but they don't know how to say it.

Wednesday, November 20, 2019

Seventh grade openhouse

Recently, I went to my son's 7th grade openhouse. It was a nice setup. They made me follow my son's daily routine, visiting the same classrooms in the same order, but with 10 minute class time where the teachers gave information about what they would be doing this year.

Overall I am impressed by what I saw and heard. My main takeaways was that the school has embraced a hands-on learning curriculum, and is making very good/wise use of technology.

My impression from Math, Science, ELA classes was that, the middle school is using the flipped-classroom model to a great extent. There is little lecturing and a lot of group work.

The science class presentation was exciting. Instead of memorizing, the class emphasizes hypothesizing, testing, and reasoning.

Art class also sounded exciting. They added new media modules. This is not surprising given that half the 12 years old population list being a YouTuber as their career choice. The art teacher said that their goal is to get the students excited and comfortable with making/producing things.

The school offers many interesting clubs in addition to the curriculum. I wish I was a student, and had the opportunity to learn in this environment.

Technology use

The school makes very good use of technology. Each student gets a ChromeBook for the year. They use it at home, charge overnight, and bring it to school to use in the classes all day. These ChromeBooks have very good battery life, and they are very durable.

Everything is on the web, and accessible via the ChromeBooks. The students have access to grading information via the schools webportal for them. Many classes use apps that enable the students to create flash cards for studying. The students get to take practice tests. And almost every class  make use of videos, including YouTube videos.

In each class, the students are responsible for organizing their agendas and keep track of the homework deadlines. The students reach all their class material and view and submit homework online using their ChromeBooks and Google office and other Google tools.

It looks like Google has the monopoly on the mindshare of the new generation. And Microsoft is missing the boat on this one big time. By the time these students graduate college and start jobs, they may not be very willing to adopt into the Windows/PC ecosystem.

Each classroom has a smartboard. These smartboards are actually computer screens projected over to a regular whiteboard. The smartboard functionality comes from the software that tracks hand movements and pen touches to the whiteboard via a camera. The teacher can display any website on the smartboard, and can scroll, click, zoom, with ease.

I am left with a lurking and jealousy guilt after the openhouse. Compared to the middleschool's technology use, our universities and our CS department's technology use is very lame... Why is that?

In my university, we keep on traditional lecturing. In our classes, there is a nice monitor and projecting technology, but there are no smartboards. I think there are clickers available in some classrooms, but that is ancient tech and unnatural way to interact. The university makes us use the God-awful (and even more ancient) Blackboard software for interacting/managing the class and grades.

Monday, November 18, 2019

Book Review. Digital minimalism: Choosing a Focused Life in a Noisy World

"Digital Minimalism: Choosing a Focused Life in a Noisy World" is Cal Newport's new book. The topic of the book is clear from the title. You should be quitting your Facebook checking, Twitter chatting, Imgur browsing, phone fiddling habits. No more monkey business. It is time to get to work. Deep work.

In Chapter 4 of the book, Calvin forms the term *solitude deprivation*. Solitude deprivation is on the other end of the spectrum to solitary confinement, but it can also be bad for you as well over a long duration. The book argues that today we all experience solitude deprivation. The smartphones, laptops, and screens do not give us time to be alone with our thoughts and process things in our speeds. I had heard a nice story, where the Amazon natives recruited for an expedition in to the jungle would take long breaks after doing some walking. They would say they are waiting for their soul to catch up to their bodies. Today we don't give time for our souls to catch up to us, and process both emotionally and rationally the flood of events/news we are bombarded with every day.

So I really liked Chapter 4 that talked about solitude deprivation. This chapter made me really worried that online connectivity, and never getting bored, could be doing more harm to me than I thought.  This chapter made a much more convincing case for the need for quitting social media than the first couple chapters in my view. But maybe it is because I am more of an abstract ideas guy.

Calvin's previous book "Deep Work" had a lot of impact. I think "Digital Minimalism" may not have that much impact. (Well, Digital Minimalism already has become a New York Times, Wall Street Journal, Publishers Weekly, and USA Today bestseller... I guess I mean more impact than that ;-) Deep Work had a positive message, "embrace deeper concentration", whereas Digital Minimalism has a negative message, "prevent digital clutter". I know, I know... For each book, you could simply switch the statements from positive to negative and vice versa. I am just referring to the tone/mood of the books. Digital Minimalism is more of a self-help/how-to book. It prescribes lists of things to do and not to do in a somewhat patronizing voice. The Deep Work book was more conceptual and thought-provoking, and less of a how-to self-help book. I have listened to Deep Work at least three times. I don't see that happening with the "Digital Minimalism" book. I would have liked to read a book titled "Deep Solitude" from Calvin, which I am sure I would be re-reading several times.
If you want to build a ship, don't drum up people to collect wood and don't assign them tasks and work, but rather teach them to long for the endless immensity of the sea.
--Antoine de Saint-Exupery

In any case, I think this is a book you should definitely check out. I wish Calvin best of luck with getting these ideas adopted. They are very timely and important.  In 2005, I was labmates with Calvin at Nancy Lynch's theory of distributed systems group. Calvin is like a real life Captain America. Always responsible, kind, tidy, and disciplined. He would arrange his working hours carefully and would optimize everything. He is super smart and productive. His publication record is impressive. He is a theory and abstract thinking person at heart. He thinks clearly and in a sound manner with much deliberation. He is both a successful professor and a successful writer. He is walking the walk as well. So we should pay attention when he is talking.

MAD questions

1. Is social media more dangerous than TV was? 
It may be so because it is algorithmically manipulated to be addictive. Technology companies completed the loop and this is a quick feedback loop fed by millions of people participating in the experiment.  On the other hand, I have heard of a hypothesis that, since the kids are raising up with this technology, they will develop ways to be immune to it. But I don't know if I am convinced by that argument. Parents should definitely regulate/restrict for these technologies. And I think even the governments should be regulating for these technologies. The Digital Minimalism book cites that mental health is at crisis level for millennials raised with this technology available to them. I see students on their phones walking in the halls and fiddling with their phones even in class. They are always busy catching up to what is on their screens, but they are missing up on things happening around them, and most importantly happening within them.

2. Is it possible to have a good social media tool?
Things have a way of going south quickly for social media and common collaboration tools. Quora, which was once where insightful articles reside, is a cesspool now. I guess we chalk it up to human nature.

I like my Twitter-verse. It is nicely curated to give me a chance to observe interesting people chat and think. It is like watching passersby at a coffee. It is not deep conversation, but it is still useful to keep me inspired and informed on these people's interests. I wish we could write paragraphs on Twitter, but then, maybe people wouldn't write and interact that much.

Saturday, November 16, 2019

SOSP19. I4: Incremental Inference of Inductive Invariants for Verification of Distributed Protocols

This paper is by Haojun Ma (University of Michigan), Aman Goel (University of Michigan), Jean-Baptiste Jeannin (University of Michigan), Manos Kapritsos (University of Michigan), Baris Kasikci (University of Michigan), Karem A. Sakallah (University of Michigan).

This paper is about formal verification of distributed systems. Writing proofs manually is cumbersome. Existing tools for formal verification all require the human to find the inductive invariant.

I4 combines power of Ivy (a tool for interactive verification of infinite-state systems) and model checking in order to find inductive invariant without relying on human intuition. Ivy takes as input a protocol description and a safety property, and guides the user interactively to discover an inductive invariant. The goal for finding an inductive invariant is to prove that the safety property always holds. An inductive proof has a base case, which proves initial state is safe, and an inductive step, which proves if state k is safe, prove state k+1 is safe. Once that inductive invariant is found, Ivy automatically verifies that it is indeed inductive.

The insight in I4 is that the safety/correctness behavior of a distributed system does not fundamentally change as the size increases. I witness this regularly in my use of TLA+ for model checking protocols. TLA+ is able to identify any problem (sometimes requiring upto 40 steps) by finding a counterexample involving three nodes. Three nodes is often what it takes. One node initializes a coordination operation, and the other two nodes see a different perspective of the ongoing computation, maybe due to exchanging messages with each other (i.e. doing stale reads) at inopportune times, and arrive to conflicting decisions that violate the goal of the coordination operation.

I4 uses inductive invariants from small instances and apply/generalize to large instances, and automates this with model-checking. More specifically, I4 first creates a finite instance of the protocol; use a model checking tool to automatically derive the inductive invariant for this finite instance; and generalize this invariant to an inductive invariant for the infinite protocol. This amends the I4 approach in that it automates the inductive invariant discovery process. This amends the model checking approach as well. While model checking is fully automated, it doesn’t scale to distributed systems. I4 applies model checking to small, finite instances and then generalizes the result to all instances.

The figure above shows an overview of the I4 flow for the invariant generation on a finite instance.
Given a protocol description--written in Ivy--and an initial size, I4 first generates a finite instance of that protocol with a given initial size. For example, ... I4 will generate a finite instance of the protocol with one server and two clients. It then uses the Averroes model checker to either generate an inductive invariant that proves the correctness of the protocol for that particular instance, or produce a counterexample demonstrating how the protocol can be violated and which can be used to debug the protocol. If the protocol is too complex, the model checker may fail to produce an answer within a reasonable amount of time or it may run out of memory. If this occurs, the finite encoding is simplified—using a concretization technique—to further constrain it and make it easier for the model checker to run to completion. This step is currently done manually but is easily automatable. Once an inductive invariant has been identified, I4 generalizes it to apply not only to the finite instance that produced it, but also to all instances of the protocol.
It is important to note that if the safety invariant does not hold, Averroes produces a counterexample and the human should work on the protocol to come up with a safety invariant that holds for the protocol. I4 is automatic, in that if the protocol safety invariant holds, then the inductive invariant is generated automatically by the Averroes tool.  But, wait, what is the difference between safety invariant and inductive invariant? Isn't safety invariant already inductive?

Safety property P may be an invariant but not an inductive one. "The verification proof requires the derivation of additional invariants that are used to constrain P until it becomes inductive. These additional invariants are viewed as strengthening assertions that remove those parts of P that are not closed under the system's transition relation." In other words, while the safety property holds for reachable states, it may not be closed under program actions outside the reachable states. This makes safety invariant unsuitable for verification since proving properties is not constrained to the reachable states (as it is hard to enumerate/identify reachable states in a proof). So, the inductive invariant is a version of the safety property that is closed under the program actions. The figure below illustrates this relationship. I think this concept is explored further in the Ivy paper.

If the safety property holds, then Averroes generates an inductive invariant for the finite instance; minimizes the invariant by removing redundant clauses; and then passes it on to the next step to be generalized. However, occasionally the finite instance may still be too large for the Averroes model checker, and it may run out of memory. This is where human involvement is needed again. The human helps concretize the small finite version of the protocol further to avoid state space explosion. Symmetry plays a big role here. FIRST is the keyword that denotes the node that sends the first message. The model checker can try instances where all the three nodes in the finite instances might be the one that sends the message. The human can notice a symmetry and set "FIRST = Node1" to help reduce the state space. (The team is working on automating this step as well.)

Then I4 uses Ivy for the proof generation as shown below, and the verification is complete.

I4 is available as opensource at They applied I4 to several examples as shown in the table.

I4 improves on manual verification via using Coq and interactive verification using Ivy.

A restriction in I4 is that it applies to verification of safety properties, and not to liveness properties.

I am happy to find so many verification papers at SOSP. This paper appeared in the distributed systems in the afternoon of Day 2. In the morning of Day 2, there was a session on verification which included four papers. I had reviewed two of these papers earlier: "Scaling Symbolic Evaluation for Automated Verification of Systems Code with Serval" and "Verifying Concurrent, Crash-safe Systems with Perennial". It looks like the verification community at SOSP is quick to take results from more general and theoretic verification conferences, and integrate those tools and improve upon them to put them in use for verification of practical systems.

Thursday, November 14, 2019

SOSP19 Lineage Stash: Fault Tolerance Off the Critical Path

This paper is by Stephanie Wang (UC Berkeley), John Liagouris (ETH Zurich), Robert Nishihara (UC Berkeley), Philipp Moritz (UC Berkeley), Ujval Misra (UC Berkeley), Alexey Tumanov (UC Berkeley), Ion Stoica (UC Berkeley).

I really liked this paper. It has a simple idea, which has a good chance of getting adopted by real world systems. The presentation was very well done and was very informative. You can watch the presentation video here.

Low-latency processing is very important for data processing, stream processing, graph processing, and control systems. Recovering after failures is also important for them, because for systems composed of 100s of nodes, node failures are part of daily operation.

It seems like there is a tradeoff between low latency and recovery time. The existing recovery methods either have low runtime overhead or low recovery overhead, but not both.
  • Global checkpoint approach to recovery achieves a low runtime overhead, because a checkpoint/snapshot can be taken asynchronously and off the critical path of the execution. On the other hand, the checkpoint approach has high recovery overhead because the entire system needs to be rolled back to the checkpoint and then start from there again.
  • Logging approach to recovery has high runtime overhead, because it synchronously records/logs every data about any nondeterministic execution after the last checkpoint. On the flip side of the coin, it can achieve low overhead to recovery because only the failed processes need to be rolled back a little and resume from there. 

Can we have a recovery approach that achieves both low runtime overhead and low recovery overhead? The paper proposes the "lineage stash" idea to achieve that. The idea behind lineage stash is simple.

The first part of the idea is to reduce the amount of data logged by only logging the lineage. Lineage stash logs the pointers to the data messages instead of the data, and also logs task descriptions in case that data needs to be recreated by the previous operation. Lineage stash also logs the order of execution.

The second part of the idea is to do this lineage logging asynchronously, off the critical path of execution. The operators/processes now include a local volatile cache for lineage, which is asynchronously flushed to the underlying remote global lineage storage. The global lineage store is a sharded and replicated key-value datastore.

But then the question becomes, is this still fault tolerant? If we are doing the logging to the global lineage store asynchronously, what if the process crashes before sending the message, and we lose the log information?

The final part of the idea is to use a causal logging approach, and piggybacking the uncommitted lineage information to the other processes/operations for them to store in their stashes as well. So this kind of resembles a tiny decentralized blockchain stored in the stashes of interacting processes/operations.

In the figure, the filter process had executed some tasks and then passed messages to the counter process. Since the logging is off the critical path, the lineage for these tasks was not yet replicated to the global lineage stash. But as part of the rule, the lineage was piggybacked to the messages sent to the counter, so the counter has also a copy of the lineage in its stash, when the filter process crashed. Then in the recovery, the counter process helps by flushing this uncommitted lineage to the global lineage storage for persistence. The recovering filter process can then retrieve and replay this lineage to achieve a correct and quick (on the order of milliseconds) recovery.

Lineage stash idea was implemented and evaluated in Apache Flink for a stream processing word count application over 32 nodes. It was compared against the default global checkpoint recovery, and the default augmented with synchronous logging.

As the figure above shows, by using asynchronous logging approach, linear stash is able to avoid the runtime latency overhead of synchronized logging and matches that of the asynchronous checkpointing approach. Moreover, as the figure below shows, the recovery latency of checkpointing is very high. The lineage stash approach reaches similar recovery latency as the syncronized logging approach.

The lineage stash looks very promising for providing lightweight (off the critical path) fault-tolerance for fine-grain data processing systems. I really like the simplicity of the idea. I feel like I have seen a related idea somewhere else as well. But I can't quite remember it.

Monday, November 11, 2019

SOSP19 Verifying Concurrent, Crash-safe Systems with Perennial

This paper is by Tej Chajed (MIT CSAIL), Joseph Tassarotti (MIT CSAIL), Frans Kaashoek (MIT CSAIL), Nickolai Zeldovich (MIT CSAIL).

Replicated disk systems, such as file systems, databases, and key-value stores, need both concurrency (to provide high performance) and crash safety  (to keep your data safety). The replicated disk library is subtle, but the paper shows how to systematically reason about all possible executions using verification. (This work considers verification of a single computer storage system with multiple disk --not a distributed storage system.)

Existing verification frameworks support either concurrency (CertiKOS [OSDI ’16], CSPEC [OSDI ’18], AtomFS [SOSP ’19]) or crash safety (FSCQ [SOSP ’15], Yggdrasil [OSDI ’16], DFSCQ [SOSP ’17]).

Combining verified crash safety and concurrency is challenging because:
  • Crash and recovery can interrupt a critical section,
  • Crash can wipe in-memory state, and
  • Recovery logically completes crashed threads' operations. 

Perennial introduces 3 techniques to address these three challenges:
  • leases to address crash and recovery interrupting a critical section,
  • memory versioning to address crash wiping in-memory state, and
  • recovery helping to address problems due to interference from recovery actions.

The presentation deferred to the paper for the first two techniques and explained the recovery helping technique.

To show that the implementation satisfies the high-level specification a forward simulation is shown under an abstraction relation. The abstraction relation maps the concrete/implementation state to the high-level abstract specification state. Perennial adopted the abstraction relation as: "if not locked (due to an operation in progress), then the abstract state matches the concrete state in both disks".

The problem is "crashing" breaks the abstraction relation. To fix this problem, Perennial separates crash invariant (which refers to interrupted spec operations) from the abstraction invariant. The recovery proof relies on the crash invariant to restore the abstraction invariant.

Crash invariant says "if disks disagree, some thread was writing the value on the first disk". Then the recovery helping technique helps recovery commit writes from before the crash. The recovery proof shows the code restores the abstraction relation by completing all interrupted writes. As a result users get correct behavior and atomicity.

The Perennial proof framework was written in 9K lines of coq which provides crash reasoning: leases, memory versioning, and recovery helping. Perennial is built on top of Iris concurrency framework (for concurrency reasoning), which is built on top of Coq. (Iris:  R. Krebbers, R. Jung, A. Bizjak, J.-H. Jourdan, D. Dreyer, and L. Birkedal. The essence of higher-order concurrent separation logic. In Proceedings of the 26th European Symposium on Programming Languages and Systems, pages 696–723, Uppsala, Sweden, Apr. 2017.)

The authors have developed Goose for reasoning about Go implementations, but they also defer this to the paper. The developer writes Go code, and the Goose translator (written in 2K lines of Go code) translates this to Perennial proof, where it is machine checked with Coq.

As evaluation of Perennial framework, they verified a mail server written in Go. They argue that compared to a verification in CSCSPEC [OSDI ’18] (their earlier verification framework), the verification in Perennial takes less effort and is done in less number of lines of proof.

The software is available at

MAD questions

1. Is this an instance of a convergence refinement relation? 
In 2001, I was thinking on fault-tolerance preserving refinements as a graduate student working on graybox design of self-stabilization. The question was that: If we design fault-tolerance at the abstract, what guarantee do we have that after the abstract code is compiled and implemented in concrete, the fault-tolerance still holds/works?

It is easy to see that fault-tolerance would be preserved by an "everywhere refinement" that preserves the abstraction relation (between concrete and abstract) at any state, including the states outside the invariant states that are not reachable in the absence of faults. But the problem is that outside the invariant, the abstraction relation may not hold due to recovery actions being different than normal actions. That is pretty much the dilemma the Perennial work faced in verifying the recovery of replicated disks above.

OK, I said, let's relax the everywhere refinement to an "everywhere eventual refinement" and that would work for preserving fault-tolerance. Yes, it works, but it is not easy to prove that the concrete is an everywhere eventual refinement of the abstract because there is a lot of freedom in this type of refinement, and not much of a structure to leverage. The proof becomes as hard as proving fault-tolerance of the concrete from scratch. So, what I ended up proposing was a "convergent refinement", where the actions of the concrete provides a compacted version of the actions of the abstract outside the invariant. In other words, the forward simulation outside the invariant would be skipping states in the concrete. Perennial faced with the same dilemma chose to use a different abstraction relation. Whereas the convergence refinement idea is to keep the same abstraction relation but allow it to contract/skip steps in the computations outside the invariant states. I wonder if this could be applicable in the Perennial problem.

My reasoning with going compacting steps in refinement outside invariant was because it is safer than expanding the computation: if you show recovery in states in the abstract, by skipping steps (and not adding new ones) the concrete is also guaranteed to preserve that recovery.

Here is the abstract of my 2002 paper on convergence refinement. I just checked and this paper only got 19 citations in 19 years. It did not age well after getting a best paper award at ICDCS'02. In comparison, some of the papers we wrote quickly and published as short paper or as a workshop paper got more than 150-900 citations in less than 10 years. Citations is funny business.
Refinement tools such as compilers do not necessarily preserve fault-tolerance. That is, given a fault-tolerant program in a high-level language as input, the output of a compiler in a lower-level language will not necessarily be fault-tolerant. In this paper, we identify a type of refinement, namely "convergence refinement", that preserves the fault-tolerance property of stabilization. We illustrate the use of convergence refinement by presenting the first formal design of Dijkstra’s little-understood 3-state stabilizing token-ring system. Our designs begin with simple, abstract token-ring systems that are not stabilizing, and then add an abstract "wrapper" to the systems so as to achieve stabilization. The system and the wrapper are then refined to obtain a concrete token-ring system, while preserving stabilization. In fact, the two are refined independently, which demonstrates that convergence refinement is amenable for "graybox" design of stabilizing implementations, i.e., design of system stabilization based solely on system specification and without knowledge of system implementation details.

Saturday, November 9, 2019

SOSP19 Day 2, Scaling Symbolic Evaluation for Automated Verification of Systems Code with Serval

Verification session was the first session for Day 2. I like formal methods, and I did enjoy these papers. In this post I will only talk about the first paper in the session, the Serval paper. (You can read about SOSP19 Day 1 here.)

This paper is by Luke Nelson (University of Washington), James Bornholt (University of Washington), Ronghui Gu (Columbia University), Andrew Baumann (Microsoft Research), Emina Torlak (University of Washington), Xi Wang (University of Washington).

This paper received a best paper award at SOSP19, and the software is publicly available at

SOSP has a tradition of publishing systems verification papers, such as seL4 (SOSP’09), Ironclad Apps (OSDI’14), FSCQ (SOSP’15), CertiKOS (PLDI’16), Komodo (SOSP’17). A downside of systems verification is it is very effort-intensive. The Certikos manual proof consisted of more than 200K lines.

To help address this problem,  this paper introduces Serval, a framework for  developing automated verifiers for systems software. Serval accomplishes this by lifting interpreters written by developers into automated verifiers. It also provides a systematic approach to identifying and repairing verification performance bottlenecks using symbolic profiling and optimizations.

Wait, wait... What is an interpreter? And what is lifting?

In prior work on automatic verification (such as Hyperkernel SOSP17), a verifier implements symbolic evaluation for specific systems, and the verifier is not reusable/generalized. To make the verifier reusable and general, in Serval, the developers write an interpreter for an instruction set using Rosette, an extension of the Racket language for symbolic reasoning. Serval leverages Rosette to "lift" an interpreter into a verifier; which means to "transform a regular program to work on symbolic values". The developers also give the system specifications to be verified.

In the Serval framework the verifier consists of the lifted interpreter and the symbolic optimization. The steps are: write a verifier as interpreter, then Serval performs symbolic profiling to find bottleneck, and apply optimizations until verification becomes feasible.

Serval uses symbolic execution to avoid the state space explosion problem. But the program counter (PC) becoming symbolic is bad as it unnecessarily opens up search space. Serval prevents this with symbolic optimizations:

  • peephole optimization
  • fine-tune symbolic evaluation
  • use domain language to reduce the concrete values PC can take, and avoid path explosion problem.

Unfortunately I didn't understand much about the first two optimizations from listening to the presentation.

Using Serval, the authors build automated verifiers for the RISC-V, x86-32, LLVM, and BPF instruction sets. Targeting low level end of compiling stack can be an advantage for verification, because we don't need to trust higher level language toolkits. Future work will consider how the low-level-guarantees identified and verified by Serval could be connected to high level data structures for proof verification.

To show that existing systems can be retrofitted for Serval, they Retrofitted CertiKOS and Komodo for Serval. They mention this takes around 4 weeks for a new system. They also found 15 new bugs in Linux BPF JIT.

I will read the paper carefully to understand Serval better. It seems promising for scaling verification to practical systems. Of course the process still requires expertise and several weeks worth of effort, but Serval improves on the state-of-the-art with many months of effort.

Thursday, November 7, 2019

SOSP19 Day 1 wrap up

It was only 3 sessions into day 1, and my brain was fried.
Conferences are tiring because you are exposed to so many new ideas in a short time. It was clear I would not be able to pay attention to the papers in the last session, so I skipped that session (the privacy session which included the following three papers) to go for a walk at the golf park behind the conference center.

After the privacy session, there was a poster session and reception from 5-7:30pm. The poster session was nice for asking authors questions about the papers and having more in-depth conversation.

A student had told me he doesn't know how to start conversations with other conference attendees. I told him "That's easy.. Just ask them about what they are working on these days." A better way to start deeper conversations is to listen to the paper presentations, and have genuine questions about future work, or some extension and connection, and go discuss with them at coffee breaks, lunch, or poster session.

In the free-roaming poster session and reception, I had a chance to meet many colleagues and catch up on what they are working these days. When they returned the question, I had to talk for 3-5 minutes about what I am working on these days. I found that my "elevator pitch" got better and better as I had to answer this question many times.

I am a shy person, but at conferences my curiosity works in my favor, and I approach people to learn about their current work, and what they think of this paper versus that paper. I really enjoy talking to fellow researchers, each of whom is an expert in a small part of a big field. We may have different opinions on things, they may not like the papers/ideas I like, but I get to learn about their perspectives and file them in my brain without having to agree or disagree with them for now.

General impressions about SOSP 

SOSP is single track, so 500+ people were in the same big conference room for the sessions. The first half of the room had tables, and the second half just chairs. If you sat at a table row, you can rest your laptop on the table and type comfortably. I sat at the very front row and took notes. Interestingly, there is little contention for the front rows. Another advantage of sitting at the front row is that I am not distracted by seeing other audience members checking Facebook, Twitter, and mail on their laptops.

(Rant: This is my pet peeve. This drives me nuts. What is the point of flying over a long distance, driving at least two more hours from the airport to come to the conference, and check mail and social media all day long? You disrupted your whole week to travel to this conference and now you are not "at the conference". This I will never understand. Be here now!)

To comment on the papers and presentations on the first day, I found all the sessions very interesting. I don't have a favorite, all the three sessions I attended were very good.

Most of the presentations were given by graduate students. The quality of most of the presentations were very good. It is obvious a lot effort went into the rehearsals of those presentation. Almost all presenters had written (and memorized) extensive speaker notes and while the presentation view was displayed on the curtain, they had the presenter notes open on their laptops. Some of the presenters just read from their presentation notes. Those presentations were not very engaging. But at least the slides were very well organized, and the messages were distilled down to important points and were easy to follow.

Each presentation was about 20 minutes including a 2-3 minutes question answering slot at the end. (I think the SOSP conferences I attended before had 5 minutes reserved Q&A slot, but for this one the Q&A was not as rigidly reserved and enforced.)

Most of the presenters were able to cover around 40 slides in 20 minutes. This is an astonishingly large number. Normally the rule of thumb is to have 10 slides to present in 20 minutes. But being well prepared for a smooth flowing presentation, the presenters were somehow able to pull this off. I guess this takes its toll on the listeners though. I felt overwhelmed and exhausted after three sessions being bombarded by too many ideas, concepts, and acronyms.

I had written two posts about how to present in case you are looking for advice in that department.

Tuesday, November 5, 2019

SOSP19 Day 1, Debugging session

This session was the first session after lunch and had four papers on debugging in large scale systems.

CrashTuner: Detecting Crash Recovery Bugs in Cloud Systems via Meta-info Analysis

This paper is by Jie Lu (The Institute of Computing Technology of the Chinese Academy of Sciences), Chen Liu (The Institute of Computing Technology of the Chinese Academy of Sciences), Lian Li (The Institute of Computing Technology of the Chinese Academy of Sciences), Xiaobing Feng (The Institute of Computing Technology of the Chinese Academy of Sciences), Feng Tan (Alibaba Group), Jun Yang (Alibaba Group), Liang You (Alibaba Group).

Crash recovery code can be buggy and often result in catastrophic failure. Random fault injection is ineffective for detecting them as they are rarely exercised. Model checking at the code level is not feasible due to state space explosion problem. As a result, crash-recovery bugs are still widely prevalent. Note that the paper does not talk about "crush" bugs, but "crash recovery" bugs, where the recovery code interferes with normal code and causes the error.

Crashtuner introduces new approaches to automatically detect crash recovery bugs in distributed systems. The paper observes that crash-recovery bugs involve "meta-info" variables. Meta-info variables include variables denoting nodes, jobs, tasks, applications, containers, attempt, session, etc. I guess these are critical metadata. The paper might include more description for them.

The insight in the paper is that crash-recovery bugs can be easily triggered when nodes crash before reading meta-info variables and/or crash after writing meta-info variables.

Using this insight, Crashtuner inserts crash points at read/write of meta-info variables. This results in a 99.91% reduction on crash points with previous testing techniques.

They evaluated Crashtuner on Yarn, HDFS, HBase, and ZooKeeper and found 116 crash recovery bugs. 21 of these were new crash-recovery bugs (including 10 critical bugs). 60 of these bugs were already fixed.

The presentation concluded by saying that  meta-info is a well-suited abstraction for distributed systems. After the presentation, I still have questions about identifying meta-info variables and what would be false-positive and false-negative rates for finding meta-info variables via heuristic definition as above.

The Inflection Point Hypothesis: A Principled Debugging Approach for Locating the Root Cause of a Failure

This paper is by Yongle Zhang (University of Toronto), Kirk Rodrigues (University of Toronto), Yu Luo (University of Toronto), Michael Stumm (University of Toronto), Ding Yuan (University of Toronto).

This paper is about debugging for finding the root cause of a failure. The basic approach is to collect large number of traces (via failure reproduction and failure execution), and try to find the strongest statistical correlation with the fault, and identify this as the root cause. The paper asks the question: what is the fundamental property of root cause that allows us to build a tool to automatically identify the root cause?

The paper offers as the definition for root cause as "the most basic reason for failure, if corrected will prevent the fault from occurring." This has two parts: "if changed would result in correct execution" and "the most basic cause".

Based on this, the paper defines inflection point as the first point in the failure execution that differs from the instruction in nonfailure execution. And develops Kairux: a tool for automated root cause localization. The idea in Kairux is to construct the nonfailure execution that has the longest common prefix. To this end it uses unit tests, stitches unit test to construct nonfailure execution, and modifies existing unit test for longer common prefix. Then it uses dynamic slicing to obtain partial order.

The presentation gave a real world example from HDFS 10453, delete blockthread. It took the developer one month to figure out the root cause of the bug. Kairux does this automatically.

Kairux was evaluated on 10 cases from JVM distributed systems, including  HDFS, HBase, ZooKeeper. It successfully found the root cause for 7 out the 10 cases. For the 3 unsuccessful cases, the paper claims this was because the root cause location could not be reached by modifying unit tests

This paper was similar to the previous paper in the session in that it had a heuristic insight which had applicability in a reasonably focused narrow domain. I think the tool support would be welcome by developers. Unfortunately I didn't see that the code and tool is available as opensource anywhere.

Finding Semantic Bugs in File Systems with an Extensible Fuzzing Framework

This paper is by Seulbae Kim (Georgia Institute of Technology), Meng Xu (Georgia Institute of Technology), Sanidhya Kashyap (Georgia Institute of Technology), Jungyeon Yoon (Georgia Institute of Technology), Wen Xu (Georgia Institute of Technology), Taesoo Kim (Georgia Institute of Technology).

The presentation offers by asking "Can file systems be bug free?" and answers this in the negative, citing that the codebase for filesystems is massive (40K-100K) and are constantly evolving. This paper proposes to use fuzzing as an umbrella solution that unifies existing bug checkers for finding semantic bugs in filesystems.

The idea in fuzzing is to give crashes are feedback to the fuzzers. However, the challenge for finding semantic bugs using fuzzers is that semantic bugs are silent, and won't be detected. So we need a checker to go through the test cases and check for the validity of return values, and give this feedback to fuzzer.

To realize this insight, they built Hydra. Hydra is available as opensource at

Hydra uses checker defined signals, automates input space exploration, test execution, and incorporation of test cases. Hydra is extensible via pluggable checkers for spec violation posix checker (sybilfs), for logic bugs, for memory safety bugs, and for crash consistency bug (symC3).

So far, Hydra has discovered 91 new bugs in Linux file systems, including several crash consistency bugs. Hydra also found a bug in a verified file system (FSCQ), (because it had used an unverified function in implementation).

The presenter said that Hydra generates better test cases, and the minimizer can reduce the steps in crashes from 70s to 20s. The presentation also live demoed Hydra in action with symC3.

Efficient and Scalable Thread-Safety Violation Detection --- Finding thousands of concurrency bugs during testing

This paper is by Guangpu Li (University of Chicago), Shan Lu (University of Chicago), Madanlal Musuvathi (Microsoft Research), Suman Nath (Microsoft Research), Rohan Padhye (Berkeley).

This paper received a best paper award at SOSP19. It provides an easy push button for finding concurrency bugs.

The paper deals with thread safety violations (TSV). A thread safety violation occurs if two threads concurrently invoke two conflicting methods upon the same object. For example, the C# list datastructure has a contract that says two adds cannot be concurrent. Unfortunately thread safety violations still exist, and are hard to find via testing as they don't show up in most test runs. The presentation mentioned a major bug that lead to Bitcoin loss.

Thread safety violations are very similar to data race conditions, and it is possible to use data-race detection tools in a manually intensive process to find some of these bugs in small scale. To reduce the manual effort, it is possible to adopt dynamic data race analysis while running the program under test inputs, but these require a lot of false-positive pruning.

In a large scale, these don't work. The CloudBuild at Microsoft involves  100million tests from 4K team and upto 10K machines. At this scale, there are three challenges: integration, overhead, and false positives.

The paper presents TSVD, a scalable dynamic analysis tool. It is push button. You provide TSVD only the thread safety contract, and it finds the results with zero false positives. TSVD was  deployed in Azure, and it has found more than 1000 bugs in a short time. The tool is available as opensource at

To achieve zero false positive, TSVD uses a very interesting trick. A potential violation (i.e., a call site that was identified by code analysis as one that may potentially violate the thread safety contract) is retried in many test executions by injecting delays to trigger a real violation. If a real violation is found, this is a true bug. Else, it was a false-positive and is ignored.

But how do we do the analysis to identify these potentially unsafe calls to insert delays? TSVD uses another interesting trick to identify them. It looks for conflicting calls with close-by physical timestamps. It flags likely racing calls, where two conflicting calls from different threads to the same object occur within a short physical time window. This way of doing things is more efficient and scalable than trying to do a happened-before analysis and finding calls with concurrent logical timestamps. Just identify likely race calls.

OK, what if there is actual synchronization between the two potentially conflicting calls within closeby physical timestamps? Why waste energy to keep testing it to break this? Due to synchronization, this won't lead to a real bug. To avoid this they use synchronization inference (another neat trick!): If m1 synchronized before m2, a delay added to m1 leads to the same delay to m2. If this close correlation is observed in the delays, TSVD infers synchronization. This way it also infers if a program is running sequentially or not, which calls are more likely to lead to problems, etc.

They deployed TSVD at Microsoft for several months. It was given thread safety contracts of 14 system classes in C#, including list, dictionary, etc. It was tested on 1600 projects, and was run 1 or 2 times, and found 1134 thread safety violations. During the validation procedure, they found that 96% TSVs are previously unknown to developers and 47% will cause severe customer facing issues eventually.

TSVD beats other approaches, including random, data collider, happened-before (hb) tracking. 96% of all violations were captured by running TSVD 50 times. And 50% violations were captured by running TSVD once! This beats other tools with little overhead.

One drawback to TSVD approach is that it may cause a false negative by adding the random delay. But when you run the tool multiple times, those missed false negatives are captured due to different random delays tried.

Yep, this paper definitely deserved a best paper award. It used three very interesting insights/heuristics to make the problem feasible/manageable, and then built a tool using these insights, and showed exhaustive evaluations of this tool. 

Sunday, November 3, 2019

SOSP19 Day 1, Blockchain session

The second session on Day 1 was on blockchains. There were three papers on this session.

Teechain: A Secure Payment Network with Asynchronous Blockchain Access

This paper is by Joshua Lind (Imperial College London), Oded Naor (Technion), Ittay Eyal (Technion), Florian Kelbert (Imperial College London), Emin Gun Sirer (Cornell University), Peter Pietzuch (Imperial College London).

Bitcoin's throughput is a measly 4 transactions per second as limited by Nakamoto consensus. This leads to work on off-chain scaling via payment networks, e.g. lightning network.  The payment network consists of nodes that establish pairwise connections. To establish a multihop connection these pairwise connections are utilized as intermediaries. There are three phases to a payment network: setup a multihop connection, process payments for some time, perform a settlement where the final balance is written back to the blockchain.

To guard against a roll-back attack and ensure that the correct balance is written, the blockchain is used as the root of trust. If a node misbehaves,  writes an incorrect amount, the other reacts within delta time, corrects by providing proof of lightning transactions, resolves the situation, and  gets incentive for it.

The problem with this is that the reaction time delta requires synchronous access. Spam/congestion attacks can make some transactions take more than 7 days to be written to the chain. So what should be the appropriate value of delta, the reaction time? Security of the payment channel should not rely on read/write latencies.

To address this issue, the paper proposes Teechain, the first asynchronous blockchain access payment network. Teechain removes the blockchain as root-of trust by introducing another root of trust, a treasury. Treasury controls funds, balances, and payments. But how do we realize treasuries for blockchains, avoiding centralization and trust?

Teechain uses committees for realizing treasuries for blockchains. A treasury committee consists of n parties in the network, and requires m out of n parties to agree before accessing funds. But how large should m and n be? Large committees are problematic for scaling. To reduce the size of committees, the paper proposes trusted executions, such as the Intel SGX enclave. The enclaves guards against software hardware attacks, and uses trusted execution environments (tees) to secure committee members. The paper does not take tees as the entire solution, because some attacks, like foreshadow could still be possible with tees. Instead, the paper uses tees to increase the attack costs and combines tees with treasury committees to solve the problem securely.

OK, how does the treasury committee maintain agreement? Chain replication is used for propagating the transactions to the committee. Even though the committee members use tees, they do not rule out that some nodes can still be byzantine, so there is an m-out-of-n agreement at the end. If chain replication is applied naively, some attacks are possible. So Teechain uses a variant called force-freeze chain replication, where if the chain configuration is changed or if there is a fault, the decision is an abort decision, and the state is dumped.

The paper includes evaluations to show that Teechain scales out. Evaluations were done using a complete graph and a hub-spoke graph. For committee sizes n=3 or 4 are used, and it was shown that 1 million tx/sec are possible using 30 machines.

Teechain is available as opensource from

Fast and Secure Global Payments with Stellar

This paper is by Marta Lokhava (Stellar), Giuliano Losa (Galois), David Mazières (Stanford), Graydon Hoare (Stellar), Nicolas Barry (Stellar), Eliezer Gafni (UCLA), Jonathan Jove (Stellar), Rafał Malinowski (Stellar), Jed McCaleb (Stellar).

In 2018, I had written a review of the stellar arxiv paper I read on Stellar. This paper extends that with formal verification, evaluation, and lessons learned from several years in deployment.

However, the presentation was not technical. It was a very high level presentation that avoided the description of the protocol and formal verification. I think the presenter, David, must have done this presentation hundred times before to VCs because the presentation was pitch perfect. Every criticism possible about the protocol was proactively flipped and explained as an advantage. Here is how the presentation went.

Things we take for granted, such as a bank account in a stable currency, access to well-regulated investments, cheap international money transfers, globally accepted fee-free credit cards, are not available in many places. Stellar provides more equitable access to assets. It is the first solution to provide:
  1. open membership
  2. issuer-enforced finality: still need secure servers, but issuer owns or designates them
  3. cross-issuer atomicity
The Stellar transaction model is based on replicated state machines (RSMs). Each RSM executes transactions to keep ledger state. Transactions guarantee atomicity. But now that the RSM is distributed, how does Stellar guarantee ledger integrity? Stellar uses a shared RSM. The idea is to follow the graph transitively until it converges. The hypothesis is that any two nodes transitively follow a common node. This holds true for Internet with its hierarchical domains.

The Byzantine agreement in Stellar follows from this hypothesis. The key idea is that the broadcast protocol steps are conditioned on other nodes' steps: take the step if all nodes are mutually satisfied. For availability must generalize follows to a sets of peers, called quorum slices. A quorum is a set of nodes that contains one slice from nonfaulty server.

Stellar has top tier, middle tier, leaf tier servers. As in Internet, no central authority appoints the top tier. The production network has been running for 4 years.

In the Q&A period, one question was: "How are you dealing with dynamic reconfiguration of quorums?" David said "we get it for free, and can unilaterally change quorum slices at any time". But this answer is not clear to me because reconfiguration could reduce your safety. As far as I remember from the Stellar arxiv report, the onus is on the user to figure out that her quorum slices are set up correctly. So the reconfiguration should be more involved than that.

Another question was: "Can you do smartcontracts where state lies in different slices?" David claimed it is possible, but again without reading the paper I don't understand this answer. The state could be partitioned (and not fully replicated) across all nodes, and ordering across all slices may be involved.

Notary: A Device for Secure Transaction Approval

This paper is by Anish Athalye (MIT CSAIL), Adam Belay (MIT CSAIL), Frans Kaashoek (MIT CSAIL), Robert Morris (MIT CSAIL), Nickolai Zeldovich (MIT CSAIL).

Smartphones suffer from bugs and attacks. So hardware wallets (or cryptocurrency wallets) are adopted for transaction approval for critical financial transactions. The ledger app store contains 50+ third party apps. Unfortunately, existing hardware wallets also have OS bugs and potential hardware bugs as well.

The contribution in this paper is to introduce "notary", a device for secure transaction approval. Notary uses an agent separation architecture. And the authors have developed a physical hardware wallet prototype for notary.

The separation architecture provides isolation: there is a kernel system-on-chip (soc) and a separate agent soc. The kernel soc does not run any third party code. It is the agent soc that runs third-party code. The agent soc does not have access to OS. The agent also does not have full access to hardware, as it lacks access to the persistent storage.

There are only 2 wires across the kernel soc and agent soc: UART and RST. Using RST, the kernel can reset the agent soc. More specifically, the kernel clears the state in agent soc after every transaction. This way notary uses clean-state deterministic start to ensure noninterference across transactions, and avoid any bugs/attacks.

Verilog is used for verifying that notary clears registers under any path. SMT-compatible format is used for symbolic circuit simulation. They developed a RISC-V based prototype, and evaluated with two agents: Bitcoin and web-app approval.

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...