Friday, January 31, 2020

Future of work

A week or so ago, I told you I wanted to brainstorm about the future of work. Well, here it is for what it is worth.

Work wants to be free/liberated 

As liberties progress and more welfare is created, our expectations move higher up in Moslow's hierarchy. We are not satisfied with achieving bare survival by means of our work, and we want to thrive, flourish, and express ourselves via our work. We long to have more autonomy and freedom of choice and expression in our work. We like to pursue our own enterprises as part of our work.

We already see things moving away from the traditional industrial model with company loyalty, closer to this type of model. In software field, which is well off compared to others, people switch jobs frequently, and try to build their own brands/niches, and demand fulfillment from their work.

As these trends shape the future of work, we will see a growing gig economy that is served/supported by algorithmic platforms. Some primitive examples of this is used for Uber, DoorDash, TaskRabbit, Amazon and Ebay reseller markets, etc. Recently, we start to see better examples of this via Airbnb, and in particularly, with Airbnb experiences. Many people use Airbnb experiences to host cooking classes or art classes in Airbnb locations they rent.

These gig economy platforms task a single person rather than a team for a straightforward task (e.g., driving from point A to point B). In the future we will see platforms that perform team formation (algorithmically) to address complicated/composite projects.

Infrastructure for future of work

In this future of work, teams will form in an ad hoc manner and dissolve when the project is finished. To perform team forming algorithmically, it becomes important to investigate the search problem in temporal graphs. The algorithms should be able to sift through large scale temporal graphs to find the right person, not just for the job, but also for the team.

The systems support for this platform will come from blockchains and smartcontracts. Blockchains and smartcontracts will maintain the attestation/certification of work/contributions and payment/remittance/settlement of deals and commerce.

Future of law will steer closely to mathematical/logical specification of algorithmic contracts. With systems support for digital oracles for electronic checking/verification of real world, smartcontracts will gain more jurisdiction. Some digital oracles may also be implemented with algorithmically tasked third party inspectors. When the digital oracles find the smartcontract conditions are satisfied, payment/remittance/settlement of deals and commerce will execute automatically.

Predictions for future of work, workers, and technology

Future of work
  • gig work: ad-hoc teams of experts recruited by the algorithm
  • centered around: problem solving, empowering, mentoring, caregiving 
Future of workers
  • based on: apprenticeship with continuous training and certification
  • suited to: workers schedule, tastes, and skills
Future of technology
  • Software augments and shapes the physical: E-commerce got accepted, and it will next dominate and shape the physicality of commerce. With advances in 3D-printing software will get embedded in manifacturing and managing the physical devices
  • Advanced big-data processing and search/match algorithms will run the show for the future of work economy
  • Blockchains and smartcontracts will keep the tabs 

Things are a-changing against the grain of many decades of industrial capitalism where big companies rule, company loyalty and brand recognition is expected, and where big money attracts more money as money/capital dominates over work/labor/value. We will see companies being replaced by team or personal brands. We will see jobs replaced by projects, tasks, and problems. Career paths will explode and we will see many unique careers. Change is not easy, and it is not easy to accept and welcome change.

It may be that kids understand this better. I asked my 8 year old daughter, what she wants to be when she grows up, and she gave this a moment's thought and said: "I want to be a person that makes other people happy".

That quick but deep answer from my daughter was enlightening for me. It pretty much sums up the future of work. You won't need the labels for jobs. Rather you will find ways to fulfill your vision (such as making people happy) in a way you are skilled for. You will be doing meaningful work, because you will be defining the meaning and the work.

Wednesday, January 29, 2020

CSE 610 Distributed Consensus, Decentralized Ledgers, and Blockchains

This semester I am teaching a small research course on Distributed Consensus, Decentralized Ledgers, and Blockchains.

Blockchains! You?

One should be able to call bullshit when one sees bullshit. In this blog, I did that several times for certain claims about blockchains technology. 

But, just because of some stupid/hyped-up claims and some first-generation inefficient implementations, it is unacceptable to make fun of and ridicule the entire blockchains field. (Yes, it is definitely a field, when a Turing Award winner and a strong candidate for Turing Award publishes papers on the topic.) A researcher should not use a broad brush to paint. Truth is multidimensional.

Even when parts of a field is impractical right now, that doesn't justify putting the entire field down. P2P was impractical. It lost to the centralized model. But the best ideas from those work found their way to traditional datacenter computing. The distributed hash table idea got adopted for distributed key-value stores. Bittorrent saw some use in datacenter application-layer networking.
In this two minute video Bertram Gilfoyle makes this point very well. Here is the entire set of slides he used in the presentation, they are hilarious. I will show these to my class tomorrow.

Course content

My blockchains course starts with classical distributed consensus problem definition and algorithms, and then considers the decentralized consensus problem in permissioned/permissionless setting with Byzantine participants.

The course covers the fundamental concepts for the design and analysis of such distributed consensus algorithms including: Paxos, PBFT, Tendermint, HotStuff, Nakamoto consensus algorithms. It also discusses formal verification of smartcontracts using TLA+ specification language and model checker. Finally, it reviews state-of-the-art blockchain systems and applications.

1. Distributed consensus: consensus problem, synchronous and asynchronous setups, Paxos, Ben-Or, Texel, PBFT

2. Blockchain protocols: blockchain basics, Nakamoto, Tendermint, HotStuff, QSCAvalanche

3. Blockchain systems: Bitcoin, Ethereum, Libra, Hyperledger, DFinity, ConcordBFT

4. Smartcontracts: Smartcontract basics, creating safe smartcontracts, formal specification of smartcontracts, crosschain deals and adversarial commerce

5. Applications: Internet of value, blockchaindb, multiparty collaboration

The class will have one midterm, several assignments, and one project. The project qualifies as MS project. Two students can form a team and choose a project that involves (1) TLA+ modeling and Paxi implementation of a blockchain protocol or (2) TLA+ modeling and deployment of a smartcontract on a blockchain platform.

There is no required textbook for the course, however ample reading material (i.e., research papers) will be assigned. A suitable reference book that covers the blockchain basics is (free preprint at

Sunday, January 19, 2020

Future of Work (MSFT Faculty Summit 2019)

Future of work is a very interesting topic. This is something I would like to think about and brainstorm. But since the topic is very broad, the attempts to talk about future of work can get hazy and vague. NSF has released this call for proposals on Future of Work. It invites researchers to start addressing the needs/problems/challenges future work with a "convergent perspective". I am accustomed to seeing vague and extravagant language in NSF calls, but I think this one surpassed the usual. The word convergent/convergence appears 22 times in the call. (I think convergent is a new word for interdisciplinary in this context.)

I guess a similar thing happened with the Microsoft Faculty Summit 2019. Last July, I was at Microsoft for my sabbatical, and had a chance to attend the Microsoft Faculty Summit, whose subject was "Future of Work". It was a two day workshop, and I was very motivated to learn as much as I could from the workshop, because the topic is very interesting. After attending the first day, I was disappointed by the incoherent and vague sessions/presentations, and couldn't get myself to attend the second day.

The entire thing looked unorganized. Everyone presented what they are working on (for example machine learning, AR/VR, visualization) with some minimal effort to link it to the future of work concept. As a result, the talks in a session, or sometimes the talks in the same panel appeared disconnected and unrelated. At the end of one such panel, one of the audience members actually asked the panel "what do your talks have in common?" The panel didn't have a good answer, and  bounced the question back, and asked the audience member "why did you think the talks were unrelated?", upon which he replied "because everyone talked about unrelated topics". The panel then said, "no, we were all talking about different aspects of the same thing". I think given a sufficiently high level of abstraction, it is true that we are all talking about the same thing.

I think I was expecting to see talks that take on the future of work question in a head-on way, and was frustrated when I found that everyone was thinking about this either in an incremental way (microproductivity in the car, workflow creation in the smartphone, improving machine learning datasets), or in a couple cases in a very vague abstract way ("changing the social narrative", not sure about the question but AI is the answer). I didn't attend the second day, and maybe there were some more coherent panels and talks in the second day. The agenda and talk videos are at this link, if you like to check.

However, at the end of the first day, Bill Gates's interview with Eric Horvitz was very interesting. It started with overview of Bill and Melinda Gates foundation's work to combat global poverty. This work saved millions of lives. They also supported new curriculum and education work as well.
Bill had some smart things to say about future of work in the interview.

"When you write an email, it is not a random thing, you have a purpose. How does that email relate to other things you are involved with and dealing with? Graph representation of heterogeneous things can help for modeling this. Can machine learning infer the high-level intention and taxonomy? Can it help with how this email or the response to it is prioritized?"

"Long time ago, 1995, I gave a speech called information on your fingertips, i.e., query languages. We are extremely close for the machine assistant to help in many things. I have no idea how long it will take for deep understanding of text, but for other things (for which there have been demos for decades) we should be close. Such as information agents."

"I want to raise the bar for remote assistance on complex tasks."

"VR type avatar for a remote meeting would be helpful. Right now it is not even close to the real experience. Halolens could be a piece of the puzzle. We are optimistic though, this should be within a 5 year time frame." (I am not optimistic that this could be done within 5 years.)

(About gig and freelance economy) "A key element is education math and science free stuff is extremely good, Khan Academy, etc.,  but this has little effect on actual skill levels. ... How do we quantify on a good math teacher? Average math teacher is probably the same with that of 20 years ago. ...  I am enthused about that: whether you can still learn in mature life, lifelong learning"

"Someday our sense of purpose will not be work. Someday we will collaborate goals other than the human basics for a higher sense of purpose. ... If you can supercharge education, you buy many decades where the labor market is well matched to what those demands would look like."

(About new forms of automation, and inflection point in AI) "The society and government policies work should get ready for AI. The way the education system works, the way the tax system works--- which currently discourages labor--- should be changed. Things that are more encouraging of labor such as income tax credit will dominate, and things like social security taxes that cost you on labor, those would go away."

The camera doesn't show it too much, but Bill Gates has very fidgety legs, he kept shaking his legs the entire time he was talking. He is also very animated with his hands, and can get worked up while answering some questions. Also he likes saying "phenomenal".

Wednesday, January 15, 2020

Book review. Every tool is a hammer: Life is what you make it (by Adam Savage)

This book, published in 2019, is from Adam Savage of Mythbuster's fame.

The book is a love letter to making. The book defines making not as the physical act of building but as the audacity of creating: "Making is more than the physical act of building. It's dancing, it's sweing. It's cooking. It's writing songs. It's silk-screening. It's breaking new trails both literally and figuratively."

To me, this book is another manifestation of the "War of Art" book by Steven Pressfield in a different format/domain. This book talks about the resilience, the labor of creativity, and acting in face of a fearful/challenging project, which is also the topic of the War of Art book. The "How to be a Pro" concept discussed in the War of Art book is also explored in depth here. The respect for the craft, the tools, and the requirement to ship something by the deadline defines the professional. "Professional is patient, prepared, acts in face of fear, dedicated to mastering technique."

The book has the following chapters:
1. Dig through the bottom of the rabbit hole
2. Lists
3. Checkboxes
4. Use more cooling fluid
5. Deadlines
6. Drawing
7. Increase your loose tolerance
8. Screw>glue
9. Share
10. See everything, reach everything
11. Cardboard
12. Hammers, blades, scissors
13. Sweep up every day

The book distills hard-earned wisdom from the maker-space, and it turns out these are universal lessons that apply to most creative and crafting, including writing (as in the War of Art book) and research and computer science and engineering (as I try to persuade you via analogies below).

The first chapter is about the resistance. Getting started is always hard, there are so many unknowns, and this can be paralyzing. Adam says he is very lucky due to his temperament, and this is not too much of a problem for him.  His curiosity and obsession makes starting easy for him, and the book doesn't spend much time covering this. If you love what you do, you don't need to work a single day.

Chapters 2 and 3 was a lot of fun. It may be surprising to see chapters on Lists and Checkboxes in a book about making stuff, but if you think about it these are the most universal tools for dealing with complexity, and for staying organized and planning ahead. Surgeons and pilots have checklist. Software engineers also use kanban, todo lists, and task tracking tools. For my own research and planning, I use Emacs Org-Mode, which is also lists and checkboxes. These are essential for getting things done, and keeping your head clear.

Chapter 4, use more cooling fluid, is about the importance of going slow. "Professional is patient, prepared, acts in face of fear, dedicated to mastering technique." If you want to go far, go slow. Do not incur any technical debt. Technical debt bites you back big time.

Chapter 5 is about deadlines, because you have got to ship it, dammit. Deadlines work wonders for getting papers written, features added, and projects completed. Adding unrealistic deadlines doesn't help, but a deadline you set realistically is important for focusing the project, eliminates the decision fatigue (due to too many possible options), and helps getting things wrapped up.

Chapter 6, on drawing, is all about the importance of modeling. As I have been saying for many years, use modeling and TLA+, people! The chapter makes almost the same points for the benefits of using drawing. Closely related to this, Chapter 11, Cardboard, is about the importance of rapid prototyping.

Chapter 7, increase your loose tolerance, is a chapter on fault-tolerance. Chapter 8, screw>glue, is on the importance of modularity. These provide nice analogies in maker space about why building some slack, avoid tight coupling, and maintaining modularity is important, as for computer systems and distributed systems.

Chapter 9 is sharing your work with the community, so everyone, including you, benefit from this act of sharing. This is very applicable for research.  Practicing open research and disseminating results as well as the thought process behind them helps not only the community but the researcher as well. The researcher benefits by getting feedback, making connections, and getting recognition in the community.

Chapter 10 is about mise en place, and Chapter 13, sweep up every day, is again familiar concepts to anyone that is a professional. Professionals develop a respectful relationship with their craft and tools.

Chapter 12, hammers, blades, scissors, are about... actual hammers, blades, scissors. Ha, ha, I couldn't find an analogy or deep message here. In this chapter, he reviews some common types and gives tips for the makers/builders when choosing these tools.

I listened to this audiobook from Adam Savage's voice, and it was quite the treat. It looks like he Adam had some theater background, before his many years as a show host, so his voice is very well trained and his passion for making comes through the reading crystal clear.

Bits or Atoms?

I am more of a bits person, but I can't deny that making/creating physical things have a different level of attraction/joy for me. I felt very happy, the first time I baked a bread with the bread machine. When I fix things around the house or build something, I feel proud. Writing papers is fun, but even at the end of it, I like to print the paper, and hold it physically to inspect it.

We have undeniable affinity to the physical, and this is why it can be a really good idea to use props in teaching.

Sunday, January 12, 2020

Distributed systems analogies

I start my distributed systems class each semester with a video of a murmuration of starlings. I tell the students that this still qualifies for a distributed system, because the definition fits: A collection of autonomous nodes communicating to address a problem collectively, with no shared memory and no common physical clock.

Just from simple local actions of each starling (i.e., that of readjusting position with respect to neighbors) a global behavior of the murmuration emerges in poetic beauty. I like this analogy pedagogically because it is very visual. What I don't like about the analogy is that the specification of the global behavior is lax. The starlings don't crash into each other, fine, but there is no other constraint about the global behavior of the murmuration. No constraints there. The murmuration can go to any direction, split, rejoin, etc.

What could be a better analogy for a distributed system?

To answer this question, let's start by considering an auxiliary question: What is the "action word" for distributed systems? For databases, the characteristic action word is storing/retrieving. For machine learning, it is perception. How about distributed systems? What do you think?

I think the action word for distributed systems is coordination. The fundamental challenge in distributed systems is to coordinate the behavior of nodes which execute concurrently with limited information about each other.

This brings me to my distributed systems analogy: a soccer team. The players try to execute certain plays/tactics, but no player can read the other player's mind. There is no shared memory/state among players, and timing of the players can also be off. Each player has only a limited, incomplete view of the system (e.g., an evolving attack). Yet, the global objective for the team is well-defined: defend your goal post, and eventually score a goal.

I like this analogy because it is very visual, and the global specification is nontrivial. It is easy to see "a collection of autonomous nodes communicating to address a problem collectively, with no shared memory and no common physical clock." One may even call FC Barcelona's tiki taka as a finely tuned micro services deployment.

It may be possible to stretch the analogy by relating team tactics to distributed algorithms, injuries to crash faults, etc. One part of the analogy  I dislike is the team pits against an opposing team. We don't put distributed systems in opposition to each other in deployment.

Well, there is the orchestra analogy, which also involves coordination. But I don't think an orchestra is a fitting analogy to a distributed system. The musicians play notes collectively, but the coordination is not with respect to each other or dynamic inputs to the system, but with respect to the musical composition, which is heavily prescripted. Maybe the orchestra analogy is more suitable for parallel computing. Jazz players jamming/improvising is more suitable as a distributed systems analogy.

A team of construction workers also makes a good analogy for distributed systems. Something I noticed in the video is that these workers don't get in each others' way; unlike many distributed systems they don't seem to require strict coordination/synchronization. Maybe this is a suitable analogy for a big data processing system, with map-reduce workers.

If you are not worried about the visual demonstration component of the analogy, coordination within a company/corporation may also serve as a distributed system analogy. If you consider a company as a distributed system, you can think of several organization tactics: centralized, decentralized, federated, tiered, or hierarchical. I think there should be good synergy between management-science and distributed systems. But I don't know if this has been explored much. (Please let me know if you have good leads on this.)

What are closely related or synergistic disciplines to distributed systems?

The other day I was thinking if it would make sense to couple the adjective "distributed" with other disciplines. For example, distributed biology, distributed chemistry, distributed math, distributed economics, distributed medicine, distributed education, distributed civil engineering, distributed geology, etc.

Most of these don't make sense. But some does. When there is agency in the nodes, then distributed makes sense and may become applicable. For example, distributed economics/finance may make sense because of different markets and players affecting economy/finance. Maybe even distributed medicine makes sense, if you think of fighting an epidemic as a coordination problem starting the vaccinations/treatments from many different locations.

Distributed civil engineering may make sense because of the transportation problem, but maybe centralized algorithms are sufficient to address the static parts of transportation logistics. Distributed algorithms may be more appropriate for real-time traffic engineering solution, and definitely for sub-second geographical coordination required in some distribution management systems, such as electric grids.

Wednesday, January 8, 2020

Practical Byzantine Fault Tolerance

This paper is authored by Miguel Castro and Barbara Liskov, and it appeared in OSDI 1999. The conference version has around 2500 citations, and the journal version has close to 1500 citations. This paper is the grandfather for most Byzantine Fault Tolerant (BFT) protocols, including work that appeared through 2000s-2010s at OSDI/SOSP, and more recent blockchain BFT work such as LibraBFT. 

The serpentine history of Byzantine fault-tolerance

The paper starts with this declaration:
"We believe that Byzantine fault-tolerant algorithms will be increasingly important in the future because malicious attacks and software errors are increasingly common and can cause faulty nodes to exhibit arbitrary behavior."
How prescient of them. When Byzantine fault tolerance (BFT) was first introduced by Lamport in 1982, it was mostly of theoretical interest, and it remained so for a long time. Although the prediction in this 1999 paper eventually came true, I don't think it happened due to the reasons the paper expected it to happen. The paper gave the following boiler-plate justification of why BFT may become practical and relevant:
"Since malicious attacks and software errors can cause faulty nodes to exhibit Byzantine (i.e., arbitrary) behavior, Byzantine-fault-tolerant algorithms are increasingly important."
This justification has been used by many papers, but I am not very fond of this argument. First of all, this argument desperately asks for a citation right? (The journal version of the paper, which came a couple years later, also does not provide a citation.) How do we know the malicious attacks and software failure modes map nicely and practically to Byzantine fault-model? It is a big  assumption that a software bug, or an operator error, or a cracker that gets root on some machines can only affect upto 1/3rd of the nodes. In practice that is often not the case, as those faults are all correlated. A software bug in one replica will be present in the other replicas as well, and likely to manifest itself with the same state and input (the state-machine-replication approach is especially prone to this correlated failure). A cracker that broke into one replica is likely to use the same techniques or passwords to break into the other replicas.

Use of techniques like N-version programming may help alleviate the issue. The idea is to implement the same API/specifications in multiple different ways, so that a software error does not constitute a problem for more than one third of the replicas. But this is not easy, also not always effective. Many developers are prone to make similar mistakes/bugs in similar tricky places.

While there has been a decade of followup work to PBFT, since the above problems remained unaddressed, there was also a growing criticism of BFT work. I swear I have seen a couple talks toward 2010 that declared BFT impractical. They had titles like "Do we need BFT" and "BFT is impractical". Unfortunately I cannot find those talks now.

In the last five years or so, with the prominence of blockchains and adversarial commerce applications, BFT assumption/model became relevant again. Therefore, yes, the prediction in the opening of the paper turned out to be appropriate after all.

PBFT Protocol overview

The PBFT protocol introduced in this paper has improved significantly on previous work on BFT protocols. The previous BFT algorithms assumed a synchronous system and/or were too costly in terms of communication. The PBFT protocol introduced in this paper was the first to work in a partially synchronous environment (and to preserve safety in an asynchronous environment) and offered significant reductions in communication costs.  The former is achieved through the use of views and view-changing, and the latter is achieved through the use of cryptography to authenticate messages and prevent spoofing and replays.

PBFT is based on replicated state machine (RSM) approach. This is very similar to how Paxos uses RSM to achieve consensus to the face of crash faults. Here, due to Byzantine fault-tolerance, the protocol gets a bit more involved. Also instead of N=2F+1 in Paxos, BFT requires N=3F+1, where F is the upper bound on the number of faulty replicas, and N is the number of replicas in total. The reason is as follows. We cannot wait reply from more than N-F replicas, because the F faulty replicas may not reply. But what if the among the N-F replicas that respond, F of them are faulty replicas? So there must still be enough responses that those from non-faulty replicas outnumber those from faulty ones, i.e., N-2F>F.

The RSM maintenance works pretty similar to Paxos, or rather, viewstamped replication flavor of consensus. The replicas move through a succession of configurations called views (this corresponds to the ballotnumber concept in Paxos). In a view one replica is the primary and the others are followers.
Views are numbered consecutively. The primary of a view is replica p such that p= v mod N.  View changes are carried out when it appears that the primary has failed.

The state of each replica includes the state of the service, a message log containing messages the replica has accepted, and an integer v denoting the replica's current view.

PBFT normal operation goes as follows:
1. A client sends a request to invoke a service operation to the primary
2. The primary multicasts the request to the backups
3. Replicas execute the request and send a reply to the client
4. The client waits for F+1 replies from different replicas with the same result; this is the result of the operation

Here the primary is replica 0. There are three phases to commit the request. The pre-prepare and prepare phases are used to totally order requests sent in the same view even when the primary, which proposes the ordering of requests, is faulty. The prepare and commit phases are used to ensure that requests that commit are totally ordered across views.

Pre-prepare phase

The primary assigns a sequence number to the request, adds its view v, multicasts a pre-prepare message to all the backups, and appends the message to its log.

A backup accepts a pre-prepare message provided:
  • the signatures in the request and the pre-prepare message are correct 
  • it is in view v
  • it has not accepted a pre-prepare message for view and sequence number   containing a different digest
  • the sequence number in the pre-prepare message is between a low water mark,   h, and a high water mark, H

Prepare phase

Accepting the pre-prepare message, a follower enters the prepare phase by multicasting a Prepare message to all other replicas and adds both messages to its log.

A replica (including the primary) accepts prepare messages and adds them to its log provided their signatures are correct, their view number equals the replica’s current view, and their sequence number is between h and H.

The predicate, Prepared (m,v,n,i) holds iff replica i has inserted in its log:
  • the request m
  • a pre-prepare for m in view v with sequence number n, and 
  • 2F prepares from different backups that match the pre-prepare
The pre-prepare and prepare phases of the algorithm guarantee that non-faulty replicas agree on a total order for the requests within a view. That is, if Prepared (m,v,n,i) holds, then Prepared (m',v,n,j) is false. This is because, the former implies that at least F+1 non-faulty replicas have sent a pre-prepare or prepare for m in view v with sequence number n, and the quorum intersection implies that a nonfaulty replica must have sent two conflicting messages for same v and n, which is a contradiction.

Commit phase

Replica i multicasts a Commit message to other replicas when Prepared is truthified. Replicas accept commit messages and insert them in their log provided they are properly signed, the view number in the message is equal to the replica's current view, and the sequence number is between h and H

The predicated Committed (m,v,n) holds iff Prepared (m,v,n,i) is true for all i in some set of F+1 replicas. And the predicate Committed-local (m,v,n,i) holds iff Prepared (m,v,n,i) is true and i has accepted 2F+1 commits (possibly including its own) matching m.

As an invariant of PBFT, we have that if Committed-local (m,v,n,i) holds, then Committed (m,v,n) holds. This ensures that non-faulty replicas agree on the sequence numbers of requests that commit locally even if they commit in different views at each replica. Furthermore, with the view change protocol this ensures that any request that commits locally at a non-faulty replica will commit at F+1 or more non-faulty replicas eventually.

Garbage collection and checkpointing

For the safety condition to hold, messages must be kept in a replica's log until it knows that the requests they concern have been executed by at least F+1 non-faulty replicas and it can prove this to others in view changes.

If some replica misses messages that were discarded by all non-faulty replicas, it will need to be brought up to date by transferring all or a portion of the service state. Therefore, replicas need some proof that the state is correct. These proofs are generated periodically using the checkpointing protocol, when a request with a sequence number divisible by some constant (e.g., 100) is executed. The checkpoint protocol is also used to advance the low and high water marks (which limit what messages will be accepted).


That was normal operation with a stable primary. The view-change protocol provides liveliness by allowing the system to make progress when the primary fails. View changes are triggered by timeouts that prevent followers from waiting indefinitely for requests to execute. Below is the essence of the view-change protocol.


Using PBFT, the paper shows how to implement a Byzantine-fault-tolerant NFS service. The paper performs evaluations on this service and shows that it is only 3% slower than a standard unreplicated NFS.

Of course PBFT still has a high communication cost, due to the all-to-all communications in the prepare and commit phases. Recent work on BFT protocols employ threshold signatures and are able to avoid these all-to-all broadcasts in the PBFT protocol, and replace them with leader to quorum and quorum to leader communications. That leads to a big improvement and makes the BFT protocol the same order of magnitude communication complexity as primary backup-replication and Paxos protocols. With pipelining and rotation of leaders, these modern BFT protocols achieve much better performance.

Tuesday, January 7, 2020

Death's End (by Liu Cixin) and the Swordholder problem

This is the last book in Liu Cixin's Remembrance of Earth's Past trilogy. I had written reviews for the Three Body Problem and Dark Forest before. So for completeness sake, I write my impressions about this book as well. This time I try not to talk much about the plot, and try to keep this spoiler-free.

This book is epic, in terms of the space distance and timeline it covers. It is awe-inspiring in that vastness in space and time sense, and makes you think at the universal scale. This book feels much more hardcore sci-fi than the previous two books, and less believable than them. The book also feels rushed in character/plot development (definitely not length) compared to the previous two books. But all in all, it is a very captivating science fiction and I couldn't stop thinking about it even one week after finishing the book.

The book still has a very cynical and very pessimistic view of human nature (as well as alien nature). The universe is a dark forest, where everyone tries to finish off each other. Everything is bleak, life is meaningless, pointless, and worthless.

On the other hand, the book made me very optimistic about the potential advances in science. The description of four dimensional space is very rich in visualizations. The brainstorming the book induces on what things are possible if you can manipulate the three dimensional space from four dimensional space is inspirational. The potential behind the curvature drive technology (manipulating/changing physical laws) is also amazing to dream about.

One of my main annoyances with the book is the weak character development and naivete in gender dynamics in the book. Cheng Xin, the main character in the story, is portrayed as a weakling. She is very smart (an aerospace engineer) but she is weak and fallible. Like in the original sin, she is blamed for the fall of man, not once but several times. I was aghast at how absurd this portrayal was, and I couldn't stop myself screaming back to the audio-book at several places.

Unfortunately Liu Cixin's women characters have been all one dimensional. Luo Ji's wife was also that way. In the Dark Forest, Luo Ji has created a perfect woman in his imagination: a very submissive, very gentle, soft-spoken, geisha like woman. The worse part is that after that Da Shi had searched China to find the woman that fits the description. It was cringe-worthy to read about how one-dimensional this character was in the book.

Those being said, I want to emphasize again that Cixin Liu is a very talented story teller. (I am very much impressed by the creativity in the stories told from the mouth of Yun Tianming and the technological clues hidden in these stories.) I recommend the series strongly to all science-fiction readers. I also want to recognize the hardwork and talent of the Ken Liu for doing a superb job in translating the books to English. It looks like he has taken a lot of initiative and made creative improvements in the translation rather than doing a straightforward translation.

The swordholder problem

The first part of the book describes the swordholder position, which is sort of a mutual destruction button-holder duty. Initially Luo Ji was tasked as the swordholder. But the problem with the swordholder position is that this puts too much trust in one person. One person becomes responsible for life-death decision of two worlds, the earth and trisolaris.

Liu Cixin has a programming background. He should have realized that this trust problem is better handled by using a Byzantine fault-tolerant consensus protocol. Instead of one person, choose 7 people to act as swordholders and this system will tolerate 2 Byzantine swordholders (because N>3*F).

If we make the assumption that the swordholders don't counsel with each other and each makes a decision on a definite and correct input provided to them, it may be possible to make the system work with N=2F+1 nodes, tolerating more faulty nodes. In that case, just a multisignature solution (with F+1 signatures) could work. (The input is whether the Trisolarans started an attack on Earth. This assumes the input is definite/decisive and binary and it will be visible to the nodes not through other nodes, but through the environment, media, newsreport, and sensors.  And this assumes that for the nonfaulty nodes the input is not faked as in Truman show.)

Yes, this is too many assumptions, so a counsel based solution among nodes, and hence a Byzantine fault-tolerant protocol with N=3F+1, seems to be more appropriate.

Of course, Liu Cixin used the swordholder position for dramatic effect in the novel, but I am a technical person, and I can't get over the fact that the people of three centuries later would be so ignorant of more suitable solutions to the problem.

This brings me back to our time... Is the nuclear button holder position in our time still a single trusted person position? (Related concept: Dead hand) Do we trust a president this much to give him this responsibility? Isn't it time to institute a Byzantine fault-tolerant solution to solve this problem?

Saturday, January 4, 2020

State Machine Replication in Facebook's Libra Blockchain

This paper presents LibraBFT, the state machine replication system designed for Facebook's Libra Blockchain. LibraBFT is based on HotStuff, which I had summarized recently. LibraBFT adds modules, such as a pacemaker, to adopt and operationalize the HotStuff protocol within a real-world blockchain system.

The plan with the Libra project is that, initially, the participating validators will be permitted into the consensus network by "Founding Members". Later on, membership eligibility will gradually become open/permissionless while preserving decentralization with careful governance. (To facilitate dynamic membership, Libra BFT can reconfigure itself by embedding configuration-change commands in the sequence.)

In my summary below I mostly use prose lifted from the LibraBFT whitepaper. It is a 41 page paper, so I focus only on the important parts to provide the summary. After the summary, I add a brief discussion about how LibraBFT compares with other blockchain consensus protocols and future directions for research in the Libra project.


  • Safety: LibraBFT maintains consistency among honest validators even if up to one-third of the validators are corrupt/byzantine.
  • Asynchrony: Consistency is guaranteed even in cases of network asynchrony.
  • Finality: LibraBFT supports a notion of finality (an explicit commit certificate) for transactions. 
  • Linearity and Responsiveness: Linearity guarantees that driving transaction commits incurs only linear communication (this is optimal) even when leaders rotate; responsiveness means that the leader has no built-in delay steps and advances as soon as it collects responses from validators.
  • Sustainability: In contrast to the computationally expensive Proof-of-Work (PoW), LibraBFT is designed as a Proof-of-Stake (PoS) system, where participation privileges are granted to known members based on their financial involvement.

Leaders, Votes, Quorum Certificates

LibraBFT belongs to the family of leader-based consensus protocols. In leader-based protocols, validators make progress in rounds, where each round has a designated leader.

Leaders are responsible for proposing new blocks and obtaining signed votes from the validators on their proposals. During a round, the leader proposes a block that extends the longest chain it knows. If the proposal is valid and timely, each honest node will sign it and send a vote back to the leader.

After the leader has received enough votes to reach a quorum, it aggregates the votes into a Quorum Certificate (QC) that extends the chain. The QC is broadcast to every node. If the leader fails to assemble a QC, participants will timeout and move to the next round.

Eventually, enough blocks and QCs will extend the chain in a timely manner, and a block will match the commit rule of the protocol. When this happens, the chain of uncommitted blocks up to the matching block become committed.

Rounds and Phases

Each round has three-phases. The first and second phases of a round are similar to PBFT, but the result of the second phase is a certified certificate, or a QC-of-QC, rather than a commit decision. A commit decision is reached upon getting a quorum of votes on the QC-of-QC (a QC-of-QC-of-QC). An honest leader can prove the safety of a proposal by referencing a single QC (from the highest round).


LBFT uses a chaining approach, where the three phases for commitment are spread across rounds. More specifically, every phase is carried in a round and contains a new proposal. The leader of round k drives only a single phase of certification of its proposal. In the next round, k+1, a leader again drives a single phase of certification: it sends its own k+1 proposal, but it also piggybacks the QC for the k proposal. In this way, certifying at round k+1 generates a QC for k+1, and a QC-of-QC for k. In the third round, k+2, the k proposal can become committed, the (k+1) proposal can obtain a QC-of-QC, and the (k+2) can obtain a QC. (See my summary of HotStuff for a more detailed explanation.)


Rounds must be strictly increasing along a chain: $round(B_i) < round(B_{i+1})$. When rounds increase exactly by one, that is $round(B_i) + 1 = round(B_{i+1})$, we say that the chain has contiguous rounds. The commit logic is simple: It requires a 3-chain with contiguous round numbers whose last descendent has been certified.

(Safety) New commits always extend a chain containing all the previous commits.
(Liveness) If the network is synchronous for a sufficiently long time, eventually a new commit is produced.

Occasionally, LibraBFT may contain chains that have gaps in round numbers. For example, a dishonest leader may propose an invalid block, or a leader may fail to gather a quorum of votes in a timely manner because of network issues. LBFT guarantees that only one fork becomes committed through a simple voting rule that consists of two ingredients:  First, validators vote in strictly increasing rounds. Second, whenever validators receive a block, they maintain a preferred round, defined as the highest known grandparent round. The rule is that validators vote for a block if its parent round is at least the preferred round. In Figure 2, validators that contributed to the formation of a QC for round k+2 remember k as their preferred round.

Round synchronization

LibraBFT assumes a partial synchrony model: after some unknown global stabilization time (GST), the network delivers all messages between honest nodes under some (unknown) time delay $\delta_M > 0$.

The advancement of rounds is governed by a module called Pacemaker, which keeps track of votes and of time. In a happy path, the Pacemaker module at each validator observes progress,  a leader proposal for the current round, and advances to the next round. In a recovery path, the Pacemaker observes lack of progress in a round. Upon a local round timeout, the Pacemaker broadcasts a TimeoutMsg notification.

Comparison with other protocols

LibraBFT is based on PBFT, so it is in the same leader-based protocols family with Tendermint and Casper. I discussed this a little in the previous post on HotStuff.

Nakamoto consensus is also a leader based protocol, but the leader election is silent, and is determined by proof-of-work. Also in Nakamoto consensus there is no explicit commit/finality. Here is a comparison of Nakamoto and Paxos protocols from an earlier post.

Threshold Logical Clocks and QSC provides a leaderless consensus protocol for blockchains. Finality/commit is not explicit in that protocol either.

Avalanche (a descendant of Texel consensus) also provides a leaderless consensus protocol for blockchains. It is based on polling/sampling and metastability. Again, finality/commit is not explicit.

Future work: Leader election

The paper does not provide much detail about how the leaders are designated per round in LibraBFT. It looks like the leader election strategy in the Libra source-code is round-robin. However, the paper notes that this strategy makes round leaders predictable, and hence facilitates denial-of-service attacks. To address this problem, the paper proposes to work on a verifiable random function (VRF) to randomize leaders using some source of randomness that cannot be predicted in advance.

The paper also mentions that, going forward, the Libra team will investigate new proposer election strategies.
There are several ways to enhance the proposer election mechanism in order to avoid performance hick-ups when a bad leader is elected, or worse, when a succession of bad leaders is elected. 
An alternate leader strategy elects an ordered pair of leaders per round. The lower leader delays a certain duration in order to yield to the higher leader. This approach has the benefit of unlocking rounds with a crashed leader quickly, but the risk of creating contention among the leaders. The success of the approach largely hinges on the ability to stagger leaders effectively. 
A different approach optimizes for a stable leader, and provides fairness and load balancing via rotating input generators. In each round, there are two distinct roles, a leader and an input generators. The leader is kept stable so long as progress and good performance are observed. The input generators are designated among the validators based on past performance and availability. In each round, the stable leader has the authority to promote or dismiss an input generator, but if it dismissed generators too frequently, the leader itself will get demoted eventually.
In any case, it looks like this will be an issue that will require more research in LibraBFT.

Future work: Scaling to a large number of nodes

There is no evaluation result in the paper about scalability. In a previous paper, the HotStuff protocol was evaluated on up to 128 nodes. The incast storm at a leader node constitutes a bottleneck for the system, and it would not be possible to scale the system to large number of nodes if every node reports back to a leader.

Despite many bad things associated with Nakamoto consensus, it avoids the incast storm problem and is able to scale to very large number of nodes. Moreover, sampling-based/metastability approaches as in Avalanche also avoids the incast storm problem and has a chance to scale better.

But I am still hopeful about the leader-based consensus approach. It is simple, assured, and gives a quick proof-of-finality. There could be ways to make the leader-based approach scale. For example, one way to resolve the incast storm problem in LibraBFT could be to use incast-aggregator nodes before a leader. Another way to alleviate the problem could be to use a federations-based consensus model as in Stellar.

Friday, January 3, 2020

How to speak (by Patrick Winston)

On New Year's morning, I watched the above lecture by Patrick Winston, the late MIT professor, on "How To Speak". (Patrick Winston passed away in July 2019. May he rest in peace.)  I had previously covered the Winston star idea by him; this talk provides a more general and comprehensive treatment on how to speak and present.

Patrick's talk contained several new gems for me. And I thought I already know a lot about presenting.

Start with an empowerment promise. Nobody cares about your talk, they care about what useful things they can learn from your talk. So make this about the audience, look at this from their perspective, and start by telling them what useful thing they will get out of this talk.

Differentiate between presentations that aim to expose (a job talk or a conference talk) versus that aim to inform (a lecture).

For an exposition talk, you need to get across your message in the first five minutes. You should explain the problem you solve and why that is an important problem, and then you should explain why your approach is new and effective. Then you give a preview of your contributions.

Be minimalistic in your slides. Cut almost all words on the slides, and use a small number of slides. Using a lot of text on slides mean that people will be distracted when trying to read them and won't be able to follow what you say. (Humans have a single language processing center.) This suggests that it may be useful to prepare to versions of your slides, one for live presentation and another for serving as handouts to post at the conference website or on your homepage.

Avoid laser pointers. Laser pointers are too distracting as you need to face the curtain to point to things. Instead use arrows on slides, and explain/highlight them in your talk. I recently bought a LogiTech Spotlight pointer, which solves the awkward pointing problem with laser pointers. It is accelerator based and it highlights on the laptop screen in software, so you don't have to turn your back to the audience to point to the screen.

As for presentations that aim to inform, Patrick argues that using chalk and blackboard is more appropriate. I agree that the blackboard is very useful when explaining a protocol. But the slides are still more convenient for the rest of the time. Again the rules about well prepared slides apply. It is also important to keep the pace slow, and keep things interactive with several strategic questions embedded in the slides.

Try to minimize distraction in the audience. Patrick argues that laptops be disallowed in lectures. Again, since humans have one language processing center, they will miss the presentation if they read emails or social media posts on their laptops. Obviously requesting the laptops to be closed is not practical for conference talks. (I use my laptop to take notes which helps me to follow the talks better.) In fact conference talks has the most disadvantageous setup. The audience is tired from previous talks in the conference, they have laptops in front of them, and the room is not well-lit. The only defense for conference talks is to make the content and presentation very interesting.

Try to use props. A prop is an actual object that you can touch and show, and it makes your presentation much more interesting because it creates a visceral reaction in the audience. I will try to work on this item. This may require some creativity for people working on distributed systems and cloud computing, but is not totally inconceivable.

End with the contributions slide. Remind the audience what you argued, how you argued it, and why it matters. Patrick insists that you do not finish your talk with saying thank you, as that is a weak move :-)

Patrick's talk was more about the mechanics of giving a talk. I had previously written a couple posts about presenting your work, which focused more on how to frame your presentation and how to tell a story. They are a nice complement to Patrick's presentation.

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...