Friday, January 3, 2020

How to speak (by Patrick Winston)


On New Year's morning, I watched the above lecture by Patrick Winston, the late MIT professor, on "How To Speak". (Patrick Winston passed away in July 2019. May he rest in peace.)  I had previously covered the Winston star idea by him; this talk provides a more general and comprehensive treatment on how to speak and present.

Patrick's talk contained several new gems for me. And I thought I already know a lot about presenting.

Start with an empowerment promise. Nobody cares about your talk, they care about what useful things they can learn from your talk. So make this about the audience, look at this from their perspective, and start by telling them what useful thing they will get out of this talk.

Differentiate between presentations that aim to expose (a job talk or a conference talk) versus that aim to inform (a lecture).

For an exposition talk, you need to get across your message in the first five minutes. You should explain the problem you solve and why that is an important problem, and then you should explain why your approach is new and effective. Then you give a preview of your contributions.

Be minimalistic in your slides. Cut almost all words on the slides, and use a small number of slides. Using a lot of text on slides mean that people will be distracted when trying to read them and won't be able to follow what you say. (Humans have a single language processing center.) This suggests that it may be useful to prepare to versions of your slides, one for live presentation and another for serving as handouts to post at the conference website or on your homepage.

Avoid laser pointers. Laser pointers are too distracting as you need to face the curtain to point to things. Instead use arrows on slides, and explain/highlight them in your talk. I recently bought a LogiTech Spotlight pointer, which solves the awkward pointing problem with laser pointers. It is accelerator based and it highlights on the laptop screen in software, so you don't have to turn your back to the audience to point to the screen.

As for presentations that aim to inform, Patrick argues that using chalk and blackboard is more appropriate. I agree that the blackboard is very useful when explaining a protocol. But the slides are still more convenient for the rest of the time. Again the rules about well prepared slides apply. It is also important to keep the pace slow, and keep things interactive with several strategic questions embedded in the slides.

Try to minimize distraction in the audience. Patrick argues that laptops be disallowed in lectures. Again, since humans have one language processing center, they will miss the presentation if they read emails or social media posts on their laptops. Obviously requesting the laptops to be closed is not practical for conference talks. (I use my laptop to take notes which helps me to follow the talks better.) In fact conference talks has the most disadvantageous setup. The audience is tired from previous talks in the conference, they have laptops in front of them, and the room is not well-lit. The only defense for conference talks is to make the content and presentation very interesting.

Try to use props. A prop is an actual object that you can touch and show, and it makes your presentation much more interesting because it creates a visceral reaction in the audience. I will try to work on this item. This may require some creativity for people working on distributed systems and cloud computing, but is not totally inconceivable.

End with the contributions slide. Remind the audience what you argued, how you argued it, and why it matters. Patrick insists that you do not finish your talk with saying thank you, as that is a weak move :-)

Patrick's talk was more about the mechanics of giving a talk. I had previously written a couple posts about presenting your work, which focused more on how to frame your presentation and how to tell a story. They are a nice complement to Patrick's presentation.

Wednesday, December 25, 2019

HotStuff: BFT Consensus in the Lens of Blockchain

This paper appeared in PODC 2019, and is by Maofan Yin, Dahlia Malkhi, Michael K. Reiter, Guy Golan Gueta, and Ittai Abraham. The paper presents HotStuff, a leader-based Byzantine fault-tolerant consensus protocol. HotStuff forms the basis of LibraBFT which is used in Facebook's Libra project.

HotStuff uses the partially synchronous model for liveness, but it is safe even under an asynchronous model. (This is also how Paxos operates, as a crash tolerant consensus protocol.) Once network communication becomes synchronous, HotStuff enables a correct leader to drive the protocol to consensus at the pace of actual (vs. maximum) round duration --a property called responsiveness. Another innovation in HotStuff is that it provides communication complexity that is linear (rather than quadratic) in the number of replicas. In other words, all-to-all broadcasts are replaced with only participant-to-leader and leader-to-participant communication, with rotating leaders. HotStuff is the first partially synchronous BFT replication protocol exhibiting these two properties combined.

HotStuff overview

HotStuff builds and improves upon PBFT. In PBFT, a stable leader can drive a consensus decision in two rounds of message exchanges. The first phase guarantees proposal uniqueness through the formation of a quorum certificate (QC) consisting of (n − f) votes. The second phase guarantees that the next leader can convince replicas to vote for a safe proposal. This requires the leader to relay information from (n−f) replicas, each reporting its own highest QC or vote. Unfortunately, the view-change algorithm that enables a new leader to collect information and propose it to replicas is complex, bug-prone, and incurs a significant communication penalty for even moderate system sizes.

This is where HotStuff's core technical contribution comes. HotStuff solves a liveness problem, the hidden lock problem, with the view change protocol by adding a lock-precursor phase. The resulting three phase algorithm helps to streamline the protocol and achieve linear view-change costs in contrast to generations of two-phase protocols which suffer from a quadratic communication bottleneck on leader replacement.

Ok, let's unpack this. The hidden lock problem occurs, if a leader doesn't wait for the $\Delta$ expiration time of a round. Simply receiving N-F replies from participants (up to F of which may be Byzantine) is not sufficient to ensure that the leader gets to see the highest lock. This is a race condition problem, the highest lock value may be hidden in the other F honest nodes which the leader didn't wait to hear from. (The hidden lock is not a problem in Paxos view-change because Paxos does not deal with Byzantine nodes, but only with crash-prone nodes.) Such an impatient leader may propose a lower lock value than what is accepted and this may lead to a liveness violation. In order not to wait the maximum $\Delta$ expiration time of a round, HotStuff introduces another round, a precursor-lock round, before the actual lock round. This additional precursor-lock round solves the hidden lock problem, because if 2F+1 participants accept the pre-cursor lock, the leader will surely hear from them and learn the highest lock value proposed (not necessarily accepted), when it only talks to N-F nodes without needing to wait for $\Delta$ time. The below lecture by Ittai Abraham is very helpful for understanding this problem and the algorithm.

This way, HotStuff revolves around a three-phase core, allowing a new leader to simply pick the highest QC it knows of. This simplifies the leader replacement  protocol such that the costs for a new leader to drive the protocol to consensus is no greater than that for the current leader. As such, HotStuff enables pipelining of the phases, and supports frequent succession of leaders, which is very beneficial in the blockchain context. The idea is to change the view on every prepare phase, so each proposal has its own view.

Thanks to the pipelining/chaining of the phases and rotating of leaders, HotStuff achieves high throughput despite adding a third phase to the protocol. More concretely, the pipelining/chaining works as follows. The votes over a prepare phase are collected in a view by the leader into a genericQC. Then the genericQC is relayed to the leader of the next view, essentially delegating responsibility for the next phase, which would have been pre-commit, to the next leader. However, the next leader does not actually carry a pre-commit phase, but instead initiates a new prepare phase and adds its own proposal. This prepare phase for view v+1 simultaneously serves as the pre-commit phase for view v. The prepare phase for view v+2 simultaneously serves as the pre-commit phase for view v+1 and as the commit phase for view v. This is possible because all the phases have identical structure.

Comparison with other BFT protocols

In sum, HotStuff achieves the following two properties with these improvements:
  • Linear View Change: After global stabilization time (GST) of the partial synchrony model, any correct leader, once designated, sends only O(n) authenticators to drive a consensus decision. This includes the case where a leader is replaced. Consequently, communication costs to reach consensus after GST is O($n^2$) authenticators in the worst case of cascading leader failures.
  • Optimistic Responsiveness: After GST, any correct leader, once designated, needs to wait just for the first n−f responses to guarantee that it can create a proposal that will make progress. This includes the case where a leader is replaced.

If you work on blockchain protocols, you may know that Tendermint and Casper also follow a simple leader regime. However, those systems are built around a synchronous core, wherein proposals are made in pre-determined intervals $\Delta$ that must accommodate the worst-case time it takes to propagate messages over a wide-area peer-to-peer gossip network, so they violate the optimistic responsiveness property. In contrast, in HotStuff the leader exchange incurs only the actual network delays, which are typically far smaller than $\Delta$ in practice.


As far as the chaining and pipelining is concerned, the above figure provides an overview of the commit rules of some popular BFT consensus protocols. The commit rule in DLS is One-Chain, allowing a node to be committed only by its own leader. The commit rules in PBFT, Tendermint and Casper are almost identical, and consist of Two-Chains. They differ in the mechanisms they introduce for liveness, PBFT has leader proofs of quadratic size (no Linearity), Tendermint and Casper introduce a mandatory $\Delta$ delay before each leader proposal (no Optimistic Responsiveness). HotStuff uses a Three-Chain rule, and has a linear leader protocol without delay.

Tuesday, December 24, 2019

Cross-chain Deals and Adversarial Commerce

This paper appeared in VLDB'19 and is authored by Maurice Herlihy, Barbara Liskov, and Liuba Shrira.

How can autonomous, mutually-distrusting parties cooperate safely and effectively?

This question is very important for enabling commerce. The trading world answered this question so far by relying on a trusted third party, and in the worst case, on the government/rule-of-law to litigate parties deviating from their contracts. With prevalence of e-commerce and decentralization, this question is recently  considered in *trustless* settings by modern distributed data management systems.

Solving the trustless multi-party cooperation when all the parties use the same blockchain is achievable via smartcontracts, but solving the problem where the parties use different blockchains bring many additional challenges. Some of these challenges are familiar to us from the classical distributed systems research on distributed transactions, such as how to combine multiple steps into a single atomic action, how to recover from failures, and how to synchronize concurrent access to data. But considering the cross-chain multi-party collaboration in a trustless setting requires reconsidering/revising that work, because here the participants are autonomous and potentially adversarial.

Cross-chain deal 

The paper proposes the cross-chain deal as a new computational abstraction for structuring complex distributed exchanges in an adversarial setting. The paper starts by clarifying the distinctions between cross-chain deals with atomic transactions, and cross-chain swaps.

Cross-chain deals differ from atomic transactions, because
  • transactions perform complex distributed state changes, while deals simply exchange assets among parties
  • transactions use “all-or-nothing” semantics to preserve global invariants, however each autonomous party in a deal can decide independently whether it finds an outcome satisfactory
  • transactions assume crash failures, whereas deals assume byzantine faults
Cross-chain deals also differ from cross-chain swaps. In cross-chain swaps, each party sets up an unconditional transfer, thus they are incapable of expressing standard financial transactions such as arbitrage and auctions.
  • Consider an arbitrage example where Alice pays Bob with coins she receives from Carol, and sells Carol a ticket she receives from Bob. Alice cannot commit to either transfer at the start of a swap protocol because she does not own those assets. 
  • Auctions also cannot be expressed as swaps because the auction’s outcome (reserve price exceeded, identity of winner) cannot be determined until all bids have been submitted.
In contrast to transactions and swaps, deals address the demands of adversarial commerce better:
  • In adversarial commerce each party decides for itself whether to participate in a particular interaction. 
  • Parties agree to follow a common protocol, an agreement that can be monitored, but not enforced. 
  • Correctness is local and selfish: all parties that follow the protocol end up “no worse off” than when they started, even in the presence of faulty or malicious behavior by an arbitrary number of other parties. 

The paper uses the following running example.  Bob decides to sell two coveted tickets to a hit play for 100 coins. Alice knows that Carol would be willing to pay 101 coins for those tickets, so Alice moves to broker a deal between Bob and Carol. Alice devises a cross-chain deal, to be executed by Alice, Bob, and Carol, and communicating through contracts running on various blockchains. If all goes as planned, all transfers take place, and if anything goes wrong (someone crashes or tries to cheat), no honest party should end up worse off.


Property 1: For every protocol execution, every compliant party ends up with an acceptable payoff.

Property 2: No asset belonging to a compliant party is escrowed forever.

Property 3: If all parties are compliant and willing to accept their proposed payoffs, then all transfers happen (all parties’ payoffs are All).

The paper describes two alternative protocols for implementing cross-chain deals
  • Timelock protocol, based on synchronous communication, is fully decentralized
  • CBC protocol, based on semi-synchronous communication, requires a globally shared ledger
Before discussing the two protocols, we first discuss the general framework for cross-chain deals.

How Cross-Chain Deals Work

An escrow is used for ensuring that a single asset cannot be transferred to different parties at the same time. Here is what happens when P places a in escrow during deal D:
Pre: Owns(P, a)
Post: Owns(D, a) and Owns_C (P, a) and Owns_A(P, a)

The precondition states that P can escrow A only if P owns a. The postcondition states that ownership of a is transferred from P to D (via the escrow contract), but P remains the owner of a in both Commit and Abort.

These are the phases of a cross-chain deal.
  • Clearing Phase: A market-clearing service discovers and broadcasts the participants, the proposed transfers, and pos- sibly other deal-specific information.
  • Escrow Phase: Parties escrow their outgoing assets. E.g., Bob escrows his tickets and Carol her coins.
  • Transfer Phase: The parties perform the sequence of tentative ownership transfers according to the deal. E.g., Bob tentatively transfers the tickets to Alice, who subsequently transfers them to Carol.
  • Validation phase: Once the tentative transfers are complete, each party checks that the deal is the same as proposed by the (untrusted) market-clearing service, and that its incoming assets are properly escrowed (so they cannot be double-spent). E.g., Carol checks that the tickets to be transferred are escrowed, that the seats are (at least as good as) the ones agreed upon, and that she is not about to somehow overpay.
  • Commit phase: The parties vote on whether to make the tentative transfers permanent. If all parties vote to commit, the escrowed assets are transferred to their new owners, otherwise they are refunded to their original owners.
A tentative transfer commits if it becomes permanent, and it aborts if it is discarded. A deal commits if all its tentative transfers commit, and it aborts if all its tentative transfers abort.

(Remark: I think this formulation does not capture what happens if neither commit or abort gets satisfied for a deal, i.e, where some tentative transfers commit and some tentative transfers abort. For example, tentative transfers may commit for Carol, but the tentative transfers for Bob may get aborted. 
In the beginning the paper said that "all or nothing semantics" is impossible to satisfy with adversarial participants: "it may be possible that some transfers commit and other abort provided that no compliant party is worse-off." But I can't see how the formulation above captures this.)

Timelock Protocol

Timelock protocol is the first protocol paper discusses for implementing cross-chain deals. It is fully decentralized but it assumes a synchronous network model where blockchain propagation time is known and bounded.

In the timelock protocol, escrowed assets are released if all parties vote to commit. Parties do not explicitly vote to abort. Timeouts are used for ensuring that escrowed assets are not locked up forever. Each escrow contract’s timeout for a party’s commit vote depends on the length of the path along which that vote was forwarded.

Phases of the deal:
  • Clearing: The market-clearing service broadcasts to all parties in the deal, the deal identifier D, the list of parties plist, a commit phase starting time t0 used to compute timeouts, and the timeout delay $\Delta$. Because t0 and $\Delta$ are used only to compute timeouts, their values do not affect normal execution times. If deals take minutes (or hours), then $\Delta$ could be measured in hours (or days).
  • Escrow: Each party places its outgoing assets in escrow through an escrow contract escrow(D,$D_{info}$,a) on that asset’s blockchain. 
  • Transfer: Party P transfers an asset (or assets) a tentatively owned by P to party Q by sending transfer(D,a,Q)$to the escrow contract on the asset’s blockchain.
  • Validation: Each party examines its escrowed incoming assets to see if they represent an acceptable payoff and the deal information provided by the market-clearing service is correct. If so, the party votes to commit.
  • Commit: Each compliant party sends a commit vote to the escrow contract for each incoming asset. A party uses commit(D,v,p) to vote directly and to forward votes to the deal’s escrow contracts, where v is the voter and p is the path signature for v’s vote. For example, if Alice is forwarding Bob’s vote then v is Bob, and p contains first Bob’s signature, and then Alice’s signature. 
A contract releases the escrowed asset to the new owner(s) when it accepts a commit vote from every party. If the contract has not accepted a vote from every party by time $t0+N*\Delta$, where N is the number of parties, it will never accept the missing votes, so the contract times out and refunds its escrowed assets to the original owners.

This may violate "all-or-nothing" semantics, as the paper explains. Suppose that Bob acquires Alice and Carol’s votes on time, and forwards them to claim the coins, but Alice and Carol are driven offline before they can forward Bob’s vote to the ticket blockchain, so Bob ends up with both the coins and the tickets. In this case, technically, Alice and Carol have deviated from the protocol by not claiming their assets in time. (To alleviate this problem, $\Delta$ should be chosen large enough to make sustained denial-of-service attacks prohibitively expensive.)

Assuming synchrony assumptions hold, the paper proves the following for the timelock protocol.

Theorem 1. The timelock protocol satisfies safety.

Theorem 2. The timelock protocol satisfies weak liveness: no compliant party’s outgoing assets are locked up forever.

Theorem 3. The timelock protocol satisfies strong liveness.

CBC Protocol

The second protocol the paper proposes for implementing cross-chain deals is the CBC protocol. CBC does not assume fully-synchronous communication as in timelock, so it is not susceptible to timing related attacks. But in return it requires a globally shared certified blockchain (CBC) as a kind of shared log among the parties. The paper argues that this loss of decentralization is inevitable: no protocol that tolerates periods of asynchrony can be decentralized (following the FLP result).
  • If all parties are compliant, then if the deal commits (resp. aborts) at any asset’s blockchain, it must commit (resp. abort) at all of them.
  • Initially, the deal’s state is bivalent: both commit and abort are possible outcomes. But the deal’s state cannot remain bivalent forever, so it must be possible to reach a (bivalent) critical state where each party is about to take a decisive step.
  • A potentially decisive step forcing an eventual commit cannot take place at a different blockchain than a potentially decisive step forcing an eventual abort, because then it would be impossible to determine which happened first, hence which one was truly decisive. 
(Remark: Hmm... The requirement in the last bullet could be solved by using quorum intersection, as in Paxos. Paxos is also prone to FLP result, but it still preserves safety in a fully asynchronous setup thanks to its use of quorums and ballotnumbers for leaders. So I suppose instead of a shared CBC, it is possible---but cumbersome--- to use a byzantine Paxos protocol to coordinate the commit/abort decision across multiple chains. The question then becomes, is Paxos decentralized? Paxos uses a leader, but the leaders are exchangeable, when one leader is not making progress, another leader can safely emerge to conclude the consensus.)

OK, back to the CBC protocol. Instead of voting on individual assets, each party votes on the CBC whether to commit or abort the entire deal. In the absence of full synchrony, since the parties cannot use timed escrow,  they vote to abort if validation fails, or if too much time has passed.

A party can extract a proof from the CBC that particular votes were recorded in a particular order. A party claiming an asset (or a refund) presents a proof of commit (or abort) to the contract managing that asset. The contract checks the proof’s validity and carries out the requested transfers if the proof is valid. A proof of commit proves that every party voted to commit the deal before any party voted to abort. A proof of abort proves that some party voted to abort before every party voted to commit.

Phases of the protocol
  • Clearing: The market-clearing service broadcasts a unique identifier D and a list of participating parties plist. If more than one startDeal for D is recorded on the CBC, the earliest is considered definitive.
  • Escrow: Each party places its outgoing assets in escrow escrow(D,plist,h,a,...)
  • Transfer: Party P transfers an asset (or assets) a tentatively owned by P to party Q by sending transfer(D,a,Q) to the escrow contract on the asset’s blockchain.
  • Validation: Each party checks that its proposed payoff is acceptable and that assets are properly escrowed with the correct plist and h.
  • Commit: Each party X publishes either a commit or abort vote for D on the CBC via commit(D, h, X) or abort(D, h, X).
Safety is satisfied because complaint parties agree on whether a deal commits or aborts. Weak liveness is satisfied because any compliant party whose assets are locked up for too long will eventually vote to abort. Strong liveness is satisfied in periods when the network is synchronous because every party votes to commit before any party votes to abort.

Discussion

I really enjoyed reading the paper. The problem the authors take on is very ambitious and important. I also found both protocols intriguing. The CBC protocol is simpler than the timelock protocol, as it uses a shared blockchain for coordination. In contrast to the timelock protocol, the CBC protocol tolerates asynchronous periods, and its cost and latency is better. Doing things completely decentralized increases the costs, and some centralization can help cut costs quickly.

Both protocols do use an escrow, and the escrow is a bit of a crotch, and another entity/contract to pay to as well. The paper says that "Escrow plays the role of classical concurrency control, ensuring that a single asset cannot be transferred to different parties at the same time." This is a bit of a blanket statement, and I wonder if there is indeed no other way to ensure concurrency control across chains to ensure that an asset cannot be transferred to different parties at the same time.

Saturday, December 21, 2019

Decade in Review

We are entering 2020, and this is a good time to retrospect and review the past decade from the lens of this blog, which started late 2010.
I started posting regularly on September 2010. I wanted to get into the cloud computing domain, so I needed to accumulate background on cloud computing work. I decided that as I read papers on cloud computing, I will post a summary to this blog. I thought if I could explain what I learned from the papers in my own words, I would internalize those lessons better. And if others read those summaries and benefit, that is an extra plus.
Initially I reviewed and posted paper summaries on big data processing systems and NoSQL distributed databases to catch up on these areas. Around 2010s, misrepresentation of the CAP theorem by NoSQL proponents was a problematic issue. I covered some papers that try to clarify this issue. Throughout the years, I covered many papers that discussed the different consistency guarantees offered by distributed databases. Transactions in distributed databases also became a favorite topic in the blog. After reading papers on this topic, I dipped my toes into the distributed databases research area. Distributed databases are a very important and practical topic in the industry. A good geo-distributed database solves a lot of problems for big companies. Last year I was at Microsoft Azure Cosmos DB for my sabbatical and got to learn more about this domain.

Spanner's use of synchronized clocks in distributed databases was a big milestone. This work led us to think about hybrid vector clocks, and later hybrid logical clocks. This later led us to develop Retroscope for querying consistent cuts in distributed systems. Nowadays we are thinking about timely protocols, in continuation of this line.

Cloud computing has always been a big part of this blog. I discussed some datacenter networking papers, but I did not really get in to that field. On the cloud computing topic, there was a lot of excitement about  containers and microservices. Based on the problems discussed in that domain, I have written an exploratory technical paper called stabilization in the cloud. I think that is still an open and interesting problem. Recently serverless (function as a service) is all the rage in the cloud environments and there have been several interesting papers on the topic.

I really got into Paxos in this last decade. I didn't expect to fall for Paxos protocols this hard. I had first encountered Paxos around 1999 in the distributed systems reading seminars I attended as a PhD student. I guess I had liked it back then, but it didn't make a big impression on me. From 2007-2015 Paxos gradually became more and more popular and important in cloud computing. In this blog we had more than 20 Paxos posts only in the last two years.

My students and I have a love-hate relationship with Paxos. We understand it very well and are among the top experts on the topic. But, unfortunately,  academia lost some of its excitement about Paxos and a so called "Paxos fatigue" developed. Because of this cold shoulder, we tell ourselves that perhaps we should be working on other things. On the other hand, Paxos is still transforming and ruling over large scale distributed systems deployments in the industry (one recent example is Facebook's use of Paxos to build a scalable control plane). Paxos is one of the most impactful ideas in cloud computing stacks. And despite the reduced interest from the academia, there is still a vast unexplored algorithm design space, and more work is needed to tailor Paxos  for specific distributed systems deployments, topologies, and workloads.  A striking evidence of this arrived in 2017 with the flexible quorum breakthrough which came unanticipated almost 30 years after the Paxos protocol was first proposed. This further opened up the design space for customizing Paxos to different environments and workloads, which is yet to be fully realized. We hope we will be able to convince more researchers to care about and work on these problems. In any case, we can't seem to pull ourselves away from working on Paxos variants. They are a lot of fun. I am working on Paxos Unpacked now.
Ok, enough about Paxos...

In 2016, the machine learning field exploded and went mainstream. Distributed systems support for machine learning became a hot topic. I learned some machine learning by following online courses and reading papers. I really appreciated the neat mathematics, differential computation employed here. As far as developing distributed systems support for machine learning, we performed some surveys, and thought about the topic for a while. But I gave up after sometime, thinking that it would be hard to do principled algorithmic work here because the ML field works very close to the application, and the solutions are pretty application-specific. It turns out that I was judgmental, as several nice algorithmic and distributed systems started coming out recently.

Blockchains got a lot of hype recently. It took me a long time to get in to blockchains, even though it is a very closely related topic to distributed consensus. When I finally offered a seminar on blockchains in Spring 2018, I started appreciating some of the good work done in this domain, and certain parts of the vision for decentralized computing. I love the premise of ICOs for democratizing the stock market and of smartcontracts for enabling decentralized e-trade without any middleman (please hurry up, we need a decentralized search engine and big data analytics for this to work). Unfortunately the blockhain field has been perpetually overhyped and this damages the progress in the field. I think we finally started to see the hype dying and more solid work in the field reconvening.

MAD questions

What trends are brewing that we are missing? 
Bitcoin was released in 2009 and most of us missed it as we enter 2010. What are some trends that are brewing silently that we are missing now?

IOT and 3D printing areas have been seeing increasingly more interest. But there has not been any revolutionary breakthrough yet as far as I can see.

Differentiable programming seems interesting to me.

Quantum computing is also seeing some nice progress, but I don't know much about the area.

The cloud/datacenter computing came a long way in this last decade. I think the serverless model, dataflow architectures for analytics/transactions, and the use of RDMAs are some promising trends. I am also happy to see increased adoption of formal methods and verification for distributed systems. But, maybe since I am embedded too much in the field, I am unable to identify a big hit clearly for the next coming decade.

Wednesday, December 18, 2019

My Distributed Systems Seminar's reading list for Spring 2020

Below is the first draft list of papers I plan to discuss in my distributed systems seminar in the Spring semester. If you have some suggestions on some good/recent papers to cover, please let me know.

Paxos

Replication

Transactions/consistency

Formal methods


In order to run our reading/discussion seminars effectively, we follow the format described in this post.

Here are the links to our previous semester reading lists.

Tuesday, December 17, 2019

The Ben-Or decentralized consensus algorithm

In PODC 1983, Michael Ben-Or published a randomized distributed asynchronous consensus algorithm in a paper titled "Another advantage of free choice (Extended Abstract): Completely asynchronous agreement protocols".


After 15 years, a paper in 1998 provided a correctness proof of Ben-Or's algorithm. Above is the pseudocode for Ben-Or from that paper. From the pseudocode it looks like Ben-Or is a very simple algorithm, but in reality its behavior is not easy to understand and appreciate. But fret not, TLA+ modeling and model checking helps a lot for understanding the behavior of the Ben-Or algorithm. It is fulfilling when you finally understand how the algorithm works, and how safety is always preserved and progress is achieved eventually probabilistically.

I had assigned modeling of Ben-Or as the TLA+ project for my distributed systems class. Here I share my PlusCal modeling of Ben-Or, and discuss how this model helps us to understand the algorithm better. Here is the link to the full model on Github, if you like to play with it.

Outline of the algorithm 


This is the top level model of each process.

Ben-Or is a decentralized consensus algorithm, it is not a leader-based algorithm. In the absence of a leader for tie-breaking, Ben-Or solves only binary input consensus, rather than arbitrary input value consensus. Even then, it employs randomization for achieving progress. (I had discussed the pros and cons of leader-based versus leaderless consensus algorithms earlier.)

Ben-Or has rounds but, unlike Paxos, the rounds are not client-restricted and do not preempt each other. The rounds are common to all nodes, and they are traversed in a lock-step fashion by N-F nodes in an asynchronous manner, where N is the total number of nodes, and F the upper bound on the number of faulty nodes. Recently we have seen that threshold clock synchronization tries to make such a round concept more systematic.

Each round has two phases. In the first phase of a round, each node tries to identify a value supported by a majority in order to propose that for the second phase. In the second phase (the decision/ratification phase), a node finalizes its decision if it finds the same value proposed by at least F+1 nodes. Upon failure to get a decision in the current round, Ben-Or makes some nodes to change their votes before the next round. This helps jolt/tilt the system toward a decision, so that the system (eventually probabilistically) converges to a consensus value in one of the upcoming rounds.

The model uses the following process local variables. p1v denotes the value the node broadcasts in phase 1 of a round, and p2v denotes the value it broadcasts in phase 2 of a round. The decided variable at each node to store the final/terminal decision a node makes for consensus. Initially decided=-1, and only when the process decides, decided is set to 0 or 1. These variables are initialized as p1v=INPUT[self], p2v=-1, decided=-1. In the very first round, the p1v values are come from the INPUT value for each node, in the consequent rounds, the p1v is assigned as a function of the p2v values in EvalP2 function.

We use MAXROUND to bound the number of rounds a node can try to keep model checking feasible; e.g., MAXROUND=3 or 4 suffices to get the properties tested. We use INPUT to give initial values to nodes. This is binary consensus, so we use 0 or 1 as possible initial values. When we run our model, in the model overview pane, we can give values to these constants as follows: F <- 1, N <- 4, INPUT <- «0,1,1,1», MAXROUND <-3.

Phase1 and Phase2 actions


We model message passing among nodes by using a shared message board. This is simply a set defined as a global variable. We use p1Msg as the phase1 message board, and p2Msg as the phase2 message board. Each node broadcasts its p1v or p2v value by adding it to the corresponding message board via a simple set union. For example p1Msg:=p1Msg \union {<<self, r, p1v>>} adds p1v to the corresponding phase1 message board. This amounts to broadcasting the p1v value of the node, and any other node can on its own schedule read from p1Msg set what messages have been sent for a given round, r. This reading is done via set filtering. SentP1Msgs(r)=={m \in p1Msg: m[2]=r} returns all messages in broadcast for round r to the p1Msg board, and SentP1MsgsV(r,a)=={m \in p1Msg: m[2]=r /\ m[3]=a} returns all messages broadcast with round number r and value a.

In phase1, each node broadcasts its p1v to p1Msg, and then waits until there are at least N-F phase1 values shared in the channel for round r. This ensures that the nodes traverse the rounds in a lock-step fashion. Among the p1v values retrieved from the channel for round r, if the node sees a majority value $a \in {0,1}$, it adopts that value for its p2v value to propose in phase2 of this round. If the node does not see a majority value (which does not imply that there is no majority value present, because each node sees a different perspective of values), it adopts the default $p2v=-1$ value for proposing in phase2, meaning that it did not see either 0 or 1 as supported from phase 1 communication exchange.

Note that, since both 0 or 1 cannot be in the majority, the p2v value set for a round can be {-1}, {-1,0}, {-1,1}, {0} or {1}. But it is not possible have {-1,0,1} or {0,1} as the p2v value set because if either 0 or 1 is seen in the majority, it is not possible for the other value to be also in the majority.

In phase2, each node  broadcasts its p2v to p2Msg, and then waits until there are at least N-F phase2 values shared in the channel for this round r. Among the p2v values retrieved from the channel for round r, if the node sees a value $a \in {0,1}$ sent by at least F nodes, it adopts that value for its decision value. Else if the node sees a 0 or 1 value sent by another node, it adopts this value for its p1v value for the upcoming round. This is an optimistic attempt for accumulating traction for a value that at least one node witnessed as majority among p1v values in this round. Otherwise, as the last resort, the node adopts a random 0 or 1 value as its p1v value for the next round. This serves as the random jolt to get the system tilt towards a value that can be witnessed as majority by some nodes so the case1 or case2 above applies for the round.

Why does the algorithm require seeing at least F nodes with the same p2v value before it is chosen for decision? If less than F nodes have seen a majority p1v value, say 1, and they prematurely decide on this value, it is possible that they may all crash after this decision. The other nodes would be following case2 and case3 of phase2, and it is quite possible that in the next round 0 is seen as majority value by some node, and the system then converges to decide on 0, violating the agreement property. While progress is probabilistic in Ben-Or, safety should and is always satisfied in Ben-Or.

Once a node sees at least F+1 nodes with the same p2v value, say 1, it decides. And it is guaranteed that in this phase 2, other available nodes also see at least one node with p2v=1, even if the F nodes that has proposed p2v=1 crashed. (They await for $N-F$ p2v values, remember.) This means that, in the worst case, in the next round all available processes will see majority p1v=1 among the N-F phase1 messages they receive, and decide in the phase2 of that round. 

Okay, this brings us to the properties we check on this model.

Safety and Progress properties


Agreement says that for any two nodes j and k that decided, their decision values should be the same. Agreement should always be satisfied, even when F is more than N/2. Of course, if $F>N/2$, nodes will never be able to decide, because in phase1 a majority is required for proposing a 0 or 1 p2v value.

For testing progress for F<N/2, we use the Progress property that says eventually all processes decide something. If we start with the same p1v values at all nodes, progress is guaranteed for N>2F. In that case all nodes will see that value as the majority value and propose it as the p2v value in round 1, and they all decide in phase2 of that round.

But what about if the initial p1v values have a mixture of 0s and 1s? They may not decide because it may not be possible for any available node to observe a majority 0 or 1 from their sample of N-F values, and the default/skip value $p2v=-1$ will be proposed for the phase 2. When such an ambiguity is possible, the decision can then be postponed forever by choosing the worst possible "random" assignments to the nodes for the next round. This is of course not any more a uniformly random assignment of 0s or 1s to p1v, but a pathological/malicious value assignment by the model checker to show how the Progress can be violated.

For example for N=4 and F=1, INPUT=<<0,1,1,1>> will violate the Progress property. The model checker will hide the "p1v=1" from node, and since the other nodes would not see the value "1" in the majority (which is 3 for N=4), they will all propose p2v=-1. The model checker also will find the "random" assignments that will extend this indecision by not giving majority to any value. (On the other hand, for F=0, in this setup, the majority value of "1" will be seen by all nodes, and the system will decide in one round.)

The pathological runs will break the Progress property, but how do we check that under a mixed input set, there are some runs that terminate? To this end, we write a safety/invariant property called BaitProgress which claims that it is not possible for all processes to decide, and watch the model checker come up with a counterexample for how this property is violated, which implies that yest consensus/progress is solved for some runs.

If N=4 and INPUT=«0,1,1,1», is it possible to have 0 decided for a run? To test this we write a safety property called MinorityReport which claims that it is not possible for all the nodes to decide "0" as the consensus value. The model checker will try to prove us wrong by producing a counterexample. If you check this for F=1, we will find that yes, MinorityReport will be violated meaning that there exists an execution where all nodes will decide on 0 as the consensus value, even though it is clearly in the minority in the initial input set.

MAD questions

How does this compare with Texel, another decentralized consensus protocol?

Texel is also a leaderless binary asynchronous consensus protocol. But Texel does not use rounds, and instead use consistent cuts to evaluate whether there is a consensus value safe to decide on. Texel assumes N>3F+1, so that a value supported by F+1 nodes (which is a majority of 2F+1) is visible to any process even after F nodes crash (or become unavailable otherwise). The problem is there may be two different supporting values, and this violates progress. Moreover, there is a risk that progress is still not guaranteed when all values are the same. When the consistent cut read of nodes (this is called experimentation phase, which is somewhat similar to a phase1 in Ben-Or) conflict with each other, the read is aborted and restarted from scratch. So it is possible that the decision is postponed forever by conflicting experiment phases, even when all input values are the same.

As I discussed earlier, a binary consensus is still useful for blockchain transaction. A well-formed client should avoid double spending, and would not  propose two different transactions with the same UTXO, and therefore  they are guaranteed for liveness. However liveness is not guaranteed (but safety still holds) for double-spending transactions submitted by Byzantine clients, which conflict with one another. In this case the liveness violation is a feature not a bug.

Avalanche extended Texel's leaderless consensus approach in terms of scalability and quick finality (by using sampling and metastability), and applied the resultant decentralized algorithm in the blockchain domain. I assume a similar transformation is possible for extending Ben-Or to a sampling based solution. Even though Ben-Or is round based unlike Texel, I think via sampling and momentum we can avoid the requirement of the rounds, similar to how Avalanche avoided the requirement of consistent reads in Texel.

Tuesday, December 10, 2019

How to ask better questions

This is a followup to my "Master Your Questioning Skills" post. There I gave some suggestions about how to ask better questions, but didn't engage with the topic directly.
Coming up with the questions is not difficult, once you get out of your default mode and give yourself the license to indulge in the questioning frame of mind. 
By calling these questions MAD questions, I gave myself the license/protection to go wild, uninhibited by traditional assumptions/constraints. This gave rise to reducing the bar, and approaching the topic with a beginner mind, as an outsider. This made it easier to ask more original questions. By detaching and discarding the baggage of tradition, you can start working from the first principles. 
To get to the good questions, get over with the bad questions. Bad questions lead to good questions. So roll with them and exhaust them to get to the good questions. 
A good question gives you a perspective change that is productive. It opens a new area to explore. It creates new relations/connections.

Then I asked: "Is it possible to teach how to ask great questions?"
Answers are teachable, so we teach answers. But we don't teach how to question. In kindergarten and primary school, the kids are already pretty good at questioning, but come middle school most kids stop asking questions. I don't know... Maybe students that ask questions come across hard to manage and contrarian, and get discouraged by teachers and adults. We (academicians) try to teach questioning during PhD by way of example via apprenticeships. I am not aware of any organized/formal way that asking better questions is being taught. There is no course or good book on it, as far as I can see.

Here I try again to provide one more method/approach that can enable asking better questions.

Of course, the best I can offer on this is my subjective experience. The premise here is that "I ask good questions". I think I do. And others also tell this about me. There is a chance I have an inflated view about my ability to ask questions, but I think I am above the average. (Dunning Kruger effect, I know.)

Anyways, here is my working theory about how I got better at asking questions. This is because... I always have working theories about everything. Sometimes these are analogies about "how this one thing looks somewhat like this other thing". Often these are just educated guesses based on the models of world I have in mind.

I have working theories about everything because I constantly try to make sense of things in advance, with the limited information available to me. Some of these working theories are contradictory with each other, yet I am OK with keeping them clustered together in my mind, and living with them for a long time. I came to be OK with carrying conflicting and incomplete theories about everything. I never fully believe any of my theories. Sometimes I find myself defending both sides of a debate to the surprise of my students. One of my PhD students once told me: "But you once said this about this topic", and I told him, "I say a lot of things, don't take me too seriously." I think the cardinal sin is being too certain about anything.

As I cultivate these theories, I tend to keep refining them and cut out the incorrect models. I naturally want to test which ones are closer to the truth. So I come up with questions to test them. These questions turn out to be deeper questions that cut to the heart of the matter.

These theories help immensely because, as the book "Range" explains:
Struggling to generate an answer on your own, even a wrong one, enhances subsequent learning. Socrates was apparently on to something when he forced pupils to generate answers rather than bestowing them. It requires the learner to intentionally sacrifice current performance for future benefit.

Metcalfe and colleagues have repeatedly demonstrated a “hypercorrection effect.” The more confident a learner is of their wrong answer, the better the information sticks when they subsequently learn the right answer. Tolerating big mistakes can create the best learning opportunities.

I think my mom was influential for me to cultivate this working theory state of mind. I wrote about this earlier. I think most children keep working models about things they don't understand. Probably this is why small children keep asking many questions. And parents laugh hard when the children's models of things are far removed from reality. But this is no laughing material. We should be encouraging them to have more working models about everything, and keep testing and refining them.

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...