Wednesday, December 25, 2019

HotStuff: BFT Consensus in the Lens of Blockchain

This paper appeared in PODC 2019, and is by Maofan Yin, Dahlia Malkhi, Michael K. Reiter, Guy Golan Gueta, and Ittai Abraham. The paper presents HotStuff, a leader-based Byzantine fault-tolerant consensus protocol. HotStuff forms the basis of LibraBFT which is used in Facebook's Libra project.

HotStuff uses the partially synchronous model for liveness, but it is safe even under an asynchronous model. (This is also how Paxos operates, as a crash tolerant consensus protocol.) Once network communication becomes synchronous, HotStuff enables a correct leader to drive the protocol to consensus at the pace of actual (vs. maximum) round duration --a property called responsiveness. Another innovation in HotStuff is that it provides communication complexity that is linear (rather than quadratic) in the number of replicas. In other words, all-to-all broadcasts are replaced with only participant-to-leader and leader-to-participant communication, with rotating leaders. HotStuff is the first partially synchronous BFT replication protocol exhibiting these two properties combined.

HotStuff overview

HotStuff builds and improves upon PBFT. In PBFT, a stable leader can drive a consensus decision in two rounds of message exchanges. The first phase guarantees proposal uniqueness through the formation of a quorum certificate (QC) consisting of (n − f) votes. The second phase guarantees that the next leader can convince replicas to vote for a safe proposal. This requires the leader to relay information from (n−f) replicas, each reporting its own highest QC or vote. Unfortunately, the view-change algorithm that enables a new leader to collect information and propose it to replicas is complex, bug-prone, and incurs a significant communication penalty for even moderate system sizes.

This is where HotStuff's core technical contribution comes. HotStuff solves a liveness problem, the hidden lock problem, with the view change protocol by adding a lock-precursor phase. The resulting three phase algorithm helps to streamline the protocol and achieve linear view-change costs in contrast to generations of two-phase protocols which suffer from a quadratic communication bottleneck on leader replacement.

Ok, let's unpack this. The hidden lock problem occurs, if a leader doesn't wait for the $\Delta$ expiration time of a round. Simply receiving N-F replies from participants (up to F of which may be Byzantine) is not sufficient to ensure that the leader gets to see the highest lock. This is a race condition problem, the highest lock value may be hidden in the other F honest nodes which the leader didn't wait to hear from. (The hidden lock is not a problem in Paxos view-change because Paxos does not deal with Byzantine nodes, but only with crash-prone nodes. See the remark after this paragraph.) Such an impatient leader may propose a lower lock value than what is accepted and this may lead to a liveness violation. In order not to wait the maximum $\Delta$ expiration time of a round, HotStuff introduces another round, a precursor-lock round, before the actual lock round. This additional precursor-lock round solves the hidden lock problem, because if 2F+1 participants accept the pre-cursor lock, the leader will surely hear from them and learn the highest lock value proposed (not necessarily accepted), when it only talks to N-F nodes without needing to wait for $\Delta$ time. The below lecture by Ittai Abraham is very helpful for understanding this problem and the algorithm.

(Remark. Let me elaborate on why the hidden lock problem is an issue in BFT but not in Paxos. In Paxos, if a node says I previously accepted this value for this round, we trust that node and use that value. In a BFT protocol, we cannot trust a validator node. So we look for the threshold-signed QC (from N-F) nodes that says that this value has indeed been witnessed. Unfortunately, that witnessed QC may not be available until after another round: the leader needs to collect the thresholded signature for the QC, and rebroadcast it back. It may be that one node that receives that rebroadcast may be outside the N-F contacted by a new leader. And this is why we need the precursor-lock round.)

As explained above, HotStuff revolves around a three-phase core, allowing a new leader to simply pick the highest QC it knows of. This simplifies the leader replacement  protocol such that the costs for a new leader to drive the protocol to consensus is no greater than that for the current leader. As such, HotStuff enables pipelining of the phases, and supports frequent succession of leaders, which is very beneficial in the blockchain context. The idea is to change the view on every prepare phase, so each proposal has its own view.

Thanks to the pipelining/chaining of the phases and rotating of leaders, HotStuff achieves high throughput despite adding a third phase to the protocol. More concretely, the pipelining/chaining works as follows. The votes over a prepare phase are collected in a view by the leader into a genericQC. Then the genericQC is relayed to the leader of the next view, essentially delegating responsibility for the next phase, which would have been pre-commit, to the next leader. However, the next leader does not actually carry a pre-commit phase, but instead initiates a new prepare phase and adds its own proposal. This prepare phase for view v+1 simultaneously serves as the pre-commit phase for view v. The prepare phase for view v+2 simultaneously serves as the pre-commit phase for view v+1 and as the commit phase for view v. This is possible because all the phases have identical structure.

Comparison with other BFT protocols

In sum, HotStuff achieves the following two properties with these improvements:
  • Linear View Change: After global stabilization time (GST) of the partial synchrony model, any correct leader, once designated, sends only O(n) authenticators to drive a consensus decision. This includes the case where a leader is replaced. Consequently, communication costs to reach consensus after GST is O($n^2$) authenticators in the worst case of cascading leader failures.
  • Optimistic Responsiveness: After GST, any correct leader, once designated, needs to wait just for the first n−f responses to guarantee that it can create a proposal that will make progress. This includes the case where a leader is replaced.

If you work on blockchain protocols, you may know that Tendermint and Casper also follow a simple leader regime. However, those systems are built around a synchronous core, wherein proposals are made in pre-determined intervals $\Delta$ that must accommodate the worst-case time it takes to propagate messages over a wide-area peer-to-peer gossip network, so they violate the optimistic responsiveness property. In contrast, in HotStuff the leader exchange incurs only the actual network delays, which are typically far smaller than $\Delta$ in practice.

As far as the chaining and pipelining is concerned, the above figure provides an overview of the commit rules of some popular BFT consensus protocols. The commit rule in DLS is One-Chain, allowing a node to be committed only by its own leader. The commit rules in PBFT, Tendermint and Casper are almost identical, and consist of Two-Chains. They differ in the mechanisms they introduce for liveness, PBFT has leader proofs of quadratic size (no Linearity), Tendermint and Casper introduce a mandatory $\Delta$ delay before each leader proposal (no Optimistic Responsiveness). HotStuff uses a Three-Chain rule, and has a linear leader protocol without delay.

Tuesday, December 24, 2019

Cross-chain Deals and Adversarial Commerce

This paper appeared in VLDB'19 and is authored by Maurice Herlihy, Barbara Liskov, and Liuba Shrira.

How can autonomous, mutually-distrusting parties cooperate safely and effectively?

This question is very important for enabling commerce. The trading world answered this question so far by relying on a trusted third party, and in the worst case, on the government/rule-of-law to litigate parties deviating from their contracts. With prevalence of e-commerce and decentralization, this question is recently  considered in *trustless* settings by modern distributed data management systems.

Solving the trustless multi-party cooperation when all the parties use the same blockchain is achievable via smartcontracts, but solving the problem where the parties use different blockchains bring many additional challenges. Some of these challenges are familiar to us from the classical distributed systems research on distributed transactions, such as how to combine multiple steps into a single atomic action, how to recover from failures, and how to synchronize concurrent access to data. But considering the cross-chain multi-party collaboration in a trustless setting requires reconsidering/revising that work, because here the participants are autonomous and potentially adversarial.

Cross-chain deal 

The paper proposes the cross-chain deal as a new computational abstraction for structuring complex distributed exchanges in an adversarial setting. The paper starts by clarifying the distinctions between cross-chain deals with atomic transactions, and cross-chain swaps.

Cross-chain deals differ from atomic transactions, because
  • transactions perform complex distributed state changes, while deals simply exchange assets among parties
  • transactions use “all-or-nothing” semantics to preserve global invariants, however each autonomous party in a deal can decide independently whether it finds an outcome satisfactory
  • transactions assume crash failures, whereas deals assume byzantine faults
Cross-chain deals also differ from cross-chain swaps. In cross-chain swaps, each party sets up an unconditional transfer, thus they are incapable of expressing standard financial transactions such as arbitrage and auctions.
  • Consider an arbitrage example where Alice pays Bob with coins she receives from Carol, and sells Carol a ticket she receives from Bob. Alice cannot commit to either transfer at the start of a swap protocol because she does not own those assets. 
  • Auctions also cannot be expressed as swaps because the auction’s outcome (reserve price exceeded, identity of winner) cannot be determined until all bids have been submitted.
In contrast to transactions and swaps, deals address the demands of adversarial commerce better:
  • In adversarial commerce each party decides for itself whether to participate in a particular interaction. 
  • Parties agree to follow a common protocol, an agreement that can be monitored, but not enforced. 
  • Correctness is local and selfish: all parties that follow the protocol end up “no worse off” than when they started, even in the presence of faulty or malicious behavior by an arbitrary number of other parties. 

The paper uses the following running example.  Bob decides to sell two coveted tickets to a hit play for 100 coins. Alice knows that Carol would be willing to pay 101 coins for those tickets, so Alice moves to broker a deal between Bob and Carol. Alice devises a cross-chain deal, to be executed by Alice, Bob, and Carol, and communicating through contracts running on various blockchains. If all goes as planned, all transfers take place, and if anything goes wrong (someone crashes or tries to cheat), no honest party should end up worse off.

Property 1: For every protocol execution, every compliant party ends up with an acceptable payoff.

Property 2: No asset belonging to a compliant party is escrowed forever.

Property 3: If all parties are compliant and willing to accept their proposed payoffs, then all transfers happen (all parties’ payoffs are All).

The paper describes two alternative protocols for implementing cross-chain deals
  • Timelock protocol, based on synchronous communication, is fully decentralized
  • CBC protocol, based on semi-synchronous communication, requires a globally shared ledger
Before discussing the two protocols, we first discuss the general framework for cross-chain deals.

How Cross-Chain Deals Work

An escrow is used for ensuring that a single asset cannot be transferred to different parties at the same time. Here is what happens when P places a in escrow during deal D:
Pre: Owns(P, a)
Post: Owns(D, a) and Owns_C (P, a) and Owns_A(P, a)

The precondition states that P can escrow A only if P owns a. The postcondition states that ownership of a is transferred from P to D (via the escrow contract), but P remains the owner of a in both Commit and Abort.

These are the phases of a cross-chain deal.
  • Clearing Phase: A market-clearing service discovers and broadcasts the participants, the proposed transfers, and pos- sibly other deal-specific information.
  • Escrow Phase: Parties escrow their outgoing assets. E.g., Bob escrows his tickets and Carol her coins.
  • Transfer Phase: The parties perform the sequence of tentative ownership transfers according to the deal. E.g., Bob tentatively transfers the tickets to Alice, who subsequently transfers them to Carol.
  • Validation phase: Once the tentative transfers are complete, each party checks that the deal is the same as proposed by the (untrusted) market-clearing service, and that its incoming assets are properly escrowed (so they cannot be double-spent). E.g., Carol checks that the tickets to be transferred are escrowed, that the seats are (at least as good as) the ones agreed upon, and that she is not about to somehow overpay.
  • Commit phase: The parties vote on whether to make the tentative transfers permanent. If all parties vote to commit, the escrowed assets are transferred to their new owners, otherwise they are refunded to their original owners.
A tentative transfer commits if it becomes permanent, and it aborts if it is discarded. A deal commits if all its tentative transfers commit, and it aborts if all its tentative transfers abort.

(Remark: I think this formulation does not capture what happens if neither commit or abort gets satisfied for a deal, i.e, where some tentative transfers commit and some tentative transfers abort. For example, tentative transfers may commit for Carol, but the tentative transfers for Bob may get aborted. 
In the beginning the paper said that "all or nothing semantics" is impossible to satisfy with adversarial participants: "it may be possible that some transfers commit and other abort provided that no compliant party is worse-off." But I can't see how the formulation above captures this.)

Timelock Protocol

Timelock protocol is the first protocol paper discusses for implementing cross-chain deals. It is fully decentralized but it assumes a synchronous network model where blockchain propagation time is known and bounded.

In the timelock protocol, escrowed assets are released if all parties vote to commit. Parties do not explicitly vote to abort. Timeouts are used for ensuring that escrowed assets are not locked up forever. Each escrow contract’s timeout for a party’s commit vote depends on the length of the path along which that vote was forwarded.

Phases of the deal:
  • Clearing: The market-clearing service broadcasts to all parties in the deal, the deal identifier D, the list of parties plist, a commit phase starting time t0 used to compute timeouts, and the timeout delay $\Delta$. Because t0 and $\Delta$ are used only to compute timeouts, their values do not affect normal execution times. If deals take minutes (or hours), then $\Delta$ could be measured in hours (or days).
  • Escrow: Each party places its outgoing assets in escrow through an escrow contract escrow(D,$D_{info}$,a) on that asset’s blockchain. 
  • Transfer: Party P transfers an asset (or assets) a tentatively owned by P to party Q by sending transfer(D,a,Q)$to the escrow contract on the asset’s blockchain.
  • Validation: Each party examines its escrowed incoming assets to see if they represent an acceptable payoff and the deal information provided by the market-clearing service is correct. If so, the party votes to commit.
  • Commit: Each compliant party sends a commit vote to the escrow contract for each incoming asset. A party uses commit(D,v,p) to vote directly and to forward votes to the deal’s escrow contracts, where v is the voter and p is the path signature for v’s vote. For example, if Alice is forwarding Bob’s vote then v is Bob, and p contains first Bob’s signature, and then Alice’s signature. 
A contract releases the escrowed asset to the new owner(s) when it accepts a commit vote from every party. If the contract has not accepted a vote from every party by time $t0+N*\Delta$, where N is the number of parties, it will never accept the missing votes, so the contract times out and refunds its escrowed assets to the original owners.

This may violate "all-or-nothing" semantics, as the paper explains. Suppose that Bob acquires Alice and Carol’s votes on time, and forwards them to claim the coins, but Alice and Carol are driven offline before they can forward Bob’s vote to the ticket blockchain, so Bob ends up with both the coins and the tickets. In this case, technically, Alice and Carol have deviated from the protocol by not claiming their assets in time. (To alleviate this problem, $\Delta$ should be chosen large enough to make sustained denial-of-service attacks prohibitively expensive.)

Assuming synchrony assumptions hold, the paper proves the following for the timelock protocol.

Theorem 1. The timelock protocol satisfies safety.

Theorem 2. The timelock protocol satisfies weak liveness: no compliant party’s outgoing assets are locked up forever.

Theorem 3. The timelock protocol satisfies strong liveness.

CBC Protocol

The second protocol the paper proposes for implementing cross-chain deals is the CBC protocol. CBC does not assume fully-synchronous communication as in timelock, so it is not susceptible to timing related attacks. But in return it requires a globally shared certified blockchain (CBC) as a kind of shared log among the parties. The paper argues that this loss of decentralization is inevitable: no protocol that tolerates periods of asynchrony can be decentralized (following the FLP result).
  • If all parties are compliant, then if the deal commits (resp. aborts) at any asset’s blockchain, it must commit (resp. abort) at all of them.
  • Initially, the deal’s state is bivalent: both commit and abort are possible outcomes. But the deal’s state cannot remain bivalent forever, so it must be possible to reach a (bivalent) critical state where each party is about to take a decisive step.
  • A potentially decisive step forcing an eventual commit cannot take place at a different blockchain than a potentially decisive step forcing an eventual abort, because then it would be impossible to determine which happened first, hence which one was truly decisive. 
(Remark: Hmm... The requirement in the last bullet could be solved by using quorum intersection, as in Paxos. Paxos is also prone to FLP result, but it still preserves safety in a fully asynchronous setup thanks to its use of quorums and ballotnumbers for leaders. So I suppose instead of a shared CBC, it is possible---but cumbersome--- to use a byzantine Paxos protocol to coordinate the commit/abort decision across multiple chains. The question then becomes, is Paxos decentralized? Paxos uses a leader, but the leaders are exchangeable, when one leader is not making progress, another leader can safely emerge to conclude the consensus.)

OK, back to the CBC protocol. Instead of voting on individual assets, each party votes on the CBC whether to commit or abort the entire deal. In the absence of full synchrony, since the parties cannot use timed escrow,  they vote to abort if validation fails, or if too much time has passed.

A party can extract a proof from the CBC that particular votes were recorded in a particular order. A party claiming an asset (or a refund) presents a proof of commit (or abort) to the contract managing that asset. The contract checks the proof’s validity and carries out the requested transfers if the proof is valid. A proof of commit proves that every party voted to commit the deal before any party voted to abort. A proof of abort proves that some party voted to abort before every party voted to commit.

Phases of the protocol
  • Clearing: The market-clearing service broadcasts a unique identifier D and a list of participating parties plist. If more than one startDeal for D is recorded on the CBC, the earliest is considered definitive.
  • Escrow: Each party places its outgoing assets in escrow escrow(D,plist,h,a,...)
  • Transfer: Party P transfers an asset (or assets) a tentatively owned by P to party Q by sending transfer(D,a,Q) to the escrow contract on the asset’s blockchain.
  • Validation: Each party checks that its proposed payoff is acceptable and that assets are properly escrowed with the correct plist and h.
  • Commit: Each party X publishes either a commit or abort vote for D on the CBC via commit(D, h, X) or abort(D, h, X).
Safety is satisfied because complaint parties agree on whether a deal commits or aborts. Weak liveness is satisfied because any compliant party whose assets are locked up for too long will eventually vote to abort. Strong liveness is satisfied in periods when the network is synchronous because every party votes to commit before any party votes to abort.


I really enjoyed reading the paper. The problem the authors take on is very ambitious and important. I also found both protocols intriguing. The CBC protocol is simpler than the timelock protocol, as it uses a shared blockchain for coordination. In contrast to the timelock protocol, the CBC protocol tolerates asynchronous periods, and its cost and latency is better. Doing things completely decentralized increases the costs, and some centralization can help cut costs quickly.

Both protocols do use an escrow, and the escrow is a bit of a crotch, and another entity/contract to pay to as well. The paper says that "Escrow plays the role of classical concurrency control, ensuring that a single asset cannot be transferred to different parties at the same time." This is a bit of a blanket statement, and I wonder if there is indeed no other way to ensure concurrency control across chains to ensure that an asset cannot be transferred to different parties at the same time.

Saturday, December 21, 2019

Decade in Review

We are entering 2020, and this is a good time to retrospect and review the past decade from the lens of this blog, which started late 2010.
I started posting regularly on September 2010. I wanted to get into the cloud computing domain, so I needed to accumulate background on cloud computing work. I decided that as I read papers on cloud computing, I will post a summary to this blog. I thought if I could explain what I learned from the papers in my own words, I would internalize those lessons better. And if others read those summaries and benefit, that is an extra plus.
Initially I reviewed and posted paper summaries on big data processing systems and NoSQL distributed databases to catch up on these areas. Around 2010s, misrepresentation of the CAP theorem by NoSQL proponents was a problematic issue. I covered some papers that try to clarify this issue. Throughout the years, I covered many papers that discussed the different consistency guarantees offered by distributed databases. Transactions in distributed databases also became a favorite topic in the blog. After reading papers on this topic, I dipped my toes into the distributed databases research area. Distributed databases are a very important and practical topic in the industry. A good geo-distributed database solves a lot of problems for big companies. Last year I was at Microsoft Azure Cosmos DB for my sabbatical and got to learn more about this domain.

Spanner's use of synchronized clocks in distributed databases was a big milestone. This work led us to think about hybrid vector clocks, and later hybrid logical clocks. This later led us to develop Retroscope for querying consistent cuts in distributed systems. Nowadays we are thinking about timely protocols, in continuation of this line.

Cloud computing has always been a big part of this blog. I discussed some datacenter networking papers, but I did not really get in to that field. On the cloud computing topic, there was a lot of excitement about  containers and microservices. Based on the problems discussed in that domain, I have written an exploratory technical paper called stabilization in the cloud. I think that is still an open and interesting problem. Recently serverless (function as a service) is all the rage in the cloud environments and there have been several interesting papers on the topic.

I really got into Paxos in this last decade. I didn't expect to fall for Paxos protocols this hard. I had first encountered Paxos around 1999 in the distributed systems reading seminars I attended as a PhD student. I guess I had liked it back then, but it didn't make a big impression on me. From 2007-2015 Paxos gradually became more and more popular and important in cloud computing. In this blog we had more than 20 Paxos posts only in the last two years.

My students and I have a love-hate relationship with Paxos. We understand it very well and are among the top experts on the topic. But, unfortunately,  academia lost some of its excitement about Paxos and a so called "Paxos fatigue" developed. Because of this cold shoulder, we tell ourselves that perhaps we should be working on other things. On the other hand, Paxos is still transforming and ruling over large scale distributed systems deployments in the industry (one recent example is Facebook's use of Paxos to build a scalable control plane). Paxos is one of the most impactful ideas in cloud computing stacks. And despite the reduced interest from the academia, there is still a vast unexplored algorithm design space, and more work is needed to tailor Paxos  for specific distributed systems deployments, topologies, and workloads.  A striking evidence of this arrived in 2017 with the flexible quorum breakthrough which came unanticipated almost 30 years after the Paxos protocol was first proposed. This further opened up the design space for customizing Paxos to different environments and workloads, which is yet to be fully realized. We hope we will be able to convince more researchers to care about and work on these problems. In any case, we can't seem to pull ourselves away from working on Paxos variants. They are a lot of fun. I am working on Paxos Unpacked now.
Ok, enough about Paxos...

In 2016, the machine learning field exploded and went mainstream. Distributed systems support for machine learning became a hot topic. I learned some machine learning by following online courses and reading papers. I really appreciated the neat mathematics, differential computation employed here. As far as developing distributed systems support for machine learning, we performed some surveys, and thought about the topic for a while. But I gave up after sometime, thinking that it would be hard to do principled algorithmic work here because the ML field works very close to the application, and the solutions are pretty application-specific. It turns out that I was judgmental, as several nice algorithmic and distributed systems started coming out recently.

Blockchains got a lot of hype recently. It took me a long time to get in to blockchains, even though it is a very closely related topic to distributed consensus. When I finally offered a seminar on blockchains in Spring 2018, I started appreciating some of the good work done in this domain, and certain parts of the vision for decentralized computing. I love the premise of ICOs for democratizing the stock market and of smartcontracts for enabling decentralized e-trade without any middleman (please hurry up, we need a decentralized search engine and big data analytics for this to work). Unfortunately the blockhain field has been perpetually overhyped and this damages the progress in the field. I think we finally started to see the hype dying and more solid work in the field reconvening.

MAD questions

What trends are brewing that we are missing? 
Bitcoin was released in 2009 and most of us missed it as we enter 2010. What are some trends that are brewing silently that we are missing now?

IOT and 3D printing areas have been seeing increasingly more interest. But there has not been any revolutionary breakthrough yet as far as I can see.

Differentiable programming seems interesting to me.

Quantum computing is also seeing some nice progress, but I don't know much about the area.

The cloud/datacenter computing came a long way in this last decade. I think the serverless model, dataflow architectures for analytics/transactions, and the use of RDMAs are some promising trends. I am also happy to see increased adoption of formal methods and verification for distributed systems. But, maybe since I am embedded too much in the field, I am unable to identify a big hit clearly for the next coming decade.

Wednesday, December 18, 2019

My Distributed Systems Seminar's reading list for Spring 2020

Below is the first draft list of papers I plan to discuss in my distributed systems seminar in the Spring semester. If you have some suggestions on some good/recent papers to cover, please let me know.




Formal methods

In order to run our reading/discussion seminars effectively, we follow the format described in this post.

Here are the links to our previous semester reading lists.

Tuesday, December 17, 2019

The Ben-Or decentralized consensus algorithm

In PODC 1983, Michael Ben-Or published a randomized distributed asynchronous consensus algorithm in a paper titled "Another advantage of free choice (Extended Abstract): Completely asynchronous agreement protocols".

After 15 years, a paper in 1998 provided a correctness proof of Ben-Or's algorithm. Above is the pseudocode for Ben-Or from that paper. From the pseudocode it looks like Ben-Or is a very simple algorithm, but in reality its behavior is not easy to understand and appreciate. But fret not, TLA+ modeling and model checking helps a lot for understanding the behavior of the Ben-Or algorithm. It is fulfilling when you finally understand how the algorithm works, and how safety is always preserved and progress is achieved eventually probabilistically.

I had assigned modeling of Ben-Or as the TLA+ project for my distributed systems class. Here I share my PlusCal modeling of Ben-Or, and discuss how this model helps us to understand the algorithm better. Here is the link to the full model on Github, if you like to play with it.

Outline of the algorithm 

This is the top level model of each process.

Ben-Or is a decentralized consensus algorithm, it is not a leader-based algorithm. In the absence of a leader for tie-breaking, Ben-Or solves only binary input consensus, rather than arbitrary input value consensus. Even then, it employs randomization for achieving progress. (I had discussed the pros and cons of leader-based versus leaderless consensus algorithms earlier.)

Ben-Or has rounds but, unlike Paxos, the rounds are not client-restricted and do not preempt each other. The rounds are common to all nodes, and they are traversed in a lock-step fashion by N-F nodes in an asynchronous manner, where N is the total number of nodes, and F the upper bound on the number of faulty nodes. Recently we have seen that threshold clock synchronization tries to make such a round concept more systematic.

Each round has two phases. In the first phase of a round, each node tries to identify a value supported by a majority in order to propose that for the second phase. In the second phase (the decision/ratification phase), a node finalizes its decision if it finds the same value proposed by at least F+1 nodes. Upon failure to get a decision in the current round, Ben-Or makes some nodes to change their votes before the next round. This helps jolt/tilt the system toward a decision, so that the system (eventually probabilistically) converges to a consensus value in one of the upcoming rounds.

The model uses the following process local variables. p1v denotes the value the node broadcasts in phase 1 of a round, and p2v denotes the value it broadcasts in phase 2 of a round. The decided variable at each node to store the final/terminal decision a node makes for consensus. Initially decided=-1, and only when the process decides, decided is set to 0 or 1. These variables are initialized as p1v=INPUT[self], p2v=-1, decided=-1. In the very first round, the p1v values are come from the INPUT value for each node, in the consequent rounds, the p1v is assigned as a function of the p2v values in EvalP2 function.

We use MAXROUND to bound the number of rounds a node can try to keep model checking feasible; e.g., MAXROUND=3 or 4 suffices to get the properties tested. We use INPUT to give initial values to nodes. This is binary consensus, so we use 0 or 1 as possible initial values. When we run our model, in the model overview pane, we can give values to these constants as follows: F <- 1, N <- 4, INPUT <- «0,1,1,1», MAXROUND <-3.

Phase1 and Phase2 actions

We model message passing among nodes by using a shared message board. This is simply a set defined as a global variable. We use p1Msg as the phase1 message board, and p2Msg as the phase2 message board. Each node broadcasts its p1v or p2v value by adding it to the corresponding message board via a simple set union. For example p1Msg:=p1Msg \union {<<self, r, p1v>>} adds p1v to the corresponding phase1 message board. This amounts to broadcasting the p1v value of the node, and any other node can on its own schedule read from p1Msg set what messages have been sent for a given round, r. This reading is done via set filtering. SentP1Msgs(r)=={m \in p1Msg: m[2]=r} returns all messages in broadcast for round r to the p1Msg board, and SentP1MsgsV(r,a)=={m \in p1Msg: m[2]=r /\ m[3]=a} returns all messages broadcast with round number r and value a.

In phase1, each node broadcasts its p1v to p1Msg, and then waits until there are at least N-F phase1 values shared in the channel for round r. This ensures that the nodes traverse the rounds in a lock-step fashion. Among the p1v values retrieved from the channel for round r, if the node sees a majority value $a \in {0,1}$, it adopts that value for its p2v value to propose in phase2 of this round. If the node does not see a majority value (which does not imply that there is no majority value present, because each node sees a different perspective of values), it adopts the default $p2v=-1$ value for proposing in phase2, meaning that it did not see either 0 or 1 as supported from phase 1 communication exchange.

Note that, since both 0 or 1 cannot be in the majority, the p2v value set for a round can be {-1}, {-1,0}, {-1,1}, {0} or {1}. But it is not possible have {-1,0,1} or {0,1} as the p2v value set because if either 0 or 1 is seen in the majority, it is not possible for the other value to be also in the majority.

In phase2, each node  broadcasts its p2v to p2Msg, and then waits until there are at least N-F phase2 values shared in the channel for this round r. Among the p2v values retrieved from the channel for round r, if the node sees a value $a \in {0,1}$ sent by at least F nodes, it adopts that value for its decision value. Else if the node sees a 0 or 1 value sent by another node, it adopts this value for its p1v value for the upcoming round. This is an optimistic attempt for accumulating traction for a value that at least one node witnessed as majority among p1v values in this round. Otherwise, as the last resort, the node adopts a random 0 or 1 value as its p1v value for the next round. This serves as the random jolt to get the system tilt towards a value that can be witnessed as majority by some nodes so the case1 or case2 above applies for the round.

Why does the algorithm require seeing at least F nodes with the same p2v value before it is chosen for decision? If less than F nodes have seen a majority p1v value, say 1, and they prematurely decide on this value, it is possible that they may all crash after this decision. The other nodes would be following case2 and case3 of phase2, and it is quite possible that in the next round 0 is seen as majority value by some node, and the system then converges to decide on 0, violating the agreement property. While progress is probabilistic in Ben-Or, safety should and is always satisfied in Ben-Or.

Once a node sees at least F+1 nodes with the same p2v value, say 1, it decides. And it is guaranteed that in this phase 2, other available nodes also see at least one node with p2v=1, even if the F nodes that has proposed p2v=1 crashed. (They await for $N-F$ p2v values, remember.) This means that, in the worst case, in the next round all available processes will see majority p1v=1 among the N-F phase1 messages they receive, and decide in the phase2 of that round. 

Okay, this brings us to the properties we check on this model.

Safety and Progress properties

Agreement says that for any two nodes j and k that decided, their decision values should be the same. Agreement should always be satisfied, even when F is more than N/2. Of course, if $F>N/2$, nodes will never be able to decide, because in phase1 a majority is required for proposing a 0 or 1 p2v value.

For testing progress for F<N/2, we use the Progress property that says eventually all processes decide something. If we start with the same p1v values at all nodes, progress is guaranteed for N>2F. In that case all nodes will see that value as the majority value and propose it as the p2v value in round 1, and they all decide in phase2 of that round.

But what about if the initial p1v values have a mixture of 0s and 1s? They may not decide because it may not be possible for any available node to observe a majority 0 or 1 from their sample of N-F values, and the default/skip value $p2v=-1$ will be proposed for the phase 2. When such an ambiguity is possible, the decision can then be postponed forever by choosing the worst possible "random" assignments to the nodes for the next round. This is of course not any more a uniformly random assignment of 0s or 1s to p1v, but a pathological/malicious value assignment by the model checker to show how the Progress can be violated.

For example for N=4 and F=1, INPUT=<<0,1,1,1>> will violate the Progress property. The model checker will hide the "p1v=1" from node, and since the other nodes would not see the value "1" in the majority (which is 3 for N=4), they will all propose p2v=-1. The model checker also will find the "random" assignments that will extend this indecision by not giving majority to any value. (On the other hand, for F=0, in this setup, the majority value of "1" will be seen by all nodes, and the system will decide in one round.)

The pathological runs will break the Progress property, but how do we check that under a mixed input set, there are some runs that terminate? To this end, we write a safety/invariant property called BaitProgress which claims that it is not possible for all processes to decide, and watch the model checker come up with a counterexample for how this property is violated, which implies that yest consensus/progress is solved for some runs.

If N=4 and INPUT=«0,1,1,1», is it possible to have 0 decided for a run? To test this we write a safety property called MinorityReport which claims that it is not possible for all the nodes to decide "0" as the consensus value. The model checker will try to prove us wrong by producing a counterexample. If you check this for F=1, we will find that yes, MinorityReport will be violated meaning that there exists an execution where all nodes will decide on 0 as the consensus value, even though it is clearly in the minority in the initial input set.

MAD questions

How does this compare with Texel, another decentralized consensus protocol?

Texel is also a leaderless binary asynchronous consensus protocol. But Texel does not use rounds, and instead use consistent cuts to evaluate whether there is a consensus value safe to decide on. Texel assumes N>3F+1, so that a value supported by F+1 nodes (which is a majority of 2F+1) is visible to any process even after F nodes crash (or become unavailable otherwise). The problem is there may be two different supporting values, and this violates progress. Moreover, there is a risk that progress is still not guaranteed when all values are the same. When the consistent cut read of nodes (this is called experimentation phase, which is somewhat similar to a phase1 in Ben-Or) conflict with each other, the read is aborted and restarted from scratch. So it is possible that the decision is postponed forever by conflicting experiment phases, even when all input values are the same.

As I discussed earlier, a binary consensus is still useful for blockchain transaction. A well-formed client should avoid double spending, and would not  propose two different transactions with the same UTXO, and therefore  they are guaranteed for liveness. However liveness is not guaranteed (but safety still holds) for double-spending transactions submitted by Byzantine clients, which conflict with one another. In this case the liveness violation is a feature not a bug.

Avalanche extended Texel's leaderless consensus approach in terms of scalability and quick finality (by using sampling and metastability), and applied the resultant decentralized algorithm in the blockchain domain. I assume a similar transformation is possible for extending Ben-Or to a sampling based solution. Even though Ben-Or is round based unlike Texel, I think via sampling and momentum we can avoid the requirement of the rounds, similar to how Avalanche avoided the requirement of consistent reads in Texel.

Tuesday, December 10, 2019

How to ask better questions

This is a followup to my "Master Your Questioning Skills" post. There I gave some suggestions about how to ask better questions, but didn't engage with the topic directly.
Coming up with the questions is not difficult, once you get out of your default mode and give yourself the license to indulge in the questioning frame of mind. 
By calling these questions MAD questions, I gave myself the license/protection to go wild, uninhibited by traditional assumptions/constraints. This gave rise to reducing the bar, and approaching the topic with a beginner mind, as an outsider. This made it easier to ask more original questions. By detaching and discarding the baggage of tradition, you can start working from the first principles. 
To get to the good questions, get over with the bad questions. Bad questions lead to good questions. So roll with them and exhaust them to get to the good questions. 
A good question gives you a perspective change that is productive. It opens a new area to explore. It creates new relations/connections.

Then I asked: "Is it possible to teach how to ask great questions?"
Answers are teachable, so we teach answers. But we don't teach how to question. In kindergarten and primary school, the kids are already pretty good at questioning, but come middle school most kids stop asking questions. I don't know... Maybe students that ask questions come across hard to manage and contrarian, and get discouraged by teachers and adults. We (academicians) try to teach questioning during PhD by way of example via apprenticeships. I am not aware of any organized/formal way that asking better questions is being taught. There is no course or good book on it, as far as I can see.

Here I try again to provide one more method/approach that can enable asking better questions.

Of course, the best I can offer on this is my subjective experience. The premise here is that "I ask good questions". I think I do. And others also tell this about me. There is a chance I have an inflated view about my ability to ask questions, but I think I am above the average. (Dunning Kruger effect, I know.)

Anyways, here is my working theory about how I got better at asking questions. This is because... I always have working theories about everything. Sometimes these are analogies about "how this one thing looks somewhat like this other thing". Often these are just educated guesses based on the models of world I have in mind.

I have working theories about everything because I constantly try to make sense of things in advance, with the limited information available to me. Some of these working theories are contradictory with each other, yet I am OK with keeping them clustered together in my mind, and living with them for a long time. I came to be OK with carrying conflicting and incomplete theories about everything. I never fully believe any of my theories. Sometimes I find myself defending both sides of a debate to the surprise of my students. One of my PhD students once told me: "But you once said this about this topic", and I told him, "I say a lot of things, don't take me too seriously." I think the cardinal sin is being too certain about anything.

As I cultivate these theories, I tend to keep refining them and cut out the incorrect models. I naturally want to test which ones are closer to the truth. So I come up with questions to test them. These questions turn out to be deeper questions that cut to the heart of the matter.

These theories help immensely because, as the book "Range" explains:
Struggling to generate an answer on your own, even a wrong one, enhances subsequent learning. Socrates was apparently on to something when he forced pupils to generate answers rather than bestowing them. It requires the learner to intentionally sacrifice current performance for future benefit.

Metcalfe and colleagues have repeatedly demonstrated a “hypercorrection effect.” The more confident a learner is of their wrong answer, the better the information sticks when they subsequently learn the right answer. Tolerating big mistakes can create the best learning opportunities.

I think my mom was influential for me to cultivate this working theory state of mind. I wrote about this earlier. I think most children keep working models about things they don't understand. Probably this is why small children keep asking many questions. And parents laugh hard when the children's models of things are far removed from reality. But this is no laughing material. We should be encouraging them to have more working models about everything, and keep testing and refining them.

Sunday, December 8, 2019

Moneyball, but for academia

"Moneyball: The Art of Winning an Unfair Game" is a book by Michael Lewis, published in 2003, about the Oakland Athletics baseball team and its general manager Billy Beane. Its focus is the team's analytical, evidence-based, sabermetric approach to assembling a competitive baseball team despite Oakland's small budget. 
I had zero knowledge and interest about baseball, but the book was very engaging, and I could stop reading. Michael Lewis is one of my favorite writers. I had read Flash Boys, Big Short, and Next by him, and all of them were very good.

Fundamentally, Moneyball is about making radical but rational choices to the face of flawed ways of the tradition. Where there is a tradition-ridden unoptimized market, there is potential for disruption: if you are brave enough to do things differently, you can benefit a lot from doing so.  Initially only a few people are daring enough to see this and ignore the tradition and status-quo to start doing things differently. The gravitational pull of tradition and status-quo is very strong especially at the institutions-level. It seems that at the individual level, it is easier to escape from this pull, but at the institutions level this pull is almost inescapable because it is safer to do things "the way it is always done", and it is very hard to shift the entire practices of an institution as institutions are very risk-averse.

Since the faculty recruitment season is in motion, I started thinking about how these lessons would apply to faculty search.

Disclaimer. These are my views, not necessarily that of my institution/department. It should be obvious that I speak only for myself, and I speak subjectively and with generalizations which do not always hold.

Status quo

Academia is a nonlinear game. The quality of the research produced by faculty has nonlinear rewards/returns. Having 100 research faculty does not make a department automatically 4 times better than a department with 25 faculty. It is the quality and not the quantity of research/publications that matter.

In other words, horizontal scalability does not work, and recruiting high-performing research faculty is important. If you have unlimited resources/appeal, you would want to get the stars. In the absence of that, you need to plan carefully about how you can  discover stars before they become stars and/or when they are undervalued?

Hiring assistant professors fresh out of graduate school is one way of finding undervalued talent. These freshly minted PhDs did not have an opportunity to develop their research agendas fully and prove themselves. They will also be incentivized to do their best work to secure tenure at your department. I suppose this is why hiring fresh PhDs as assistant professors is the most common way of faculty recruiting in the academia.

Among the senior hires, a candidate with great publications and funding but coming from a lower ranked department is, by definition, undervalued. Most of the senior hires is of that nature. Ah... The R word... rankings! This brings us to the biggest inefficiency/flaw in the current faculty recruitment traditions.

Fallacies of current hiring practices

Ranking fallacy. 
There is a big premium put on whether the candidates come from a top ranked school. The rule of thumb many departments use is that they only seriously consider candidates that come from schools at least 20 places higher than their own rankings.

But this is a crazy obsession. Humans have similar brains (we are all dumb). It is unwise to think that candidates coming from top schools are much smarter. It is also not necessarily the case that they receive a much better education. This is the age of Internet, information/education is accessible to those seeking for it.

However, the candidates coming from top schools have more institutional support. This is a nontrivial effect. It takes a village to raise a child. I ran an informal poll on Twitter with 50 votes.
The students at high rank departments can easily find more faculty and students to collaborate with. I also think the role of regression to the mean is a big factor. Students in top ranked departments push harder for achieving more.

In any case, the objective criteria is simple. When the search committee evaluates the faculty applications, they should consider the quality and quantity of the publications and the coherency of the research statement. If a candidate from a lower ranked department has comparable publication record to another one from a top ranked department, isn't it better to hire the former as that is a more impressive accomplishment? We should check for skills and potential rather than lineage and schools.

Hype fallacy.
Another common flaw with the current hiring practices is that it is very much coupled to the hype cycles. There is a big premium put on whether the applicant work on a currently hot domain and not enough emphasis given to the toolkit of the applicant and the potential application areas of this toolkit. The domains are ephemeral, but the toolkits are long lived. If a candidate has a good toolset, it is easy for her to switch domains, and work on many domains.

For example, instead of trying to hire machine learning people who are in high demand by all departments, it would be a better strategy to hire people who are undervalued. Maybe hire theory people, programming languages people, or people who have formal methods experience, but with practical twists. After the big players take their pick on machine learning people, and when there is an excess of high quality machine learning candidates on the market, hire from machine learning.

Charisma fallacy.
I think the departments care too much about how well the candidate's talk went. That is not a very relevant skill for academic success. Organization of the presentation, yes. But the delivery of the talk, and looking good doing it, does not have much to do with the quality of the researcher. Many high caliber researchers are not very good on thinking on their feet, and nor are they required to be. Yet, unfortunately some candidates get penalized as they do not come across as high energy or assertive enough, and their presentation is not lively. It is better not to make this mistake, and recruit these undervalued researchers.

This point is also relevant to increasing diversity. Increasing the diversity of the departments are very important and genuinely beneficial to the departments. I am happy to see that this is now strongly encouraged with new systems put in place in many universities. The faculty should keep in check the hidden biases they may have during the candidate visit and evaluation.

What are optimal strategies for recruitment?

It is somewhat easy to account for the above fallacies. But devising optimal strategies for recruitment is very hard, and is beyond my pay grade. From what I can see, there aren't much work on this out there either. Are there any data-oriented study on faculty recruiting with controls? Are there any departments that play this moneyball game better?

I don't have answers, I only have questions.

Is it better to recruit a candidate that is from a weak area for the department, or a candidate that aligns with a strong area for the department? Both has some advantages. It is not very clear, which would be better when.

Being a successful faculty require different skillset from being a successful PhD student, such as being a good mentor, team builder, novel thinker, proposal writer, even organizational skills. What are ways to evaluate these? For some of these we rely on the recommendation letters, but this is not objective?

What are best ways to objectively measure the creativity and vision of the candidate?

How do we measure how good a supervisor and collaborator the candidate could be?

Friday, December 6, 2019

I am looking for PhD students

I am trying to recruit PhD students and since this is the application season for graduate school, I thought I should also advertise here.

If you are interested, please contact me via email.

Below are the two projects I plan to get new students started on.

Yes, it snows a lot in Buffalo. But the summer is beautiful, and Niagara Falls region is 20 minute drive from the university.

Paxos Unpacked

Due to their excellent fault-tolerance and consistency benefits, Paxos protocols are employed at the core of many distributed systems infrastructures at Google, Facebook, Amazon, and Microsoft. When properly tailored and optimized, Paxos family of protocols can deliver efficiency, performance, and scalability at par with weakly consistent protocols, while providing a stable and strong foundation to build services and applications on top.

In this project, we will explore performant, scalable, practical and usable variants of Paxos. We already implemented the Paxi framework to facilitate developing and benchmarking Paxos variants, and provided implementations of a dozen Paxos variants.

In this project, we will develop new Paxos variants in Go using Paxi, and run experiments and benchmarks on these implementations. These include implementing:

  • reconfiguration for Paxos
  • in-protocol sharding for Paxos
  • linearizable quorum reads for Paxos variants
  • durability of Paxos log on disk in an efficient manner
  • BigPaxos for vertically scaling Paxos to hundreds of nodes
  • Byzantine tolerant Paxos variants for blockchains/blockDAGs

Timely Protocols

This project proposes timely protocols that use tightly-synchronized clocks to convey acquisition and affirmation of information with the passage of time and reduce the communication costs of distributed coordination.

Consider a timely commit protocol. The transaction manager (TM) sends a message to the resource managers (RMs) asking them to commit the transaction at time T. If an RM needs to reject, it sends back a message; upon receiving this, the TM sends an Abort message to all the RMs. Otherwise the TM’s first message to commit at T is the only message transmitted. While the classical two-phase commit protocol uses two broadcast, and receives N responses, the timely commit protocol uses silent-consent to reduce that to only one broadcast for the most common happy path. By eschewing explicit positive acknowledgments, timely protocols have the potential to avoid incast storm problems and improve the throughput of the system.

In contrast to leases and timeouts, which use passage of time to convey expiration and removal of information, timely protocols propose a more active and general use of synchronized clocks. While this approach is attractive as it can reduce the communication costs of distributed protocols, the reason this has not been adopted in practice is the multiplitude of challenges involved. Faults, such as crash of nodes and loss of messages, invalidate the information to be conveyed via passage of time.

We formulate a systematic two-pronged approach to circumvent these challenges: 1) fault detection and correction to reduce threats to safety violation of timely protocols, and 2) hindsight reconciliation to recover any potential safety violation that persists nevertheless.

In this project we will implement proof-of-concept prototypes of timely protocols, timely detectors, timely correctors, and hindsight reconciliation. High-level timely detectors and correctors can build on our previous work on Retroscope. 

Thursday, December 5, 2019

On the advisor-mentee relationship

I had previously given advice about how to find your advisor. But I realize I haven't talked about the advisor-mentee relationship directly before.

The advisor is the window through which the student gets to perceive research and academia from. Advisors imprint the students with their approach to research, taste in research problems, perspective on solutions. Advisors also act as role models to the students in a more general sense. I see that the advisor's qualities often live on the student for a long time --at least through the assistant professorship period. For example, if the advisor is kind and generous to students, the mentee is nice to students, etc. This is a big responsibility for the advisors, and they should be aware of their roles in their interactions with their mentees.

The advisor's main job is to ask questions and guide strategically where and when it matters. Acting as the leader, the advisor should provide purpose, give direction, and occasionally motivation to the student. However, spoonfeeding (providing the answers) and hand-holding (guiding through all steps) should be avoided from the beginning. Through osmosis the advisor should instill some research taste on the mentee, teach the art of asking questions and thinking critically/analytically.

Building the advisor-mentee relationship takes time. This is a professional relationship, personality matches and getting along should not be basis for accepting/rejecting mentees. Both sides should keep this relationship professional in a respecting and understanding way. The advisors should not be controlling, and the students in turn should realize that there are consequences to shirking off their responsibilities. The most important thing in the advisor-mentee relation is open communication. The worse thing that could happen is a fail-silent fault: masking problems/issues and pretending everything is fine, and then failing the other party in the project/effort with little heads up.

For the advisor-mentee relationship to flourish, the advisor should cultivate a peer/collegial relation. I once met a faculty candidate who referred to his advisor as "boss" in a non-sarcastic serious way. That was a red flag. This should be a collegial relationship. The students shouldn't be the yes men. They should be able to defend their positions and spar on ideas. Personally I feel happy (and proud) when my students are able to point out when I am wrong. Better decisions emerge from arguing different approaches/positions. The egos shouldn't get in the way of the search for truth.

Initially it is common to talk past each other as you are unaccustomed to each other's thinking and communication styles. It may take many weeks before you can start communicating efficiently. But magic happens one day, after many months of working together. You start completing each others' train of thoughts. You challenge one another and figure out things together. That feels like jamming sessions of musicians, and that is a very gratifying thing.

After writing this much, I will be amiss if I don't thank my PhD and postdoc advisors. Thank you Anish Arora and Nancy Lynch for being wonderful advisors and role models. From you, I learned to be passionate about research and the importance and power of thinking clearly and in a disciplined manner.

Monday, December 2, 2019

Saturday, November 30, 2019

Book Review. The Dark Forest (2008)

This book is the second book in the trilogy titled "Remembrance of Earth's Past" by Liu Cixin. I had reviewed the first book in the series, "The Three Body Problem", earlier. 

I had listened to this book as audiobook. The reader of the audiobook, the voice-actor, was very competent, and also remarkably good in speaking in accents. This made the already engaging story more captivating.

As in the first book, this book also introduces new concepts. The Dark Forest theory is one. The theory posits that the universe is like a dark forest, where everybody is out there to hunt anybody. Due to chain-of-suspicion, a civilization can never be certain of an alien civilization's true intentions.
The extreme distances between stars creates an insurmountable "chain of suspicion", where any two civilizations cannot communicate well enough to dissipate mistrust, making conflict inevitable. Leaving a primitive civilization alone is not an option due to the exponential progress of technological change, and a civilization you have detected might easily surpass your own technological level in a few centuries and become a threat. If you have detected a civilization, then you have also confirmed that said civilization will eventually be able to detect you. Therefore, it is in every civilization's best interest to preemptively strike and destroy any developing civilization before it can become a threat, but without revealing their own location to the wider universe, thus explaining the Fermi paradox.
This is very much the coordinating attack problem in distributed systems. Two generals will attack a city. Only if they attack together, they will have victory. If only one party attacks, and the other party does not join, the attacker will be defeated/destroyed. The two generals try to coordinate this attack by talking to each other via messengers, which may be captured by the city. When the channel is arbitrarily lossy, two parties cannot solve the consensus problem in a deterministic manner for all cases. Chain-of-suspicion arises.
  1. When you get my message, you know my vote, but I don't know you know. So you send me a message back to coordinate the attack, because you suspect that if I don't know you know, I won't join you in the attack.  
  2. When I get your message, I know that you know my vote. But I cannot commit on this, because you don't know that "I know that you know" (because the message may have been lost). So I suspect that because of this lack of information, you won't join me in the attack. So I send you a message back. 
  3. When you get my message, you know that I know that you know my vote. But you realize that I don't know that "you know that I know that you know my vote". Then you suspect that under this condition, I may not join you in the attack and you will get destroyed by the enemy attacking alone .So you send me a message back.
  4. When I get your message, I know that you know that I know that you know my vote. But I realize that you don't know this: "I know that you know that I know that you know my vote", and I suspect you won't join me in the attack due to this lack of information. So I need to send you a message. 
(And this is when the generals were not Byzantine. With Byzantine participants, things get even more hairy.) This chain of suspicion makes consensus/cooperation impossible. However, it should be noted that these are theoretical impossibilities and has restrictions: such as using deterministic protocol. The reality is not this bleak. In practice we solve the coordinated attack problem every time we establish a TCP connection. I had written something on this before.

This book picks up from where the first book has left. The setup is that Trisolarans can observe, through sophons, any visual and audio activity in the world, and even anything written to any electronic storage. The only thing they cannot observe is what goes on in human's brains. (What an interesting setup... I suspect the computer security researchers would appreciate this setup even more.) This led the earth to select wallfacers, the keepers of plans known only to themselves, who are granted full access to the resources of the UN. Another interesting twist was that the Trisolaran society does not have any distinction between communication and thinking, as their thinking is opaque. So they were initially unfamiliar with lying and deception. I thought the book could have developed on this more but this didn't develop as much I thought it could be.

Other interesting concepts include the droplet from Trisolarans, the hibernation to move in time toward doomsday battle, and the mental seal machine.

It was a long book but very interesting and very engaging. I really like Liu Cixin's writing. Of course the protagonists and most of the main characters are Chinese, because Liu is Chinese and is writing for a Chinese audience primarily. After all we have seen many many many AngloSaxon/Western heroes in the literature and media, and of course Chinese protagonists have every right to save the world themselves. I welcome that. But unfortunately I started seeing Chinese government party-lines seeping into the book's narrative. The first book had stroke a balanced tone with respect to this (especially with its discussion/coverage of the cultural revolution), but in this book, especially in the second part, there are undertones of propaganda. The book discusses why it is important to be cold-blooded pragmatists and how totalitarianism is very efficient to deal with crisis situations, etc. I just started the third book in the series, and I find that there is even more propaganda in that book. But, of course, that won't stop me from reading and enjoying the book.

I like these book series because it sets the perspective to a wide perspective: the infinite space versus the marble sized earth. It is a shame we are not thinking more about space, and establishing space-fairing civilizations. When I was growing up in Turkey, I had a very exaggerated view of Turkey. For me it was the entire world. I thought its institutions and its politicians were important and powerful, and the power struggles there were consequential. When I traveled to US for graduate studies, I very quickly realized how limited that worldview that was. Similarly, earth-centric thinking is very limited, but we are very slow to realize this. 

In the book, all the projects are all planned ahead for the four centuries to come. But you can appreciate that although our time is just but a little sliver of time in that timeline, the actions we take are important enablers and first steps for the centuries down the line. So yes, I really enjoyed the big perspective thinking this leads to, and I think we should try to expose ourselves to this type of thinking to break free of our very limited perspective in terms of time and space. 

Thursday, November 28, 2019

Paper review. Threshold Logical Clocks for Asynchronous Distributed Coordination and Consensus

This is a recent arxiv paper by Bryan Ford, EPFL. The figures I use are from Bryan's presentation.

The paper introduces a threshold logical clock (TLC) abstraction and uses it to implement decentralized asynchronous consensus on top. In contrast to Ben-Or which implements decentralized asynchronous binary consensus, TLC based Que-Sera-Consensus (QSC) achieves consensus for arbitrary values proposed.

After I summarize the paper, I will compare/contrast QSC with Paxos, Texel/Avalanche, and Ben-Or.

Threshold Logical Clocks

TLC ensures that a number of nodes progress through logical time in a lock-step fashion. On reaching logical time-step s, each node waits for a threshold tm of broadcasts received from s before it can proceed to step s+1.

Different nodes may see different subsets of the tm messages. The adversarial network schedule ultimately determines this, but can we at least measure a-posteriori the success/failure of a given message's propagation to other nodes? For this purpose, TLC supports witnessing.

Each time-step occurs in two logical phases. Each node broadcasts unwitnessed message and collects responses from at least t witnesses. Each node re-broadcasts witnessed message and collects at least t witnessed messages

A particular protocol instance TLC(tm, tw, n) is parameterized by message threshold tm, witness threshold tw, and number of nodes n. To get from step s to s+1, each node must collect not just tm messages but tm threshold witnessed messages from step s. Each threshold message must have been witnessed by at least tw participants.

A problem still exists. Different nodes may see different subsets, and different messages may have been witnessed by different subsets of nodes. TLC observes that causal ordered message delivery adds some convergence properties to the protocol and simplifies reasoning about it. One way to ensure causal message propagation is to piggyback and gossip every message send so far. Each node i simply includes in every message it sends a record of i's entire causal history. This makes time advancement events propagate virally.

When configured with majority thresholds, TLC offers a useful property. Every witnessed message m broadcast in step s is seen by all n nodes by step s+2:
  • By majority nodes by s+1 by definition of witnessing 
  • Each node collects majority step s+1 msgs by s+2 
  • Since any two majorities intersect, there is at least a 1-node overlap at s+1

Que Sera Consensus (QSC)

QSC provides an asynchronous consensus built on top of TLC. Each round takes three TLC logical time-steps, in which
  • Node broadcasts proposal w/ random priority p 
  • Waits and observes for three TLC time-steps 
  • Decides if any proposal is undisputable winner
This description is of course too high-level. Since there is a lot of subtlety involved due to network asynchrony, the QSC protocol is reached by way of explaining several strawman protocols leading to it.

Here is the baseline and the first strawman (genetic fitness lottery) towards deriving QSC.

Strawman 3 makes the nodes pick celebrity proposals, which a majority of nodes have heard of by the next time-step. This rule still leaves uncertainty, however, since different participants might have seen different subsets of confirmed proposals from step s, and not all of them might have seen the eligible proposal with the globally winning ticket.

In order to seek a universal celebrity, Strawman 5 says we should watch the paparazzi.

The paparazzi condition guarantees that, everyone knows the proposal's existence by s+2, and everyone knows of its celebrity by s+3. But not everyone might have seen paparazzi message!

This brings us to Que Sera, whatever happens happens, Consensus (QSC).  Nodes prefer highest-priority celebrity proposal p and build on it in proposals for future rounds, but don't assume everyone agrees on p! This says that only some consensus rounds may succeed, and that only some nodes may even realize that a round succeeded. And it also guarantees that when this holds, all nodes will build on that value and eventually decide.

Nodes conservatively determine agreement when they are unaware of existence of higher-priority proposal and aware of at least one paparazzi node for p.

Each node i will observe successful consensus with a probability of at least 1/2 in each round, independently of other rounds. Thus, the probability i has not yet finalized a unique proposal for round r by a later round r+k is at most $1/2^k$. What I really like is that by tying the consensus with a blockchain structure/formation, the delay in committing is made somewhat compensated/pipelined. When a commit is finally achieved, any round that i sees as successful will permanently commit both proposal p and any prior uncommitted blocks that p built on in the blockchain structure.

QSC implementations are available in Promela/Spin & Go.

How does this compare with other consensus algorithms?

I am a distributed systems guy. I will be looking at this from a distributed algorithms perspective. It looks like Bryan is coming from security background, and he may be trying to emphasize other properties (like network adversary tolerance) of the protocol. So I may be missing some security related properties of the protocol.

Before I go on to compare and at some points criticize the algorithm, I want to mention that overall this is a nice algorithm. I want to model this in TLA+ when I find some time so that I will be able to get a better understanding of what is going on in TLC and QSC.

The paper claims that QSC is simple, and even simpler than Paxos, but I disagree with that strongly. It took 9 subsections with many strawman algorithms to describe the QSC algorithm in Section 4.10 after TLC is described in Section 2. That said, the protocol is nice and interesting, and it has some advantages not present in other decentralized consensus algorithms like Texel and Ben-Or.

QSC is not binary consensus. This makes it more powerful than Texel and Ben-Or. However, if the domain is blockchain, a binary consensus works with no problems because you are deciding whether to include this proposed transaction in the chain or not. If only one value (one transaction) is proposed for a given UTXO, a binary consensus algorithm can guarantee termination for it. Well-formed clients propose only a single transaction for a given UTXO because otherwise it would be a double-spend attempt, and then the binary consensus does not need to guarantee termination for this instance.

TLC enforces a single frame of reference for the rounds as it aligns rounds across a threshold of participants. In this sense, TLC is reminiscent of the Ben-Or algorithm which also aligns rounds across threshold number of participants. In Ben-Or, N-F participants proceed in lock-step for the two phases of each round. In Ben-Or the termination is again probabilistic with probability 1, as the non termination chance after k rounds reduce to  $1/2^k$ probability. If one node decides at round k, it is guaranteed that all the other nodes will decide in the next round. As we mentioned above, while Ben-Or is restricted to binary consensus, QSC allows arbitrary proposals as input.

It may be possible to argue that the single frame of reference for the rounds, simplifies reasoning. But as a distributed algorithms person, I think this is just unnecessary extra synchronization and not a good idea. In Paxos, the rounds not aligned across participants. Paxos uses concurrent/separate frames of reference for rounds with no problem at all. This is achieved by Paxos's use of totally ordered ballot numbers to make rounds client-restricted. This allows multiple frame of reference rounds to exist concurrently without violating the safety/agreement property.

This brings me to do a more in depth comparison with Paxos, and a speculation.

QSC vs Paxos (and SDPaxos?)

QSC is still randomized consensus. So is Paxos in an asynchronous environment. In an asynchronous environment, dueling leaders problem may be probabilistically avoided (using probabilistically backing off). However, if there is some little partial synchrony which allows the failure detectors can stabilize to $\diamond S$, stable leaders can emerge, and progress will be guaranteed. In comparison, progress in QSC will always be probabilistic.

By using a leader based solution Paxos gets other benefits as well. Paxos uses much less communication to get the consensus done. Paxos communication is 1-to-all-to-1  (with 1 being the leader). In contrast communication in QSC is all-to-all-to-all across the three steps in the rounds.

Here is the speculation part. In effect a leader also emerges in QSC. It is the higher-priority celebrity proposal. So QSC is closer to leader-based Paxos protocols rather than to the leaderless Ben-Or and Texel protocols. What is more the random priority assignment and the use of paparazzi seem to implement a dynamic version of the sequencer node in the SDPaxos protocol. Let me explain.

Let's recall the decision procedure in QSC. A node decides as consensus on a proposal p, when it is (1) unaware of existence of higher-priority proposal than p, and (2) aware of at least one paparazzi node for p.

SDPaxos separates the leader into two roles of ordering/value-choosing done by the sequencer and the replication-checking done by any node. And it seems like QSC divides the sequencer role in to two, via the celebrity and paparazzi roles. QSC confines everything in single-frame of reference rounds by its use of TLC. On the other hand, SDPaxos allows multiple rounds occurring concurrently, via ballotnumber use for sequencer.

Extensions for scalability

QSC uses a lot of communication (all-to-all steps) and wouldn't scale. Finding alternate ways for scaling QSC would be beneficial.

Avalanche has extended Texel via sampling to large scale decentralized consensus.

Is it possible to come up with a sampling based extension to QSC? Texel had two advantages going for it. It is pull-based solution, and it only considers binary consensus. In contrast QSC is push-based, considers arbitrary consensus, and uses a single-reference frame for rounds. These make it hard to apply sampling to QSC.

It would be interesting to think about scalability extensions for QSC. Of course committee-selection based approaches should be applicable at least.

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...