HotStuff: BFT Consensus in the Lens of Blockchain

This paper appeared in PODC 2019, and is by Maofan Yin, Dahlia Malkhi, Michael K. Reiter, Guy Golan Gueta, and Ittai Abraham. The paper presents HotStuff, a leader-based Byzantine fault-tolerant consensus protocol. HotStuff forms the basis of LibraBFT which is used in Facebook's Libra project.

HotStuff uses the partially synchronous model for liveness, but it is safe even under an asynchronous model. (This is also how Paxos operates, as a crash tolerant consensus protocol.) Once network communication becomes synchronous, HotStuff enables a correct leader to drive the protocol to consensus at the pace of actual (vs. maximum) round duration --a property called responsiveness. Another innovation in HotStuff is that it provides communication complexity that is linear (rather than quadratic) in the number of replicas. In other words, all-to-all broadcasts are replaced with only participant-to-leader and leader-to-participant communication, with rotating leaders. HotStuff is the first partially synchronous BFT replication protocol exhibiting these two properties combined.

HotStuff overview

HotStuff builds and improves upon PBFT. In PBFT, a stable leader can drive a consensus decision in two rounds of message exchanges. The first phase guarantees proposal uniqueness through the formation of a quorum certificate (QC) consisting of (n − f) votes. The second phase guarantees that the next leader can convince replicas to vote for a safe proposal. This requires the leader to relay information from (n−f) replicas, each reporting its own highest QC or vote. Unfortunately, the view-change algorithm that enables a new leader to collect information and propose it to replicas is complex, bug-prone, and incurs a significant communication penalty for even moderate system sizes.

This is where HotStuff's core technical contribution comes. HotStuff solves a liveness problem, the hidden lock problem, with the view change protocol by adding a lock-precursor phase. The resulting three phase algorithm helps to streamline the protocol and achieve linear view-change costs in contrast to generations of two-phase protocols which suffer from a quadratic communication bottleneck on leader replacement.

Ok, let's unpack this. The hidden lock problem occurs, if a leader doesn't wait for the $\Delta$ expiration time of a round. Simply receiving N-F replies from participants (up to F of which may be Byzantine) is not sufficient to ensure that the leader gets to see the highest lock. This is a race condition problem, the highest lock value may be hidden in the other F honest nodes which the leader didn't wait to hear from. (The hidden lock is not a problem in Paxos view-change because Paxos does not deal with Byzantine nodes, but only with crash-prone nodes. See the remark after this paragraph.) Such an impatient leader may propose a lower lock value than what is accepted and this may lead to a liveness violation. In order not to wait the maximum $\Delta$ expiration time of a round, HotStuff introduces another round, a precursor-lock round, before the actual lock round. This additional precursor-lock round solves the hidden lock problem, because if 2F+1 participants accept the pre-cursor lock, the leader will surely hear from them and learn the highest lock value proposed (not necessarily accepted), when it only talks to N-F nodes without needing to wait for $\Delta$ time. The below lecture by Ittai Abraham is very helpful for understanding this problem and the algorithm.

(Remark. Let me elaborate on why the hidden lock problem is an issue in BFT but not in Paxos. In Paxos, if a node says I previously accepted this value for this round, we trust that node and use that value. In a BFT protocol, we cannot trust a validator node. So we look for the threshold-signed QC (from N-F) nodes that says that this value has indeed been witnessed. Unfortunately, that witnessed QC may not be available until after another round: the leader needs to collect the thresholded signature for the QC, and rebroadcast it back. It may be that one node that receives that rebroadcast may be outside the N-F contacted by a new leader. And this is why we need the precursor-lock round.)

As explained above, HotStuff revolves around a three-phase core, allowing a new leader to simply pick the highest QC it knows of. This simplifies the leader replacement  protocol such that the costs for a new leader to drive the protocol to consensus is no greater than that for the current leader. As such, HotStuff enables pipelining of the phases, and supports frequent succession of leaders, which is very beneficial in the blockchain context. The idea is to change the view on every prepare phase, so each proposal has its own view.

Thanks to the pipelining/chaining of the phases and rotating of leaders, HotStuff achieves high throughput despite adding a third phase to the protocol. More concretely, the pipelining/chaining works as follows. The votes over a prepare phase are collected in a view by the leader into a genericQC. Then the genericQC is relayed to the leader of the next view, essentially delegating responsibility for the next phase, which would have been pre-commit, to the next leader. However, the next leader does not actually carry a pre-commit phase, but instead initiates a new prepare phase and adds its own proposal. This prepare phase for view v+1 simultaneously serves as the pre-commit phase for view v. The prepare phase for view v+2 simultaneously serves as the pre-commit phase for view v+1 and as the commit phase for view v. This is possible because all the phases have identical structure.

Comparison with other BFT protocols

In sum, HotStuff achieves the following two properties with these improvements:
  • Linear View Change: After global stabilization time (GST) of the partial synchrony model, any correct leader, once designated, sends only O(n) authenticators to drive a consensus decision. This includes the case where a leader is replaced. Consequently, communication costs to reach consensus after GST is O($n^2$) authenticators in the worst case of cascading leader failures.
  • Optimistic Responsiveness: After GST, any correct leader, once designated, needs to wait just for the first n−f responses to guarantee that it can create a proposal that will make progress. This includes the case where a leader is replaced.

If you work on blockchain protocols, you may know that Tendermint and Casper also follow a simple leader regime. However, those systems are built around a synchronous core, wherein proposals are made in pre-determined intervals $\Delta$ that must accommodate the worst-case time it takes to propagate messages over a wide-area peer-to-peer gossip network, so they violate the optimistic responsiveness property. In contrast, in HotStuff the leader exchange incurs only the actual network delays, which are typically far smaller than $\Delta$ in practice.

As far as the chaining and pipelining is concerned, the above figure provides an overview of the commit rules of some popular BFT consensus protocols. The commit rule in DLS is One-Chain, allowing a node to be committed only by its own leader. The commit rules in PBFT, Tendermint and Casper are almost identical, and consist of Two-Chains. They differ in the mechanisms they introduce for liveness, PBFT has leader proofs of quadratic size (no Linearity), Tendermint and Casper introduce a mandatory $\Delta$ delay before each leader proposal (no Optimistic Responsiveness). HotStuff uses a Three-Chain rule, and has a linear leader protocol without delay.


Popular posts from this blog

Graviton2 and Graviton3

Foundational distributed systems papers

Learning a technical subject

Learning about distributed systems: where to start?

Strict-serializability, but at what cost, for what purpose?

CockroachDB: The Resilient Geo-Distributed SQL Database

Amazon Aurora: Design Considerations + On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes

Warp: Lightweight Multi-Key Transactions for Key-Value Stores

Anna: A Key-Value Store For Any Scale

Your attitude determines your success