Snowflake to Avalanche: A Novel Metastable Consensus Protocol Family for Cryptocurrencies

This paper is by Team-Rocket ---the authors are pseudonymous, I presume a guy, a gal, and a cat is involved. I learned about the paper when Emin Gun Sirer announced it on Twitter.

Here is the synopsis from the paper. "This paper introduces a brand new family of consensus protocols suitable for cryptocurrencies, based on randomized sampling and metastable decision. The protocols provide a strong probabilistic safety guarantee, and a guarantee of liveness for correct clients."

Below I first summarize the protocols and at the end I will provide my comments/evaluations. The analysis of the paper is very strong and span 8 pages, but I skip that section in my review.

Introduction

Traditional consensus protocols incur high communication overhead and accurate knowledge of membership. So while they work great for less than 20 participants, they do not work for large numbers of participants and open/permissionless environments. Nakamoto consensus family protocols address that problem, but they are costly and wasteful. Bitcoin currently consumes around 63.49 TWh/year, about twice as all of Denmark.

This paper introduces a new family of consensus protocols, inspired by gossip algorithms: The system operates by repeatedly sampling the participants at random, and steering the correct nodes towards the same consensus outcome. The protocols do not use proof-of-work (PoW) yet achieves safety through an efficient metastable mechanism. So this family avoids the worst parts of traditional and Nakamoto consensus protocols.

Similar to Nakamoto consensus, the protocols provide a probabilistic safety guarantee, using tunable security parameters to make the possibility of a consensus failure arbitrarily small.

The protocols guarantee liveness only for virtuous transactions. Liveness for conflicting transactions issued by Byzantine clients is not guaranteed. This point is important, here is the explanation:
"In a cryptocurrency setting, cryptographic signatures enforce that only a key owner is able to create a transaction that spends a particular coin. Since correct clients follow the protocol as prescribed, they are guaranteed both safety and liveness. In contrast, the protocols do not guarantee liveness for rogue transactions, submitted by Byzantine clients, which conflict with one another. Such decisions may stall in the network, but have no safety impact on virtuous transactions. This is a very sensible tradeoff for building cryptocurrency systems." (While this is OK for cryptocurrency systems, it would be a problem for general consensus where conflicting requests from clients will be present.)

OK, then, does this well-formed non-conflicting transactions make consensus trivial in cryptocurrency systems? Would nonconflicting transactions reduce consensus to plain gossip? Then, what does the paper contribute? With plain gossip, the Byzantine client can first introduce one transaction, and then introduce another transaction. With plain gossip the last write wins and double-spending would ensue. So plain gossip won't do; the protocol needs to sample and establish/maintain some sort of communal memory of transactions such that an established transaction should be impossible to change. Nakamoto uses PoW-based chain-layering/amberification to achieve that. This paper shows how this amberification can be achieved via sampling-based gossip and DAG-layering without PoW! Avalanche is a nice name to denote this irreversible process.

The model

The paper adopts Bitcoin's unspent transaction output (UTXO) model: The clients issue cryptographically signed transactions that fully consume an existing UTXO and issue new UTXOs. Two transactions conflict if they consume the same UTXO and yield different outputs. Correct clients never issue conflicting transactions. It is also impossible for Byzantine clients to forge conflicts with transactions issued by correct clients.

On the other hand, Byzantine clients can issue multiple transactions that conflict with one another, and the correct clients should only consume at most one of those transactions. The goal of the Avalanche family of consensus protocols is to accept a set of non-conflicting transactions in the presence of Byzantine behavior.

The Avalanche family of protocols provide the following guarantees with high probability (whp):

  • Safety. No two correct nodes will accept conflicting transactions.
  • Liveness. Any transaction issued by a correct client (aka virtuous transaction) will eventually be accepted by every correct node.


Slush: Introducing Metastability

The paper starts with a non-Byzantine protocol, Slush, and then builds up Snowflake, Snowball, and Avalanche, with better Byzantine fault-tolerance (BFT) and irreversibility properties.


Slush is presented using a decision between two conflicting colors, red and blue. A node starts out initially in an uncolored state. Upon receiving a transaction from a client, an uncolored node updates its own color to the one carried in the transaction and initiates a query.

To perform a query, a node picks a small, constant-sized ($k$) sample of the network uniformly at random, and sends a query message. Upon receiving a query, an uncolored node adopts the color in the query, responds with that color, and initiates its own query, whereas a colored node simply responds with its current color.

Once the querying node collects k responses, it checks if a fraction $\alpha*k$ are for the same color, where $\alpha > 0.5$ is a protocol parameter. If the $\alpha*k$ threshold is met and the sampled color differs from the node's own color, the node flips to that color. It then goes back to the query step, and initiates a subsequent round of query, for a total of $m$ rounds. Finally, the node decides the color it ended up with at time $m$. The paper shows in the analysis that m grows logarithmically with $n$.

This simple protocol illustrates the basic idea but it has many shortcomings. It assumes synchronized rounds available to all. In Line 15, the "accept color" comes at the end of m rounds; there are no early accepts. Finally, Slush does not provide a strong safety guarantee in the presence of Byzantine nodes, because the nodes lack state: Byzantine nodes can try to flip the memoryless nodes to opposite colors.

Snowflake: BFT

Snowflake augments Slush with a single counter that captures the strength of a node's conviction in its current color. In the Snowflake protocol in Figure 2:

  1. Each node maintains a counter $cnt$;
  2. Upon every color change, the node resets $cnt$ to 0;
  3. Upon every successful query that yields $\geq \alpha*k$ responses for the same color as the node, the node increments $cnt$.



Here the nodes can accept colors in an asynchronous manner, not all at the end of $m$ rounds. Each can accept when its own counter exceeds $\beta$. When the protocol is correctly parameterized for a given threshold of Byzantine nodes and a desired $\epsilon$ guarantee, it can ensure both safety and liveness.

Things already got interesting here. The analysis shows that there exists a phase-shift point after which correct nodes are more likely to tend towards a decision than a bivalent state. Further, there exists a point-of-no-return after which a decision is inevitable. The Byzantine nodes lose control past the phase shift, and the correct nodes begin to commit past the point-of- no-return, to adopt the same color, whp.

Snowball: Adding confidence

Snowflake's notion of state is ephemeral: the counter gets reset with every color flip. That is too much history to forget based on one sampling result. Snowball augments Snowflake with momentum by adding confidence counters that capture the number of queries that have yielded a threshold result for their corresponding color (Figure 3):

  1. Upon every successful query, the node increments its confidence counter for that color.
  2. A node switches colors when the confidence in its current color becomes lower than the confidence value of the new color.



Avalanche: Adding a DAG

Adding a DAG improves efficiency, because a single vote on a DAG vertex implicitly votes for all transactions on the path to the genesis vertex. Secondly, it also improves security, because the DAG intertwines the fate of transactions, similar to the Bitcoin blockchain. This makes past decisions (which are buried under an avalanche) much harder to undo.

When a client creates a transaction, it names one or more parents in the DAG. This may not correspond to application-specific dependencies: e.g., a child transaction need not spend or have any relationship with the funds received in the parent transaction. The paper includes a detailed discussion of parent selection in the implementation section.

In the cryptocurrency application, transactions that spend the same funds (double-spends) conflict, and form a conflict set, out of which only a single one can be accepted. As we mentioned above, a conflict set is disparate from how the DAG is constructed, yet the protocol needs to maintain and check for conflict sets for the safety of consensus.

Avalanche embodies a Snowball instance for each conflict set. While Snowball used repeated queries and multiple counters to capture the amount of confidence built in conflicting transactions (colors), Avalanche takes advantage of the DAG structure and uses a transaction's progeny/descendents. When a transaction T is queried, all transactions reachable from T by following the DAG edges are implicitly part of the query. A node will only respond positively to the query if T and its entire ancestry are currently the preferred option in their respective conflict sets. If more than a threshold of responders vote positively, the transaction is said to collect a chit, cT=1, otherwise, cT=0. Nodes then compute their confidence as the sum of chit values in the progeny of that transaction. Nodes query a transaction just once and rely on new vertices and chits, added to the progeny, to build up their confidence.



As Figure 5 shows, when node u discovers a transaction T through a query, it starts a one-time query process by sampling $k$ random peers. A query starts by adding $T$ to Transaction set, initializing $cT=0$, and then sending a message to the selected peers. Each correct node u keeps track of all transactions it has learned about in set $T_u$, partitioned into mutually exclusive conflict sets $PT$, $T \in T_u$. Since conflicts are transitive, if $T_i$ and $T_j$ are conflicting, then $PT_i = PT_j$.



Figure 6 shows what happens when a node receives a query for transaction T from peer j. The node determines if T is currently strongly preferred and returns a positive response to peer j. The transaction T is strongly preferred, if every single ancestor of T is  preferred among its competing transactions (listed in its corresponding conflict set).

Note that the conflict set of a virtuous transaction is always a singleton.  Figure 7 illustrates a sample DAG built by Avalanche, where the shaded regions indicate conflict sets. Sampling in Avalanche will create a positive feedback for the preference of a single transaction in its conflict set. For example, because T2 has larger confidence than T3, its descendants are more likely collect chits in the future compared to T3. So T9 would have an advantage over T6 and T7 in its conflict set.


Figure 4 illustrates the Avalanche protocol main loop, executed by each node. In each iteration, the node attempts to select a transaction T that has not yet been queried. If no such transaction exists, the loop will stall until a new transaction is added. It then selects k peers and queries those peers. If more than $\alpha*k$ of those peers return a positive response, the chit value is set to 1. After that, it updates the preferred transaction of each conflict set of the transactions in its ancestry. Next, T is added to the set Q so it will never be queried again by the node.


Similar to Bitcoin, Avalanche leaves determining the acceptance point of a transaction to the application. Committing a transaction can be performed through a safe early commitment. For virtuous transactions, T is accepted when it is the only transaction in its conflict set and has a confidence greater than threshold $\beta_1$. Alternatively, T can also be accepted after a $\beta_2$ number of consecutive successful queries. If a virtuous transaction fails to get accepted due to a liveness problem with parents, it could be accepted if reissued with different parents.

Implementation

The Team Rocket implemented a bare-bones payment system by porting Bitcoin transactions to Avalanche. They say: "Deploying a full cryptocurrency involves bootstrapping, minting, staking, unstaking, and inflation control. While we have solutions for these issues, their full discussion is beyond the scope of this paper."

As I mentioned before, this section talks in depth about parent selection:
The goal of the parent selection algorithm is to yield a well-structured DAG that maximizes the likelihood that virtuous transactions will be quickly accepted by the network. While this algorithm does not affect the safety of the protocol, it affects liveness and plays a crucial role in determining the shape of the DAG. A good parent selection algorithm grows the DAG in depth with a roughly steady "width". The DAG should not diverge like a tree or converge to a chain, but instead should provide concurrency so nodes can work on multiple fronts. 
Perhaps the simplest idea is to mint a fresh transaction with parents picked uniformly at random among those transactions that are currently strongly preferred. But this strategy will yield large sets of eligible parents, consisting mostly of historical, old transactions. When a node samples the transactions uniformly that way, the resulting DAG will have large, ever-increasing fan-out. Because new transactions will have scarce progenies, the voting process will take a long time to build the required confidence on any given new transaction.
Not only are the ancestors important, progeny is also important for low-latency transaction acceptance. The best transactions to choose lie somewhere near the frontier, but not too far deep in history. The adaptive parent selection algorithm chooses parents by starting at the DAG frontier and retreating towards the genesis vertex until finding an eligible parent.

Evaluation

The basic payment system is implemented in 5K lines of C++ code. Experiments are conducted on Amazon EC2 by running from hundreds to thousands of virtual machine instances using c5.large instances.



For throughput, maintaining a partial order (DAG) that just captures the spending relations allows for more concurrency in processing than a classic BFT log replication system where all transactions have to be linearized. Also, the lack of a leader in Avalanche helps prevent bottlenecks.



Figure 21 shows that all transactions are confirmed within approximately 1 second. Figure 22 shows transaction latencies for different numbers of nodes and that the median latency is more-or-less independent of network size.


Avalanche's latency is only slightly affected by misbehaving clients, as shown in Figure 23.


For emulated georeplication, measurements show an average throughput of 1312 tps, with a standard deviation of 5 tps, and the median transaction latency is 4.2 seconds, with a maximum latency of 5.8 seconds.

The paper includes a comparison paragraph to Algorand and Bitcoin:
Algorand uses a verifiable random function to elect committees, and maintains a totally-ordered log while Avalanche establishes only a partial order. Algorand is leader-based and performs consensus by committee, while Avalanche is leaderless. Both evaluations use a decision network of size 2000 on EC2. Our evaluation uses c5.large with 2 vCPU, 2 Gbps network per VM, while Algorand uses m4.2xlarge with 8 vCPU, 1 Gbps network per VM. The CPUs are approximately the same speed, and our system is not bottlenecked by the network, making comparison possible. The security parameters chosen in our experiments guarantee a safety violation probability below 10−9 in the presence of 20% Byzantine nodes, while Algorand's evaluation guarantees a violation probability below 5 × 10−9 with 20% Byzantine nodes. 
The throughput is 3-7 tps for Bitcoin, 364 tps for Algorand (with 10 Mbyte blocks), and 159 tps (with 2 Mbyte blocks). In contrast, Avalanche achieves over 1300 tps consistently on up to 2000 nodes. As for latency, finality is 10–60 minutes for Bitcoin, around 50 seconds for Algorand with 10 Mbyte blocks and 22 seconds with 2 Mbyte blocks, and 4.2 seconds for Avalanche.

MAD questions

1. Is this a new protocol family?

Yes. Nakamato consensus used PoW to choose leaders. Other protocols uses PoX (e.g., proof-of-lottery, proof-of-stake, PoW) to choose committees which then run PBFT.  Traditional consensus protocols require known membership.

In contrast, Avalanche is a leaderless protocol family that works in open/permissionless setting. It doesn't use any PoX scheme, but uses randomized sampling and metastability to ascertain and persist transactions.

The analysis of the protocols are very strong, and discuss phase-shift point and point-of-no-return for these protocols. This is a very interesting approach to think about consensus. This is also a very fresh approach to thinking about self-stabilization as well. I have a good understanding of self-stabilization literature but I haven't seen this approach in that domain either. I would say the approach would also see interest from the broad self-organizing systems area.

The DAG analysis in the implementation section is also interesting. I don't know much about the hashgraph-based solutions so I don't know how this DAG construction relates to those.

2. What is the incentive to participate?

The paper already discussed a cryptocurrency implementation using Avalanche. But minting, staking, credit distribution, etc, was left for future work. The incentive to participate would come from the cryptocurrency minting and staking. The credit assignment would be interesting and probably would involve several new research problems as well.

3. Where does the Sybil attack tolerance of Avalanche come from?

Avalanche tolerates Byzantine nodes using a tunable parameter to increase/decrease tolerance factor. The paper also reports results with 20% Byzantine nodes.

However, if participation is very inexpensive, Sybil attack is possible, where large number of Byzantine sock-puppets can be introduced to the system violating the BFT ratios. I guess a proof-of-stake based approach can be used in Avalanche to prevent the introduction of an enormous number of Byzantine nodes to the network.

Making Sybil nodes a bit costly help, and that can be complemented with keeping the number of correct nodes high. If the protocol can be really resource light, people wouldn't mind having this in the background in their laptops the same way they don't mind background Dropbox synchronization open. With some incentive it is possible to have many many many participants which also increase tolerance against Sybil attack.

4. What is the resource (computation and storage) requirements at the participants?

On the topic of resource-lightness of the protocol, the paper mentions that transaction validation is the performance bottleneck: "To test the performance gain of batching, we performed an experiment where batching is disabled. Surprisingly, the batched throughput is only 2x as large as the unbatched case, and increasing the batch size further does not increase throughput. The reason for this is that the implementation is bottlenecked by transaction verification. Our current implementation uses an event-driven model to handle a large number of concurrent messages from the network. After commenting out the verify() function in our code, the throughput rises to 8K tps, showing that either contract interpretation or cryptographic operations involved in the verification pose the main bottleneck to the system."

In Avalanche, each participant node needs to maintain the DAG. But since the DAG is a pretty flexible data structure in Avalanche, I think it shouldn't be hard to shard the DAG across groups of participants.

I also wonder about the cost of "conflict set" maintenance as I didn't get a good understanding of how the conflict sets are maintained. The paper mentions an optimization for conflict set maintenance in the implementation section: "A conflict set could be very large in practice, because a rogue client can generate a large volume of conflicting transactions. Instead of keeping a container data structure for each conflict set, we create a mapping from each UTXO to the preferred transaction that stands as the representative for the entire conflict set. This enables a node to quickly determine future conflicts, and the appropriate response to queries."

5. Can some of the assumptions be used for constructing an attack?

The paper says: "The analysis assumes a synchronous network, while the deployment and evaluation is performed in a partially synchronous setting. We conjecture that the results hold in partially synchronous networks, but the proof is left to future work."

I think I buy this. The protocols coming after Slush weakened the synchronicity assumption. The epidemic random sampling mechanism help propagate transactions to the network. So, with enough number of correct nodes, and some weak guarantees about processing speed, I think this can work. Well, we should see the proof.

The epidemic random sampling mechanism requires a decentralized service so that the node to connect with sufficiently many correct nodes to acquire a statistically unbiased view of the network. I guess peer-to-peer finger tables would be sufficient to achieve that. This service should also be guarded against Byzantine nodes as this can be used to affect consensus results by routing nodes to Byzantine pools for sampling. I am wondering if asynchronicity can also be used to introduce/reinforce network partitioning.

Comments

synace said…
Nice review of the paper. Would love to discuss in-person at Blockchain Buffalo meetups. http://www.blockchainbuffalo.com/
Duane Johnson said…
Would a heterogenous compute environment (e.g. IoT devices vs. m5.large EC2 instances) provide any challenges? Would a light-weight compute instance have a decreased chance of getting its transaction accepted, for instance?
Zawy said…
The problem with avalanche is that it assumes a high level of node participation (the "membership" must not change too much, section 3.7). In other words, there's no protection against network splits. The authors said they would address this in a later paper, but that's the hard part. It solves C and A but assumes there is no P. Ava and BCH said partitions are not part of the real world, but if they are not a big issue, POW did not need inventing. POW's magic is proving the chain history had the least sum of partitions with only 1 member per election needing to communicate that he won and everyone immediately agreeing without communication. No membership list is needed because it does not prove there was no partition, it only proves it had the route of least partitions, assuming a 51% attack has not occurred.

So if there is a network split that coincides with conflicting spends on each side, the network is permanently forked. There's no mechanism to tell nodes which fork should is correct unless it defaults back to POW, but if it defaults back to POW, there is no protection against a normal double spend (I have BCH in mind). BCH miners can opt to include unseen txns or exclude Avalanche-approved txns, but this does not resolve the issue.

I have a scheme to enable Avalanche (and can be modified for all POS & POW schemes) to measure participation in order to prove there was not a network split. Stakers must register their time-locked stake with a txn. For each txn that comes up for a vote, they hash the txn plus their registration txn a small number of times that is proportional to their stake. They transmit their lowest target result if it is one of the 30-lowest targets seen. By averaging the 30 lowest targets seen, every node knows independently (even if they get a different set of 30) the degree to which the stakers were participating, giving the node the go-ahead to approve the txn vote and requiring miners not to select a different txn. An attack would require a 51% majority on stake and POW. But if only 70% of registered stakers are shown to be participating, a 35% staker could attack. There are a lot of complexities in trying to get this to work.

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Advice to the young

Foundational distributed systems papers

Distributed Transactions at Scale in Amazon DynamoDB

Linearizability: A Correctness Condition for Concurrent Objects

Understanding the Performance Implications of Storage-Disaggregated Databases

Designing Data Intensive Applications (DDIA) Book