Wednesday, June 13, 2018

About self-stabilization and blockchain consensus

This is a half-baked exploratory piece about how some self-stabilization ideas may relate with blockchain consensus in open/permissionless environments. I will write about some similarities and some major differences between the self-stabilizing distributed algorithms work and blockchain consensus deployments.

Self-stabilization

Let's start with a brief review self-stabilization.

Stabilization is a type of fault-tolerance that handles faults in a principled unified manner instead of on a case-by-case basis. Instead of trying to figure out how much faults can disrupt the system's operation, stabilization assumes arbitrary state corruption, which covers all possible worst-case collusions of faults and program actions. Stabilization then advocates designing recovery actions that takes the program back to invariant states starting from any arbitrary state. This makes stabilization suitable for dealing with unanticipated faults.

The most distinct property of self-stabilization is its emphasis on liveness and providing eventual safety. When the invariant is violated, liveness still holds, and the system makes progress from faulty-states moving closer to invariant states until eventually safety is reestablished. In that sense stabilization is related to eventual-consistency approaches as well.

If you like to see an example, here is Dijkstra's self-stabilizing token ring program. In this problem, the processes are arranged in a ring fashion and there is a unique token circulating in this ring. (You can think of the token may be providing mutual exclusion to the processes; whichever process has the token can access the critical-section/shared-resource).

The similarity between self-stabilization and blockchain consensus is that they are both OK with finite-duration divergence of state/consistency.  In (most) blockchain consensus protocols, you may have a fork, which is resolved eventually. In Bitcoin the fork is resolved with addition of new blocks and using the longest-chain rule with time. Avalanche uses DAG not a chain, so the "fork" is resolved using increased confidence values with time. (In other words, the unpopular branches in DAG atrophy over time.)

In stabilization literature, there are distantly related approaches for blockchain consensus. I would say two important/early ones are Error-detecting codes and fault-containing self-stabilization (2000) and Probabilistic self-stabilization (1990).

Also, the undecided-state dynamics, population protocols by Aspnes et.al., and
self-* properties through gossiping are somewhat related work for Avalanche family consensus protocols.

State-management perspective

There are big differences in self-stabilization and blockchain attitude for state.

Stabilization likes to use soft-state and minimal state. That helps with efficient/lightweight eventual consistency.

Blockchain achieves eventual-consistency with hard-state, and lots of it. This is achieved through full replication at each node and using error-checking codes. While the stabilization approach does not like keeping history (because it can be corrupted), blockchain approach embraces history. However, blockchain maintains history in such a way that corruption is evident and can be weeded out! The error-checking codes (or confidences in Avalanche) are chained to provide increasing toughness/tolerance to tampering for older entries.

Another difference is in terms of partial/local state versus global state: In stabilization nodes often have partial/local state, and the global state is composition of these local stated. In decentralized consensus, each node has full state, the entire chain or the entire DAG. So in some sense stabilization acts distributedly across nodes, and blockchain consensus acts in a decentralized manner per node.

Dynamic networks and churn perspective

I like to also take some time to discuss how some self-stabilization work and blockchain consensus deal with open environments.

Self-stabilizing graph algorithms provide some tolerance/cushion for dynamic (time-varying) network. The edge costs may change (rewiring the graph) and the stabilizing algorithm reacts/adapts to the new graph. Some examples are  stabilizing clustering and shortest path tree algorithms. While self-stabilization tolerates some churn, if the churn is high, the self-stabilizing algorithm may not be able to catch up.

In contrast, dynamic networks is not a big problem for blockchain consensus, especially since communication is often done via an epidemic broadcast/gossip or pull-based gossip as in Avalanche. The blockchain consensus protocols just need a way to propagate the transactions and chain state for replication at the nodes.

There is also the issue of Sybil-resistance. If participation is very inexpensive, Sybil attack is possible in an open/permissionless environment. The attack works by introducing a large number of Byzantine sock-puppets to the system and by violating the byzantine nodes ratio in the system. Sybil resistance is orthogonal to what I discussed and can be achieved for both approaches by incorporating PoW or PoS techniques.

MAD questions

1. Is it possible to use blockchain full-state replication as composing block for building larger-scale stabilizing systems?
Maybe the blockchain full-state replication may emulate a Virtual Node (VN) in an open/trustless region/zone, and you can have a stabilizing solution built/deployed over these unfallable/uncrashable VNs.

2. Is open/permissionless environment overrated? If we have federation like systems hierarchically-arranged, and quick reconfiguration/stabilization to choose suitable quorums for consensus, would that be enough?

Sunday, June 10, 2018

Book Review. Endurance: A year in space, a lifetime of discovery

I didn't know what to expect when I picked this book from the library. A book by an astronaut could turn out to be boring and mundane for me. The thing is, I am interested in space, but I wouldn't say I am passionate about it. I never wished I could be an astronaut as a child (or as an adult). I guess I wanted to be the engineer that designed those systems, rather than the astronaut that piloted them.

Long story short, I enjoyed reading this book a lot. I never got bored, on the contrary I was very engaged. The book interleaved Scott Kelly's growing up and his 1 year stay at the International Space Station (ISS) at every other chapter. At the end of the book, Scott's life timeline has caught up to the beginning of his year-long ISS stay, and his ISS stay timeline concluded with the Soyuz return capsule entering the atmosphere.

I learned a lot about the space program. It still weirds me out that while we are very much earth-bound with our lives, technology, perspectives, problems, and dreams/aspirations, and yet there are some dozen people that get to inhabit space and look onward to Mars. The future is not uniformly distributed.

The book also included a lot of life lessons and reflections on human relationships. I guess, being in space and looking down to earth, would give a great vantage point on such reflection.

I had also read Seveneves by Neal Stephenson and The Martian by Andy Weir and loved those as well. I guess I should be looking for more books about space. Please recommend some.

In this book, Scott Kelly credits in several places "The Right Stuff" book by (the recently deceased) Tom Wolfe to have turned around his life and got him on track to being an astronaut despite all odds lined against him. So I should also plan on reading some work by Tom Wolfe.

From the book

Here are some random interesting passages from the book. Bring your own context. I will start skipping passages when I am tired of typing.

Page 30:
After a while the bus slows, then comes to a stop well before the launchpad. We nod at one another, step off, and take up our positions. We've all undone the rubber-band seals [on our Sokol suits] that had been so carefully and publicly leak-checked just an hour before. I center myself in front of the right rear tire and reach into my Sokol suit. I don't really have to pee, but it's a tradition: When Yuri Gagarin was on his way to the launchpad for his historic first spaceflight, he asked to pull over---right about here--- and peed on the right rear tire of the bus. Then he went to space and came back alive. So now we all must do the same. The tradition is so well respected that women space travelers bring a bottle of urine or water to splash on the tire rather than getting entirely out of their suits.

[ I guess the danger and unpredictability of the situation is enough to make to make scientist/engineered minded people very superstitious. ]

Page 40-41:
In my freshman year, I started out with great hope that I could turn things around and be a good student, as I had every previous school year. This determination always lasted just a few days, until I realized once again that it was impossible for me to concentrate in class or to study on my own. Soon I was waking up each morning and struggling to think of a reason to go to class, knowing I wouldn't absorb any of the professor's lecture. Often, I didn't go. How was I going to graduate, let alone do well enough to be accepted by any medical school?

Everything changed that afternoon when I picked up The Right Stuff. I'd never read anything like it before. I'd heard the word "voice" used to describe literature, but this was something I could actually hear in my head. Even out in the middle of the swamp, Wolfe wrote in this rot-bog of pine truncks, scum slicks, dead dodder vines, and mosquito eggs, even out in this great overripe sump, the smell of "burned beyond recognition" obliterated everything else. I felt the power of those words washing over me, even if some of the words I had to look up in the dictionary. Perilous, neophyte, virulent. I felt like I had found my calling. I wanted to be like the guys in this book, guys who could land a jet on an aircraft carrier at night and then walk away with a swagger. I wanted to be a naval aviator. I was still a directionless, undereducated eighteen-year-old with terrible grades who knew nothing about airplanes. But The Right Stuff had given me the outline of a life plan.

Page 50:
Unlike the early days of spaceflight, when piloting skill was what mattered, twenty-first-century astronauts are chosen for our ability to perform a lot of different jobs and to get along well with others, especially in stressful and cramped circumstances for long periods of time. Each of my crewmates is not only a close coworker in an array of different high-intensity jobs but also a roommate and a surrogate for all humanity.

Page 140:
The increased fluid pressure may squish our eyeballs out of shape and cause swelling in the blood vessels of our eyes and optic nerves. ... It's possible, too, that high CO2 is causing or contributing to changes in our vision ... High sodium in our diets could also be a factor ... Only male astronauts have suffered damage to their eyes while in space, so looking at the slight differences in the head and neck veins of male and female astronauts might also help scientists start to nail down the causes. If we can't we just might have to send an all-women crew to Mars.

Page 150:
Launch time comes and goes. Shortly after, my laptop's internet connection starts working again. I look up the video for the SpaceX launch, but the connection isn't strong enough to stream the video. I get a jerky, frozen image. then my eye stops on a headline: "SpaceX Rocket Explods During Cargo Launch to Space Station."
You've got to be fucking kidding me.
The flight director gets on a privatized space-to-ground channel and tells us the rocket has been lost.

Page 159: [ Upon getting a second chance after being disqualified from flying F14s ]
"You know, you can fly the airplane okay, but you're not flying it all the time," he told me. "You're on altitude and airspeed, but you're not on top of it." I had been trained to keep my altitude within a 200 foot range, so I didn't worry if I was 10 feet off the precise altitude, or 20, or 50. But Scrote pointed out that this imprecision in the end would lead me far from where I needed to wind up, and fixing it would take a lot of my attention. I had to always be making small, constant corrections if I wanted to make the situation better. He was right. My flying got better, and I've been able to apply what I learned from him to a lot of other areas of life as well.

Page 161:
Being in an F-14 squadron in the 1990s was like a cross between playing a professional sport and being in a rock-and-roll band. The movie Top Gun didn't quite capture the arrogance and bravado of it all. The level of drunkenness and debauchery was unbelievable (and is, thankfully, no longer the standard).

Page 164: [ A Marine saying about failures and mistakes ]
There are those who have, and those who will.

Page 290:
This wasn't my first time training with the Russians, of course... By now, I was intimately familiar with the way the Russian space agency handles training similarly to NASA, such as an emphasis on simulator training, and the way they don't, like their emphasis on the theoretical versus the practical---to an extreme. If NASA were to train an astronaut how to mail a package, they would take a box, put an object in the box, show you the route to the post office, and send you on your way with postage. The Russians would start in the forest with a discussion on the species of tree used to create the pulp that will make up the box, then go into excruciating detail on the history of box making. Eventually you would get to the relevant information about how the package is actually mailed, if you didn't fall asleep first. It seems to me this is part of their system of culpability---everyone involved in training needs to certify that the crew was taught everything they could possibly need to know. If anything should go wrong, it must then be the crew's fault.

Page 294: [ Real artists The Russians ship! ]
Once Sasha was back in his seat and it seemed clear we weren't going to catch fire, we talked about our predicament. I decided not to voice concern about the flammability risk.
"It's too bad we won't launch today," I said.
"Da," Sasha agreed. "We will be the first crew to scrub after strapping in since 1969." This is an incredible statistics, considering how often the space shuttle used to scrub, right up to the seconds before launch, even after the main engines had lit.
A voice for the control center interrupted us. "Guys, start your Sokol suit leak checks."
What? Sasha and I looked at each other with identical What-the-fuck? expressions. We were now inside five minutes to launch. Sasha raced to get strapped back into his seat properly.

Page 300-301: [January 8, 2011 During Scott's second ISS mission]
Mission control told me that the chief of the Astronaut Office, Peggy Whitson, needed to talk to me and would be calling on a private line in five minutes. I had no idea why, but I knew the reason wouldn't be anything good. Five minutes is  along time to think about what emergency might have occurred on the ground. Maybe my grandmother had died. Maybe one of my daughters had been hurt.
...
Peggy came on the line. "I don't know how to tell you this," she said, "so I'm just going to tell you. Your sister-in-law, Gabby was shot."
[ His sister-in-law is Gabrielle Giffords, the former congresswoman from Arizona. ]

Page 311: [ Upon being disqualified from the year-long ISS mission ]
When I got home that night, I told Amiko about being medically disqualified. Rather than looking disappointed, as I expected, she looked puzzled.
"So they are going to send someone who has been on two long flights and has not suffered vision damage?" she asked.
"Right," I said.
"But if the point of this mission is to learn more about what happens to your body on a long mission," she asked, "why would they send someone who is known to be immune to one of the things they intend to study?"
This was a good point.

... I decided to present my case to management. They listened, and to my surprise they reversed their decision.

When I was preparing for the press conference to announce Misha and me as the one-year crew members, I asked what I thought was an innocent question about genetic research. I mentioned something we haven't previously discussed: Mark would be a perfect control study throughout the year. [ Scott's identical twin brother Mark Kelly is also an astronaut. ]
It turns out my mentioning this had enormous ramifications. Because NASA was my employer, it would be illegal for them to ask me for my genetic information. But once I had suggested it, the possibilities of studying the genetic effects of spaceflight transformed the research. The Twins Study became an important aspect of the research being done on station. A lot of people assumed that I was chosen for this mission because I have an identical twin, but that was just serendipitous.

Page 350:
I've been thinking about the whole arc of my life that had brought me here, and I always think about what it meant to me to read The Right Stuff as a young man. I feel certain that I wouldn't have done any of things I have if I hadn't read that book---if Tom Wolfe hadn't written it. On a quiet Saturday afternoon, I call Tom Wolfe to thank him. He sounds truly amazed to hear from me. I tell him we're passing over the Indian Ocean, how fast we're going, how our communication system works. We talk about books and about New York and about what I plan to do first when I get back (jump into my swimming pool). We agree to have lunch when I'm back on Earth, and that's now one of the things I'm looking forward to most.

Page 360:
As much as I worked on scientific experiments, I think I learned at least as much about practical issues of how to conduct a long-range exploration mission. This is what crew members on ISS are always doing---we are not just solving problems and trying to make things better for our own spaceflights, but also studying how to make things better for the future. ... And the larger struggles of my mission---most notably, CO2 management and upkeep of the Seedra--- will have a larger impact on future missions on the space station and future space vehicles. NASA has agreed to manage CO2 at a much lower target level, and better versions of CO2 scrubbers are being developed that will one day replace the Seedra...

MAD questions

1. What type of computer/software systems should we have in space?
The RAM is exposed to radiation, so memory can be corrupted. A self-stabilizing program that tolerates memory corruption could be useful in some situations. I also heard triple modular redundancy and replicas are preferred for some computer/sensor systems. How about Byzantine tolerant systems? Would that be useful for computers in space?

The reliability and assurance needed for computers in space stations should be in a league of their own. I wonder if there is a nice description of modern programming techniques/styles for computers deployed in space.

2. Could there be a book out there for you that could change your life? Give your life a whole new meaning/purpose?

3. Or could something you write/contribute can help change someone's life?

Sunday, June 3, 2018

Snowflake to Avalanche: A Novel Metastable Consensus Protocol Family for Cryptocurrencies

This paper is by Team-Rocket ---the authors are pseudonymous, I presume a guy, a gal, and a cat is involved. I learned about the paper when Emin Gun Sirer announced it on Twitter. I started speculating about the authors last week, and this is my latest guess.

OK, back to the facts. Here is the synopsis from the paper. "This paper introduces a brand new family of consensus protocols suitable for cryptocurrencies, based on randomized sampling and metastable decision. The protocols provide a strong probabilistic safety guarantee, and a guarantee of liveness for correct clients."

Below I first summarize the protocols and at the end I will provide my comments/evaluations. The analysis of the paper is very strong and span 8 pages, but I skip that section in my review.

Introduction

Traditional consensus protocols incur high communication overhead and accurate knowledge of membership. So while they work great for less than 20 participants, they do not work for large numbers of participants and open/permissionless environments. Nakamoto consensus family protocols address that problem, but they are costly and wasteful. Bitcoin currently consumes around 63.49 TWh/year, about twice as all of Denmark.

This paper introduces a new family of consensus protocols, inspired by gossip algorithms: The system operates by repeatedly sampling the participants at random, and steering the correct nodes towards the same consensus outcome. The protocols do not use proof-of-work (PoW) yet achieves safety through an efficient metastable mechanism. So this family avoids the worst parts of traditional and Nakamoto consensus protocols.

Similar to Nakamoto consensus, the protocols provide a probabilistic safety guarantee, using tunable security parameters to make the possibility of a consensus failure arbitrarily small.

The protocols guarantee liveness only for virtuous transactions. Liveness for conflicting transactions issued by Byzantine clients is not guaranteed. This point is important, here is the explanation:
"In a cryptocurrency setting, cryptographic signatures enforce that only a key owner is able to create a transaction that spends a particular coin. Since correct clients follow the protocol as prescribed, they are guaranteed both safety and liveness. In contrast, the protocols do not guarantee liveness for rogue transactions, submitted by Byzantine clients, which conflict with one another. Such decisions may stall in the network, but have no safety impact on virtuous transactions. This is a very sensible tradeoff for building cryptocurrency systems." (While this is OK for cryptocurrency systems, it would be a problem for general consensus where conflicting requests from clients will be present.)

OK, then, does this well-formed non-conflicting transactions make consensus trivial in cryptocurrency systems? Would nonconflicting transactions reduce consensus to plain gossip? Then, what does the paper contribute? With plain gossip, the Byzantine client can first introduce one transaction, and then introduce another transaction. With plain gossip the last write wins and double-spending would ensue. So plain gossip won't do; the protocol needs to sample and establish/maintain some sort of communal memory of transactions such that an established transaction should be impossible to change. Nakamoto uses PoW-based chain-layering/amberification to achieve that. This paper shows how this amberification can be achieved via sampling-based gossip and DAG-layering without PoW! Avalanche is a nice name to denote this irreversible process.

The model

The paper adopts Bitcoin's unspent transaction output (UTXO) model: The clients issue cryptographically signed transactions that fully consume an existing UTXO and issue new UTXOs. Two transactions conflict if they consume the same UTXO and yield different outputs. Correct clients never issue conflicting transactions. It is also impossible for Byzantine clients to forge conflicts with transactions issued by correct clients.

On the other hand, Byzantine clients can issue multiple transactions that conflict with one another, and the correct clients should only consume at most one of those transactions. The goal of the Avalanche family of consensus protocols is to accept a set of non-conflicting transactions in the presence of Byzantine behavior.

The Avalanche family of protocols provide the following guarantees with high probability (whp):

  • Safety. No two correct nodes will accept conflicting transactions.
  • Liveness. Any transaction issued by a correct client (aka virtuous transaction) will eventually be accepted by every correct node.


Slush: Introducing Metastability

The paper starts with a non-Byzantine protocol, Slush, and then builds up Snowflake, Snowball, and Avalanche, with better Byzantine fault-tolerance (BFT) and irreversibility properties.


Slush is presented using a decision between two conflicting colors, red and blue. A node starts out initially in an uncolored state. Upon receiving a transaction from a client, an uncolored node updates its own color to the one carried in the transaction and initiates a query.

To perform a query, a node picks a small, constant-sized ($k$) sample of the network uniformly at random, and sends a query message. Upon receiving a query, an uncolored node adopts the color in the query, responds with that color, and initiates its own query, whereas a colored node simply responds with its current color.

Once the querying node collects k responses, it checks if a fraction $\alpha*k$ are for the same color, where $\alpha > 0.5$ is a protocol parameter. If the $\alpha*k$ threshold is met and the sampled color differs from the node's own color, the node flips to that color. It then goes back to the query step, and initiates a subsequent round of query, for a total of $m$ rounds. Finally, the node decides the color it ended up with at time $m$. The paper shows in the analysis that m grows logarithmically with $n$.

This simple protocol illustrates the basic idea but it has many shortcomings. It assumes synchronized rounds available to all. In Line 15, the "accept color" comes at the end of m rounds; there are no early accepts. Finally, Slush does not provide a strong safety guarantee in the presence of Byzantine nodes, because the nodes lack state: Byzantine nodes can try to flip the memoryless nodes to opposite colors.

Snowflake: BFT

Snowflake augments Slush with a single counter that captures the strength of a node's conviction in its current color. In the Snowflake protocol in Figure 2:

  1. Each node maintains a counter $cnt$;
  2. Upon every color change, the node resets $cnt$ to 0;
  3. Upon every successful query that yields $\geq \alpha*k$ responses for the same color as the node, the node increments $cnt$.



Here the nodes can accept colors in an asynchronous manner, not all at the end of $m$ rounds. Each can accept when its own counter exceeds $\beta$. When the protocol is correctly parameterized for a given threshold of Byzantine nodes and a desired $\epsilon$ guarantee, it can ensure both safety and liveness.

Things already got interesting here. The analysis shows that there exists a phase-shift point after which correct nodes are more likely to tend towards a decision than a bivalent state. Further, there exists a point-of-no-return after which a decision is inevitable. The Byzantine nodes lose control past the phase shift, and the correct nodes begin to commit past the point-of- no-return, to adopt the same color, whp.

Snowball: Adding confidence

Snowflake's notion of state is ephemeral: the counter gets reset with every color flip. That is too much history to forget based on one sampling result. Snowball augments Snowflake with momentum by adding confidence counters that capture the number of queries that have yielded a threshold result for their corresponding color (Figure 3):

  1. Upon every successful query, the node increments its confidence counter for that color.
  2. A node switches colors when the confidence in its current color becomes lower than the confidence value of the new color.



Avalanche: Adding a DAG

Adding a DAG improves efficiency, because a single vote on a DAG vertex implicitly votes for all transactions on the path to the genesis vertex. Secondly, it also improves security, because the DAG intertwines the fate of transactions, similar to the Bitcoin blockchain. This makes past decisions (which are buried under an avalanche) much harder to undo.

When a client creates a transaction, it names one or more parents in the DAG. This may not correspond to application-specific dependencies: e.g., a child transaction need not spend or have any relationship with the funds received in the parent transaction. The paper includes a detailed discussion of parent selection in the implementation section.

In the cryptocurrency application, transactions that spend the same funds (double-spends) conflict, and form a conflict set, out of which only a single one can be accepted. As we mentioned above, a conflict set is disparate from how the DAG is constructed, yet the protocol needs to maintain and check for conflict sets for the safety of consensus.

Avalanche embodies a Snowball instance for each conflict set. While Snowball used repeated queries and multiple counters to capture the amount of confidence built in conflicting transactions (colors), Avalanche takes advantage of the DAG structure and uses a transaction's progeny/descendents. When a transaction T is queried, all transactions reachable from T by following the DAG edges are implicitly part of the query. A node will only respond positively to the query if T and its entire ancestry are currently the preferred option in their respective conflict sets. If more than a threshold of responders vote positively, the transaction is said to collect a chit, cT=1, otherwise, cT=0. Nodes then compute their confidence as the sum of chit values in the progeny of that transaction. Nodes query a transaction just once and rely on new vertices and chits, added to the progeny, to build up their confidence.



As Figure 5 shows, when node u discovers a transaction T through a query, it starts a one-time query process by sampling $k$ random peers. A query starts by adding $T$ to Transaction set, initializing $cT=0$, and then sending a message to the selected peers. Each correct node u keeps track of all transactions it has learned about in set $T_u$, partitioned into mutually exclusive conflict sets $PT$, $T \in T_u$. Since conflicts are transitive, if $T_i$ and $T_j$ are conflicting, then $PT_i = PT_j$.



Figure 6 shows what happens when a node receives a query for transaction T from peer j. The node determines if T is currently strongly preferred and returns a positive response to peer j. The transaction T is strongly preferred, if every single ancestor of T is  preferred among its competing transactions (listed in its corresponding conflict set).

Note that the conflict set of a virtuous transaction is always a singleton.  Figure 7 illustrates a sample DAG built by Avalanche, where the shaded regions indicate conflict sets. Sampling in Avalanche will create a positive feedback for the preference of a single transaction in its conflict set. For example, because T2 has larger confidence than T3, its descendants are more likely collect chits in the future compared to T3. So T9 would have an advantage over T6 and T7 in its conflict set.


Figure 4 illustrates the Avalanche protocol main loop, executed by each node. In each iteration, the node attempts to select a transaction T that has not yet been queried. If no such transaction exists, the loop will stall until a new transaction is added. It then selects k peers and queries those peers. If more than $\alpha*k$ of those peers return a positive response, the chit value is set to 1. After that, it updates the preferred transaction of each conflict set of the transactions in its ancestry. Next, T is added to the set Q so it will never be queried again by the node.


Similar to Bitcoin, Avalanche leaves determining the acceptance point of a transaction to the application. Committing a transaction can be performed through a safe early commitment. For virtuous transactions, T is accepted when it is the only transaction in its conflict set and has a confidence greater than threshold $\beta_1$. Alternatively, T can also be accepted after a $\beta_2$ number of consecutive successful queries. If a virtuous transaction fails to get accepted due to a liveness problem with parents, it could be accepted if reissued with different parents.

Implementation

The Team Rocket implemented a bare-bones payment system by porting Bitcoin transactions to Avalanche. They say: "Deploying a full cryptocurrency involves bootstrapping, minting, staking, unstaking, and inflation control. While we have solutions for these issues, their full discussion is beyond the scope of this paper."

As I mentioned before, this section talks in depth about parent selection:
The goal of the parent selection algorithm is to yield a well-structured DAG that maximizes the likelihood that virtuous transactions will be quickly accepted by the network. While this algorithm does not affect the safety of the protocol, it affects liveness and plays a crucial role in determining the shape of the DAG. A good parent selection algorithm grows the DAG in depth with a roughly steady "width". The DAG should not diverge like a tree or converge to a chain, but instead should provide concurrency so nodes can work on multiple fronts. 
Perhaps the simplest idea is to mint a fresh transaction with parents picked uniformly at random among those transactions that are currently strongly preferred. But this strategy will yield large sets of eligible parents, consisting mostly of historical, old transactions. When a node samples the transactions uniformly that way, the resulting DAG will have large, ever-increasing fan-out. Because new transactions will have scarce progenies, the voting process will take a long time to build the required confidence on any given new transaction.
Not only are the ancestors important, progeny is also important for low-latency transaction acceptance. The best transactions to choose lie somewhere near the frontier, but not too far deep in history. The adaptive parent selection algorithm chooses parents by starting at the DAG frontier and retreating towards the genesis vertex until finding an eligible parent.

Evaluation

The basic payment system is implemented in 5K lines of C++ code. Experiments are conducted on Amazon EC2 by running from hundreds to thousands of virtual machine instances using c5.large instances.



For throughput, maintaining a partial order (DAG) that just captures the spending relations allows for more concurrency in processing than a classic BFT log replication system where all transactions have to be linearized. Also, the lack of a leader in Avalanche helps prevent bottlenecks.



Figure 21 shows that all transactions are confirmed within approximately 1 second. Figure 22 shows transaction latencies for different numbers of nodes and that the median latency is more-or-less independent of network size.


Avalanche's latency is only slightly affected by misbehaving clients, as shown in Figure 23.


For emulated georeplication, measurements show an average throughput of 1312 tps, with a standard deviation of 5 tps, and the median transaction latency is 4.2 seconds, with a maximum latency of 5.8 seconds.

The paper includes a comparison paragraph to Algorand and Bitcoin:
Algorand uses a verifiable random function to elect committees, and maintains a totally-ordered log while Avalanche establishes only a partial order. Algorand is leader-based and performs consensus by committee, while Avalanche is leaderless. Both evaluations use a decision network of size 2000 on EC2. Our evaluation uses c5.large with 2 vCPU, 2 Gbps network per VM, while Algorand uses m4.2xlarge with 8 vCPU, 1 Gbps network per VM. The CPUs are approximately the same speed, and our system is not bottlenecked by the network, making comparison possible. The security parameters chosen in our experiments guarantee a safety violation probability below 10−9 in the presence of 20% Byzantine nodes, while Algorand's evaluation guarantees a violation probability below 5 × 10−9 with 20% Byzantine nodes. 
The throughput is 3-7 tps for Bitcoin, 364 tps for Algorand (with 10 Mbyte blocks), and 159 tps (with 2 Mbyte blocks). In contrast, Avalanche achieves over 1300 tps consistently on up to 2000 nodes. As for latency, finality is 10–60 minutes for Bitcoin, around 50 seconds for Algorand with 10 Mbyte blocks and 22 seconds with 2 Mbyte blocks, and 4.2 seconds for Avalanche.

MAD questions

1. Is this a new protocol family?

Yes. Nakamato consensus used PoW to choose leaders. Other protocols uses PoX (e.g., proof-of-lottery, proof-of-stake, PoW) to choose committees which then run PBFT.  Traditional consensus protocols require known membership.

In contrast, Avalanche is a leaderless protocol family that works in open/permissionless setting. It doesn't use any PoX scheme, but uses randomized sampling and metastability to ascertain and persist transactions.

The analysis of the protocols are very strong, and discuss phase-shift point and point-of-no-return for these protocols. This is a very interesting approach to think about consensus. This is also a very fresh approach to thinking about self-stabilization as well. I have a good understanding of self-stabilization literature but I haven't seen this approach in that domain either. I would say the approach would also see interest from the broad self-organizing systems area.

The DAG analysis in the implementation section is also interesting. I don't know much about the hashgraph-based solutions so I don't know how this DAG construction relates to those.

2. What is the incentive to participate?

The paper already discussed a cryptocurrency implementation using Avalanche. But minting, staking, credit distribution, etc, was left for future work. The incentive to participate would come from the cryptocurrency minting and staking. The credit assignment would be interesting and probably would involve several new research problems as well.

3. Where does the Sybil attack tolerance of Avalanche come from?

Avalanche tolerates Byzantine nodes using a tunable parameter to increase/decrease tolerance factor. The paper also reports results with 20% Byzantine nodes.

However, if participation is very inexpensive, Sybil attack is possible, where large number of Byzantine sock-puppets can be introduced to the system violating the BFT ratios. I guess a proof-of-stake based approach can be used in Avalanche to prevent the introduction of an enormous number of Byzantine nodes to the network.

Making Sybil nodes a bit costly help, and that can be complemented with keeping the number of correct nodes high. If the protocol can be really resource light, people wouldn't mind having this in the background in their laptops the same way they don't mind background Dropbox synchronization open. With some incentive it is possible to have many many many participants which also increase tolerance against Sybil attack.

4. What is the resource (computation and storage) requirements at the participants?

On the topic of resource-lightness of the protocol, the paper mentions that transaction validation is the performance bottleneck: "To test the performance gain of batching, we performed an experiment where batching is disabled. Surprisingly, the batched throughput is only 2x as large as the unbatched case, and increasing the batch size further does not increase throughput. The reason for this is that the implementation is bottlenecked by transaction verification. Our current implementation uses an event-driven model to handle a large number of concurrent messages from the network. After commenting out the verify() function in our code, the throughput rises to 8K tps, showing that either contract interpretation or cryptographic operations involved in the verification pose the main bottleneck to the system."

In Avalanche, each participant node needs to maintain the DAG. But since the DAG is a pretty flexible data structure in Avalanche, I think it shouldn't be hard to shard the DAG across groups of participants.

I also wonder about the cost of "conflict set" maintenance as I didn't get a good understanding of how the conflict sets are maintained. The paper mentions an optimization for conflict set maintenance in the implementation section: "A conflict set could be very large in practice, because a rogue client can generate a large volume of conflicting transactions. Instead of keeping a container data structure for each conflict set, we create a mapping from each UTXO to the preferred transaction that stands as the representative for the entire conflict set. This enables a node to quickly determine future conflicts, and the appropriate response to queries."

5. Can some of the assumptions be used for constructing an attack?

The paper says: "The analysis assumes a synchronous network, while the deployment and evaluation is performed in a partially synchronous setting. We conjecture that the results hold in partially synchronous networks, but the proof is left to future work."

I think I buy this. The protocols coming after Slush weakened the synchronicity assumption. The epidemic random sampling mechanism help propagate transactions to the network. So, with enough number of correct nodes, and some weak guarantees about processing speed, I think this can work. Well, we should see the proof.

The epidemic random sampling mechanism requires a decentralized service so that the node to connect with sufficiently many correct nodes to acquire a statistically unbiased view of the network. I guess peer-to-peer finger tables would be sufficient to achieve that. This service should also be guarded against Byzantine nodes as this can be used to affect consensus results by routing nodes to Byzantine pools for sampling. I am wondering if asynchronicity can also be used to introduce/reinforce network partitioning.

Tuesday, May 29, 2018

Some blockchain applications and Reviewcoin

Conjecture 34: Forall x, we have "x on blockchain".
I don't have a proof but I have seen dental on blockchain, kiwis on blockchain, shoes on blockchain, etc.

Apart from the many silly x-on-blockchain attempts, I have heard some serious and promising applications. The Brave browser basic attention token seems to be a serious effort and can help redefine the ad-economy on the browsers. I also recently heard about Abra digital currency bankteller, which is another solid company and application.

To continue, Jackson Palmer said these are his favorite decentralized technology projects: WebTorrentDat / BeakerMastodon, and Scuttlebutt. Finally, there are well circulated requests for apps, which I think can make some uptick in the blockchain applications game.

My blockchain app suggestions

I didn't want to be left out of the action. Ideas are free. Here are some things that occurred to me.

Medical school students can do an ICO and fund their schooling

This will be a utility token, where the owner can get 30 minutes visitation/consulting worth per token. Of course as the student becomes a more experienced and skilled doctor, the token will be more valuable, because 30 minutes of the doctor is now more valuable. Of course with the token comes a token economy, where you can buy/sell tokens for doctors. 

The same idea can apply to some other graduate schools as well, including electrical engineering, computer science, etc. Maybe some labs can raise funding for students/projects using this model.

Being a philanthropist never was so profitable/greedy. And being greedy was never more philanthropic.

But come to think of it, I don't like this. Would this lead to maintaining a stock ticker for people in your life? (Hmm, his grades this semester are not that good, time to short his tokens.) People into stock trading all day may be OK with this, but definitely not for me.

Review coin

This idea occurred to me recently: We should build a blockchain solution for conference reviewer crediting system. If you don’t help with reviewing, you don’t get to have your papers reviewed.

The biggest advantage of doing this in blockchain rather than using a traditional centralized databases would be that now this will belong to everyone in the academic community rather than a specific organization. IEEE versus ACM is a thing. There are many other organizations. Even some universities may not want to submit to a system if it is maintained by a "rival" institution. (Though I must say the arxiv.org owned by Cornell gained adoption from all/most institutions as a repository of electronic preprints, so it looks like it can be done.)

With blockchains technology it may be possible to get the incentive model right as well, since similar problems have been studied in blockchains for cryptocurrencies.

There are many challenges here though. We like to keep the reviewers anonymous. We also want to keep the power of the anonymous reviewers in check and avoid bad/empty/trivial reviews and enforce a certain quality-level in the reviews.

To keep the review quality in check we can require multisig on a review. The extra endorsements may come from PC chairs or some assigned reviewer-reviewers to keep to a more decentralized system.

The participants (authors, reviewers, reviewer-reviewers) can be anonymous. The reviews do not need to go into blockchain/dag, only hash of the reviews can go in there for verification while maintaining some privacy. The participants will have private key and public keys, so it is not possible to initiate a transaction on behalf of another as in Bitcoin. On the other hand, we are still worried about double-spending and attacks on the blockchain.

To fend off attacks, the Reviewcoin blockchain could be powered with PoW, PoS, or leaderless-gossip-based (as in the Avalanche paper). This is a crazy thought but would it at all be possible to make this blockchain reviewer-work powered? There are some works that look at human-work puzzles and proof-of-useful work for blockchains. I speculate how a review-work powered method could be adopted for this problem in MAD questions below.

Reviewcoin can enable a better compensated yet more open model for reviewing papers. The model can also be applied to reviewing whitepapers, component designs, or even code-reviewing. The party that needs to get something reviewed should hodl some Reviewcoin, which could be earned by either doing meaningful reviews, or helping with Reviewcoin blockchain infrastructure running strong, or via purchasing from Reviewcoin owners.

By offering more Reviewcoins it can be possible to recruit more skilled/capable/experience reviewers for the work. Offering more coins should not get you more favorable reviews though, and the reviewer-reviewers should be the enforcer for fairness appointed by the system.

MAD questions

1. Is linking reviews with compensation opening a can of worms?
Probably... But the existing system is not great either. Academics volunteer for technical program committees of prestigious conferences (1) to help, (2) to get some name recognition as your name gets listed on conference website, and (3) to keep current with new research in the field. This is a good-will based model; these reviewers would consult to outside starting from $200 an hour yet they spend tens of hours for free to review for a conference. Inevitably this good-will based approach is starting to face many problems. Freeloading is a problem, as we are seeing some conferences getting 100s and even 1000s of paper submissions. Incentives is becoming a problem: for journals and weak conferences, it is hard to find reviewers, because the name recognition part is not that strong. Finally, experts are overstretched with volunteering.

So maybe we should consider opening a can of worms, if it can help us build a more sustainable system going forward.

2. Human proof-of-work for Reviewcoin
To certify that a review is valid and not random junk to get some Reviewcoins, the review can be checked and signed by program committee chairs if this is for a conference. We can have authenticated academics serving for PC chairs, so their signature certifies the review is correct and the reviewer gets some Reviewcoin for the service. To extend to non-conference unorganized reviews,  the review certification can be done through a multisig from a randomly selected set of participants. If the Byzantine nodes (Sybil/bot participants) are in the minority, they will not be able to issue false certification of reviews. It may be possible to use a proof-of-stake of participants to limit the number of Sybil nodes. Also the Reviewcoin stake/credibility of participants can be used in a weighted manner for reliability.

3. Where did May go?
Time is an illusion. End-of-semester doubly so. (With apologies to Douglas Adams.)

Wednesday, May 23, 2018

TUX2: Distributed Graph Computation for Machine Learning

The TUX2 paper appeared in NSDI 17 and was authored by Wencong Xiao, Beihang University and Microsoft Research; Jilong Xue, Peking University and Microsoft Research; Youshan Miao, Microsoft Research; Zhen Li, Beihang University and Microsoft Research; Cheng Chen and Ming Wu, Microsoft Research; Wei Li, Beihang University; Lidong Zhou, Microsoft Research.

TUX2 introduces some new concepts to graph process engines to adapt them better for machine learning (ML) training jobs. Before I can talk about the contributions of TUX2, you need to bear with me as I explain how current graph processing frameworks fall short for ML training.

Background and motivation

Graph processing engines often takes a "think like a vertex" approach. A dominant computing model in "think like a vertex" approach is the Gather-Apply-Scatter (GAS) model. You can brush up on graph processing engines by reading my reviews of Pregel and Facebook graph processing.

Modeling ML problems as bipartite graphs

Many ML problems can be modeled with graphs and attacked via iterative computation on the graph vertices. The Matrix Factorization (MF) algorithm, used in recommendation systems, can be modeled as a computation on a bipartite user-item graph where each vertex corresponds to a user or an item and each edge corresponds to a user's rating of an item.

A topic-modeling algorithm like Latent Drichlet Allocation (LDA) can be modeled as a document-word graph. If a document contains a word, there is an edge between them; the data on that edge are the topics of the word in the document.  Even logistic regression can be modeled as a sample-feature graph.


Gaps in addressing ML via graph processing

1. The graphs that model ML problems often have bi-partite nature and heterogeneous vertices, with distinct roles (e.g., user vertices and item vertices). However, the standard graph model used in graph processing frameworks assumes a homogeneous set of vertices.

2. For ML computation, an iteration of a graph computation might involve multiple rounds of propagation between different types of vertices, rather than a simple series of GAS phases. The standard GAS model is unable to express such computation patterns efficiently.

3. Machine learning frameworks have been shown to benefit from the Stale Synchronous Parallel (SSP) model, a relaxed consistency model with bounded staleness to improve parallelism. The graph processing engines use Bulk Synchronous Parallel (BSP) model by default.

TUX2 Design

To address the gaps identified above, TUX2
1. supports heterogeneity in the data model,
2. advocates a new graph model, MEGA (Mini-batch, Exchange, GlobalSync, and Apply), that allows flexible composition of stages, and
3. supports SSP in execution scheduling.

Next we discuss the basic design elements in TUX2, and how the above 3 capabilities are built on them.

The vertex-cut approach 

TUX2 uses the vertex-cut approach (introduced in PowerGraph), where the edge set of a high-degree vertex can be split into multiple partitions, each maintaining a replica of the vertex. One of these replicas is designated the master; it maintains the master version of the vertex's data. All the remaining replicas are called mirrors, and each maintains a local cached copy.

Vertex-cut is very useful for implementing the parameter-server model: The master versions of all vertices' data can be treated as the distributed global state stored in a parameter server. The mirrors are distributed to workers, which also has the second type of vertices and use the mirror vertices to iterate on these second type of vertices.

Wait, the second type of vertices? Yes, here we harken back to the bipartite graph model. Recall that we had bipartite graph with heterogeneous vertices, with some vertices having higher degrees. Those higher degree vertices are master vertices and held at the server, and the low degree vertices are data/training for those master vertices and they cache the master vertices as mirror vertices and train on them. And, in some sense, the partitions of low-order vertex type in the bipartite graph corresponds to mini-batch.

The paper has the following to say on this. In a bipartite graph, TUX2 can enumerate all edges by scanning only vertices of one type. The choice of which type to enumerate sometimes has significant performance implications. Scanning the vertices with mirrors in a mini-batch tends to lead to a more efficient synchronization step, because these vertices are placed contiguously in an array. In contrast, if TUX2 scans vertices without mirrors in a mini-batch, the mirrors that get updated for the other vertex type during the scan will be scattered and thus more expensive to locate. TUX2 therefore allows users to specify which set of vertices to enumerate during the computation.

Each partition is managed by a process that logically plays both

  • a worker role, to enumerate vertices in the partition and propagate vertex data along edges, and 
  • a server role, to synchronize states between mirror vertices and their corresponding masters. 
Inside a process, TUX2 uses multiple threads for parallelization and assigns both the server and worker roles of a partition to the same thread. Each thread is then responsible for enumerating a subset of mirror vertices for local computation and maintaining the states of a subset of master vertices in the partition owned by the process.


Figure 3 illustrates how TUX2 organizes vertex data for a bipartite graph, using MF on a user-item graph as an example. Because user vertices have much smaller degree in general, only item vertices are split by vertex-cut partitioning. Therefore, a master vertex array in the server role contains only item vertices, and the worker role only manages user vertices. This way, there are no mirror replicas of user vertices and no distributed synchronization is needed. In the worker role, the mirrors of item and user vertices are stored in two separate arrays.

In each partition, TUX2 maintains vertices and edges in separate arrays. Edges in the edge array are grouped by source vertex. Each vertex has an index giving the offset of its edge-set in the edge array. Each edge contains information such as the id of the partition containing the destination vertex and the index of that vertex in the corresponding vertex array. This graph data structure is optimized for traversal and outperforms vertex indexing using a lookup table. Figure 2 shows how data are partitioned, stored, and assigned to execution roles in TUX2.


Scheduling minibatches with SSP

TUX2 executes each iteration on a minibatch with a specified size. Each worker first chooses a set of vertices or edges as the current minibatch to execute on. After the execution on the mini-batch finishes, TUX2 acquires another set of vertices or edges for the next minibatch, often by continuing to enumerate contiguous segments of vertex or edge arrays.

TUX2 supports SSP in the mini-batch granularity. It tracks the progress of each mini-batch iteration to enable computation of clocks. A worker considers clock t completed if the corresponding mini-batch is completed on all workers (including synchronizations between masters and mirrors) and if the resulting update has been applied to and reflected in the state. A worker can execute a task at clock t only if it knows that all clocks up to t−s−1 have completed, where s is the allowed slack.

The MEGA model 

TUX2 introduces a new stage-based MEGA model, where each stage is a computation on a set of vertices and their edges in a graph. Each stage has user-defined functions (UDF) to be applied on the vertices or edges accessed during it. MEGA defines four types of stage: Mini-batch, Exchange, GlobalSync, and Apply.

MEGA allows users to construct an arbitrary sequence of stages. Unlike GAS, which needs to be repeated in order (i.e., GAS-GAS-GAS-GAS), in MEGA you can flexibly mix and match (e.g., E-A-E-A-G). For example, in algorithms such as MF and LDA, processing an edge involves updating both vertices. This requires two GAS phases, but can be accomplished in one Exchange phase in META. For LR, the vertex data propagations in both directions should be followed by an Apply phase, but no Scatter phases are necessary; this can be avoided in the MEGA model because MEGA allows an arbitrary sequence of stages.

Below are examples of  Matrix factorization (MF) and Latent Dirichlet Allocation (LDA) programmed with the META model. (LDA's stage sequence is the same as MF's.)


Implementation and Evaluation

TUX2 is implemented in ~12,000 lines of C++ code. TUX2 takes graph data in a collection of text files as input. Each process picks a separate subset of those files and performs bipartite-graph-aware algorithms to partition the graph in a distributed way. Each partition is assigned to, and stored locally with, a process. Unfortunately the evaluations with TUX2 do not take into account graph partitioning time, which can be very high. 

The evaluations show that data layout matters greatly in the performance of ML algorithms. Figure 8 compares the performance of BlockPG, MF, and LDA with two different layouts: one an array-based graph data layout in TUX2 and the other a hash-table-based lay-out often used in parameter-server-based systems (but implemented in TUX2 for comparison). The y-axis is the average running time of one iteration for BlockPG, and of 10 iterations for MF and LDA to show the numbers on a similar scale. These results show that the graph layout improves performance by up to 2.4× over the hash-table-based layout.


The paper also includes a comparison with Petuum, but the evaluations have several caveats. The evaluations do not include comparison of convergence/execution time; execution time per iteration does not always determine the convergence time. The evaluations do not take into account the partitioning time of the graph for TUX2. And finally, some comparisons used early unpatched version of Petuum MF algorithm whose data placement issues are resolved later.

MAD questions

1. What is the net gain here?
I like this paper; it made me ask and ponder on many questions, which is good.

I don't think TUX2 pushes the state of the art in ML. ML processing frameworks are already very efficient and general with the iterative parameter-server computing model, and they are getting better and more fine grained.

On the other hand, I think TUX2 is valuable because it showed how the high-level graph computing frameworks can be adapted to implement the low-level parameter-server approach and address ML training problems more efficiently. This may provide some advantages for problems that are/need to be represented as graphs, such as for performing ML training on Sparql data stores.

Moreover by using higher-level primitives, TUX2 provides some ease of programmability. I guess this may be leveraged further to achieve some plug and play programming of ML for certain class of programs.

So I find this to be a conceptually very satisfying paper as it bridges the graph processing model to parameter-server model. I am less certain about the practicality part.


2. How does graph processing frameworks compare with dataflow frameworks?
There are big differences between dataflow frameworks and graph processing frameworks. In the dataflow model, there is a symbolic computation graph, the graph nodes represent mathematical operations, while the graph edges represent the data that flow between the operations. That is a very different model than the graph processing model here.

In MEGA, there are only 4 stages, where the Apply stage can take in user defined functions. This is higher-level (and arguably more programming friendly) than a dataflow framework such as TensorFlow which has many hundreds of predefined operators as vertices.


3. How does TUX2 apply to Deep Learning (DL)?
The paper does not talk about whether TUX2 can apply to DL and how it can apply.

It may be possible to make DL fit the TUX2 model with some stretching. Deep neural network (DNN) layers (horizontally or vertically partitioned) could be the high-rank vertices hold in the servers. And the images are low-ranked vertices hold in partitions.

But this will require treating the DNN partions as a meta-vertex and schedule executions for each sub-vertes in the meta-vertex in one cycle. I have no clue about how to make backpropagation work here though.

Moreover, for each image, the image may need to link to entire NN, so the bipartite graph may collapse into a trivial one and trivial data-parallelism. It may be possible to make the convolutional layers can be distributed. It may even be possible to insert early exits and train that way.

So, it may be possible but it is certainly not straightforward. I am not even touching the subject of the performance of such a system.

Monday, May 21, 2018

SoK Cryptocurrencies and the Bitcoin Lightning Network

We wrapped up the distributed systems seminar with two more papers discussed last month.

The "SoK: Research Perspectives and Challenges for Bitcoin and Cryptocurrencies" paper appeared in 2015. Although the cryptocurrency scene has seen a lot of action recently, this survey paper did not age and it is still a very good introduction to learning about the technical aspects and challenges  of cryptocurrencies. The paper starts with a technical overview of the cryptocurrency concept. Then it delves more into incentives and stability issues. It observes that it is unclear "how stability will be affected either in the end state of no mining rewards or in intermediate states as transaction fees become a non-negligible source of revenue". It talks about possible attacks, including Goldfinger attack and feather-forking, and also about stability of mining pools and the peer-to-peer layer. Finally it also covers some security and privacy issues.

"The Bitcoin Lightning Network: Scalable Off-Chain Instant Payments" whitepaper is from 2016. Bitcoin has a very high latency, poor scalability, and high transaction fees, and the lightning network provides an overlay solution to ease off these pains. To this end, the paper describes a secure off-chain solution for instant payments and microtransactions with low transaction fee.  This video by Jackson Palmer describes the lightning network ideas very nicely. (I strongly prefer reading to watching/listening, but recently I am finding a lot of useful content on YouTube.)

The protocol builds on the basic concept of a pairwise payment channel. Two parties put aside an initial amount of Bitcoin into a multi signature transaction. Subsequently many updates to the allocation of the current balance can be made off-the-chain just with the cooperation of both parties using a new timelocked transaction, without broadcasting this to the chain. This is analogous to buying a Starbucks card and transacting with Starbucks pairwise with that card. A broadcast to the chain can be done to redeem funds on the chain and to close the channel. If either party tries to cheat by broadcasting an old transaction state from the pairwise payment channel, the counterparty may take all the funds in the channel as penalty after it provides the latest multisig agreed state to the chain (within the on-chain dispute mediation window).

The lightning overlay network is then formed by multihop routing over these pairwise payment channels. Even when A and Z does not have a direct pairwise payment channel, it may be possible to construct a multihop route from A to Z to use intermediaries, traversing through several pairwise payment channels. While pairwise channels could be without a fee, the multihop intermediaries need a small transaction fee to have the incentive to participate.

The pairwise payment channels need to be extend with hashlocks to achieve multihop transactions. Here is how this is done as explained in the Bitcoin wiki:

  1. Alice opens a payment channel to Bob, and Bob opens a payment channel to Charlie.
  2. Alice wants to buy something from Charlie for 1000 satoshis.
  3. Charlie generates a random number and generates its SHA256 hash. Charlie gives that hash to Alice.
  4. Alice uses her payment channel to Bob to pay him 1,000 satoshis, but she adds the hash Charlie gave her to the payment along with an extra condition: in order for Bob to claim the payment, he has to provide the data which was used to produce that hash.
  5. Bob uses his payment channel to Charlie to pay Charlie 1,000 satoshis, and Bob adds a copy of the same condition that Alice put on the payment she gave Bob.
  6. Charlie has the original data that was used to produce the hash (called a pre-image), so Charlie can use it to finalize his payment and fully receive the payment from Bob. By doing so, Charlie necessarily makes the pre-image available to Bob.
  7. Bob uses the pre-image to finalize his payment from Alice

In order for the multihop routes to work there should be enough cooperative participants online, each of which with enough balance to pass the bucket from hand to hand. The rest is a path-finding exercise. With viable paths present, similar to a source-side routing decision, you can start the lightning transaction. There has been a lot of work on MANETs under the names of intermittently connected networks and delay tolerant routing. I wonder if some of them can find applications here (even though they were mostly geometrical and this is a non-geometrical network.) In practice though, the multihop network often converges to a hub and spokes model, with a well-connected fat wallet intermediary.

The lightning network is also useful for transacting securely across different blockchains, as long as all edges in the route support the same hash function to use for the hash lock and can create timed locks.

MAD questions

1. How can we improve the way we run the seminar?
The students liked how we run the seminars. They said they were more actively engaged and learned a greater deal due to this format. But I think we can improve. It would be nice to get experts video-conference and answer some questions. I think many experts would be generous to spare 20 minutes to spare to answer some questions from smart well-prepared students. 

The students also mentioned that it would have been nice to have some hands-on projects to accompany the seminars. We started a blockchain channel so the interested students can figure out some small projects they can work on and collaborate.

2. What did I learn?
I didn't think much of blockchains but I am fascinated to learn that there are many challenging questions here and many good ideas/techniques. This is an area very suitable for doing distributed algorithms/protocols work, which I love. There is a need for developing more principled approaches and well-reasoned and verified algorithms/protocols.

I still think the best ideas from these work will get borrowed and used in more centralized (could be hierarchical or federated) systems for the sake of efficiency/economy and scalability. That may be how these systems will go mainstream to millions and even billions.

Saturday, May 19, 2018

Book Review -- Accidental Genius: Using Writing to Generate Your Best Ideas, Insight, and Content

I had read this short book a long time ago. It is a very helpful book to learn about how to use writing for thinking --freewriting.

Motivation for freewriting

The mind is lazy. Thinking is hard, and our brains don't like hard. It recycles tired thoughts, and avoids unfamiliar and uncomfortable territory.

Freewriting prevents that from happening. Freewriting is a form of forced creativity.
Writing is nature's way of letting you know how sloppy your thinking is.  --Guindon 
If you think without writing, you only think you're thinking. --Lamport 
Freewriting helps to unclog the mind, reduce resistance to thinking and writing, bring clarity, provide perspective, improve creativity by causing a chain reaction of ideas, and articulate better about ideas.

The premise of freewriting is simple: getting a 100 ideas is easier than getting 1. "When you need an idea, don’t try for just one. When searching for one great idea, we demand perfection from it, depress ourselves, become desperate, and block. Go for lots of ideas. Keep your threshold low. One idea leads to the next."

The first half of the book gives the below six tactics for freewriting. (The second half elaborates more on these.)

1. Try Easy

A relaxed 90 percent is more efficient than a vein-bulging 100 percent effort. When you begin freewriting about a thorny subject, remind yourself to try easy.

2. Write Fast and Continuously 

Your best thought comes embedded in chunks of your worst thought. Write a lot. Think quantity. If you temporarily run out of things to say, babble onto the page. Write as quickly as you can type, and continue to generate words without stopping: Your mind will eventually give you its grade A unadulterated thoughts.

3. Work against a Time Limit 

The limit energizes your writing effort by giving you constraints. Pomodoro method helps here.

4. Write the Way You Think 

Freewriting is a means of watching yourself think. Write the way you talk to yourself. Since you're writing for yourself, you don’t need to polish your raw thoughts to please others. All that matters is that you yourself understand your logic and references.

5. Go with the Thought 

That's the first rule of improv theater. Assume that a particular thought is true, and take a series of logical steps based on the thought. In other words, as what if questions.

6. Redirect Your Attention 

Explore random paths with focus changing questions: Why? Why not? How can I change that? How can I prove/disprove that? What am I missing here? What does this remind me of? What’s the best/worst-case scenario? Which strengths of mine can I apply? How would I describe this to my uncle? If I wanted to make a big mistake here, what would I do? What do I need that I don’t yet have?

MAD questions

1. What are other uses for freewriting?
Freewriting makes for good meditation. Many people use morning pages (freewriting in the morning) to clear logjams in thire minds. I couldn't make this a habit, but when I did try this I got value out of it.

I occasionally used freewriting for thinking about a research problem. I write down some observations and then I start speculating about hypothesis all in freewriting form. It works. Again, I wish I could make this a habit and do this more often. Well, I guess the blogging and the MAD questions help somewhat for this.

Freewriting is also good for planning. I should do more planning.

I guess it is time to schedule some freewriting in my week. 

2. Is it possible to use freewriting for writing?
Yes, you can use freewriting to write memos, articles, and stories. The trick is to edit ruthlessly after freewriting. That may be wasteful though. Freewriting helps you discover what you want to write. After that doing an outline and writing from that could save time. So it may be best to combine a little bit of bottom up freewriting with a little bit of top down outline writing.

3. Did you try any freewriting marathons?
The book suggests that instead of 20 minute freewriting sessions, it is helpful occasionally to go for 4-5 hours of freewriting marathons. I never tried that. It sounds torturous, but maybe it is worth a try for the sake of curiosity what would my depleted unrestrained psyche spew out after an hour of typing its train of thoughts. Couldn't the mind get lazy in freewriting mode as well and start to recycle same thoughts? But I guess the idea in the marathon is to force your mind to move past that.

4. How was your experience with freewriting?
Let me know in the comments or via email or on Twitter.

Wednesday, May 16, 2018

Paper summary. Decoupling the Control Plane from Program Control Flow for Flexibility and Performance in Cloud Computing

This paper appeared in Eurosys 2018 and is authored by Hang Qu, Omid Mashayekhi, Chinmayee Shah, and Philip Levis from Stanford University.

I liked the paper a lot, it is well written and presented. And I am getting lazy, so I use a lot of text from the paper in my summary below.

Problem motivation 

In data processing frameworks, improved parallelism is the holy grail because it can get more data processed in less time.

However, parallelism has a nemesis called the control plane. While, control plane can have a wide array of meaning, in this paper control plane is defined as the systems and protocols for scheduling computations, load balancing, and recovering from failures.

A centralized control frame becomes a bottleneck after a point. The paper cites other papers and states that a typical cloud framework control plane that uses a fully centralized design can dispatch fewer than 10,000 tasks per second. Actually, that is not bad! However, with machine learning (ML) applications we are starting to push past that limit: we need to deploy on 1000s of machines large jobs that consist of many many short tasks (10 milliseconds) as part of iterations over data mini-batches.

To improve the control plane scalability, you can distribute the control plane across worker nodes (as in Nimbus and Drizzle). But that also runs into scaling problems due to the synchronization needed between workers and the controller. Existing control planes are tightly integrated with the control flow of the programs, and this requires workers to block on communication with the controller at certain points in the program, such as spawning new tasks or resolving data dependencies. When synchronization is involved, a distributed solution is not necessarily more scalable than a centralized one. But it is more complicated for sure: in one sense that is the computer analog of mythical man-month problem.

Another approach to improve scalability is to remove the control plane entirely. Some frameworks, such as Tensorflow, Naiad, and MPI frameworks, are scheduled once as a big job and they manage their own execution after that. Well, of course the problem doesn't go away, but plays inside the framework level: the scalability is limited by the applications logic written in these frameworks and the frameworks' support for concurrency control. Furthermore, these frameworks don't play nice with the datacenter/cloud computing environment as well. Rebalancing load or migrating tasks requires killing and restarting a computation by generating a new execution plan and installing it on every node.

This paper proposes a control plane design that breaks the existing tradeoff between scalability and  flexibility. It allows jobs to run extremely short tasks (<1ms) on thousands of cores and reschedule computations in milliseconds. And that has applications in particular for ML.

Making the Control Plane Asynchronous

To prevent synchronous operations, the proposed control plane cleanly divides responsibilities between controller and workers: a controller decides where to execute tasks and workers decide when to execute them.

The control plane's traffic is completely decoupled from the control flow of the program, so running a program faster does not increase load at the controller. When a job is stably running on a  fixed set of workers, the asynchronous control plane exchanges only occasional heartbeat messages to monitor worker status.

The architecture

Datasets are divided into many partitions, which determines the available degree of parallelism. Datasets are mutable and can be updated in place. This corresponds nicely to parameters in ML applications.

The controller uses an abstraction called a partition map to control where tasks execute. The partition map describes which worker each data object should reside on. _Because task recipes trigger tasks based on what data objects are locally present, controlling the placement of data objects allows the controller to implicitly decide where tasks execute._ The partition map is updated asynchronously to the workers, and when a worker receives an update to the map it asynchronously applies any necessary changes by transferring data.

On the worker side, an abstraction called task recipes describes triggers for when to run a task by specifying a pattern matched against the task's input data. Using recipes, every worker spawns and executes tasks by examining the state of its local data objects, obviating the need to interact with the controller.

Task recipes

A task recipe specifies (1) a function to run, (2) which datasets the function reads and/or writes, and (3) preconditions that must be met for the function to run.


There are three types of preconditions to trigger a recipe:

  1. Last input writer: For each partition it reads or writes, the recipe specifies which recipe should have last written it. This enforces local write-read dependencies, so that a recipe always sees the correct version of its inputs.
  2. Output readers: For each partition it writes, the recipe specifies which recipes should have read it since the last write. This ensures that a partition is not overwritten until tasks have finished reading the old data.
  3. Read messages: The recipe specifies how many messages a recipe should read before it is ready to run. Unlike the other two preconditions, which specify local dependencies between tasks that run on the same worker, messages specify remote dependencies between tasks that can run on different workers.

Since incorrect preconditions can lead to extremely hard to debug computational errors, they are generated automatically from a sequential user program. A single recipe describes potentially many iterations of the same data-parallel computation.

Writers and readers are specified by their stage number, a global counter that every worker maintains. The counter counts the stages in their program order, and increments after the application determines which branch to take or whether to continue another loop. (Using the counters I think it is possible to implement SSP method easily as well.) All workers follow an identical control flow, and so have a consistent mapping of stage numbers to recipes.

Exactly-once Execution and Asynchrony

Ensuring atomic migration requires a careful design of how preconditions are encoded as well as how data objects move between workers. No node in an asynchronous control plane has a global view of the execution state of a job, so workers manage atomic migration among themselves. To ensure that the task from a given stage executes exactly once and messages are delivered correctly, when workers transfer a data partition they include the access history metadata relevant to preconditions, the last writer and how many recipes have read it.

Partition Map

A partition map is a table that specifies, for each partition, which worker stores that partition in memory. A partition map indirectly describes how a job should be distributed across workers, and is used as the mechanism for the controller to signal workers how to reschedule job execution.


The controller does five things:
(1) Starts a job by installing the job's driver program and an initial partition map on workers.
(2) Periodically exchanges heartbeat messages with workers and collects workers' execution statistics, e.g. CPU utilization and CPU cycles spent computing on each partition.
(3) Uses the collected statistics to compute partition map updates during job execution.
(4) Pushes partition map updates to all workers.
(5) Periodically checkpoints jobs for failure recovery.

Maximizing data locality

To maximize data locality, the controller updates the partition map under the constraints that the input partitions to each possible task in a job are assigned to the same worker. The execution model of task recipes is intentionally designed to make the constraints explicit and achievable: if a stage reads or writes multiple datasets, a task in the stage only reads or writes the datasets' partitions that have the same index, so those partitions are constrained to be assigned to the same worker.

Implementation

The group designed Canary, an asynchronous control plane, which can execute over 100,000 tasks/second on each core, and can scale linearly with the number of cores.

The driver constructs the task recipes. A driver program specifies a sequential program order, but the runtime may reorder tasks as long as the observed result is the same as the program order (just as how processors reorder instructions).


Canary periodically checkpoints all the partitions of a job. The controller monitors whether workers fail using periodic heartbeat messages. If any worker running a job is down, the controller cleans up the job's execution on all workers, and reruns the job from the last checkpoint.

Checkpoint-based failure recovery rewinds the execution on every worker back to the last checkpoint when a failure happens, while lineage-based failure recovery as in Spark only needs to recompute lost partitions. But the cost of lineage-based failure recovery in CPU-intensive jobs outweighs the benefit, because it requires every partition to be copied before modifying it.

Evaluation results

Current synchronous control planes such as Spark execute 8,000 tasks per second; distributed ones such as Nimbus and Drizzle can execute 250,000 tasks/second. Canary, a framework with an asynchronous control plane, can execute over 100,000 tasks/second on each core, and this scales linearly with the number of cores. Experimental results on 1,152 cores show it schedules 120 million tasks per second. Jobs using an asynchronous control plane can run up to an order of magnitude faster than on prior systems. At the same time, the ability to split computations into huge numbers of tiny tasks with introducing substantial overhead allows an asynchronous control plane to e ciently balance load at runtime, achieving a 2-3× speedup over highly optimized MPI codes.

Evaluations are done with  applications performing logistic regression, K-means clustering, and PageRank.

MAD Questions

1. Is it a good idea to make tasks/recipes dependent/linked to individual data objects? How do we know the data objects in advance? Why does the code need to refer to the objects? I think that model can work well if the data objects are parameters to be tuned int ML applications. We live in the age of the `big model'. I guess graph processing applications can also fit well to this programming model. I think this can also fit well with any(?) dataflow framework application. Is it possible to make all analytics applications fit to this model?

2. The Litz paper had similar ideas for doing finer-grain scheduling at the workers and obviating the need for synchronizing with the scheduler. Litz is a resource-elastic framework supporting high-performance execution of distributed ML optimizations. Litz remains general enough to accommodate most ML applications, but also provides an expressive programming model allowing the applications (1) to support stateful workers that can store the model parameters which are co-located with a partition of input data, and (2) to define custom task scheduling strategies satisfying fine-grain task dependency constraints and allowances. At runtime, Litz executes these strategies within the specified consistency requirements, while gracefully persisting and migrating application state.

3. Since it is desirable to have one tool for batch and serving, would it be possible to adopt Canary for serving? Could it be nimble enough?

4. Is it possible to apply techniques from the Blazes paper to improve how the driver constructs the task recipes?