Wednesday, February 21, 2018

Paper summary: Bitcoin-NG -- A scalable Blockchain Protocol

This week in our seminar, we discussed the Bitcoin-NG paper. The paper appeared in NSDI 16, and is authored by Ittay Eyal, Adem Efe Gencer, Emin Gün Sirer, and Robbert van Renesse at the Cornell University.

The model section of this paper is very well formalized and written. This is like a Rosetta Stone find for classical distributed systems researchers that want to enter blockchain research. So in this summary, I will start by covering as much of that as I can.

The Nakamoto Consensus Problem 

The blockchain system is comprised of a set of nodes N connected by a reliable peer-to-peer network. Nodes can generate public-private key-pairs for themselves. The system employs a cryptopuzzle system, defined by a cryptographic hash function H. The solution to a puzzle defined by the string y is a string x such that H(y|x) --the hash of the concatenation of the two-- is smaller than some target (i.e., the hash has k number of leading zeros). Each node i has a limited amount of compute power, called mining power, measured by the number of potential puzzle solutions it can try per second. A solution to a puzzle constitutes a "proof of work", as it statistically indicates the amount of work a node had to perform in order to find it.

At any time t, a subset of nodes B(t) are Byzantine and can behave arbitrarily, controlled by a single adversary. The mining power of the Byzantine nodes is less than 1/4 of the total compute power at any given time.

Comparing the Nakamoto consensus problem with the classic Byzantine consensus problem is very instructional. Nakamoto consensus has probabilistic guarantees for Termination, Agreement, and Validity, whereas the classic Byzantine Consensus has deterministic guarantees for them.

BitCoin's Blockchain protocol

Bitcoin provides a Byzantine fault tolerant blockchain protocol for implementing a decentralized cryptocurrency. Each user commands addresses, and sends Bitcoins by forming a transaction from her address to another's address and sending it to the Bitcoin mining nodes.  Transactions are protected with cryptographic techniques that ensure only the rightful owner of a Bitcoin address can transfer funds from it. A client owns x Bitcoins at time t if the aggregate of unspent outputs to its address is x. Miners accept transactions only if their sources have not been spent, preventing users from double-spending their funds. The miners commit the transactions into a global append-only log called the blockchain. If multiple miners create blocks with the same preceding block, the chain is forked into branches, forming a tree. All miners add blocks to the heaviest chain of which they know, with random tie-breaking.

If you want to understand the blockchain data structure, this youtube video and demo is fantastic for it.

Unfortunately Bitcoin is haunted with scalability woes. The maximum rate at which Bitcoin can process transactions is capped by the block size and block interval.

Increasing the block size (which is currently set at 1MB) improves throughput, but the resulting bigger blocks take longer to propagate in the network. Of course increasing that to 10 MB will not cause a significant problem in propagation, after all bandwidth is not that limited. However, taken at extreme the tradeoff will be there. Moreover, every second of headstart counts for mining the next block: New blocks received late means miners are wasting resources by building on an old block that is no longer the most recent.

Reducing the block interval reduces latency, but leads to instability due to frequent forks. Bitcoin currently targets a conservative 10 minutes between blocks, yielding 10-minute expected latencies for transactions to be encoded in the blockchain.

With this setup, Bitcoin yields a wimpy 1 to 3.5 transactions per second. Transaction finalization is also problematic. If you wait for 6 blocks to append to declare a transaction to be finalized, that will take an expected 60 minutes time.


Bitcoin-NG is a protocol that improves on Bitcoin in terms of transaction throughput and latency of propagation. Bitcoin-NG’s latency is limited only by the propagation delay of the network, and its bandwidth is limited only by the processing capacity of the individual nodes. Bitcoin-NG achieves this performance improvement by decoupling Bitcoin’s blockchain operation into two planes: leader election and transaction serialization. In Bitcoin-NG, consensus is pushed back only to identify the leader, and serialization is performed by the leader. This seems to me to be a classic application of chain replication (while the paper does not cite chain replication).

Bitcoin-NG divides time into epochs, where each epoch has a single leader. As in Bitcoin, leader election is performed randomly and infrequently via proof-of-work. Once a leader is chosen, it is entitled to serialize transactions via microblocks unilaterally until a new leader is chosen, marking the end of the former’s epoch.

Thus the protocol introduces two types of blocks: key blocks for leader election and microblocks that contain the ledger entries. Like a Bitcoin block, a key block contains the reference to the previous block (either a key block or a microblock, usually the latter), the current Unix time, a coinbase transaction to pay out the reward, a target value, and a nonce field containing arbitrary bits. Unlike Bitcoin, a key block in Bitcoin-NG contains a public key that will be used in subsequent microblocks.

Once a node generates a key block it becomes the leader. As a leader, the node is allowed to generate microblocks at a set rate smaller than a predefined maximum. (Specifically, if the timestamp of a microblock is in the future, or if its difference with its predecessor's timestamp is smaller than the minimum, then the microblock is invalid. This prohibits a greedy leader from swamping the system with microblocks.)  A microblock contains ledger entries and a header. The header contains the reference to the previous block, the current Unix time, a cryptographic hash of its ledger entries, and a cryptographic signature of the header. The signature uses the private key that matches the public key in the latest key block in the chain.

Microblock fork prevention is essential for this system, since the leader can spit out many microblocks quickly to make more money from transaction fees.  To report on a leader that violates the microblock production rate, a poison transaction is employed. The poison transaction contains the header of the first block in the pruned branch as a proof of fraud, and it has to be placed on the blockchain within the maturity window of the misbehaving leader’s key block, and before the revenue is spent by the malicious leader.

The new leader cannot fake an offending microblock to accuse the old leader, because the poison transaction should contain the private signature of the previous leader. Moreover the 40/60 partitioning of credit for microblock transactions between the old leader and the new leader incentivizes the new leader not to cut the microblock set short, because it would like to get more revenue from them. So the new leader that mines the last key block has all the incentive to behave according to the protocol. The 40/60 partitioning of the credit is shown to satisfy these constraints via mathematical modeling in the paper.

It looks like the microblocks need not be chained in a sequence. Rather the microblocks form a set that follow the keyblock of the corresponding leader and satisfying the rate limitations prescribed in the protocol. So what is the authoritative microblock set taken then, given that slightly different set of microblocks may arrive at different nodes of the network. It looks like the new leader (that mines the next key block) is the one to authoritatively decide on that.

MAD questions

1) In the Bitcoin protocol, if we increase the block size what could go wrong? That would definitely help with the throughput problems Bitcoin is facing. It would mean some slight increase in delivery latency depending on the bandwidth available in the path. But even increasing it to 10MB is unlikely to break anything, I think. (But hey I am a newbie in this space.) So why are there endless debates about this in the Bitcoin community? Could it be that the small blockers are concerned more about actually the increase in the transaction volume, which would mean the history grows quickly, which would make being a full Bitcoin node (rather than just a client) become more of a burden? But isn't that just delaying the inevitable? Is the idea here to delay the inevitable until more storage space and computing become available? That sounds so wrong to me as an engineer.

2) It looks like a hard fork will be needed for switching to the Bitcoin-NG protocol, because the current Bitcoin clients are unaware of the concept of microblocks. Oh, well, good luck with getting on a "consensus" on a hard fork in Bitcoin protocol. In his latest blog post "Why Decentralization Matters", Chris Dixon cited as the biggest problem with the centralized services that they can change the rules on the users: "The good news is that billions of people got access to amazing technologies, many of which were free to use. The bad news is that it became much harder for startups, creators, and other groups to grow their internet presence without worrying about centralized platforms changing the rules on them, taking away their audiences and profits. This in turn stifled innovation, making the internet less interesting and dynamic."

On the other hand, the "community-governed" decentralized protocols seem to be haunted by the reverse problem: It is very hard to change the rules even to fix problems with the protocol. Isn't that as big, if not a bigger, problem as the former for stifling innovation?

Tuesday, February 13, 2018

Paper review. IPFS: Content addressed, versioned, P2P file system

This week we discussed the IPFS whitepaper by Juan Benet in my Distributed Systems Seminar.

Remember peer-to-peer systems? IPFS is "peer-to-peer systems reloaded" with improved features. IPFS is a content-addressed distributed file system that combines Kademlia + BitTorrent + Git ideas. IPFS also offers better privacy/security features: it provides cryptographic hash content addressing, file integrity and versioning, and filesystem-level encryption and signing support.

The question is will it stick? I think it won't stick, but this work will still be very useful because we will transfer the best bits of IPFS to our datacenter computing as we did with other peer-to-peer systems technology. The reason I think it won't stick has nothing to do with the IPFS development/technology, but has everything to do with the advantages of centralized coordination and the problems surrounding decentralization. I rant more about this later in this post. Read on for the more detailed review on IPFS components, killer app for IPFS, and MAD questions.

IPFS components


Nodes are identified by a NodeId, the cryptographic hash3 of a public-key, created with S/Kademlia’s static crypto puzzle. Nodes store their public and private keys (encrypted with a passphrase).


Transport: IPFS can use any transport protocol, and is best suited for WebRTC DataChannels(for browser connectivity) or uTP.
Reliability: IPFS can provide reliability if underlying networks do not provide it, using uTP or SCTP.
Connectivity: IPFS also uses the ICE NAT traversal techniques.
Integrity: IPFS optionally checks integrity of messages using a hash checksum.
Authenticity: IPFS optionally checks authenticity of messages using HMAC with sender’s public key.


To find other peers and objects, IPFS uses a DSHT based on S/Kademlia and Coral. Coral DSHT improves over by Kademlia based on the three rules of real-estate: location, location, location. Coral stores addresses to peers who can provide the data blocks taking advantage of data locality. Coral can distribute only subsets of the values to the nearest nodes avoiding hot-spots. Coral organizes a hierarchy of separate DSHTs called clusters depending on region and size. This enables nodes to query peers in their region first, "finding nearby data without querying distant nodes" and greatly reducing the latency of lookups.


In IPFS, data distribution happens by exchanging blocks with peers using a BitTorrent inspired protocol: BitSwap. Unlike BitTorrent, BitSwap is not limited to the blocks in one torrent. The blocks can come from completely unrelated files in the filesystem. In a sense, nodes come together to barter in the marketplace. BitSwap incentivizes nodes to seed/serve blocks even when they do not need anything in particular. To avoid leeches (freeloading nodes that never share), peers track their balance (in bytes verified) with other nodes, and peers send blocks to debtor peers according to a function that falls as debt increases. For bartering, potentially, a virtual currency like FileCoin (again by Juan Benet) can be used.


IPFS builds a Merkle DAG, a directed acyclic graph where links between objects are cryptographic hashes of the targets embedded in the sources. (This video explains Merkle Trees superbly.) Merkle DAGs provide IPFS many useful properties:
1. Content addressing: All content is uniquely identified by its multihash checksum.
2. Tamper resistance: all content is verified with its checksum.
3. Deduplication: all objects that hold the exact same content are equal, and only stored once.


IPFS also defines a set of objects for modeling a versioned filesystem on top of the Merkle DAG. This object model is similar to Git’s:
1. block: a variable-size block of data.
2. list: an ordered collection of blocks or other lists.
3. tree: a collection of blocks, lists, or other trees.
4. commit: a snapshot in the version history of a tree.


IPNS is the DNS for IPFS. We have seen that NodeId is obtained by hash(node.PubKey). Then IPNS assigns every user a mutable namespace at: /ipns/<NodeId>. A user can publish an Object to this /ipns/<NodeId> path signed by her private key. When other users retrieve the object, they can check the signature matches the public key and NodeId. This verifies the authenticity of the Object published by the user, achieving mutable state retrieval.

Unfortunately since <NodeId> is a hash, it is not human friendly to pronounce and recall. For this DNS TXT IPNS Records are employed. If /ipns/<domain> is a valid domain name, IPFS looks up key ipns in its DNS TXT records: TXT "ipfs=XLF2ipQ4jD3U ..." 
# the above DNS TXT record behaves as symlink
ln -s /ipns/XLF2ipQ4jD3U /ipns/

There is even the Beaker browser to help you surf IPFS. But its usability is not great.  If IPFS wants to manage the web, it should further improve its IPNS and content discovery game. Where is the search engine for IPFS content? Do we need to rely on links from friends like the 1993's Web?

What is the killer app for IPFS?

The introduction of the paper discusses HTTP and Web, and then says:
"Industry has gotten away with using HTTP this long because moving small files around is relatively cheap, even for small organizations with lots of traffic. But we are entering a new era of data distribution with new challenges: (a) hosting and distributing petabyte datasets, (b) computing on large data across organizations, (c) high-volume high-definition on-demand or real-time media streams, (d) versioning and linking of massive datasets, (e) preventing accidental disappearance of important files, and more. Many of these can be boiled down to "lots of data, accessible everywhere." Pressed by critical features and bandwidth concerns, we have already given up HTTP for different data distribution protocols. The next step is making them part of the Web itself. 
What remains to be explored is how [Merkle DAG] data structure can influence the design of high-throughput oriented file systems, and how it might upgrade the Web itself. This paper introduces IPFS, a novel peer-to-peer version-controlled filesystem seeking to reconcile these issues."
How common are petabyte or even gigabyte files on the Internet? There is definitely an increase in size trend due to the popularity of the multimedia files. But when will this become a pressing issue? It is not a pressing issue right now because CDNs help a lot for reducing traffic for the Internet. Also bandwidth is relatively easy to add compared to latency improvements. Going for a decentralized model globally comes with several issues/headaches, and I don't know how bad the bandwidth problems would need to get before starting to consider that option. And it is not even clear that the peer-to-peer model would provide more bandwidth savings than CDNs at the edge model.

I am not convinced that the Web is the killer application for IPFS, although at the end, the paper gets ambitious:
"IPFS is an ambitious vision of new decentralized Internet infrastructure, upon which many different kinds of applications can be built. At the bare minimum, it can be used as a global, mounted, versioned filesystem and namespace, or as the next generation file sharing system. At its best, it could push the web to new horizons, where publishing valuable information does not impose hosting it on the publisher but upon those interested, where users can trust the content they receive without trusting the peers they receive it from, and where old but important files do not go missing. IPFS looks forward to bringing us toward the Permanent Web."
Decentralization opens a Pandora's box of issues. Centralized is efficient and effective. Coordination wants to be centralized. A common and overhyped misconception is not centralized is not scalable and centralized is a single point of failure. After close to two decades of work in cluster computing and cloud computing, we have good techniques in place for achieving scalability and fault-tolerance for centralized (or logically centralized, if you like) systems. For scalability, shard it, georeplicate it, and provide CDNs for reading. For fault-tolerance, slap Paxos on it, or use chain replication systems (where Paxos guards the chain configuration), or use the globe-spanning distributed datastores available today. Case in point, Dropbox is logically-centralized but is very highly available and fault-tolerant, while serving to millions of users. Facebook is able to serve billions of users.

If you want to make the natural disaster tolerance argument to motivate the use of IPFS, good luck trying to use IPFS over landlines when power and ISPs are down, and good luck trying to form a multihop wireless ad hoc network over laptops using IPFS. Our only hope in a big natural disaster is cell towers and satellite communication. Disaster tolerance is serious work and I hope governments around the world are funding sufficient research into operational, planning, and communications aspects of that.

In Section 3.8, the whitepaper talks about the use cases for IPFS:
1. As a mounted global filesystem, under /ipfs and /ipns.
2. As a mounted personal sync folder that automatically versions, publishes, and backs up any writes.
3. As an encrypted file or data sharing system.
4. As a versioned package manager for all software.
5. As the root filesystem of a Virtual Machine.
6. As the boot filesystem of a VM (under a hypervisor).
7. As a database: applications can write directly to the Merkle DAG data model and get all the versioning, caching, and distribution IPFS provides.
8. As a linked (and encrypted) communications platform.
9. As an integrity checked CDN for large files (without SSL).
10. As an encrypted CDN.
11. On webpages, as a web CDN.
12. As a new Permanent Web where links do not die.
I don't think any of these warrant going full peer-to-peer. There are centralized solutions for them, or centralized solutions are possible for them.

An important use case for IPFS is to circumvent government censorship. But isn't it easier to use VPNs then to use IPFS for this purpose. (Opera browser comes with VPN build-in, and many easy to use VPN apps are available.) If the argument is that the governments can ban VPNs or prosecute people using VPN software, those issues also apply to IPFS unfortunately. Technology is not always the solution especially when dealing with big social issues.

IPFS may be a way of sticking it to the man. But the invisible hand of the free market forces also help here; when one big corporation starts playing foul and upsets the users, new companies and startups quickly move in to disrupt the space and fill in the void.

Again, I don't want to come across wrong. I think IPFS is great work, and Juan Benet and IPFS contributors accomplished a gigantic task, with a lot of impact on future systems (I believe the good parts of IPFS will be "adopted" to improve Web and datacenter/cloud computing). I just don't believe dialing the crank to 11 on decentralization is the right strategy for wide adoption. I don't see the killer application that makes it worthwhile to move away from the convenience of the more centralized model to open the Pandora's box with a fully-decentralized model.

MAD questions 

1) Today's networking ecosystem evolved for the client-server model, what kind of problems could this create for switching to peer-to-peer model? As a basic example, the uplink at residential (or even commercial) spaces is an order of magnitude less than downlink assuming they are consumers of traffic not originators of traffic. Secondly, ISPs (for good or bad) evolved to take on traffic shaping/engineering responsibilities peering with other ISPs. It is a complex system. How does popular IPFS use interact with that ecosystem.

2) As a related point, smartphones gained primary citizenship status in today's Internet. How well can peer-to-peer and IPFS get along with smartphones? Smartphones are very suitable to be thin clients in the cloud computing model, but they are not suitable to act as peers in a peer-to-peer system (both for battery and connection bandwidth reasons). To use a technical term, the smartphones will be leeches in a peer-to-peer model. (Well unless there is good token/credit system in place, but it is unrealistic to expect that soon.)

3) On the academic side of things, designing a decentralized search engine for IPFS sounds like a great research problem. Google had it easy in the datacenter but can you design a decentralized keyword/content based search engine (or one day old indexes) maintained in a P2P manner over IPFS nodes? Popularity of a file in the system (how many copies it has in the system) can play a role in its relevance ranking for the keyword. Also could a bloom filter like data structure be useful in a p2p search?

4) Here are some more pesky problems with decentralization. I am not clear if satisfactory answers exist on these. Does IPFS mean I may be storing some illegal content originated by other users?
How does IPFS deal with the volatility? Just closing laptops at night may cause unavailability under an unfortunate sequence of events. What is the appropriate number of replicas for a data to avoid this fate? Would we have to over-replicate to be conservative and provide availability?
If IPFS is commonly deployed, how do we charge big content providers that benefit from their content going viral over the network? Every peer chips in distributing that content, but the content generator benefits let's say by way of sales. Would there need to be a token economy that is all seeing and all fair to solve this issue?

5) Is it possible to use Reed-Solomon erasure coding with IPFS? Reed-Solomon codes are very popular in the datacenters as they provide great savings for replication.

6) IPFS does not tolerate Byzantine behavior, right? The crypto puzzle needed for Node Id can help reduce the false spammers, as it makes them do some work. But after joining, there is no guarantee that the peers will play it fair: they can be Byzantine to wreak havoc on the system. But how much problems can they cause? Using cryptos and signatures prevent many problems. But can the Byzantine nodes somehow collude to cause data loss in the system, making the originator think the data is replicated, but then deleting this data? What other things can go wrong?

Saturday, February 10, 2018

Paper summary. SnailTrail: Generalizing critical paths for online analysis of distributed dataflows

Monitoring is very important for distributed systems, and I wish it would receive more attention in research conferences. There has been work on monitoring for predicate detection purposes and for performance problem detection purposes. As machine learning and big data processing frameworks are seeing more action, we have been seeing more work on the latter category. For example in ML there have been work on how to figure out what is the best configuration to run. And in the context of general big data processing framework there has been work on identifying performance bottlenecks.

Collecting information and creating statistics about a framework to identify the bottleneck activities seems like an easy affair. However, the "making sense of performance" paper (2015) showed that this is not as simple as it seems, and sophisticated techniques such as blocked time analysis are needed to get a more accurate picture of performance bottlenecks.

This paper (by ETH Zurich and due to appear in NSDI 18) extends the performance bottleneck analysis to more general frameworks and to be supported by an online monitoring tool called SnailTrail. SnailTrail is written in Rust, over the Timely Dataflow framework. It supports monitoring of applications written in Flink, Spark, Timely Dataflow, Tensorflow, and Heron.

SnailTrail overview

The SnailTrail tool operates in 5 stages:
  • it ingests the streaming logs from the monitored distributed application,
  • slices those streams into windows, 
  • constructs a program activity graph (PAG) for the windows, 
  • computes the critical path analysis of the windows, and 
  • outputs the summaries. 

Program activity graph (PAG) is a directed acyclic graph. The vertices denote the start and end of activities, such as: data processing, scheduling, buffer management, serialization, waiting, application data exchange, control messages, or unknown activities. The edges has a type and a weight for the activities. The edges capture the happened-before relationships between the vertices. Figure 2.a shows an example. The figure also shows how to project this PAG to an interval, which is pretty straightforward and as you expect.

As output, SnailTrail provides summaries for operation, worker, and communication bottlenecks. These summaries help identify which machine is overloaded, and which operators should be re-scaled. The figure below shows examples.

Critical path analysis

Critical path analysis (CPA) is a technique originally introduced in the context of project planning. CPA produces a metric which captures the importance of each activity in the transient critical paths of the computation.
The CPA score calculation in SnailTrail centers on the Transient Path Centrality concept which corresponds to the number of paths this activity appears on. In the formula e[w] corresponds to the weight of the edge e, which is taken as the start/end time duration of the activity.

In Figure 3, the activity (u,v) has the highest CPA score because it is involved in all of the 9 paths that appear in this window.

A noteworthy thing in the PAG in Figure 3 is the wait period (denoted as dashed lines) after b in worker 1. Because worker 1 is blocked with a wait after b, that path terminates at b, and does not feed into another path. The paper explains this as follows: "Waiting in our model is always a consequence of other, concurrent, activities, and so is a key element of critical path analysis: a worker does not produce anything useful while waiting, and so waiting activities can never be on the critical path." Because the two wait states terminates in deadend some of the paths, in this interval depicted in Figure 3, there are only 9 possible paths.

The formula above enables us to calculate the CPA scores without enumerating and materializing all the transient critical paths of the computation in each window. But how do you calculate N (which, for Figure 3 is calculated as 9) without enumerating the critical paths? It is simple really. The PAG is a directed acyclic graph (DAG). Everytime there is a split in the path, you copy path_count to both edges. Everytime two edges join, you set the new path_count to be the sum of the path_counts in the two incoming edges. At the end of the interval, look at how many paths are exiting, that is N. Give this a try for Figure 3 yourself.

MAD questions

1) What could be some drawbacks/shortcomings of CPA? One reason CPA may not be representative is because one execution of the application may not be typical of all executions. For example in Fig 3, w3 may take long time in the next execution in c-d activity and that could become the bottleneck in another execution of the application. But it is possible to argue that since SnailTrail produces summaries, this may not be a big issue. Another reason CPA may not be representative is because the execution may be data dependent. And again it is possible to argue that this won't be a big issue if the application uses several data in processing, and things get averaged.

Another criticism of CPA could be that it does not compose. Try combining two adjacent windows; that would lead to changing the CPA scores for the activities. Some previously low scored  activities will jump up to be high, and some highly scored activities will go down. Because of this non-composition, it becomes important to decide what is the best window/interval size to calculate these CPA metrics for summarization purposes. On the other hand, it may be reasonable to expect these metrics to be non-composable, since these metrics (blocked time analysis and critical time analysis) are designed to be complex to capture inherent/deep critical activities in the computations.

Could there be another approach than the CPA scoring presented here to capture the importance of an execution activity in the paths of the computation?  Maybe something that uses the semantics of activity types and their relation to each other?

2) The SnailTrail method uses snapshots to determine the start and end of the windows that the CPA scoring algorithm works on. Does time synchronization need to be perfect for snailtrail snapshot and analysis to work? What are the requirements on the time synchronization for this to work? 

It turns out this currently requires perfect clock synchronization. The evaluation experiments are run in the same machine with 32 cores. Without perfect clock synchronization, the snapshots may not be consistent and that could break the path calculation and CPA scoring. Hybrid Logical Clocks can help deal with the uncertainty periods in NTP clocks and can make the method work. The paper addresses this issue in Appendix F: "In SnailTrail, we assume that the trend toward strong clock synchronization in datacenters means that clock skew is not, in practice, a significant problem for our analysis. If it were to become an issue, we would have to consider adding Lamport clocks and other mechanisms for detecting and correcting for clock skew."

3) The paper doesn't have an example of improvement/optimization based on summary analytics. Even when the summary shows the high scored CPA activities to improve upon, it will be tricky for a developer to go in and change the code to improve things, because there is no indication of how easy it would be to improve the performance on this high CPA score activities. 

One way to address this could be to extend the CPA idea as follows. After ranking the activities based on CPA scores, construct another ranking of activities based on ease of optimizing/improving (I don't know how to do this, but some heuristics and domain knowledge can help). Then start the improvements/optimizations from the easiest to optimize activities that have highest CPAs.
Another way to make the summary analytics more useful is to process it further to provide more easily actionable suggestions, such as add more RAM, more CPU, more network/IO bandwidth. Or the suggestions could also say that you can do with less of resources of one kind, which can helpf for saving money in cloud deployments. If you can run with similar performance but on cheaper configurations, you would like to take that option. Would this extension require adopting SnailTrail monitoring and CPA scoring more towards capturing  resource usage related metrics? 

4) Since CPA first appeared in the project/resource management domain, could there be other techniques there that can apply to performance monitoring of distributed execution?

5) I think SnailTrail breaks new ground in the "context" or "philosophy" of monitoring: it is not trace based, it is not snapshot based, but it is window/interval-based. In our work on Retroscope, we argued it is important to aggregate the paths for both spatially (across the distributed computation) and temporally (as an evolving picture of the system's performance). SnailTrail extends the snapshot view to window views.


Frank McSherry recently joined ETH Zurich's systems group and will be helping with the strymon project and the SnailTrail work, so I'll be looking forward to more monitoring work from there.  

Wednesday, February 7, 2018

Think before you code

I have been reading Flash boys by Michael Lewis. It is a fascinating book about Wall Street and high-frequency traders. I will write a review about the book later but I can't refrain from an overview comment. Apart from the greediness, malice, and sneakiness of Wall Street people, another thing that stood out about them was the pervasive cluelessness and ineptness. It turns out nobody knew what they were doing, including the people that are supposed to be regulating things, even sometimes the people that are gaming the system. This brought to my mind Steve Jobs' quote from his 1994 interview: "Everything in this world... was created by people no smarter than you."

Anyways, today I will talk about something completely different in the book that caught my eye. It is about thinking before coding.

On Page 132 of the book:
Russians had a reputation for being the best programmers on Wall Street, and Serge thought he knew why: They had been forced to learn to program computers without the luxury of endless computer time. Many years later, when he had plenty of computer time, Serge still wrote out new programs on paper before typing them into the machine. "In Russia, time on the computer was measured in minutes," he said. "When you write a program, you are given a tiny time slot to make it work. Consequently we learned to write the code in ways that minimized the amount of debugging. And so you had to think about it a lot before you committed it to paper.... The ready availability of computer time creates this mode of working where you just have an idea and type it and maybe erase it ten times. Good Russian programmers, they tend to have had that one experience at some time in the past--the experience of limited access to computer time."

Of course, Dijkstra was a big proponent of thinking before programming. Here in EWD 1284, he compares European CS vs American CS. He didn't have nice things to say about the keyboard-happy American CS.
The major differences between European and American CS are that American CS is more machine-oriented, less mathematical, more closely linked to application areas, more quantitative and more willing to absorb industrial products in its curriculum. For most of these differences there are perfect historical explanations, many of which reflect the general cultural differences between the two continents, but for CS we have also to take into account the special circumstance that due to the post-war situation, American CS emerged a decade earlier, for instance at the time when design, production, maintenance and reliability of the hardware were still causes for major concern. The names of the early professional societies are in this respect revealing: the “Association for Computing Machinery” and the “British Computer Society”. And so are the names of the scientific discipline and the academic departments: in the US, CS is short for Computer Science, in Europe it is short for Computing Science.
The other circumstance responsible for a transatlantic difference in how CS evolved I consider a true accident of history, viz. that for some reason IBM was very slow in getting interested in Europe as a potential market for its computers, and by the time it targeted Europe, this was no longer virgin territory. Consequently, IBM became in Europe never as predominant as it has been in Northern America. 

Here in EWD1165, Dijkstra shares an anecdote about thinking before coding and uses that as an opportunity to take a shot at "software engineering". Man, Dijkstra had the best rants.
A recent CS graduate got her first job, started in earnest on a Monday morning and was given her first programming assignment. She took pencil and paper and started to analyse the problem, thereby horrifying her manager 1.5 hours later because “she was not programming yet”. She told him she had been taught to think first. Grudgingly the manager gave her thinking permission for two days, warning her that on Wednesday she would have to work at her keyboard “like the others”! I am not making this up. And also the programming manager has found the euphemism with which to lend an air of respectability to what he does: “software engineering”.

In fact, Dijkstra takes things further, and advocates even forgoing the pen and the paper when thinking:
What is the shortest way to travel from Rotterdam to Groningen, in general: from given city to given city. It is the algorithm for the shortest path, which I designed in about twenty minutes. One morning I was shopping in Amsterdam with my young fiancée, and tired, we sat down on the café terrace to drink a cup of coffee and I was just thinking about whether I could do this, and I then designed the algorithm for the shortest path. As I said, it was a twenty-minute invention. In fact, it was published in ’59, three years late. The publication is still readable, it is, in fact, quite nice. One of the reasons that it is so nice was that I designed it without pencil and paper. I learned later that one of the advantages of designing without pencil and paper is that you are almost forced to avoid all avoidable complexities. Eventually that algorithm became, to my great amazement, one of the cornerstones of my fame.
Dijkstra (2001), in an interview with Philip L. Frana. (OH 330; Communications of the ACM 53(8):41–47)"

In several of his EWDs, Dijkstra mentioned how he favored solving problems without pen and paper and just by thinking hard, and how he has the entire article sketched in his mind before he sits to write it down in one go. Here is an example where he mentions the Mozart versus Beethoven approach to composing. Obviously Dijkstra was on the Mozart camp.
There are very different programming styles. I tend to see them as Mozart versus Beethoven. When Mozart started to write, the composition was finished. He wrote the manuscript and it was 'aus einem Guss' (from one cast). In beautiful handwriting, too. Beethoven was a doubter and a struggler who started writing before he finished the composition and then glued corrections onto the page. In one place he did this nine times. When they peeled them, the last version proved identical to the first one.
Dijkstra (2001) Source: Denken als discipline, a program from Dutch public TV broadcaster VPRO from April 10th, 2001 about Dijkstra"

For the record, and not that you care, I am on team Beethoven. Being perfectionist and taking an "I will get it right in one shot" approach to thinking/designing makes me freeze. Separating concerns by dividing my thinking between a creative mode and criticizing mode works much better for me. (I talked about that in here, here, and here.)

Maybe my thinking is sloppy and I need crutches. But it is possible to argue that writing/prototyping is a tool--not a crutch-- for thinking, and that tools are invaluable capability multipliers. And on the power of writing as a tool, I will end with these quotes:
Writing is nature's way of letting you know how sloppy your thinking is. -- Guindon. 
If you think without writing, you only think you're thinking. -- Leslie Lamport 
Without writing, you are reduced to a finite automaton.
With writing you have the extraordinary power of a Turing machine. -- Manuel Blum

MAD questions

1) Hmm, a handicap that becomes an advantage. I think that was the entire theme of the "David and Goliath" book by Gladwell. Of course, that book got a largely negative critical response. But that doesn't mean it cannot be a useful model sometimes.

Are there scientifically proven studies of a seeming handicap becoming an advantage? What are some examples?

2) Yeah, about that whole Mozart versus Beethoven mode thing... Am I mistaken? Should I be more open-minded about the Mozart mode? What is doable by the Mozart mode that would be impossible to do with the Beethoven mode?

3) Well, this is not a question. This is self reflection. Way to go Murat! You started with Flash Boys, jumped to Steve Jobs's 1994 interview, and finished with comparing Mozart and Beethoven mode. A great display of "discipline in thought" indeed.

Thursday, February 1, 2018

Paper review. Blockchains from a distributed computing perspective

Our distributed systems seminar met the first time this Monday. I went through the syllabus, explained how we will run the seminar, and introduced the papers we will discuss. In the second hour, to give students an overview of the blockchains technology, I went through this paper: "Blockchains from a distributed computing perspective".

I liked this paper a lot. It is from a respected distributed systems expert, Maurice Herlihy. It is written to be accessible and expository. The paper gives several examples (with increasing sophistication) to explain the blockchain concepts concretely. Finally, the paper is a perfect fit with our seminar's theme of looking at blockchains from a distributed systems perspective. Herlihy states that the paper is "colored by the perspective that much of the blockchain world is a disguised, sometimes distorted, mirror-image of the distributed computing world."

For a data processing perspective on blockchains, see this other paper.

Simple ledger example

The paper first introduces a simple ledger system using Alice's online news service as an example. To record articles as they arrive, Alice created a ledger that is implemented as a simple linked list. When an article arrives, it is placed in a shared pool, and a set of dedicated threads, called miners, collectively run a repeated protocol, called consensus, to select which item to append to the ledger. And to query for an article, a thread scans the linked-list ledger.

Is this a too simple, too constrained way of implementing the system? The paper says that the log-based system architecture has two compelling advantages:

  1. it is universal; it can implement any type of data structure, no matter how complex
  2. all questions of concurrency and fault-tolerance are compartmentalized in the consensus protocol 

Indeed this log/ledger based architecture is very popular for modern cloud-native distributed systems. This writeup, by Jay Kreps, tells you how important is the log/ledger abstraction for building distributed systems.

Kafka and BookKeeper are very popular platforms that enables compartmentalizing the concurrency and fault-tolerance in the framework's consensus protocol---realized by ZooKeeper. (As a side note, the private blockchain, Hyperledger v1.0.0-rc1, adopts a no-Byzantine consensus protocol based on Kafka.)

Finally, the Tango paper (SOSP'13) showed a good example of universality of logs: it showed how to build replicated in-memory data structures, and reliable and available distributed transactions and applications, on top of a shared log architecture. I believe the ideas explored in Tango paper can be used for efficiently maintaining multiple streams in the same log so that reading the whole chain is not needed for materializing the updates at the nodes.


Going from a centralized log implementation to a fully-decentralized public blockchain implementation needs some motivation. After her online news business, Alice wants to go to restaurant business. Like any trendy social influencer, she does an initial certificate sale (redeemable for meals) for her restaurant for raising capital.

This is somewhat like using Kickstarter for the restaurant. (Centralized is not de facto bad. Why would you not trust KickStarter? The market and the laws can straighten Kickstarter up, if it plays crooked.) However, the initial coin offering (ICO) adds more functionality on top of Kickstarter: you can sell/trade fractions of your token. In other words, the ICO creates a whole market around kickstarting.

The paper introduces the Proof-of-Work (PoW) idea using this example. Most miners are probably honest, content to collect their fees, but there is still a threat that even a small number of dishonest miners might collude with one another to cheat Alice’s investors. Alice’s first idea is to have miners, identified by their IP addresses, vote via a Byzantine fault-tolerant consensus algorithm. Alice quickly realizes this is a bad idea. Alice has a nemesis, Sybil, who is skilled in the art of manufacturing fake IP addresses. So Alice employs PoW for the blockchain. Sybil’s talent for impersonation is useless to her if each of her sock puppet miners must buy an expensive, long-shot lottery ticket.

PoW is a form of costly signaling: it is expensive in terms of time wasted and electricity bills. As a famous example, Bitcoin uses PoW for consensus: only a miner which has successfully solved a computationally hard puzzle (finding the right nonce for the block header) can append to the blockchain.

This is a good point for a concrete example. Luckily, this demonstration of PoW-based blockchain is a perfect way of doing that. If you haven't watched/tried this, take 15 minutes now to do it. Someone should give an award to Anders.

PoW has several shortcomings, of course. It is computationally very expensive and extremely wasteful for resources around the globe. Its throughput is miserable and its transactions are slow. It is nondeterministic: "It is still possible that two blocks are appended at the same time, creating a fork in the blockchain. Which block should subsequent miners build on? The usual answer is to build on the block whose chain is longest." So Bitcoin consolidates this by only considering a block as confirmed after it is followed by a number of blocks (typically six blocks).


The paper introduces smartcontracts using cross-chain swap as an example. Suppose Alice wants to trade some of her coupons to Bob in return for some bitcoins. Alice’s coupons live on one blockchain, and Bob’s bitcoin live on another, so they need to devise an atomic cross-chain swap protocol to consummate their deal. Naturally, neither one trusts the other.

I am acutely aware that technology is not the cure for everything. But, for the usecases where you can go all-digital, I think smartcontracts will be a great tool. It is an executable contract, so you can avoid notaries, lawyers, courts, and cops, because the failure clauses will be executed automatically. I think the smartcontracts idea is here to stay even when the fully-decentralized PoW-based blockchain approach dies.

On the other hand, the smartcontracts are not devoid of challenges either. The paper talks about the DAO attack, and makes a great point about the importance of concurrency control for smartcontracts: "We have seen that the notion that smart contracts do not need a concurrency model because execution is single-threaded is a dangerous illusion."  In a post last week, I had presented a TLA+/PlusCal modeling of the DAO attack.

MAD questions

Blockchain PoW Consensus vulnerabilities

The consensus problem, which is also at the heart of the blockchain, has been studied for more than 40 years in distributed systems. The PoW based blockchain takes a radical approach to implement it in an approximated manner.

Now, I am not saying distributed systems have consensus solved at scale. There is no solution that scales to the big number of participants that may be desired as in public blockchains. I wrote about it this earlier: "I love that Blockchains brings new constraints and requirements to the consensus problem. In blockchains, the participants can now be Byzantine, motivated by financial gains. And it is not sufficient to limit the consensus participants to be 3 nodes or 5 nodes, which was enough for tolerating crashes and ensuring persistency of the data. In blockchains, for reasons of attestability and tolerating colluding groups of Byzantine participants, it is preferred to keep the participants at 100s. Thus the problem becomes: How do you design a byzantine tolerant consensus algorithm that scales to 100s or 1000s of participants?"

On the other hand, the research on consensus in distributed systems literature has established a rigorous perimeter around the safety and liveness/progress guarantees that can be provided. The FLP impossibility result shows that it is impossible guarantee progress under a full asynchronous model with a crash failure---even with reliable channels. Paxos approach to solving consensus compartmentalizes the safety and progress properties nicely. Even under a fully asynchronous model (where all reasonable timing expectations break), Paxos preserves safety thanks to its balloting and anchoring system. Paxos also provides progress as soon as the partial synchrony kicks in and weak-complete & eventually-weak-accurate failure detectors are implementable (i.e., when we are out of the realm of the FLP result).

Because of this compartmentalization, we are guaranteed that Paxos's safety guarantees is not vulnerable to a timing attack where malicious parties break all reasonable timing assumptions about the system. This is a hard thing to get right, and several protocols have been shown to be vulnerable to timing attacks because they depended on some perfectly reasonable time synchronization assumptions.

It seems like the Bitcoin protocol makes several reasonable timing assumptions, but when an attacker manages to violate those timing assumptions, the safety of the protocol can be violated as well. 

How will the Bitcoin bubble end?

I had done a Twitter poll on this a while ago. What other options do you think are plausible? (Of course I am not claiming that this poll means anything.)

I still have a lot to learn about blockchains and cryptocurrencies, and in particular Bitcoin. Something I don't yet understand is whether it is possible for the ecosystem to erode in a slippery slope, tragedy of the commons fashion, to a majority Byzantine behavior. Say if a slightly different version of the protocol starts cutting some corners, and since that turns out to be advantageous financially more miners start adopting it, and soon that becomes the majority behavior and give rise to a hack (via a trojan) or slaying of the goose that lays the golden eggs.

While I don't know much about the dirty implementation details of Bitcoin, this incident doesn't give me much confidence that Bitcoin is decentralized, immutable, unforgeable, and unhackable.

What is the best medium for smartcontracts?

Is a scripting language the best medium? It is easy to screw that up as the DAO attack and ERC20 standard examples in the paper show. Maybe using a more declarative approach, and employing formal specifications and invariant-based reasoning to writing contracts could prove to be more resistant to errors.

What about applying model checking to smartcontracts? Could that help reduce the risks?

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...