Wednesday, May 23, 2018

TUX2: Distributed Graph Computation for Machine Learning

The TUX2 paper appeared in NSDI 17 and was authored by Wencong Xiao, Beihang University and Microsoft Research; Jilong Xue, Peking University and Microsoft Research; Youshan Miao, Microsoft Research; Zhen Li, Beihang University and Microsoft Research; Cheng Chen and Ming Wu, Microsoft Research; Wei Li, Beihang University; Lidong Zhou, Microsoft Research.

TUX2 introduces some new concepts to graph process engines to adapt them better for machine learning (ML) training jobs. Before I can talk about the contributions of TUX2, you need to bear with me as I explain how current graph processing frameworks fall short for ML training.

Background and motivation

Graph processing engines often takes a "think like a vertex" approach. A dominant computing model in "think like a vertex" approach is the Gather-Apply-Scatter (GAS) model. You can brush up on graph processing engines by reading my reviews of Pregel and Facebook graph processing.

Modeling ML problems as bipartite graphs

Many ML problems can be modeled with graphs and attacked via iterative computation on the graph vertices. The Matrix Factorization (MF) algorithm, used in recommendation systems, can be modeled as a computation on a bipartite user-item graph where each vertex corresponds to a user or an item and each edge corresponds to a user's rating of an item.

A topic-modeling algorithm like Latent Drichlet Allocation (LDA) can be modeled as a document-word graph. If a document contains a word, there is an edge between them; the data on that edge are the topics of the word in the document.  Even logistic regression can be modeled as a sample-feature graph.


Gaps in addressing ML via graph processing

1. The graphs that model ML problems often have bi-partite nature and heterogeneous vertices, with distinct roles (e.g., user vertices and item vertices). However, the standard graph model used in graph processing frameworks assumes a homogeneous set of vertices.

2. For ML computation, an iteration of a graph computation might involve multiple rounds of propagation between different types of vertices, rather than a simple series of GAS phases. The standard GAS model is unable to express such computation patterns efficiently.

3. Machine learning frameworks have been shown to benefit from the Stale Synchronous Parallel (SSP) model, a relaxed consistency model with bounded staleness to improve parallelism. The graph processing engines use Bulk Synchronous Parallel (BSP) model by default.

TUX2 Design

To address the gaps identified above, TUX2
1. supports heterogeneity in the data model,
2. advocates a new graph model, MEGA (Mini-batch, Exchange, GlobalSync, and Apply), that allows flexible composition of stages, and
3. supports SSP in execution scheduling.

Next we discuss the basic design elements in TUX2, and how the above 3 capabilities are built on them.

The vertex-cut approach 

TUX2 uses the vertex-cut approach (introduced in PowerGraph), where the edge set of a high-degree vertex can be split into multiple partitions, each maintaining a replica of the vertex. One of these replicas is designated the master; it maintains the master version of the vertex's data. All the remaining replicas are called mirrors, and each maintains a local cached copy.

Vertex-cut is very useful for implementing the parameter-server model: The master versions of all vertices' data can be treated as the distributed global state stored in a parameter server. The mirrors are distributed to workers, which also has the second type of vertices and use the mirror vertices to iterate on these second type of vertices.

Wait, the second type of vertices? Yes, here we harken back to the bipartite graph model. Recall that we had bipartite graph with heterogeneous vertices, with some vertices having higher degrees. Those higher degree vertices are master vertices and held at the server, and the low degree vertices are data/training for those master vertices and they cache the master vertices as mirror vertices and train on them. And, in some sense, the partitions of low-order vertex type in the bipartite graph corresponds to mini-batch.

The paper has the following to say on this. In a bipartite graph, TUX2 can enumerate all edges by scanning only vertices of one type. The choice of which type to enumerate sometimes has significant performance implications. Scanning the vertices with mirrors in a mini-batch tends to lead to a more efficient synchronization step, because these vertices are placed contiguously in an array. In contrast, if TUX2 scans vertices without mirrors in a mini-batch, the mirrors that get updated for the other vertex type during the scan will be scattered and thus more expensive to locate. TUX2 therefore allows users to specify which set of vertices to enumerate during the computation.

Each partition is managed by a process that logically plays both

  • a worker role, to enumerate vertices in the partition and propagate vertex data along edges, and 
  • a server role, to synchronize states between mirror vertices and their corresponding masters. 
Inside a process, TUX2 uses multiple threads for parallelization and assigns both the server and worker roles of a partition to the same thread. Each thread is then responsible for enumerating a subset of mirror vertices for local computation and maintaining the states of a subset of master vertices in the partition owned by the process.


Figure 3 illustrates how TUX2 organizes vertex data for a bipartite graph, using MF on a user-item graph as an example. Because user vertices have much smaller degree in general, only item vertices are split by vertex-cut partitioning. Therefore, a master vertex array in the server role contains only item vertices, and the worker role only manages user vertices. This way, there are no mirror replicas of user vertices and no distributed synchronization is needed. In the worker role, the mirrors of item and user vertices are stored in two separate arrays.

In each partition, TUX2 maintains vertices and edges in separate arrays. Edges in the edge array are grouped by source vertex. Each vertex has an index giving the offset of its edge-set in the edge array. Each edge contains information such as the id of the partition containing the destination vertex and the index of that vertex in the corresponding vertex array. This graph data structure is optimized for traversal and outperforms vertex indexing using a lookup table. Figure 2 shows how data are partitioned, stored, and assigned to execution roles in TUX2.


Scheduling minibatches with SSP

TUX2 executes each iteration on a minibatch with a specified size. Each worker first chooses a set of vertices or edges as the current minibatch to execute on. After the execution on the mini-batch finishes, TUX2 acquires another set of vertices or edges for the next minibatch, often by continuing to enumerate contiguous segments of vertex or edge arrays.

TUX2 supports SSP in the mini-batch granularity. It tracks the progress of each mini-batch iteration to enable computation of clocks. A worker considers clock t completed if the corresponding mini-batch is completed on all workers (including synchronizations between masters and mirrors) and if the resulting update has been applied to and reflected in the state. A worker can execute a task at clock t only if it knows that all clocks up to t−s−1 have completed, where s is the allowed slack.

The MEGA model 

TUX2 introduces a new stage-based MEGA model, where each stage is a computation on a set of vertices and their edges in a graph. Each stage has user-defined functions (UDF) to be applied on the vertices or edges accessed during it. MEGA defines four types of stage: Mini-batch, Exchange, GlobalSync, and Apply.

MEGA allows users to construct an arbitrary sequence of stages. Unlike GAS, which needs to be repeated in order (i.e., GAS-GAS-GAS-GAS), in MEGA you can flexibly mix and match (e.g., E-A-E-A-G). For example, in algorithms such as MF and LDA, processing an edge involves updating both vertices. This requires two GAS phases, but can be accomplished in one Exchange phase in META. For LR, the vertex data propagations in both directions should be followed by an Apply phase, but no Scatter phases are necessary; this can be avoided in the MEGA model because MEGA allows an arbitrary sequence of stages.

Below are examples of  Matrix factorization (MF) and Latent Dirichlet Allocation (LDA) programmed with the META model. (LDA's stage sequence is the same as MF's.)


Implementation and Evaluation

TUX2 is implemented in ~12,000 lines of C++ code. TUX2 takes graph data in a collection of text files as input. Each process picks a separate subset of those files and performs bipartite-graph-aware algorithms to partition the graph in a distributed way. Each partition is assigned to, and stored locally with, a process. Unfortunately the evaluations with TUX2 do not take into account graph partitioning time, which can be very high. 

The evaluations show that data layout matters greatly in the performance of ML algorithms. Figure 8 compares the performance of BlockPG, MF, and LDA with two different layouts: one an array-based graph data layout in TUX2 and the other a hash-table-based lay-out often used in parameter-server-based systems (but implemented in TUX2 for comparison). The y-axis is the average running time of one iteration for BlockPG, and of 10 iterations for MF and LDA to show the numbers on a similar scale. These results show that the graph layout improves performance by up to 2.4× over the hash-table-based layout.


The paper also includes a comparison with Petuum, but the evaluations have several caveats. The evaluations do not include comparison of convergence/execution time; execution time per iteration does not always determine the convergence time. The evaluations do not take into account the partitioning time of the graph for TUX2. And finally, some comparisons used early unpatched version of Petuum MF algorithm whose data placement issues are resolved later.

MAD questions

1. What is the net gain here?
I like this paper; it made me ask and ponder on many questions, which is good.

I don't think TUX2 pushes the state of the art in ML. ML processing frameworks are already very efficient and general with the iterative parameter-server computing model, and they are getting better and more fine grained.

On the other hand, I think TUX2 is valuable because it showed how the high-level graph computing frameworks can be adapted to implement the low-level parameter-server approach and address ML training problems more efficiently. This may provide some advantages for problems that are/need to be represented as graphs, such as for performing ML training on Sparql data stores.

Moreover by using higher-level primitives, TUX2 provides some ease of programmability. I guess this may be leveraged further to achieve some plug and play programming of ML for certain class of programs.

So I find this to be a conceptually very satisfying paper as it bridges the graph processing model to parameter-server model. I am less certain about the practicality part.


2. How does graph processing frameworks compare with dataflow frameworks?
There are big differences between dataflow frameworks and graph processing frameworks. In the dataflow model, there is a symbolic computation graph, the graph nodes represent mathematical operations, while the graph edges represent the data that flow between the operations. That is a very different model than the graph processing model here.

In MEGA, there are only 4 stages, where the Apply stage can take in user defined functions. This is higher-level (and arguably more programming friendly) than a dataflow framework such as TensorFlow which has many hundreds of predefined operators as vertices.


3. How does TUX2 apply to Deep Learning (DL)?
The paper does not talk about whether TUX2 can apply to DL and how it can apply.

It may be possible to make DL fit the TUX2 model with some stretching. Deep neural network (DNN) layers (horizontally or vertically partitioned) could be the high-rank vertices hold in the servers. And the images are low-ranked vertices hold in partitions.

But this will require treating the DNN partions as a meta-vertex and schedule executions for each sub-vertes in the meta-vertex in one cycle. I have no clue about how to make backpropagation work here though.

Moreover, for each image, the image may need to link to entire NN, so the bipartite graph may collapse into a trivial one and trivial data-parallelism. It may be possible to make the convolutional layers can be distributed. It may even be possible to insert early exits and train that way.

So, it may be possible but it is certainly not straightforward. I am not even touching the subject of the performance of such a system.

Monday, May 21, 2018

SoK Cryptocurrencies and the Bitcoin Lightning Network

We wrapped up the distributed systems seminar with two more papers discussed last month.

The "SoK: Research Perspectives and Challenges for Bitcoin and Cryptocurrencies" paper appeared in 2015. Although the cryptocurrency scene has seen a lot of action recently, this survey paper did not age and it is still a very good introduction to learning about the technical aspects and challenges  of cryptocurrencies. The paper starts with a technical overview of the cryptocurrency concept. Then it delves more into incentives and stability issues. It observes that it is unclear "how stability will be affected either in the end state of no mining rewards or in intermediate states as transaction fees become a non-negligible source of revenue". It talks about possible attacks, including Goldfinger attack and feather-forking, and also about stability of mining pools and the peer-to-peer layer. Finally it also covers some security and privacy issues.

"The Bitcoin Lightning Network: Scalable Off-Chain Instant Payments" whitepaper is from 2016. Bitcoin has a very high latency, poor scalability, and high transaction fees, and the lightning network provides an overlay solution to ease off these pains. To this end, the paper describes a secure off-chain solution for instant payments and microtransactions with low transaction fee.  This video by Jackson Palmer describes the lightning network ideas very nicely. (I strongly prefer reading to watching/listening, but recently I am finding a lot of useful content on YouTube.)

The protocol builds on the basic concept of a pairwise payment channel. Two parties put aside an initial amount of Bitcoin into a multi signature transaction. Subsequently many updates to the allocation of the current balance can be made off-the-chain just with the cooperation of both parties using a new timelocked transaction, without broadcasting this to the chain. This is analogous to buying a Starbucks card and transacting with Starbucks pairwise with that card. A broadcast to the chain can be done to redeem funds on the chain and to close the channel. If either party tries to cheat by broadcasting an old transaction state from the pairwise payment channel, the counterparty may take all the funds in the channel as penalty after it provides the latest multisig agreed state to the chain (within the on-chain dispute mediation window).

The lightning overlay network is then formed by multihop routing over these pairwise payment channels. Even when A and Z does not have a direct pairwise payment channel, it may be possible to construct a multihop route from A to Z to use intermediaries, traversing through several pairwise payment channels. While pairwise channels could be without a fee, the multihop intermediaries need a small transaction fee to have the incentive to participate.

The pairwise payment channels need to be extend with hashlocks to achieve multihop transactions. Here is how this is done as explained in the Bitcoin wiki:

  1. Alice opens a payment channel to Bob, and Bob opens a payment channel to Charlie.
  2. Alice wants to buy something from Charlie for 1000 satoshis.
  3. Charlie generates a random number and generates its SHA256 hash. Charlie gives that hash to Alice.
  4. Alice uses her payment channel to Bob to pay him 1,000 satoshis, but she adds the hash Charlie gave her to the payment along with an extra condition: in order for Bob to claim the payment, he has to provide the data which was used to produce that hash.
  5. Bob uses his payment channel to Charlie to pay Charlie 1,000 satoshis, and Bob adds a copy of the same condition that Alice put on the payment she gave Bob.
  6. Charlie has the original data that was used to produce the hash (called a pre-image), so Charlie can use it to finalize his payment and fully receive the payment from Bob. By doing so, Charlie necessarily makes the pre-image available to Bob.
  7. Bob uses the pre-image to finalize his payment from Alice

In order for the multihop routes to work there should be enough cooperative participants online, each of which with enough balance to pass the bucket from hand to hand. The rest is a path-finding exercise. With viable paths present, similar to a source-side routing decision, you can start the lightning transaction. There has been a lot of work on MANETs under the names of intermittently connected networks and delay tolerant routing. I wonder if some of them can find applications here (even though they were mostly geometrical and this is a non-geometrical network.) In practice though, the multihop network often converges to a hub and spokes model, with a well-connected fat wallet intermediary.

The lightning network is also useful for transacting securely across different blockchains, as long as all edges in the route support the same hash function to use for the hash lock and can create timed locks.

MAD questions

1. How can we improve the way we run the seminar?
The students liked how we run the seminars. They said they were more actively engaged and learned a greater deal due to this format. But I think we can improve. It would be nice to get experts video-conference and answer some questions. I think many experts would be generous to spare 20 minutes to spare to answer some questions from smart well-prepared students. 

The students also mentioned that it would have been nice to have some hands-on projects to accompany the seminars. We started a blockchain channel so the interested students can figure out some small projects they can work on and collaborate.

2. What did I learn?
I didn't think much of blockchains but I am fascinated to learn that there are many challenging questions here and many good ideas/techniques. This is an area very suitable for doing distributed algorithms/protocols work, which I love. There is a need for developing more principled approaches and well-reasoned and verified algorithms/protocols.

I still think the best ideas from these work will get borrowed and used in more centralized (could be hierarchical or federated) systems for the sake of efficiency/economy and scalability. That may be how these systems will go mainstream to millions and even billions.

Saturday, May 19, 2018

Book Review -- Accidental Genius: Using Writing to Generate Your Best Ideas, Insight, and Content

I had read this short book a long time ago. It is a very helpful book to learn about how to use writing for thinking --freewriting.

Motivation for freewriting

The mind is lazy. Thinking is hard, and our brains don't like hard. It recycles tired thoughts, and avoids unfamiliar and uncomfortable territory.

Freewriting prevents that from happening. Freewriting is a form of forced creativity.
Writing is nature's way of letting you know how sloppy your thinking is.  --Guindon 
If you think without writing, you only think you're thinking. --Lamport 
Freewriting helps to unclog the mind, reduce resistance to thinking and writing, bring clarity, provide perspective, improve creativity by causing a chain reaction of ideas, and articulate better about ideas.

The premise of freewriting is simple: getting a 100 ideas is easier than getting 1. "When you need an idea, don’t try for just one. When searching for one great idea, we demand perfection from it, depress ourselves, become desperate, and block. Go for lots of ideas. Keep your threshold low. One idea leads to the next."

The first half of the book gives the below six tactics for freewriting. (The second half elaborates more on these.)

1. Try Easy

A relaxed 90 percent is more efficient than a vein-bulging 100 percent effort. When you begin freewriting about a thorny subject, remind yourself to try easy.

2. Write Fast and Continuously 

Your best thought comes embedded in chunks of your worst thought. Write a lot. Think quantity. If you temporarily run out of things to say, babble onto the page. Write as quickly as you can type, and continue to generate words without stopping: Your mind will eventually give you its grade A unadulterated thoughts.

3. Work against a Time Limit 

The limit energizes your writing effort by giving you constraints. Pomodoro method helps here.

4. Write the Way You Think 

Freewriting is a means of watching yourself think. Write the way you talk to yourself. Since you're writing for yourself, you don’t need to polish your raw thoughts to please others. All that matters is that you yourself understand your logic and references.

5. Go with the Thought 

That's the first rule of improv theater. Assume that a particular thought is true, and take a series of logical steps based on the thought. In other words, as what if questions.

6. Redirect Your Attention 

Explore random paths with focus changing questions: Why? Why not? How can I change that? How can I prove/disprove that? What am I missing here? What does this remind me of? What’s the best/worst-case scenario? Which strengths of mine can I apply? How would I describe this to my uncle? If I wanted to make a big mistake here, what would I do? What do I need that I don’t yet have?

MAD questions

1. What are other uses for freewriting?
Freewriting makes for good meditation. Many people use morning pages (freewriting in the morning) to clear logjams in thire minds. I couldn't make this a habit, but when I did try this I got value out of it.

I occasionally used freewriting for thinking about a research problem. I write down some observations and then I start speculating about hypothesis all in freewriting form. It works. Again, I wish I could make this a habit and do this more often. Well, I guess the blogging and the MAD questions help somewhat for this.

Freewriting is also good for planning. I should do more planning.

I guess it is time to schedule some freewriting in my week. 

2. Is it possible to use freewriting for writing?
Yes, you can use freewriting to write memos, articles, and stories. The trick is to edit ruthlessly after freewriting. That may be wasteful though. Freewriting helps you discover what you want to write. After that doing an outline and writing from that could save time. So it may be best to combine a little bit of bottom up freewriting with a little bit of top down outline writing.

3. Did you try any freewriting marathons?
The book suggests that instead of 20 minute freewriting sessions, it is helpful occasionally to go for 4-5 hours of freewriting marathons. I never tried that. It sounds torturous, but maybe it is worth a try for the sake of curiosity what would my depleted unrestrained psyche spew out after an hour of typing its train of thoughts. Couldn't the mind get lazy in freewriting mode as well and start to recycle same thoughts? But I guess the idea in the marathon is to force your mind to move past that.

4. How was your experience with freewriting?
Let me know in the comments or via email or on Twitter.

Wednesday, May 16, 2018

Paper summary. Decoupling the Control Plane from Program Control Flow for Flexibility and Performance in Cloud Computing

This paper appeared in Eurosys 2018 and is authored by Hang Qu, Omid Mashayekhi, Chinmayee Shah, and Philip Levis from Stanford University.

I liked the paper a lot, it is well written and presented. And I am getting lazy, so I use a lot of text from the paper in my summary below.

Problem motivation 

In data processing frameworks, improved parallelism is the holy grail because it can get more data processed in less time.

However, parallelism has a nemesis called the control plane. While, control plane can have a wide array of meaning, in this paper control plane is defined as the systems and protocols for scheduling computations, load balancing, and recovering from failures.

A centralized control frame becomes a bottleneck after a point. The paper cites other papers and states that a typical cloud framework control plane that uses a fully centralized design can dispatch fewer than 10,000 tasks per second. Actually, that is not bad! However, with machine learning (ML) applications we are starting to push past that limit: we need to deploy on 1000s of machines large jobs that consist of many many short tasks (10 milliseconds) as part of iterations over data mini-batches.

To improve the control plane scalability, you can distribute the control plane across worker nodes (as in Nimbus and Drizzle). But that also runs into scaling problems due to the synchronization needed between workers and the controller. Existing control planes are tightly integrated with the control flow of the programs, and this requires workers to block on communication with the controller at certain points in the program, such as spawning new tasks or resolving data dependencies. When synchronization is involved, a distributed solution is not necessarily more scalable than a centralized one. But it is more complicated for sure: in one sense that is the computer analog of mythical man-month problem.

Another approach to improve scalability is to remove the control plane entirely. Some frameworks, such as Tensorflow, Naiad, and MPI frameworks, are scheduled once as a big job and they manage their own execution after that. Well, of course the problem doesn't go away, but plays inside the framework level: the scalability is limited by the applications logic written in these frameworks and the frameworks' support for concurrency control. Furthermore, these frameworks don't play nice with the datacenter/cloud computing environment as well. Rebalancing load or migrating tasks requires killing and restarting a computation by generating a new execution plan and installing it on every node.

This paper proposes a control plane design that breaks the existing tradeoff between scalability and  flexibility. It allows jobs to run extremely short tasks (<1ms) on thousands of cores and reschedule computations in milliseconds. And that has applications in particular for ML.

Making the Control Plane Asynchronous

To prevent synchronous operations, the proposed control plane cleanly divides responsibilities between controller and workers: a controller decides where to execute tasks and workers decide when to execute them.

The control plane's traffic is completely decoupled from the control flow of the program, so running a program faster does not increase load at the controller. When a job is stably running on a  fixed set of workers, the asynchronous control plane exchanges only occasional heartbeat messages to monitor worker status.

The architecture

Datasets are divided into many partitions, which determines the available degree of parallelism. Datasets are mutable and can be updated in place. This corresponds nicely to parameters in ML applications.

The controller uses an abstraction called a partition map to control where tasks execute. The partition map describes which worker each data object should reside on. _Because task recipes trigger tasks based on what data objects are locally present, controlling the placement of data objects allows the controller to implicitly decide where tasks execute._ The partition map is updated asynchronously to the workers, and when a worker receives an update to the map it asynchronously applies any necessary changes by transferring data.

On the worker side, an abstraction called task recipes describes triggers for when to run a task by specifying a pattern matched against the task's input data. Using recipes, every worker spawns and executes tasks by examining the state of its local data objects, obviating the need to interact with the controller.

Task recipes

A task recipe specifies (1) a function to run, (2) which datasets the function reads and/or writes, and (3) preconditions that must be met for the function to run.


There are three types of preconditions to trigger a recipe:

  1. Last input writer: For each partition it reads or writes, the recipe specifies which recipe should have last written it. This enforces local write-read dependencies, so that a recipe always sees the correct version of its inputs.
  2. Output readers: For each partition it writes, the recipe specifies which recipes should have read it since the last write. This ensures that a partition is not overwritten until tasks have finished reading the old data.
  3. Read messages: The recipe specifies how many messages a recipe should read before it is ready to run. Unlike the other two preconditions, which specify local dependencies between tasks that run on the same worker, messages specify remote dependencies between tasks that can run on different workers.

Since incorrect preconditions can lead to extremely hard to debug computational errors, they are generated automatically from a sequential user program. A single recipe describes potentially many iterations of the same data-parallel computation.

Writers and readers are specified by their stage number, a global counter that every worker maintains. The counter counts the stages in their program order, and increments after the application determines which branch to take or whether to continue another loop. (Using the counters I think it is possible to implement SSP method easily as well.) All workers follow an identical control flow, and so have a consistent mapping of stage numbers to recipes.

Exactly-once Execution and Asynchrony

Ensuring atomic migration requires a careful design of how preconditions are encoded as well as how data objects move between workers. No node in an asynchronous control plane has a global view of the execution state of a job, so workers manage atomic migration among themselves. To ensure that the task from a given stage executes exactly once and messages are delivered correctly, when workers transfer a data partition they include the access history metadata relevant to preconditions, the last writer and how many recipes have read it.

Partition Map

A partition map is a table that specifies, for each partition, which worker stores that partition in memory. A partition map indirectly describes how a job should be distributed across workers, and is used as the mechanism for the controller to signal workers how to reschedule job execution.


The controller does five things:
(1) Starts a job by installing the job's driver program and an initial partition map on workers.
(2) Periodically exchanges heartbeat messages with workers and collects workers' execution statistics, e.g. CPU utilization and CPU cycles spent computing on each partition.
(3) Uses the collected statistics to compute partition map updates during job execution.
(4) Pushes partition map updates to all workers.
(5) Periodically checkpoints jobs for failure recovery.

Maximizing data locality

To maximize data locality, the controller updates the partition map under the constraints that the input partitions to each possible task in a job are assigned to the same worker. The execution model of task recipes is intentionally designed to make the constraints explicit and achievable: if a stage reads or writes multiple datasets, a task in the stage only reads or writes the datasets' partitions that have the same index, so those partitions are constrained to be assigned to the same worker.

Implementation

The group designed Canary, an asynchronous control plane, which can execute over 100,000 tasks/second on each core, and can scale linearly with the number of cores.

The driver constructs the task recipes. A driver program specifies a sequential program order, but the runtime may reorder tasks as long as the observed result is the same as the program order (just as how processors reorder instructions).


Canary periodically checkpoints all the partitions of a job. The controller monitors whether workers fail using periodic heartbeat messages. If any worker running a job is down, the controller cleans up the job's execution on all workers, and reruns the job from the last checkpoint.

Checkpoint-based failure recovery rewinds the execution on every worker back to the last checkpoint when a failure happens, while lineage-based failure recovery as in Spark only needs to recompute lost partitions. But the cost of lineage-based failure recovery in CPU-intensive jobs outweighs the benefit, because it requires every partition to be copied before modifying it.

Evaluation results

Current synchronous control planes such as Spark execute 8,000 tasks per second; distributed ones such as Nimbus and Drizzle can execute 250,000 tasks/second. Canary, a framework with an asynchronous control plane, can execute over 100,000 tasks/second on each core, and this scales linearly with the number of cores. Experimental results on 1,152 cores show it schedules 120 million tasks per second. Jobs using an asynchronous control plane can run up to an order of magnitude faster than on prior systems. At the same time, the ability to split computations into huge numbers of tiny tasks with introducing substantial overhead allows an asynchronous control plane to e ciently balance load at runtime, achieving a 2-3× speedup over highly optimized MPI codes.

Evaluations are done with  applications performing logistic regression, K-means clustering, and PageRank.

MAD Questions

1. Is it a good idea to make tasks/recipes dependent/linked to individual data objects? How do we know the data objects in advance? Why does the code need to refer to the objects? I think that model can work well if the data objects are parameters to be tuned int ML applications. We live in the age of the `big model'. I guess graph processing applications can also fit well to this programming model. I think this can also fit well with any(?) dataflow framework application. Is it possible to make all analytics applications fit to this model?

2. The Litz paper had similar ideas for doing finer-grain scheduling at the workers and obviating the need for synchronizing with the scheduler. Litz is a resource-elastic framework supporting high-performance execution of distributed ML optimizations. Litz remains general enough to accommodate most ML applications, but also provides an expressive programming model allowing the applications (1) to support stateful workers that can store the model parameters which are co-located with a partition of input data, and (2) to define custom task scheduling strategies satisfying fine-grain task dependency constraints and allowances. At runtime, Litz executes these strategies within the specified consistency requirements, while gracefully persisting and migrating application state.

3. Since it is desirable to have one tool for batch and serving, would it be possible to adopt Canary for serving? Could it be nimble enough?

4. Is it possible to apply techniques from the Blazes paper to improve how the driver constructs the task recipes?

Thursday, May 10, 2018

Misc rambling

Here, have some random stuff.

Confident idiots 

A couple weeks ago, I was in the YMCA sauna. There were 6 people, including 3 high school kids. One of the high school kids---who looked like a young Jonah Hill--- is excited he will be getting a car from his father and telling his friends about it.

He asks his friends how often he needs to get the oil changed on the car. One of his friends authoritatively says: "Oh, they will tell you when you get the car. It is every 50 miles." The boy says this so confidently that I am trying hard not to chuckle.

So, Jonah  is OK with that. They keep talking about the car. But then this occurs to Jonah. He says "Once my mom drove us to Wisconsin and back and it is definitely more than 50 miles. It was like 500 miles. So oil change distance has got to be much longer than that."

The boy who authoritatively suggested 50 miles for oil change says with the same confidence: "Yeah, that is it, you change the oil every 50 thousand miles. That is what it is!"

Why are the ignorant so confident? I know Lake Wobegon effect, and such,  but really why...

Weird dream 

(This was January 12th, 2018.)

At night at 2am, I woke up buzzed with a weird dream. It was a combination of Ted Chiang like predeterminism molded into the Bart Curlish narrative---the assassin-woman in Dirk Gently's Holistic Detective Agency--- kind of deal.

So, I had a mission, someone else had a mission to kill me, and yet another person had a mission to protect me. That last person didn't know his mission, but predeterminism forced him to jump always between me and the bullets, and weirdly he was also unaffected by the bullets. But, he was absolutely terrified, shouting hysterically and trying to escape, but every time he attempted to run away, he always ended up jumping between me and the assassin and involuntarily protecting me.

Slowly this guy realized that this was his mission, and even if he didn't want to do it, he would be doing it in a predetermined way. When he had this realization, he was very relieved, and went on shielding me from the bullets, this time peacefully. Around the same time, I also realized that I had my own mission, and that I will be indestructible despite all the attempts of the assassin. I woke up energized and buzzed.

Predetermination and a sense of purpose is such a powerful concept. If you truly genuinely believe you are predestined for a purpose, that might make you invincible.

Incidentally, my first post in this blog in 2007 was about a dream as well.

MAD questions

1. How should we deal with confident idiots?
The ignorance and overconfidence of that high school student was harmless, funny, and cute. Hopefully he will grow out of it.

But what if he doesn't? What if that overconfident-ignorance is then fueled by acceptance or even admiration? Confident idiots who wield power are not so fun and harmless anymore.

What can we do to intervene with this?

My mom told me as a child that it is impossible to change anyone's mind via arguing. I had took this into heart, and refrained from arguing with people on their opinions. But lately I changed my position. I still know I can't change anyone's mind via arguing, but I am now more vocal about stating disagreement and pointing people what is wrong with their opinions and positions. I put my voice out there, hoping that it may stop people getting more carried out with their opinions and echo chambers.

I recently listened to this podcast episode from Hidden Brain, and I think it elaborates these issues better.

2. What if you know you could not fail, what would you attempt? Do you think you have a true calling?

I know they say follow your passion is bad career advice, but shouldn't you try to align them? Even better, figure out your true calling and excel at it. If you believe strongly you are predetermined to achieve it, maybe you will. As Steve Jobs said:  Some things need to be believed to be seen.

3. Let's hope that the confident idiot does not believe he is predetermined to achieve a mission.

What if you are the confident idiot, rather than the determined visionary?
To strike a healthy balance, always doubt yourself and never doubt yourself.

4. Oh, and, blockchains and cryptocurrency. Discuss.

Tuesday, May 8, 2018

Transmitting your message over a lossy channel

Late at night, I was tweeting random stuff and then the below tweet came.
Since I worked on wireless sensor networks in a previous life, I had things to say here.

Here are the ways to deal with a lossy channel, and what those mean for your writing.

A straightforward and easy way to deal with a lossy channel is to add redundancy. Repeat the message a number of times, and one transmission will make it. In writing, this corresponds to: "Tell them what you are going to tell them, tell them, then tell them what you told them."

Another way to deal with a lossy channel is that you increase the signal-to-noise ratio. This can be done via increasing transmission power and/or somehow reducing the noise. This translates to writing more clearly. Use definite, specific, concrete language. And be succinct and to the point. As you review your draft, ruthlessly cut out frivolous parts and divergence. Read the text as if you are an outsider, empathize with the reader, and try to reduce any misunderstandings this reader could have.
Kill your darlings, kill your darlings, even when it breaks your egocentric little scribbler’s heart, kill your darlings.
-- Stephen King
Another way  to deal with a lossy channel is to send a preamble. In computer networks, a syncword, sync character, sync sequence or preamble is used to synchronize a data transmission by indicating the start of data transmission.  Sending preambles also achieve frame synchronization permitting the data bits within the frame to be extracted for decoding or retransmission. In writing, this corresponds to giving advance warning and building up to where you are leading the reader.

Ok, this is pushing it a bit too much, but here it is. You can also use TDMA to deal with a lossy channel. The transmitter establishes a rythym/cadence, and the receivers are able to receive at those predetermined times the clear transmission. In writing, this corresponds to using a parallel structure, so that the reader can follow your points easier.

That's all folks.

MAD questions

1. Did you know that the receiver can also amplify its power to reduce transmission loss?

I wrote about this earlier.
Most people have an incorrect model of how radios and wireless radio communication work. The layman thinks radios are simple. The transmitter does all the heavy lifting, and puts the signal on the air. Before the signal fades away completely, if the receiver device is in communication range, it will pick up the signal by listening. This naive understanding suggests that the receiver is a passive device, as listening is perceived mostly as a passive activity. But, nothing can be further from the truth. The receiver spends considerable energy to power its radio and amplify the signal in order to receive it.

So, this thing is very symmetrical. If a text is hard to understand, you can power up your reading game to decipher it better. Similar techniques help here also for dealing with the lossy transmission channel as well.

But this has a limitation. "When the receiver powers its radio to amplify signals, it is also amplifying noise. The signal to noise ratio (SNR) determines successful reception: if the signal has not faded, it will get amplified more, stand out, and decoded correctly. This gives raise to an interesting question: When there is no signal in the channel, what does the receiver radio hear?" To learn the answer you should read that post.

2. Is that all you got? 5 lousy writing analogies from radio transmission?
No analogies to MIMO, RFDI, or software defined radio? My analogy game must be getting weak.

3. What are your analogies to writing?
Another analogy I often use for writing is that of showcasing a garden. You are a gardener, you cultivated a great garden, and you are exhibiting it to visitors. How do you showcase your garden and lead them to the ? Initially, you will need to guide the visitors which paths to take, without guidance they will take arbitrary paths and get lost. So, take care to prune frivolous paths lest they get distracted, gently guide them toward the centerpieces and highlights of your garden. Give them a tour. Don't be too pushy either.

Related links

How I write
How to write your research paper
How I read

Sunday, May 6, 2018

Truth is Multidimensional

Nasreddin Hoca is a 13th century Turkish wise-man / populist-philosopher, known for his funny stories and anecdotes.

In one story, he was judging between two men over a conflict. He listened to the first man, and told him, he was right about the issue. Then he listened to the second man, and also found him reasonable, and told him he was right about the issue. Hoca's wife was listening and she intervened: How is that possible, how can both sides be right about the issue?

Hoca thought some more about this, and said: "Dear, you are right, too."

Murat Hoca

I often find myself in the same position. For example most recently, I read Sinofski's tweets pro blockchain, and Jackson Palmer's tweets rebuking those, and I find both sides right.


How can both sides in an argument be right?

This is possible when the issue is multidimensional.

Being a researcher

I had written this earlier:
"I am OK with being confused, and I am capable of holding ambivalent thoughts in my brain. These are minimal (necessary but not sufficient) requirements for being a researcher. I would rather have unanswered questions than unquestioned answers. (Aha, of course, that was a Feynman quote. "I would rather have questions that can't be answered than answers that can't be questioned." --Richard Feynman)"
When I am working on a new problem/area at the cutting edge, I find that I have to make several hypotheses/guesses and take positions. As a result I have to keep several ambiguous/conflicting positions in my mind for a long time. I let them simmer in my brain until the situation is resolved. Often, I eliminate some conflicting positions and settle on one that is more accurate. And sometimes this is resolved with the frame shifting (or paradigm shifting) and we go above the plane and find a position that accommodates these two conflicting positions.

Do these sound like Zen koans for the academics?

Ok, here is another one: I like being wrong and I also don't like being wrong.

I like being wrong because it gives me a learning opportunity.

I also don't like being wrong, so as soon as I learn I am wrong, I change my position so I am not wrong any more.

MAD questions

1. If you are mindful about it, you find that you are faced with a lot false dichotomies every day. What are your favorite examples?

2. What have you changed your mind about recently?

Friday, May 4, 2018

Research and climbing


I first liked this. But then I grew uneasy and even disturbed by it... I have been in this business (the research business ---not the climbing) for 20 years, and I find that unfortunately reality is not that simple, idealistic, and egalitarian.

Climbing a mountain is hard and resource intensive

To climb a mountain, you need more than a vision of a feasible path.

Last year, I had read this book titled "After the summit" by Lei Wang. It gives life lessons learned from Lei's arduous journey to climb the highest peak on all seven continents, including Mount Everest, and skiing to the North and South poles.

Here is what you need to climb a tough mountain. I think this is still an incomplete list but it is much more than "time and patience" and a "vision of a path to the peak".
  1. Equipment... very expensive high-quality equipment... good tools are important
  2. Funding (Just Google for "What do you need to climb everest?". The first link says an Everest climb starts from $35K.)
  3. Lots of training to get very fit
  4. Lots of lessons to learn climbing technique
  5. Training with personal coaches
  6. Guidance from experts
  7. A very capable team
  8. A base camp, where you spend weeks to acclimate
  9. Experienced sherpas
  10. Experience of having climbed similar mountains
  11. Dedication, passion, perseverance
  12. Managing your mental state

Here are the not so glorious parts about climbing a mountain. You sacrifice a lot. You get your ass kicked by nature, you get your knuckles frozen. You need to learn to walk with spiked boots, you learn to build ladder bridges to walk over wide and deep Crevasses. You learn to sleep out in the cold, eat shitty food. You learn discipline and safety processes. You make discomfort your companion. You learn to tolerate misery.

None of this sounds enticing. But probably the worse part is how boring 99% of the climb is. You are at the summit only for 15 minutes. You have to find a way to love the journey.


Just in case this is still unclear, what I wrote for climbing above maps to research one to one, and the mapping is left as an exercise to the reader.

One thing that is not in the climbing task-list is the need for documentation. Maybe documentation is also important for climbers---I don't know, but documentation/writing/publishing is a vital part of research. You need to show the path to others, so that they can see the result, follow-along, and hopefully even improve on it.

MAD questions

1. Man, why am I always so critical?

I think the climbing analogy is actually a good one for research, but I grew unhappy about the emphasis on a very small part of the climbing: "looking for a path that nobody has noticed before". Being the cynic I am, I just had to write this long piece about it. I think I have severe professional deformation: I can't stop finding problems with things.

2. Ok, what about Math theorems?

For climbing steep cliffs in theoretical math, is time, patience, and a vision of a path to the peak enough?

The theoretical math domain is less resource intensive. But to get an important contribution you still need years of training and years of thinking about the problem in your head. Maybe take off the first items 1 and 2 in my list above and compensate for them with some increased effort in the remaining items. It seems like the mathematicians need to develop and master a lot of mental tools indeed:
Here is the Wikipedia article on Wiles' solution of Fermat's last theorem.
Here is a New Yorker article on Yitang Zhang's work on bounds gaps.
Here is the Wikipedia article on Grigori Perelman's on soul theorem.

3. What are some other analogies you know for the research process?

I like analogies. I wrote this about analogies earlier:
It is not hard to make analogies, because all analogies are to some extent inaccurate. When Paul Graham made the analogy about hackers and painters, many people criticized and said that there are many other hackers and vocation analogies. But analogies, while inaccurate, can also be useful. Analogies can give you a new perspective, and can help you translate things from one domain to the other to check if it is possible to learn something from that.
I had given research and sudoku analogy before. I still think that is a good analogy. I will leave it to someone else to analyze/criticize that one.

If you have more analogies, please share them with me in the comments.

4. What are some books/articles that got this right?

I recently wrote about the Crypto book. I think it reflected the ups and downs and valleys of death for the discovery process well. I also liked Einstein's biography by Walter Isaacson; it provided a good account of Einstein's work habits and struggles. Although not a book on research, the "War of Art" book by Steven Pressfield did a good job describing the struggles and challenges of creative work. I guess the Dip book by Seth Godin also has relevance.

But I want a book that delves more in to the most boring parts of this process. I had heard about (HT @tedherman) biologists spending years perfecting experiments on rats to make sure it is properly designed and free from uncontrolled factors (do check page 35, wow!). I want to read a long book about the boring laborious parts of research.

5. Should we be more cynical?

I said you need a lot of resources to climb a new mountain, but I left it at that. I didn't elaborate on the implications of this for the way research is done.

A couple years ago I had read the book "Big Science: Ernest Lawrence and the Invention that Launched the Military-Industrial Complex". Since the 1930s, the scale of scientific endeavor has grown exponentially. Increasingly more, we started to need big teams, big equipment, and big funding for inventions. As the problems get more tricky and more specialized, it is inevitable that the research gets more resource intensive.

Is it a losing battle to try to keep things more egalitarian, keep research more democratized? Yes, research is hard, but if one is very dedicated, work boring hours, do boring work to train, learn the techniques, is it at all possible for her to climb the really tall mountains?

I think the answer is yes.

When we find that we need increasingly more expensive/specialized resources to enter an area, this is an indication that the area has incurred a lot of technical debt. Fortunately, there is a catch up process, where the equipment becomes more affordable (e.g., cloud computing, crispr, 3d printers), and the esoteric techniques are simplified/explained and tooling is provided (e.g., machine learning).

So I am hopeful that it is possible to keep research more democratized. It is also a duty for us to strive for this, otherwise it is an indication we are incurring technical debt in our research field. Researchers are not fond of doing this kind of janitorial stuff and cleaning up, but that is the only way to move forward in a sustainable manner.

Another great equalizer is the emergence of new problems and domains. Innovation begets innovation. As we discover things, we find new terrains open up. And a new terrain is a good opportunity to make impacts without needing immense resources. This is what I had wrote earlier:
We are all equally stupid. Our brains consist of 1.3 kg (3pounds) of gooey material. It is hard to hold 10 variables in our brains simultaneously. Our brains are very poor at reasoning about even very short concurrent programs (I know from first-hand experience :-). Our brains are also very susceptible to biases and fallacies.
The people you see as experts/geniuses are that way because they have been working/thinking/internalizing these topics for more than a decade. Those experts become baffled if you ask them a question a little bit outside of the frame they are accustomed to.
So here is the trick, "you can level the playing field by working on new things/technologies".

Saturday, April 28, 2018

Book review. How to write a lot: A practical guide to productive academic writing

This book is by Paul J. Silvia, a psychology professor. This book provides a great prescription to enable academics to write more. And let's be frank we all should be writing more.

I give this little book to my graduate students so they get a healthy perspective on writing, an integral part of a researcher's job description, before they develop bad habits and PTSD from bad experiences with writing.

The book prescribes simple straightforward solutions to the writing woes. It has a no BS, let's get to business attitude. It gives some practical writing style advice as well, but if you need more help on developing a good writing/composition style, you should also look into other books, such as "Elements of Style", and on "Writing Well".

Here are some selected pieces from the book. I recommend the book to all struggling academic writers.

On finding time to write (page 12)

You have a teaching schedule, and you never miss it. If you think that writing time is lurking somewhere, hidden deep within your weekly schedule, you will never write a lot. If you think that you won't be able to write until a big block of time arrives, such as spring break or the summer months, then you'll never write a lot. Finding time is a destructive way of thinking about writing. Never say this again. Instead of finding time to write, allot time to write. Prolific writers make a schedule and stick to it. It's that simple.

Scheduling is the way (page 17)

Perhaps you're surprised by the notion of scheduling. "Is that really the trick?" You ask. "Isn't there another way to write a lot?" Nope---making a schedule and sticking to it is the only way. There is no other way to write a lot.

The best kind of self control (page 22)

My wife has fast Internet access in her home office, but I don't have anything. Writing time is for writing, not for checking email, reading the news, or browsing the latest issues of journals. ... The best kind of self-control is to avoid situations that require self-control.

Writing breeds good ideas for writing (page 23)

If you believe that you should write only when you feel like writing, ask yourself some simple questions: How has this strategy worked so far? Are you happy with how much you write? Do you feel stressed about finding time to write or about completing half-finished projects? Do you sacrifice your evenings and weekends for writing?

It's easy to demolish this specious barrier: Research has shown that waiting for inspiration doesn't work. Boice (1990) conducted a study with profound implications for every binge writer who waits for inspiration. He gathered a sample of college professors who struggled with writing, and he randomly assigned them to use different writing strategies. People in an abstinence condition were forbidden from all nonemergency writing; people in a spontaneous condition scheduled 50 writing sessions but wrote only when they felt inspired; and people in a contingency management condition scheduled 50 writing sessions and were forced to write during each session. The dependent variables were the number of pages written per day and the number of creative ideas per day.

This is what Boice found. First, people in the contingency management condition wrote a lot: They wrote 3.5 times as many pages as people in the spontaneous condition and 16 times as much as those in the abstinence condition. People who wrote "when they felt like it" were barely more productive than people told not to write at all--- inspiration is overrated. Second, forcing people to write enhanced their creative ideas for writing. The typical number of days between creative ideas was merely 1 day for people in who were forced to write; it was 2 days for people in the spontaneous condition and 5 days for people in the abstinence condition. Writing breeds good ideas for writing.

Make a list of writing projects (page 47)

After you've committed to a writing schedule, you need to make a list of your project goals and write them down. When you sit down to write, spend a minute thinking about what you want to do that day. Setting priorities among your project goals will take the stress out of managing several projects at once. And monitoring your writing will keep you focused on your goals, motivate you not to miss a day, inform you about how well you're doing, and give you hard facts that you can show to your binge-writing colleagues who are doubters and unbelievers. Anyone who combines the tips in this chapter with a regular schedule will write a lot.

Write first, revise later (page 75)

Generating text and revising text are distinct parts of writing---don't do both at once. The goal of text generation is to throw confused, wide-eyed words on a page; the goal of text revision is to scrub the words clean so that they sound nice and make sense. Some writers---invariably struggling writers---try to write a pristine first draft, one free of flaws and infelicities. The quest for the perfect first draft is misguided. Writing this way is just too stressful: These writers compose a sentence; worry about it for 5 minutes; delete it; write it again; change a few words; and then, exasperated move on to the next sentence. Perfectionism is paralyzing. 

Furthermore, writing sentence by sentence makes your text sound disjointed. The paragraph, not the sentence, is the basic unit of writing.

Tuesday, April 24, 2018

Paper summary. Step by Step Towards Creating a Safe Smart Contract: Lessons and Insights from a Cryptocurrency Lab

This paper, on our seminar reading list, was our first real intro to the "smartcontracts", so we liked that the paper was written to be very accessible.


A smart contract is a program that is executed by all miners and its outputs incorporated to the blockchain. A contract consists of program code, a storage file, and an account balance. The program code of a contract is fixed when the contract is created, and cannot be changed.

The contract's code is executed whenever it receives a message from a user or from another contract. While executing its code, the contract may read from or write to its storage file. The contract's storage file is stored on the public blockchain. A contract can also receive money into its account balance, and send money to other contracts or users. A contract's entire state is visible to the public.

Ethereum uses the concept of "gas" (bought by currency) to discourage over-consumption of resources. During the execution of a transaction, every program instruction consumes some amount of gas. If the gas runs out before the transaction reaches an ordinary stopping point, it is treated as an exception: the state is reverted as though the transaction had no effect, but the Ether used to purchase the gas is not refunded!


This program in Ethereum Serpent language (Ethereum now uses Solidity as the programming language) implements a simple financial "swap" instrument. The contract allows two parties, Alice and Bob, who take opposing bets about the price of a stock at some future time. Both parties initially deposit equal amounts of money (as units of Ether currency). After a deadline has passed, the current price of the stock is queried by interacting with a designated stock price authority (implemented as a smart contract, StockPriceAuthority). Depending on the price at that time, the entire combined deposit is awarded to either Alice or Bob.

On lines 1 and 2, the contract's storage allocates space for the public keys of Alice and Bob, and 2) the deadline and threshold of the swap contract. The contract also defines a function determine outcome, which any party may invoke.

Pitfalls of Smart Contract Programming

Section 4 describes prominent errors made when coding smart contract. It uses a "rock, papers, scissors" playing smartcontract and show a surprising amount of things that can go wrong in a simple problem like that.

In Section 4.1, the concurrency caused errors take the lead place.  As the paper "Blockchains from a distributed computing perspective" argued, smartcontracts have a surprising amount of concurrency in them in the so-called chain serialized transactions. The paper mentioned: "We have seen that the notion that smart contracts do not need a concurrency model because execution is single-threaded is a dangerous illusion." 

In the "rock, papers, scissors" contract, what happens to a third player that attempts to join and sends money to the contract? That money becomes inaccessible to anyone.

Another thing that was surprising to me is that when you are writing a smartcontract you need to take incentives into account as another constraint in the problem. In the "rock, papers, scissors" contract,
For example, one party can wait for the other to open its commitment. Upon seeing that he will lose, that party may elect to abort (after all revealing its committed input costs gas). But, this denies payment to the other player as well. The solution is to use an additional security deposit and add a deadline after which not revealing implies losing.

Step 1: You have a problem
Step 2: You introduce a blockchain to solve problem
Step 3: You now have an incentives problem, a trust distribution problem, a community management problem, a regulation problem, a speculation problem, an upgrade path problem.

Of course there are additional problems due to cryptography, how to keep the committed inputs secret, etc. Finally, Section 4.4. in the paper discusses Ethereum specific bugs. I am not sure how many of these still apply after 3 years, in 2018.

MAD questions

1. Every miner needing to store all contracts' data storage may limit the scalability, right? Is there a way to provide a sharding service for smartcontracts as well? Is it possible to apply the techniques introduced in Aspen for this?

2. Every miner needing to execute the smartcontract for validation is also very wasteful. How can that be relaxed? Is it possible to devise a way where the output can be checked quickly without having to execute the entire code?


PS: Relevant to this paper, here is the DAO smartcontract attack modeling I wrote about earlier.

Friday, April 20, 2018

Book review. Crypto: How the code rebels beat the government---saving privacy in the digital age.

In graduate school, I had read "Hackers: Heroes of the Computer Revolution" from Steven Levy and enjoyed it a lot. (I still keep the dog eared paper copy with affection.) So, I should have read Steven Levy's Crypto book a long time ago. But for some reason, I didn't...even though I was aware of the book. I guess that was due to a stupid quirk of mine; I had some aversion to the security/cryptography research. I don't know why. Maybe it was because I had sat through a couple of bad security/cryptography talks (a similar aversion happened to me after a bad networking course). Another reason, I regret to admit, may be that I had some distributed systems snobbery going on that time. I was so into the distributed systems/algorithms area that I was quick to label AI, security, and this, and that as uninteresting or useless *to me*. I wish I could have been more open minded. I am sure reading this book then would have changed my outlook toward security and cryptography.

(Side Remark: The lesson, kids, is to always keep an open mind. Being a snob is stupid. I have been seeing snobbery against blockchain work among some systems researchers, and that is wrong. I have criticized some parts of blockchain and fully-decentralized approaches many times on this blog, but I know better than being a snob. My brain is open, I am reading about it, and when I find something interesting and suitable for my skillset, I will be happy to work more on it and contribute.)

Coming back to the book, I recommend this book very highly. The book skillfully combines very personal stories of the researchers involved in crypto work with simple explanations of the important technical materials. Levy sure knows how to tell stories.

Now to be hypercritical and to nitpick, I thought the writing felt somewhat rushed in some places. I thought the writing in the Hackers book was more skilled and better refined. My guess is Levy had to rush this to publication. There were occasional ambiguous sentences, which a careful editor and proofreader would have caught.

Some selected parts from the book

I am not going to do a proper book review. Instead I find it more fun to include tidbits from the book without providing any context.

Pages 21&24: Whit Diffie is a careful reader

By now Diffie had finally gotten around to reading David Kahn's The Codebreakers. Since Diffie was a very slow, methodical reader, tackling a book of a thousand densely packed pages was a major undertaking for him. "He traveled everywhere with that book in hand," says his friend Harriet Fell. "If you invited him to dinner, he'd come with The Codebreakers." But Diffie found the hundreds of hours he spent on the book to be well worth the trouble.
...
"I read it more carefully than anyone had ever read it. Kahn's book to me is like the Vedas," he explains, citing the centuries-old Indian text. "There's an expression I learned: 'If a man loses his cow, he looks for it in the Vedas.' "
...
Why had Diffie's once-intermittent interest become such a consuming passion? Behind every great cryptographer, it seems, there is a driving pathology. ... "I had been looking all my life for some great mystery. ... I think somewhere deep in my mind is the notion that if I could learn just the right thing, I would be saved."

Pages 31-33. Diffie meets Hellman

"It was a meeting of the minds," says Hellman.
...
The half-hour meeting went on for an hour, two hours, longer. Hellman simply didn't want it to end, and Diffie, too, seemed eager to continue as long as possible. Hellman had promised his wife he'd be home by late afternoon to watch their two small children while she went off, so finally he asked Diffie back to his house. No problem! Diffie called Mary and she came over to have dinner with Whit and all the Hellmans, and it wasn't until 11:00 or so that night that the dialogue broke up.

Page 67. Diffie's existential crisis

Mary Fischer recalls the lowest point. One day she walked into the McCarthys' bedroom and found Diffie with his head in his hands, weeping. "I asked him what was wrong," she says, "and he told me he was never going to amount to anything, that I should find someone else, that he was --and I remember this exact term-- a broken-down old researcher."

Page 69. Diffie's breakthrough

That spring, Diffie had settled into a routine at the McCarthy house. Every morning he would make breakfast for Mary and Sarah, McCarthy's fourteen-year-old daughter. Then Mary would go off to work, Sarah would go off to school, and Diffie would stay home. One day in May 1975, he spent the morning hours thinking. After a lunch break, he returned to his mental work. For the umpteenth time, he had been thinking about the problem of establishing a secure log-in password on a computer network. Again, there was that old problem of having to trust the administrator with the secret password. How could you shut that third party out of the scheme entirely? Sometime in the afternoon, things suddenly became clear to Diffie: devise a system that could not only provide everything in Diffie's recently envisioned one-way authentication scheme but could also deliver encryption and decryption in a novel manner. It would solve the untrustworthy administrator problem, and much, much more.
He would split the key.

Pages 77-78. Merkle the Berkeley undergraduate student

Instead, for reasons that remain unclear but are probably related to Merkle's unconventional mind, he fixated on what struck him as a weird, somewhat challenging aspect of a more basic dilemma. The essential cryptographic scenario assumed that the channel of communication was vulnerable. ... But what measures could you exploit if you wanted to communicate with someone who wasn't already in possession of a pre-arranged, secure symmetrical key?
...
Merkle, unpolluted with knowledge about theory or history of crypto, was unaware of the apparent impossibility of his mission. He simply tried to solve the problem.

Pages 90-100. Enter Ron Rivest and the RSA algorithm

Even so, "New Directions in Cryptography" [Diffie-Hellman 1976 paper] turned out to be more than interesting to Rivest: it thrilled him. Ultimately, it changed his life.
The paper appealed to Rivest's heart as well as his head. Rivest was a theoretician, but one for whom simple abstractions were not enough. The ideal for him was actually putting the ethereal mechanics of math to work, of making a tangible difference in the world of flesh and dirt.

On April 3, 1997, a graduate student named Anni Bruce held a Passover seder at her home. Rivest was there, and Shamir, and Adleman. For several hours ideas of mathematical formulas and factoring were put aside for a recapitulation of the escape of the Jewish peopole from Egypt. As is customary with seders, people downed a lot of wine. It was nearly midnight when Rivest and his wife returned home. While Gail Rivest got ready for bed, Ron stretched out on the couch and began thinking about the problem that had consumed him and his colleagues for months. He would often do that---lie flat on the sofa with his eyes closed, as if he were deep in sleep. Sometimes he'd sit up and flip through the pages of a book, not really looking, but reworking the numbers. He had a computer terminal at home, but that night he left it off. "I was just thinking," he says.

That was when it came to him---the cognitive lightning bolt known as the Eureka Moment. He had a scheme! It was similar to some of their more recent attempts in that it used number theory and factoring. But this was simpler, more elegant. Warning himself not to get overexcited---Shamir and Adleman, after all, had broken many of his previous proposals---he jotted down some notes. He did allow himself the luxury of saying to his wife that he'd come up with an idea that just might work. He doesn't remember phoning the guys that night. Adleman, though, insists that he received a call sometime after midnight.

...
Rivest insisted that it was a joint project, that Shamir's and Adleman's contributions were crucial, that the scheme was the final point in an evolutionary process. To Rivest, it was as if the three of them had been in a boat together, all taking turns rowing and navigating in search of a new land. Rivest might have stepped out of the boat firs, but they all deserved credit for the discovery.

Where is the sequel?

OK, I am tired of typing. The book is 300 pages, and I only wrote a few of the most interesting parts in the first 100 pages. Please go read the book, I think you will like it.

I hope Steven Levy writes a followup book on Bitcoin and blockchain. The book stops around 2000 after discussing David Chaum's Digicash. So cryptocurrencies and blockchain would be a natural sequel to this one. 

MAD questions


Both Diffie's and Rivest's breakthroughs came after many months of intense work and thinking. But after the breakthrough insight, the schemes become easy to derive and explain. That is sort of like a one-way trapdoor function, isn't it? A trapdoor function is a function that is easy to compute in one direction, yet difficult to compute in the opposite direction (finding its inverse) without special information, called the "trapdoor". Trapdoor functions are widely used in cryptography.

I don't think there was chance involved in these discoveries. It seemed like those ideas were up in the air, vaguely hovering, and they had to go through a laborious condensation period before they materialized.

But here is an interesting page on the role of chance/serendipity in scientific discoveries.  Psychologist Kevin Dunbar and colleagues estimate that between 30% and 50% of all scientific discoveries are accidental in some sense.

Here is another interesting page on multiple discoveries / simultaneous invention. Merton believed that it is multiple discoveries, rather than unique ones, that represent the common pattern in science.

Thursday, April 12, 2018

Notes from USENIX NSDI Days 2 & 3

Before talking about Days 2 & 3, here are my notes from NSDI day 1 if you are interested.

Again all of the papers mentioned (and omitted) below are accessible publicly at the NSDI web page. Enjoy!

NSDI day 2 

Day 2 had the following sessions:

  • web and video
  • performance isolation and scaling
  • congestion control
  • cloud

Here are the papers I took detailed notes on. I ordered them by how interesting I found them.

Towards Battery-Free HD Video Streaming

This paper is by Saman Naderiparizi, Mehrdad Hessar, Vamsi Talla, Shyamnath Gollakota, and Joshua R Smith, University of Washington.

This work was mind blowing for me. This is a strong group that has developed the Wisp motes before, but even then the presented battery-free video streaming sensors was very impressive and the talk included a live demo of them.

So here is the deal. The goal of this work is to design sticker form factor, battery-free cameras. But that is crazy right? Video streaming is power hungry, how can you go battery-free?

The paper looks closely at the costs of the components of a video streaming camera. It turns out the image sensor, at 85microWatt, is very low power. On the other hand the radio at 100milliWatt is very high power.

If we could only offload the task of communication away from the sensor, we can pull this off!

The inspiration for the idea comes from the Russian great seal bug. Inside the great seal was a very thin membrane that vibrated when there is talking. Then a directional remote radio was used to receive those analog vibrations and reconstruct noise by the Russian spies. The following is from the Wikipedia page on this seal bug, called "the Thing":
The "Thing" consisted of a tiny capacitive membrane connected to a small quarter-wavelength antenna; it had no power supply or active electronic components. The device, a passive cavity resonator, became active only when a radio signal of the correct frequency was sent to the device from an external transmitter. Sound waves (from voices inside the ambassador's office) passed through the thin wood case, striking the membrane and causing it to vibrate. The movement of the membrane varied the capacitance "seen" by the antenna, which in turn modulated the radio waves that struck and were re-transmitted by the Thing. A receiver demodulated the signal so that sound picked up by the microphone could be heard, just as an ordinary radio receiver demodulates radio signals and outputs sound.
The group applies the same principle to cameras to make them battery-free. They showed on the stage the first demo of analog video backscatter that sends pixels directly to the antenna. The range for the prototype of the system was 27 feet, and it streamed 112*112 resolution of real-time video. A software defined radio was used to recreate the backscattered analog video.

For the analog hardware, the speakers said they got inspiration from human brain signals and created pulse width modulated pixels. More engineering went into performing intra-frame compression leveraging that the adjacent pixels are fairly similar.

Vesper: Measuring Time-to-Interactivity for Web Pages

This paper is by  Ravi Netravali and Vikram Nathan, MIT CSAIL; James Mickens, Harvard University; Hari Balakrishnan, MIT CSAIL.

This work asks the question of "what does it mean for a page to load quickly?" In other words, how should we define the load time?

The way page loads work likes this. The url typed in browser invokes the server to send the html. The browser runs html + javascript (requesting other embedded items as it encounters them in the html, but that is the topic of Ravi and James's other paper in the session). In the browser there is a javascript engine and a rendering engine which constructs the DOM tree. The js engine and dom tree interact via dom api.

Often the page load metric used is the page load time. Of course, this is very conservative, because only some of the page content is immediately visible, the invisible part doesn't matter, so why care about the load time of the invisible parts? Making this observation, Google came up with speed index: time to render above-the-fold, i.e., in the visible part of the browser.

But the speed index is also deficient because it doesn't talk about the js code running time. The js code running time affects the user experience, say via autocomplete, etc. An interactive page is not usable without js, and today a large fraction of the pages are interactive. The talk gave some statistics about median 182 handlers, and 95th percentile 1252 handlers in the pages surveyed.

To improve on the speed index, this work comes up with the ready index, which is defined as page time-to-interactivity in terms of visibility and functionality.

But the challenge is, nobody knows a good way to automatically identify the interactive state: are the java scripts working yet?

The system, Vesper, uses a 2 phase approach

  • identify visible elements event handlers, and state handlers access when fired
  • track loading progress of interactive state from phase 1

As a side note, the speaker, Ravi, spoke in a relaxed and clear way. The talk didn't feel rushed, but covered a lot of stuff. I really loved the delivery.

Performance Analysis of Cloud Applications

This paper is by Dan Ardelean, Amer Diwan, and Chandra Erdman, Google.

This work from Google considers the question of how we evaluate a change before we deploy it in production? Yes, the accepted approach is to use A/B testing over a small fraction of the users, but is that enough?

The abstract has this to say:
"Many popular cloud applications are large-scale distributed systems with each request involving tens to thousands of RPCs and large code bases. Because of their scale, performance optimizations without actionable supporting data are likely to be ineffective: they will add complexity to an already complex system often without chance of a benefit. This paper describes the challenges in collecting actionable data for Gmail, a service with more than 1 billion active accounts.
Using production data from Gmail we show that both the load and the nature of the load changes continuously. This makes Gmail performance difficult to model with a synthetic test and difficult to analyze in production. We describe two techniques for collecting actionable data from a production system. First, coordinated bursty tracing allows us to capture bursts of events across all layers of our stack simultaneously. Second, vertical context injection enables us combine high-level events with low-level events in a holistic trace without requiring us to explicitly propagate this information across our software stack."
The vertical context injection roughly means collecting the trace at the kernel level, using ftrace, where the layers above the kernel injects information into the kernel via stylized syscalls with payload.

The paper concludes with this observations. For meaningful performance experiments:

  • do experiments in production
  • use controlled A-B tests with 10 millions of users (less is not very meaningful)
  • use long-lived tests to capture the changing mix of requests
  • use creative approaches (vertical context injections) for collecting rich data cheaply.

LHD: Improving Cache Hit Rate by Maximizing Hit Density

This paper is by Nathan Beckmann, Carnegie Mellon University; Haoxian Chen, University of Pennsylvania; Asaf Cidon, Stanford University and Barracuda Networks.

Who knew... Cache eviction policies still require work and you can achieve big improvements there.

To motivate the importance of the cache hit rate research, Asaf mentioned the following. The  key-value cache is 100x faster than database. For Facebook if you can improve its cache hit rate of 98%, by just another additional 1%, the performance would improve 35%.

Here is the abstract:
Cloud application performance is heavily reliant on the hit rate of datacenter key-value caches. Key-value caches typically use least recently used (LRU) as their eviction policy, but LRU’s hit rate is far from optimal under real workloads. Prior research has proposed many eviction policies that improve on LRU, but these policies make restrictive assumptions that hurt their hit rate, and they can be difficult to implement efficiently.
We introduce least hit density (LHD), a novel eviction policy for key-value caches. LHD predicts each object’s expected hits-per-space-consumed (hit density), filtering objects that contribute little to the cache’s hit rate. Unlike prior eviction policies, LHD does not rely on heuristics, but rather rigorously models objects’ behavior using conditional probability to adapt its behavior in real time.
To make LHD practical, we design and implement RankCache, an efficient key-value cache based on memcached. We evaluate RankCache and LHD on commercial memcached and enterprise storage traces, where LHD consistently achieves better hit rates than prior policies. LHD requires much less space than prior policies to match their hit rate, on average 8x less than LRU and 2–3x less than recently proposed policies. Moreover, RankCache requires no synchronization in the common case, improving request throughput at 16 threads by 8x over LRU and by 2x over CLOCK.

Poster session

There was a poster session at the end of day 2. I wish there were more of the preliminary but bold work in the poster session, because most of the posters were just a poster accompanying the presented paper in the main track.

I liked these two posters the most. They are both very interesting works in progress.
  • Distributed Test Case Generation using Model Inference. Stewart Grant and Ivan Beschastnikh, University of British Columbia
  • High Performance and Usable RPCs for Datacenter Fabrics. Anuj Kalia, Carnegie Mellon University; Michael Kaminsky, Intel Labs; David G. Andersen, Carnegie Mellon University 

NSDI Day 3

The day 3 had these sessions:

  • Network monitoring and diagnosis
  • Fault-Tolerance
  • Physical Layer
  • Configuration Management

Since the sessions were networking specific, and since I was tired of the firehose of information spewed at me in the first 2 days, I didn't take much notes on day 3.  So I will just include my notes on the Plover paper from the fault-tolerance session.

PLOVER: Fast, Multi-core Scalable Virtual Machine Fault-tolerance

This paper is by Cheng Wang, Xusheng Chen, Weiwei Jia, Boxuan Li, Haoran Qiu, Shixiong Zhao, and Heming Cui, The University of Hong Kong.

This work builds on the Remus paper which appeared in NSDI08 and received a test-of-time award this year at NSDI.

The two limitations of the REMUS primary/backup VM replication approach was that:

  • too many memory pages needs to be copied and transferred, and
  • a split brain is possible due to partitioning.

This work, Plover, uses Paxos to address these problems. Paxos helps with both problems. By using 3 nodes, it doesn't suffer from the split brain the primary-backup approach suffers. By totally-ordering the requests seen by replicas, it can avoid copying memory pages. Replicas executing the same sequence of inputs should have the same state --- well, of course, assuming deterministic execution that is.

The drawback with Paxos is: it cannot handle non-determinism in request execution. To fix this, Plover invokes the VM page synchronization periodically before releasing replies.

The good news is that using Paxos to totally-order requests makes most memory pages same: the paper reports 97% of the pages being the same. So the VM synchronization is lightweight because it only needs to take care of the remaining 3% pages.

Plover is available on github.

I wonder, since Plover already does VM synchronization, does it really need to use a 100% totally-ordered requests delivered to the replicas via Paxos? Would it be possible to use a relaxed but faster solution? The Tapir project explored relaxed ordering of operations for storage systems, and some of the lessons may be applicable here.

MAD questions

Ok the MAD questions today picks up the thread from the last time. How do you improve the conference experience? Are conferences cramming to many technical sessions in a day? What can be done differently to improve the interactivity and networking of the conference participants?

A major reason I go to conferences is to meet people doing interesting work and converse with them, learn better about their perspectives and thought-processes.

During the 3 days at NSDI, I have talked with 14 people. That is not a bad number for me. I am not from the networking and NSDI community, so I don't know most people there. I get more chances to interact with people if I go to a conference where I know more people. Unfortunately, since I kept switching research areas (theoretical distributed systems 98-00, wireless sensor networks 00-10, smartphones/crowdsourcing 10-13, cloud computing 13-18) I don't have a home conference, where I know most people.

Out of these 14 people, I only knew 3 of them before. From the remaining, I knew a couple of them from interacting on Twitter, but the remaining majority were cold-hello first-time interactions.

The cold-hello interactions are hard, and as an introvert and shy person (except when I am curious) I had to force myself to have these first-time interactions. I assume the people I approach are also interested in talking to people (that is what conferences are supposed to be about), and we can have nice interesting conversations since we have some shared interest on distributed systems and at least on research. I would say 75% of the conversations I had  were interesting and not-superficial. But sometimes it bombs and that gets awkward. And instead of being happy about the nice interactions you have, it is easy to focus on the awkward ones and feel bad about them.

Although I am happy with meeting 14 interesting people, this is so much lower than the people I meet and talk with at HPTS. If you look at my posts about HPTS, you can see that I made it a point to emphasize how much I enjoyed the interactivity of HPTS.

I think a major way HPTS makes this happen is it sets the intentions clear and states this explicitly the first day. Pat Helland takes the stage and says that "the point of HPTS is to meet other people and interact, and the sessions are just a break from meeting/networking with other people". Since HPTS makes the cold-hello the norm, it does not feel weird anymore. I never had an awkward conversation at HPTS.

I am sure there are many ways to build interactivity and networking in the conferences. Why don't we make the posters session a long session in the afternoon, rather than after 6pm? Are there any ice-breaker activities that the conferences can adapt? I remember that at an activity with 20 people, the moderator asked everyone to say something uniquely quirky about themselves. That broke the ice pretty quickly; I still remember some of the quirks people mentioned. Maybe to scale for larger groups, it may be possible to have open-mike crazy ideas, dangerous ideas, and hot-takes sessions. Maybe we need to get some professionals to help, say from improv comedy people or capable event ice-breaker people. (I assume big companies like Google should have skilled HR people to help tech people interact better, right?)

Ok, this can get long, and I am not really knowledgeable in this area, so I will stop here. But here is my ask. Next time if you see me at a conference, please approach me and say hello. I am sure we will have a nice conversation and we will have things to learn from each other.

Maybe next time I should make a badge to make this explicit: "Please say Hi to me, I love to meet and talk to you."