Tuesday, May 29, 2018

Some blockchain applications and Reviewcoin

Conjecture 34: Forall x, we have "x on blockchain".
I don't have a proof but I have seen dental on blockchain, kiwis on blockchain, shoes on blockchain, etc.

Apart from the many silly x-on-blockchain attempts, I have heard some serious and promising applications. The Brave browser basic attention token seems to be a serious effort and can help redefine the ad-economy on the browsers. I also recently heard about Abra digital currency bankteller, which is another solid company and application.

To continue, Jackson Palmer said these are his favorite decentralized technology projects: WebTorrentDat / BeakerMastodon, and Scuttlebutt. Finally, there are well circulated requests for apps, which I think can make some uptick in the blockchain applications game.

My blockchain app suggestions

I didn't want to be left out of the action. Ideas are free. Here are some things that occurred to me.

Medical school students can do an ICO and fund their schooling

This will be a utility token, where the owner can get 30 minutes visitation/consulting worth per token. Of course as the student becomes a more experienced and skilled doctor, the token will be more valuable, because 30 minutes of the doctor is now more valuable. Of course with the token comes a token economy, where you can buy/sell tokens for doctors. 

The same idea can apply to some other graduate schools as well, including electrical engineering, computer science, etc. Maybe some labs can raise funding for students/projects using this model.

Being a philanthropist never was so profitable/greedy. And being greedy was never more philanthropic.

But come to think of it, I don't like this. Would this lead to maintaining a stock ticker for people in your life? (Hmm, his grades this semester are not that good, time to short his tokens.) People into stock trading all day may be OK with this, but definitely not for me.

Review coin

This idea occurred to me recently: We should build a blockchain solution for conference reviewer crediting system. If you don’t help with reviewing, you don’t get to have your papers reviewed.

The biggest advantage of doing this in blockchain rather than using a traditional centralized databases would be that now this will belong to everyone in the academic community rather than a specific organization. IEEE versus ACM is a thing. There are many other organizations. Even some universities may not want to submit to a system if it is maintained by a "rival" institution. (Though I must say the arxiv.org owned by Cornell gained adoption from all/most institutions as a repository of electronic preprints, so it looks like it can be done.)

With blockchains technology it may be possible to get the incentive model right as well, since similar problems have been studied in blockchains for cryptocurrencies.

There are many challenges here though. We like to keep the reviewers anonymous. We also want to keep the power of the anonymous reviewers in check and avoid bad/empty/trivial reviews and enforce a certain quality-level in the reviews.

To keep the review quality in check we can require multisig on a review. The extra endorsements may come from PC chairs or some assigned reviewer-reviewers to keep to a more decentralized system.

The participants (authors, reviewers, reviewer-reviewers) can be anonymous. The reviews do not need to go into blockchain/dag, only hash of the reviews can go in there for verification while maintaining some privacy. The participants will have private key and public keys, so it is not possible to initiate a transaction on behalf of another as in Bitcoin. On the other hand, we are still worried about double-spending and attacks on the blockchain.

To fend off attacks, the Reviewcoin blockchain could be powered with PoW, PoS, or leaderless-gossip-based (as in the Avalanche paper). This is a crazy thought but would it at all be possible to make this blockchain reviewer-work powered? There are some works that look at human-work puzzles and proof-of-useful work for blockchains. I speculate how a review-work powered method could be adopted for this problem in MAD questions below.

Reviewcoin can enable a better compensated yet more open model for reviewing papers. The model can also be applied to reviewing whitepapers, component designs, or even code-reviewing. The party that needs to get something reviewed should hodl some Reviewcoin, which could be earned by either doing meaningful reviews, or helping with Reviewcoin blockchain infrastructure running strong, or via purchasing from Reviewcoin owners.

By offering more Reviewcoins it can be possible to recruit more skilled/capable/experience reviewers for the work. Offering more coins should not get you more favorable reviews though, and the reviewer-reviewers should be the enforcer for fairness appointed by the system.

MAD questions

1. Is linking reviews with compensation opening a can of worms?
Probably... But the existing system is not great either. Academics volunteer for technical program committees of prestigious conferences (1) to help, (2) to get some name recognition as your name gets listed on conference website, and (3) to keep current with new research in the field. This is a good-will based model; these reviewers would consult to outside starting from $200 an hour yet they spend tens of hours for free to review for a conference. Inevitably this good-will based approach is starting to face many problems. Freeloading is a problem, as we are seeing some conferences getting 100s and even 1000s of paper submissions. Incentives is becoming a problem: for journals and weak conferences, it is hard to find reviewers, because the name recognition part is not that strong. Finally, experts are overstretched with volunteering.

So maybe we should consider opening a can of worms, if it can help us build a more sustainable system going forward.

2. Human proof-of-work for Reviewcoin
To certify that a review is valid and not random junk to get some Reviewcoins, the review can be checked and signed by program committee chairs if this is for a conference. We can have authenticated academics serving for PC chairs, so their signature certifies the review is correct and the reviewer gets some Reviewcoin for the service. To extend to non-conference unorganized reviews,  the review certification can be done through a multisig from a randomly selected set of participants. If the Byzantine nodes (Sybil/bot participants) are in the minority, they will not be able to issue false certification of reviews. It may be possible to use a proof-of-stake of participants to limit the number of Sybil nodes. Also the Reviewcoin stake/credibility of participants can be used in a weighted manner for reliability.

3. Where did May go?
Time is an illusion. End-of-semester doubly so. (With apologies to Douglas Adams.)

Wednesday, May 23, 2018

TUX2: Distributed Graph Computation for Machine Learning

The TUX2 paper appeared in NSDI 17 and was authored by Wencong Xiao, Beihang University and Microsoft Research; Jilong Xue, Peking University and Microsoft Research; Youshan Miao, Microsoft Research; Zhen Li, Beihang University and Microsoft Research; Cheng Chen and Ming Wu, Microsoft Research; Wei Li, Beihang University; Lidong Zhou, Microsoft Research.

TUX2 introduces some new concepts to graph process engines to adapt them better for machine learning (ML) training jobs. Before I can talk about the contributions of TUX2, you need to bear with me as I explain how current graph processing frameworks fall short for ML training.

Background and motivation

Graph processing engines often takes a "think like a vertex" approach. A dominant computing model in "think like a vertex" approach is the Gather-Apply-Scatter (GAS) model. You can brush up on graph processing engines by reading my reviews of Pregel and Facebook graph processing.

Modeling ML problems as bipartite graphs

Many ML problems can be modeled with graphs and attacked via iterative computation on the graph vertices. The Matrix Factorization (MF) algorithm, used in recommendation systems, can be modeled as a computation on a bipartite user-item graph where each vertex corresponds to a user or an item and each edge corresponds to a user's rating of an item.

A topic-modeling algorithm like Latent Drichlet Allocation (LDA) can be modeled as a document-word graph. If a document contains a word, there is an edge between them; the data on that edge are the topics of the word in the document.  Even logistic regression can be modeled as a sample-feature graph.


Gaps in addressing ML via graph processing

1. The graphs that model ML problems often have bi-partite nature and heterogeneous vertices, with distinct roles (e.g., user vertices and item vertices). However, the standard graph model used in graph processing frameworks assumes a homogeneous set of vertices.

2. For ML computation, an iteration of a graph computation might involve multiple rounds of propagation between different types of vertices, rather than a simple series of GAS phases. The standard GAS model is unable to express such computation patterns efficiently.

3. Machine learning frameworks have been shown to benefit from the Stale Synchronous Parallel (SSP) model, a relaxed consistency model with bounded staleness to improve parallelism. The graph processing engines use Bulk Synchronous Parallel (BSP) model by default.

TUX2 Design

To address the gaps identified above, TUX2
1. supports heterogeneity in the data model,
2. advocates a new graph model, MEGA (Mini-batch, Exchange, GlobalSync, and Apply), that allows flexible composition of stages, and
3. supports SSP in execution scheduling.

Next we discuss the basic design elements in TUX2, and how the above 3 capabilities are built on them.

The vertex-cut approach 

TUX2 uses the vertex-cut approach (introduced in PowerGraph), where the edge set of a high-degree vertex can be split into multiple partitions, each maintaining a replica of the vertex. One of these replicas is designated the master; it maintains the master version of the vertex's data. All the remaining replicas are called mirrors, and each maintains a local cached copy.

Vertex-cut is very useful for implementing the parameter-server model: The master versions of all vertices' data can be treated as the distributed global state stored in a parameter server. The mirrors are distributed to workers, which also has the second type of vertices and use the mirror vertices to iterate on these second type of vertices.

Wait, the second type of vertices? Yes, here we harken back to the bipartite graph model. Recall that we had bipartite graph with heterogeneous vertices, with some vertices having higher degrees. Those higher degree vertices are master vertices and held at the server, and the low degree vertices are data/training for those master vertices and they cache the master vertices as mirror vertices and train on them. And, in some sense, the partitions of low-order vertex type in the bipartite graph corresponds to mini-batch.

The paper has the following to say on this. In a bipartite graph, TUX2 can enumerate all edges by scanning only vertices of one type. The choice of which type to enumerate sometimes has significant performance implications. Scanning the vertices with mirrors in a mini-batch tends to lead to a more efficient synchronization step, because these vertices are placed contiguously in an array. In contrast, if TUX2 scans vertices without mirrors in a mini-batch, the mirrors that get updated for the other vertex type during the scan will be scattered and thus more expensive to locate. TUX2 therefore allows users to specify which set of vertices to enumerate during the computation.

Each partition is managed by a process that logically plays both

  • a worker role, to enumerate vertices in the partition and propagate vertex data along edges, and 
  • a server role, to synchronize states between mirror vertices and their corresponding masters. 
Inside a process, TUX2 uses multiple threads for parallelization and assigns both the server and worker roles of a partition to the same thread. Each thread is then responsible for enumerating a subset of mirror vertices for local computation and maintaining the states of a subset of master vertices in the partition owned by the process.


Figure 3 illustrates how TUX2 organizes vertex data for a bipartite graph, using MF on a user-item graph as an example. Because user vertices have much smaller degree in general, only item vertices are split by vertex-cut partitioning. Therefore, a master vertex array in the server role contains only item vertices, and the worker role only manages user vertices. This way, there are no mirror replicas of user vertices and no distributed synchronization is needed. In the worker role, the mirrors of item and user vertices are stored in two separate arrays.

In each partition, TUX2 maintains vertices and edges in separate arrays. Edges in the edge array are grouped by source vertex. Each vertex has an index giving the offset of its edge-set in the edge array. Each edge contains information such as the id of the partition containing the destination vertex and the index of that vertex in the corresponding vertex array. This graph data structure is optimized for traversal and outperforms vertex indexing using a lookup table. Figure 2 shows how data are partitioned, stored, and assigned to execution roles in TUX2.


Scheduling minibatches with SSP

TUX2 executes each iteration on a minibatch with a specified size. Each worker first chooses a set of vertices or edges as the current minibatch to execute on. After the execution on the mini-batch finishes, TUX2 acquires another set of vertices or edges for the next minibatch, often by continuing to enumerate contiguous segments of vertex or edge arrays.

TUX2 supports SSP in the mini-batch granularity. It tracks the progress of each mini-batch iteration to enable computation of clocks. A worker considers clock t completed if the corresponding mini-batch is completed on all workers (including synchronizations between masters and mirrors) and if the resulting update has been applied to and reflected in the state. A worker can execute a task at clock t only if it knows that all clocks up to t−s−1 have completed, where s is the allowed slack.

The MEGA model 

TUX2 introduces a new stage-based MEGA model, where each stage is a computation on a set of vertices and their edges in a graph. Each stage has user-defined functions (UDF) to be applied on the vertices or edges accessed during it. MEGA defines four types of stage: Mini-batch, Exchange, GlobalSync, and Apply.

MEGA allows users to construct an arbitrary sequence of stages. Unlike GAS, which needs to be repeated in order (i.e., GAS-GAS-GAS-GAS), in MEGA you can flexibly mix and match (e.g., E-A-E-A-G). For example, in algorithms such as MF and LDA, processing an edge involves updating both vertices. This requires two GAS phases, but can be accomplished in one Exchange phase in META. For LR, the vertex data propagations in both directions should be followed by an Apply phase, but no Scatter phases are necessary; this can be avoided in the MEGA model because MEGA allows an arbitrary sequence of stages.

Below are examples of  Matrix factorization (MF) and Latent Dirichlet Allocation (LDA) programmed with the META model. (LDA's stage sequence is the same as MF's.)


Implementation and Evaluation

TUX2 is implemented in ~12,000 lines of C++ code. TUX2 takes graph data in a collection of text files as input. Each process picks a separate subset of those files and performs bipartite-graph-aware algorithms to partition the graph in a distributed way. Each partition is assigned to, and stored locally with, a process. Unfortunately the evaluations with TUX2 do not take into account graph partitioning time, which can be very high. 

The evaluations show that data layout matters greatly in the performance of ML algorithms. Figure 8 compares the performance of BlockPG, MF, and LDA with two different layouts: one an array-based graph data layout in TUX2 and the other a hash-table-based lay-out often used in parameter-server-based systems (but implemented in TUX2 for comparison). The y-axis is the average running time of one iteration for BlockPG, and of 10 iterations for MF and LDA to show the numbers on a similar scale. These results show that the graph layout improves performance by up to 2.4× over the hash-table-based layout.


The paper also includes a comparison with Petuum, but the evaluations have several caveats. The evaluations do not include comparison of convergence/execution time; execution time per iteration does not always determine the convergence time. The evaluations do not take into account the partitioning time of the graph for TUX2. And finally, some comparisons used early unpatched version of Petuum MF algorithm whose data placement issues are resolved later.

MAD questions

1. What is the net gain here?
I like this paper; it made me ask and ponder on many questions, which is good.

I don't think TUX2 pushes the state of the art in ML. ML processing frameworks are already very efficient and general with the iterative parameter-server computing model, and they are getting better and more fine grained.

On the other hand, I think TUX2 is valuable because it showed how the high-level graph computing frameworks can be adapted to implement the low-level parameter-server approach and address ML training problems more efficiently. This may provide some advantages for problems that are/need to be represented as graphs, such as for performing ML training on Sparql data stores.

Moreover by using higher-level primitives, TUX2 provides some ease of programmability. I guess this may be leveraged further to achieve some plug and play programming of ML for certain class of programs.

So I find this to be a conceptually very satisfying paper as it bridges the graph processing model to parameter-server model. I am less certain about the practicality part.


2. How does graph processing frameworks compare with dataflow frameworks?
There are big differences between dataflow frameworks and graph processing frameworks. In the dataflow model, there is a symbolic computation graph, the graph nodes represent mathematical operations, while the graph edges represent the data that flow between the operations. That is a very different model than the graph processing model here.

In MEGA, there are only 4 stages, where the Apply stage can take in user defined functions. This is higher-level (and arguably more programming friendly) than a dataflow framework such as TensorFlow which has many hundreds of predefined operators as vertices.


3. How does TUX2 apply to Deep Learning (DL)?
The paper does not talk about whether TUX2 can apply to DL and how it can apply.

It may be possible to make DL fit the TUX2 model with some stretching. Deep neural network (DNN) layers (horizontally or vertically partitioned) could be the high-rank vertices hold in the servers. And the images are low-ranked vertices hold in partitions.

But this will require treating the DNN partions as a meta-vertex and schedule executions for each sub-vertes in the meta-vertex in one cycle. I have no clue about how to make backpropagation work here though.

Moreover, for each image, the image may need to link to entire NN, so the bipartite graph may collapse into a trivial one and trivial data-parallelism. It may be possible to make the convolutional layers can be distributed. It may even be possible to insert early exits and train that way.

So, it may be possible but it is certainly not straightforward. I am not even touching the subject of the performance of such a system.

Monday, May 21, 2018

SoK Cryptocurrencies and the Bitcoin Lightning Network

We wrapped up the distributed systems seminar with two more papers discussed last month.

The "SoK: Research Perspectives and Challenges for Bitcoin and Cryptocurrencies" paper appeared in 2015. Although the cryptocurrency scene has seen a lot of action recently, this survey paper did not age and it is still a very good introduction to learning about the technical aspects and challenges  of cryptocurrencies. The paper starts with a technical overview of the cryptocurrency concept. Then it delves more into incentives and stability issues. It observes that it is unclear "how stability will be affected either in the end state of no mining rewards or in intermediate states as transaction fees become a non-negligible source of revenue". It talks about possible attacks, including Goldfinger attack and feather-forking, and also about stability of mining pools and the peer-to-peer layer. Finally it also covers some security and privacy issues.

"The Bitcoin Lightning Network: Scalable Off-Chain Instant Payments" whitepaper is from 2016. Bitcoin has a very high latency, poor scalability, and high transaction fees, and the lightning network provides an overlay solution to ease off these pains. To this end, the paper describes a secure off-chain solution for instant payments and microtransactions with low transaction fee.  This video by Jackson Palmer describes the lightning network ideas very nicely. (I strongly prefer reading to watching/listening, but recently I am finding a lot of useful content on YouTube.)

The protocol builds on the basic concept of a pairwise payment channel. Two parties put aside an initial amount of Bitcoin into a multi signature transaction. Subsequently many updates to the allocation of the current balance can be made off-the-chain just with the cooperation of both parties using a new timelocked transaction, without broadcasting this to the chain. This is analogous to buying a Starbucks card and transacting with Starbucks pairwise with that card. A broadcast to the chain can be done to redeem funds on the chain and to close the channel. If either party tries to cheat by broadcasting an old transaction state from the pairwise payment channel, the counterparty may take all the funds in the channel as penalty after it provides the latest multisig agreed state to the chain (within the on-chain dispute mediation window).

The lightning overlay network is then formed by multihop routing over these pairwise payment channels. Even when A and Z does not have a direct pairwise payment channel, it may be possible to construct a multihop route from A to Z to use intermediaries, traversing through several pairwise payment channels. While pairwise channels could be without a fee, the multihop intermediaries need a small transaction fee to have the incentive to participate.

The pairwise payment channels need to be extend with hashlocks to achieve multihop transactions. Here is how this is done as explained in the Bitcoin wiki:

  1. Alice opens a payment channel to Bob, and Bob opens a payment channel to Charlie.
  2. Alice wants to buy something from Charlie for 1000 satoshis.
  3. Charlie generates a random number and generates its SHA256 hash. Charlie gives that hash to Alice.
  4. Alice uses her payment channel to Bob to pay him 1,000 satoshis, but she adds the hash Charlie gave her to the payment along with an extra condition: in order for Bob to claim the payment, he has to provide the data which was used to produce that hash.
  5. Bob uses his payment channel to Charlie to pay Charlie 1,000 satoshis, and Bob adds a copy of the same condition that Alice put on the payment she gave Bob.
  6. Charlie has the original data that was used to produce the hash (called a pre-image), so Charlie can use it to finalize his payment and fully receive the payment from Bob. By doing so, Charlie necessarily makes the pre-image available to Bob.
  7. Bob uses the pre-image to finalize his payment from Alice

In order for the multihop routes to work there should be enough cooperative participants online, each of which with enough balance to pass the bucket from hand to hand. The rest is a path-finding exercise. With viable paths present, similar to a source-side routing decision, you can start the lightning transaction. There has been a lot of work on MANETs under the names of intermittently connected networks and delay tolerant routing. I wonder if some of them can find applications here (even though they were mostly geometrical and this is a non-geometrical network.) In practice though, the multihop network often converges to a hub and spokes model, with a well-connected fat wallet intermediary.

The lightning network is also useful for transacting securely across different blockchains, as long as all edges in the route support the same hash function to use for the hash lock and can create timed locks.

MAD questions

1. How can we improve the way we run the seminar?
The students liked how we run the seminars. They said they were more actively engaged and learned a greater deal due to this format. But I think we can improve. It would be nice to get experts video-conference and answer some questions. I think many experts would be generous to spare 20 minutes to spare to answer some questions from smart well-prepared students. 

The students also mentioned that it would have been nice to have some hands-on projects to accompany the seminars. We started a blockchain channel so the interested students can figure out some small projects they can work on and collaborate.

2. What did I learn?
I didn't think much of blockchains but I am fascinated to learn that there are many challenging questions here and many good ideas/techniques. This is an area very suitable for doing distributed algorithms/protocols work, which I love. There is a need for developing more principled approaches and well-reasoned and verified algorithms/protocols.

I still think the best ideas from these work will get borrowed and used in more centralized (could be hierarchical or federated) systems for the sake of efficiency/economy and scalability. That may be how these systems will go mainstream to millions and even billions.

Saturday, May 19, 2018

Book Review -- Accidental Genius: Using Writing to Generate Your Best Ideas, Insight, and Content

I had read this short book a long time ago. It is a very helpful book to learn about how to use writing for thinking --freewriting.

Motivation for freewriting

The mind is lazy. Thinking is hard, and our brains don't like hard. It recycles tired thoughts, and avoids unfamiliar and uncomfortable territory.

Freewriting prevents that from happening. Freewriting is a form of forced creativity.
Writing is nature's way of letting you know how sloppy your thinking is.  --Guindon 
If you think without writing, you only think you're thinking. --Lamport 
Freewriting helps to unclog the mind, reduce resistance to thinking and writing, bring clarity, provide perspective, improve creativity by causing a chain reaction of ideas, and articulate better about ideas.

The premise of freewriting is simple: getting a 100 ideas is easier than getting 1. "When you need an idea, don’t try for just one. When searching for one great idea, we demand perfection from it, depress ourselves, become desperate, and block. Go for lots of ideas. Keep your threshold low. One idea leads to the next."

The first half of the book gives the below six tactics for freewriting. (The second half elaborates more on these.)

1. Try Easy

A relaxed 90 percent is more efficient than a vein-bulging 100 percent effort. When you begin freewriting about a thorny subject, remind yourself to try easy.

2. Write Fast and Continuously 

Your best thought comes embedded in chunks of your worst thought. Write a lot. Think quantity. If you temporarily run out of things to say, babble onto the page. Write as quickly as you can type, and continue to generate words without stopping: Your mind will eventually give you its grade A unadulterated thoughts.

3. Work against a Time Limit 

The limit energizes your writing effort by giving you constraints. Pomodoro method helps here.

4. Write the Way You Think 

Freewriting is a means of watching yourself think. Write the way you talk to yourself. Since you're writing for yourself, you don’t need to polish your raw thoughts to please others. All that matters is that you yourself understand your logic and references.

5. Go with the Thought 

That's the first rule of improv theater. Assume that a particular thought is true, and take a series of logical steps based on the thought. In other words, as what if questions.

6. Redirect Your Attention 

Explore random paths with focus changing questions: Why? Why not? How can I change that? How can I prove/disprove that? What am I missing here? What does this remind me of? What’s the best/worst-case scenario? Which strengths of mine can I apply? How would I describe this to my uncle? If I wanted to make a big mistake here, what would I do? What do I need that I don’t yet have?

MAD questions

1. What are other uses for freewriting?
Freewriting makes for good meditation. Many people use morning pages (freewriting in the morning) to clear logjams in thire minds. I couldn't make this a habit, but when I did try this I got value out of it.

I occasionally used freewriting for thinking about a research problem. I write down some observations and then I start speculating about hypothesis all in freewriting form. It works. Again, I wish I could make this a habit and do this more often. Well, I guess the blogging and the MAD questions help somewhat for this.

Freewriting is also good for planning. I should do more planning.

I guess it is time to schedule some freewriting in my week. 

2. Is it possible to use freewriting for writing?
Yes, you can use freewriting to write memos, articles, and stories. The trick is to edit ruthlessly after freewriting. That may be wasteful though. Freewriting helps you discover what you want to write. After that doing an outline and writing from that could save time. So it may be best to combine a little bit of bottom up freewriting with a little bit of top down outline writing.

3. Did you try any freewriting marathons?
The book suggests that instead of 20 minute freewriting sessions, it is helpful occasionally to go for 4-5 hours of freewriting marathons. I never tried that. It sounds torturous, but maybe it is worth a try for the sake of curiosity what would my depleted unrestrained psyche spew out after an hour of typing its train of thoughts. Couldn't the mind get lazy in freewriting mode as well and start to recycle same thoughts? But I guess the idea in the marathon is to force your mind to move past that.

4. How was your experience with freewriting?
Let me know in the comments or via email or on Twitter.

Wednesday, May 16, 2018

Paper summary. Decoupling the Control Plane from Program Control Flow for Flexibility and Performance in Cloud Computing

This paper appeared in Eurosys 2018 and is authored by Hang Qu, Omid Mashayekhi, Chinmayee Shah, and Philip Levis from Stanford University.

I liked the paper a lot, it is well written and presented. And I am getting lazy, so I use a lot of text from the paper in my summary below.

Problem motivation 

In data processing frameworks, improved parallelism is the holy grail because it can get more data processed in less time.

However, parallelism has a nemesis called the control plane. While, control plane can have a wide array of meaning, in this paper control plane is defined as the systems and protocols for scheduling computations, load balancing, and recovering from failures.

A centralized control frame becomes a bottleneck after a point. The paper cites other papers and states that a typical cloud framework control plane that uses a fully centralized design can dispatch fewer than 10,000 tasks per second. Actually, that is not bad! However, with machine learning (ML) applications we are starting to push past that limit: we need to deploy on 1000s of machines large jobs that consist of many many short tasks (10 milliseconds) as part of iterations over data mini-batches.

To improve the control plane scalability, you can distribute the control plane across worker nodes (as in Nimbus and Drizzle). But that also runs into scaling problems due to the synchronization needed between workers and the controller. Existing control planes are tightly integrated with the control flow of the programs, and this requires workers to block on communication with the controller at certain points in the program, such as spawning new tasks or resolving data dependencies. When synchronization is involved, a distributed solution is not necessarily more scalable than a centralized one. But it is more complicated for sure: in one sense that is the computer analog of mythical man-month problem.

Another approach to improve scalability is to remove the control plane entirely. Some frameworks, such as Tensorflow, Naiad, and MPI frameworks, are scheduled once as a big job and they manage their own execution after that. Well, of course the problem doesn't go away, but plays inside the framework level: the scalability is limited by the applications logic written in these frameworks and the frameworks' support for concurrency control. Furthermore, these frameworks don't play nice with the datacenter/cloud computing environment as well. Rebalancing load or migrating tasks requires killing and restarting a computation by generating a new execution plan and installing it on every node.

This paper proposes a control plane design that breaks the existing tradeoff between scalability and  flexibility. It allows jobs to run extremely short tasks (<1ms) on thousands of cores and reschedule computations in milliseconds. And that has applications in particular for ML.

Making the Control Plane Asynchronous

To prevent synchronous operations, the proposed control plane cleanly divides responsibilities between controller and workers: a controller decides where to execute tasks and workers decide when to execute them.

The control plane's traffic is completely decoupled from the control flow of the program, so running a program faster does not increase load at the controller. When a job is stably running on a  fixed set of workers, the asynchronous control plane exchanges only occasional heartbeat messages to monitor worker status.

The architecture

Datasets are divided into many partitions, which determines the available degree of parallelism. Datasets are mutable and can be updated in place. This corresponds nicely to parameters in ML applications.

The controller uses an abstraction called a partition map to control where tasks execute. The partition map describes which worker each data object should reside on. _Because task recipes trigger tasks based on what data objects are locally present, controlling the placement of data objects allows the controller to implicitly decide where tasks execute._ The partition map is updated asynchronously to the workers, and when a worker receives an update to the map it asynchronously applies any necessary changes by transferring data.

On the worker side, an abstraction called task recipes describes triggers for when to run a task by specifying a pattern matched against the task's input data. Using recipes, every worker spawns and executes tasks by examining the state of its local data objects, obviating the need to interact with the controller.

Task recipes

A task recipe specifies (1) a function to run, (2) which datasets the function reads and/or writes, and (3) preconditions that must be met for the function to run.


There are three types of preconditions to trigger a recipe:

  1. Last input writer: For each partition it reads or writes, the recipe specifies which recipe should have last written it. This enforces local write-read dependencies, so that a recipe always sees the correct version of its inputs.
  2. Output readers: For each partition it writes, the recipe specifies which recipes should have read it since the last write. This ensures that a partition is not overwritten until tasks have finished reading the old data.
  3. Read messages: The recipe specifies how many messages a recipe should read before it is ready to run. Unlike the other two preconditions, which specify local dependencies between tasks that run on the same worker, messages specify remote dependencies between tasks that can run on different workers.

Since incorrect preconditions can lead to extremely hard to debug computational errors, they are generated automatically from a sequential user program. A single recipe describes potentially many iterations of the same data-parallel computation.

Writers and readers are specified by their stage number, a global counter that every worker maintains. The counter counts the stages in their program order, and increments after the application determines which branch to take or whether to continue another loop. (Using the counters I think it is possible to implement SSP method easily as well.) All workers follow an identical control flow, and so have a consistent mapping of stage numbers to recipes.

Exactly-once Execution and Asynchrony

Ensuring atomic migration requires a careful design of how preconditions are encoded as well as how data objects move between workers. No node in an asynchronous control plane has a global view of the execution state of a job, so workers manage atomic migration among themselves. To ensure that the task from a given stage executes exactly once and messages are delivered correctly, when workers transfer a data partition they include the access history metadata relevant to preconditions, the last writer and how many recipes have read it.

Partition Map

A partition map is a table that specifies, for each partition, which worker stores that partition in memory. A partition map indirectly describes how a job should be distributed across workers, and is used as the mechanism for the controller to signal workers how to reschedule job execution.


The controller does five things:
(1) Starts a job by installing the job's driver program and an initial partition map on workers.
(2) Periodically exchanges heartbeat messages with workers and collects workers' execution statistics, e.g. CPU utilization and CPU cycles spent computing on each partition.
(3) Uses the collected statistics to compute partition map updates during job execution.
(4) Pushes partition map updates to all workers.
(5) Periodically checkpoints jobs for failure recovery.

Maximizing data locality

To maximize data locality, the controller updates the partition map under the constraints that the input partitions to each possible task in a job are assigned to the same worker. The execution model of task recipes is intentionally designed to make the constraints explicit and achievable: if a stage reads or writes multiple datasets, a task in the stage only reads or writes the datasets' partitions that have the same index, so those partitions are constrained to be assigned to the same worker.

Implementation

The group designed Canary, an asynchronous control plane, which can execute over 100,000 tasks/second on each core, and can scale linearly with the number of cores.

The driver constructs the task recipes. A driver program specifies a sequential program order, but the runtime may reorder tasks as long as the observed result is the same as the program order (just as how processors reorder instructions).


Canary periodically checkpoints all the partitions of a job. The controller monitors whether workers fail using periodic heartbeat messages. If any worker running a job is down, the controller cleans up the job's execution on all workers, and reruns the job from the last checkpoint.

Checkpoint-based failure recovery rewinds the execution on every worker back to the last checkpoint when a failure happens, while lineage-based failure recovery as in Spark only needs to recompute lost partitions. But the cost of lineage-based failure recovery in CPU-intensive jobs outweighs the benefit, because it requires every partition to be copied before modifying it.

Evaluation results

Current synchronous control planes such as Spark execute 8,000 tasks per second; distributed ones such as Nimbus and Drizzle can execute 250,000 tasks/second. Canary, a framework with an asynchronous control plane, can execute over 100,000 tasks/second on each core, and this scales linearly with the number of cores. Experimental results on 1,152 cores show it schedules 120 million tasks per second. Jobs using an asynchronous control plane can run up to an order of magnitude faster than on prior systems. At the same time, the ability to split computations into huge numbers of tiny tasks with introducing substantial overhead allows an asynchronous control plane to e ciently balance load at runtime, achieving a 2-3× speedup over highly optimized MPI codes.

Evaluations are done with  applications performing logistic regression, K-means clustering, and PageRank.

MAD Questions

1. Is it a good idea to make tasks/recipes dependent/linked to individual data objects? How do we know the data objects in advance? Why does the code need to refer to the objects? I think that model can work well if the data objects are parameters to be tuned int ML applications. We live in the age of the `big model'. I guess graph processing applications can also fit well to this programming model. I think this can also fit well with any(?) dataflow framework application. Is it possible to make all analytics applications fit to this model?

2. The Litz paper had similar ideas for doing finer-grain scheduling at the workers and obviating the need for synchronizing with the scheduler. Litz is a resource-elastic framework supporting high-performance execution of distributed ML optimizations. Litz remains general enough to accommodate most ML applications, but also provides an expressive programming model allowing the applications (1) to support stateful workers that can store the model parameters which are co-located with a partition of input data, and (2) to define custom task scheduling strategies satisfying fine-grain task dependency constraints and allowances. At runtime, Litz executes these strategies within the specified consistency requirements, while gracefully persisting and migrating application state.

3. Since it is desirable to have one tool for batch and serving, would it be possible to adopt Canary for serving? Could it be nimble enough?

4. Is it possible to apply techniques from the Blazes paper to improve how the driver constructs the task recipes?

Thursday, May 10, 2018

Misc rambling

Here, have some random stuff.

Confident idiots 

A couple weeks ago, I was in the YMCA sauna. There were 6 people, including 3 high school kids. One of the high school kids---who looked like a young Jonah Hill--- is excited he will be getting a car from his father and telling his friends about it.

He asks his friends how often he needs to get the oil changed on the car. One of his friends authoritatively says: "Oh, they will tell you when you get the car. It is every 50 miles." The boy says this so confidently that I am trying hard not to chuckle.

So, Jonah  is OK with that. They keep talking about the car. But then this occurs to Jonah. He says "Once my mom drove us to Wisconsin and back and it is definitely more than 50 miles. It was like 500 miles. So oil change distance has got to be much longer than that."

The boy who authoritatively suggested 50 miles for oil change says with the same confidence: "Yeah, that is it, you change the oil every 50 thousand miles. That is what it is!"

Why are the ignorant so confident? I know Lake Wobegon effect, and such,  but really why...

Weird dream 

(This was January 12th, 2018.)

At night at 2am, I woke up buzzed with a weird dream. It was a combination of Ted Chiang like predeterminism molded into the Bart Curlish narrative---the assassin-woman in Dirk Gently's Holistic Detective Agency--- kind of deal.

So, I had a mission, someone else had a mission to kill me, and yet another person had a mission to protect me. That last person didn't know his mission, but predeterminism forced him to jump always between me and the bullets, and weirdly he was also unaffected by the bullets. But, he was absolutely terrified, shouting hysterically and trying to escape, but every time he attempted to run away, he always ended up jumping between me and the assassin and involuntarily protecting me.

Slowly this guy realized that this was his mission, and even if he didn't want to do it, he would be doing it in a predetermined way. When he had this realization, he was very relieved, and went on shielding me from the bullets, this time peacefully. Around the same time, I also realized that I had my own mission, and that I will be indestructible despite all the attempts of the assassin. I woke up energized and buzzed.

Predetermination and a sense of purpose is such a powerful concept. If you truly genuinely believe you are predestined for a purpose, that might make you invincible.

Incidentally, my first post in this blog in 2007 was about a dream as well.

MAD questions

1. How should we deal with confident idiots?
The ignorance and overconfidence of that high school student was harmless, funny, and cute. Hopefully he will grow out of it.

But what if he doesn't? What if that overconfident-ignorance is then fueled by acceptance or even admiration? Confident idiots who wield power are not so fun and harmless anymore.

What can we do to intervene with this?

My mom told me as a child that it is impossible to change anyone's mind via arguing. I had took this into heart, and refrained from arguing with people on their opinions. But lately I changed my position. I still know I can't change anyone's mind via arguing, but I am now more vocal about stating disagreement and pointing people what is wrong with their opinions and positions. I put my voice out there, hoping that it may stop people getting more carried out with their opinions and echo chambers.

I recently listened to this podcast episode from Hidden Brain, and I think it elaborates these issues better.

2. What if you know you could not fail, what would you attempt? Do you think you have a true calling?

I know they say follow your passion is bad career advice, but shouldn't you try to align them? Even better, figure out your true calling and excel at it. If you believe strongly you are predetermined to achieve it, maybe you will. As Steve Jobs said:  Some things need to be believed to be seen.

3. Let's hope that the confident idiot does not believe he is predetermined to achieve a mission.

What if you are the confident idiot, rather than the determined visionary?
To strike a healthy balance, always doubt yourself and never doubt yourself.

4. Oh, and, blockchains and cryptocurrency. Discuss.

Tuesday, May 8, 2018

Transmitting your message over a lossy channel

Late at night, I was tweeting random stuff and then the below tweet came.
Since I worked on wireless sensor networks in a previous life, I had things to say here.

Here are the ways to deal with a lossy channel, and what those mean for your writing.

A straightforward and easy way to deal with a lossy channel is to add redundancy. Repeat the message a number of times, and one transmission will make it. In writing, this corresponds to: "Tell them what you are going to tell them, tell them, then tell them what you told them."

Another way to deal with a lossy channel is that you increase the signal-to-noise ratio. This can be done via increasing transmission power and/or somehow reducing the noise. This translates to writing more clearly. Use definite, specific, concrete language. And be succinct and to the point. As you review your draft, ruthlessly cut out frivolous parts and divergence. Read the text as if you are an outsider, empathize with the reader, and try to reduce any misunderstandings this reader could have.
Kill your darlings, kill your darlings, even when it breaks your egocentric little scribbler’s heart, kill your darlings.
-- Stephen King
Another way  to deal with a lossy channel is to send a preamble. In computer networks, a syncword, sync character, sync sequence or preamble is used to synchronize a data transmission by indicating the start of data transmission.  Sending preambles also achieve frame synchronization permitting the data bits within the frame to be extracted for decoding or retransmission. In writing, this corresponds to giving advance warning and building up to where you are leading the reader.

Ok, this is pushing it a bit too much, but here it is. You can also use TDMA to deal with a lossy channel. The transmitter establishes a rythym/cadence, and the receivers are able to receive at those predetermined times the clear transmission. In writing, this corresponds to using a parallel structure, so that the reader can follow your points easier.

That's all folks.

MAD questions

1. Did you know that the receiver can also amplify its power to reduce transmission loss?

I wrote about this earlier.
Most people have an incorrect model of how radios and wireless radio communication work. The layman thinks radios are simple. The transmitter does all the heavy lifting, and puts the signal on the air. Before the signal fades away completely, if the receiver device is in communication range, it will pick up the signal by listening. This naive understanding suggests that the receiver is a passive device, as listening is perceived mostly as a passive activity. But, nothing can be further from the truth. The receiver spends considerable energy to power its radio and amplify the signal in order to receive it.

So, this thing is very symmetrical. If a text is hard to understand, you can power up your reading game to decipher it better. Similar techniques help here also for dealing with the lossy transmission channel as well.

But this has a limitation. "When the receiver powers its radio to amplify signals, it is also amplifying noise. The signal to noise ratio (SNR) determines successful reception: if the signal has not faded, it will get amplified more, stand out, and decoded correctly. This gives raise to an interesting question: When there is no signal in the channel, what does the receiver radio hear?" To learn the answer you should read that post.

2. Is that all you got? 5 lousy writing analogies from radio transmission?
No analogies to MIMO, RFDI, or software defined radio? My analogy game must be getting weak.

3. What are your analogies to writing?
Another analogy I often use for writing is that of showcasing a garden. You are a gardener, you cultivated a great garden, and you are exhibiting it to visitors. How do you showcase your garden and lead them to the ? Initially, you will need to guide the visitors which paths to take, without guidance they will take arbitrary paths and get lost. So, take care to prune frivolous paths lest they get distracted, gently guide them toward the centerpieces and highlights of your garden. Give them a tour. Don't be too pushy either.

Related links

How I write
How to write your research paper
How I read

Sunday, May 6, 2018

Truth is Multidimensional

Nasreddin Hoca is a 13th century Turkish wise-man / populist-philosopher, known for his funny stories and anecdotes.

In one story, he was judging between two men over a conflict. He listened to the first man, and told him, he was right about the issue. Then he listened to the second man, and also found him reasonable, and told him he was right about the issue. Hoca's wife was listening and she intervened: How is that possible, how can both sides be right about the issue?

Hoca thought some more about this, and said: "Dear, you are right, too."

Murat Hoca

I often find myself in the same position. For example most recently, I read Sinofski's tweets pro blockchain, and Jackson Palmer's tweets rebuking those, and I find both sides right.


How can both sides in an argument be right?

This is possible when the issue is multidimensional.

Being a researcher

I had written this earlier:
"I am OK with being confused, and I am capable of holding ambivalent thoughts in my brain. These are minimal (necessary but not sufficient) requirements for being a researcher. I would rather have unanswered questions than unquestioned answers. (Aha, of course, that was a Feynman quote. "I would rather have questions that can't be answered than answers that can't be questioned." --Richard Feynman)"
When I am working on a new problem/area at the cutting edge, I find that I have to make several hypotheses/guesses and take positions. As a result I have to keep several ambiguous/conflicting positions in my mind for a long time. I let them simmer in my brain until the situation is resolved. Often, I eliminate some conflicting positions and settle on one that is more accurate. And sometimes this is resolved with the frame shifting (or paradigm shifting) and we go above the plane and find a position that accommodates these two conflicting positions.

Do these sound like Zen koans for the academics?

Ok, here is another one: I like being wrong and I also don't like being wrong.

I like being wrong because it gives me a learning opportunity.

I also don't like being wrong, so as soon as I learn I am wrong, I change my position so I am not wrong any more.

MAD questions

1. If you are mindful about it, you find that you are faced with a lot false dichotomies every day. What are your favorite examples?

2. What have you changed your mind about recently?

Friday, May 4, 2018

Research and climbing


I first liked this. But then I grew uneasy and even disturbed by it... I have been in this business (the research business ---not the climbing) for 20 years, and I find that unfortunately reality is not that simple, idealistic, and egalitarian.

Climbing a mountain is hard and resource intensive

To climb a mountain, you need more than a vision of a feasible path.

Last year, I had read this book titled "After the summit" by Lei Wang. It gives life lessons learned from Lei's arduous journey to climb the highest peak on all seven continents, including Mount Everest, and skiing to the North and South poles.

Here is what you need to climb a tough mountain. I think this is still an incomplete list but it is much more than "time and patience" and a "vision of a path to the peak".
  1. Equipment... very expensive high-quality equipment... good tools are important
  2. Funding (Just Google for "What do you need to climb everest?". The first link says an Everest climb starts from $35K.)
  3. Lots of training to get very fit
  4. Lots of lessons to learn climbing technique
  5. Training with personal coaches
  6. Guidance from experts
  7. A very capable team
  8. A base camp, where you spend weeks to acclimate
  9. Experienced sherpas
  10. Experience of having climbed similar mountains
  11. Dedication, passion, perseverance
  12. Managing your mental state

Here are the not so glorious parts about climbing a mountain. You sacrifice a lot. You get your ass kicked by nature, you get your knuckles frozen. You need to learn to walk with spiked boots, you learn to build ladder bridges to walk over wide and deep Crevasses. You learn to sleep out in the cold, eat shitty food. You learn discipline and safety processes. You make discomfort your companion. You learn to tolerate misery.

None of this sounds enticing. But probably the worse part is how boring 99% of the climb is. You are at the summit only for 15 minutes. You have to find a way to love the journey.


Just in case this is still unclear, what I wrote for climbing above maps to research one to one, and the mapping is left as an exercise to the reader.

One thing that is not in the climbing task-list is the need for documentation. Maybe documentation is also important for climbers---I don't know, but documentation/writing/publishing is a vital part of research. You need to show the path to others, so that they can see the result, follow-along, and hopefully even improve on it.

MAD questions

1. Man, why am I always so critical?

I think the climbing analogy is actually a good one for research, but I grew unhappy about the emphasis on a very small part of the climbing: "looking for a path that nobody has noticed before". Being the cynic I am, I just had to write this long piece about it. I think I have severe professional deformation: I can't stop finding problems with things.

2. Ok, what about Math theorems?

For climbing steep cliffs in theoretical math, is time, patience, and a vision of a path to the peak enough?

The theoretical math domain is less resource intensive. But to get an important contribution you still need years of training and years of thinking about the problem in your head. Maybe take off the first items 1 and 2 in my list above and compensate for them with some increased effort in the remaining items. It seems like the mathematicians need to develop and master a lot of mental tools indeed:
Here is the Wikipedia article on Wiles' solution of Fermat's last theorem.
Here is a New Yorker article on Yitang Zhang's work on bounds gaps.
Here is the Wikipedia article on Grigori Perelman's on soul theorem.

3. What are some other analogies you know for the research process?

I like analogies. I wrote this about analogies earlier:
It is not hard to make analogies, because all analogies are to some extent inaccurate. When Paul Graham made the analogy about hackers and painters, many people criticized and said that there are many other hackers and vocation analogies. But analogies, while inaccurate, can also be useful. Analogies can give you a new perspective, and can help you translate things from one domain to the other to check if it is possible to learn something from that.
I had given research and sudoku analogy before. I still think that is a good analogy. I will leave it to someone else to analyze/criticize that one.

If you have more analogies, please share them with me in the comments.

4. What are some books/articles that got this right?

I recently wrote about the Crypto book. I think it reflected the ups and downs and valleys of death for the discovery process well. I also liked Einstein's biography by Walter Isaacson; it provided a good account of Einstein's work habits and struggles. Although not a book on research, the "War of Art" book by Steven Pressfield did a good job describing the struggles and challenges of creative work. I guess the Dip book by Seth Godin also has relevance.

But I want a book that delves more in to the most boring parts of this process. I had heard about (HT @tedherman) biologists spending years perfecting experiments on rats to make sure it is properly designed and free from uncontrolled factors (do check page 35, wow!). I want to read a long book about the boring laborious parts of research.

5. Should we be more cynical?

I said you need a lot of resources to climb a new mountain, but I left it at that. I didn't elaborate on the implications of this for the way research is done.

A couple years ago I had read the book "Big Science: Ernest Lawrence and the Invention that Launched the Military-Industrial Complex". Since the 1930s, the scale of scientific endeavor has grown exponentially. Increasingly more, we started to need big teams, big equipment, and big funding for inventions. As the problems get more tricky and more specialized, it is inevitable that the research gets more resource intensive.

Is it a losing battle to try to keep things more egalitarian, keep research more democratized? Yes, research is hard, but if one is very dedicated, work boring hours, do boring work to train, learn the techniques, is it at all possible for her to climb the really tall mountains?

I think the answer is yes.

When we find that we need increasingly more expensive/specialized resources to enter an area, this is an indication that the area has incurred a lot of technical debt. Fortunately, there is a catch up process, where the equipment becomes more affordable (e.g., cloud computing, crispr, 3d printers), and the esoteric techniques are simplified/explained and tooling is provided (e.g., machine learning).

So I am hopeful that it is possible to keep research more democratized. It is also a duty for us to strive for this, otherwise it is an indication we are incurring technical debt in our research field. Researchers are not fond of doing this kind of janitorial stuff and cleaning up, but that is the only way to move forward in a sustainable manner.

Another great equalizer is the emergence of new problems and domains. Innovation begets innovation. As we discover things, we find new terrains open up. And a new terrain is a good opportunity to make impacts without needing immense resources. This is what I had wrote earlier:
We are all equally stupid. Our brains consist of 1.3 kg (3pounds) of gooey material. It is hard to hold 10 variables in our brains simultaneously. Our brains are very poor at reasoning about even very short concurrent programs (I know from first-hand experience :-). Our brains are also very susceptible to biases and fallacies.
The people you see as experts/geniuses are that way because they have been working/thinking/internalizing these topics for more than a decade. Those experts become baffled if you ask them a question a little bit outside of the frame they are accustomed to.
So here is the trick, "you can level the playing field by working on new things/technologies".