## Thursday, April 27, 2017

### DARPA: Delivering Disruption

What does DARPA and Facebook have in common?
Move fast and break things!

A couple weeks ago Stefanie Tompkins, DARPA acting deputy director, visited University at Buffalo to give a talk and meet researchers. She talked about DARPA and funding opportunities at DARPA. I had the mental image of DARPA as a slow moving, hard to maneuver aircraft carrier, but her talk painted a very different picture, and partially changed my perception of DARPA.

DARPA, then ARPA (Advanced Research Projects Agency), was created in 1958 as a response to Sputnik 1. DARPA's mission is to prevent strategic surprise, avoid another Sputnik moment. And as you already know, the best way to predict future is to invent it. So DARPA is perpetually pushing the envelope to invent the next strategic surprise and disruptive technology itself.

DARPA has a large number of successes under its belt. Internet has its roots in ARPANET.  GPS grew out of basic research on atomic clocks funded by DARPA. DARPA is also credited with several inventions in stealth, radar arrays, uavs, ir night vision, microelectronics, and materials sciences. DARPA funded several high impact grand challenges on autonomous vehicles starting back in 2004, which helped kick start todays self-driving cars technologies. DARPA also funded other grand challenges, including the network challenge (2009), and cyber grand challenge (2016).

To drive for disruption and off-scale impact, DARPA recruits program managers from universities (20% of PMs) and industry. The program managers rotate quickly, they serve 3-5 years. Within that time, they use the budget allocated to them to create/fund DARPA programs on high impact research and development. DARPA programs may range from $10,000,000 to$400,000,000 over 2 to 5 years.

DARPA has 6 technical offices.

Heilmeier questions have a significant place DARPA culture. Even the program managers need to answer these questions when creating new programs.

Another part of DARPA culture is the infamous quadchart. "Quadcharts which cram in figures, assertions, and required logos/branding/acronyms are a proud and sacred tradition" (RT @tedherman).

While Stefanie's presentation managed to convince me that DARPA is an agile institution, the quadcharts make me a skeptic :-)

## Monday, April 24, 2017

### Book Review: The Story of Your Life (by Ted Chiang)

The "Story of Your Life" is the science-fiction story which the Arrival film was based on. It was a very good read. Back in November, before the movie was released, I read it in less than 2 hours, in one breath.

The first couple pages of the book had some weird/mixed usage of the past future tense. Right at that first page, you get the idea that something is not right and this is going to be an interesting read. And as you keep reading, the story drops on you more clues, and you feel both smart and awed, when you first piece together that Gary was indeed the linguist's (Louise) first husband, and the daughter is not dead yet due to the climbing accident.

Ted Chiang gives a lot of credit to the readers' intelligence. I liked that a lot. I also liked that I had to google and learn several things while reading the story. I googled to learn about "Fermat's principle", "teleology", "ideograms", and some related linguistic terms.

I was pretty shaken after finishing the story. It contributes to the Freewill and Fate debate in philosophy and theology from a science-fiction perspective. (Time is an illusion, and with an altered perception, you can experience time in an orthogonal axis, and freewill becomes irrelevant/pointless.) The story is at the same time very emotional because the parentship thread is very well woven into the story.

Ted Chiang writes a very tight story. No, actually, he calculates, computes, and weaves the story. Amused by this first story, I read couple other stories from him. I also liked "The Alchemist's Gate", which was written like a Sufi story, again exploring the concept of fate, and free will.  Ted seems to be obsessed about these concepts, and must have been thinking very deeply about them. This  story also made subtle and thoughtful "what if" cases about these concepts. This story also had a strong technical side interwoven with a moving emotional side.

Wikipedia says Ted Chiang has been working as a technical writer in the software industry. I would be interested in reading the manuals he writes.

Hacker News Discussion
How I wrote Arrival
Wolfram's blog on the symbolic language created for the movie
Timequake by Kurt Vonnegut was also an excellent book playing with the freewill and fate ideas.

## Friday, April 21, 2017

### Paper summary: Federated Learning

On Thursday, April 6, Google announced Federated Learning.  The announcement didn't make quite a splash, but I think this is potentially transformative. Instead of uploading all the data from the smartphones to the cloud for training the model in the datacenter, federated learning enables in-situ training at the smartphones themselves. The datacenter is still involved but it is involved for just aggregating the smartphone-updated local models in order to construct the new/improved global model.

This is a win-win-win situation. As a smartphone user, your privacy is preserved since your data remains on your device, but you still get the benefits of machine learning on your smartphone. Google gets what it needs: it perpetually learns from cumulative user experience and improves its software/applications. Google collects insights without collecting data (and some of these insights may still be transferable to advertising income). Secondly, Google also outsources the training to the users' smartphones: it makes the smartphones work on updating the model, rather than using servers in its datacenters. This may seem like penny-pinching, but if you consider the sheer number of Android smartphones out there (more than 1.5 billion according to numbers by the end of 2015), the computation savings are huge. Notice that Google is winning twice, while you win only once. :-)

After skimming the Google announcement, I was excited, because I had predicted this on Jan 2016. When I read the TensorFlow whitepaper, I was struck by a peculiar emphasis on heterogenous device support of TensorFlow, especially on the smartphone. I predicted that Google is up to something, more than just inference/serving layer support on smartphones. I got the mechanism for this wrong, since I didn't have machine learning experience then.

The Google announcement links to some papers, and I read the most relevant paper carefully. Below is my summary of the paper: Communication-Efficient Learning of Deep Networks from Decentralized Data.

## Applications that are ideal fit for federated learning

The paper pitches smartphone applications as the ideal domain for federated learning.
For smartphones, the upload transfer is a bottleneck. Instead of uploading the data to the cloud for training, let the smartphone train and update the model and transmit it back. This makes sense when the model-update is smaller than the data. There is another Google paper that provides tricks for further optimizing/compressing the size of the learned-model at the smartphone for transmitting back to the cloud. Since the upload rate is about 1/3rd of the download rate, such techniques are beneficial.

A second big benefit for the smartphone applications domain is preserving the private data of the smartphone user. Finally, in the smartphone applications domain the labels on the data is available or inferable from user interaction (albeit in an application dependent way).

Concrete examples of these applications are "image classification: predicting which images are most likely to be viewed and shared", and "language modeling: improving voice recognition and text entry on smartphone keyboard." Google is already using federated learning in the GBoard Keyboard on Android for improving text entry. On device training uses a miniature version of TensorFlow.

## Related work

There has been work on iteratively averaging locally trained models, but they are inside datacenter, not at the edge. See: Sixin Zhang, Anna E Choromanska, and Yann LeCun. Deep learning with elastic averaging SGD. In NIPS. 2015. and also this one: Ryan McDonald, Keith Hall, and Gideon Mann. Distributed training strategies for the structured perceptron. In NAACL HLT, 2010.

There has been work motivated from privacy perspective, but with limited empirical results. And there has been work in convex setting, for distributed consensus algorithm (not the distributed systems version of the problem but the machine learning version.)

## The contributions of the paper

The paper introduces the FedAvg algorithm for federated learning. The algorithm is not an intellectually challenging innovation, as it prescribes a small variation to the traditional SGD training. So the paper is careful not to claim too much credit for the algorithm. Instead the paper distributes its contributions to items 1 and 3 below, and downplays 2 in comparison: "1) the identification of the problem of training on decentralized data from mobile devices as an important research direction; 2) the selection of a straightforward and practical algorithm that can be applied to this setting; and 3) an extensive empirical evaluation of the proposed approach."

## The algorithm

The algorithm uses a synchronous update scheme that proceeds in rounds of communication. There is a fixed set of K clients, each with a fixed local dataset. At the beginning of each round, a random fraction C of workers is selected, and the server sends the current global algorithm state to each of these workers (e.g., the current model parameters). Only a fraction of workers are selected for efficiency, as the experiments show diminishing returns for adding more workers beyond a certain point. Each selected worker then performs local computation based on the global state and its local dataset, and sends an update to the server. The server then applies these updates to its global state, and the process repeats.

So here are the high-level steps in the algorithm:

1. Workers are sent the model by the server
2. Workers compute an updated model based on their local data
3. The updated models are sent from the workers to the server
4. The server aggregates these models (by averaging) to construct the new global model

These steps are almost the same in a traditional ML/DL learning with a parameter-server and workers. But there are some minor differences. Difference in step 1: Not all workers, but a subset of workers are chosen. Differences in step 2: Workers are also producers of data, they work on the data they produce. Workers may do multiple iterations on the local-model-update. Difference in step 3: The model, not the gradients, is transmitted back.

For step 3, I don't know why the model is sent rather than the gradients. The paper presents a derivation to argue why both are equivalent, but then does not provide any justification for transmitting model instead of the gradients. There is no explanation nor experiments about comparing the two approaches.

Here is the algorithm: (The paper calls workers as clients.)

The amount of computation is controlled by three key parameters: C, the fraction of workers that perform computation on each round; E, then number of training passes each worker makes over its local dataset on each round; and B, the local minibatch size used for the worker updates. B =infinity indicates that the full local dataset is treated as a single minibatch.

And this is the biggest difference in the algorithm from traditional synchronous SGD:
In each synchronous round, the workers can perform multiple rounds of local-model update before uploading the model back to the cloud. Since the local-data is small, computing/iterating over it is fast. And since communication is already high-latency, the workers may as well do many local computation rounds to make this worthwhile.

The paper makes a big deal of the unbalanced data distribution and especially the  non-iid (non- independent & identically distributed) data at the workers. However, there is nothing special in the FedAvg algorithm to address the non-iid data challenge. It works because the SGD tolerates noise. And most likely having many workers participate at each round helps a lot. Even with C=0.1, the number of smartphones used for training can be amazingly high. In the experimental results,   100 workers are used for MNIST image classification application, and  1146 workers are used for Shakespeare dataset language modeling application. This is way more workers used in traditional ML, certainly for the MNIST and Shakespeare datasets. So the sheer number of workers helps compensate the non-iid challenge.

## Questions

Can Federated Learning bootstrap from zero-trained model?
Before reading the paper, I thought "maybe the smartphones are not initialized from random parameters, but with partially or mostly trained parameters". The experiments suggests otherwise: The smartphones are started from an untrained model. However, Figure 1 in the paper shows that the smartphones (i.e., workers) should get started with the same common initialization, because independent initialization does not converge.

How long is a round?
The experiments investigate how many rounds are needed to converge, but there is no explanation in the paper about the length of a round. In my estimation, the length of a round should be around 5 minutes, or so. In practical use, Google employs as workers the smartphones that are plugged to a wall-outlet and connected to a WI-FI. (Android phones provide this information to Google and a subset among them is randomly selected as workers. In our PhoneLab project, we also uploaded data from phones only when they were plugged in.) Since this is synchronous SGD, the round progresses with the rate of the slowest worker. So they must be using backup workers (maybe 5% or 10%) to combat the straggler problem.

FedAvg uses synchronous SGD, but would it also work with asynchronous rounds?
Unfortunately there are no experiments that investigate this question.

## Experimental results

The experiments use the MNIST dataset image classification, CIFAR 10 image classification, and Shakespeare dataset for language modeling and prediction of the next character (alphabet character that is).

The paper says this: "Our goal is to use additional computation in order to decrease the number of rounds of communication needed to train a model. The speedups we achieve are due primarily to adding more computation on each client, once a minimum level of parallelism over clients is used." They achieve this by increasing E (or decreasing B), once C is saturated (which tend to happen for C=0.1 in their setup).

Even the pathologically biased distribution of local datasets  works great because number of C is high, and provides smoothing. "Both standard SGD and FedAvg with only one client per round (C = 0), demonstrate significant oscillations in accuracy, whereas averaging over more clients smooths this out."

## Tuesday, February 21, 2017

### 1 million pageviews

My blog has recently reached 1 million pageviews. This warrants for a short retrospection.

I started the posting regularly on September 2010. I wanted to get into the cloud computing domain, so I needed to accumulate background on cloud computing work. I decided that as I read papers on cloud computing, I will post a summary to this blog. I thought if I could explain what I learned from the papers in my own words, I would internalize those lessons better. And if others read those summaries and benefit, that is an extra plus.

"Writing is nature's way of telling you how sloppy your thinking is." In fact, I learned a lot writing those paper reviews. Writing the reviews gave me a deeper understanding of the work done, beyond what I could achieve by passively reading them. Putting them on web was also a nice choice, because I could refer my students to some of these summaries when needed. And it turned out that I referred to those summaries myself very frequently to jog my memory. Since I have encoded the writing with my understanding, reading through my summary would get me refreshed about the important lessons from that work. Since my summaries were made available on the web, all I needed to do was google search for muratbuffalo and paper name.

(Side remark about my research journey: My research area at the first couple years of my PhD was distributed algorithms and self-stabilization. Then starting on 2002, wireless sensor networks has become my research area. I applied stabilizing distributed algorithms for in-network querying and tracking in wireless sensor networks. Around 2009 I started transitioning to crowdsourced sensing and collaboration using smartphones. And starting from 2010, I transitioned to large-scale cloud computing systems. Distributed systems has been the common theme through out. Here is a link to my research statement as of the end of 2016.)

Over time I included posts about my conference trips, book reviews, rants, and research advice for students. Putting research advice (reading, writing, presenting) is also beneficial because I can refer my students to it. And occasionally I receive emails from remote corners of the world about how some of these posts helped them or inspired them, and that makes me very happy for an entire day.

## Some of the big hits

The bursty traffic all came from Hacker News. The regular traffic came from many sources: Google searches, blog posts, twitter links.

## Multiclass classification problems

We can adopt this singleclass logistic regression idea for solving a multiclass classification problem using one vs. all approach: To do k classifications, split the training set into k separate binary classification problems.