Paper review: TensorFlow, Large-Scale Machine Learning on Heterogeneous Distributed Systems

- January 11, 2016

The paper is available here.

TensorFlow is Google's new framework for implementing machine learning algorithms using dataflow graphs. Nodes/vertices in the graph represent operations (i.e., mathematical operations, machine learning functions), and the edges represent the tensors, (i.e., multidimensional data arrays, vectors/matrices) communicated between the nodes. Special edges, called control dependencies, can also exist in the graph to denote that the source node must finish executing before the destination node starts executing. Nodes are assigned to computational devices and execute asynchronously and in parallel once all the tensors on their incoming edges becomes available.

It seems like the dataflow model is getting a lot of attention recently and is emerging as a useful abstraction for large-scale distributed systems programming. I had reviewed Naiad dataflow framework earlier. Adopting the dataflow model provides flexiblity to TensorFlow, and as a result, TensorFlow framework can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models.

TensorFlow's heterogeneous device support

The paper makes a big deal about TensorFlow's heterogenous device support, it is even right there in the paper title. The paper says: "A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards."

Wait, what? Why does TensorFlow need to run on wimpy phones?!

The paper says the point is just portability: "Having a single system that can span such a broad range of platforms significantly simplifies the real-world use of machine learning system, as we have found that having separate systems for large-scale training and small-scale deployment leads to significant maintenance burdens and leaky abstractions. TensorFlow computations are expressed as stateful dataflow graphs."

I understand this, yes portability is important for development. But I don't buy this as the explanation. Again, why does TensorFlow, such a powerhorse framework, need to be shoehorned to run on a single wimpy phone?!

I think Google has designed and developed TensorFlow as a Maui-style integrated code-offloading framework for machine learning. What is Maui you ask? (Damn, I don't have a Maui summary in my blog?)

Maui is a system for offloading of smartphone code execution onto backend servers at method-granularity. The system relies on the ability of managed code environment (.NET CLR) to be run on different platforms. By introducing this automatic offloading framework, Maui enables applications that exceed memory/computation limits to run on smartphones in a battery- & bandwidth-efficient manner.

TensorFlow enables cloud backend support for machine learning to the private/device-level machine learning going on in your smartphone. It doesn't make sense for a power-hungry entire TensorFlow program to run on your wimpy smartphone. Your smartphone will be running only certain TensorFlow nodes and modules, the rest of the TensorFlow graph will be running on the Google cloud backend. Such a setup is also great for preserving privacy of your phone while still enabling machine learned insights on your Android.

I will talk about applications of this, but first let me mention this other development about TensorFlow that supports my guess.

Google Opensourced the TensorFlow API in Nov 2015

Google opensourced the TensorFlow API and a limited reference implementation (the implementation runs on a single device only) under the Apache 2.0 license in November 2015. This implementation is available at www.tensorflow.org, and has attracted a lot of attention.

Why did Google opensource this project relatively early rather than keeping it proprietary for longer? This is their answer: "We believe that machine learning is a key ingredient to the innovative products and technologies of the future. Research in this area is global and growing fast, but lacks standard tools. By sharing what we believe to be one of the best machine learning toolboxes in the world, we hope to create an open standard for exchanging research ideas and putting machine learning in products."

This supports my guess. TensorFlow's emphasis on heterogeneity is not just for portability. Google is thinking of TensorFlow as an ecosystem. They want developers to adopt TensorFlow, so TensorFlow is used for developing machine learning modules in Android phones and tablets. And then, Google will support/enrich (and find ways to benefit from) these modules by providing backends that run TensorFlow. This is a nice strategy for Google, a machine learning company, to percolate to the machine learning in the Internet of Things domain in general, and the mobile apps market in particular. Google can be the monopoly of Deep learning As A Service (DAAS) provider leveraging the TensorFlow platform.

How can Google benefit from such integration? Take a look at this applications list: "TensorFlow has been used in Google for deploying many machine learning systems into production: including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery."

With a mobile-integrated TensorFlow machine-learning system, Google can provide better personal assistant on your smartphone. Watch out Siri, better speech recognition, calendar/activity integration, face recognition, and computer vision is coming. Robotics applications can enable Google to penetrate self-driving car OS, and drone OS markets. And after that can come more transformative globe-spanning physical world sensing & collaboration applications.

With this I rest my case. (I got carried away, I just intended to do a paper review.)