Posts

Showing posts with the label tensorflow

SOSP19 Day 1, machine learning session

Image
In this post, I cover the papers that appeared in the first session of the conference: machine learning. ( Here is my account of Day 0 and the opening remarks if you are interested.) I haven't read any of these papers yet and I go by my understanding of these papers from their presentations. I might have misunderstood some parts. The good news is all of these papers are available as open access , and I include links to the papers in my notes. Please check the papers when in doubt about my notes. The machine learning session contained four papers. I found all of them very interesting. They applied principled systems design techniques to machine learning and provided results that have broader applicability than a single application. I wanted to enter the machine learning research area three years ago . But I was unsuccessful and concluded that the area is not very amenable for doing principled systems work. It looks like I had admitted defeat prematurely. After seeing these pape...

Paper summary. Decoupling the Control Plane from Program Control Flow for Flexibility and Performance in Cloud Computing

Image
This paper appeared in Eurosys 2018 and is authored by Hang Qu, Omid Mashayekhi, Chinmayee Shah, and Philip Levis from Stanford University. I liked the paper a lot, it is well written and presented. And I am getting lazy, so I use a lot of text from the paper in my summary below. Problem motivation  In data processing frameworks, improved parallelism is the holy grail because it can get more data processed in less time. However, parallelism has a nemesis called the control plane . While, control plane can have a wide array of meaning, in this paper control plane is defined as the systems and protocols for scheduling computations, load balancing, and recovering from failures. A centralized control frame becomes a bottleneck after a point. The paper cites other papers and states that a typical cloud framework control plane that uses a fully centralized design can dispatch fewer than 10,000 tasks per second. Actually, that is not bad! However, with machine learning (ML) app...

Paper summary. SnailTrail: Generalizing critical paths for online analysis of distributed dataflows

Image
Monitoring is very important for distributed systems, and I wish it would receive more attention in research conferences. There has been work on monitoring for predicate detection purposes and for performance problem detection purposes. As machine learning and big data processing frameworks are seeing more action, we have been seeing more work on the latter category. For example in ML there have been work on how to figure out what is the best configuration to run. And in the context of general big data processing framework there has been work on identifying performance bottlenecks. Collecting information and creating statistics about a framework to identify the bottleneck activities seems like an easy affair. However, the "making sense of performance" paper (2015) showed that this is not as simple as it seems, and sophisticated techniques such as blocked time analysis are needed to get a more accurate picture of performance bottlenecks. This paper (by ETH Zurich and due ...

The Lambda and the Kappa Architectures

This article, by Jimmy Lin, looks at the Lambda and Kappa architectures, and through them considers a larger question: Can one size fit all? The answer, it concludes, is it depends on what year you ask! The pendulum swings between the apex of one tool to rule them all , and the other apex of multiple tools for maximum efficiency . Each apex has its drawbacks: One tool leaves efficiency on the table, multiple tools spawns integration problems. In the RDBMS world, we already saw this play out. One size RDBMS fitted all, until it couldn't anymore. Stonebraker declared "one size does not fit all", and we have seen a split to dedicated OLTP and OLAP databases connected by extract-transform-load (ETL) pipelines. But these last couple years we are seeing a lot of one size fits all "Hybrid Transactional/Analytical Processing (HTAP)" solutions being introduced again. Lambda and Kappa OK, back to telling the story from the Lambda and Kappa architectures perspecti...

Paper summary. TFX: A TensorFlow-Based Production-Scale Machine Learning Platform

Image
This paper from Google appeared at KDD 2017 Applied Data Science track. The paper discusses Google's quality assurance extensions to their machine learning (ML) platforms, called TensorFlow Extended (TFX). (Google is not very creative with names, they should take cue from Facebook.) TFX supports continuous training and serving pipelines and integrates best practices to achieve production-level reliability and scalability. You can argue that the paper does not have a deep research component and a novel insight/idea. But you can argue the same thing for the checklist manifesto by Atul Gowande, which nevertheless does not decrease from its effectiveness, usefulness, and impact. On the other hand, the paper could definitely have been written much succinctly. In fact, I found this blog post by Martin Zinkevich, the last author of the paper, much easier to follow than the paper. (Are we pushed to make papers artificially obfuscated to be publication-worthy?)  This blog post on serv...

TensorFlow-Serving: Flexible, High-Performance ML Serving

Image
This paper by Google appeared at NIPS 2017 . The paper presents a system/framework to serve machine learning (ML) models. The paper gives a nice motivation for why there is a need for productizing model-serving using a reusable, flexible, and extendable framework. ML serving infrastructure were mostly ad-hoc non-reusable solutions, e.g. "just put the models in a BigTable, and write a simple server that loads from there and handles RPC requests to the models." However, those solutions quickly get complicated and intractable as they add support for: + model versioning (for model updates with a rollback option) + multiple models (for experimentation via A/B testing) + ways to prevent latency spikes for other models or versions concurrently serving, and + asynchronous batch scheduling with cross-model interleaving (for using GPUs and TPUs). This work reminded me of the Facebook Configerator. It solves the configuration management/deployment problem but for ML models. ...

Paper Summary. The Case for Learned Index Structures

Image
This paper was put on Arxiv yesterday and is authored by Tim Kraska, Alex Beutel, Ed Chi, Jeff Dean, Neoklis Polyzotis. The paper aims to demonstrate that " machine learned models have the potential to provide significant benefits over state-of-the-art database indexes ". If this research bears more fruit, we may look back and say, the indexes were first to fall, and gradually other database components (sorting algorithms, query optimization, joins) were replaced with neural networks (NNs). In any case this is a promising direction for research, and the paper is really thought provoking. Motivation Databases started as general, one-size fits all blackboxes. Over time, this view got refined to "standardized sizes" to OLAP databases and OLTP databases. Databases use indexes to access data quickly. B-Trees and Hash-maps are common techniques to implement indexes. But along with the blackbox view, the databases treat the data as opaque, and apply these ind...

Paper summary. Blazes: Coordination analysis and placement for distributed programs

Image
This paper is by Peter Alvaro, Neil Conway, Joseph M. Hellerstein, and David Maier. It appears in October 2017 issue of ACM Transactions on Database Systems. A preliminary conference version appeared in ICDE 2014.  This paper builds a theory of dataflow/stream-processing programs, which cover the Spark , Storm, Heron , Tensorflow , Naiad, TimelyDataflow work. The paper introduces compositional operators of labels, and shows how to infer the coordination points in dataflow programs. When reading below, pay attention to the labeling section, the labels "CR, CW, OR, OW", and the section on reduction rules on labels. To figure out these coordination points, the Blazes framework relies on annotations of the dataflow programs to be supplied as an input. This is demonstrated in terms of a Twitter Storm application and a Bloom application. The paper has many gems.  It says at the conclusion section in passing that when designing systems, we should pay attention to coordinat...

Paper summary: A Computational Model for TensorFlow

Image
This paper appeared in MAPL 17. It is written by Martin Abadi, Michael Isard, and Derek G. Murray at Google Brain. It is a 7-page paper, and the meat of the paper is in Section 3. I am interested in the paper because it talks about TLA+ modeling of TensorFlow graphs and uses that for creating an operational semantics for TensorFlow programs . In other words, the paper provides a conceptual framework for understanding the behavior of TensorFlow models during training and inference. As you recall, TensorFlow relies on dataflow graphs with mutable state . This paper describes a simple and elementary semantics for these dataflow graphs using TLA+. The semantics model does not aim to account for implementation choices: it defines what outputs may be produced, without saying exactly how. A framework of this kind does not just have theoretical/academical value; it can be useful to assess correctness of TensorFlow's dataflow graph (symbolic computation graph) rewriting optimizations...

Paper summary: Federated Learning

Image
On Thursday, April 6, Google announced Federated Learning.   The announcement didn't make quite a splash, but I think this is potentially transformative. Instead of uploading all the data from the smartphones to the cloud for training the model in the datacenter, federated learning enables in-situ training at the smartphones themselves. The datacenter is still involved but it is involved for just aggregating the smartphone-updated local models in order to construct the new/improved global model. This is a win-win-win situation. As a smartphone user, your privacy is preserved since your data remains on your device, but you still get the benefits of machine learning on your smartphone. Google gets what it needs: it perpetually learns from cumulative user experience and improves its software/applications. Google collects insights without collecting data (and some of these insights may still be transferable to advertising income). Secondly, Google also outsources the training to the ...

Popular posts from this blog

Hints for Distributed Systems Design

My Time at MIT

Advice to the young

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Learning about distributed systems: where to start?

Foundational distributed systems papers

Distributed Transactions at Scale in Amazon DynamoDB

Making database systems usable

Looming Liability Machines (LLMs)

Analyzing Metastable Failures in Distributed Systems