Paper summary. TFX: A TensorFlow-Based Production-Scale Machine Learning Platform
This paper from Google appeared at KDD 2017 Applied Data Science track. The paper discusses Google's quality assurance extensions to their machine learning (ML) platforms, called TensorFlow Extended (TFX). (Google is not very creative with names, they should take cue from Facebook.)
TFX supports continuous training and serving pipelines and integrates best practices to achieve production-level reliability and scalability. You can argue that the paper does not have a deep research component and a novel insight/idea. But you can argue the same thing for the checklist manifesto by Atul Gowande, which nevertheless does not decrease from its effectiveness, usefulness, and impact.
On the other hand, the paper could definitely have been written much succinctly. In fact, I found this blog post by Martin Zinkevich, the last author of the paper, much easier to follow than the paper. (Are we pushed to make papers artificially obfuscated to be publication-worthy?) This blog post on serving skew, a major topic discussed in the TFX paper, was both very succinct and accessible.
While we are on the topic of related work, the NIPS 2016 paper, "What is your ML score? A rubric for ML production systems", from a subset of the authors of the TFX paper, is also related. A big motivation for this paper is another previous Google paper, titled: "Hidden technical debt in machine learning systems".
The paper focuses its presentation on the following components of the TFX.
Data Transformation: This component implements a suite of data transformations to allow "feature wrangling" for model training and serving. The paper says: "Representing features in ID space often saves memory and computation time as well. Since there can be a large number (∼1–100B) of unique values per sparse feature, it is a common practice to assign unique IDs only to the most “relevant” values. The less relevant values are either dropped (i.e., no IDs assigned) or are assigned IDs from a fixed set of IDs."
Data validation: To perform validation, the component relies on a schema that provides a versioned, succinct description of the expected properties of the data.
The other day, I wrote about modeling use cases, which included data modeling. That kind of TLA+/PlusCal modeling may have applications here to design and enforce a rich/sophisticated schema, with high-level specifications of some of the main operations on the data.
I first thought whether it would be beneficial to check when warm training would be applicable/beneficial. But then I realized, why bother? ML is empirical and practical; try it and see if warm training helps, and if not, don't use it. On the other hand, if the design space becomes very large, this kind of applicability check can help save time, and guide the development process.
This section also talks about FeatureColumns which help users focus on which features to use in their machine learning model. These provide a declarative way of defining the input layer of a model.
It turns out the "safe to serve" part is not trivial at all: "The model should not crash or cause errors in the serving system when being loaded, or when sent bad or unexpected inputs, and the model shouldn’t use too many resources (such as CPU or RAM). One specific problem we have encountered is when the model is trained using a newer version of a machine learning library than is used at serving time, resulting in a model representation that cannot be used by the serving system."
This section first says it is important to use a common data format for standardization, but then backtracks on that: "Non neural network (e.g., linear) models are often more data intensive than CPU intensive. For such models, data input, output, and preprocessing tend to be the bottleneck. Using a generic protocol buffer parser proved to be inefficient. To resolve this, a specialized protocol buffer parser was built based on profiles of various real data distributions in multiple parsing configurations. Lazy parsing was employed, including skipping complete parts of the input protocol buffer that the configuration specified as unnecessary. The application of the specialized protocol buffer parser resulted in a speedup of 2-5 times on benchmarked datasets."
In NIPS 2017, Google had a more detailed paper on the Tensorflow serving layer.
This part was very interesting and is a testament to the usefulness of TFX:
"The data validation and analysis component helped in discovering a harmful training-serving feature skew. By comparing the statistics of serving logs and training data on the same day, Google Play discovered a few features that were always missing from the logs, but always present in training. The results of an online A/B experiment showed that removing this skew improved the app install rate on the main landing page of the app store by 2%."
In fact, this thought send me down a rabbit hole, where I read about Apache Beam, Google Dataflow, and then the Lambda versus Kappa architecture. Very interesting work, which I will summarize soon.
2) Why do research papers not have a MAD questions section?
(I am not picking on this paper.) I guess the research papers have to claim authority, and provide a sense of everything is under control. Pointing out unclosed-loops and open-ended questions may give a bad impression for the paper. The future work sections often come as one paragraph at the end of the paper, and play it safe. I don't think it should be that way though. More relaxed venues, such as HOT-X and workshops can provide a venue for papers that raise questions.
TFX supports continuous training and serving pipelines and integrates best practices to achieve production-level reliability and scalability. You can argue that the paper does not have a deep research component and a novel insight/idea. But you can argue the same thing for the checklist manifesto by Atul Gowande, which nevertheless does not decrease from its effectiveness, usefulness, and impact.
On the other hand, the paper could definitely have been written much succinctly. In fact, I found this blog post by Martin Zinkevich, the last author of the paper, much easier to follow than the paper. (Are we pushed to make papers artificially obfuscated to be publication-worthy?) This blog post on serving skew, a major topic discussed in the TFX paper, was both very succinct and accessible.
While we are on the topic of related work, the NIPS 2016 paper, "What is your ML score? A rubric for ML production systems", from a subset of the authors of the TFX paper, is also related. A big motivation for this paper is another previous Google paper, titled: "Hidden technical debt in machine learning systems".
The paper focuses its presentation on the following components of the TFX.
Data analysis, transformation, and validation
Data analysis: This component gathers statistics over feature values: for continuous features, the statistics include quantiles, equi-width histograms, the mean and standard deviation. For discrete features they include the top-K values by frequency.Data Transformation: This component implements a suite of data transformations to allow "feature wrangling" for model training and serving. The paper says: "Representing features in ID space often saves memory and computation time as well. Since there can be a large number (∼1–100B) of unique values per sparse feature, it is a common practice to assign unique IDs only to the most “relevant” values. The less relevant values are either dropped (i.e., no IDs assigned) or are assigned IDs from a fixed set of IDs."
Data validation: To perform validation, the component relies on a schema that provides a versioned, succinct description of the expected properties of the data.
The other day, I wrote about modeling use cases, which included data modeling. That kind of TLA+/PlusCal modeling may have applications here to design and enforce a rich/sophisticated schema, with high-level specifications of some of the main operations on the data.
Model training
This section talks about warm-starting, which is inspired by transfer learning. The idea is to first train a base network on some base dataset, then use the ‘general’ parameters from the base network to initialize the target network, and finally train the target network on the target dataset. This cuts down the training time significantly. When applying this to continuous training, TFX helps you identify a few general features of the network being trained (e.g., embeddings of sparse features). When training a new version of the network, TFX initializes (or warm-starts) the parameters corresponding to these features from the previously trained version of the network and fine tune them with the rest of the network.I first thought whether it would be beneficial to check when warm training would be applicable/beneficial. But then I realized, why bother? ML is empirical and practical; try it and see if warm training helps, and if not, don't use it. On the other hand, if the design space becomes very large, this kind of applicability check can help save time, and guide the development process.
This section also talks about FeatureColumns which help users focus on which features to use in their machine learning model. These provide a declarative way of defining the input layer of a model.
Model evaluation and validation
A good model meets a desired prediction quality, and is safe to serve.It turns out the "safe to serve" part is not trivial at all: "The model should not crash or cause errors in the serving system when being loaded, or when sent bad or unexpected inputs, and the model shouldn’t use too many resources (such as CPU or RAM). One specific problem we have encountered is when the model is trained using a newer version of a machine learning library than is used at serving time, resulting in a model representation that cannot be used by the serving system."
Model serving
This component aims to scale serving to varied traffic patterns. They identified interference between the request processing and model-load processing flows of the system which caused latency peaks during the interval when the system was loading a new model or a new version of an existing model. To solve this they provide a separate dedicated threadpool for model-loading operations, which reduces the peak latencies by an order of magnitude.This section first says it is important to use a common data format for standardization, but then backtracks on that: "Non neural network (e.g., linear) models are often more data intensive than CPU intensive. For such models, data input, output, and preprocessing tend to be the bottleneck. Using a generic protocol buffer parser proved to be inefficient. To resolve this, a specialized protocol buffer parser was built based on profiles of various real data distributions in multiple parsing configurations. Lazy parsing was employed, including skipping complete parts of the input protocol buffer that the configuration specified as unnecessary. The application of the specialized protocol buffer parser resulted in a speedup of 2-5 times on benchmarked datasets."
In NIPS 2017, Google had a more detailed paper on the Tensorflow serving layer.
Case Study: Google Play
One of the first deployments of TFX is the recommender system for the Google Play mobile app store. TFX is used for the Google Play recommender system, whose goal is to recommend relevant Android apps to the Play app users. Wow, talk about scale: Google Play has over one billion active users and over one million apps.This part was very interesting and is a testament to the usefulness of TFX:
"The data validation and analysis component helped in discovering a harmful training-serving feature skew. By comparing the statistics of serving logs and training data on the same day, Google Play discovered a few features that were always missing from the logs, but always present in training. The results of an online A/B experiment showed that removing this skew improved the app install rate on the main landing page of the app store by 2%."
MAD questions
1) The paper provides best practices for validating the sanity of ML pipelines, in order to avoid the Garbage In Garbage Out (GIGO) syndrome. How much of these best practices is likely to change over the years? I can already see a paper coming in the next couple years, titled: "One size does not fit all for machine learning".In fact, this thought send me down a rabbit hole, where I read about Apache Beam, Google Dataflow, and then the Lambda versus Kappa architecture. Very interesting work, which I will summarize soon.
2) Why do research papers not have a MAD questions section?
(I am not picking on this paper.) I guess the research papers have to claim authority, and provide a sense of everything is under control. Pointing out unclosed-loops and open-ended questions may give a bad impression for the paper. The future work sections often come as one paragraph at the end of the paper, and play it safe. I don't think it should be that way though. More relaxed venues, such as HOT-X and workshops can provide a venue for papers that raise questions.
Comments