Saturday, November 25, 2017

On dataflow systems, Naiad and Tensorflow

The below definition for dataflow programming is from Wikipedia (with some small edits):
"Traditionally, a program is modeled as a series of operations happening in a specific order; this may be referred to as sequential, procedural, or imperative programming. The program focuses on commands for programming, where data is normally /at rest/.

In contrast, dataflow programming emphasizes the movement of data; it models a program as a series of connections with explicitly defined inputs and outputs, and connect operations. An operation runs as soon as all of its inputs become available. Thus, dataflow languages are inherently parallel and can work well in large, decentralized systems."

Some examples of dataflow frameworks are map-reduce, Spark, Storm & Heron, GraphX, GraphLab, Naiad (now implemented in Rust as Timely-Dataflow), and Tensorflow.

Map-Reduce uses a very simple directed-acyclic-graph (DAG) with only two operations: map and reduce. Spark extends map-reduce to a couple dozen operations in addition to map and reduce, and implements data storage/processing to be in-memory. More specifically, in Spark, a computation is modeled as a directed acyclic graph (DAG), where each vertex denotes a Resilient Distributed Dataset (RDD) and each edge denotes an operation on RDD. RDDs are collection of objects divided in logical partitions that are stored and processed as in-memory, with shuffle/overflow to disk.

Twitter's Storm and Heron are dataflow stream-processing frameworks. In Storm (and its subsequent modular implementation Heron), bolts and spouts corresponds to dataflow operations and streams.

I had written summaries about GraphX and GraphLab approaches to dataflow as well.

In this post, I will focus more on Naiad (and timely-dataflow), and try to compare/contrast that with TensorFlow.

Naiad, timely-dataflow, and differential-dataflow

Naiad introduced a dataflow programming framework that allows cycles in the dataflow graph. Naiad's dataflow model supports structured loops allowing feedback in the dataflow, stateful data flow vertices capable of consuming and producing records without global coordination, and notifications for vertices once they have received all records for a given round of input or loop iteration. Here is my summary of Naiad from 2014.

Frank McSherry has been singlehandedly (I don't know, maybe he uses both his hands) implementing Naiad in Rust for the last couple years, and calls the project timely-dataflow. He also has another project that provides implementation of differential dataflow in Naiad using timely-dataflow on Rust.

Differential dataflow (i.e., incremental dataflow) means streaming iterative computation and doing incremental computation only in response to changing data. "If you change one input to your computation, you would prefer not to re-evaluate the entire computation. Rather, changes to the input produces some changes in the output of your first dataflow operator, which are then changes to the inputs of subsequent operators. Eventually these changes either evaporate or spill out the end of your dataflow as changed outputs. As changes flow around the dataflow, they happen at various logical times. These times reflect the epoch of streaming input data, and iteration counters for any loops they may be contained in. Rather than eagerly aggregate these changes, we leave them disaggregated, allowing future computation to use arbitrary subsets of the changes. Although we leave ourselves more work to do for each change, we end up with an orders-of-magnitude reduction in the numbers of changes."

Theres is a technical distinction between differential and incremental. Incremental works with "sequence of arbitrary updates" whereas differential works with  "partial order of arbitrary updates". The former has been around a while, has lots of prior art, etc. Differential is pretty new and while more general is pretty similar in most cases (e.g. streaming join is the same for both of them; streaming join in a loop is only possible in differential).

Naiad versus TensorFlow

TensorFlow is a special instantiation/application of Naiad's timely-dataflow model of cyclic dataflow processing for the domain of machine learning and specifically deep-learning. It is well integrated with tools, Python, GPUs, etc. Here is my summary of TensorFlow if you need a refresher.

I think TensorFlow traded off generality of Naiad and gained much more in return. This is what some people call "addition by subtraction".

Naiad aimed to satisfy real-time querying. TensorFlow is OK with batch training. TensorFlow can still output a result at the end of each minibatch, not a big deal. The real-time querying requirement in Naiad may be an unrealistic and unnecessarily/impractical requirement to shoot for in practical systems. Who knows. Are there practical applications that require tight real-time querying with the most recent updates reflected in the view? (More on this, at the end, on the applications discussion.)

Naiad aimed at precisely correct output. If something changes slightly, Naiad will recompute effected things (incrementally, yay) and give you the correct output. On the other hand, if something changes, TensorFlow will consider the changed things as a new minibatch to train with, as a result of which not much may change. Machine Learning and Deep Learning tolerate uncertainty anyhow. Minibatches are how TensorFlow handles differential/incremental dataflow implicitly. No complicated mechanism is required. If things change, that is another minibatch to train for, and most of the previously computed model parameters may remain unaffected.

(Naiad's differential/incremental processing imposes extra memory requirements. It is unclear to me how much state needs to be held at the stateful-operations about the input-graph so that incremental processing is still possible.)

Finally, Naiad has a join operation a very strong/capable yet a costly operation. TensorFlow does not have a join operation, but that makes TensorFlow operations easier to shard/partition across machines. Maybe TensorFlow doesn't have join operations because its inputs are assumed to be independent. Naiad assumes inputs can be dependent: like parts of a graph, or update to an earlier graph, streaming into Naiad. So Naiad has a join operator but pays the cost for that powerful operator.

Naiad's generality should open it up for more potential applications. Differential dataflow is ideal for graph algorithms that update the graph or operates on a dynamic graph. This should be useful for search engines; when the internet graph and links update, you shouldn't recompute the entire result in batch. I don't know what Google uses most recently, but Google has been using an incremental transactional system leveraging BigTable, called Percolator, to solve this problem, without needing the differential dataflow computing power of Naiad.

Other application areas for Naiad's computing power could be social network applications and recommendation systems. But it seems like social network applications do not need to be very precise, approximate answers are OK with them. And recommendations systems do not need very tight real-time constraints yet.

No comments: