Paper review: Prioritizing attention in fast data

- May 09, 2017

This paper appeared in CIDR17 and is authored by Peter Bailis, Edward Gan, Kexin Rong, and Sahaana Suri at Stanford InfoLab.

Human attention is scarce, data is abundant. The paper argues, this is how we fight back:

prioritize output: return fewer results
prioritize iteration: perform feedback driven development and give useful details and allow user to tune the analysis pipeline
prioritize computation: aggressively filter and sample, tradeoff accuracy/completeness with performance where it has low impact, and use incremental data structures

The slogan for the system is: MacroBase is a search engine for fast data. MacroBase employs a customizable combination of high-performance streaming analytics operators for feature extraction, classification, and explanation.

MacroBase has a dataflow architecture (Storm, Spark Streaming, Heron). The paper argues it is better to focus on what dataflow operators to provide than to try to design from-scratch a new system (which won't be much faster/efficient than existing dataflow systems anyhow).

The architecture of MacroBase is simple:
ingestion&ETL -> feature transform -> classification -> data explanation

MacroBase focuses attention on dataflow operators to prioritize computation. This is done by applying classic systems techniques: predicate pushdown, incremental memorization, partial materialization, cardinality estimation, approximate query processing (top K sketch).

Users are engaged at three different interface levels with MacroBase.
1) Basic: web based point and click UI
2) Intermediate: custom pipeline configuring using Java
3) Advanced: custom dataflow operator design using Java/C++

Users highlight key performance metrics (e.g., power drain, latency) and metadata attributes (e.g., hostname, device ID), and MacroBase reports explanations of abnormal behavior. For example, MacroBase may report that queries running on host 5 are 10 times more likely to experience high latency than the rest of the cluster.

As a broader theme, the paper argues there is opportunity in marrying systems-oriented performance optimization and the machine learning literature. Another big message from the paper is the importance of building combined and optimized end-to-end systems.

MacroBase is currently doing mostly anomaly/outlier detection, and it is not doing any deeper machine learning training. There are plans to make the system distributed. Given that it is based on a dataflow system, there are many plausible ways to achieve distribution of MacroBase.