Metadata

Posts

Showing posts with the label stream-processing

DDIA: Chapter 11 - Stream Processing

- December 10, 2024

Daily batch processes introduce significant latency, since input changes reflected in the output only after a day. For fast paced business, this is too slow. To reduce delays, stream processing occurs more frequently (e.g., every second) or continuously, where events are handled as they happen. In stream processing, a record is typically called an event—a small, immutable object containing details of an occurrence, often with a timestamp. Polling for new events becomes costly when striving for low-latency continuous processing. Frequent polling increases overhead as most requests return no new data. Instead, systems should notify consumers when new events are available. Messaging systems handle this by pushing events from producers to consumers. Direct messaging systems require application code to handle message loss and assume producers and consumers are always online, limiting fault tolerance. Message brokers (or message queues) improve reliability by acting as intermediaries. P...

DBSP: Automatic Incremental View Maintenance for Rich Query Languages

- November 19, 2024

Incremental computation represents a transformative (!) approach to data processing. Instead of recomputing everything when your input changes slightly, incremental computation aims to reuse the original output and efficiently update the results. Efficiently means performing work proportional only to input and output changes. This paper introduces DBSP, a programming language inspired by signal processing (hence the name DB-SP). DBSP is simple, yet it offers extensive computational capabilities. With just four operators, it covers complex database queries, including entire relational algebra, set and multiset computations, nested relations, aggregations, recursive queries, and streaming computations. Basic DBSP operators The language is designed to capture computation on streams. Streams are represented as infinite vectors indexed by consecutive time. Each stream can represent values of any type and incorporates basic mathematical operations like addition, subtraction, and ...

Paper summary. Decoupling the Control Plane from Program Control Flow for Flexibility and Performance in Cloud Computing

- May 16, 2018

This paper appeared in Eurosys 2018 and is authored by Hang Qu, Omid Mashayekhi, Chinmayee Shah, and Philip Levis from Stanford University. I liked the paper a lot, it is well written and presented. And I am getting lazy, so I use a lot of text from the paper in my summary below. Problem motivation In data processing frameworks, improved parallelism is the holy grail because it can get more data processed in less time. However, parallelism has a nemesis called the control plane . While, control plane can have a wide array of meaning, in this paper control plane is defined as the systems and protocols for scheduling computations, load balancing, and recovering from failures. A centralized control frame becomes a bottleneck after a point. The paper cites other papers and states that a typical cloud framework control plane that uses a fully centralized design can dispatch fewer than 10,000 tasks per second. Actually, that is not bad! However, with machine learning (ML) app...

The Lambda and the Kappa Architectures

- January 14, 2018

This article, by Jimmy Lin, looks at the Lambda and Kappa architectures, and through them considers a larger question: Can one size fit all? The answer, it concludes, is it depends on what year you ask! The pendulum swings between the apex of one tool to rule them all , and the other apex of multiple tools for maximum efficiency . Each apex has its drawbacks: One tool leaves efficiency on the table, multiple tools spawns integration problems. In the RDBMS world, we already saw this play out. One size RDBMS fitted all, until it couldn't anymore. Stonebraker declared "one size does not fit all", and we have seen a split to dedicated OLTP and OLAP databases connected by extract-transform-load (ETL) pipelines. But these last couple years we are seeing a lot of one size fits all "Hybrid Transactional/Analytical Processing (HTAP)" solutions being introduced again. Lambda and Kappa OK, back to telling the story from the Lambda and Kappa architectures perspecti...

Paper summary. Real-Time Machine Learning: The Missing Pieces

- December 28, 2017

This paper, dated March 11, 2017 on arxiv, is from UB Berkeley. Here is Prof. Michael Jordan's Strata Hadoop conference talk on this. The paper first motivates the need for real-time machine learning. For this it mentions in-situ reinforcement learning (RL) that closes the loop by taking actions that affect the sensed environment. (The second paragraph mentions that such RL can be trained more feasibly by using simulated/virtual environments: by first trying multiple actions [potentially in parallel] to see their affect in simulation before interacting with the real world. Again this requires real-time performance as the simulation should be performed faster than real-time interaction.) Based on this application scenario, here are their desired requirement from the ML platform. R1: low latency R2: high throughput R3: dynamic task creation (RL primitives such as Monte Carlo tree search may generate new tasks during execution) R4: heterogeneous tasks (tasks would have wide...

Retroscope: Retrospective cut-monitoring of distributed systems (part 3)

- December 22, 2017

This post continues the discussion on monitoring distributed systems with Retroscope . Here we focus on cut monitoring approach Retroscope uses. (This post is jointly written with Aleksey Charapko and Ailidani Ailijiang.) Retroscope is a monitoring system for exploring global/nonlocal state history of a distributed system. It differs from other monitoring tools due to the way it inspects the system state. While request tracers inspect the system by following the trace of a request (i.e. request r in the figure), Retroscope performs cut monitoring and examines the system at consistent global cuts, observing the state across many machines and requests. It moves along the system history and scans a progression of states one cut at a time, checking cut Ts1 and then Ts2 and so on. Retroscope’s cut monitoring approach is complementary to the request tracing solutions, and brings a number of advantages. First, by exposing the nonlocal state, Retroscope enables users to examine...

Paper summary. Blazes: Coordination analysis and placement for distributed programs

- November 27, 2017

This paper is by Peter Alvaro, Neil Conway, Joseph M. Hellerstein, and David Maier. It appears in October 2017 issue of ACM Transactions on Database Systems. A preliminary conference version appeared in ICDE 2014. This paper builds a theory of dataflow/stream-processing programs, which cover the Spark , Storm, Heron , Tensorflow , Naiad, TimelyDataflow work. The paper introduces compositional operators of labels, and shows how to infer the coordination points in dataflow programs. When reading below, pay attention to the labeling section, the labels "CR, CW, OR, OW", and the section on reduction rules on labels. To figure out these coordination points, the Blazes framework relies on annotations of the dataflow programs to be supplied as an input. This is demonstrated in terms of a Twitter Storm application and a Bloom application. The paper has many gems. It says at the conclusion section in passing that when designing systems, we should pay attention to coordinat...

On dataflow systems, Naiad and Tensorflow

- November 25, 2017

The below definition for dataflow programming is from Wikipedia (with some small edits): "Traditionally, a program is modeled as a series of operations happening in a specific order; this may be referred to as sequential, procedural, or imperative programming. The program focuses on commands for programming, where data is normally /at rest/. In contrast, dataflow programming emphasizes the movement of data; it models a program as a series of connections with explicitly defined inputs and outputs, and connect operations. An operation runs as soon as all of its inputs become available. Thus, dataflow languages are inherently parallel and can work well in large, decentralized systems." Some examples of dataflow frameworks are map-reduce, Spark, Storm & Heron, GraphX, GraphLab, Naiad (now implemented in Rust as Timely-Dataflow), and Tensorflow. Map-Reduce uses a very simple directed-acyclic-graph (DAG) with only two operations: map and reduce. Spark extends map-redu...

Retroscope: Retrospective Monitoring of Distributed Systems (part 2)

- September 12, 2017

This post, part 2, focuses on monitoring distributed systems using Retroscope. This is a joint post with Aleksey Charapko . If you are unfamiliar with hybrid logical clocks and distributed snapshots, give part 1 a read first. Monitoring and debugging distributed applications is a grueling task. When you need to debug a distributed application, you will often be required to carefully examine the logs from different components of the application and try to figure out how these components interact with each other. Our monitoring solution, Retroscope, can help you with aligning/sorting these logs and searching/focusing on the interesting parts. In particular, Retroscope captures a progression of globally consistent distributed states of a system and allows you to examine these states and search for global predicates. Let’s say you are working on debugging ZooKeeper, a popular coordination service. Using Retroscope you can easily add instrumentation to the ZooKeeper nodes to log and ...