DBSP: Automatic Incremental View Maintenance for Rich Query Languages

Incremental computation represents a transformative (!) approach to data processing. Instead of recomputing everything when your input changes slightly, incremental computation aims to reuse the original output and efficiently update the results. Efficiently means performing work proportional only to input and output changes.

This paper introduces DBSP, a programming language inspired by signal processing (hence the name DB-SP). DBSP is  simple, yet it offers extensive computational capabilities. With just four operators, it covers complex database queries, including entire relational algebra, set and multiset computations, nested relations, aggregations, recursive queries, and streaming computations.


Basic DBSP operators 

The language is designed to capture computation on streams. Streams are represented as infinite vectors indexed by consecutive time. Each stream can represent values of any type and incorporates basic mathematical operations like addition, subtraction, and a zero element. DBSP introduces incremental computing operators such as lifting ($\uparrow$) that converts scalar functions to stream functions, delay ($z^{-1}$) that shifts stream values, differentiation (${D}$) that compute stream changes, and integration (${I}$) that reconstruct original streams from change streams. Integration and differentiation are inverses of each other, and this shows in their stream circuit representation. 

   



DBSP is a simplified version of differential dataflow. I talked about differential dataflow in 2017 here. Differential dataflow represented time as a lattice, DBSP represents it as a single array. In other words, in DBSP time is consecutive and each state requires a unique predecessor. In differential dataflow, time is defined as a partial order to capture causality and concurrency. That is a better model for distributed systems, but that introduces a lot of complexity. DBSP makes a big simplification when it assumes linear synchronous time, and probably opens up issues related to managing out-of-order inputs, incorporating data from multiple streams, and tracking progress. 

But the simplification buys us powerful properties. The chain rule below emerges as a particularly powerful composition technique for incremental computation. If you apply two queries in sequence and you want to incrementalize that composition, it's enough to incrementalize each query independently. This is big. This allows independent optimization of query components, enabling efficient and modular query processing. 



Application to database incremental view maintenance problem

The paper argues that traditional databases can be reframed as streaming databases by representing linearizable transactions as streams and database states as evolving snapshots. With this framing, views become dynamically/incrementally computed entities over this stream. And DBSP gives us simple and principled Incremental View Maintenance (IVM) over a database.

An IVM algorithm needs to compute the changes to the view given the changes to the database. Given a query Q, we can define its incremental version $ Q^\Delta$ using stream integration and differentiation: $Q^\Delta = {D} \circ {\uparrow \!Q} \circ {I}$. The incremental version of a query is a stateful streaming operator which computes directly on changes and produces changes. This is depicted graphically as:

Instead of maintaining the output view entirely, DBSP proposes generating deltas as the output of the computation. The idea that both inputs and outputs to an IVM system are streams of changes is key to the symmetry of the solution and the fundamental reason that the chain rule exists.

But we have one more kink to iron out. DBSP operations assume that the competitions are done over commutative groups but databases are not commutative groups: they compute over sets. Fortunately, there is a known trick to solve this problem: Represent each table and as a Z-set. I talked about Z-sets (counting sets) briefly when describing Walter recently. In a Z-set each row has an additional weight and the weight shows how many times the row belongs to the table. In Z-sets the weights can be both positive and negative. This is great, because we can leverage this to represent delta changes to tables for example a positive weight indicator says a row is added/updated and a negative weight indicates a row is removed.

Moreover, we can represent database tables with the mapping where each weight is one if positive, and deleted if negative. This is also where the DISTINCT operation comes handy. It "removes" duplicates from multisets, and it also eliminates elements with negative weights. The paper also talks about a way to implement/lift DISTINCT operator as efficiently incrementalizable. 

We can use Z-sets to implement all of SQL, essentially entire relational algebra. Moreover in this implementation, each primitive operator has an efficient incremental version which does only work proportional to the size of the change.



Incremental View Maintenance (IVM) algorithm

Now, let's put that chain rule compositionality to good use. Here is the algorithm.

Consider a relational query Q defining a view V. To create a circuit that maintains V incrementally, we apply the following mechanical steps:

  1. Translate Q into a circuit using the rules in Table 1.
  2. Apply distinct elimination rules until convergence.
  3. Lift the whole circuit converting it to a circuit operating on streams.
  4. Incrementalize the whole circuit "surrounding" it with I and D (as we described above).
  5. Apply the chain rule recursively on the query structure to obtain an incremental implementation.

This algorithm is deterministic and its running time is proportional to the number of operators in the query.


Wrapping up

So what do we have so far? As long as we can express a new operator as a DBSP circuit, we can (1) define its incremental version, (2) apply the incrementalization algorithm to obtain an efficient incremental implementation, and (3) be confident that it composes with any other operators.

I skip the incremental recursive queries discussion from the paper. Here is a brief summary from the paper:  "We compile a recursive query into a circuit that employs incremental computation internally (using semi-naïve evaluation), to compute the fixed point. Here we combine these results to construct a circuit that evaluates a recursive query incrementally. The circuit receives a stream of updates to input relations, and for every update recomputes the fixed point. To do this incrementally, it preserves the stream of changes to recursive relations produced by the iterative fixed point computation, and adjusts this stream to account for the modified inputs. Thus, every element of the input stream yields a stream of adjustments to the fixed point computation, using nested streams."


The paper is a tour-de-force.  The researchers validate their approach through Lean theorem proving, developing an open-source Rust implementation with an MIT-licensed runtime. Their SQL-to-incremental-circuit compiler provides practical proof of the concept's viability.


The DBSP (now Feldera) Rust library provides APIs for basic algebraic data types: such as groups, finite maps, Z-set, indexed Z-set. A separate circuit construction API allows users to create DBSP circuits by placing operator nodes (corresponding to boxes in our diagrams) and connecting them with streams, which correspond to the arrows in our diagrams. They also built a SQL to DBSP compiler, which translates standard SQL queries into DBSP circuits. The compiler implements the Algorithm above to generate a streaming version of any SQL query. (Since I linked to the DBSP/Feldera repo, for completeness here is a link to the Differential Dataflow repo.)

This 12 minute VLDB conference presentation video does a great job of explaining the basic concepts of DBSP, so I recommend watching this. The paper received best paper in VLDB'23.


This is cool work! As I mentioned in the differential dataflow comparison, DBSP makes simplification to consecutive/synchronous time, but as a result it gets a very clean yet powerful modular system, which seems to get good practical mileage in real world. In 2001, I was a summer intern at IBM Research T.J. Watson, Westchester NYC. I was interning with Rob Strom to add an incremental view maintenance (IVM) research on IBM's publish-subscribe system, Gryphon. We had some progress, but what I got out of that project was that this is a very hard problem. Glad to see science is doing its thing, incrementally, but surely. 


Comments

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Foundational distributed systems papers

Advice to the young

Linearizability: A Correctness Condition for Concurrent Objects

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Understanding the Performance Implications of Storage-Disaggregated Databases

Designing Data Intensive Applications (DDIA) Book