Are Database System Researchers Making Correct Assumptions about Transaction Workloads?
In this blog, we had reviewed quite a number of deterministic database papers, including Calvin, SLOG, Detock, which aimed to achieve higher throughput and lower latency. The downside of these systems is sacrificing transaction expressivity. They rely on two critical assumptions: first, that transactions are "non-interactive", meaning they are sent as a single request (one-shot) rather than engaging in a multi-round-trip conversation with the application, and second, that the database can know a transaction's read/write set before execution begins (to lock data deterministically).
So when these deterministic database researchers write a paper to validate how these assumptions hold in the real world, we should be skeptical and cautious in our reading. Don't get me wrong, this is a great and valuable paper. And we still need to be critical in our reading.
Summary
The study employed a semi-automated annotation tool to analyze 111 popular open-source web applications and over 30,000 transactions. The authors target applications built with Django (Python) and TypeORM (TypeScript). These Object-Relational Mappers (ORMs) abstract away the SQL layer, allowing developers to interact with a wide variety of underlying relational databases (such as PostgreSQL, MySQL, and SQLite) using standard application code. The authors categorized transactions based on two primary dimensions: Transaction Interactivity and Read/Write Set Inferability.
Transaction Interactivity
The study found that interactive transactions, where the application holds a transaction open while performing logic or reads, are surprisingly rare. 39% of applications contain no interactive transactions. For the remaining 61% of applications, only roughly 9.6% of the workload is interactive.
The authors then distinguish between "Strictly Interactive" transactions, which inherently require external input or side effects mid-flight (e.g., waiting for an API response or user prompt), and "Non-Strictly Interactive" ones, where the back-and-forth is merely deterministic application logic processing reads between queries without outside dependencies. They find strictly interactive transactions account for only 0.5% of the total workload. They argue that the vast majority of remaining interactive transactions are convertible to one-shot transactions with minimal code changes.
Read/Write Set Inferability
The authors tested whether a database could statically predict which rows a transaction would touch. They found that for 90% of applications, at least 58% of transactions have read/write sets that are fully inferable in advance.
The primary obstacle to perfect knowledge is the CDA Mismatch (Conflict Detection Attribute Mismatch), which occurs in roughly 27% of transactions. This happens when a transaction queries data using a secondary attribute (e.g., 'email') rather than the primary locking key (e.g., 'ID'). See Figure 1 above. When this happens, the database cannot infer the lock without first inspecting the data. However, the study remains upbeat about this. Since CDA mismatches typically occur in simple single-statement transactions, they say that the "fallback" mechanism (running a lightweight reconnaissance query to find the ID) is inexpensive.
Critical Review
This paper provides very valuable empirical data on transaction workloads by surveying a large number of applications. It echoes the scale of the SIGMOD 2015 "Feral Concurrency Control" study, which examined the integrity risks of the mismatch between databases and modern ORM frameworks like Rails. However, we must examine this study critically to identify potential limitations in scope and bias. My two big criticism are the following.
Selection Bias. The choice of the corpus, open-source web applications using ORMs, heavily skews the results and excludes almost all enterprise systems. The study also ignores ad-hoc transactions executed by human operators (DBAs, analysts) via terminals. These transactions are inherently interactive ("user think time") and often performance-critical.
Convertibility Optimism. The paper argues that "Non-Strictly Interactive" transactions are easily convertible to deterministic models without resorting to brittle Stored Procedures. They advocate for reconnaissance/scout query pattern, where the application performs a dry run read phase to optimistically build a write-set before sending it as a single atomic batch. While this is promising, the claim of minimal changes seems very optimistic. However, I think this conversion challenge is a great opportunity for research and for collecting empirical evidence of the actual engineering costs.
Ok, one more thing. The title "Are Database System Researchers Making Correct Assumptions?" is a bit of a bait-and-switch. It sounds like a general audit of the entire field, but it really only targets the Deterministic research niche (systems like Calvin). Traditional database researchers who work on standard locking or MVCC (like PostgreSQL or Spanner) never relied on these strict assumptions, so they are largely off the hook for this paper. Actually, it would be great to actually have another paper on that exact and more ambitious topic. If you know of one, please let me know.
Comments