Towards Optimal Transaction Scheduling

This paper (VLDB'2024) looks at boosting transaction throughput through better scheduling. The idea is to explore the schedule-space more systematically and pick execution orders that reduce conflicts.

The paper's two main contributions are a scheduling policy called Shortest Makespan First (SMF) and a MVTSO concurrency control variant called MVSchedO.

SMF uses a greedy heuristic to pick low-conflict schedules that minimize the increase in total execution time (makespan) at each step. MVSchedO enforces the chosen schedule at a fine-grained level by adapting multi-version timestamp ordering (MVTSO). The authors implement SMF and MVSchedO in RocksDB (R-SMF) and show up to 3.9x higher throughput and 3.2x lower tail latency on benchmarks as well as Meta's TAO workload.

I mean, this is a stellar systems work, and it makes a convincing case that search-based scheduling is a promising direction for extracting higher throughput from database systems.

Motivation

Reducing contention is a big deal. Throughput drops when transactions contend for the same data. But, finding the optimal throughput schedule is NP-complete (Papadimitriou-1970), so database systems use mostly FIFO arrival order among pending transactions (i.e., the transactions made available to the scheduler at a given time). How much can we reduce conflicts by reordering, and is it worth going to all that trouble?

This paper explores the full schedule space more systematically, even when access patterns are only partially known. It tries to find good schedules quickly enough to be practical, and then execute transactions exactly in that schedule order.

They use SMF for the first challenge and MVSchedO for the second, and implement both on RocksDB to show substantial gains.


Searching for Fast Schedules

To define a framework to measure the effect of transaction ordering, the paper proposes makespan: the total time for a batch to complete under idealized execution (equal-length operations, no failures, no bottlenecks). Minimizing makespan maximizes throughput for a finite batch. A toy example with four transactions under MVTSO shows that the FIFO schedule has makespan 8, while another that groups compatible transactions reduces makespan to 6, achieving a 25% improvement in throughput.


Shortest Makespan First (SMF) is a greedy scheduling algorithm to reduce makespan. You start with one transaction, then repeatedly add the one from a random sample of k unscheduled transactions that causes the smallest increase in makespan. This gives O(nk) complexity. Small k (like 5) works well in real workloads because there's a good chance of picking a low-conflict transaction.

The key insight is that most contention comes from a few hot keys. By separating transactions that hit those keys, SMF gets most of the benefit without needing full access-set knowledge. The paper illustrates this with TPC-C: it shows that knowing only Warehouse/District keys gets within 6% of the performance of having complete read/write knowledge.


An online scheduling-first database

To bring SMF into an online database, R-SMF splits the work into three steps:

  1. predict the small set of hot-key accesses using a classifier,
  2. schedule transactions using SMF based on those predictions, and
  3. enforce the order during execution using MVSchedO, which modifies MVTSO.


The classifier is simple and efficient. It clusters metadata vectors (type + hot keys) using KNN to get canonical hot-key sets. At runtime, a transaction's hints map to the predicted hot keys. This loses some precision but keeps overhead low, and is accurate in OLTP workloads where most contention comes from obvious keys. The model is retrained periodically to track drift.

The classifier is simple and efficient. To produce canonical hot-key sets, it employs kNN to cluster transaction metadata, such as the transaction type (new order, transfer funds) and explicit key parameters (warehouse ID, account number). At runtime, a transaction's hints (derived from metadata) are mapped to a predicted hot-key set. This sacrifices some precision but keeps overhead low, and works well in OLTP workloads where most contention comes from a few obvious keys. The model is periodically retrained to adapt to workload drift.

The online SMF only tracks in-flight transactions and ignores non-hot keys for makespan. For each hot key, it only tracks the latest conflicting op. This cuts memory and compute cost while keeping most of the scheduling benefit. The scheduler runs in O(nk) per batch, where n = in-flight txns, k is a small constant like 5.

MVSchedO adapts MVTSO to enforce SMF's chosen ordering. MVTSO already assigns timestamps and allows maximal concurrency, but typically assigns timestamps by arrival and does not prevent later-timestamp operations from running before earlier ones (causing aborts). To integrate SMF into MVTSO, MVSchedO enforces SMF's order by giving SMF's timestamps to MVTSO and delays hot-key ops until prior scheduled ops finish. The rest of MVTSO's read/write/commit logic remains unchanged. (In Algorithm 1, only the asterisk-marked lines are updated from the original MVTSO algorithm.) Non-hot keys run as usual. If predictions are wrong, blocked transactions resume when the awaited ops finish or abort. This keeps enforcement overhead low and ensures maximum concurrency possible under the chosen schedule.

MVSchedO preserves serializability because any execution it allows is a valid MVTSO execution under some arrival order. SMF only changes timestamp assignment. The paper argues MVSchedO achieves optimal concurrency for the given schedule with early write visibility.


Evaluation

The evaluation tries to address these questions: (1) how much does scheduling help, (2) what are the overheads, and (3) how does SMF compare to other search methods. They implement R-SMF in RocksDB, plus bolt-on versions (SMF-OCC, SMF-Lock) and a TAO prototype. Workloads include Epinions, SmallBank, TPC-C, TAOBench, and YCSB. Experiments use separate client/server VMs, closed-loop clients, and averaged results.

R-SMF shows big gains on skewed and mixed workloads: up to 3.9x throughput over RocksDB OCC and 3.2x lower tail latency on benchmarks and TAO. SMF reduces wasted work and aborts by picking low-conflict orders; MVSchedO prevents aborts from out-of-order execution after timestamp assignment. The bolt-on SMF variants also help, though less. Abort rates drop sharply across workloads.

Overhead is small when contention is low (<5% throughput drop) since most transactions skip hot-key logic. At medium/high contention, R-SMF improves throughput significantly. I like this "pay as you need" approach. The classifier accuracy seems to be the key: when hints are good, performance improves; with poor hints (50% wrong), performance can drop. The paper argues that system can fall back to FIFO if schedule quality degrades.



Comparison to Morty

Morty (which we had reviewed recently) and R-SMF both integrate ordering into concurrency control and build on MVTSO ideas, but they differ in approach. R-SMF focuses on avoiding conflicts up front: it predicts hot keys, uses SMF to pick a low-conflict order, and enforces it for those keys via MVSchedO. Execution sticks to the schedule from the start, so there's no speculative work. Mispredictions just cause some idle waiting. This works best when hot-key prediction is accurate and contention is skewed, and it can fall back to FIFO if needed.

Morty instead uses a speculate-and-repair model for geo-replicated, interactive workloads. It assigns a speculative total order at arrival but allows reads from uncommitted data to boost concurrency. At commit-time validation, if a read was stale, Morty rewinds to that point and re-executes only the affected part. This handles mispredictions naturally and works well in high-latency WAN settings, where speculative overlap hides delays.

I think a natural extension would be to combine the two: use R-SMF's SMF+MVSchedO to avoid most conflicts through proactive hot-key scheduling, but when unexpected conflicts slip through (due to mispredictions or non-hot-key contention) apply Morty-style selective re-execution instead of full aborts. This hybrid could cut abort rates further and preserve more work per transaction when hot-key prediction accuracy degrades.

Comments

Popular posts from this blog

Hints for Distributed Systems Design

My Time at MIT

Advice to the young

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Foundational distributed systems papers

Learning about distributed systems: where to start?

Distributed Transactions at Scale in Amazon DynamoDB

Making database systems usable

Looming Liability Machines (LLMs)

Analyzing Metastable Failures in Distributed Systems