Transaction Healing: Scaling Optimistic Concurrency Control on Multicores

This paper from SIGMOD 2016 proposes a transaction healing approach to improve the scalability of Optimistic Concurrency Control (OCC) in main-memory OLTP systems running on multicore architectures. Instead of discarding the entire execution when validation fails, the system repairs only the inconsistent operations to improve throughput in high-contention scenarios.

If this sounds familiar, it's because we recently reviewed the Morty paper from EuroSys 2023, which applied healing ideas to interactive transactions using continuations to support re-execution. This 2016 Transaction Healing paper is scoped to static stored procedures, and focuses more on integrating healing into OCC for stored procedures. 


Key Ideas

OCC works well under low contention because it separates reads from writes and  keeps critical sections short (only for validation). But under high contention, especially in workloads with skewed access patterns (like Zipfian distributions), transactions are frequently invalidated by concurrent updates. The naive OCC response of abort and restart leads to wasting CPU cycles and degrading cache locality.

Transaction healing aims to address this problem by observing/betting that most validation failures affect only a subset of a transaction's operations. If only the affected operations can be detected and recovered, the system can avoid redoing the entire transaction. They implement this by leveraging two components.

First, a static analysis phase extracts operation dependencies from the stored procedure a priori. The dependency analysis distinguishes between two types of relations: key-dependencies, where the result of one operation determines the lookup key for another; and value-dependencies, where the value produced by one operation is used in a subsequent one. With this graph in hand, transaction healing can surgically repair any non-serializable operation at runtime.

Second, a runtime access cache, maintained per thread, tracks the behavior of each executed operation (its inputs, outputs, effects, and the memory addresses it accessed) and identifies conflicted parts of a transaction at runtime. The access cache supports this by recording memory addresses (avoiding repeated index lookups) and allowing efficient reuse of unaffected results.


Transaction healing

The healing process is triggered during the validation phase, when an inconsistency is detected in the read/write set. Rather than aborting immediately, the system identifies the earliest affected operation (using its dependency graph), and restores it. If the operation is value-dependent, healing updates its effects based on cached inputs and outputs. If it's key-dependent, a re-execution is necessary since the accessed record may change. The healing propagates forward through the dependency graph, recursively restoring all operations affected by the initial inconsistency.

The healing mechanism is built to preserve serializability. Validation acquires locks in a globally consistent order (e.g., sorted by memory address) to avoid deadlocks. If during healing a lock must be acquired out of order (e.g., due to new dependencies introduced by re-executed operations), the transaction is aborted in order not to risk a deadlock. The paper says this situation is rare due to validation-order optimizations. Despite occasional aborts, transaction healing guarantees forward progress and eventual termination: each transaction's read/write set is finite and every element is validated at most once, which ensures that healing either succeeds or fails definitively.


Evaluation Highlights

They implemented a C++ in-memory database engine, THEDB, to test these ideas. THEDB employs LLVM to perform static dependency analysis on stored procedures and includes support for standard database features like inserts, deletes, and range queries (the latter protected against phantoms via B+-tree versioning, as in Silo). The authors evaluate THEDB on a 48-core AMD machine using two common benchmarks: TPC-C and Smallbank. THEDB is compared against five systems: variants of OCC (including Silo-style), 2PL, a hybrid OCC-2PL approach, and a deterministic partitioned system.

The results show that, under high contention, THEDB significantly outperforms the alternatives, achieving up to 6.2x higher throughput than Silo and approaching the performance of an idealized OCC system with validation disabled. This shows that transaction healing adds minimal overhead and successfully eliminates the restart costs that dominate OCC's performance under load. Moreover, THEDB maintains stable throughput as contention increases (e.g., under more skewed Zipfian distributions), while traditional OCC and Silo degrade rapidly. Scalability is also great up to 48 cores.


Discussion

**** What are the limitations of static analysis used?

Transaction healing proposed here is limited to stored procedures because it relies on static dependency extraction. Unlike Morty, which handles interactive transactions using runtime continuations, this work cannot deal with dynamic control flow or unknown transaction logic at runtime. As a result, ad-hoc queries revert to standard OCC, where any healing benefit is lost.

On the other hand, there is some subtlety here. Transaction healing does not require read/write sets to be declared in advance as the deterministic systems like Calvin do. Deterministic systems must know the exact records a transaction will access before it begins execution, so they can assign transactions to partitions and establish a global execution order. Transaction healing avoids this rigidity. It doesn't need to know which specific records a transaction will access ahead of time. Instead, it relies on static analysis to extract the structure of the transaction logic, namely which operations depend on which others. These dependencies, such as key or value dependencies between operations, are known statically because the transaction logic is written as a stored procedure. But the actual keys and values involved are discovered dynamically as the transaction executes. The system uses an access cache to record which memory locations were read or written, and validation happens afterward. This flexibility allows transaction healing to support dynamic, cross-partition access patterns without prior declaration.


**** How does this compare with Morty?

Transaction Healing is designed for in-memory OLTP systems running with OCC on multicore machines, where the workload consists of static stored procedures. Morty, in contrast, is built for a distributed geo-replicated system and handles interactive transactions with dynamic control flow. It uses MVTSO, with speculative execution and a priori ordering. Unlike THEDB, Morty allows transactions to read from uncommitted versions, exposing concurrency that traditional systems suppress. It tracks execution through continuation-passing style (CPS) in order to make control dependencies explicit and enable partial re-execution of logic branches. While transaction healing employed LLVM to automatically perform static dependency analysis on stored procedures, Morty did not automate translation of transaction program to CPS program. Finally, since it is distributed and deployed over WAN, Morty integrates concurrency control with replication to reduce latency and uses quorum voting to maintain fault-tolerant correctness without centralized logging.


Comments

Popular posts from this blog

Hints for Distributed Systems Design

My Time at MIT

Advice to the young

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Foundational distributed systems papers

Learning about distributed systems: where to start?

Distributed Transactions at Scale in Amazon DynamoDB

Making database systems usable

Looming Liability Machines (LLMs)

Analyzing Metastable Failures in Distributed Systems