OLTP Through the Looking Glass, and What We Found There

This paper appeared in Sigmod 2008. The goal of the paper is to rethink the Online Transaction Processing (OLTP) database architecture (which remained largely unchanged from late 1970s) and to test out a streamlined/stripped-out in-memory database architecture. To this end, the paper dissects where time is spent inside of a single node OLTP database system, and carefully measures the performance of various stripped down variants of that system.


Motivation

OLTP databases were optimized for the computer technology of the 1970s-80s.  Multi-threaded execution was adopted to allow other transactions to make progress while one transaction waits for data to be fetched from the slow, rotating disk drive. Locking-based concurrency control was adopted since conflicts were inevitable in these disk-bound transactions, and since the small dataset sizes and the workloads of the time resulted in extremely "hot" keys.

The computing world now is quite a different environment. Fast SSDs and ample main-memory space have made transaction execution extremely fast, and reduced the likelyhood of conflicts. Moreover datasets got bigger, and keys got colder, and this made optimistic concurrency control (OCC) a good choice. Furthermore, as long running (analytical) queries are now serviced by specialized warehouses, we find that OLTP queries have gotten shorter and faster. 

Motivated by these shifts, the paper investigates alternative database architectures that may be better suited to modern hardware and workload characteristics, and attempts to estimate the performance of these new approaches.


Streamlined OLTP

To explore where time is spent inside of a single node OLTP database system, the authors consider the open source database system, Shore (http://www.cs.wisc.edu/shore/), and benchmark it using a subset of the TPC-C. They start modifying the original implementation, which ran about 640 transactions per second (TPS), by removing different features from the engine one at a time, until it is left with a tiny kernel (single-threaded, lock-free, main memory database system without recovery), which processes 12700 TPS.

That is 20 folds improvement in throughput, but with substantially reduced functionality. They end up removing:

  • Logging. Assembling log records and tracking down all changes in database structures slows performance. Logging may not be necessary if recoverability is not a requirement or if recoverability is provided through other means (e.g., other sites on the network).
  • Locking. Traditional two-phase locking poses a sizeable overhead since all accesses to database structures are governed by a separate entity, the Lock Manager.
  • Latching. In a multi-threaded database, many data structures have to be latched before they can be accessed. Removing this feature and going to a single-threaded approach has a noticeable performance impact.
  • Buffer management. A main memory database system does not need to access pages through a buffer pool, eliminating a level of indirection on every record access.

The white box below buffer manager represents the remaining tiny kernel, which still runs the transactions, but uses about 1/15th of the instructions of the original system. At this point, transactions are not concurrent at all. We get serializable isolation because each transaction is in fact executed to completion one at a time and does not overlap with another transaction. What they call hand-coded optimizations is accelerating the B-tree code by hand-coding node searches to optimize for the common case where keys are uncompressed integers (labeled "B-tree keys" in the other figures).

They find that stripping out any one of the components of the system has a relatively small benefit on the overall performance. For example, main memory optimizations improved the performance of Shore by about 30%, which is significant but unlikely to motivate the major database vendors to re-engineer their systems. Similar gains would be obtained by eliminating just latching or switching to a single-threaded, one-transaction-at-a-time approach.

The learning from these experiments is that, unless one strips out all of these components, the performance of a main memory-optimized database (or a database without transactions, or one without logging) is unlikely to be much better than a conventional database where most of the data fit into RAM. On the other hand, a totally stripped out kernel (after a lot of engineering and invasive surgery) would forgo a lot of the generality of OLTP transactions and that would also pose other challenges. This somewhat tapers off the initial wild enthusiasm of the paper for fully in-memory database architectures.


Discussion

Although this paper took a bullish stance for fully in-memory database architectures, that vision did not come to full fruition. Despite significant progress main memory capacities and price reduction in the 16 years since the publication of the paper, the economics ultimately did not favor fully in-memory databases. Datasets continued to grow larger and become "colder" (less frequently accessed), and it never became economically viable to keep all data in memory, except maybe for some special usecases. The proposed approach of using a stripped-down minimal kernel to maximize performance benefits could not be fully realized, as not all data could be kept in the faster, but more expensive, main memory.

Instead SSDs emerged as a much cheaper and favorable storage option for databases, and the database research has shifted its focus towards memory-optimized database designs, which pragmatically tried to balance the use of fast in-memory structures and cost-effective secondary storage. In this blog we covered a couple of these

The Chardonnay paper, which we reviewed couple days ago, also explored a memory-optimized execution strategy using the concept of a "dry-run" mode. It made some radical/unconventional choices with the epoch idea, but on the plus side it required less invasive surgery for implementation/deployment of OLTP databases, and best of all it provided a distributed database, in contrast to the single node architectures explored in this Sigmod paper, and the Hekaton and Hyper/Umbra work. 

Another point raised in the paper, but not fully explored, is the limitations of the B-tree data structure. With the benefit of over a decade and a half of hindsight, we can now see how log-structured merge (LSM) trees may have been a more suitable data organization approach. The LSM-tree approach gained significant prominence in recent years (e.g., RocksDB), as it offers significant advantages over B-trees by leveraging modern hardware characteristics, and in particular arranging better collaboration between main-memory and SSDs.

The paper also failed to discuss the importance of multiversion concurrency control (MVCC) approach. As storage and memory have become increasingly affordable, MVCC has emerged as a crucial component of modern database systems. MVCC allows read and write operations to proceed without blocking each other, and this way improves significantly over traditional locking-based concurrency control.

A compelling follow-up to the architectural thinking explored in this paper is Pat Helland's recent work "Scalable OLTP in the Cloud: What's the BIG DEAL?". Pat builds upon the trends of LSM-based storage and MVCC, and combines them into a cloud-optimized distributed architecture that effectively leverages modern hardware to deliver high-performance transaction processing.

Comments

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Advice to the young

Foundational distributed systems papers

Distributed Transactions at Scale in Amazon DynamoDB

Linearizability: A Correctness Condition for Concurrent Objects

Understanding the Performance Implications of Storage-Disaggregated Databases

Designing Data Intensive Applications (DDIA) Book