Scalable OLTP in the Cloud: What’s the BIG DEAL?

This paper is from Pat Helland, the apostate philosopher of database systems, overall a superb person, and a good friend of mine. The paper appeared this week at CIDR'24. (Check out the program for other interesting papers). The motivating question behind this work is: "What are the asymptotic limits to scale for cloud OLTP (OnLine Transaction Processing) systems?" Pat says that the CIDR 2023 paper "Is Scalable OLTP in the Cloud a Solved Problem?" prompted this question. 

The answer to the question? Pat says that the answer lies in the joint responsibility of database and the application. If you know of Pat's work, which I have summarized several in this blog, you would know that Pat has been advocating along these lines before. But this paper provides a very crisp, specific, concrete answer. Read on for my summary of the paper.

Disclaimer: This is a wisdom and technical information/detail packed 13-page paper, so I will try my best to summarize the salient points. I will be using text from the paper to explain/summarize it. (Don't taze me bro!) 

Snapshot Isolation (SI) is a BIG DEAL

The database and the application have a BIG DEAL: their isolation semantics! In particular, snapshot isolation (SI) is the sweet spot. At this point, I got a nice database history lesson on how the isolation semantics evolved. I would have guessed the semantics had become more strict over time. No, on the contrary, they evolved to be more relaxed to meet performance and scalability expectations. And SI does hit a sweet point in that it still provides the user good isolation guarantees without jeopardizing the scaling behavior of the database by requiring it to serialize everything. 

In the rest of the paper, keep in mind that, an OLTP system is defined as a domain-specific application using a RCSI (READ COMMITTED SNAPSHOT ISOLATION) SQL database to provide transactions across many concurrent users.

The BIG DEAL splits the scaling responsibilities between the database and the application.

  • Scalable DBs don’t coordinate across disjoint TXs updating different keys.
  • Scalable apps don’t concurrently update the same key.

The big deal provides guarantees from the DB to the App. A scalable application can read all it wants. Updates to disjoint records don’t coordinate across TXs. Row-locks on disjoint records don’t coordinate across TXs.

Applications must tolerate these big deal disclaimers. Reads return snapshots: Records have no "current" value. There is no NOW in a BIG DEAL database! Transactions may abort any time but not too often. SELECT with SKIP LOCKED may subset the set of qualifying records as it returns results.

This means applications should change business behavior in order to scale. They can only provide a fuzzy/blurry view of the "current" state/changes. So, apps introduce ambiguity in biz domain specific ways: online retail makes ambiguous promises such as "Usually ships in 24 hours". And apps provide delayed truth: finances of a large company may take days to summarize. Many OLTP apps aggregate values synchronously as they interact with humans. Public TPC benchmarks (e.g., TPC-A, TPC-B, and TPC-C) mandated synchronous aggregations. But, as applications scale they should rethink concentrating the aggregated values of business state in dedicated records. By slowly  and asynchronously aggregating these business state, the application can scale in a domain-specific manner.

Today's OLTP databases don't scale

Before suggesting a hypothetical scalable database that satisfies the database side of the big deal, Pat shows us why today’s databases don’t scale!

In today's MVCC databases, reads & writes fight to access the "current" value of a record. The current version has a home location (a partition, server, or a B+ tree) holding the most recently committed version of the record or perhaps an uncommitted version. To update a record, exclusive access to the record's home is required. This causes infighting, contention, and coordination between the updating TX and any concurrent reading TXs.

Even reads contend with each other, since these implementations force MVCC readers to start out looking at the latest version of a key first. Coordination may also be needed to access neighboring records. Accessing key-ranges in B+Trees or similar data structures that may be changing needs cross-transaction coordination.

Readers coordinate with writers. Writers coordinate with readers. Readers coordinate with other readers!

Having a home for a record also makes online repartitioning/sharding (which is required for scalability) very difficult. Moving record keys from one partition to another is complex and impacts application availability.

To address these challenges, Pat proposes a prototype design. The database is structured so that there is no pre-assigned home for a record per key. Unlike partitioned DBs, this allows the database to seamlessly adapt to workload changes.

I liken this to the miscellaneous manifesto or how instead of neatly organizing/allocating everything a place (which inevitably fails, requiring incessant re-orgs), embracing the messiness and using a search engine to get to information quickly.

Rethinking OLTP databases

The architecture is based on a design Pat explored in a previous work. The work is very technical, and I missed the nuance and contributions of it because I didn't read through the appendix about details.

Owner servers verify that concurrent transactions have not created any conflicting updates for each key row-locked or updated by the TX that optimistically hopes to commit. Owner servers partitioned by both key-range and time-range. Repartitioning happens dynamically to accommodate scale. 

Worker servers are also horizontally scalable, and each have their own transaction log. As TX load increases, workers are added. Each TX happens at a single worker server. The worker servers accept connections from app servers, perform transactions & their queries, commit transactions to their per-worker log, and periodically flush committed new record-versions to the LSM (log structured merge tree).

LSM servers accept flushes from workers and incorporate them into the orderly past stored in the LSM. Record-versions are organized first by time, second by key. Each LSM layer contains record-versions for a band of time. With an LSM, the past scales without coordinating across disjoint transactions reading and updating!

Transaction execution and commit

We now deep dive to workers and owners as they are the most significant components in this OLTP architecture. The owners do the concurrency control (the adjudication of transactions with respect to other concurrent transactions), but the workers do the actual work of the transaction. The transaction is centralized in the worker's log. The workers' logs are ingested by LSM servers for later consumption and durability.

The worker will accept incoming connections from application servers, and plan/execute SQL statements: Reading with snapshots by key or key-range, acquiring row-locks using their unique record key, and updating records by their unique key. The worker will guess a future time to hopefully commit after being verified that updates and row-locks don’t conflict with concurrent TXs. The worker will then log the transaction’s updates & commit record in its local transaction log, which will then be fed into LSM servers.

As a commit-time for a transaction is guessed by the worker, every update and row-lock must be verified for no conflicting updates from snapshot to commit by the owner servers. As incoming proposed-updates and verify-locks arrive, they include a proposed-commit-time. Owner-servers align commit-time for records & workers. An incoming request from a worker hopefully arrives at the owner-server before its local clock has reached the proposed-commit-time. If it arrives after commit-time, owner-server returns an error and the TX aborts. If it arrives before commit-time, the owner-server waits until its local clock reaches commit-time.

What are row-locks you ask?

Row-locks allow the application to ask the database for help with concurrency across transactions. Traffic cops provide pessimistic concurrency control. They will stall later transactions if they acquire a row-lock held by an earlier transaction. This pessimistic ordering of transactions may be violated when failures happen. Competing transactions usually wait to allow the lock holder to go first but that may be flawed. So correctness will be enforced by OCC prior to commit. Of course, row-locks are moot when scalable apps avoid concurrent updates to the same records. But if the app experience concurrent updates to the same records, row-locks can help with the liveness of transactions when the DB uses them to function as a traffic cop.

Ok, let's wrap up the transaction execution discussion by talking about how owners can be horizontally scaled. Owners can close for new business and direct new proposed-updates elsewhere. An owner closed for new business only accept worker requests for snapshot reads in their rectangle of key & time ranges, proposed updates, and notifications of transaction outcome. In contrast an owner open for new business also allow new proposed-updates, and new verify-locks.

Massive Scale: It’s About Time!

As we have seen, the DB leverages time to provide snapshots, commits, and external consistency. External Consistency ensures new incoming requests see all previously exposed data, even by other database connections. That means, snapshot reads from new incoming work must be after all committed work previously visible outside the database.

By using current time, T-now, as the snapshot time, this is easy. But this would get trickier and more complex as the geographic scope of a DB grows past a single datacenter.

Overall, this prototype database architecture is a big vindication for using time in systems. (Some of these ideas have been explored in Pat's earlier paper, under seniority and retirement.) Everything in the database is versioned by the record-version commit time. The database organizes data by its creation time to achieve scaling. Reads are old record-versions as of a past snapshot. Row-locks ensure locked records unchanged until commit time. And updates materialize as new record-versions for later snapshots.


Franck Pachot said…
This article is based on the claim that MVCC reads have to compete with concurrent writes, but that is incorrect. In fact, it's the opposite. Non-MVCC databases had to acquire a lock to read because they only have access to the current version.

In contrast, MVCC databases can read from a past state. The past doesn't change because nobody can modify the past version of anything (except for Terminator). That's why in MVCC you can read the past versions without acquiring any lock on them. MVCC databases can even read from the read-only standby replica to scale out the reads, ignoring concurrent reads on the primary.
Murat said…
Yes, MVCC helps. But, as Pat shows in the paper, and outlines in Table 1, in the existing MVCC implementations, since the current record is in the path of reaching the snapshot reads, there is infighting and contention.
Franck Pachot said…
The problem is that the author doesn't show it but claims, maybe from SQL Server implementation (which I don't know in detail but all terms used are SQL Server ones to it, and he ignores that Oracle and PostgreSQL store rows in Heap Tables, not in the primary key).

He claims that the current row must have the location of the previous one. It is the opposite in PostgreSQL: When the current version is updated, it adds a pointer in the previous one. This is additional work for the writers, but readers do not have to go to the current version except if their read time is higher than the past version.

In Oracle, the current row points to the transaction log to build a previous image but only needs a short shared latch. Then, the buffer clone can be used by many other consistent reads. The buffer cache holds 5 versions by default, which covers most of the queries read times without rereading the current buffer. I've seen contention on rollback segments but never on the current buffer (XCUR in V$BH). Good explanation here: said…
I believe I understand what you are saying, and I entirely agree the subtly of implementation matters here since the author's point is itself subtle. My charitable interpretation would be the following.

If we consider a record to include both an index entry & heap tuple, it is correct to state new index entries are replaced in Oracle & placed next to older versions in Postgres. These would be the mechanisms of contention in Table 1.

Not all varieties of reads will contend via this mechanism.

That being said, I can understand why this point was omitted from the paper since any read that doesn't use an index will result in a larger scope of access and more necessary coordination.
sb said…
Getting cooperation from the OLTP app side is interestingly nothing mentioned (atleast in your summary), now that becomes relevant when you think of the key, is it stock itself or the trade/order IDs? A basket of NiFTY 500 order processing will swarm the workers to grow rapidly, while owners adjusting its key ranges to the workers? I suppose splitting the whole transaction work into multiple dedicated servers is ignoring the network chatter required (imagine just clock sync overhead for a nanosecond future commit timestamp in your summary) to scale. MVCC and realtime read/write with trade guarantees is hard, really hard, I mean will you place another million dollar without knowing the previous one really went through or not? No overlapping key updates (single trades) days are ancient, TPCC is just primary exam in terms of today's basket trading and search for sub-nanosecond latency.
Having developed such apps/ similar databases and now reinventing myself for low latency programming, I've few ideas that be worth something that sees through both sides of the aforementioned "coordination required to scale".

Popular posts from this blog

The end of a myth: Distributed transactions can scale

Foundational distributed systems papers

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Metastable failures in the wild

SIGMOD panel: Future of Database System Architectures

Amazon Aurora: Design Considerations + On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes

Dude, where's my Emacs?

There is plenty of room at the bottom