An Evaluation of Distributed Concurrency Control

This VLDB'17 paper investigates the effects of distributed concurrency control protocols on performance of distributed transactional databases. They evaluate six protocols: two-phase locking with NO_WAIT and WAIT_DIE flavors, timestamp ordering (TIMESTAMP), multi-version concurrency control (MVCC), optimistic concurrency control (OCC), and a deterministic protocol Calvin. For the evaluations, they use an in-memory distributed database evaluation framework called Deneva (, providing an apples-to-apples comparison between each protocol.

This is a delightful paper. After we covered several distributed databases in recent posts, this paper helps us consider the performance implications of concurrency control schemes they use and how they would fare under different workloads. The paper is very well written and easy to follow. In my summary I lifted a lot of the text from the paper with little editing.

The six transaction protocols considered

The paper investigates serializable execution of transactions. Under this model, transactions behave as if they were executed one-at-a-time against a single copy of database state.

Two-phase locking (2PL) has two-phases (duh!). In the first phase, known as the growing phase, a transaction acquires a lock for any record that it needs to access. Locks are acquired in either shared/read or exclusive/write mode. The transaction enters the second phase of 2PL, called the shrinking phase, by releasing one of its locks. Once the transaction enters this phase, it is not allowed to acquire new locks, but it can still perform read or write operations on any object for which it still holds the lock. In their implementation, the DBMS holds locks until the transaction commits or aborts (i.e., strict 2PL) and they use record-level locks to maximize concurrency.

2PL implementations differ on how to handle deadlocks. In NO_WAIT, if a transaction tries to access a locked record and the lock mode is not compatible with the requested mode, then the DBMS aborts the transaction that is requesting the lock. The WAIT_DIE protocol is similar except it attempts to avoid aborts by ordering transactions based on the timestamps that the DBMS assigned them when they started. The DBMS queues a conflicting transaction as long as its timestamp is smaller (older) than any of the transactions that currently own the lock.

Another family of concurrency control protocols relies on timestamps. In the most basic algorithm of this class (TIMESTAMP), transaction's operations are ordered by their assigned timestamp. The transaction's timestamp dictates access to a record. This protocol avoids deadlock by aborting transactions with a smaller (older) timestamp than the transaction that currently has an exclusive hold on a record.

In contrast, multi-version concurrency control (MVCC) maintains several timestamped copies of each record. This enables reads and writes to proceed with minimal conflict, since reads can access older copies of the data if a write is not committed.

Optimistic concurrency control (OCC) executes transactions concurrently and determines whether the result of transaction execution was in fact serializable at the time of commit. That is, before committing, the DBMS validates the transaction against all the transactions that committed since the transaction started or are currently in the validation phase. If a transaction can commit, the DBMS copies the local writes to the database and results are returned to the client. Otherwise, the transaction aborts and destroys any local copies of the data.

Deterministic scheduling (e.g., CALVIN) is a recent proposal as an alternative to traditional concurrency control protocols. In CALVIN, centralized coordinators decide on a deterministic order of the transactions and eliminate the need for coordination between servers required in other concurrency control protocols. Thus, CALVIN does not need to use an atomic commitment protocol to determine the fate of a transaction.

All of the protocols (except for CALVIN) employ the two-phase commit (2PC) protocol to ensure that either all servers commit or none do (atomic commit). The system only requires 2PC for transactions that perform updates on multiple partitions. Read-only transactions and single-partition transactions skip this step and send responses immediately back to the client. OCC is an exception to the read-only transaction rule, since it must validate its reads as well.

System overview

Figure 1 shows the high-level architecture of Deneva, the paper's evaluation framework. Deneva is a shared-nothing system where each server is responsible for one or more partitions of the data, and no partition is managed by more than one server. Deneva supports distributed transactions across partitions but it does not provide replication or fault tolerance; thus, this investigation is limited to failure-free scenarios. It arranges clients and servers in a fully connected topology over a set of deployed cloud computing instances. It uses nanomsg, a scalable and thread-safe socket library to communicate between instances using TCP/IP.

All transactions in Deneva execute as stored procedures that run on the servers. Each procedure contains program logic intermixed with queries that read or update records in the database. When a server receives a new transaction request, it invokes the stored procedure, which will then generate queries that access data either on its local partition or a remote partition managed by another server. If an active transaction aborts due to protocol behavior, the coordinator sends a message to the other participating servers to abort. They will each roll back any changes that the transaction made to its local partition. The coordinator then puts the request back into its work queue an exponential back-off penalty (starting at 10 ms).


The OLTP transactions considered are (1) short-lived, (2) access a small number of records at a time, and (3) are repeatedly executed with different input parameters. The paper uses three benchmarks for the evaluation.

Most of the experiments are performed with the Yahoo! Cloud Serving Benchmark (YCSB). YCSB has a single table with a primary key and 10 additional columns with 100 B of random characters. For the experiments, they use a YCSB table of ~16 million records per partition, which represents a database size of ~16 GB per node. The table is partitioned by the primary key using hash partitioning. Each transaction in YCSB accesses 10 records (unless otherwise stated) that are a combination of independent read and update operations that occur in random order. Data access follows a Zipfian distribution, where the frequency of access to sets of hot records is tuned using a skew parameter (theta). When theta is 0, data is accessed with uniform frequency, and when it is 0.9 it is extremely skewed.

TPC-C benchmark models a warehouse order processing application and is the industry standard for evaluating OLTP databases. It contains nine tables that are each partitioned by warehouse ID, except for the item table that is read-only and replicated at every server. They support the two transactions in TPC-C that comprise 88% of the default workload mix: Payment and NewOrder. The other transactions require functionality, such as scans, that are currently unsupported in Deneva, so they are omitted. The Payment transaction accesses at most two partitions. The first step in the transaction is to update payment amounts for the associated warehouse and district. Every Payment transaction requests exclusive access to its home warehouse. The customer's information is then updated in the second part of the transaction. The customer belongs to remote warehouse with a 15% probability. The first part of the NewOrder transaction reads its home warehouse and district records, then it updates the district record. In the second part, the transaction updates 5–15 items in the stock table. Overall, 99% of all items updated in a transaction are local to its home partition, while 1% are at a remote partition. This means that ∼10% of all NewOrder transactions are multi-partition.

The Product-Parts-Supplier workload (PPS) is another OLTP benchmark that contains transactions that execute foreign key lookups. It contains five tables: one each for products, parts, and suppliers, partitioned by their respective primary key IDs, a table that maps products to parts that they use and a table that maps suppliers to parts they supply. The benchmark assigns parts to products and suppliers randomly with a uniform distribution. The benchmark's workload is comprised of a mix of single- partition and multi-partition transactions. The multi-partition Order- Product transaction first retrieves the parts for a product and then decrements their stock quantities. The LookupProduct transaction is similar in that it retrieves the parts for a product and their stock quantities, but without updating any record. Both transaction types execute one or more foreign key look-ups that each may span multiple partitions.


All experiments, until the last subsection labeled TPC-C and PSP, uses the YCSB microbenchmarks. The graphs use the same consistent legend/key throughout.

Contention. Figure 2 shows that the protocols' throughput is relatively unaffected by skew for theta values up to ~0.5. After this point, most of them decline in performance but at different rates. Once theta reaches ~0.8, all but one of them converge to the same low performance. CALVIN is the only protocol to maintain good performance despite high skew. Since all of the transactions’ data accesses are independent, CALVIN does not need to send messages between the read and execution phases. Thus, unlike the other protocols, it does not hold locks while waiting for messages from remote servers.

OCC performs worse than the other non-deterministic protocols under low contention due to the overheads of validation and copying during transaction execution. At higher levels of contention, however, the benefit of tolerating more conflicts and thus avoiding unnecessary aborts outweighs these overheads. MVCC and TIMESTAMP have a steep performance degradation when theta reaches ~0.5 because they block newer transactions that conflict until the older ones commit. Although some transactions avoid this in MVCC by reading older versions, this requires that the reading transaction is older than all transactions with non-committed writes to the data, which is often not the case in this workload.

Update rate.
For this experiment, they vary the percentage of YCSB transactions that invoke updates while keeping the server count constant at 16 nodes. Each transaction accesses 10 different records. They use a medium skew setting (theta=0.6). The results in Figure 3 show that the performance of most of the protocols declines as the percentage of update transactions increases.

Multi-Partition Transactions.
Figure 4 shows the performance of the protocols against varying the number of partitions each transaction accesses. With the exception of CALVIN, the protocols' throughput plummets when transactions touch more than one partition. From two to four partitions, performance drops by 12–40%. This degradation is due to two reasons: (1) the overhead of sending remote requests and resuming transactions during execution, (2) the overhead of 2PC and the impact of holding locks for multiple round trip times between transaction execution and 2PC. CALVIN's performance drop from single- to multi-partition transactions is not as extreme as the other protocols. CALVIN's servers synchronize at every epoch. Even no multi-partition transactions arrive from remote sequencers, a scheduler must wait until it receives an acknowledgement from each sequencer before proceeding to ensure deterministic transaction ordering. If some sequencers lag behind other node's, this can cause slowdown in the system. Thus CALVIN does not scale as well as other protocols to 16 nodes when Deneva is only executing single-partition transactions.

Scalability. The results in Figure 5a show that all protocols allow reads to proceed without blocking, so these results are close to the throughput that is achieved without any concurrency control. The throughputs of the protocols are nearly identical except for OCC and CALVIN. OCC's throughput is limited by the overhead of copying items for reference during the validation phase and the cost of the validation phase itself. CALVIN has the lowest throughput due to its bottleneck at the scheduler.

In Figure 5b, we see that with a mixed read/write workload, the landscape begins to change due to contention. Though all protocols improved throughput at 64 nodes over a single server, the gains are limited. At 64 servers, the protocols all improve their single-server performance by 1.7–3.8×. Although CALVIN shows the best improvement over its single-node performance, NO_WAIT has the best overall performance for all numbers of servers.

In comparing the breakdown in Figure 6b with the read-only results in Figure 6a, we see that the IDLE time increases for both MVCC and TIMESTAMP. This is because these protocols buffer more transactions while waiting for older transactions to complete. It is shocking how little "useful work" is done! These concurrency control protocols, like knowledge workers, have very little useful work to show for the amount of time spent due to the huge preparation costs involved.

Data dependent aborts. In the previous experiments, CALVIN had an advantage because it does not need to use 2PC or other message passing beyond communicating with the sequencer. Since YCSB reads and writes are independent, and transactions do not conditionally abort based on values read during the transaction, CALVIN did not need to exchange any messages between servers between its read and write phase. To measure the effect of data-dependent aborts, the team adds a conditional statement to YCSB to model the execution logic associated with making an abort decision. Though most protocols' throughput were only affected by 2–10%, CALVIN experienced a 36% decrease in throughput when the workload was run on 16 servers with medium contention (theta=0.6, 50% update transactions) compared to the original YCSB workload. They also found that as contention increases from theta=0.8 to theta=0.9, CALVIN's throughput declines from 73k to 19k transactions per second.

TPC-C and PSP benchmarks
. Finally some experiments were also run with the more realistic workloads in TPC-C and PSP benchmarks. Here we see that Calvin loses its edge compared to the other protocols.


Under low contention and low update workloads, all of the protocols perform well. But for workloads with high update rates, two-phase locking with NO_WAIT outperforms other non-deterministic protocols by up to 54%, and for workloads with high contention, it outperforms them by up to 78%. CALVIN outperforms all other protocols under the highest contention levels and update rates for simple transactions, but when workloads include realistic OLTP workloads or transactions with foreign key lookups, CALVIN is the only one whose performance does not scale as the cluster size increases. Calvin, like Calvin Newport, does batching and avoid disrupting context switches, but that strategy fails for realistic OLTP workloads.

The results show that the scalability of all of the protocols is limited. The commit protocol is one of the primary factors that affects throughput. Most of the protocols require two round trips before committing. CALVIN was designed specifically to try to mitigate the affects of 2PC, but if there is the possibility that a transaction will abort, a round of messages must be broadcast and collected before transactions can commit at their local node and release their locks. In all cases, locks held during 2PC or while waiting for messages prevent forward progress in the application.

Another major bottleneck is data contention. When there is no contention (i.e., the workload includes small number of updates) using a protocol with minimal overhead, such as 2PL or a TIMESTAMP protocol, made the most sense. These protocols quickly degraded as the number of writes increased and past a certain threshold using OCC or CALVIN was a better choice.

This has been likely the most comprehensive evaluation of these concurrency control protocols together, but it is also important to keep in mind the limitations of the evaluation as well.

  • The evaluation framework does not factor in replication and fault-tolerance mechanisms required to implement these concurrency protocols.
  • Only serializable isolation is considered, and considering snapshot isolation would change the results significantly.
  • Client facing effects of aborted transactions are not considered. An aborted transaction is retried after a backoff, but in a real world scenario this would result in stalls due to upcoming transactions having dependence on the aborted transactions.
  • The OCC copying overhead cited in the paper seems like an implementation problem rather than fundamental protocol issue. There could be more efficient implementations of OCC, especially in the context of multiversion concurrency control.

Potential solutions

The paper concludes that to achieve truly scalable operation, distributed concurrency control solutions must seek a tighter coupling with either novel network hardware (in the local area) or applications (via data modeling and semantically-aware execution), or both. I am not too sold on the improve the network suggestion, so I include the two other suggestions here.

Another point I want to make before stopping is, why stop at these couple suggestions for improving the performance of distributed OLTP databases. Are there more solutions we can consider? Would a disaggregated/decoupled architecture help in scaling the distributed database better? What about dynamically changing concurrency control strategies per-key based on workload? What are your suggestions for improving distributed OLTP databases?

Adapt the Data Model: While distributed transactions are expensive, transactions that execute within a single node are relatively inexpensive. As a result, applications that express their transactions in a manner that is amenable to single-node execution are not subject to the penalties investigated here. A primary strategy for achieving single-node operation is to perform partitioning within the data model. For example, Helland’s entity group approach over data stored within a single, possibly hierarchical, set of data that can fit on a single server.

Seek Alternative Programming Models: Given the cost of serializability, a reasonable alternative is to seek alternatives. Often, discussions regarding non-serializable behavior present a false dichotomy between serializability and application consistency versus difficult to understand and error-prone alternatives. It is helpful to remember that serializability is a means towards achieving application- level consistency, but it is not strictly necessary; despite the fact that many existing formulations of non-serializable isolation (e.g., Read Committed isolatio and eventual consistency) are unintuitive, this does not necessarily mean that all non-serializable programmer interfaces need be unintuitive. In effect, this alternative prompts a re-investigation of so-called semantics-based concurrency control methods. Given a lack of information about the semantics of applications, serializability is, in a sense, the "optimal" strategy for guaranteeing application consistency. But given the scalability challenges we have observed, we believe it is worth investigating techniques for pushing down additional semantics into the database engine. In fact, one study observed that application developers for Web frameworks (e.g., Ruby on Rails) already use non-transactional interfaces for expressing their application consistency criteria.


Popular posts from this blog

Graviton2 and Graviton3

Foundational distributed systems papers

Learning a technical subject

Your attitude determines your success

Learning about distributed systems: where to start?

Progress beats perfect

CockroachDB: The Resilient Geo-Distributed SQL Database

Warp: Lightweight Multi-Key Transactions for Key-Value Stores

Amazon Aurora: Design Considerations + On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes

Anna: A Key-Value Store For Any Scale