Distributed Transactions at Scale in Amazon DynamoDB

This paper appeared in July at USENIX ATC 2023.

If you haven't read about the architecture and operation of DynamoDB, please first read my summary of the DynamoDB ATC 2022 paper. The big omission in that paper was discussion about transactions. This paper amends that. It is great to see DynamoDB, and AWS in general, is publishing/sharing more widely than before.

Overview

A killer feature of DynamoDB is predictability at any scale. Do read Marc Brooker's post to fully appreciate this feature.

Aligned with this predictability tenet, when adding transactions to DynamoDB, the first and primary constraint was to preserve the predictable high performance of single-key reads/writes at any scale.

The second big constraint was to implement transactions using update in-place operation without multi-version concurrency control. The reason for this was they didn't want to mock with the storage layer which did not support multi-versioning.

Satisfying both of the above constraints may seem like a fool's errand, as transactions are infamous for not being scalable and reducing performance for normal operations without MVCC, but the team got creative around these constraints, and managed to find a saving grace. After going through customer needs and case studies, they figured out those are possible to be served using one-shot transactions, as they could express most use-cases by just employing one-shot transactions. (I'll revisit this statement after explaining the DynamoDB transactional API in the next section.)

One-shot transactions mean that transactions are submitted as single request, and not through a period of back-and-forth interactive operations starting with BEGIN_TXN and ultimately committed (or aborted) with a COMMIT_TXN.

The Sinfonia paper presented the system's side benefits of single-shot transactions. The DynamoDB paper went for a different approach, in order not to incur the drawbacks of the two-phase locking approach used in Sinfonia.  

Locking restricts concurrency and can lead to deadlocks. Moreover, it requires a recovery mechanism to release locks when an application fails after acquiring locks as part of a transaction. These would violate the tenet of not interfering with the single read/write predictability.


Instead of a locking based approach, DynamoDB uses an optimistic concurrency control scheme. And a classic one at that! The protocol goes all the way back to VLDB 1980! We had reviewed that paper here. Remember the time-stamp order (TSO) algorithm? Mmm, TSO algorithm so good, yumm, yumm, yumm!

In the TSO algorithm, Each transaction, is assigned a timestamp on start, which is taken to define its position in the serial order. Every object x is tagged with the timestamp of the last transaction that successfully read/write it. As long as transactions appear to execute at their assigned time, serializability is achieved. To realize this, if/when a transaction tries to access (read/write) an object from its future it is aborted. And, that's all folks.

In order to relieve your anxiety before you read further, no, correctness does not depend on timestamps, clock synchronization only helps to improve performance, as more accurate clocks result in more successful transactions and a serialization order that complies with real time.

Using one-shot transactions, and adopting OCC TSO algorithm checks all the boxes, and provides a transaction protocol true to DynamoDB spirit of predictable response times. After all isn't systems design all about choosing the right tradeoffs? 

“Find out who you are and do it on purpose” --Dolly Parton


Ok, now let's regroup. Here is what we will cover in the rest of this post. We will review the API and then learn about how the transactions are implemented. Finally we will discuss some experiment results.

The paper has a section  on what additional optimizations are available if they are constrained to single-shot key-value transactions. Although this is interesting, we will skip it to keep the review at a reasonable length. I would just mention that many of these optimizations follow from application of the Thomas Write Rule. Among the optimizations, one about keeping the read-only transactions to one phase (not implemented currently) is noteworthy.


Architecture and API


Above is the system architecture for transaction. Transactions rely on a transaction coordinator that perform the two-phase coordination of one-shot OCC TSO protocol, the non-transaction operations (i.e., single key reads/writes) bypass the transaction coordinator (the purple box) and go directly to the storage node to complete the operation in one touch. (Again, please read the ATC 2022 summary to cache-in how DynamoDB single key operations work.)



Ok, let's talk API now. Think of TransactGetItems as singleshot "read-only" transaction. You can't mix TransactGetItems with TransactPutItems. But you can mix CheckItem with TransactPutItems, and that gives you enough leverage to code interesting transactions as Listing 1 shows. The insight here is that the operation goes to the same Paxos leader for a storage partition, and thanks to the MultiPaxos read-lease optimization, the leader can serve local reads about the value of the items it hosts and realize these checks in a CheckItem/TransactPutItems transaction locally. (In Cassandra this is not possible.)  



Now that we saw the API, let's revisit the question of how expressive these one-shot transactions can be. The team found that most business transaction of the form start/interactive/commit can be modeled as read transactions followed by a write transaction with a check condition.

The interleaving of reads and writes within an interactive transaction gives more flexibility for exploration, but by using the above formula with care one-shot transactions would have good expressivity and still provide  serializability (thanks to the check in the write transaction at the end).


Write transaction execution


In the prepare phase of the protocol, the transaction coordinator sends a message to the primary storage nodes for the items being written. The storage primary accepts the transaction if:

  • All preconditions on the item are met
  • The transaction's timestamp is greater than the item’s timestamp indicating when it was last written
  • The set of previously accepted transactions that are attempting to write the same item is empty

If the transaction is accepted by all the participating storage nodes, then the transaction coordinator will commit the transaction with a second phase message. Each participant storage node performs the desired writes on its local items and records the timestamp of the transaction as the items’ last write timestamp. Items for which a precondition was checked but that are not being written also have their timestamps updated.

I love the simplicity of recovery. No locks to mock with. The problem with locks is you may get locked out, and it is no fun trying to find an open locksmith at midnight.

For recovery, we don't need to be concerned with storage node group primary failure, thanks to Paxos. Transaction coordinator failures are of concern, and there is an easy solution to this by leveraging the ledger and employing recovery managers.

Coordinators maintain a persistent record of each transaction and its outcome in a ledger (a DynamoDB table with transaction identifiers as the key). A recovery manager periodically scans the ledger looking for old transactions that have not yet been completed. Such stalled transactions are assigned to a new transaction coordinator who resumes executing the transaction protocol. It is okay for multiple coordinators to be finishing the same transaction at the same time since duplicate attempts to write an item are ignored by its storage node.


Now that we understand the transactional protocol, it is time (pun intended!) to talk about the effects of timestamping in TSO on correctness and performance again. Correctness does not depend on timestamps, because for the correctness reasoning we can treat timestamps as monotonously increasing logical timestamps (because the protocol rejects smaller timestamps and timestamps only increase). Of course, better synchronized clocks result in more successful transactions, because the real time ordering will align better with the serialization order.

An issue diligent distributed systems people will catch is the risk of timestamping too far in to the future leading to availability issues. The DynamoDB team have put safeguards to ignore clocks that are ahead by many seconds, and isolate transactions managers with faulty clocks. Moreover, the clocks in the coordinator fleet are sourced from the AWS time-sync service,  keeping them closely in sync within a few microseconds.


Read-only transaction execution

Read-only transactions painstakingly avoids having to maintain a read timestamp on each item as this would have turned every read into a more costly write operation (to update the read timestamp) on persistent, replicated data. To avoid this latency and cost, they devised a two-phase writeless protocol for  read-only transactions. This is done again by using a classic idea from optimistic concurrency control.

In the first phase of the protocol, the transaction coordinator reads all the items that are in the transaction's read-set. If any of these items are currently being written by another transaction, then the read transaction is rejected; otherwise, the read transaction moves to the second phase. In its response to the transaction coordinator, the storage node not only returns the item's value but also its current committed log sequence number (LSN). The current committed LSN of the item is the sequence number of the last write that the storage node performed and acknowledged to the client. The LSN increases monotonically.

In the second phase, the items are read again. If there have been no changes to the items between the two phases, namely the LSNs have not changed, then the read transaction returns successfully with all of the item values that were fetched. In the case where an item has been updated between the two rounds of the protocol, the read transaction is rejected.


Experiments

A uniform key distribution and an item size of 900 bytes were used in these tests. Workloads were scaled from 100 thousand to 1 million operations per second. The experiments don't have y-axis measurements to not divulge too much information. (I am not happy about that, but business is business.)


What we see is nice predictability. With the increase in throughput, both TransactGetItems and TransactWriteItems exhibit negligible variances at P50. The latency increases slightly at P99 as the request rate increases; this is due to increased java garbage collection on the transaction coordinators when the load is heavier.

In Fig 7 and 8, we can see how close the latencies of transactions are to single key operations.


Although reads and writes to items within a transaction are processed in parallel, the latency of the transaction request is determined by its slowest operation. Transactions that involve a greater number of operations are more likely to experience a slow read or write.




Comparison of cancellation rate for varied contentious workloads are shown in Figures 9 and 10.




Conclusion

DynamoDB is ticking along (pun!) as always with predictable and high-performance at any scale. Recent prime day stats shows that DynamoDB served at peak 126 million requests per second with single digit millisecond latency and high availability.

I didn't get a chance to mention this so far in the review, but DynamoDB transactions were also TLA+ modeled and checked especially for getting things right in the presence of failures.

The Usenix ATC conference presentation video is still not released on this paper. But in his Fast 19 keynote, Doug Terry had covered the thinking behind DynamoDB transactions in detail. It is a great watch (oh, Murat, stop it with the time related puns!).

Comments

Popular posts from this blog

Learning about distributed systems: where to start?

Hints for Distributed Systems Design

Foundational distributed systems papers

Metastable failures in the wild

The demise of coding is greatly exaggerated

Scalable OLTP in the Cloud: What’s the BIG DEAL?

The end of a myth: Distributed transactions can scale

SIGMOD panel: Future of Database System Architectures

Why I blog

There is plenty of room at the bottom