SLOG: serializable, low-latency, geo-replicated transactions

This paper is by Kun Ren, Dennis Li, and Daniel Abadi, and it appeared at VLDB 2019.

This paper is about providing strict serializability in geo-replicated databases. Strict serializability implies that all reads within a transaction must see the value of any writes that committed before the transaction began, no matter where that write was performed world-wide. Furthermore, if a transaction, A, begins after (in real time) transaction B completes, no client can see the effect of A without the effect of B.

Since a strict serializability system behaves like it is running on a single machine processing transactions sequentially, this reduces application code complexity and bugs. However, strict serializability comes with a cost. Current state-of-the-art geo-replicated systems cannot provide strict serializability  alongside low latency writes and high transactional throughput.

To achieve all three (strict-serializability, low-latency writes and high transactional throughput), SLOG uses locality in access patterns to assign a home region to each data granule. Reads and writes to nearby data occur rapidly, without cross-region communication. However, reads and writes to remote data, and transactions that access data from multiple regions (i.e., multi-home transactions), must pay cross-region communication costs. Nonetheless, SLOG uses a deterministic architecture to move most of this communication outside of conflict boundaries, enabling these transactions to be processed at high throughput. More specifically, SLOG relies on deterministic processing to avoid two phase commit. Once all parties agree to the plan, processing occurs (mostly) independently on each node, with the system relying on the plan's determinism in order to avoid replica divergence.

Unfortunately, in order to create a deterministic plan of execution, more knowledge about the transaction is needed prior to processing it relative to traditional nondeterministic systems. Most importantly, the entire transaction (information regarding which data will be accessed by the transaction) must be present during this planning process.

SLOG borrows the deterministic architecture from Calvin, and is implemented leveraging the open source Calvin codebase. However, SLOG improves on Calvin in the following ways. In Calvin, every transaction, no matter where it originates from, is sequenced by a global Paxos process. This enables Calvin to have complete knowledge of the input to the system while planning how a batch of transactions will be executed. Of course, this comes at the cost of requiring every transaction to pay the cross-region latency to run Paxos across regions. SLOG removes the global Paxos process in order to reduce latency, but this causes unawareness of transactions submitted to replicas located in different regions during the planning of transaction processing. We will see how SLOG coordinates the transactions, classifying them as single-home and multi-home, with as little communication as possible.

In my summary below I use many sentences lifted from the paper. The paper is well written, and I wouldn't be able to improve on many of these explanations.

SLOG overview

SLOG uses a master-oriented architecture, where every data item is mastered at a single "home" replica. Writes and linearizable reads of a data item must be directed to its home replica. Each SLOG region contains a number of servers over which data stored at that region is partitioned (and replicated). Some of this data is mastered by that region (it is the home region for that data) and the rest is a replica of data from a different home region.

Each region maintains a local input log which is implemented via Paxos across its servers. This local log only contains transactions that are expected to modify data mastered at that region. This input log is sent to all other regions in batches. Each batch is labeled with a sequence number, and the other regions use this sequence number to ensure that they have not missed any batches from that region.  Regions use deterministic transaction processing to replay the local log from other regions. By continuously replaying local logs from other regions, the system is able to support local snapshot reads of data mastered at other regions at any desired point in the versioned history.

Single-home transactions

  1. When a region receives a transaction to process, it sends a message to its Lookup Master to obtain its cached value for the home of each granule accessed by the transaction. The returned locations are stored inside the transaction code. If every granule is currently mapped to the same home region, the transaction becomes initially assumed to be a single-home transaction, and is shipped to that region.
  2. Once the (assumed) single-home transaction reaches its home region, it is appended into an in-memory batch of transactional input on the server at which it arrives, and this server appends this batch to that region's input log via Paxos.
  3. A separate Paxos process interleaves batches from the local log with batches from the local logs that are received from other regions in order to create that region's view of the global log. 

All three batches appear in the global log of all three regions. However, the order in which these batches appear in the three respective global logs is different. The only guarantee is that 1-2 will appear after 1-1, since they originated from the same region. If all transactions are single-home, then it is guaranteed that each region's local log will access a disjoint set of database system granules (i.e., transactions across local logs do not conflict with each other). Therefore, the limited degree to which the global logs are ordered differently across different regions will not cause replica divergence.

This being a deterministic system ensures that all data progress through the same sequence of updates at every replica, without runtime coordination. Since home metadata is part of the data granule, the metadata is updated at the same point within the sequence of updates of that granule at every region. Therefore, any assumed single-home transaction that is not actually single-home will be exposed as non-single-home independently at each region (i.e., each region will independently, without coordination, observe that it is not single-home, and will all agree to abort the transaction without any cross-region communication). The transaction is then restarted as a multi-home transaction. Similarly, if an assumed single-home transaction is indeed single-home, but the assumed home is incorrect, all regions will see the incorrect home and counter annotation and independently agree to abort the transaction.

Multi-home transactions

  1. All multi-home transactions, no matter where they originate, must be ordered with respect to each other. For this SLOG employs a global log for ordering the multihome transactions with respect to other multi-home transactions and sends them to the regions where they are ordered with respect to single-home transactions.
  2. A multi-home transaction exists in several different locations in a region's global log. There will be an entry containing the code for that transaction, and then separately will come several LockOnlyTxns entries, one from each region that houses data expected to be accessed by that transaction. The code can start to be processed when it arrives, but it will block whenever it tries to access data for which the corresponding LockOnlyTxn has yet to complete.
  3. LockOnlyTxns exist to specify how the multi-home transaction should be ordered relative to single-home transactions at that region. Depending on where the LockOnlyTxn gets placed in the local log of that region, it will ensure that the multi-home transaction will observe the writes of all transactions earlier than it in the local log, and none of the writes from transactions after it. When the region has all the locks in its log, it executes the code. 

The example in the figure assumes that the code for the multihome transaction is already disseminated to the regions, and it only shows the dissemination of the LockOnlyTxn to the regions. At Region 0, InsertIntoLocalLog(T2) is called after it has placed single-home T1 into its local log. It therefore places its generated LockOnlyTxn for T2 after T1. InsertIntoLocalLog(T2) is called at Region 1 between placing single home transactions T3 and T4 into its local log and thus places the LockOnlyTxn for T2 it generates there. T2's LockOnlyTxns are ordered differently at each region is not problematic since LockOnlyTxns always access disjoint data and thus commute.

SLOG's deterministic locking scheme acquires all locks prior to processing a transaction and releases all locks after commit. Thus it is a form of two-phase locking (2PL). In any schedule produced by a 2PL implementation, if locks are held until after the transaction commits, and all reads read the value of the most recent write to that data item, then the resulting schedule is strictly serializable.

The use of a global log for multi-home transaction ordering was an interesting point of discussion in our Zoom reading group. We discussed whether it could be possible to order the multi-home transactions by other means, say using version vectors included in their metadata. The global log provides a central point which increases latency, as we will see in the latency graphs in the evaluation.

Dynamic remastering

SLOG does not require any remastering of data to process multi-home transactions, but does perform dynamic data remastering as access patterns change over time. A remaster request updates metadata of a single data granule and is thus a single-home transaction. The request is sent to the home region for that granule, which will insert the request in its local log, which will eventually cause the request to appear in the global logs of all regions. When each region processes this request in its global log, it updates the granule metadata. The Lookup Master of that region is also asynchronously updated to reflect the new mastership information.

However, caution is required as this example demonstrates: Region 2 places the local log from region 0 that contains T3new before the local log entry from region 1 that contains T2. Thus, it sees a different serial order: T1, T3, T2. This leads to potential replica divergence. The counter part of the metadata is used to circumvent this danger. Prior to requesting a lock on a granule, SLOG compares the counter that was written into the transaction metadata by the LookupMaster at the region the transaction was submitted to with the current counter in storage at that region. If the counter is too low, it can be certain that the LookupMaster that annotated the transaction had out of date mastership information, and the transaction is immediately aborted (every region will independently come to this same conclusion). If the counter is too high, the transaction blocks until the region has processed a number of remaster requests for that data granule equal to the difference between the counters.

Evaluation

Since SLOG relies on deterministic processing, it was implemented over the open source Calvin codebase [2], by adding the two Paxos logs per region and the Lookup Master index, and processing transactions as described above.


We can see that SLOG is not able to do as high throughput as Calvin, because Calvin uses a single point for serialization and does not have aborts ever. On the other hand, due to exactly the same reason, we see that SLOG improves the latency of Calvin in WAN. Notice in the latency graph that for 100% multi-home transactions, SLOG latency degrades to that of Calvin.

Here SLOG is compared with Spanner. One point to keep in mind is that Spanner supports SQL-API and more general transactions, whereas SLOG is restricted to use a deterministic architecture.

Here is the paper presentation video from our Zoom reading group.

Comments

Popular posts from this blog

Learning about distributed systems: where to start?

Hints for Distributed Systems Design

Foundational distributed systems papers

Metastable failures in the wild

The demise of coding is greatly exaggerated

Scalable OLTP in the Cloud: What’s the BIG DEAL?

The end of a myth: Distributed transactions can scale

SIGMOD panel: Future of Database System Architectures

Why I blog

There is plenty of room at the bottom