Paper review: Implementing Linearizability at Large Scale and Low Latency (SOSP'15)

Motivation for the paper

Remember the 2-phase commit problem?  The 2-phase commit blocks when the initiator crashes. Or, after a server crash the client/initiator may not be able to determine whether the transaction is completed. And transaction blocks. Because if we retry we may end up giving inconsistent result or redoing the transaction which messes up linearizability.

You need to have 3-phase commit to fix this. The new-leader/recovery-agent comes and tries to recommit things. Unfortunately, 3-phase commit solutions are complicated, there are a lot of corner cases. Lamport and Gray recommended that the Paxos consensus box can be used to remember the initiator's abort commit decision to achieve consistency, or more precisely they recommended the Paxos box remembers each participants decision  for the sake of shaving off a message in critical latency path.

What this paper recommends is an alternative approach. Don't use Paxos box to remember the decision, instead use a durable log at the participant to remember the decision. At this log the participant stores a completion record, which includes any results that are returned to the client. So if the initiator is confused and retries, or if the client retries, or a recovery-agent from one participating server comes and retries, this querying party is not going to get an inconsistent answer/decision from what is committed/returned earlier from the transaction.

How is the log at the participant durable against the crash of the participant? In other words, how do we ensure that the completion record is preserved? This is where this assumption about fast log-based recovery and RAMCloud specific features comes into play. RAMCloud maintains a log-structured replication and quick recovery, that ensures the completion record is not lost.

The paper presents this durable log-based transaction serializability idea with single participant, i.e., single object transaction, and then shows that it can be extended to multiple participant transactions.

That was my plot line for motivating the approach in the paper. The paper used, what I think is an indirect way to motivate the problem, by first pointing a problem with linearizability: exactly-once semantics. The figure illustrates that "at least once + idempotence != exactly-once". The paper then presents the completion record idea to achieve exactly-once semantics, and then builds linearizability on top of it, and in turn builds transactions on top of it.

Another idea in the paper is "implementing transactions on top of a lightweight linearizability layer". The paper argues that after having a lightweight linearizability layer in place, transations in fact become easier to implement. We will revisit this idea at the end of the review to see how it holds up.

RIFL architecture

The lightweight linearizability layer the paper suggests is named RIFL (Reusable Infrastructure for Linearizability).
In order to implement exactly-once semantics, RIFL must solve 4 problems: RPC identification, completion record durability, retry rendezvous, and garbage collection.
1) In order to detect redundant RPCs, each RPC must have a unique identifier, which is present in all invocations of that RPC.
2) RIFL assumes that the underlying system provides durable storage for completion records keyed with the RPC identifier.
3) If an RPC completes and is then retried at a later time, the retries must find the completion record to avoid re-executing the operation. To achieve this  each operation is associated with a particular object in the underlying system, and the completion record is stored wherever that object is stored.
4) For garbage collection, a completion record should not be reclaimed until it is certain that the corresponding request will never be retried. This can happen in two ways. First, once the client has received a response, it will never retry the request. Clients provide acknowledgments to the servers about which requests have successfully completed, and RIFL uses the acknowledgments to delete completion records.
RIFL appears to the rest of the system as three modules. The first, RequestTracker, runs on client machines to manage sequence numbers for outstanding RPCs (Figure 3). The second module, LeaseManager, runs on both clients and servers to manage client leases (Figure 4). On clients, LeaseManager creates and renews the client’s lease, which also yields a unique identifier for the client. On servers, LeaseManager detects the expiration of client leases. The third module, ResultTracker, runs only on servers: it keeps track of currently executing RPCs and manages the completion records for RPCs that have finished (Figure 5).

Implementing transactions over RIFL

The paper shows how Sinfonia minitransactions can be implemented over RIFL layer. You can read a summary of Sinfonia minitransactions here. The implementation of Sinfonia transactions over RIFL requires a long description, so I will avoid summarizing it myself, and instead point to a couple paragraphs verbatim from the paper to give you an idea about this.
"No side-effects" is the key idea when implementing transactions. The Transaction object defers all updates to the key-value store until commit is invoked. Commit must atomically verify that each object has the required version number, then apply all of the write and delete operations. If any of the version checks fail, the commit aborts and no updates occur.
Commit is implemented using a two-phase protocol where the client serves as coordinator. In the first phase, the client issues one prepare RPC for each object involved in the transaction (see Figure 6). The server storing the object (called a participant) locks the object and checks its version number. If it doesn't match the desired version then the participant unlocks the object and returns ABORT; it also returns ABORT if the object was already locked by another transaction. Otherwise the participant stores information about the lock in a transaction lock table and creates a durable record of the lock in its log. It then returns PREPARED to the client. The client issues all of the prepare RPCs concurrently and it batches requests to the same participant. If all of the prepare RPCs return PREPARED, then the commit will succeed; if any of the prepare RPCs return ABORT, then the transaction will abort. In either case, the client then enters the second phase, where it issues a decision RPC for each object. The participant for each object checks whether the RPC indicates “commit” or “abort”. If the decision was to commit, it applies the mutation for that object, if any. Then, whether committing or aborting, it removes the lock table entry and adds a tombstone record to the RAMCloud log to nullify the lock record. The transaction is effectively committed once a durable lock record has been written for each of the objects. At this point, regardless of the client's actions, the mutations will eventually be applied in the crash recovery.

If the client is suspected to have crashed, the participant for the transaction's first object acts as recovery coordinator. The recovery coordinator executes a two-phase protocol similar to that of the client, except that its goal is to abort the transaction unless it has already committed (in general, there may not be enough information to complete an incomplete transaction, since the client may have crashed before issuing all of the prepare RPCs). In the first phase the recovery coordinator issues a requestAbort RPC for each object, whose participant will agree to abort unless it has already accepted a prepare for the object. If requestAbort returns PREPARED for every object in the transaction, then the transaction has already committed. Otherwise, the recovery coordinator will abort the transaction. In either case, the recovery coordinator then sends decision RPCs for each object. In order to provide enough information for crash recovery, the client includes identifiers for all objects in the transaction as part of each prepare. This information serves two purposes. First, it allows each participant to identify the recovery coordinator (the server for the first object in the list). Second, the object list is needed by the recovery coordinator to identify participants for its requestAbort and decision RPCs. The recovery coordinator may not have received a prepare if the client crashed, so when a participant invokes startRecovery it includes the list of objects that it received in its prepare.

Transactions over Linearizability vs. Linearizability over Transactions

Implementing transactions over linearizability certainly provided a lot of benefit in terms of simplifying the complex recovery protocol in the original Sinfonia system. By implementing Sinfonia over RIFL, the paper did not need to implement the recovery protocol of Sinfonia. On the other hand we don't have results from the paper to see if implementing Sinfonia over RIFL helped improve performance.  The paper does not include any comparison of performance with base Sinfonia transactions.  Or the paper could have at least measured throughput of base RAMCloud transaction and RAMCloud with RIFL transactions to see if there is a throughput increase. That is also missing sorely in the paper.

As far as general transactions and general linearizability is concerned: the RIFL paper doesn't compare with Calvin, a system that made the case for transactions over linearizability. This is a big omission as Calvin system set out to do just that, implementing transactions over a linearized log: “By first writing transaction requests to a durable, replicated log, and then using a concurrency control mechanism that emulates a deterministic serial execution of the log's transaction requests, Calvin supports strongly consistent replication and fully ACID distributed transactions, and manages to incur lower inter-partition transaction coordination costs than traditional distributed database systems.”

There is also the question of throughput vs. latency for transactions. The Calvin paper suggests that transactions over linearizability improves throughput but latency suffers. I think the reason is as follows: linearizability-first avoids coordination headache/contentions. So eventhough it may add to latency, it should help improve throughput overall because coordination contention is avoided. Things are all predecided by the linearizability layer, aka the log.

Evaluation section

The evalution section shows the practical implications of the shim lightweight linearizability layer and demonstrates that this provides low overhead transactions even as the number of participants increase.
The RIFL+RAMCloud stack is compared/contrasted with H-Store main-memory database system using the TPC-C benchmark. The RAMCloud implementation of RIFL exhibits high performance: it adds less than 4% to the 13.5 μs base cost for writes, and simple distributed transactions execute in about 20 μs. RAMCloud transactions outperform H-Store on the TPC-C benchmark, providing at least 10x lower latency and 1.35x–7x as much throughput.
To contextualize these comparison results,  keep in mind that H-Store is a totally different beast than RAMCloud. H-Store is a  row-store-based main-memory relational database management system for OLTP applications. So H-Store is definitely more heavyweight to deal with general transactions and relational databases.

Figure 12 graphs the throughput of RIFL-based transactions. For transactions involving a single participant, a cluster with 10 servers can support about 440k transactions per second. With more participants per transaction, throughput drops roughly in proportion to the number of participants: with five participants in each transaction, the cluster throughput is about 70k transactions per second. Figure 12 also shows that the single-server optimization improves throughput by about 40%.

Related links

The paper:
Conference video:

The case for RAMClouds:
RAMCloud Reloaded:


Popular posts from this blog

Learning about distributed systems: where to start?

Hints for Distributed Systems Design

Foundational distributed systems papers

Metastable failures in the wild

The demise of coding is greatly exaggerated

The end of a myth: Distributed transactions can scale

Scalable OLTP in the Cloud: What’s the BIG DEAL?

SIGMOD panel: Future of Database System Architectures

Why I blog

There is plenty of room at the bottom