Enabling lightweight transactions with precision time (ASPLOS 2017)

This paper is by Pulkit A. Misra, Jeffrey S. Chase, Johannes Gehrke, Alvin R. Lebeck, and it appeared at ASPLOS'17.

The paper describes *Semel*, a durable multi-version read-optimized key-value store, and *Milana*, a distributed OCC transaction system built on top of Semel.

The main ideas in the paper are to exploit precision time (PTP) and efficient persistent memory based on flash NVM/SSD, and to show how they helped for building OCC based distributed transactional system. In a way, this paper revisited and revised Thor and showed the benefits achievable from using modern clocks and storage technologies inside a datacenter. Thor (Sigmod 1995), was from Barbara Liskov's group, and introduced loosely synchronized clocks for OCC, and performed validation on the storage servers.

Before we summarize Semel and Milana, let's recap the main contributions of this paper:
  1. Move ordering off the critical path: Each version of a key's value is timestamped using PTP. These timestamps enable using a lightweight primary-backup replication protocol that moves update ordering off the critical path. (In this way the paper is similar to TAPIR from OSDI-15, which leverages NTP based ordering for building geo-replicated transactional storage over unordered replication.)
  2. Use of *optimistic concurrency control* (OCC) in Milana to support serializable ACID transactions over Semel. Each transaction executes on a single client (i.e., an application server). The client issues read/write requests to Semel storage servers, assigns PTP timestamps for start, end of transactions, and acts as the coordinator to commit or abort the transaction.
  3. Modifying the erase-before-write (remap-on-write) behavior of SSDs Flash Translation Layer (FTL) to enable cheap multi-version storage and to integrate version management with FTL garbage collection
  4. Leveraging the primary/backup replication in Semel to reduce OCC validation costs, such that write validation occurs only on the primary for each affected shard, and read-only transactions (served as consistent snapshots) validate locally at the client

Here is the presentation on this from our Zoom Distributed Systems Reading Group.

Semel

Semel uses primary/backup replication with a designated primary for each shard. It exploits PTP to relax the ordering requirement and commit each update as soon as a majority of replicas ACK it.
  • Since the replicated Semel operations are timestamped writes to independent versions of independent data items, there is no need to maintain ordering
  • The ordering is explicit in the version timestamps, which are recovered along with the data
  • A server executes reads on the named version and rejects writes with timestamps older than the current version, guaranteeing at-most-once semantics
A key-value store implemented using traditional SSDs requires two mapping steps: mapping Key to Logical Block Address (LBA) and then mapping LBA to <PBN, Page>, the physical block number. However, it is possible to modify the FTL to collapse this two-step translation into a single translation, so that it maps a key directly to a physical address with a single map table access.


Keeping versions around longer than necessary on flash-based systems may cause wasteful remapping (moving) during garbage collection. Semel tries to balance flash remapping cost with the desire to provide historical versions within a certain window size, e.g., keep all versions that are less than 5 seconds old. Semel utilizes watermarking to establish a lower bound on the client clocks. Each client periodically broadcasts the timestamp of its last acknowledged operation to all storage servers. The minimum of all these timestamps is the watermark in Semel. This means that the clients are not any random client, but a set of application servers that Semel and Milana keep track of to update the watermark.

SEMEL's approach to linearizable RPC is similar in spirit to RIFL, which also timestamps requests at the client and persists a completion record containing each request's timestamp with the object. The key difference is that SEMEL's request timestamps are global and synchronized across the clients. Precise clocks enable Semel to simplify the ordering protocol.

Milana

Milana leverages Semel's precision timestamps and builds OCC based transactional system adapted to a client/server setting. In Milana, for an update transaction T, the client first performs the reads from the primaries responsible, stages the write locally, and then validates T before committing via a 2 phase commit which involves the primaries involved.


Each primary uses Algorithm 1 to validate T's keys. As Algorithm 1 shows, this is done by comparing T's timestamped read-write accesses to those of other transactions to identify any access conflicts that violate a serializable ordering. T fails validation if it has conflicts that violate serializability. If T passes validation at the primary, the primary then propagates the validation decision (SUCCESS/ABORT) along with the write set (on successful validation) and shard list to the backup replicas, waits for f (out of 2f) backups to respond, and then reports the decision as its vote to the client/coordinator. If a primary votes to commit T then T is prepared at that primary.


The client accumulates the votes from all primaries and determines the outcome: T commits iff all primaries vote to commit, else T aborts. The client reports the outcome to the application and then asynchronously notifies all primaries of the outcome. Conflicting transactions are aborted and then restarted at the client.

Apart from update transactions, Milan provides efficient snapshot reads. Milana satisfies T's reads for a key K by returning a version that is current as of T's $ts_{begin}$, even if a writer has written a new version of K with a later timestamp. The paper shows that this reduces false conflicts and improves concurrency and throughput.

Moreover, the Milana clients perform local validation for read-only transactions, because they can skip the 2 phase commit required for update transactions. Local validation allows a read-only transaction T to commit iff the values in T's read set are from a consistent snapshot:
  • each value for a key K in T's read set is the youngest committed version of K with timestamp $\leq ts_{begin}$, and 
  • no key K in the read set has a prepared version with timestamp $\leq ts_{begin}$
Local validation ensures a serializable transaction ordering for read-only transactions, but it does not necessarily provide external consistency. Milana provides both serializability and external consistency for read-write transactions, which validate on the servers.

If a primary of a participant shard fails then it would block all transactions involving that shard. A new primary must be elected (failover) in order to unblock any running transactions and resume service. The paper mentions that as long as a majority of replicas (f + 1) of all shards are available, it is possible to resume service and complete any outstanding transaction.

Evaluation

The evaluation shows that
  • Semel achieves 20%-50% higher IOPs than a traditional separate version and flash management approach.
  • Compared to NTP, PTP enables use of OCC with up to 43% lower transaction abort rates.
  • Under the Retwis benchmark, client-local validation of read-only transactions yields a 35% reduction in latency and 55% increase in transaction throughput for read-heavy workloads.
The evaluation is done inside a datacenter, not including cross-datacenter experiments. They mention that NTP shows an average skew of 1.51 milliseconds among clients, while software timestamped PTP has average skew of 53.2 microseconds. The evaluation did not consider fault-tolerance. Also the FTL remapping was done in emulation using the Open-Channel SSD framework, and not on real hardware.


Comments

Popular posts from this blog

Learning about distributed systems: where to start?

Hints for Distributed Systems Design

Foundational distributed systems papers

Metastable failures in the wild

The demise of coding is greatly exaggerated

The end of a myth: Distributed transactions can scale

Scalable OLTP in the Cloud: What’s the BIG DEAL?

SIGMOD panel: Future of Database System Architectures

Why I blog

There is plenty of room at the bottom