Ocean Vista: Gossip-Based Visibility Control for Speedy Geo-Distributed Transactions

This paper occurred in VLDB'19 and is authored by Hua Fan and Wojciech Golab. 

The paper is about providing strict serializability in geo-replicated databases. The technique it uses can be summarized as "geo-replicate-ahead transactions". First the transaction T is replicated to all the parties across the datacenters. The parties then check to see that the watermark rises above T to ensure that all transactions including and preceding T has been replicated successfully to all the parties. Then, the execution of T can be done asynchronously at each party.

This should remind you of the SLOG paper

SLOG uses a deterministic architecture to move most of this communication outside of conflict boundaries, enabling these transactions to be processed at high throughput. More specifically, SLOG relies on deterministic processing to avoid two phase commit. Once all parties agree to the plan, processing occurs (mostly) independently on each node, with the system relying on the plan's determinism in order to avoid replica divergence.

All multi-home transactions, no matter where they originate, must be ordered with respect to each other. For this SLOG employs a global log for ordering the multihome transactions with respect to other multi-home transactions and sends them to the regions where they are ordered with respect to single-home transactions.

Ocean Vista also orders transactions with respect to each other first and executes them only after this ordering is stable. Ocean Vista explains this idea in terms of watermarks, but it is a very similar idea. Both papers appeared in VLDB'19. I guess this is another case of a concurrent discovery of an idea whose time has come.  


Gossip-based visibility control 

The presentation slides here are nice, so I will borrow some figures from them to explain the Ocean Vista algorithm. 

Ocean Vista (OV), uses multi-versioning to combine transaction control, concurrency control, and replication functions into a single protocol, that gossips watermarks. 

The usual synchronous transaction processing proceeds as: (1) Read all keys, (2) Compute, (3) Write all keys. In contrast OV reverses this: (1) write/replicate transaction with functors as data version placeholders, (2) Read & Compute transaction once watermark is cleared, (3) Perform asynchronous write to replace the functors with the final values.

Here is the algorithm.

Transactions are totally ordered by global versions generated based on synchronized clocks. The visibility watermark Vwatermark is a version number below which all transactions must have completed their write-only operations (S-phase). All transactions with versions below Vwatermark can be made visible safely, and the transaction order is fixed because no transaction with a lower version may be created. 

The replica watermark Rwatermark is a version number, below which all versions of transactions must have been fully replicated on all corresponding replicas. OV can provide consistent reads using RO (i.e., read from any replica) in the common case for transaction below the Rwatermark. The read-only operation Read(key, ts) (ts < Vwatermark) retrieves the latest version no greater than ts for key. When ts < Rwatermark, the Read can call Get(key, ts) directly on any replica. Reading versions that have been made visible but are not fully replicated (i.e., Rwatermark<ts<Vwatermark) is the only case that requires reading from a quorum.

As far as the write quorums go, writes/replications in the S-phase can succeed in one round trip in the fast path regardless of conflicts, and require two round trips in the slow path when too many failed nodes are present. This has been a point of discussion in our Zoom Reading Group. Aleksey made this observation, and I think he is right. It may be possible to avoid using fast Paxos for implementing OV replication in S-phase.

Why does OV protocol use the FastPaxos-like algorithm for replication? FastPaxos requires the use of larger "super quorums", however, OV replication is not client-driven, and a server specifically picks a unique timestamp for the replication. In FastPaxos, multiple commands may be tried on the same instance (i.e. timestamp), requiring a larger quorum for recovery in phase-1 of Paxos with a smaller majority quorum. However, in OV we do not see such conflicts: the timestamps for transactions are *unique* and are assigned by one node that coordinates replication. The only possible conflict that we saw happening is when a transaction first tries to replicate the command and then issues an abort on the same timestamp. This arguably creates a write-write conflict on the same instance (timestamp), but we think it can be resolved by establishing fixed precedence to make aborts always win over the regular writes. With precedence order established, writing abort to a majority of nodes should be sufficient to make abort persistent and recoverable. We have not reached a definite conclusion on why OV uses larger fast quorums, so we may still be missing something in our understanding of the protocol.


Discussion

Comparing SLOG and Ocean Vista for differences is instructive. SLOG seems to be  more centralized, having both a dedicated master per partition and a dedicated ordering layer for multi-partition transactions. OV employs aggregation based per region gossipers to help keep traffic manageable, but that is not centralization per se. On the other hand, OV may have some vulnerability areas due to NTP clock timestamping. (The paper dedicates a subsection explaining this is not a safety problem but could cause some performance issues.) SLOG avoids that problem by being more centralized.

SLOG, Ocean vista, and also TAPIR impose some restrictions to the transactions they allow. As SLOG puts it: "Unfortunately, in order to create a deterministic plan of execution, more knowledge about the transaction is needed prior to processing it relative to traditional nondeterministic systems. Most importantly, the entire transaction (information regarding which data will be accessed by the transaction) must be present during this planning process." Ocean Vista calls this functors implemented as stored procedures.


Here is a video of the Ocean Vista presentation from our Zoom Reading group.

Comments

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Advice to the young

Foundational distributed systems papers

Distributed Transactions at Scale in Amazon DynamoDB

Linearizability: A Correctness Condition for Concurrent Objects

Understanding the Performance Implications of Storage-Disaggregated Databases

Designing Data Intensive Applications (DDIA) Book