Use of Time in Distributed Databases (part 2): Use of logical clocks in databases

This is part 2 of our "Use of Time in Distributed Databases" series. We talk about the use of logical clocks in databases in this post. We consider three different approaches:

  • vector clocks
  • dependency graph maintenance
  • epoch service 

In the upcoming posts we will allow in physical clocks for timestamping, so there is no (almost no) physical clocks involved in the systems in part 2.   



1. Vector clocks

Dynamo: Amazon's highly available key-value store (SOSP'07)

Dynamo employs sloppy quorums and hinted hand-off and uses version vector (a special case of vector clocks) to track causal dependencies within the replication group of each key. A version vector contains one entry for each replica (thus the size of clocks grows linearly with the number of replicas). The purpose of this metadata is to detect conflicting updates and to be used in the conflict reconciliation function. Dynamo provides eventual consistency thanks to this reconciliation function and conflict detection by version vectors.

Cassandra, which provided an opensource implementation of Dynamo, decided to forgo vectors clocks in favor of using physical time supplied by the client and Last-Writer-Wins rule for updating replicas. 

So, yeah, somehow use of vector clocks in datastores fizzled out over time. Maybe the size of vector clocks to be included in messages was the headache. Or maybe use of synchronized physical clocks offered more advantages in addition to a single scalar timestamp. Nevertheless, vector clocks may still have applications in version control systems and event logging in distributed systems. And, below we talk about two more systems that uses some form of vector clocks.


ORBE (SOCC'13): Causal consistency

ORBE uses vector clocks, organized as a matrix, to represent dependencies. The vector clock has an entry per partition and data center. Physical clocks are used for generating read snapshot times, and ORBE can complete read-only transactions in one round by relying on these loosely synchronized physical clocks. A drawback with ORBE is the large size of timestamps, which followup work on Gentle Rain aimed to address.


The end of a myth: Distributed transactions can scale (VLDB'17)

NAM-DB aims to addresses scalability challenges in distributed transactions through innovative use of RDMA (Remote Direct Memory Access) adopting a timestamp oracle design. The timestamp oracle uses a partitionable vector clock approach to manage commit timestamps without contention. The timestamp oracle protocol implements a software-based solution where each transaction execution thread maintains its own counter in a timestamp vector, allowing for distributed timestamp management without contention. Transactions obtain read timestamps by reading the vector and commit timestamps by incrementing their specific vector entry through efficient RDMA operations.

Let's dive into how the commit protocol achieves consistency. When committing, transactions create new timestamps by incrementing their counter, verify and lock their write-sets using RDMA operations, and update the timestamp vector upon success. This design offers several advantages: transaction threads operate independently without synchronization overhead, long-running transactions don't block others, and the system maintains monotonic timestamp progression when stored on a single memory server (though this property may not hold with partitioned storage).



2. Dependency graph maintenance

Scalable Causal Consistency for Wide-Area Storage with COPS (SOSP'11)

COPS introduced a dependency tracking approach for achieving causal consistency in geo-replicated datastores. The system assigns scalar version numbers to objects and maintains causality by having clients track the versions of all objects read in their causal past. When updates are propagated between data centers, they carry their dependencies, and receiving data centers only make updates visible once all dependencies are satisfied. A key feature of COPS is its support for causally consistent read-only transactions, which provide a consistent snapshot of the system. These transactions are implemented through a two-round protocol.

COPS chose to perform explicit dependency tracking over using vector clocks. They justified their choice against vector clocks by citing scalability concerns, particularly the O(N) size growth with the number of nodes. They argued that in a datacenter with thousands of nodes, the metadata overhead would become prohibitive. I think they overindexed on the N number of nodes. N doesn't grow to very large numbers in deployments, and especially not for replication. As another reason, they noted that vector clocks only provide happens-before relationships and there would still be a need for additional mechanisms like serialization points or explicit dependency checking to enforce causal consistency across the datacenter. I don't get this argument, either. I think they wanted to take a stance for explicit dependency checking rather than the implicit/wholesale causality we get from logical/vector clocks. 

This explicit dependency tracking approach influenced later systems, including the EPaxos family of consensus protocols. The principle is the same: Each operation maintains dependencies for operations, and replication dependencies are checked at each node, and when they are satisfied the value is updated there. Unfortunately, the dependency graphs can grow significantly in pathological cases, and these systems can experience significant slowdowns when dependency lists grow large. Subsequent systems like Occult and Accord/Cassandra (as we will cover in upcoming posts in this series) have shown that combining dependency tracking approach with loosely synchronized physical clocks can help manage the complexity. 


Kronos: The design and implementation of an event ordering service (Eurosys'14)

Kronos introduces a centralized event ordering service for distributed systems that tracks happens-before relationships through a dedicated API. Rather than having individual nodes maintain and propagate dependency information as in COPS, here the applications explicitly register events and define causal relationships with the Kronos service. This approach allows for cross-system dependency management and fine-grained concurrency detection, with the system binding events to a time order as late as possible. While this provides more flexibility in capturing application-specific causality compared to Logical/Vector Clocks (which automatically assume causal dependence between consecutive events on the same node), it comes with the overhead of communicating with the service and searching dependency graphs.


 

3. Epoch server

Chardonnay (OSDI'23): use of a centralized epoch service

Chardonnay is an in-memory distributed database that employs a logically-centralized (3 MultiPaxos nodes under the trenchcoat) epoch service, whose sole job  is to maintain a monotonic epoch counter. The magic of the epoch counter enters the picture for read-only transactions, but let's first cover the read-write transactions.

For read-write transactions, Chardonnay uses a two-phase approach: first running transactions in "dry run" mode to discover and pin read/write sets in memory, then executing them definitively using 2PC+2PL in-memory for speed. This approach leverages modern datacenter networking being significantly faster than disk I/O, allowing Chardonnay to achieve strictly serializable transactions efficiently by keeping relevant data in memory and avoiding deadlocks through ordered lock acquisition. In that sense, this architecture builds on ideas from deterministic databases like Calvin.

For read-only transactions, Chardonnay implements snapshot isolation within epochs (10ms intervals), enabling contention-free queries. A transaction can get a consistent snapshot as of the beginning of the current epoch ec by ensuring it observes the effects of all committed transactions that have a lower epoch. That is realized by waiting for all the transactions with an epoch e < ec to release their write locks. Hence, the snapshot read algorithm would simply work by reading the epoch ec, then reading the appropriate key versions. It is a neat trick, no?

This algorithm does not guarantee strict serializability, because a transaction T would not observe the effects of transactions in epoch ec that committed before T started. If desired, ensuring linearizability is easy at the cost of some latency; after T starts, wait for the epoch to advance once and then use the new epoch for reads. Another neat trick. Tradeoff latency with efficiency/throughput.

The system has been extended to multi-datacenter deployments through Chablis (CIDR '24), which introduces global epochs for cross-datacenter consistency while maintaining local epoch efficiency.

Comments

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Advice to the young

Foundational distributed systems papers

Linearizability: A Correctness Condition for Concurrent Objects

Understanding the Performance Implications of Storage-Disaggregated Databases

Designing Data Intensive Applications (DDIA) Book