Posts

Use of Time in Distributed Databases (part 4): Synchronized clocks in production databases

Image
This is part 4 of our "Use of Time in Distributed Databases" series . In this post, we explore how synchronized physical clocks enhance production database systems. Spanner Google's Spanner (OSDI'12) implemented a novel approach to handling time in distributed database systems through its TrueTime API. TrueTime API provides time as an interval that is guaranteed to contain the actual time, maintained within about 6ms (this is 2012 published number which improved significantly since then) of uncertainty using GPS receivers and atomic clocks. This explicit handling of time uncertainty allows Spanner to provide strong consistency guarantees while operating at a global scale. Spanner uses multi-version concurrency control (MVCC) and achieves external consistency (linearizability) for current transactions through techniques like "commit wait," where transactions wait out the uncertainty in their commit timestamps before making their writes visible. Spanner uses ...

I Can’t Believe It’s Not Causal! Scalable Causal Consistency with No Slowdown Cascades

Image
I recently came across the Occult paper (NSDI'17) during my series on "The Use of Time in Distributed Databases." I had high expectations, but my in-depth reading surfaced significant concerns about its contributions and claims. Let me share my analysis, as there are still many valuable lessons to learn from Occult about causality maintenance and distributed systems design. The Core Value Proposition Occult (Observable Causal Consistency Using Lossy Timestamps) positions itself as a breakthrough in handling causal consistency at scale. The paper's key claim is that it's "the first scalable, geo-replicated data store that provides causal consistency without slowdown cascades." The problem they address is illustrated in Figure 1, where a slow/failed shard A (with delayed replication from master to secondary) can create cascading delays across other shards (B and C) due to dependency-waiting during write replication. This is what the paper means by "...

Use of Time in Distributed Databases (part 3): Synchronized clocks in databases

Image
This is part 3 of our "Use of Time in Distributed Databases" series . In this post, we explore how synchronized physical clocks enhance database systems, focusing on research and prototype databases. Discussion of time's role in production databases will follow in our next post. To begin, let's revisit the utility of synchronized clocks in distributed systems. As highlighted in Part 1 , synchronized clocks provide a shared time reference across distributed nodes and partitions. For simple, single-key replication tasks, such precision is often unnecessary and leader-based approaches such as MultiPaxos or Raft is much more appropriate. Even WPaxos might be considered if you need a WAN deployment. Of course, if you want to go very fancy by using a leaderless designs, such as those in the EPaxos family/Tempo/Accord,  then dependency graphs and time synchronization re-enter the picture. The true value of synchronized clocks becomes apparent in distributed multi-key operat...

Use of Time in Distributed Databases (part 2): Use of logical clocks in databases

Image
This is part 2 of our "Use of Time in Distributed Databases" series . We talk about the use of logical clocks in databases in this post. We consider three different approaches: vector clocks dependency graph maintenance epoch service  In the upcoming posts we will allow in physical clocks for timestamping, so there is no (almost no) physical clocks involved in the systems in part 2.    1. Vector clocks Dynamo: Amazon's highly available key-value store (SOSP'07) Dynamo employs sloppy quorums and hinted hand-off and uses version vector (a special case of vector clocks) to track causal dependencies within the replication group of each key. A version vector contains one entry for each replica (thus the size of clocks grows linearly with the number of replicas). The purpose of this metadata is to detect conflicting updates and to be used in the conflict reconciliation function. Dynamo provides eventual consistency thanks to this reconciliation function and conflict detect...

Use of Time in Distributed Databases (part 1)

Image
Distributed systems are characterized by nodes executing concurrently with no shared state and no common clock. Coordination between nodes are needed to satisfy some correctness properties, but since coordination requires message communication there is a performance tradeoff preventing nodes from frequently communicating/coordinating. Timestamping and ordering Why do the nodes, in particular database nodes, need coordinating anyway? The goal of coordination among nodes is to perform event ordering and align independent timelines of nodes with respect to each other. This is why timestamping events and ordering them is very important. Nodes run concurrently without knowing what the other nodes are doing at the moment. They learn about each other's states/events only by sending and receiving messages and this information by definition come from the past state of the nodes. Each node needs to compose a coherent view of the system from these messages and all the while the system is movi...

Utilizing highly synchronized clocks in distributed databases

Image
This master's thesis at Lund University Sweden  explores how CockroachDB 's transactional performance can be improved by using tightly synchronized clocks. The paper addresses two questions: how to integrate high-precision clock synchronization into CockroachDB and the resulting impact on performance. Given the publicly available clock synchronization technologies like Amazon Time Sync Service and ClockBound, the researchers (Jacob and Fabian) argue that the traditional skepticism around the reliability of clocks in distributed systems is outdated.  CockroachDB vs Spanner approaches CockroachDB uses loosely-synchronized NTP clocks, and achieves linearizability by using Hybrid Logical Clocks (HLC) and relying on a static maximum offset (max_offset=500milliseconds) to account for clock skew. However, this approach has limitations, particularly in handling transactional conflicts within a predefined uncertainty interval. When a transaction reads a value with a timestamp falling ...

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Advice to the young

Foundational distributed systems papers

Linearizability: A Correctness Condition for Concurrent Objects

Understanding the Performance Implications of Storage-Disaggregated Databases

Designing Data Intensive Applications (DDIA) Book

Use of Time in Distributed Databases (part 2): Use of logical clocks in databases