Use of Time in Distributed Databases (part 1)
Distributed systems are characterized by nodes executing concurrently with no shared state and no common clock. Coordination between nodes are needed to satisfy some correctness properties, but since coordination requires message communication there is a performance tradeoff preventing nodes from frequently communicating/coordinating.
Timestamping and ordering
Why do the nodes, in particular database nodes, need coordinating anyway? The goal of coordination among nodes is to perform event ordering and align independent timelines of nodes with respect to each other. This is why timestamping events and ordering them is very important. Nodes run concurrently without knowing what the other nodes are doing at the moment. They learn about each other's states/events only by sending and receiving messages and this information by definition come from the past state of the nodes. Each node needs to compose a coherent view of the system from these messages and all the while the system is moving along advancing and changing state. This is like trying to compose a panaromic snapshot of a parade moving down the street only by using small 4x6 photos shot by the participants at arbitrary times.
In 1978, Leslie Lamport captured the ordering relationship between events in a distributed system by introducing a neat abstraction called logical clocks. Logical clocks timestamp events based on causality (happened-before) which comes from either precedence within the same thread of execution at the node or via message communication received from another node.
Unfortunately logical clocks are disconnected from real time (wall clocks). Logical clocks would not be able to establish an ordering to event2 at node2 that is occurring a full 24 hours later than event1 at node1, if there had not been a chain of communication linking event1 to event2 with a causal precedence relationship in that period. Logical clocks ignore all back-channel, and require strict enforcement of all communication to be logical-clock timestamped. Moreover, they don't allow you to query physical time an event took place.
Their cousin vector clocks can get you consistent snapshots across nodes (well, with O(n) vector clock overhead), but vector clocks are also vulnerable to the same drawbacks above. To address these drawbacks, we need physical clocks and real-time affinity. To drive this point home, and learn about even more benefits synchronized clocks can give us, here is another round of motivation for the importance of time in distributed systems and databases.
What is synchronized clocks good for?
Asymmetry of information is the curse on distributed system. The coordinated attack problem provides a great demonstration of this. The two generals in that problem are stuck in a Zeno's paradox of "You may not know that I know that you know that I know" kind of information chaining. A classic paper to checkout on this topic is "Knowledge and common knowledge in a distributed environment".
Synchronized clocks create a consistent global time reference, seeding the build up of common knowledge. Having a globally consistent time reference (i.e., synchronized clocks/time) is a big boon for common knowledge. Without a common reference time, even by using same rate local clock at each node you lose considerable information about the other party.
Consider this example where node2 is trying to learn for how long more node1's lease would be valid. Without synchronized time, unilateral communication from node1 to node2 does not suffice to convey this information, because the message from node1 to node2 could have been arbitrarily delayed in the channel. This makes the timer on node2 obsolete for coordinating between the two nodes since by the time the message arrives the lease might have long expired. With unilateral communication node2 would only know when for sure node1's lease is invalidated, but cannot know when node1 is still holding a lease. It can only reason about the absence of lease by node1, but not about its presence.
Without synchronization, node2 requires two-way communication to estimate lease validity. In this scheme, node2 initializes the communication, and starts its timer when it sends the message. When it receives a response back from node1 which contains the remaining time on the lease for node1, it needs to subtract the delta time passed for the round-trip-time (RTT) to get a lowerbound on the validity of node1's lease. It cannot subtract delta/2 from the lease timer, because communication delay may be asymmetrical.
But, if we have synchronized clocks, this information can be conveyed just with a unilateral communication from node1 to node2. Node1's message would say "I have the lease till T_end", and when node2 receives the message, this T_end information is instantly usable by Node2 without sufffering the RTT delay rounding above, because of the synchronized clocks. Synchronized clocks simplifies coordination and reduces uncertainty in distributed systems.
Loosely synchronized clocks
Ok, let's discuss physical clocks, and clock synchronization at last! Each node has a local clock, which is just a crystal/quartz oscillator. A circuit counts the number of times a crystal oscillates and declares that a millisecond has passed because it counted say 1000 oscillations. This is not precise of course. Atomic clocks use rubidium, which has an oscillation/clock error of one microsecond a day. But quartz is not like that. It is dirt cheap and small, but also inaccurate so it drifts too much. No two quartz crystals are identical. They have big variations. Another thing about quartz crystal is they are very temperature sensitive: when the temperature gets colder they oscillate more when it gets hotter they oscillate less. Therefore frequent time synchronization is needed when using crystal oscillator based clocks. Otherwise each node's clock would drift apart from each other.
In 1985, Network Time Protocol (NTP) entered the game. NTP is a networking protocol for clock synchronization between computer systems over packet-switched, variable-latency data networks. NTP is by far the most popular time sync distribution protocol on the Internet. However, NTP clock sync errors can be amount to tens of milliseconds. The biggest source of problem for NTP non-deterministic errors come in for synchronization? It comes from the network. The switches and routers are the source of nondeterministic errors for NTP. Asymmetry in the links is a big error source. Consider 100 mbps link feeding into 1 Gbps link. One way there is no delay, but coming back the other way there is queuing delay.
Unfortunately, using loosely synchronized clocks via NTP, with several tens of millisecond uncertainty, violates causality and monotonicity under errors. This is problematic for ordering events and our consistent snapshots across nodes. Hybrid Logical Clocks (HLC) provides a remedy, by integrating Lamport's logical time with loosely synchronized physical clocks. This makes them resilient to synchronization errors and enables progress in degraded conditions.
Of course using tightly synchronized clocks is even better, and gets you better performance with more certainty about other node's states/events/orderings. Luckily we had great progress in this area in the last decade.
The advent of tightly synchronized clocks
In 2018, I wrote this short survey on the advent of tightly synchronized clocks.
One big development has been the availability of cheaper (but almost as precise clocks) than atomic clocks. Atomic clocks use Rubidium and has 1 µs/day. This is so small that relativistic effect considerations kick in here. OCXO ovenized oscillators provide the next best way to have precise clocks in a box in a much cheaper manner. They achieve a drift of ~25 µs/day. Pair these bad boys with a GPS antenna and they get less than 100 nanosecond of true time.
The rest is ensuring high quality distribution of time information and avoiding the sources of nondeterminism creeping in. Precision Time Protocol (PTP) with hardware timestamping helps a lot here and gets you to have ~50 µs clock uncertainty as the AWS Timesync provides for public access.
I would be amiss if I don't talk about the clockbound API here. Google TrueTime (TT) introduced these bounded uncertainty intervals. A call to TT.now() API returns earliest and latest bounds that the true clock could in the worst case fall into. This is awesome because if you ensure non-overlapping intervals on events, you achieve definite event ordering. This allows you to wait out the latest bound on time to strictly order/serialize your commit event. This also provides another level of safety for correctness of timestamp based ordering. For ordering to go awry, both clock synchronization, and clock-bound around it need to go bad concurrently. If your clock synchronization goes bad, the bounds around it would grow, and you would know something is going wrong.
Timestamp-based Algorithms for Concurrency Control in Distributed Database Systems (VLDB'80)
Ok, this introduction already got long. Let's finish this Part 1 with a banger! This paper is from VLDB 1980, and predates even NTP. It is written with a typewriter for God's sake. But Bernstein & Goldman's vision is ahead of its time for many decades. They assume synchronized clocks and propose Timestamp Ordering (TSO) algorithms for concurrency control in distributed database systems. The banger is that after more than three decades, Amazon DynamoDB (ATC'23) uses TSO based algorithm for its one-shot transactions.
Here is how the TSO-based optimistic concurrency control algorithm works. Transactions are serialized by the start-time T_si. Transactions don't hold any locks, and resort to optimistic validation. Instead, every object x is tagged with the timestamp of the last transaction[s] that successfully read/write x. So, object x has two metadata associated with it: read_ts(x) and write_ts(x). To guarantee the serializability of transactions, it is crucial that these timestamps should be monotonically increasing (hence the need for synchronized clocks!). The update protocol checks and ensures this for each access. If a transaction tries to access (read/write) an object from its future it is aborted. For example, if ti finds that it is reading x, which has write_ts(x)>ts_ti, then this would be a future read, and x aborts itself. It may later be retried with a new/higher ts, and hopefully it will succeed then. That is it. That is the whole protocol.
After 1980, the next splash on the use of synchronized clocks arrive at PODC 1991, with Barbara Liskov's "Practical Uses of Synchronized Clocks in Distributed Systems" paper. Interestingly this paper does not cite the Bernstein-Goldman's banger groundbreaking paper. It proposed several distributed algorithms that use synchronized clocks:
- Leases for cache consistency and leader-local reads. Technically, these are realized only via same-rate local timers and don't need synchronized clocks.
- SCMP protocol for at-most-once message delivery
- Kerberos authentication tickets
A noteworthy principle the paper advocated was to use clocks for performance, and not correctness. Notice that in the Bernstein-Goldman paper a decade earlier the correctness of serialization of OCC transactions did not rely on the time synchronization, but the performance would be improved with better time synchronization.
Then comes two decades of silence. The next big update on the practical use of synchronized clocks came 21 years later in 2012 with Google Spanner.
Outline of Use of Time in Distributed Databases
So, this is what I am thinking of covering in the next posts on this series. I already had summaries of these work in my blog over the last several years. It will be nice to organize this information and draw some trends/conclusions from these work. One thing is clear: We see an accelerating trend of adoption of synchronized clocks in distributed databases.
Part 2: Use of logical clocks in database design: Dynamo (SOSP'07), COPS (SOSP'11), ORBE (SOCC'13)
Part 3: Use of synchronized physical clocks in database design: Granola (ATC'12), Clock-SI (SRDS'13), GentleRain (SOCC'14), Occult (NSDI'17), Nezha (VLDB'23)
Part 4: Use of clocks in production databases: Spanner (OSDI'12), MongoDB (SIGMOD'19), CockroachDB (SIGMOD'20), DynamoDB (ATC'23), Cassandra/Accord (2023), Aurora Limitless (2023), Aurora DSQL (2024)
Part 5: Discussion on the correctness of synchronized clock-based algorithms in distributed databases, and the tradeoffs between using different types of clocks/timestamping mechanisms.
Comments
That seems backward but I'd love to know how that is possible.