Practical uses of synchronized clocks in distributed systems
This paper is from 1991, by Barbara Liskov. To set the context, viewstamped replication paper was published at 1988, and the Paxos tech report (which failed to publish) came in 1989. This paper also comes only 3 years after the NTP RFC was written. Although this is an old paper, it is a timely (!) one. The paper discusses how clocks can be used for improving the performance of distributed systems.
This paper is a visionary paper, ahead of its time. It is possible to criticize the paper by saying it was too optimistic in assuming time synchronization has arrived (it took another 2 decades or so for time synchronization to make it make it). But going off with that optimistic premise, the paper made so many remarkable contributions. The next noteworthy update on the practical use of synchronized clocks came 21 years later in 2012 with Spanner. Sanjay Ghemawat on the Spanner paper was Liskov's student that worked on the Harp project.
This brings me to the next impressive thing about this paper. In most of the sections, Liskov is reporting on her systems work and how she used clocks in a novel way. One of the projects is Harp filesystem built as a highly available distributed system via viewstamped replication. Another is Thor, an object-oriented database. Yet, another is SCMP an at-most-once message delivery protocol. While Lamport, Lynch, etc. were building the theory of distributed systems, Liskov was building some concurrent and distributed systems. Both line of efforts are equally important and admirable, but my mind is blown seeing these things being built as early as 1990.
Outline of the paper
The paper provides examples from several distributed algorithms that use synchronized clocks:
- at-most-once messages (SCMP protocol)
- authentication tickets in Kerberos
- cache consistency (this is leases actually, by Gray/Cheriton 1989)
- atomicity (how transactions and leases are used together in the Thor object-oriented database)
- commit windows in the Harp filesystem (the leader-lease based consistent reads idea is explained here)
The title of the paper says synchronized clocks, but almost all of these algorithms (like in leases) are actually rate-synchronized clocks algorithms. The hosts don't need to sync clocks with each other or with the universal clock. Only in the Kerberos example some loose synchronization to real clocks was used, and that was also a best effort algorithm to reduce the risk of security attack, rather than solving it completely.
I think the writing on the paper could be better. In several places, I felt that the writing was roundabout with unnecessary elaboration. Maybe some of this is because many of these concepts were new and unfamiliar at the time. Compared to the sharp precise writing from the theory side of distributed systems at time, the writing in this paper came imprecise. The organization did not flow well reading linearly, but the discussion at the end managed to bring them together and made sense of the interim sections. These are my subjective observations, and they don't take away anything from the greatness and contributions of this seminal paper.
I will give short summaries of each section and then provide a discussion of the developments on this front in the 30 years since the publication of this paper.
At-most-once messages
Implementing at-most-once semantics is typically done by having each message receiver maintain a table containing information about active senders that have communicated with the receiver recently. When a message arrives, if there is information about the sender in the table it is used to determine whether or not the message is a duplicated.
The SCMP protocol (by Liskov 1991) uses synchronized clocks to guarantee at-most-once delivery of messages building on this concept. The idea is tat the receiver remembers all "recent" communications. If a message from a particular sender is "recent" (i.e., it is more than the watermark, upper, the system maintains), the receiver will be able to compare it with the stored information and decide accurately whether the message is a duplicate. If the message from the sender is old, it will be tagged as a duplicated even though it may not be. Thus the system will never accept a duplicate, but it may occasionally reject a non-duplicate.
The paper states that this is always safe, even when clock synchronization is bad. The reason is simple. If the message timestamp falls below the maintained watermark, upper, then reject the message as potentially duplicate. Else, if it is above upper, check the table. Either it is there in the table and gets rejected, or it is not there so it is accepted and recorded in the table. The entry about the message is kept until upper exceeds the timestamp of the message.
A downside of the protocol is that a node with a clock too far into the future will make the other nodes expend too much memory by maintaining a lot of entries in the table.
Authentication tickets in kerberos
Kerberos uses session tokens for authentication between client and servers with the help of the Ticket Granting Service, which issues these session tokens. These session tokens are used for exchanging encrypted information. To guard against tokens that are just left over at a shared workstation, each token also contains an expiry. If current_time at the server is later than the expiry time, then such tokens are rejected.
Kerberos also uses clocks to help servers detect replayed messages. An authenticator is created by encrypting the client's current time and it's private key which is also shared with the server. Server can verify the authenticator by decrypting it. Server can store information about all recent authenticators, and similar to at-most-once messages, use this to detect replays of the authenticator.
Cache consistency
This section explains the lease idea which is very familiar to all of us now. The lease idea was put forth in the 1989 paper by Gray and Cheriton. This section explains the idea in more detail and operationalizes it and explains that only synchronized clock rates are needed for leases to work. Leases are so commonly used today, we hardly give it any thought any more.
Atomicity
This section talks about the use of leases in the Thor object oriented database. It talks about how, by layering the transactions as another mechanism, the side effect of a slow clock rate causing a client to hold a stale lease can be mitigated. The invariant in this system is: each time a client uses an object, it holds a valid lease for that object. If clocks get out of sync, the invariant might not be preserved. However, in Thor the worst that will happen is that some transaction may have to abort. No damage will have been done to the consistency of the persistent objects. The higher level mechanism (transactions) provides the safety that was missing in the lower level mechanism (leases).
Commit Windows
This section talks about the use of primary copy replication in Harp using viewstamped replication approach, which was developed just 3 years earlier. The section presents, I believe, for the first time a strongly-consistent reads via leader-local-reads using leases. This is the vanilla optimization in Multi-Paxos systems including Raft. The leader takes a lease on the followers, and the clients can read the most recent value from the leader, instead of the leader doing Paxos round with the followers. As long as the leader has the lease on majority of the followers, it can serve the read locally, and would be certain another leader did not emerge to update the value unbeknownst to it. Why? Because a candidate leader would have to wait until the leases on the followers expire in order to get acks from a majority quorum of replicas. Because of this there won't be a new leader until the leader's lease expires on the followers.
What if due to misalignment in clock rates, the old leader still thinks it has the lease and serves a read locally? This will mean we will see some stale reads during this lease-missynchronization period, but the write-path correctness is not affected by this because it is preserved by view-stamped replication based consensus. The paper says: "The invariant in this system is: whenever a primary performs a read it holds valid leases from a sub majority of backups. This invariant will not be preserved if clocks get out of sync. However, the impact of violating the invariant is small. At worse there will be a violation of external consistency, but in fact this is unlikely." The paper then goes on to lay out how this can play out. Leases, when not well configured, still constitute a risk for consistent reads.
Discussion section of the paper
This section makes observations about how clock sync was used in the interim sections and draws conclusions. I really liked this last paragraph of the section, which provides a practical recipe for where to look for and what to do to use clock synchronization in distributed systems.
"To convert a distributed algorithm to one that uses synchronized clocks, there are two places to look. By examining the messages that are being exchanged, it may be possible to identify some that could be avoided by using timestamps. Or, in the case where message exchange is already reduced by maintaining state it may be possible to find a way to save storage by using timestamps as a garbage collection technique. After finding a place to use timestamps, the next step is to analyze the consequences of using synchronized clocks, both on normal behavior (when clocks are in sync) and during clock failures. During this analysis the time interval that will be used is selected (all algorithms have such an interval, e.g., the message lifetime intreval p in the at-most-once protocol). An algorithm based on timestamps is a good idea if ultimately the worst case behavior is sufficiently unlikely or sufficiently benign so as to represent a good tradeoff for improved performance."
Developments on use of clock sync in distributed systems after the paper
The expectations of the paper of time synchronization being widely available quickly was too optimistic. The paper said: "Synchronized clocks are quickly becoming a reality in distributed systems". The paper also said: "The focus on algorithms that depend only on rates is likely to diminish now that synchronized clocks exist." These turned out to be too optimistic. In the next decades, we found that it was really hard to provide bounds on the error-intervals on time synchronization. After a lot of investments in clock and network infrastructure, Spanner was able to provide 7ms bounds on the errors in 2012. (Things improved since then.) I wrote about the challenges of times synchronization in an earlier post last year on Sundial clock synchronization in datacenters paper, with an informative detour on time-synchronization in wireless systems. I won't rehash that discussion here.
The abstract of the paper mentions: "Since they have only recently become a reality in distributed systems, their use in distributed algorithms has received relatively little attention". I would say this is still the case after 30 years... This topic still receives relatively little attention. I had written a high-level whitepaper about the use of timesyncronization in distributed systems in 2018. In 2019, I wrote an NSF proposal to do more research on the use of synchronized clocks in distributed systems, and it didn't get funded. I wrote a short pitch on that here.
Another update I want to note is the use of clock synchronization to establish a timestamp-stability watermark to proxy for dependency tracking. This was used in the Tempo paper and Accord protocol in Cassandra.
In some places, the paper had a cavalier attitude about correctness in the presence of clock synchronization errors. We have since learned that time synchronization is a spring fountain of distributed systems bugs. In HPTS, whenever a talk mentioned using timestamps for something, Kyle Kingsbury's ears perk up, and he pays attention. Kyle has shown so many bugs due to timestamp problems using Jepsen testing.
The underestimation of the severity of clock synchronization bugs could be due to several reasons. This was a new area. The paper is being visionary and taking an optimistic view, and Liskov was wearing her systems builder hat and was excited about the improvements this brings for performance. On the other hand the paper also balances the optimism by stating: "Since clock synchronization can fail occasionally, it is most desirable for algorithms to depend on synchronization for performance but not for correctness."
All I can say to that is, Amen!
Comments