Do tightly synchronized clocks help consensus?

Let's start with the impossibility results, a good place to start distributed systems discussion. The coordinated attack result says that if the communication channels can drop messages arbitrarily and undetectably you cannot solve distributed consensus using a deterministic protocol in finite rounds. This result is all about *the curse of asymmetry of information*. Reliable channels are needed for two parties to agree on something. Otherwise the two parties are stuck in a Zeno's paradox of "You may not know that I know that you know that I know" kind of information chaining. A classic paper to checkout on the topic is "Knowledge and common knowledge in a distributed environment".

Assuming mostly reliable channels and demanding majority acks, it is possible to solve consensus in a partially synchronous system. Paxos shows us how to do this. Paxos uses quorum of acks to get around the problem of "the agreement having to occur in one shot" as in the coordinated attack problem. In Paxos, the quorum acks mark the point where the system agrees, and the stragglers can learn or re-learn this information because agreement was stored in a quorum of nodes. As a result, the agreed value does not change.

What does this have to do with time synchronization? Specifically, where is time synchronization in Paxos? Nowhere! Paxos preserves safety in an asynchronous system, and achieves progress when the failure detectors do not suspect the eligible and alive leader. For this, it is sufficient to use diamond S failure detectors (Chandra-Toueg 96), which do not need time synchronization. These detectors are implementable by using per process clocks measuring local time passing under the assumption that time does not advance at unboundedly different rates across processes.

Now the question is this: If we had perfect time synchronization, would we be able to solve consensus easier, better, more robust, or more efficiently?  No! The curse of asymmetry of information is the core challenge in consensus. The leader needs to hear acks from a quorum in order to anchor consensus. (The quorum should be more than F and the phase1 and phase2 quorums should intersect with each other so the consensus state persists.

Time synchronization with clock uncertainty bound, epsilon, less than microsecond are not useful for consensus because consensus is bound by communication, and communication between computers  even in the same cluster takes many microseconds. 

The guarantee we would require from time synchronization is that causality is not violated by the physical clock timestamps. That means epsilon should be less than the communication latency. If epsilon is larger than the communication latency we may order the events incorrectly. Consider this example. Event A and event B have the same clock timestamp but they are farther apart in time because of the epsilon uncertainty. If we were to use A and B as a snapshot of the distributed system, this would be an inconsistent snapshot. Another event happens after A which then sends a message that affects B, and taints the snapshot. Causality sneaks in between A and B and the integrity of the snapshot is violated. 

As long as epsilon is less than communication latency, this problem does not happen and we won't have problems building linearizability or consistent snapshots using the synchronized clocks' timestamps as the basis. 

Don't get me wrong. There could be uses for nanosecond-precise time synchronization for telemetry applications, like measuring network delays, performance hickups, etc. But those would not be useful for consensus, linearizability, or even taking consistent snapshots. Clock synchronization with epsilon on the order of microseconds would be plenty sufficient for them. 

A more robust epsilon is more important than smaller epsilon. A time synchronization protocol that can provide guaranteed epsilon at the order of 10s of microseconds would be preferable to one that can provide epsilon less than microsecond under ideal conditions but may have milliseconds clock uncertainty under some corner cases.

Some related posts


Popular posts from this blog

Foundational distributed systems papers

Your attitude determines your success

Progress beats perfect

Cores that don't count

Silent data corruptions at scale

Learning about distributed systems: where to start?

My Distributed Systems Seminar's reading list for Fall 2020

Read papers, Not too much, Mostly foundational ones

Sundial: Fault-tolerant Clock Synchronization for Datacenters