Sunday, July 27, 2014

Hybrid Logical Clocks

Here I will write about our recent work on Hybrid Logical Clocks, which provides a feasible alternative to Google's TrueTime.

A brief history of time (in distributed systems)

Logical Clocks (LC) was proposed in 1978 by Lamport for ordering events in an asynchronous distributed system. LC has several drawbacks for modern systems. Firstly, LC is divorced from physical time (PT), as a result we cannot query events in relation to real-time. Secondly, to capture happened-before relations, LC assumes that there are no backchannels and all communication occurs within the system.

Physical Time (PT) leverages on physical clocks at nodes that are synchronized using the Network Time Protocol (NTP). PT also has several drawbacks. Firstly, in a geographically distributed system obtaining precise clock synchronization is very hard; there will unavoidably be uncertainty intervals. Secondly,  PT has several kinks such as leap seconds and non-monotonic updates. And, when the uncertainty intervals are overlapping, PT cannot order events and you end up with inconsistent snapshots as the one shown below.

TrueTime (TT) was introduced by Spanner, Google's globally-distributed multiversion database, to timestamp transactions at global scale. TT leverages on tightly-synchronized physical clocks, but TT also has drawbacks. As in PT, when the uncertainty intervals are overlapping TT cannot order events, and it has to explicitly wait-out these ε intervals. To alleviate the problems of large ε, TT employs GPS/atomic clocks to achieve tight-synchronization (ε=6ms), however the cost of adding the required support infrastructure can be prohibitive and ε=6ms is still a non-negligible time.

Hybrid Logical Clocks

In our recent work (in collaboration with Sandeep Kulkarni at Michigan State University), we introduce Hybrid Logical Clocks (HLC). HLC captures the causality relationship like LC, and enables easy identification of consistent snapshots in distributed systems. Dually, HLC can be used in lieu of PT clocks since it maintains its logical clock to be always close to the PT clock.

Formally, the HLC is problem is to assign each event e a timestamp, l.e, such that
1) e hb f =>  l.e l.f,
2) Space requirement for l.e is O(1) integers,
3) l.e is close to pt.e, that is, l.e - pt.e is bounded.

Next I will show you a naive algorithm for HLC which is unbounded. Then I will present our HLC algorithm and show that it is bounded.

Naive algorithm 

The naive algorithm is very simple and very similar to Lamport's LC algorithm. However, it cannot keep l-pt bounded: l can move ahead of pt in an unbounded manner for certain cases.
In this example run, we can see that l-pt diverges unboundedly if we continue the messaging loop among processes 1, 2, and 3.

HLC algorithm

Here is the improved algorithm. This algorithm bounds l-pt and c for any case, including the example run presented above. Below is that run annotated with HLC values.
Notice that in the HLC algorithm, l-pt is trivially bounded by ε. And, more importantly c gets reset regularly. This is because, either l is incremented via receiving a larger l from another node and c gets reset, or l remains the same and pt catches up to l to increase it and c gets reset.

Under the most basic/minimal constraint that physical clock of a node is incremented by at least one between any two events on that node, we prove c < N*(ε +1). Recall that the naive algorithm cannot be bounded with this minimal assumption as the counterexample showed.

Under a more lenient environment, if we can assume that the time for message transmission is long enough for the physical clock of every node to be incremented by at least d, we prove  c < ε/d+1. Note that under that assumption the naive algorithm also becomes bound-able.

Now let's see how we can get a consistent cut using HLC. The consistency of the cut is implied by ¬(∃ p,q :: l.snap.p < l.snap.q ), which is equivalent to (∀ p,q :: l.snap.p = l.snap.q). That is, to get a consistent cut, all we need to do is to take events with the same l and c value at all nodes. In the figure, we show a consistent cut for l=10 and c=0.

Fault tolerance

Stabilization of HLC rests on the superposition property of HLC on NTP clocks. Once the NTP/physical clock stabilizes, HLC can be corrected based on the maximum permitted value of l-pt and the maximum value of c. If bounds are violated, we take the physical clock as the authority, and reset l and c values to pt and 0 respectively.

In order to contain the spread of corruptions due to bad HLC values, we have a rule to ignore out of bounds messages. In order to make HLC resilient to common NTP synchronization errors, we assign sufficiently large space to l-pt drift so that most NTP kinks can be masked smoothly.

Concluding remarks

We can have a compact representation of HLC timestamps using l and c. NTP uses 64-bit timestamps which consist of a 32-bit part for seconds and a 32-bit part for fractional second. We restrict l to track only the most significant 48 bits of pt. Rounding up pt values to 48 bits l values still gives us microsecond granularity tracking of pt. 16 bits remain for c and allows it room to grow up to 65536. (In our experiments c mostly stayed in single digits.)

HLC provides the following benefits: HLC is substitutable for PT (NTP clocks) in any application. HLC is resilient and monotonic and can tolerate NTP kinks. HLC can be used to return a consistent snapshot at any given T. HLC is useful as a timestamping mechanism in multiversion distributed databases, such as Spanner. In fact, HLC is being used in CockroachDBan opensource clone of Spanner, and is implemented in this module in particular.

Read our paper for the details.

Friday, July 4, 2014

Distributed is not necessarily more scalable than centralized

Centralized is not necessarily unscalable! 

Many people automatically associate centralized with unscalable, and distributed with scalable. And, this is getting ridiculous.

In the Spring semester, in my seminar class, a PhD student was pitching me a project for distributed storage: syncing from phone to work/home computers and other phones. The pitch started with the sentence "Dropbox is unscalable, because it is centralized". I was flabbergasted, and I asked a couple of times "Really? Do you actually claim that Dropbox is unscalable?". The student persisted and kept repeating that "Dropbox has a bottleneck because it is a centralized storage solution, and the distributed solution doesn't have that bottleneck". I couldn't believe my ears.

Dropbox already proved it is scalable: It serves files for more than 200 million users, who store 1 billion files every 24 hours. That it has a centralized architecture hosted in the cloud doesn't make it unscalable. As far as I can see there is no bottleneck caused by Dropbox having a more centralized architecture.

(For those who want to nitpick, I know Dropbox is not fully centralized; it uses AWS S3 for storage and Dropbox-company servers for metadata management. Also, it employs data parallelism in the backend for scalability, but, on the spectrum, it is closer to a centralized architecture than a fully decentralized one.)

Distributed is not necessarily scalable!

Some people when faced with a problem think, I know, I'll use distributed computing. Now they have N^2 problems. -- @jamesiry
Here is the second part. Distributing a system does not necessarily make it scalable. In fact, a fully decentralized architecture can sometimes be a disadvantage for scaling.

Consider Lamport's mutual exclusion (ME) algorithm presented in his seminal "Time, Clocks, and the Ordering of Events in a Distributed System". This ME algorithm is fully decentralized, and requires O(N) messages to be exchanged in response to one ME request. The Lamport ME algorithm employs broadcasts to keep all the nodes informed of all updates and get them on the same (more or less) state.

Now consider a centralized algorithm for ME: there is a centralized coordinator; the nodes send their request to the coordinator, and the coordinator assigns ME accordingly. (For the literalist: You can still have causal ordering in the centralized algorithm. Just use VC when nodes communicate and include VC in the request messages.) The centralized ME algorithm is more scalable: only 1 message is exchanged in response to one ME request. It has less drama and it is easier to maintain and build over.

Single point of failure?

A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. -- Leslie Lamport
A common reflex argument about centralized solutions is that it constitutes a single point of failure (SPOF). But if a distributed solution is not designed carefully, it will have multiple points of failures (MPOF). Which one would you rather have?

Let's reconsider the Lamport ME and the centralized ME algorithms. The distributed algorithm does not offer any fault-tolerance advantages. Both algorithms are prone to getting stuck with one crash failure.

In fact, we can argue that it is easier to design fault-tolerance to a centralized solution: You can employ Paxos to replicate the centralized server. In contrast, it is often much harder to design and add fault-tolerance to a distributed system. Since a distributed system is complex, it is more prone to introduce corner cases that jeopardize fault-tolerance.


Distributed is not necessarily more scalable than centralized;
And centralized is not necessarily a scalability bottleneck.

As a distributed systems professor, I wouldn't imagine myself defending centralized solutions. But there it is.

To avoid potential misunderstandings, I am not saying fully distributed/decentralized solutions are bad and to be avoided. There are advantages to decentralization, like latency reduction. And some conditions necessitate decentralization, like geographic/political/corporate isolation. We know in the real world it is a mix of centralized, up to where that is manageable and has reasonable cost, and some distributed architecture. This also depends very much on the application/task.

PS: Maybe we should do an XtraNormal animation movie about this "centralized unscalable and distributed scalable" mania. Any takers?

PS2: I thank @tedherman for improvements to the 1st draft.

PS3: Optimistic replication is a great survey of more decentralized replication protocols, their advantages, and challenges.

Bonus Section: Paxos is a relatively centralized approach to distributed consensus

Consensus is usually not an all-hands-on process. That can be hard to scale. Consider our democratic system: It is pretty centralized; we only elect leaders to rule for us.

In the same sense, you can think of Paxos as the more centralized approach to distributed consensus and state machine replication. In Paxos, the participants do not interact with all other participants to decide the order of requests to be accepted, instead the leader dictates the order of requests and the participants just accept them. (A fully decentralized consensus algorithm would be like the synchronous rounds consensus algorithm where in every round each participant communicates with all other participants so that they can converge on the same state.)

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...