Wednesday, August 30, 2017

Retrospective Lightweight Distributed Snapshots Using Loosely Synchronized Clocks

This is a summary of our recent work that appeared at ICDCS 2017. The tool we developed, Retroscope (available on GitHub), enables unplanned retrospective consistent global snapshots for distributed systems.

Many distributed systems would benefit from the ability to take unplanned snapshots of the past. Let's say Alice notices alarms going off for her distributed system deployment at 4:05pm. If she could roll-back to the state of the distributed system at 4:00pm, and roll forward step by step to figure out what caused the problems, she may be able to remedy the problem.

The ability to take retrospective snapshots requires each node to maintain a log of state changes and then to collate/align these logs to construct a consistent cut at a given time. However, clock uncertainty/skew among nodes is dangerous and can lead to taking an inconsistent snapshot. For example, the cut at 4:00pm in this figure using NTP is inconsistent, because event F is included in the cut, but causally preceding event E is not included.

To avoid this problem with physical clock synchronization, our snapshotting service Retroscope employs Hybrid Logical Clocks (HLC), that combine the benefits of physical time along with causality tracking of logical clocks (LC). By leveraging HLC, our system can take non-blocking unplanned/retrospective consistent snapshots of distributed systems with affinity to physical time.

Hybrid Logical Clocks

Each HLC timestamp is a two-part tuple: the first part is a shadow-copy of NTP time, and the second part is an overflow buffer used when multiple events have identical first parts in their HLC timestamp. The first part of HLC at a node is maintained as the highest NTP time that the node is aware of. (This comes from the node's own physical clock, or comes from a remote node from which a message was received recently.) The second part acts as the logical clock for events with identical first parts, and it is being reset every time the first part is updated.

The figure shows HLC in operation, with HLC timestamps in green, and the machine’s physical time in red. Note how message exchange between P1 and P2 bumps P2’s first HLC component to be higher than its own physical clock.

HLC not only keeps track of physical time, but also provides the same guarantees as the LC. Namely, HLC satisfies the LC condition: if e hb f then HLC.e < HLC.f. These HLC properties allows us to identify consistent cuts the same way we identify them with LC: a cut is consistent when all events have the same timestamp. In situations where we have no event with desired timestamp, we can create a phantom, non-mutating event at the desired timestamp and use it to identify the consistent cut.


Taking a Snapshot

An initiating agent can start the snapshot procedure by sending a snapshot request to every node. Each node, upon receiving the request performs a local snapshot of its current state, and uses the window-logs of state changes to undo the changes until the state reaches the requested HLC time. Each node performs snapshot independently, and there is no need for inter-node coordination on taking the snapshot. Once all machines have arrived to local snapshots at the same HLC time, we have obtained a global distributed snapshot (by virtue of the HLC feature stated above).

Our tool Retroscope also supports incremental snapshots to take an inexpensive snapshot in the vicinity of another snapshot by undoing (or redoing) events to reach the new one. These incremental snapshots are very useful for monitoring tasks, in which we need to explore many past system states in a step-through manner while searching for some invariant violation or fault causes.

We have implemented Retroscope as a set of libraries to be added to existing Java projects to enable retrospective snapshots capabilities. Our Retroscope library provides tools to log and keep track of state changes and HLC time of such changes independently at each node. The current implementation uses in-memory sliding-window log for state history, and have configurable capacity. The Retroscope server library provides the tools to aggregate multiple such event logs and identify consistent cuts across those logs with affinity to physical time. The API also allows querying for consistent cuts that satisfy certain predicates using a SQL-like querying language.

We will talk about Retroscope's querying and monitoring features in a later blog post. 

Evaluation

We have added Retroscope capabilities to several Java applications, such as Voldemort key-value database, Hazelcast in memory data-grid, and ZooKeeper.

Retroscoping Voldemort took less than 1000 lines of code for adding HLC to the network protocol, recording changes in the Retroscope window-log, and performing snapshot on the Voldemort's storage. To quantify the overhead caused by Retroscope, we compared the performance of our modified Voldemort with unmodified version on a 10-node cluster. The figure below illustrates the throughput degradation resulted from adding Retroscope to the system. We observed that in the worst case, the degradation was around 10%, however in some cases we also observed no degradation at all. Latency overheads were similar to the throughput.


Taking a snapshot brings more stress to Voldemort, a disk-based system, as each node now needs to get a copy of it local state and undo changes from that copy. Retroscope allows for the entire snapshot routine to run in non-blocking manner, but with some performance degradation. The figure below illustrate the throughput and latency of the client requests on the 10-node cluster while performing a snapshot on a database of 1,000,000 items. Average throughput degradation of the system while taking a snapshot was 18%, although the performance can be improved by using a separate disk to make a database copy.


We also ran a similar experiment on Hazelcast in-memory datagrid, and observed very little performance degradation associated with the snapshot, since Hazelcast is in-memory system.


(This was a joint post with Aleksey Charapko.)

No comments: