Retrospective Lightweight Distributed Snapshots Using Loosely Synchronized Clocks
This is a summary of our recent work that appeared at ICDCS 2017 . The tool we developed, Retroscope (available on GitHub), enables unplanned retrospective consistent global snapshots for distributed systems. Many distributed systems would benefit from the ability to take unplanned snapshots of the past. Let's say Alice notices alarms going off for her distributed system deployment at 4:05pm. If she could roll-back to the state of the distributed system at 4:00pm, and roll forward step by step to figure out what caused the problems, she may be able to remedy the problem. The ability to take retrospective snapshots requires each node to maintain a log of state changes and then to collate/align these logs to construct a consistent cut at a given time. However, clock uncertainty/skew among nodes is dangerous and can lead to taking an inconsistent snapshot. For example, the cut at 4:00pm in this figure using NTP is inconsistent, because event F is included in the cut, but causally