Consistent snapshot analogies

Last week I taught distributed snapshot in my CSE 586: Distributed Systems class. While I teach snapshot, I invariably find myself longing for analogies to provide some intuition about this concept. The global state captured by a distributed snapshot (say using Lamport/Chandy marker algorithm) does not correspond to the "state of the system at initiation of the snapshot". Furthermore, it also may not correspond to a "state of the system from initiation to current state during this computation". This is because while the snapshot taking is progressing in the system, the underlying system computation is also proceeding and changing the state of the system progressively. (Distributed snapshot is not allowed to stop/freeze underlying system computation as that reduces availability.)

For those curious about the question, "what good is a snapshot then?": The snapshot captures a reachable state from initiation state, and from the snapshot state the current state of the computation is also reachable. In other words, snapshot is a likely state of the computation, even though it may not have occurred in this particular computation. So, for stable predicate detection and distributed system debugging the snapshot is still valuable.

Going back to my predicament, the analogy I resort to is that of 1000 ants trying to take/construct a picture of the elephant as the elephant is moving. (I had heard this example from Paul Sivilotti while I was a graduate student at Ohio State.) Here the ants correspond to the marker algorithm, and the elephant the underlying computation that we want to take a snapshot of. Of course the pictures the ants will construct will be vaguely elephant-like, it will be a picture of the elephant's outer surface as it progresses in the spacetime continuum. (Achievement Unlocked: Today I used spacetime continuum in serious writing.)

Last week I was using this analogy in class, when a better (at least more modern) analogy occurred to me. Panoramic photographs! When you use your smartphone to take a panorama picture, you are in fact taking a distributed snapshot of your surrounding. Your snapshot is not instantaneous, it needs time to complete: you need to rotate 180 to 360 degrees and probably that takes a good 5-10 seconds. If in the meanwhile something moves, that object will not be reflected in its original form/place/state in your panorama picture.

We may attempt to take the analogy further to categorizing the panorama pictures as consistent snapshots and inconsistent snapshots. In an inconsistent snapshot, although the send of a message is not recorded as part of the snapshot, the receive of the message is recorded as part of the snapshot. (You received a message from the future.) So we can say that, your panorama picture is inconsistent if the object moves in the opposite direction of the panorama/snapshot. These are examples of inconsistent snapshots.

And, these are examples of consistent snapshot. (Maybe the last two are debatable as they duplicate some state.)

Finally, this seemingly-consistent inconsistent snapshot (the bearded guy on the leftmost is teleported to reappear as the rightmost person) points to the dangers of ignoring backchannels when taking a snapshot.

Probably it is not worth trying to strain the analogy further, so I will stop.  Here are some more funny iphone panoramic pictures. 


Unknown said…
When Mani taught this, he would use the analogy of counting migrating birds in the sky by use of a camera that could not cover the entire sky. I don't recall all the details, but it was something along those lines. Actually I came to prefer Friedmann's style more.
Murat said…
Now I am curious, what is Friedmann's style?

Popular posts from this blog

Foundational distributed systems papers

Your attitude determines your success

Progress beats perfect

Cores that don't count

Silent data corruptions at scale

Learning about distributed systems: where to start?

Read papers, Not too much, Mostly foundational ones

Sundial: Fault-tolerant Clock Synchronization for Datacenters


Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3 (SOSP21)