Thursday, April 25, 2013

Attacking Generals and Buridan's Ass, or How did the chicken cross the road?

Our department moved into a brand-new beautiful building last year. My new office at this building has a nice side-view of the lake, and also a bird-view of the pedestrian crosswalk on the road in front of the building.

I spend a good fraction of my time in the office on my makeshift standing desk in front of the window. So I get to watch (from my peripheral vision) the pedestrians crossing the crosswalk. This makes for a surprisingly suspenseful and distracting pastime. It turns out that crossing the crosswalk is not a trivial task, but rather a complex process. Below I talk about the crosswalk scenarios I observed, and how these relate to some fundamental impossibility results in distributed systems.

The crosswalk

This crosswalk does not have any traffic lights as it is on a low-traffic/in-campus road. According to the crosswalk protocol, the pedestrian has the right-of-way on the crosswalk: the car needs to stop if the pedestrian shows "intent to cross" (which is unfortunately not well-defined). In the common case, there is no car approaching and the pedestrian crosses easily. In the ideal execution of the crosswalk protocol, the car sees the pedestrian approaching the crosswalk and slows down in advance, and this (the car's intention to give right-of-way to the pedestrian) becomes clear to the pedestrian upon which she starts crossing.

Now let's look at the non-ideal executions of the crosswalk protocol. Sometimes it is not clear to the pedestrian whether the car is slowing down or not, yet with some leap of faith the pedestrian starts on the crosswalk. (Probably this is the correct thing to do as a pedestrian: After observing that the car is in comfortable stopping distance from the crosswalk, just take that leap of faith and start walking.) Then, the car slows down giving indication that the driver saw the pedestrian. But, sometimes the car may not slow down or may not appear to slow down from the pedestrian's point of view. In this case, the pedestrian may run to complete the crossing, or stand still in the middle of crosswalk looking confused, waiting for the car to make the next move (either cross or slow down).

Another faulty case occurs when the pedestrian hesitates before entering the crosswalk. The car slows down, and the pedestrian motions to the car with his hand to pass. The car hesitates for a short moment (or takes some time to accelerate again), and then both the car and the pedestrian start moving at the same time. Then they both stop. Then the pedestrian insists on vigorously hand-waving the car to continue, and the car driver motions the pedestrian to continue. This is the equivalent of the awkward corridor dance that ensues when two people cannot negotiate which side they should pass on the corridor. However, this is more embarrassing and dangerous. This case occurs surprisingly frequent, and in one case involved an Engineering professor as the pedestrian :-)

So, as a pedestrian, never ever hand-wave the car to pass. You have the right-of-way as a pedestrian, and by signaling the car to pass you are just messing up the crosswalk protocol. In the distributed systems lingo, you are a faulty process---or even a Byzantine process--- as you deviate from the protocol. And if this happens to you as a driver, when a pedestrian motions you to continue, ignore it, follow the protocol: Just stop and let the pedestrian pass. Better violate the progress condition than violating the safety condition.

This is supposed to be a simple protocol; the pedestrian has the right-of-way in the crosswalk. But as I covered a bit above, a lot of things can go awry in this simple protocol. There are a lot of corner cases because this is a distributed system with pedestrian and driver agents and is subject to faults and non-ideal environmental conditions.

The distributed system

As a distributed systems person, I can't help but relate these difficulties to how hard consensus is in a distributed system. A very basic and comprehensive impossibility result in distributed systems states that "Using a communication link that may experience intermittent failures, there is no deterministic algorithm for reaching agreement!" (This is known as the attacking generals problem, and it holds for both the asynchronous and the synchronous system models.)

The interaction of the pedestrian and the car driver is a lossy channel indeed, due to the different understanding/interpretations of intents. So, according to the attacking generals impossibility result, you cannot have a protocol that satisfies both the safety and progress conditions of the crosswalk protocol for all possible executions. We can only hope that safety is not violated (thankfully, that is the case on this crosswalk so far), and the only thing that suffers is the progress condition (and that only temporarily).

There are further bad news for the crosswalk protocol unfortunately. Even with the ideal conditions at the higher level (lossless channels and crash-immune agents), another impossibility result lurks at the lower level. This is known as the arbiter problem, or in its classical formulation as Buridan's ass: "an ass that starves to death because it is placed equidistant between two bales of hay and has no reason to prefer one to the other." This impossibility result states that it is impossible to build an arbiter (a device for making a binary decision based on inputs that may be changing) that is guaranteed to reach a decision in a bounded length of time in all possible executions. ("The basic proof that an arbiter cannot have a bounded response time uses continuity to demonstrate that, if there are two inputs that can drive a flip-flop into two different states, then there must exist an input that makes the flip-flop hang." Another proof of impossibility without appeal to continuity was given here.) Please read Leslie Lamport's fascinating account of the problem and his problems with getting the paper he wrote on this published.

The lessons

There are two lessons to draw here. The first one is that implementing even simple protocols are messy (especially so for distributed systems) due to the many corner cases and low level problems. Abstractions start to leak at the low level. So cut enough slack for safety; don't over-optimize and make your system fragile. Keep backup/recovery strategies for corner cases.

The second lesson is that, practice always finds a way. Despite the attacking generals impossibility result, there are millions of crosswalk crossings and billions of TCP connections every day. The impossibily results say you cannot have both safety and progress in all executions. Practical systems have probabilistic guarantees and backup plans for corner cases: Progress may get a temporarily hit, or safety gets violated but retry/recovery kicks in. And this works for us.
Theory is when you know everything, but nothing works.
Practice is when everything works, but no one knows why.

No comments: