Paper review: Comprehensive and efficient runtime checking in system software through watchdogs

This paper by Chang Lou, Peng Huang, and Scott Smith appeared in HotOS 2019. The paper argues that system software needs intrinsic detectors that  monitor internally for subtle issues specific to a process. In particular, the paper advocates employing intrinsic watchdogs as detectors. Watchdogs (also known as grenade timers) have been widely used in embedded devices. Watchdogs use a decrementing timeout counter which resets the processor when it reaches zero. To prevent a reset, the software must keep restarting the watchdog counter after performing/clearing some sanity checks.

Table 1 summarizes the comparison of crash failure detectors, intrinsic watchdogs, and error handlers. Failure detectors are too general, they just make "up-or-down" decisions. They are only good for achieving liveness, as they are too unreliable for making safety decisions.

The disadvantage with error handlers, the paper argues, is  that liveness-related failures often do not have explicit error signals that can trigger a handler: there is no signal for e.g., write being blocked indefinitely or some thread deadlocking or infinitely looping.

The mimicking approach for writing watchdog timers

The paper prescribes the mimicking approach, where the checker selects important operations from the main program, and mimics them for detecting any errors. Since the mimic checker exercises similar code logic in a production environment, it has the potential to catch and locate bugs in the program as well as faults in the environment.

The challenge with this approach is to systematically select important/representative operations from the main program. To solve this problem, the paper proposes a method using static analysis to automatically generate mimic-type watchdogs, which it calls program logic reduction.

We instead propose to derive from P a reduced but representative version W, which nonetheless retains enough code to expose gray failures. Our hypothesis that such reduction is viable stems from two insights. First, most code in P need not be checked at runtime because its correctness is logically deterministic --such code is better suited for unit testing before production and thus should be excluded from W. Second, W’s goal is to catch errors rather than to recreate all the details of P’s business logic. Therefore, W does not need to mimic the full execution of P. For example, if P invoked write() many times in a loop, for checking purposes, W may only need to invoke write() once to detect a fault.

The authors have built a prototype, called AutoWatchdog, and applied it to
ZooKeeper, Cassandra and HDFS, generating tens of checkers for each. Figure 2 shows an example from ZooKeeper.

MAD questions

1. If we detect the problems by mimicking the program, why don't we do the detection as part of the program rather than using a separate watchdog?

The watchdog detectors have a countdown timer for detecting getting stuck at any point due to some operation. This, of course, could have been added to the program as well, but maybe that makes the program look complicated. I guess having this in the watchdog detectors provide modularity, as in aspect-oriented programming.

I guess another benefit of having a separate watchdog detector is that we have a flexibility to locate it more centrally, rather than with the execution in one process. The gray failures paper made a case of asymmetry of information, and that there is a need for more end-to-end detection. Having a separate watchdog detector, we can maybe put parts of it or copies of it in different processes for being able to detect/check faults from different perspectives.

2. How do we make watchdogs compose?
One challenge that will surface when AutoWatchdog creates multiple watchdog detectors for a program is interference among the watchdogs. A reset triggered by a watchdog detector may lead to another reset triggered on another detector. And this may even get continued as the two watchdog detectors trigger resets for each other. Of course, this is more about correction that detection, so this is outside the scope of the paper.

However, even for just detection, care must be taken that the mimicking in the watchdog detectors do not have side effects and mess up the correctness of the program. The paper cautions about this problem: "Executing watchdog checkers should not incur unintended side-effects or add significant cost to the normal execution. For example, in monitoring the indexer of kvs, the checkers may try to retrieve or insert some keys, which should not overwrite data produced from the normal execution or significantly delay normal request handling." I am not sure how it is possible to avoid this problem completely with automatically generated watchdogs. If sandboxes are used, the watchdogs would not be testing/monitoring the real production environment.


Popular posts from this blog

Foundational distributed systems papers

Your attitude determines your success

Progress beats perfect

Cores that don't count

Silent data corruptions at scale

Learning about distributed systems: where to start?

Read papers, Not too much, Mostly foundational ones

Sundial: Fault-tolerant Clock Synchronization for Datacenters


Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3 (SOSP21)