Dhalion: self-regulating stream processing in Heron (VLDB'17)

This paper appeared in VLDB'17, and is by Avrilia Floratou (Microsoft), Ashvin Agrawal (Microsoft), Bill Graham (Twitter), Sriram Rao (Microsoft), and Karthik Ramasamy (Streamlio).

Dhalion aims to further reduce the complexity of configuring and managing Heron streaming applications. It provides a self-tuning/self-correcting extension to Heron to monitor, diagnose, and correct problems with the deployment. (Dhalion is named after a mythical bird with interesting healing capabilities.)

It turns out there are only 4-5 category of things that can (is likely to) go wrong in a Heron deployment. (I had summarized Heron earlier here.) The detectors and diagnosers in Dhalion check for them. The resolver as a result tries one of these corresponding 4-5 correction actions. And the effectiveness of the correctors are then rated.

Yes, you heard it right, instead of guaranteed-to-work correctors, Dhalion rates the correctors' effectiveness and decides what to try next. "After every action is performed, Dhalion evaluates whether the action was able to resolve the problem or brought the system to a healthier state. If an action does not produce the expected outcome then it is blacklisted and it is not repeated again."

This is not a fool-proof method: what if the reason corrector is ineffective because another fault/condition started developing? Another problem I can see is interference among correctors causing problems. But I think Dhalion tries/evaluates correctors one by one, otherwise the interference issues obfuscates the evaluation of correctors.

So Dhalion takes a probabilistic approach to self-tuning and recovery. But maybe that should not be a cause for alarm.  The system is already faulty, why should the correctors be guaranteed to work? If the corrector framework can try/evaluate them quickly, and converge the system back, that is all that matters. (I am not sure if Dhalion can manage to do this fast enough for most scenarios.)


Popular posts from this blog

Foundational distributed systems papers

Your attitude determines your success

Progress beats perfect

Cores that don't count

Silent data corruptions at scale

Learning about distributed systems: where to start?

Read papers, Not too much, Mostly foundational ones

Sundial: Fault-tolerant Clock Synchronization for Datacenters


Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3 (SOSP21)