Showing posts from August, 2017

Retrospective Lightweight Distributed Snapshots Using Loosely Synchronized Clocks

This is a summary of our recent work that appeared at ICDCS 2017. The tool we developed, Retroscope (available on GitHub), enables unplanned retrospective consistent global snapshots for distributed systems.

Many distributed systems would benefit from the ability to take unplanned snapshots of the past. Let's say Alice notices alarms going off for her distributed system deployment at 4:05pm. If she could roll-back to the state of the distributed system at 4:00pm, and roll forward step by step to figure out what caused the problems, she may be able to remedy the problem.

The ability to take retrospective snapshots requires each node to maintain a log of state changes and then to collate/align these logs to construct a consistent cut at a given time. However, clock uncertainty/skew among nodes is dangerous and can lead to taking an inconsistent snapshot. For example, the cut at 4:00pm in this figure using NTP is inconsistent, because event F is included in the cut, but causally prec…

Paper summary: Performance clarity as a first-class design principle (HOTOS'17)

This paper appeared in HOTOS'17 and is by Kay Ousterhout, Christopher Canel, Max Wolffe, Sylvia Ratnasamy, and Scott Shenker. (The team is at UC Berkeley Christopher is now at CMU.)

The paper argues that performance clarity is as important a design goal as performance or scalability. Performance clarity means ease of understanding where bottlenecks lie in a deployment and the performance implications of various system changes.

The motivation for giving priority to performance clarity arises from the performance opaqaness of distributed systems we have, and how hard it is to configure/tune them to optimize performance. In current distributed data processing systems, a small change in hardware or software configurations may cause disproportional impact on performance. An earlier paper, Ernest, showed that selecting an appropriate cloud instance type could improve performance by 1.9× without increasing cost. And when things run slowly in a deployment, it is hard to determine where th…

Dhalion: self-regulating stream processing in Heron (VLDB'17)

This paper appeared in VLDB'17, and is by Avrilia Floratou (Microsoft), Ashvin Agrawal (Microsoft), Bill Graham (Twitter), Sriram Rao (Microsoft), and Karthik Ramasamy (Streamlio).

Dhalion aims to further reduce the complexity of configuring and managing Heron streaming applications. It provides a self-tuning/self-correcting extension to Heron to monitor, diagnose, and correct problems with the deployment. (Dhalion is named after a mythical bird with interesting healing capabilities.)

It turns out there are only 4-5 category of things that can (is likely to) go wrong in a Heron deployment. (I had summarized Heron earlier here.) The detectors and diagnosers in Dhalion check for them. The resolver as a result tries one of these corresponding 4-5 correction actions. And the effectiveness of the correctors are then rated.

Yes, you heard it right, instead of guaranteed-to-work correctors, Dhalion rates the correctors' effectiveness and decides what to try next. "After every a…

Cloud fault-tolerance

I had submitted this paper to SSS'17. But it got rejected. So I made it a technical report and am sharing it here. While the paper is titled "Does the cloud need stabilizing?", it is relevant to the more general cloud fault-tolerance topic.

I think we wrote the paper clearly. Maybe too clearly.

Here is the link to our paper
(Ideally I would have liked to expand on Section 4. I think that is what we will work on now.)

Below is an excerpt from the introduction of our paper, if you want to skim that before downloading the pdf.
The last decade has witnessed rapid proliferation of cloud computing. Internet-scale webservices have been developed providing search services over billions of webpages (such as Google and Bing), and providing social network applications to billions of users (such as Facebook and Twitter). While even the smallest distributed …

Paper Summary: Twitter Heron

This summary combines material from "Twitter Heron: Stream Processing at Scale" which appeared at Sigmod 15 and "Twitter Heron: Towards extensible streaming engines" paper which appeared in ICDE'17.

Heron is Twitter's stream processing engine. It replaced Apache Storm use at Twitter, and all production topologies inside Twitter now run on Heron. Heron is API-compatible with Storm, which made it easy for Storm users to migrate to Heron. Reading the two papers, I got the sense that the reason Heron was developed is to improve on the debugability, scalability, and manageability of Storm. While a lot of importance is attributed to performance when comparing systems, these features (debugability, scalability, and manageability) are often more important in real-world use.

The gripes with StormHard to debug. In Storm, each worker can run disparate tasks. Since logs from multiple tasks are written into a single file, it is hard to identify any errors or exceptions …

On presenting well

I had written about "how to present your work" earlier. If you haven't read that, that is a good place to start.

I started thinking about this topic recently as my two PhD students Ailidani Ailijiang and Aleksey Charapko had to make their first presentations in ICDCS 2017. I worked with them to prepare their presentations. I listened to 3 drafts of their presentations and helped them revise the presentations. Since the topic of presenting your work was rekindled, I thought I should do another post on this.

Most talks suck I told my students that most of the conference presentations are dreadful. The good news is that the bar is not high. If you prepare sufficiently and practice, you will be already ahead of the curve.

But why do most talks suck? It is mainly because the presenters got it wrong about the purpose of the talk and try to cover everything in their paper. As a result their slides are not well thought, and the presentation follows the paper outline and tries to…

Paper summary " Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads"

This paper was written earlier than the "PyWren: Occupy the Cloud" paper, and appeared in NSDI'17. In fact this paper has influenced PyWren. The related work says "After the submission of this paper, we sent a preprint to a colleague who then developed PyWren, a framework that executes thousands of Python threads on AWS Lambda. ExCamera’s mu framework differs from PyWren in its focus on heavyweight computation with C++-implemented Linux threads and inter-thread communication."

I had written about AWS Lambda earlier. AWS Lambda is a serverless computing framework (aka a cloud-function framework) designed to execute user-supplied Lambda functions in response to asynchronous events, e.g., message arrivals, file uploads, or API calls made via HTTP requests.

While AWS Lambda is designed for executing asynchronous lightweight tasks for web applications, this paper introduced the "mu" framework to run general-purpose massively parallel heavyweight computation…

Paper summary: Occupy the Cloud: Distributed Computing for the 99%

"We are the 99%!" (Occupy Wall Street Movement, 2011)

The 99% that the title of this paper refers to is the non-cloud-native and non-CS-native programmers. Most scientific and analytics software is written by domain experts like biologists and physicists rather than computer scientists. Writing and deploying at the cloud is hard for these folks. Heck it is even hard for the computer science folk. The paper reports that an informal survey at UC Berkeley found that the majority of machine learning graduate students have never written a cluster computing job due to complexity of setting up cloud platforms.

Yes, cloud computing virtualized a lot of things, and VMs, and recently containers, reduced the friction of deploying at the clouds. However, there are still too many choices to make and things to configure before you can get your code to deploy & run at the cloud. We still don't have a "cloud button", where you can push to get your single machine code deplo…

ICCCN'17 trip notes, days 2 and 3

Keynote 2 (Day 2)Bruce Maggs gave the keynote on Day 2 of ICCCN. He is a professor at Duke university and Vice President of Research at Akamai Technologies. He talked about cyber attacks they have seen at Akamai, content delivery network (CDN) and cloud services provider.

Akamai has 230K servers, 1300+ networks, 3300 physical locations, 750 cities, and 120 countries. It slipped out of him that Akamai is so big, it can bring down internet, if it went evil, but it would never go evil :-) Hmm, should we say "too big to go evil?". This, of course, came up as a question at the end of the talk: how prepared is the Internet for one of the biggest players, such as Google, Microsoft, Netflix, Yahoo, Akamai, going rouge? Bruce said, the Internet is not prepared at all. If one of these companies turned bad, they can melt internet. I followed up that question with rouge employee and insider threat question. He said that, the Akamai system is so big that it doesn't/can't work wi…

ICCCN'17 trip notes

I flew with United airlines, with a connection at Chicago at noon to arrive Vancouver at 3:00pm PST. The second flight took 4.5 hours and was not easy. On the plus side, the flights were on time. I don't want to jinx it but I have not had any flight trouble the last 2-3 years. Chicago O'Hare airport still scares me. For at least 4-5 instances, I spent extra nights at O'Hare airport due to canceled or missed flights.

The entry to Vancouver was nice. We used screens to check ourselves in at the border. After answering some questions, and taking a selfie, the screen gave me a printout. The Canadian border agents, only took 5 seconds checking my passport and printout before allowing me in the country. Finally a border check that doesn't suck.

It was also pleasant to travel from the airport to the conference hotel, at the Vancouver Waterfront. I bought a ticket for $9, and jumped on the Canada Line train at the airport. The two-car train did not have any operator, and oper…

Popular posts from this blog

I have seen things

SOSP19 File Systems Unfit as Distributed Storage Backends: Lessons from 10 Years of Ceph Evolution

PigPaxos: Devouring the communication bottlenecks in distributed consensus

Frugal computing

Fine-Grained Replicated State Machines for a Cluster Storage System

Learning about distributed systems: where to start?

My Distributed Systems Seminar's reading list for Spring 2020

Cross-chain Deals and Adversarial Commerce

Book review. Tiny Habits (2020)

Zoom Distributed Systems Reading Group