Saturday, November 9, 2013

My notes from SOSP13 welcome and awards

The ACM Symposium on Operating Systems Principles (SOSP) is arguably the top conference in the computer systems area. The SOSP conference took a start with a welcome talk from the General Chair, Michael Kaminsky (Intel Labs), and PC Chair, Mike Dahlin (Google and UT Austin).

The chairs started by thanking to the many sponsors (platinum, gold, silver, bronze level) for the conference.

This year SOSP had 628 registrations, which made it the biggest SOSP as of yet, with a 16% increase over 2011 SOSP (which was yet biggest till then). Attendance distribution to SOSP is 76% from North America, 15% Europe, and 11% Asia. Among those attending 42% is faculty, 42% students, and 15% industry. There were 7 workshops on the weekend preceding the SOSP (LADIS was one of them), and 40% of attendants also attended workshops.

This year, for the first time, SOSP had full open access conference proceedings (the cost, $1100 per paper has been paid by SIGOPS), and this announcement got a huge applause from the audience.

Among the 150+ papers submitted, 30 papers are accepted to SOSP. There was a 3 round review process, with a total of 740 reviews, and on average 5 reviews per paper. 70 papers that fell in the middle were discussed in the depth PC meeting.

Three papers have been awarded with the best paper awards:

  1. The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors, Austin T. Clements, M. Frans Kaashoek, Nickolai Zeldovich, Robert Morris (MIT CSAIL), Eddie Kohler (Harvard)
  2. Towards Optimization-Safe Systems: Analyzing the Impact of Undefined Behavior, Xi Wang, Nickolai Zeldovich, M. Frans Kaashoek, Armando Solar-Lezama (MIT CSAIL)
  3. Naiad: A Timely Dataflow System, Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, Martin Abadi (Microsoft Research)

The accepted 30 papers were presented within the duration of 2.5 days as single track. Each paper got a 30 minute time slot for which 22-25 minutes is for presentation, and 5-8 minutes are dedicated to question-answering session. This year Google-moderator tool has also been employed for taking questions from the audience, in addition to walk-to-microphone questions. The tool enables the audience to vote on the questions and get the questions with most votes rise to the top.

A majority of the presentations were very good (which is not the case in a typical conference where a big majority of the presentations are dull). It was clear that presenters had practiced extensively before the presentations, which helped them to deliver polished 22-25 minutes talks. Question-answering sessions were lively and there were several insightful questions asked.

The papers and  presentation slides (and soon talk videos) are available from

Some of my top picks (reflecting my prejudicies and research interests) from the conference are as follows:

  • The scalable commutativity rule: Designing scalable software for multicore processors
  • Dandelion: a compiler and runtime for heterogenous systems
  • Sparrow: distributed, low latency scheduling
  • From L3 to seL4 what have we learnt in 20 years of L4 microkernels?
  • An analysis of facebook photo caching
  • Towards Optimization-Safe Systems: Analyzing the Impact of Undefined Behavior
  • Transactions chains: achieving serializability with low-latency in geo-distributed storage systems
  • Consistency-Based Service Level Agreements for Cloud Storage
  • Tango: Distributed Data Structures over a Shared Log
  • There Is More Consensus In Egalitarian Parliaments
  • Naiad: A Timely Dataflow System

I will try to edit/organize my notes about some of these talks and share them soon. Especially the last five papers on this list appeal a lot to me since they are more relevant and related to my research interests large-scale distributed systems.

Tuesday night there was a banquet and award ceremony. Stefan Savage has been awarded with the 2013 Mark Weiser award. He gave a humble yet powerful talk, and shared personal stories about how his career has benefited a lot from interacting with the late Mark Weiser. Two recent PhD thesis were awarded with the Dennis Ritchie award.

Finally, the following five papers have been added to SIGOPS Hall of Fame:

  1. “Tenex, A Paged Time Sharing System for the PDP-10”, Daniel G. Bobrow, Jerry D. Burchfiel, Daniel L. Murphy and Raymond S. Tomlinson, Communications of the ACM 15(3), March 1972. 
  2. “A NonStop Kernel”, Joel Bartlett,  in Proceedings of the Eighth ACM Symposium on Operating Systems Principles (SOSP’81), Pacific Grove, California, December 1981 
  3. K. Mani Chandy and Leslie Lamport, “Distributed Snapshots: Determining Global States of a Distributed System”, ACM Transactions on Computer Systems 3(1), February 1985. 
  4. Kenneth P. Birman and Thomas A. Joseph, “Exploiting Virtual Synchrony in Distributed Systems”, in Proceedings of the Eleventh ACM Symposium on Operating Systems Principles (SOSP’87), Austin, Texas, November 1987. 
  5. Eddie Kohler, Robert Morris, Benjie Chen, John Janotti and Frans Kaashoek, “The Click Modular Router”, ACM Transactions on Computer Systems (TOCS), 18(3), August 2000.

Friday, November 8, 2013

LADIS 2013 keynotes

I attended and presented a paper at LADIS 2013, which was colocated with SOSP13. I will talk about my paper in a later post. Here I just want to share brief summaries of the LADIS keynotes.

1. LADIS keynote: Cloud-scale Operational Excellence, by Peter Vosshall, distinguished engineer

What is operational excellence? It is anticipating and addressing problems.

For us, operational excellence arises from a combination of culture + tools + processes.

1.1 Culture

Amazon leadership principles are:
  1. Customer obsession
  2. Ownership (Amazon has a strong ownership culture, known as devops!)
  3. Insisting on the highest standards

1.2 Tools

Amazon has tools for software deployment, monitoring, visualization, ticketing, risk auditing.

In 1995, Amazon had a single web-server operation, and had a website-push perl script. This was managed by a small centralized team, named Houston.

The team invested in a tool called Apollo for automating deployments. As a result, it was easy to do deployments. Some 2011 numbers are as follows. Mean time between deployments: 11.6 secs, Max number of deployments in an hour: 1079, Mean number of hosts simultaneously receiving deployments: 10000.

Another tool for enabling continuous deployment is pipelines, which automate the path the code takes from check-in to production: packages -> version set -> beta -> 1box -> production.

1.3 Processes

When you ask for good intentions, you are not asking for a change. "Good intentions don't work, mechanisms work!"

Similar to the "Andon cord" in Toyota that stops serial line to address issues, Amazon has an Andon cord that can be pulled by the customer service department. The category owner for the Andon cord pulled needs to address immediately: the product is removed from Amazon website until category owner addresses the problem.

Correction of errors (COE) is another process. This is a mechanism Amazon employs to learn from mistakes. COE started as emails documenting errors and what is learned frankly. Anatomy of a COE today: what happened, what was the impact, the 5 whys, what were the lessons learned, what are the corrective actions.

2 LADIS keynote Baidu: Big data and infrastructure

I didn't take much notes from this talk, but here is an interesting tidbit to share.

It is known that 90% of hardware failures are caused by hard disk drives. So in some sense memory is more reliable than the disk. 3-way memory replication is enough for most applications, and that is what Baidu uses. Fast recovery for a replica is more important at the end of the day.

3 LADIS: Lessons from an internet-scale notification system, Atul Adya, Google

Thialfi is a notification service. Thialfi was first presented in SOSP11. Since then Thialfi scaled by several orders of magnitude. The team has learned unexpected lessons, and Atul talked about these lessons.

Thialfi overview: App registers for X, this is recorded at data center if X is updated app gets notification. (This is much better than busy polling by app.) Thialfi abstraction: Object unique id, and monotonically increasing version number 64 bit. Thialfi is built around soft-state. It recovers registration state from clients if needed.
Some lessons learned from operating Thialfi.

3.1 Lesson1: Is this thing on? Working for everyone?

You can never know! You need continuous testing in production. For example, look at server graphs to infer end to end latency. Chrome sync was the first real customer for Thialfi. For a big customer like Chrome, it is even possible to monitor Twitter for complaints.

3.2 Lesson2: And you thought you could debug?

In such a large scale system, you have to log selectively. When a specific user has problem, it may look like searching for a needle in a haystack. The team had to write custom production code for some customers.

3.3 Lesson3: Clients considered harmful

If you rely on client-side computations enticed by the lightweight/scalable servers promise, you will have problems with old versions of client apps. You cannot update the client code, so don't put code on clients.

3.4 Lesson4: Getting your code in the door is important

Build a feature if only customers care about it. A corollary is that you may need unclean features (weakest semantics) to get customers. For example, in one case, when they found that version numbers were not feasible for many systems, they modified Thialfi to allow time instead of version numbers.

3.5 Lesson 5: You are building your castle on sand

Use non-optimal consistent hashing (not geo-aware), rather than optimal but flapping/dithering optimal balancing.

3.6 Lesson 6: The customer is not always right

The example given here was with respect to strict latency and SLAs.

3.7 Lesson 7: You cannot anticipate the hard parts

Hard parts of Thialfi actually turned out to be:
  1. Registrations: getting client and data center to agree on registration state is hard.
  2. Wide-area routing.
  3. Client library and its protocol.
  4. Handling overload.

3.8 Question answer section

Q: Should one design a service properly at the start or make it grow organically?
A: Atul said that he was a fan of designing properly in the first place, but this failed for Thialfi. They revisited the design three times. His new rule is: if you are building on top of other sytems (as it was the case with Thialfi), don't spend months on design.
The third rewrite of Thialfi is ongoing. In this revision, they will use Google Spanner for synchronous replication of the registration state!

Q: What about Dec 2012 Chrome crashes, did that have anything to do with Thialfi.
A: Nothing to do with Thialfi, Google Sync was blamed for it. Thialfi was not implicated in a PR level failure yet.

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...