LADIS 2013 keynotes

- November 08, 2013

I attended and presented a paper at LADIS 2013, which was colocated with SOSP13. I will talk about my paper in a later post. Here I just want to share brief summaries of the LADIS keynotes.

1. LADIS keynote: Cloud-scale Operational Excellence, by Peter Vosshall, distinguished engineer

What is operational excellence? It is anticipating and addressing problems.

For us, operational excellence arises from a combination of culture + tools + processes.

1.1 Culture

Amazon leadership principles are:

Customer obsession
Ownership (Amazon has a strong ownership culture, known as devops!)
Insisting on the highest standards

1.2 Tools

Amazon has tools for software deployment, monitoring, visualization, ticketing, risk auditing.

In 1995, Amazon had a single web-server operation, and had a website-push perl script. This was managed by a small centralized team, named Houston.

The team invested in a tool called Apollo for automating deployments. As a result, it was easy to do deployments. Some 2011 numbers are as follows. Mean time between deployments: 11.6 secs, Max number of deployments in an hour: 1079, Mean number of hosts simultaneously receiving deployments: 10000.

Another tool for enabling continuous deployment is pipelines, which automate the path the code takes from check-in to production: packages -> version set -> beta -> 1box -> production.

1.3 Processes

When you ask for good intentions, you are not asking for a change. "Good intentions don't work, mechanisms work!"

Similar to the "Andon cord" in Toyota that stops serial line to address issues, Amazon has an Andon cord that can be pulled by the customer service department. The category owner for the Andon cord pulled needs to address immediately: the product is removed from Amazon website until category owner addresses the problem.

Correction of errors (COE) is another process. This is a mechanism Amazon employs to learn from mistakes. COE started as emails documenting errors and what is learned frankly. Anatomy of a COE today: what happened, what was the impact, the 5 whys, what were the lessons learned, what are the corrective actions.

2 LADIS keynote Baidu: Big data and infrastructure

I didn't take much notes from this talk, but here is an interesting tidbit to share.

It is known that 90% of hardware failures are caused by hard disk drives. So in some sense memory is more reliable than the disk. 3-way memory replication is enough for most applications, and that is what Baidu uses. Fast recovery for a replica is more important at the end of the day.

3 LADIS: Lessons from an internet-scale notification system, Atul Adya, Google

Thialfi is a notification service. Thialfi was first presented in SOSP11. Since then Thialfi scaled by several orders of magnitude. The team has learned unexpected lessons, and Atul talked about these lessons.

Thialfi overview: App registers for X, this is recorded at data center if X is updated app gets notification. (This is much better than busy polling by app.) Thialfi abstraction: Object unique id, and monotonically increasing version number 64 bit. Thialfi is built around soft-state. It recovers registration state from clients if needed.
Some lessons learned from operating Thialfi.

3.1 Lesson1: Is this thing on? Working for everyone?

You can never know! You need continuous testing in production. For example, look at server graphs to infer end to end latency. Chrome sync was the first real customer for Thialfi. For a big customer like Chrome, it is even possible to monitor Twitter for complaints.

3.2 Lesson2: And you thought you could debug?

In such a large scale system, you have to log selectively. When a specific user has problem, it may look like searching for a needle in a haystack. The team had to write custom production code for some customers.

3.3 Lesson3: Clients considered harmful

If you rely on client-side computations enticed by the lightweight/scalable servers promise, you will have problems with old versions of client apps. You cannot update the client code, so don't put code on clients.

3.4 Lesson4: Getting your code in the door is important

Build a feature if only customers care about it. A corollary is that you may need unclean features (weakest semantics) to get customers. For example, in one case, when they found that version numbers were not feasible for many systems, they modified Thialfi to allow time instead of version numbers.

3.5 Lesson 5: You are building your castle on sand

Use non-optimal consistent hashing (not geo-aware), rather than optimal but flapping/dithering optimal balancing.

3.6 Lesson 6: The customer is not always right

The example given here was with respect to strict latency and SLAs.

3.7 Lesson 7: You cannot anticipate the hard parts

Hard parts of Thialfi actually turned out to be:

Registrations: getting client and data center to agree on registration state is hard.
Wide-area routing.
Client library and its protocol.
Handling overload.

3.8 Question answer section

Q: Should one design a service properly at the start or make it grow organically?
A: Atul said that he was a fan of designing properly in the first place, but this failed for Thialfi. They revisited the design three times. His new rule is: if you are building on top of other sytems (as it was the case with Thialfi), don't spend months on design.
The third rewrite of Thialfi is ongoing. In this revision, they will use Google Spanner for synchronous replication of the registration state!

Q: What about Dec 2012 Chrome crashes, did that have anything to do with Thialfi.
A: Nothing to do with Thialfi, Google Sync was blamed for it. Thialfi was not implicated in a PR level failure yet.