On designing and deploying Internet scale services

- May 03, 2011

This 2008 paper presents hard-earned lessons from James Hamilton's experience over the last 20 years in high-scale data-centric software systems and internet-scale services. I liken this paper to "Elements of Style" for the domain of Internet scale services. Like the "Elements of Style" this paper is also not to be consumed at once, it is to be visited again and again every so often.

There are three main overarching principles: expect failures, keep things simple, automate everything. We will see reflections of these three principles in several subareas pertaining to Internet-scale services below.

Overall Application Design

Low-cost administration correlates highly with how closely the development, test, and operations teams work together. Some of the operations-friendly basics that have the biggest impact on overall service design are as follows.

/Design for failure/ Armando Fox had argued that the best way to test the failure path is never to shut the service down normally, just hard-fail it. This sounds counter-intuitive, but if the failure paths aren’t frequently used, they won't work when needed. The acid test for fault-tolerance is the following: is the operations team willing and able to bring down any server in the service at any time without draining the work load first? (Chaos Monkey anyone?)

/Use commodity hardware slice/ This is less expensive, scales better for performance and power-efficiency, and provides better failure granularity. For example, storage-light servers will be dual socket, 2- to 4-core systems in the $1,000 to $2,500 range with a boot disk.

Automatic Management and Provisioning

/Provide automatic provisioning and installation./

/Deliver configuration and code as a unit./

/Recover at the service level/ Handle failures and correct errors at the service level where the full execution context is available rather than in lower software levels. For example, build redundancy into the service rather than depending upon recovery at the lower software layer."

I would amend the above paragraph by saying "at the lowest possible service level where the execution context is available". Building fault-tolerance from bottom up is cheaper and more reusable. Doing it only at the service level is more expensive and not reusable. Building fault-tolerance at the service level is also conflicting with the principle they cite "Do not build the same functionality in multiple components".

Dependency Management

As a general rule, dependence on small components or services doesn't save enough to justify the complexity of managing them and should be avoided. Only depend on systems that are single, shared instance when multi-instancing to avoid dependency isn't an option. When dependency is inevitable as above, manage them as follows:

/Expect latency/ Don't let delays in one component or service cause delays in completely unrelated areas. Ensure all interactions have appropriate timeouts to avoid tying up resources for protracted periods.

/Isolate failures/ The architecture of the site must prevent cascading failures. Always "fail fast". When dependent services fail, mark them as down and stop using them to prevent threads from being tied up waiting on failed components.

/Implement inter-service monitoring and alerting/

Release Cycle and Testing

Take a new service release through standard unit, functional, and production test lab testing and then go into limited production as the final test phase. Rather than deploying as quickly as possible, it is better to put one system in production for a few days in a single data center, two data centers and eventually deploy globally. Big-bang deployments are very dangerous.

/Ship often and in small increments/

/Use production data to find problems/

/Support version roll-back/

Operations and Capacity Planning

Automate the procedure to move state off the damaged systems. Relying on operations to update SQL tables by hand or to move data using ad hoc techniques is courting disaster. Mistakes get made in the heat of battle. If testing in production is too risky, the script isn't ready or safe for use in an emergency.

/Make the development team responsible./ You built it, you manage it.

/Soft delete only./ Never delete anything. Just mark it deleted.

/Track resource allocation./

/Make one change at a time./

/Make everything configurable./ Even if there is no good reason why a value will need to change in production, make it changeable as long as it is easy to do.

Auditing, Monitoring, and Alerting

/Instrument everything/

/Data is the most valuable asset/

/Expose health information for monitoring/

/Track all fault tolerance mechanisms/ Fault tolerance mechanisms hide failures. Track every time a retry happens, or a piece of data is copied from one place to another, or a machine is rebooted or a service restarted. Know when fault tolerance is hiding little failures so they can be tracked down before they become big failures. Once they had a 2000-machine service fall slowly to only 400 available over the period of a few days without it being noticed initially.

Graceful Degradation and Admission Control

/Support a "big red switch."/ The concept of a big red switch is to keep the vital processing progressing while shedding or delaying some noncritical workload in an emergency.

/Control admission./ If the current load cannot be processed on the system, bringing more work load into the system just assures that a larger cross section of the user base is going to get a bad experience.