Building fault-tolerant applications on AWS

Below is my summary of the Amazon Web Services (AWS) whitepaper released on Oct 2011 by Jeff Barr, Attila Narin, and Jinesh Varia.

AWS aims to simplify the task of building and maintaining fault-tolerant distributed systems/services for its customers. For this, AWS prescribes the customers to embrace the following philosophy when they are building their applications on AWS.

1. Make computing nodes disposable/easily replaceable

AWS (as with any cloud provider actually) employs a level of indirection over the physical computer, called virtual machine (VM), to make computing nodes easily replaceable. You then need a template to define your service instance over a VM, and this is called Amazon Machine Image (AMI). The first step towards building fault-tolerant applications on AWS is to create your own AMIs. Starting your application then is simply a matter of launching VM instances on Amazon Elastic Compute Cloud (EC2) using your AMI. Once you have created an AMI, replacing a failing instance is very simple; you can just launch a replacement instance that uses the same AMI as its template. This can be done programmatically through an API invocation.

In short, AWS wants you to embrace an instance as the smallest unit of failure and make it easily replaceable. AWS helps you to automate and make this process more transparent by providing elastic IP addresses and elastic load balancing. To minimize downtime, you may keep a spare instance running, and easily fail over to this hot instance by quickly remapping your elastic IP address to this new instance. Elastic load balancing further facilitates this process by detecting unhealthy instances within its pool of Amazon EC2 instances and automatically rerouting traffic to healthy instances, until the unhealthy instances have been restored.

2. Make state/storage persistent

When you launch replacement service instances on EC2 as above, you also need to provide persistent state/data that these instances have access to. Amazon Elastic Block Store (EBS) provides block level storage volumes for use with Amazon EC2 instances. Amazon EBS volumes are off-instance storage that persists independently from the life of an instance. Any data that needs to persist should be stored on Amazon EBS volumes, not on the local hard-disk associated with an Amazon EC2 instance because that disappears when the instance die. If the Amazon EC2 instance fails and needs to be replaced, the Amazon EBS volume can simply be attached to the new Amazon EC2 instance. EBS volumes store data redundantly, making them more durable than a typical hard drive. To further mitigate the possibility of a failure, backups of these volumes can be created using a feature called snapshots.

Of course this begs the question of "what is the sweet-point in storing to EBS" as it comes with a significant penalty over EC2 RAM, and a slight penalty over EC2 disk. I guess this depends on how you can stretch the definition of "the data that needs to persist" in your application.

3. Rejuvenate your system by replacing instances

If you follow the first two principles, you can (and should) rejuvenate your system by periodically replacing old instances with new server instances transparently. This ensures that any potential degradation (software memory leaks, resource leaks, hardware degradation, filesystem fragmentation, etc.) does not adversely affect your system as a whole.

4. Use georeplication to achieve disaster tolerance

Amazon Web Services are available in 8 geographic "regions". Regions consist of one or more Availability Zones (AZ), are geographically dispersed, and are in separate geographic areas or countries. The Amazon EC2 service level agreement commitment is 99.95% availability for each Amazon EC2 Region. But in order to achieve the same availability in your application, you should deploy your application over multiple availability zones, for example by maintaining a fail-over site in another AZ as in the figure.

5. Leverage other Amazon Web Services as fault-tolerant building blocks

Amazon Web Services offers a number of other services (Amazon Simple Queue Service, Amazon Simple Storage Service, Amazon SimpleDB, and Amazon Relational Database Service.) that can be incorporated into your application development. These services are fault-tolerant, so by using them, you will be increasing the fault tolerance of your own applications.

Let's take the Amazon Simple Queue Service (SQS) example. SQS is a highly reliable distributed messaging system that can serve as the backbone of your fault-tolerant application. Once a message has been pulled by an instance from an SQS queue, it becomes invisible to other instances (consumers) for a configurable time interval known as a visibility timeout. After the consumer has processed the message, it must delete the message from the queue. If the time interval specified by the visibility timeout has passed, but the message isn't deleted, it is once again visible in the queue and another consumer will be able to pull and process it. This two-phase model ensures that no queue items are lost if the consuming application fails while it is processing a message. Even in an extreme case where all of the worker processes have failed, Amazon SQS will simply store the messages for up to four days.


Netflix (one of the largest and most prominent customers of AWS) embraced all of the above points. In fact Netflix has developed a chaos monkey tool to constantly and unpredictably kill some of its service instances in an effort to enforce that their services are built in a fault-tolerant and resilient manner and expose and resolve hidden problems with their services.

OK, after reading this, you can say that at some level the cloud computing fault-tolerance is boring: the prescribed fault correction action is to just replace the failed instance with a new instance. And if you say this, this will make the AWS folks happy, because this is the goal that they try to attain. They want to make faults uninteresting, and automatically dealt with. Unfortunately, not all faults can be isolated at the instance level, the real world isn't that simple. There are many different types of faults, such as misconfigurations, unanticipated faults, application-level heisenbugs, and bohrbugs, that won't fit into this mold. I intend to investigate these remaining nontrivial types of faults, and how to deal with them. I am particularly interested in exploring what role can self-stabilization play here. Another point, that didn't get coverage in this whitepaper is about how to detect faults and unhealthy instances. I would be interested to learn what techniques are employed in practice by AWS applications for this.


This comment has been removed by a blog administrator.

Popular posts from this blog

Foundational distributed systems papers

Your attitude determines your success

Progress beats perfect

Cores that don't count

Silent data corruptions at scale

Learning about distributed systems: where to start?

Read papers, Not too much, Mostly foundational ones

Sundial: Fault-tolerant Clock Synchronization for Datacenters


Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3 (SOSP21)