Dynamo: Amazon's highly available key-value store

This paper, which appeared in SOSP'07, describes Dynamo, the underlying storage technology for several core services in Amazon's e-commerce platform. Dynamo is a NoSQL system and provides a single key-value store. The commonly accepted (yet still disputed) wisdom is that RDBMS are overkill for simple key-value stores and are unsuitable for large-scale multi datacenter systems.

The goal in Dynamo is providing reliability at large scale. As the paper says in the introduction "The reliability and scalability of a system is dependent on how its application state is managed". This was a key lesson from the Ousterhout'90 paper: "The role of distributed state".

Dynamo is an optimistic replication system. According to the optimistic survey taxonomy, Dynamo is a multi-master (multiple coordinators can update the data) system that employs state transfer and asynchronous propagation of updates. Hence, Dynamo allows conflicting updates in to the system. Dynamo uses vector clocks to detect conflicts, and employs application-level/semantic conflict resolution. There is no bound on replica divergence. Dynamo adopts a best-effort eventual consistency approach and try to resolve divergent replicas problem when possible.

CAP theorem
The Dynamo system emphasizes availability to the extent of sacrificing consistency. The abstract reads "Dynamo sacrifices consistency under certain failure scenarios". Actually, later it becomes clear that Dynamo sacrifices consistency even in the absence of failures: Dynamo may become inconsistent in the presence of multiple concurrent write requests since the replicas may diverge due to multiple coordinators. Using Abadi's model, Dynamo is PAEL. When there are failures (Partitioning), Dynamo chooses Availability over consistency, and, Else (even in the absence of failures) Dynamo chooses Low-latency over consistency.

Dynamo is inclined to sacrificing consistency because the application semantics can tolerate inconsistency. Dynamo employs post hoc conflict resolution to fix the inconsistencies.

Design principles
The Dynamo paper mentions the Google File System (GFS) SOSP'03 work in just two sentences, but I think they are related works and they warrant comparison. Dynamo makes very different design choices than GFS. While GFS used a centralized-master approach (which was maintained in a highly available fashion using Chubby/Paxos), Dynamo uses a pure peer-to-peer (P2P) approach for serving requests.

The Dynamo system fully-embraces the symmetry principle in its design. Quoting from the paper: "Symmetry: Every node in Dynamo should have the same set of responsibilities as its peers; there should be no distinguished node or nodes that take special roles or extra set of responsibilities. In our experience, symmetry simplifies the process of system provisioning and maintenance."

I think this over-emphasis on symmetry is unwarranted. Why should the system force every node try to do everything for themselves rather than employing specialization? Amazon has a lot of expertise in services and service level agreements. So I am surprised as why they did not think of simplifying the design and implementation of Dynamo by employing specialized nodes to provide a distributed directory lookup service. The VL2 paper shows that a distributed directory service can be implemented in a very efficient and highly available manner in data centers.

The symmetry principle, and the resulting P2P fully decentralized storage nodes system, looks contrived in the paper. I wish the paper included some justification for this design choice. I cannot find very good reasons for it. Obviously this choice leads to a high availability system, but high availability is also achievable with a master coordinator approach using Chubby/Paxos solution in practice. And using a master coordinator approach would have significantly simplified several design problems in the paper, as I discuss in the next section.

The interesting thing is that the paper lists heterogeneity as another key design principle right after the symmetry principle: "Heterogeneity: The system needs to be able to exploit heterogeneity in the infrastructure it runs on. e.g. the work distribution must be proportional to the capabilities of the individual servers. This is essential in adding new nodes with higher capacity without having to upgrade all hosts at once." Doesn't this hint that specialization can also be warranted to exploit heterogeneity? There is no need to strictly adhere to the symmetry principle.

Details of Dynamo
The paper gives details on the partitioning, replication, versioning, membership, and failure handling components of Dynamo.

Partitioning algorithm employs a typical P2P key distribution algorithm (a consistent hashing approach as in Chord). Virtual nodes (tokens) are used for making the key distribution load balancing more fine-grained and more uniform. Dynamo takes pains to make sure that every storage node is utilized almost equally to any other storage node, but enforcing that precise load balancing may lead to wasted traffic in node join and leave.

Replication is performed so that each data item is replicated at N hosts. The coordinator asks for N nodes, but for the sake of availability if quorum number responds, the operation succeeds. N < R+W so we can have intersecting read and write quorums. Tuning of R and W can give write-optimized read-optimized system. In Dynamo, a popular configuration is N=3, R=2, W=2. (Note that we could still employ this type of quorums with a master approach, in GFS N=3, R=1, W=3. Dynamo can provide only a per service tuning of R and W, but with a master-based solution, you can have per key tuning of R and W.)

Data versioning is performed by using vector clocks. Dynamo treats the result of each modification as a new and immutable version of the data.

The biggest difference between Dynamo and traditional p2p systems is that it employs a one-shot lookup. In Dynamo, all nodes maintain information about all key-coordinator mappings. Dynamo does not employ a directory service lookup, instead insists --due to symmetry principle-- that all nodes maintain all the directory information. The request is first routed to a random node (for load balancing purposes), and since that node has information about which node is primarily responsible (a.k.a. coordinator) for the key (typically the first among the top N nodes in the preference list of the key), it routes the request to the corresponding coordinator. If by chance the node that received the request in the first place is also in the list, it can also coordinate the request. Read and write operations involve the first N healthy nodes in the preference list, skipping over those that are down or inaccessible.

The P2P fully decentralized design introduces a lot of complexity
The P2P fully-decentralized system introduces a lot of complexity especially in the failure handling cases. For handling transient failures, hinted handoff is complicated. For handling permanent failures, replica synchronization is more complicated. Membership and failure detection also gets very tricky in a purely decentralized setup: gossip protocol, seed nodes for external discovery, failure detection, adding/removing storage nodes are presented in a hand-wavy manner in the paper. Using a master coordinator would have made all these operations very straightforward.

Using a single master approach would also obviate the need to deal with conflicts. Since updates would be serialized by the single-master, conflicts would be prevented trivially. However, due to faults, an old state may have been exposed, and in the presence of partitions we still have to choose between consistency and availability.

Distributed directory service for lookup
An easier and less radical change in Dynamo would be to adopt a distributed directory service layer for lookup as in the VL2 paper. I think that would make a cleaner design, and solve the self-confessed limitation of scalability with the current design. To quote from the paper: "Finally, Dynamo adopts a full membership model where each node is aware of the data hosted by its peers. To do this, each node actively gossips the full routing table with other nodes in the system. This model works well for a system that contains couple of hundreds of nodes. However, scaling such a design to run with tens of thousands of nodes is not trivial because the overhead in maintaining the routing table increases with the system size. This limitation might be overcome by introducing hierarchical extensions to Dynamo. Also, note that this problem is actively addressed by O(1) DHT systems (e.g., [14])."

The thing that has most impressed me about Dynamo is that the replication is so loosely coupled that they would choose the replicas at different data centers for disaster tolerance purposes. This is made possible by the NoSQL approach and PAEL design choice in Dynamo. The paper also gives a lot of evaluation results. On average 10 ms to complete a write. 99.9% of writes completed in 300ms.

As I mentioned above, I think Dynamo would have been much simpler and efficient by making different design choices. But, what do I know?


PetrolHead said…
"Using a single master approach would also obviate the need to deal with conflicts. Since updates would be serialized by the single-master, conflicts would be prevented trivially."

A single master however will become the bottleneck in the system (exactly as happened with GFS). Also a single master has a location within the network which, if it's not "central" introduces consistently skewed latency for some clients.

Popular posts from this blog

The end of a myth: Distributed transactions can scale

Foundational distributed systems papers

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Metastable failures in the wild

Scalable OLTP in the Cloud: What’s the BIG DEAL?

SIGMOD panel: Future of Database System Architectures

Amazon Aurora: Design Considerations + On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes

Dude, where's my Emacs?

There is plenty of room at the bottom