Scalable distributed data structures for internet service construction

I think this 2000 paper (by Gribble, Brewer, Hellerstein, and Culler) may as well be the original NoSQL paper. The paper starts off by identifying the problems with RDBMS that prohibit scalability. (This is how you would motivate a NoSQL key-value store system even today :-)
  1. RDBMSs have not been designed with Internet service workloads, service properties, and cluster environments in mind, and as a result they fail to provide the right scaling, consistency, or availability guarantees. 
  2. RDBMSs permit users to decouple the logical structure of their data from its physical layout, which is a good thing, but excessive data independence (isolation of application from modifying the layout of data definition and organization) can make parallelization and therefore scaling hard. 
  3. RDBMSs always choose consistency over availability.
The paper then advocates a design sweet-point for achieving scalable and consistent data management for web services. RDBMS is out because it provides a too-high-a-level abstraction with ACID and SQL. Filesystems, on the other hand, expose a too low-level interface with little data independence and less strictly defined consistency guarantees where filesystem elements (files and directories) are directly exposed to the clients and the clients are responsible for logically structuring their application data to using these elements. The paper aims to choose a level of abstraction that provides a well-defined and simple consistency model somewhere in between that of an RDBMS and a filesystem. As a solution, the paper proposes a distributed data structure (DDS) ---in this case a distributed hash table--- and argues that DDS interfaces, while not as general as SQL, are rich enough to successfully build sophisticated services. DDS is touted to achieve the desired web service properties: scalability, fault-tolerance, availability, consistency, durability, concurrency.

DDS and key-value stores 

\epsfbox[37 323 709 584]{arch_overview.eps} }
Figure: High-level view of a DDS: a DDS is a self-managing, cluster-based data repository. All service instances (S) in the cluster see the same consistent image of the DDS; as a result, any WAN client (C) can communicate with any service instance.

DDS is basically a key-value store as we understand it today. As the paper puts it, DDS provides a persistent data management layer designed to simplify cluster-based Internet service construction. A distributed hash-table underlies this data management layer, and it simplifies Internet service construction by decoupling service-specific logic from the complexities of persistent, consistent state management. This allows services to inherit the necessary service properties from the DDS rather than having to implement them themselves. DDS presents a conventional single-host data structure interface to service authors, but in fact it partitions and replicates the data across a cluster. DDS is a cluster-level data structure and is not designed for a WAN.

The novel aspects of a DDS are the level of abstraction it presents to service authors (by providing a data structure at the programming language level), the consistency model it supports, the access behavior (concurrency and throughput demands) that it presupposes, and its design and implementation choices that are made based on its expected runtime environment and the types of failures that it should withstand. SEDA is employed for implementing DDS to achieve high throughput and high concurrency.

DDS architecture

\epsfbox[19 148 519 587]{dds_arch.eps} }
Figure: Distributed hash table architecture: each box in the diagram represents a software process. 

Services using DDS may keep soft-state but they rely on the hash table to manage all persistent state. DDS library contains only soft-state, including metadata about the cluster's current configuration and the partitioning of data in the distributed hash tables across the bricks. The DDS library acts as the 2-phase commit coordinator for update operations on the distributed hash tables. (Dynamo forgoes this consistency step, and avoids the complications discussed next.) The paper explains recovery mechanisms for what happens when coordinator fails during this 2-phase commit. However, this unavoidably leads to many corner cases and complicated to manage and may lead to recovery-induced inconsistencies. The 2-phase commit would also slow down write operations and limit scalability.
\epsfbox[47 290 594 580]{metadata.eps} }

Figure: Distributed hash table metadata maps: The key is used to traverse the DP map trie and retrieve the name of the key's replica group. The replica group name is then used looked up in the RG map to find the group's current membership.

The DDS key-lookup uses a trie-based mapping that can deal nicely with overloaded and hot keys. (For this Dynamo employs a ring-based consistent hashing.) To find the partition that manages a particular hash table key, and to determine the list of replicas in partitions' replica groups, the DDS libraries consults two metadata maps that are replicated on each node of the cluster. First is DP map maintained as trie. And the second map is replica group membership table. These two maps are soft-state and self-cleaning. Instead of enforcing consistency synchronously, the libraries can drift out of date, but lazily updated when they are used to perform operations on the bricks.


I think this paper is very nice introduction to the NoSQL key-value store area, in that you can see the original issues and original design decisions that led to the NoSQL key-value store approach in this paper. The DDS approach of providing a simple data-structure abstraction to the service authors and enabling them to inherit scalability, consistency, fault-tolerance, availability properties from the underlying careful distributed implementation of the data structure ultimately gave us BigTable, MegaStore, and similar distributed data structures.


manuzhang said…
Why data independence makes parallelization hard in RDBMS? What parallelization means here? Would you please explain a little more? Thanks!
Murat said…
@Manuzhang, by parallelization think of sharding

Since databases provide too much isolation, you cannot have control over sharding.
Unknown said…
Hey, nice article - you wrote a great reader friendly summary of the paper. It's cool to read about some of the origins of data management structure!

- Gary
AlphaPoint Data Management
John Michle said…
Its really informative, some facts and other points given here are quite considerable and to the point as well, would be better to look for more of these kind for efficient results.

Service Management Software

Popular posts from this blog

Foundational distributed systems papers

Your attitude determines your success

Progress beats perfect

Cores that don't count

Silent data corruptions at scale

Learning about distributed systems: where to start?

Read papers, Not too much, Mostly foundational ones

Sundial: Fault-tolerant Clock Synchronization for Datacenters


Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3 (SOSP21)