DDIA: Chp 6. Partitioning

Chapter 6 of the Designing Data Intensive Applications (DDIA) book discusses partitioning, a key technique for scaling large datasets and high query throughput in distributed databases. By breaking data into smaller partitions, it can be distributed across multiple nodes in a shared-nothing cluster. This allows the storage and processing load to be spread across many disks and processors. Fun fact: Partitioned databases were pioneered in the 1980s by products such as Teradata and Tandem NonStop SQL, and in 2000s rediscovered by NoSQL databases and Hadoop-based data warehouses.

Partitioning is often combined with replication, where each partition is stored on multiple nodes for fault tolerance. In a typical setup, each partition has a leader node that handles writes, with follower nodes replicating the data.


Partitioning of Key-Value Data

There are two main approaches to partitioning key-value data.

Partitioning by key range: Assign a continuous range of keys to each partition, like volumes in an encyclopedia. This enables efficient range scans, but can lead to hot spots if the data is unevenly distributed. This partitioning strategy is used by Bigtable, and its open source equivalent HBase. 

Partitioning by hash of key: Use a hash function to deterministically map each key to a partition. This spreads the data more evenly, but makes range queries more difficult. Many distributed databases use hashing to avoid hot spots. For partitioning purposes, the hash function need not be cryptographically strong: for example, MongoDB uses MD5, and Cassandra uses Murmur3.

Handling skewed workloads, where certain keys are disproportionately accessed, can be challenging. One technique is to add a random suffix to hot keys to distribute the load across multiple partitions. But now any reads have to do additional work, as they have to read the data from all partitions and combine it.

With machine learning, especially with followup work to learned index work, data systems may be able to automatically detect and compensate for skewed workloads.


Partitioning and Secondary Indexes

Partitioning becomes more complex when secondary indexes are involved. There are two main approaches.

Document-partitioned indexes: Each partition maintains its own secondary index. For example, whenever a red car is added to the database, the database partition automatically adds it to the list of document IDs for the index entry color:red, as in Fig 6.4. This necessitates "scatter/gather" queries that hit multiple partitions, which can be inefficient.

Term-partitioned global indexes: Build a single global index that is partitioned separately from the primary data. For example, a term would be color:red as in Fig 6:5. This enables more efficient queries (especially range scans), but complicates updates as writes may affect multiple secondary index partitions. In practice, updates to global secondary indexes are often asynchronous (that is, if you read the index shortly after a write, the change you just made may not yet be reflected in the index).


Rebalancing Partitions

Rebalancing may become necessary when query/CPU load increases, growing dataset grows needs more storage, or when some machines/nodes fail. Good rebalancing should distribute load fairly across nodes, be nonblocking for operations, and minimize data movement to reduce network and I/O load.

Two primary strategies have emerged for rebalancing partitions. The first approach, known as fixed number of partitions, involves creating many more partitions than there are nodes in the initial cluster setup. These partitions are then distributed across the available nodes. Now, if a node is added to the cluster, the new node can steal a few partitions from every existing node until partitions are fairly distributed once again. This method is relatively simple to implement and manage, but it requires careful consideration when choosing the initial number of partitions, as this number effectively sets an upper limit on cluster growth.

The second strategy, dynamic partitioning, is more flexible but more complex to implement. In this approach, partitions are created and split as needed, adapting to the changing distribution of data. This method is particularly useful for databases that use key-range partitioning, where fixed partition boundaries could lead to severe data imbalances. Dynamic partitioning can also be applied to hash-partitioned data, offering adaptability to various database designs. MongoDB since version 2.4 supports both key-range and hash partitioning, and it splits partitions dynamically in either case.


Request routing

Clients need a way to route requests to the appropriate partition, but as partitions are rebalanced, the assignment of partitions to nodes changes.


There are three common approaches to this problem as shown in Figure 6-7.

One approach is to allow clients to contact any node in the cluster, typically through a round-robin load balancer. If the contacted node doesn't own the required data partition, it forwards the request to the appropriate node, receives the response, and relays it back to the client. This approach simplifies client-side logic but may introduce additional network hops.  Cassandra and Riak use this approach and employ a gossip protocol for nodes to share information about cluster state changes.

Another approach employs a dedicated routing tier that acts as a partition-aware load balancer. This tier receives all client requests, determines the correct node to handle each request based on its knowledge of the partition layout, and forwards the request accordingly. While this centralizes the routing logic, it also introduces an additional component that needs to be managed and scaled.

A third approach requires clients to be aware of the partitioning scheme and node assignments. This allows clients to connect directly to the appropriate node without intermediaries, potentially reducing latency. However, it increases complexity on the client side and requires clients to be updated when the cluster topology changes.

To help the routing tier, many distributed systems use a separate coordination service, such as ZooKeeper, to manage cluster metadata. In this setup, each node registers with ZooKeeper, which maintains an authoritative mapping of partitions to nodes. Other components, like the routing tier or partition-aware clients, can subscribe to this information. ZooKeeper notifies these subscribers of any changes in partition ownership or cluster membership, ensuring routing information stays current. MongoDB has a similar architecture, but it relies on its own config server implementation and mongos daemons as the routing tier.

Comments

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Advice to the young

Foundational distributed systems papers

Distributed Transactions at Scale in Amazon DynamoDB

Linearizability: A Correctness Condition for Concurrent Objects

Understanding the Performance Implications of Storage-Disaggregated Databases

Designing Data Intensive Applications (DDIA) Book