DDIA: Chp 3. Storage and Retrieval (Part 1)

- August 20, 2024

This is Chapter 3, part 1 for the Designing Data Intensive Applications (DDIA) book. This part focuses on storage and retrieval for OLTP databases.

Even if you won't be implementing a storage engine from scratch, it is still important to understand how databases handle storage and retrieval internally. This knowledge allows you to select and tune a storage engine that best fits your application's needs. The performance of your application often hinges on the efficiency of the storage engine, especially when dealing with different types of workloads, such as transactional (involving many small, frequent read and write operations) or analytical (involving block reads for executing complex queries over large datasets).

There are two main families of storage engines: log-structured storage engines (e.g., LevelDB, RocksDB) and page-oriented storage engines like B-trees (e.g., PostGres, InnoDB/MySQL, WiredTiger/MongoDB).

Log-Structured Storage Engines

These engines use an append-only log format, which is efficient for write-heavy workloads. Writing data to a log sequentially is faster than doing random writes, especially on traditional hard drives. Periodically, the log is compacted and merged to reclaim space and optimize read performance.

LSM-trees (Log-Structured Merge-Trees), introduced in 1996, are a common implementation for log-structured storage engines. Here data is first stored in an in-memory structure (e.g., a red-black tree). Once it reaches a certain size (typically a few megabytes), it is written to disk as a sorted string table (SSTable). From time to time, the systems run a merging and compaction process in the background to combine segment files and to discard overwritten or deleted values. This reduces fragmentation and improves read efficiency.

The downside of LSM-trees is that reads can be slower because the system may need to check multiple SSTables to find a key. A read operation first tries to find the key in the memtable, then in the most recent on-disk segment, then in the next-older segment, etc. This long traversal can be mitigated by using Bloom filters to avoid unnecessary disk reads.

What about crashes? If the database crashes, the most recent writes (which are in the memtable but not yet written out to disk) would be lost. In order to avoid that problem, the system keeps a separate log on disk to which every write is immediately appended. That log is not in sorted order, but that doesn’t matter, because its only purpose is to restore the memtable after a crash. Every time the memtable is written out to an SSTable, the corresponding log can be discarded.

Page-Oriented Storage Engines

B-trees (1970) organize data in fixed-size blocks/pages (traditionally 4 KB sizes) to align with the block size of the underlying storage medium. Each B-tree page can hold multiple keys, and pages are linked hierarchically. This tree structure allows for efficient key lookups, inserts, and deletions.

A read operation going down the tree eventually gets to a page containing individual keys (a leaf page), which either contains the value for each key inline or contains references to the pages where the values can be found.

When a page becomes full, it splits, and the tree structure is updated to maintain balance. In Figure 3-6, the branching factor is six, but in reality this is in hundreds. This ensures that the depth of the tree remains only 3-4 levels deep, which minimizes the number of page reads needed for a lookup.

Write operations update in-place (in contrast to LSM-trees). Some operations require several different pages to be overwritten. For example as in Fig 3-7, if you split a page because an insertion caused it to be overfull, you need to write the two pages that were split, and also overwrite their parent page to update the references to the two child pages. A crash during this process may leave the data structure in an inconsistent state. For recovery, B-trees use a write-ahead log (WAL) to ensure that every modification is durably captured before it can applied to the B-tree structure. When the database comes back up after a crash, this log is used to restore the B-tree back to a consistent state.

Locks are employed to manage multiple threads accessing the B-tree at the same time.

Comparing LSM-Trees and B-Trees

LSM-trees are generally faster for writes because they minimize the need for random writes by appending data sequentially and deferring compaction. However, B-trees often provide faster reads since they require fewer lookups in multiple structures.

One challenge with LSM-trees is write amplification, where data is rewritten multiple times during compaction (but there are ways to alleviate this). On the other hand, LSM-trees can be compressed better, and thus often produce smaller files on disk than B-trees. B-tree storage engines leave some disk space unused due to fragmentation: when a page is split or when a row cannot fit into an existing page, some space in a page remains unused.

Indexing Strategies

So far we have only discussed key-value indexes, aka primary indexes. It is also very common to have secondary indexes. The main difference from primary index is that in a secondary index, the indexed values are not necessarily unique; that is, there might be many rows (documents, vertices) under the same index entry. This can be solved in two ways: either by making each value in the index a list of matching row identifiers (like a postings list in a full-text index) or by making each entry unique by appending a row identifier to it. Either way, both B-trees and log-structured indexes can be used as secondary indexes.

Indexes are critical for improving read performance by providing quick access to data. However, there is a catch: while they speed up reads, they can slow down writes, as each write may need to update multiple indexes.