DDIA: Chp 3. Storage and Retrieval (Part 2)
This is Chapter 3, part 2 for the Designing Data Intensive Applications (DDIA) book. This focuses on storage and retrieval for OLAP databases.
Analytics, data warehousing, star and snowflake schemas
A data warehouse is a dedicated database designed for analytical queries. It houses a read-only replica of data from transactional systems within the organization. This separation ensures that analytics (OLAP) operations do not interfere with OLTP operations which are critical to the operation of the business.
Data is extracted from OLTP databases, either through periodic snapshots or a continuous stream of updates. It's then transformed into a format optimized for analysis, cleaned to remove inconsistencies, and loaded into the data warehouse. This process is known as Extract-Transform-Load (ETL).
Large enterprises often have massive data warehouses containing petabytes of transaction history stored in a "fact table". The "fact table" represents individual events or records, while the surrounding "dimension tables" capture the who, what, where, when, how, and why details of those events.As you can see, the structure of these data warehouses typically follows a star or snowflake schema. In a star schema, the fact table sits at the center, surrounded by its dimension tables. Snowflake schemas take this a step further, breaking down the dimension tables into additional sub-dimension tables, resulting in a more normalized but also more complex data model.
Column-Oriented Storage
Data warehouses tend to have very wide tables. A fact table often contains 100s of columns. This creates a problem when using traditional row-oriented database storage, in which all the values for a single row are stored contiguously: Fetching even a small subset of columns requires the database to load the entire wide row from disk into memory, parse it, and then filter out the irrelevant data.
Column-oriented storage addresses this inefficiency by storing all the values for a single column together, rather than grouping them by row. This allows the database to selectively load only the columns required for a given query, significantly reducing the amount of data that needs to be processed.
Column storage lends itself very well to compression techniques that can reduce the disk footprint. One such compression method takes advantage of the fact that many columns have a relatively small number of distinct values compared to the total number of rows. Then, by creating separate bitmaps for each distinct value, where each bit represents the presence or absence of that value in a row, the column can be encoded very compactly, especially when run-length encoding is applied to the sparse bitmaps.
The compressed, columnar data also fits more efficiently into the CPU's cache, allowing the database to process it in tight, cache-friendly loops without the overhead of function calls and conditional checks for each row. This "vectorized processing" approach, combined with the judicious use of SIMD (single instruction multiple data) instructions, enables analytical databases to make very efficient use of modern hardware.
Another key optimization in column stores is the ability to sort the data by row, even though it is physically stored by column. The database administrator can choose the columns to sort by, based on their understanding of common query patterns. For example, if many queries target date ranges, sorting primarily by the date column allows the database to quickly skip over irrelevant rows.
Sorting enables additional compression gains, as it creates long runs of repeated values that can be efficiently encoded using techniques like run-length encoding. Some column-oriented databases, like Vertica, take this a step further by storing the same data sorted in multiple different ways, allowing the query optimizer to choose the most appropriate sort order for each query.
Column-oriented storage stars (pun intended) for read-heavy analytical workloads, but it poses challenges for data updates. The compressed, sorted nature of the columns makes in-place updates difficult, as inserting a new row would require rewriting all the column files to maintain consistency. Instead, column stores typically employ an approach similar to log-structured merge (LSM) trees, where new data is first written to an in-memory store and then periodically merged with the on-disk column files in bulk.
Aggregation: Data Cubes and Materialized Views
In addition to these storage optimizations, data warehouses also leverage materialized views and data cubes to improve query performance further. Materialized views are pre-computed result sets of common queries, stored on disk. Unlike virtual views, which simply serve as shortcuts for the underlying query, materialized views contain the actual denormalized data, making reads much faster but at the cost of increased write overhead to maintain the materialized view when the underlying data changes.
Data cubes (aka OLAP cubes) take the materialized view idea a step further by organizing the data into a multi-dimensional grid of pre-aggregated values. This allows analysts to quickly slice and dice the data along different dimensions (e.g., product, store, time) without having to re-compute the aggregations from scratch.
The trade-off for this performance boost is a loss of flexibility. Data cubes and other materialized views are optimized for a specific set of queries and dimensions, making it difficult to perform more ad-hoc analysis outside of those predefined constraints. Most data warehouses aim to maintain a balance, keeping the majority of the data in its raw, flexible form while using materialized views and data cubes to accelerate the most common analytical workloads.
Discussion
This discussion of data warehouse versus OLTP systems, reminded me of Lambda versus Kappa architectures discussion in Web services.
There is an analogue of OLAP and OLTP separation with the batch computing layer and serving layer separation in the Lambda architecture.
The materialized views approach is analogous to the Kappa architecture, which takes the "one tool fits all" approach. In this approach, the Kafka log streaming platform considers everything as a stream. Batch processing is simply streaming through historic data. Table is merely the cache of the latest value of each key in the log and the log is a record of each update to the table.
The materialized views approach is also congruous with long-running queries and incremental view maintenance.
Comments