Amazon Redshift Re-invented

This paper (SIGMOD'22) discusses the evolution of Amazon Redshift since 2015 when it launched. Redshift is a cloud data warehouse. Data warehouse basically means a place where analysis/querying/reporting is done for shitload of data coming from multiple sources. Tens of thousands of customers use Redshift to process Exabytes of data daily. Redshift is fully managed to make it simple and cost-effective to efficiently analyze BIG data.

The concept art in this blog post are creations of Stable Diffusion.


Since its launch in 2015, the use cases for Redshift have evolved, and the teams focused on meeting the following customer needs

  • High-performance execution of complex analytical queries using innovative query execution via C++ code generation
  • Enabling fast scalability in response to changing workloads by disaggregating storage and compute layers
  • Ease of use via incorporating machine learning based autonomics
  • Seamless integration with the AWS ecosystem and other AWS purpose built services

Redshift Architecture


Being in the data warehouse business, Redshift is a column-oriented massively parallel processing system. It has a very simple architecture composing of a compute layer and storage layer. A Redshift cluster consists of a single coordinator (leader) node which acts as the entrance to the system, and multiple worker (compute) nodes. Data is stored on Redshift Managed Storage (RMS, hmm, does this abbreviation make it GNU-RMS-Redshift?). RMS is backed by Amazon S3, and cached in compute nodes on locally-attached SSDs in a compressed column-oriented format.

Tables are either replicated on every compute node or partitioned into multiple buckets that are distributed among all compute nodes. The partitioning can be automatically derived by Redshift based on the workload patterns and data characteristics, or, users can explicitly specify the partitioning style as round-robin or hash, based on the table's distribution key.

AQUA is a query acceleration layer that leverages FPGAs to improve performance.

Compilation-As-A-Service (CaaS) is a caching microservice for optimized generated code for the various query fragments executed in the Redshift fleet. 




Code Generation

Code generation is at the core of the system. SQL statements are translated and compiled into efficient C++ code which are then sent to the workers (compute) nodes for execution.

When a user query is received by the leader node, it is parsed, rewritten, and optimized. Redshift's cost-based optimizer accounts for the cluster's topology and the cost of data movement between compute nodes in its cost model. Planning considers the distribution keys of participating tables to avoid unnecessary data movement.

Redshift generates C++ code specific to the query plan and the schema being executed. The generated code is then compiled and the binary is shipped to the compute nodes for execution. Each compiled file, called a segment, consists of a pipeline of operators, called steps. Each segment (and each step within it) is part of the physical query plan. Each generated binary is scheduled on each compute node to be executed by a fixed number of query processes. Each query process executes the same code on a different subset of data.

To reduce the number of blocks that need to be scanned, Redshift evaluates query predicates over zone maps i.e., small hash tables that contain the min/max values per block and leverages late materialization. The data that needs to be scanned after zone-map filtering is chopped into shared work units to allow for balanced parallel execution. Scans leverage vectorization and Single Instruction, Multiple Data (SIMD) processing for fast decompression of light-weight compression formats and for applying predicates efficiently. Bloom filters, created when building hash tables, are applied in the scan to further reduce the data volume. Prefetching is leveraged to utilize hash tables more efficiently. Redshift's execution model is optimized for the underlying Amazon EC2 Nitro hardware, resulting in industry leading price/performance.

What about the cost of compiling to C++? Is that a point of concern for providing quick responses to the customers? Redshift serves some customer workloads where the 90th percentile response time requirement is less than 1 second. So this cost should be optimized as well. To deal with this, they leverage caching. The leader node caching can already give 99.60% hit rate, but they go beyond that by using Compilation-as-a-Service (CaaS) component. CaaS uses compute and memory resources beyond the Redshift cluster to accelerate query compilation. It caches the compiled objects off-cluster in an external code cache to serve multiple compute clusters that may need the same query segment. CaaS  increased cache hits from 99.60% to 99.95%. That means,  87% of the times that an object file was not present in a cluster's local code cache, it was found in the external code cache.



Redshift Managed Storage

Redshift disaggregates storage from compute nodes using the Redshift managed storage layer (RMS). RMS is designed for a durability of 99.999999999% and 99.99% availability over a given year, across multiple availability zones (AZs). RMS builds on top of the AWS Nitro System, which features high bandwidth networking and performance indistinguishable from bare metal. Compute nodes use large, high performance SSDs as local caches. Redshift leverages workload patterns and techniques such as automatic fine-grained data eviction and intelligent data prefetching, to deliver the performance of local SSD while scaling storage automatically to Amazon S3.

The above figure shows the key components of RMS extending from in-memory caches to committed data on Amazon S3. Snapshots of data on Amazon S3 act as logical restore points for the customer. RMS accelerates data accesses from S3 by using a prefetching scheme that pulls data blocks into memory and caches them to local SSDs. The storage layer builds on Amazon S3 and persists all data to Amazon S3 with every commit. Building on Amazon S3 allows Redshift to decouple data from the compute cluster that operates on the data. Using Amazon S3 as the base gives virtually unlimited scale. RMS takes advantage of optimizations such as data block temperature, data block age, and workload patterns to optimize performance and manage data placement across tiers of storage automatically.

RMS makes in-place cluster resizing a pure metadata operation since compute nodes are practically stateless and always have access to the data blocks in RMS. RMS is metadata bound and easy to scale since data can be ingested directly into Amazon S3. The tiered nature of RMS where SSDs act as cache makes swapping out of hardware convenient. Decoupling metadata from data enables Elastic Resize and Cross-Instance Restore. Elastic resizing can enable  computer cluster elasticity of upto 4x larger or 4x smaller within a few seconds. RMS uses the tiered-storage cache to drive rehydration (i.e., what data to cache on local SSDs) after a cluster reconfiguration (e.g., Elastic Resize, cluster restore, hardware failures). In all these scenarios, the compute nodes rehydrate their local disks with the data blocks that have highest possibility to be accessed by customer queries. With this optimization, customer's queries achieve more than 80% local disk hit rate at 20% rehydration completion.

Transactions are synchronously committed to Amazon S3 by RMS. This enables multiple clusters to access live and transactionally consistent data. Writing through to Amazon S3 across different AZs is achieved by batching data writes and hiding latencies under synchronization barriers. State is owned and managed by one cluster, while concurrent readers and writers provide compute scaling on top of RMS. Redshift implements Multi-version Concurrency Control (MVCC) where readers neither block nor are blocked and writers may only be blocked by other concurrent writers. Each transaction sees a consistent snapshot of the database established by all committed transactions prior to its start. Redshift enforces serializable isolation and avoids data anomalies such as lost updates and read-write skews. The legacy implementation used a graph-based mechanism to track dependencies between transactions to avoid cycles and enforce serializability. This required tracking individual state of each transaction well after they were committed, until all other concurrent transactions committed as well. They recently adopted a new design based on a Serial Safety Net (SSN) as a certifier on top of Snapshot Isolation. This heuristic-based design allows Redshift to guarantee strict serializability in a more memory-efficient manner, using only summary information from prior committed transactions.

Scaling Compute


 

Every week Redshift processes billions of queries that serve a diverse set of workloads that have varying performance requirements. ETL workloads have strict latency SLAs that downstream reporting depends on. Interactive dashboards on the other hand have high concurrency requirements and are extremely sensitive to response times. Redshift provides customers several compute scaling options to get the best price/performance to serve their needs.

As we mentioned earlier, Elastic Resize allows customers to quickly add or remove compute nodes from their cluster depending on their current compute needs. In order to provide consistent query performance after Elastic Resize, Redshift decouples compute parallelism from data partitions. When there is more compute parallelism than data partitions, multiple compute processes are able to share work from an individual data partition. When there are more data partitions than compute parallelism, individual compute processes are able to work on multiple data partitions.

Concurrency Scaling allows Redshift to dynamically scale-out when users need more concurrency than what a single Redshift cluster can offer. With Concurrency Scaling customers maintain a single cluster endpoint to which they submit their queries. When the assigned compute resources are fully utilized and new queries start queuing, Redshift automatically attaches additional Concurrency Scaling compute clusters and routes the queued queries to them. Concurrency Scaling clusters re-hydrate data from RMS. Figure 6 shows the concurrency improvements Redshift achieved over a period of one year, in terms of query throughput versus number of concurrent clients. The workload used is 3TB TPC-DS. Redshift achieves linear query throughput for hundreds of concurrent clients.


Automated Tuning and Operations

For ease of use and eliminating the need for hiring database administrators to maintain database services, Redshift invested heavily in maintenance automation and machine learning based autonomics. Redshift runs common maintenance tasks like vacuum, analyze or the refresh of materialized views in the background without any performance impact to customer workloads.

Automatic workload management (AutoWLM) dynamically chooses query concurrency and memory assignment based on workload characteristics. Redshift also monitors and analyzes customer workloads and identifies opportunities for performance improvement, e.g., by automatically applying distribution and sort key recommendations. In addition, Redshift employs state of the art forecasting techniques to make additional nodes available as soon as possible for node failures, cluster resumption and concurrency scaling. Finally, Amazon Redshift Serverless (pay-per-use model) is the culmination of autonomics effort, which allows customers to run and scale analytics without the need to set up and manage data warehouse infrastructure. Section 5 in the paper discusses all of these topics in detail, but I will skip them for this summary.


Integration with the AWS ecosystems

Redshift extends beyond traditional data warehousing workloads, by integrating with the broad AWS ecosystem. These features include
  • querying the data lake (shitload of data that includes unstructured data) with Spectrum which supports Parquet, Text, ORC and AVRO formats,
  • semistructured data ingestion and querying with PartiQL (a SQL compatible languge which can operate on both schemaful and schemaless sources),
  • streaming ingestion from Kinesis and MSK,
  • federated queries to Aurora and RDS operational databases, and federated materialized views, and
  • providing Redshift ML which leverages Amazon Sagemaker to enable data analysts to train machine learning models and perform prediction (inference) using SQL.



Comments

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Advice to the young

Foundational distributed systems papers

Distributed Transactions at Scale in Amazon DynamoDB

Linearizability: A Correctness Condition for Concurrent Objects

Understanding the Performance Implications of Storage-Disaggregated Databases

Designing Data Intensive Applications (DDIA) Book