Understanding the Performance Implications of Storage-Disaggregated Databases
Storage-compute disaggregation in databases has emerged as a pivotal architecture in cloud environments, as evidenced by Amazon (Aurora), Microsoft (Socrates), Google (AlloyDB), Alibaba (PolarDB), and Huawei (Taurus).
This approach decouples compute from storage, allowing for independent and elastic scaling of compute and storage resources. It provides fault-tolerance at the storage level. You can then share the storage for other services, such as adding read-only replicas for the databases. You can even use the storage level for easier sharding of your database. Finally, you can also use this for exporting a changelog asynchronously to feed into peripheral cloud services, such as analytics.
Disaggregated architecture was the topic of Sigmod 23 panel. I think this quote summarizes the industry's thinking on the topic. "Disaggregated architecture is here, and is not going anywhere. In a disaggregated architecture, storage is fungible, and computing scales independently. Customer value is here, and the technical problems will be solved in time."
This Sigmod 2024 paper has conducted a comprehensive study to investigate the performance implications of storage-disaggregated databases. The work addresses several critical performance questions that were obscured due to the closed-source nature of these systems. They released an opensource version of their storage-disaggregated PostGres/RocksDB implementation at https://github.com/purduedb/OpenAurora/
Key Design Principles of disaggregated databases
Figure 2 shows the architecture of PostgreSQL (v13.0), which represents the traditional monolithic database running on a single node.Figure 3 shows the first level of disaggregation, Software-level Disaggregation (P1), and defines it as decoupling the storage engine from the compute engine. This, of course, is naive because it doesn't have any optimizations, but just makes the disk a remote disk to allow remote/shared storage. Without buffering, the performance will suck due to remote access.
Figure 4 shows the most basic and powerful optimization for disaggregation, Log-as-the-Database (P2). This approach sends only the write-ahead logs (called xlogs2 in PostgreSQL) to the storage side upon transaction commit instead of sending the actual data pages as in traditional databases. This reduces data movement over the network. The actual data pages are generated by replaying the logs at the storage node asynchronously (see steps a and b in Figure 4).The compute node will first check the local buffer. If there is a cache miss, it will fetch the missed pages from the storage node. If the requested pages have not been replayed yet, the storage node will replay the required logs on the fly (step 1 in Figure 4) and return the requested pages upon completion (step 2 in Figure 4).
Finally, Figure 5 shows the Shared-Storage Design (P3) which truly allows multiple compute nodes to share the same storage. Only the primary compute node supports read and write transactions, whereas secondary nodes can only handle read-only transactions. When new compute nodes are added, there is no need to copy or move data, as all the compute nodes will share the same data.But there are now additional challenges for the storage engine. Due to replication lag between the primary and secondary compute nodes, the storage engine must support multi-version pages because secondary compute nodes require access to older versions of a page. This operation is known as GetPage@LSN.
After the storage node receives an xlog, it asynchronously checks which page modifications are included in this xlog. Afterward, the storage node appends this xlog’s LSN and page id’s mapping record to the "Version Map", which is a hashmap using the page id as its key and the version LSN list as its value (as depicted in step a in Figure 5). This "Version Map" is utilized by the GetPage@LSN request. Following this, the storage node replay process divides the xlog into several mini-xlogs, with each mini-xlog containing modifications for only one page. Subsequently, it replays multiple mini-xlogs to obtain multiple updated pages (step b in Figure 5). Finally, the storage node inserts the newly generated pages into RocksDB, using the page id combined with LSN as its key and the page content as its value (step c in Figure 5). The good news is that multi-versioning can address the "torn page write" problem (because it only appends new pages to the database instead of updating the pages in place), which will in turn improve I/O performance.
Replaying logs may take considerable time, especially in the multi-version storage setting. To improve the performance of replaying logs, various optimizations have been developed in existing storage-disaggregated databases. The paper categorizes and studies these optimizations as "Filtered Replay" (FR) and "Smart Replay" (SR).
Evaluation
In their implementation, the storage node continuously replays the xlogs asynchronously and stores multiple versions for each page in RocksDB. They chose a 3-node storage setup because Socrates, AlloyDB, PolarDB, and Neon use 3-way replication. However, they also conducted a few experiments on a 6-node storage cluster, considering that Aurora uses 6-way replication. They use SysBench and TPC-C for driving the experiments.
They do an ablation study to investigate the effect of each optimization/design piece on the performance.
Experimental results
Q1: Performance overhead of storage disaggregation (design P1)
Storage disaggregation *if applied without buffering* significantly reduces performance for both reads (16.4X) and writes (17.9X) when using SSDs. This is due to the slower access to remote storage over the network compared to local storage.
Q2: Impact of buffering on performance
For reads: An 8GB buffer (80% hit ratio) reduces the performance gap between disaggregated and non-disaggregated databases to 1.8X. For writes: Buffering doesn't significantly improve performance, even with large buffer sizes, due to the need to send logs to storage upon transaction commit.
Q3: Effectiveness of log-as-the-database design (design P2)
For writes in light workloads, there is no significant performance improvement. But for heavy workloads, there is a 2.58X performance improvement with an 8GB buffer. For reads--only workloads there is minimal impact. But for read-after-write scenarios, this incurs 18.9% performance overhead (with 8GB buffer) due to log replay costs.
Q4: Impact of multi-version pages in shared-storage design (design P3)
The introduction of multi-version pages provides 37% improvement in write performance (8GB buffer) by addressing the "torn page write" issue.
Q5: Effect of multi-version storage on checkpointing
Multi-version storage also eliminates the high I/O issue in conventional database checkpointing. Checkpointing becomes a "free" operation without additional overhead.
Q6: Effectiveness of log-replay methods
Filtered Replay (FR) provides 1.6X improvement in write performance over the straightforward approach. Smart Replay (SR) provides an additional 1.6X improvement over FR, up to 7X for specific workloads (bulk insertion followed by index creation).
Discussion
First I will start with some gripes. Why do the figures lack units in throughput. Not once through 10 pages of evaluation do the figures and writing slip and mention the units on the throughput. I am guessing the unit for throughput is bytes/sec, but this is just guess.
Ok, with that out of my chest, let's ask more fundamental questions.
Even with all the optimizations (log as database+multiversion+fr+sr) why does the write throughput still remain at 50% of the single node performance? Yeah, there is remote communication cost, but we are not talking about latency. We are talking about throughput here, which through pipelining of the writes can be improved theoretically as high as almost recovering the performance of single node write. Yes, of course you miss some performance to friction. For example, for MultiPaxos SMR, we can get to 70-80% of single node performance. But the experiments here show only 50% of single node throughput is recovered with disaggregated storage.
Is this because of closed system benchmarking? Closed loop may cause throughput to be bottlenecked by 16 threads/clients who are blocked due to remote communication delays and transaction waiting and not being able to push the throughput as much. So, it is possible that the pipeline is not fully utilized.
Or maybe it is because of inefficiencies in storage replication implementation. If the implementation is not well optimized that may really add up to that 50% inefficiency even with clients pushing throughput. The paper mentions high tail latency due to replication to all nodes. And suspiciously it doesn't mention anywhere about quorum replication 2/3 or 4/6 storage nodes. So, if storage is actually waiting from replication at all nodes, that would introduce inefficiency that could have been optimized away.
The evaluation left these questions unanswered. But, note again that even with 50% of performance drop, this is still beneficial because of all the benefits I mentioned in the introduction.
Finally, the authors still deserve commending for developing a quick implementation of disaggregated storage and making it available as opensource.
Comments