Posts

Showing posts with the label disaggregation

Disaggregation: A New Architecture for Cloud Databases

Image
This short VLDB'25 paper  surveys disaggregation for cloud databases. It has several insightful points, and I found it worth summarizing.  The key advantage of the cloud over on-prem is elastic scalability: users can scale resources up and down and pay only for what they use. Traditional database architectures, like shared-nothing, do not fully exploit this. Thus, cloud-native databases increasingly adopt disaggregated designs. Disaggregation is primarily motivated by the asymmetry between compute and storage: Compute is far more expensive than storage in the cloud. Compute demand fluctuates quickly; storage grows slowly. Compute can be stateless and easier to scale, while storage is inherently stateful. Decoupling them lets compute scale elastically while storage remains relatively stable and cheap. Review of Disaggregation in the Clouds Early cloud-native systems like Snowflake and Amazon Aurora separate compute and storage into independent clusters. Modern systems push dis...

Can a Client–Server Cache Tango Accelerate Disaggregated Storage?

Image
This paper from HotStorage'25 presents OrcaCache, a design proposal for a coordinated caching framework tailored to disaggregated storage systems. In a disaggregated architecture, compute and storage resources are physically separated and connected via high-speed networks. These became increasingly common in modern data centers as they enable flexible resource scaling and improved fault isolation. (Follow the money as they say!) But accessing remote storage introduces serious latency and efficiency challenges. The paper positions OrcaCache as a solution to mitigate these challenges by orchestrating caching logic across clients and servers. Important note: in the paper's terminology the server means the storage node, and the client means the compute node. As we did last week for another paper , Aleksey and I live-recorded our reading/discussion of this paper. We do this to teach t he thought-process and mechanics of how experts read papers in real time. Check our discussion vi...

Taming Consensus in the Wild (with the Shared Log Abstraction)

Image
This paper recently appeared at ACM SIGOPS Operating Systems Review. It provides an overview of the shared log abstraction in distributed systems, particularly focusing on its application in State Machine Replication (SMR) and consensus protocols. The paper argues that this abstraction can simplify the design and implementation of distributed systems, and can make them more reliable and easier to maintain. What is the shared log abstraction? The shared log abstraction proposes to separate the system into two layers: the database layer (proposers/learners) and the log layer (acceptors). The shared log provides a simple API for appending entries, checking the tail, reading entries, and trimming the log. This separation allows the SMR layer to focus on higher-level concerns without dealing with the complexities of consensus protocols. This is a wisdom packed paper. It approaches the problem more from software engineering and systems/operations perspectives. ( Previous work, the Delos OSD...

Socrates: The New SQL Server in the Cloud (Sigmod 2019)

Image
This paper (Sigmod 2019) presents Socrates, the database-as-a-service (DBaaS) architecture of the  Azure SQL DB Hyperscale. Deploying a DBaaS in the cloud requires an architecture that is cost-effective yet performant. An idea that works well is to decompose/disaggregate the functionality of a database into two as compute services (e.g., transaction processing) and storage services (e.g., checkpointing and recovery). The first commercial system that adopted this idea is Amazon Aurora. The Socrates design adopts the separation of compute from storage as it has been proven useful. In addition, Socrates separates database log from storage and treats the log as a first-class citizen. Separating the log and storage tiers disentangles durability (implemented by the log) and availability (implemented by the storage tier). This separation yields significant benefits: in contrast to availability, durability does not require copies in fast storage, in contrast to durability, availability do...

Popular posts from this blog

Hints for Distributed Systems Design

My Time at MIT

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Foundational distributed systems papers

Advice to the young

Learning about distributed systems: where to start?

Distributed Transactions at Scale in Amazon DynamoDB

Making database systems usable

Looming Liability Machines (LLMs)

Analyzing Metastable Failures in Distributed Systems