Posts

Showing posts with the label distSQL

Morty: Scaling Concurrency Control with Re-Execution

Image
This EuroSys '23 paper reads like an SOSP best paper. Maybe it helped that EuroSys 2023 was in Rome. Academic conferences are more enjoyable when the venue doubles as a vacation. The Problem Morty tackles a fundamental question: how can we improve concurrency under serializable isolation (SER), especially without giving up on interactive transactions? Unlike deterministic databases (e.g., Calvin ) that require transactions to declare read and write sets upfront, Morty supports transactions that issue dynamic reads and writes based on earlier results. Transactional systems, particularly in geo-replicated settings, struggle under contention. High WAN latency stretches transaction durations, increasing the window for conflicts. The traditional answer is blind exponential backoff, but that leads to low CPU utilization. TAPIR and Spanner replicas often idle below 17% under contention as Morty's evaluation experiments show. Morty's approach to tackle the problem is to start from...

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Image
This paper is from Pat Helland, the apostate philosopher of database systems, overall a superb person, and a good friend of mine. The paper appeared this week at CIDR'24. (Check out the program for other interesting papers). The motivating question behind this work is: " What are the asymptotic limits to scale for cloud OLTP (OnLine Transaction Processing) systems? " Pat says that the CIDR 2023 paper "Is Scalable OLTP in the Cloud a Solved Problem?" prompted this question.  The answer to the question? Pat says that the answer lies in the joint responsibility of database and the application. If you know of Pat's work, which I have summarized several in this blog , you would know that Pat has been advocating along these lines before. But this paper provides a very crisp, specific, concrete answer. Read on for my summary of the paper. Disclaimer: This is a wisdom and technical information/detail packed 13-page paper, so I will try my best to summarize the sa...

TiDB: A Raft-based HTAP Database

Image
This paper is from VLDB 2020. TiDB is an opensource Hybrid Transactional and Analytical Processing (HTAP) database, developed by PingCap. The TiDB server, written in Go, is the query/transaction processing component; it is stateless, in the sense that  it does not store data and it is for computing only. The underlying key-value store, TiKV, is written in Rust, and it uses RocksDB as the storage engine. They add a columnar store called TiFlash, which gets most of the coverage in this paper. In this figure PD stands for Placement Driver (PD), which is responsible for managing Raft ranges, and automatically moving ranges to balance workloads. PD also hosts the timestamp oracle (TSO), which provides strictly increasing and globally unique timestamps to serve as transaction IDs. Each timestamp includes the physical time and logical time. The physical time refers to the current time with millisecond accuracy, and the logical time takes 18 bits. If you know about CockroachDB/CRDB ( he...

Spanner: Becoming a SQL system

Image
This is a VDLB 2017 paper. Last week we reviewed the F1 paper from 2012. It seems like F1 was an experiment and sort of a preview towards adding serious SQL support in Spanner. The original Spanner paper was published in 2012 had little discussion/support for SQL. It was mostly a "transactional NoSQL core". In the intervening years, though, Spanner has evolved into a relational database system, and many of the SQL features in F1 got incorporated directly in Spanner. Spanner got a strongly-typed schema system and a SQL query processor, among other features. This paper describes Spanner's evolution to a full featured SQL system. It focuses mostly on the distributed query execution (in the presence of resharding of the underlying Spanner record space), query restarts upon transient failures, range extraction (which drives query routing and index seeks), and the improved blockwise-columnar storage format. I wish there was discussion on the evolution of data manipulation/modi...

Is Scalable OLTP in the Cloud a Solved Problem? (CIDR 2023)

Image
This paper appeared in CIDR 2023 recently. It is a thought provoking big picture paper, coming from a great  team. The paper draws attention to the divide between conventional wisdom on building scalable OLTP databases (shared-nothing architecture) and how they are built and deployed on the cloud (shared storage architecture). There are shared nothing systems like CockroachDB, Yugabyte, and Spanner, but the overwhelming trend/volume on cloud OLTP is shared storage, and even with single writer, like AWS Aurora , Azure SQL HyperScale, PolarDB, and AlloyDB. The paper doesn't analyze this divide in detail, and jumps off to discuss shared storage with multiple writer as a more scalable way to build cloud OLTP databases. It would have been nice to go deep on this divide. I think the big benefit from shared storage is the disaggregation we get between compute and storage, which allows us to scale/optimize compute and storage resources independently based on demand. Shared storage also al...

The Seattle Report on Database Research (2022)

Every 5 years, researchers from academia and industry gather to write a state-of-the-union (SOTU) report on database research. This one was released recently . It is a very readable report, and my summary consists of important paragraphs clipped from the report. Emphasis mine in bolded sentences. I use square brackets for when I paraphrase a long text with a more direct statement. TL;DR: The SOTU is strong (the relational database market alone has revenue upwards of $50B) and growing stronger thanks to boom in cloud computing and machine learning (ML). For the next 5 years, research should scale up to address emerging challenges in data science, data governance, cloud services, and database engines.   What has changed in the last 5 years Over the last decade, our research community pioneered the use of columnar storage, which is used in all commercial data analytic platforms. Database systems offered as cloud services have witnessed explosive growth. Hybrid transactional/analytical...

Strict-serializability, but at what cost, for what purpose?

Image
Strict-serializability guarantees that transactions appear to occur in an order consistent with the "real-time" ordering of those transactions: If transaction T1 commits before transaction T2 is invoked, then the commit timestamp of T1 precedes the commit timestamp of T2. This is, in fact, the real-time constraint from linearizability , but applied across transactions not just per-key. A strict-serializability system satisfies both serializability (transactions appear to occur as if they are executed one at a time in isolation) and linearizability per key (after all single-key reads/writes are transactions over one item). Below figure is from https://jepsen.io/consistency . However, this is a one-way implication, the other direction does not hold. You can satisfy both serializability per transactions and linearizability per key, but fail to satisfy strict-serializability. (Below I give an example accompanied with a TLA+ specification to check it.) This is because, in strict-...

Popular posts from this blog

Hints for Distributed Systems Design

My Time at MIT

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Foundational distributed systems papers

Advice to the young

Learning about distributed systems: where to start?

Distributed Transactions at Scale in Amazon DynamoDB

Making database systems usable

Looming Liability Machines (LLMs)

Analyzing Metastable Failures in Distributed Systems