Posts

Showing posts with the label snapshot isolation

Analysing Snapshot Isolation

Image
This paper (PODC'2016) presents a clean and declarative treatment of Snapshot Isolation (SI) using dependency graphs. It builds on the foundation laid by prior work, including the SSI paper we reviewed recently , which had already identified that SI permits cycles with two adjacent anti-dependency (RW) edges, the so-called inConflict and outConflict edges. While the SSI work focused on algorithmic results and implementation, this paper focuses more on the theory (this is PODC after all) of defining a declarative dependency-graph-based model for SI. It strips away implementation details such as commit timestamps and lock management, and provides a purely symbolic framework. It also proves a soundness result (Theorem 10), and leverages the model for two practical static analyses: transaction chopping and robustness under isolation-level weakening. Soundness result and dependency graph model Let's begin with Theorem 10, which establishes both the soundness and completeness of the...

Serializable Isolation for Snapshot Databases

Image
This paper (SIGMOD '08) proposes a lightweight runtime technique to make Snapshot Isolation (SI) serializable without falling back to locking. The key idea behind Serializable SI (SSI) is to detect potentially dangerous (write-skew) executions at runtime and abort one of the transactions to guarantee serializability (SER). The goal is to offer the strong guarantees of SER without sacrificing SI's high performance and non-blocking reads. But would it make sense to implement SER by layering on MVCC SI instead of implementing it directly? Do you think an SI-based implementation would be more performant than native 2PL-based SER implementations? What about compared to OCC-based SER? The evaluation section gives some answers. The problem and insight Let's back up. Write-skew is the canonical anomaly under SI. And the canonical example for write-skew is the "doctors on-call" scenario. Consider two doctors removing themselves from a duty roster. Each transaction checks ...

DDIA: Chp 7. Transactions (Part 1)

Image
Chapter 7 of the Designing Data Intensive Applications (DDIA) book discusses transactions, yay! Transactions in database systems group multiple operations into a logical unit, a box of operations if you will. They simplify error handling and help manage concurrency issues. See Gray-Reuters book introduction and fault-tolerance sections for the motivation on this. The ACID properties (Atomicity, Consistency, Isolation, Durability) are often used to describe transaction characteristics, but their precise meaning had been left fuzzy by the implementations. Atomicity ensures all operations in a transaction either complete fully or not at all. It's not about concurrency, but about handling faults during transaction processing. Consistency is an application-level concern, ensuring that transactions maintain an application-defined data integrity according to defined rules. This consistencey is not the same consistency as in CAP , and it is best to forget about C in ACID, and understand...

Transactional storage for geo-replicated systems

Image
This paper from SOSP 2011 describes a distributed storage system called Walter. Walter targets web applications that operate across multiple geographic sites, and it aims to balance consistency and latency in such systems by providing strong consistency within a site, and weaker consistency across sites. Parallel Snapshot Isolation (PSI) The paper introduces a new isolation property called Parallel Snapshot Isolation (PSI). PSI ensures consistent snapshots and transaction ordering within a site, causal ordering across sites, and prevention of write-write conflicts. PSI extends snapshot isolation (see Fig 3 for illustration of snapshot isolation, and here for an example of snapshot isolated database modeling ) by allowing different sites to have different commit orderings. For example, suppose site A executes transactions T1, T2 and site B executes transactions T3, T4. PSI allows site A to first incorporate just T1, T2 and later T3, T4, while site B first incorporates T3, T4 and later ...

A critique of ANSI SQL Isolation layers (Transaction Processing Book followup)

Image
This paper is from Sigmod 1995. As the title says, it critiques the ANSI SQL 92 standard (book) with respect to problems in isolation layer definitions. This is also a good follow up to t he Transaction Book chapter 7 where we reviewed the isolation layers . In fact, here Gray (with co-authors) gets a chance to amend the content in the Transaction Book with recent developments and better understanding/mapping of isolation layers to the degrees 0-4. Snapshot isolation (SI) finally makes its debut here in Gray's writing. Why so late? As we discussed earlier in the 80s the database keys were hot, and people rightly ignored optimistic concurrency control as it was not suitable for those workloads. Moreover, memory space was very scarce commodity, and multi-version (as in snapshot storage) was also off the table. With its increased adoption (with MySQL's use of SI around the corner) we see coverage of snapshot isolation in Gray's writing.  The paper is very well written, and is...

TiDB: A Raft-based HTAP Database

Image
This paper is from VLDB 2020. TiDB is an opensource Hybrid Transactional and Analytical Processing (HTAP) database, developed by PingCap. The TiDB server, written in Go, is the query/transaction processing component; it is stateless, in the sense that  it does not store data and it is for computing only. The underlying key-value store, TiKV, is written in Rust, and it uses RocksDB as the storage engine. They add a columnar store called TiFlash, which gets most of the coverage in this paper. In this figure PD stands for Placement Driver (PD), which is responsible for managing Raft ranges, and automatically moving ranges to balance workloads. PD also hosts the timestamp oracle (TSO), which provides strictly increasing and globally unique timestamps to serve as transaction IDs. Each timestamp includes the physical time and logical time. The physical time refers to the current time with millisecond accuracy, and the logical time takes 18 bits. If you know about CockroachDB/CRDB ( he...

Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems

Image
This paper from Sigmod 2015 describes the addition of Multi-Version Concurrency Control (MVCC) to the Hyper database, and discusses at length how they implement serializability efficiently with little overhead compared to snapshot isolation. Coming only two years later, this paper seems like a response/one-up-manship to the Hekaton paper . In the paper, you can see at many places statements like "in constrast to/ unlike/ as in/ Hekaton" phrase. In a way, that is the biggest compliment a paper can pay to another paper. Hyper is also an in-memory database as in Hekaton. The paper says that Hyper is like Hekaton in that, in order to preserve scan performance, new transactions are not allocated a separate place. The most uptodate version of the database is kept all in the same place. But, you can reach older versions of a record (that are still in use by active transactions) by working backwards from the record using the delta versions maintained. The biggest improvement in Hyp...

Popular posts from this blog

Hints for Distributed Systems Design

My Time at MIT

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Foundational distributed systems papers

Advice to the young

Learning about distributed systems: where to start?

Distributed Transactions at Scale in Amazon DynamoDB

Making database systems usable

Looming Liability Machines (LLMs)

Analyzing Metastable Failures in Distributed Systems