ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks

This paper is from IBM, 1992. This is a foundational paper in databases area. ARIES achieves long-running transaction recovery in a performant/nonblocking fashion. It is more complicated than simple (write-ahead-log) WAL-based per-action-recovery, as it needs to preserve the Atomicity and Durability properties for ACID transactions. Any transactional database worth its salt (including PostGres, Oracle, MySQL) implements recovery techniques based on the ARIES principles.



"I have this condition... It's my memory." --From the movie Memento

Background

There is memory and there is disk (these days it is SSD, back in the old days it was a rotating hard disk). Memory is fast, but not persistent. Disk is durable, but slow.

We want both fast and durable.

We might execute and commit a transaction in-memory to achieve fast execution, but a committed transaction should also be durable. Flushing each transaction to the disk would add long I/O stalls before each commit. So it looks like we are caught between a rock and a hard place.

What is the solution? WAL: Write Ahead Log

WAL is somewhat similar to the sticky notes and photos in the movie Memento (2000); it helps recover the current state after working memory is wiped.

We may force each WAL update (record of which operation we execute) to the disk, for durability. This is smaller than forcing the entire state or physical block to the disk. But flushing every WAL update (or at least WAL updates for a transaction before commit) adds stall. Can we do better? What is the most performant (avoiding I/O stalls) way that we can still achieve durability?

ARIES gets us there. It doesn't leave anything on the table.

Here is some additional setup before we can appreciate how this is done.

There is the WAL keeping logical record of operations and there is the buffer pool keeping pages/blocks (physical effects).

(from Andy Pavlo's lecture on ARIES)


The most performant we can get is with Steal and No-Force policy.

  • Steal says: A transaction can steal pages from the buffer to do work. This allows Transaction C's dirty pages to be written to disk before it completes so that there's room for Transaction B to do work.
  • No-Force says: Dirty pages for finished transactions can be flushed to disk whenever convenient. They don't have to flush to disk before we can commit the transaction.


We want our WAL and recovery protocols to support a Steal and No-Force policy for the buffer pool, in order not to throttle the performance while we ensure durability. Given the Steal and No-Force constraints, ARIES goes the extra mile: It tries to get the WAL flushing to be almost async, to avoid any stalls from the WAL flushing as well. Of course for correctness, there needs to be  tiny bit of dependency/synchronization in flushing of the WAL. We will discuss this next under Principle 1.

The principles

Principle 1. Write-ahead logging: Any change to an object is first recorded in the log, and the log must be written to the stable storage *before changes to the object are written to disk.*

Note that, this doesn't mean that WAL records need to be flushed immediately. This actually gives us a lot of flexibility, and allow us to flush WAL records in batches and asynchronously most of the time. The only exception is when the buffer pool is writing a dirty page back to disk whose WAL is not flushed already. That is the only case where we need to do the WAL flushing before we can write the page back to the disk.

This was the most important principle. I will not go through to the full protocol. I will mention the other two basic principles, and refer you to two good resources on this.

Principle 2. Repeating history during Redo: On restart after a crash, ARIES retraces the actions of a database before the crash and brings the system back to the exact state that it was in before the crash. Then it undoes the transactions that were still active at crash time, because these transactions did not get to commit and need to be rollback.  

Principle 3. Logging changes during Undo: Changes made to the database while undoing transactions are logged to ensure such an action isn't repeated in the event of repeated restarts.

Andy Pavlo's lecture explains ARIES really well. Watching this is time very well spent. 

This wikipedia page provides a succinct summary of ARIES.

This blog post from Shreya explains things in a very accessible way.

My dear friend Pat Helland mentions that at Tandem Franco Putzolu implemented  ARIES style WAL-based performant transactional recovery in the 1980s (outlined in the crash recovery section of this VLDB 1984 paper), but they had not published about it. The thread is a good read.

MySQL's implementation of ARIES principles is discussed in this thread.

Formal modeling of ARIES

After we understand the principles/invariants, ARIES is not a complicated protocol. But, database recovery is a tricky business, and unavoidably the ARIES protocol has a lot of details. It is easy to get things wrong in the implementation, and it is hard to test that our implementation does recovery correctly. There are several reasons for why this is the case. The recovery code doesn't get exercised frequently (hopefully your database doesn't crash all the time). The state space for recovery is huge, so it is hard to model check things. Verification approaches need to consider timing of the crash, and ensure recovery from any ill-timed crash. This is a hard problem and only recently we started seeing verification work address this.

There are also bugs at the seams between different layers. A FAST'18 paper shows that many bugs still existed in recovery for consensus-based storage. What about lower layer implementation? Are you sure fsync is flushing correctly?


Anyways. I am getting ahead of myself. I would just be happy to have with a TLA+ model of ARIES recovery protocol. A TLA+ model would not only be useful to ensure we get the protocol right, but it would also help us to grok the protocol. In the model, we can add more labels to reduce the atomicity and see when the correctness breaks, so we can reduce the protocol to its smallest atomic primitives. This model would also be a good place to start when we want to customize recovery protocol for our system implementation.

I searched but couldn't find a TLA model of ARIES. There are WAL models for some systems, but as far as I could see they were not doing transactional recovery  like ARIES. Alan Fekete sent me this I/O automata modeling of ARIES by one of his students. It would be nice if someone can create a TLA+ modeling of the ARIES protocol starting from that model.

On a related note, recent post from Andrew Helwer about how he used TLA+ to validate the design of a snapshot coordination system for Microsoft Azure Ring Master is a good read.

NewSQL and ARIES

ARIES is for single node transactional log recovery. For a multiple node system like NewSQL systems using Paxos groups per partitions, other considerations apply. In those systems, log-based recovery can leverage help from other nodes. (The NewSQL paper from Andy Pavlo has a short section on crash recovery considerations.) When a primary of a Paxos group crashes in such a database, another node in the group  will take over as the primary, to reduce any unavailability period until the crashed node recovers and potentially rejoins the group. The new primary is still tasked with trying to forward recover a transaction or to abort it. These edges of the system are complicated, and I think prone to be buggy. I think some papers mention the work done here, but it would be good to see a more detailed exposition of recovery.

Comments

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Advice to the young

Foundational distributed systems papers

Distributed Transactions at Scale in Amazon DynamoDB

Linearizability: A Correctness Condition for Concurrent Objects

Understanding the Performance Implications of Storage-Disaggregated Databases

Designing Data Intensive Applications (DDIA) Book