Isolation Concepts (Chp. 7, part 2, Transaction processing book)

Continuing with our Transaction Processing book reading, the second part of Chapter 7 covers degrees of isolation (and them co-existing), phantoms and predicate locks, how to implement predicate locking efficiently (via intent locks, and more exotic techniques), scheduling of locks, and dealing with deadlocks.

I really liked these sections, as they contain many nice algorithmic ideas. The DB guys back then seemeed to have fun designing and then implementing these locking algebra/techniques. I miss the days where we could make up stuff/theory from first principles (because that is the "right" elegant thing to do)  and implement them with no fuss. Nowadays everything is too complicated, and we deal pieces in systems of systems where each things touches so many other that we are often zugzwanged out of innovation. 

Below I borrow several paragraphs from the book to give a brief summary of the second part of Chapter 7. 

Degrees of isolation

Locking helps to provide the isolation illusion. By acquiring fewer locks, or by holding them for shorter duration, the transaction can reduce its impact on others and increase its chances of parallel execution.

A typical default is to guard against isolation anomalies caused by WRITE->WRITE and WRITE->READ dependencies, but to ignore READ->WRITE dependencies. This goes under the name cursor stability in most SQL systems.

The concepts and definitions of degree 0, 1, and 2 dependencies easily follow from the idea that 0° ignores all dependencies, 1° is sensitive to WRITE->WRITE dependencies, and 2° is sensitive to WRITE->WRITE and WRITE->READ dependencies.

The common names for these properties are anarchy (0°), browse (1°), cursor stability (2°), and isolated, serializable, or repeatable reads (3°).

  • Degree 0. A 0° isolated transaction does not overwrite another transaction’s dirty data if the other transaction is 1° or greater.
  • Degree 1. A 1° isolated transaction has no lost updates.
  • Degree 2. A 2° isolated transaction has no lost updates and no dirty reads.
  • Degree 3. A 3° isolated transaction has no lost updates and has repeatable reads (which also implies no dirty reads). This is "true" isolation.

If a transaction observes the 0°, 1°, 2°, or 3° lock protocol, then any legal history will give that transaction 1°, 2°, or 3° isolation, as long as other transactions are at least 1°.

That means the degrees of isolation coexist peacefully. Having degree 3 transactions still provide their serializability guarantee in the presence of a degree 1 transactions is cool. It is nice to see them formulated and studied like this. Today we often don't talk about mixed isolation transactions. But when one notices a performance problem for a frequently run transaction, it is possible to manually reducing the isolation levels of that transaction after checking that it won't mess up other transactions and application integrity.

There is one flaw in mixing isolation levels. A 1° transaction's inputs are 1° isolated, which means the transaction may get dirty reads. Such data may not satisfy the system consistency constraints. If the transaction uses dirty reads to update the database and then commits these updates, other transactions will see inconsistent values and may malfunction. Given this, most systems reserve 1° isolation (browse mode) for read-only transactions. They abort browsing transactions that attempt to update public data.

The lower degrees of isolation reduce the time data is locked and reduce the amount of data that is locked, increasing the chances for concurrent execution of different transactions (at the risk of sacrificing consistent inputs and outputs). They do this by releasing locks early (before COMMIT). This gives rise to the notion of lock duration. Locks that are released quickly are called short duration locks. Locks that are held to the end of the transaction are called long duration locks. Degree 0 goes all the way and even gets short write locks. Protocols higher than 0° get long duration write (exclusive) locks but short duration read locks, or no read locks at all.


Consider this transaction that might scan the whole EMP table looking for  records with these attributes and never finding one. Using 3° isolation, it will hold thousands of record locks or one table lock when the scan completes. SQL system implementors worry that such programs will give them a bad name; unsophisticated users can easily write such programs and lock the whole database. Consequently, some SQL systems automatically release share-mode locks by default after the record or table is read. That is, 2° isolation is the default.

Phantoms and predicate locks


Consider the same query/transaction. Either the EMP table is an object and so the SELECT locks the whole EMP table, or records within the EMP table are the lockable objects and the SELECT statement locks all the individual records of EMP that satisfy the query. Suppose the individual record-locking scheme is used. What is to prevent someone else from inserting a new, blue-eyed, red-haired EMP after the read completes? There is no lock on nonexistent records.

Such new or deleted records are called phantoms, records that either appear or disappear from sets. There is no pure record-locking solution for phantoms.

There is an elegant (and abstract) solution to the phantoms problem, called predicate locks. Predicate locks allows each transaction to specify the predicate it wants. This is the purist relational algebra vision. But predicate locks (predicate comparisons) are expensive to execute. What is needed is an efficient realization of the predicate locking scheme. The granular lock protocol comes close to satisfying this need.

Granular locks, tree-locking, and intent locks

The idea is to pick a fixed set of predicates—in essence, to precompute the predicate locks. These predicates form a directed acyclic graph with TRUE at the top. Initially, think of the graph as a tree.

This all sounds just like predicate locks, and that is the point. Each predicate can be locked in shared or exclusive mode. The trick is that the system has chosen a fixed set of predicates, each predicate is represented by a name, and the predicates are organized into a tree.

FIGURE 7.10 A lock hierarchy. Each node of the tree represents a lock covering all its descendants. For example, locking the site Ripa implicitly locks all files and records at that site. To lock the Giovanna record and prevent locks of parent nodes (Phone, Ripa, and In Database), the transaction must first set intention mode locks on the parent nodes.

By following the hierarchical lock protocol, locking a node of the tree in some mode implicitly locks all the descendants of that node in the same mode. An intent mode lock on a node (predicate) should prevent other transactions from setting coarse granularity (i.e., shared and exclusive) on that node.

If a lock is requested in one mode and granted to another transaction in a second mode, the lock request can be granted immediately if the two modes are compatible (+). If the modes are not compatible (–), then the new lock request must wait for the first transaction to release its locks.

When an SQL cursor is declared with the clause FOR UPDATE, record locks acquired by the cursor are acquired in UPDATE mode rather than in SHARED mode. If the cursor updates the data, the update mode lock is converted to an exclusive lock, which is held to transaction commit. If the the cursor moves without updating the data, UPDATE mode lock is downgraded to SHARED mode for degree 3 isolated transactions (repeatable read transactions), and released by degree 2 isolated transactions (cursor stability transactions). This downgrading is done to avoid update mode locks delaying other transactions that are scanning the database.

Key range locking

Granular locks are actually predicate locks. They solve the phantom problem by locking sets of records rather than individual records. A set of records can be partitioned into subsets based on some value stored in the records. Typically, this value is a prefix of the record’s primary key. If all transactions accessing the records first acquire locks on the record's primary key prefix, then phantoms can be locked by locking their primary key prefixes. This idea is generically called key-range locking.

Suppose the file has a sorted list of records with key values W, Y, Z. Inserting X or deleting Y raise two issues: (1) the existence of records, X and Y in this case, and (2) the ordering of records, W,X,Y,Z in this case.

If transaction T performs a read unique of record X, and it is not found, T must prevent others from inserting phantom record X until T commits. If T is at record W and does a read next to get record Y, then W and Y cannot change; in addition, no one may insert a new record (phantom record X) between W and Y, until T commits. If T deletes record Y, no other transaction should insert a phantom Y, until T commits. In addition, no other transaction should notice that the original Y is missing and that Z is now immediately after X, until T commits.

The key-range lock protocol acquires a lock on the key range before it accesses records in the key range. Because key ranges are granules within files, intent mode locks on the file—and on any coarser-granularity objects—must first be acquired in accordance with the granular lock protocol.

If these three key ranges are thought of as predicates, they are simply predicate locks covering the records. Locking the key starting the range is a surrogate for locking the entire range. By using the granularity of locks protocol, and by locking root to leaf, it is possible to get shared, exclusive, or intent mode locks on these sets.

Locks and scheduling

When locks are unavailable, some transactions must wait. Waiting creates an interaction between lock scheduling and global scheduling of system resources. Deadlock is the ultimate form of waiting, and resolving deadlocks by rolling back a set of transactions is a major scheduling decision. A simple analytic model indicates that waiting and deadlocks rise with the square of the degree of multiprogramming. Here, again, is an interaction between locking and resource scheduling.

The lock manager is a scheduler. If many processes are waiting for a lock, a scheduling decision is made when the lock becomes available; this decision determines which transactions should continue to wait and which should be granted access to the resource. Traditionally, simple first-come, first-served scheduling has been used.

Convoy effect

The perils of FIFO (first in, first out) scheduling often appear in new systems as convoys: high-traffic locks that become the dispatcher list. In a convoy situation, the execution of processes in the system is controlled by the processes’ passage through these locks or semaphores. The log lock is a frequent source of convoys. Each update generates a log record to be inserted at the end of the log.  Once formed, convoys last forever. Each process waiting to get to the front, doing a small amount of work, and going back to the end of the convoy again. The problem is exacerbated by the observation that lock-unlock is a few hundred instructions while lock-wait-dispatch-unlock is a few thousand instructions. Consequently, when a convoy forms, the application processor load can increase by a factor of from 2 to 10! At this point, the system is in a situation of lock thrashing. Most of the CPU is dedicated to task switching. This is a stable phenomenon: new processes are sucked into the convoy and if a process leaves the convoy (for I/O wait or lock wait), when it returns the convoy will probably still exist.  Read this paper summary for details (Pat Helland makes a cameo appearance here). 

Once recognized, the problem of convoys is easily solved. Do not schedule hotspot locks on a FIFO basis; rather, wake up everybody who is waiting and let the scheduler-dispatcher decide who should be dispatched and granted the lock. Secondly, in a multiprocessor, spin (busy wait) on a lock for a few hundred instructions, rather than wait. Spinning is much less expensive than a process wait plus dispatch. If the lock is, in fact, 20% busy and is held for only a few hundred instructions, then it probably will be free before the processor can run through the wait logic.


Convoy phenomenon was just a slowdown.  In a deadlock situation, each member of the deadlock set is waiting for another member of the set. In other words, no one in the set is making any progress, and no one will until someone in the set completes.

An easy solution to deadlock is never to wait, but instead to cancel any request that might wait, do a (partial) rollback, and then restart the program. That definitely avoids deadlock, but it may create a livelock situation in which each member of the livelock set may soon want to wait for another member of the set, resulting in another rollback and restart. Livelock is actually worse than deadlock, because it is harder to detect and because it wastes resources.

It is possible to avoid (prevent) deadlock. The standard deadlock avoidance technique is to linearly order the resources and acquire them only in that order. A deadlock requires a waits-for cycle, and the linear order (or any partial order) avoids such cycles. We will discuss this in Section 8.

Given that deadlocks are allowed to occur, how can the system detect and deal with them? One solution is timeout: whenever a transaction waits more than a certain time, declare that something is wrong and rollback that transaction.

Timeout is a very pessimistic deadlock detector. Any long wait will be declared a deadlock. It seems more elegant to detect deadlocks and distinguish between deadlocks and long waits. Perhaps a compromise position is to run a deadlock detector after a transaction times out, just to see if there is a deadlock that can be resolved. If one is found, then a victim can be picked, and the others can continue waiting until they are granted.


Popular posts from this blog

Learning about distributed systems: where to start?

Hints for Distributed Systems Design

Foundational distributed systems papers

Metastable failures in the wild

The demise of coding is greatly exaggerated

The end of a myth: Distributed transactions can scale

Scalable OLTP in the Cloud: What’s the BIG DEAL?

SIGMOD panel: Future of Database System Architectures

Why I blog

There is plenty of room at the bottom