The role of distributed state

This is a 1990 technical report by John Ousterhout, who has been a very influential and pioneer figure in systems research. I had written a summary of his log-structured file system (LFS) paper earlier in this blog. It seems like his ideas on LFS and RAMCloud are becoming important in cloud computing these days. Ousterhout is also the creator of TCL/TK, so he has a lot to say in programming languages for web applications domain as well. The miscellaneous articles on his non-academic homepage include a lot of wisdom from his experience in the software development industry, and are also worth reading carefully.

This paper titled "Role of distributed state" describes the advantages and disadvantages of distributed state and suggests that the act of building a distributed system consists of making tradeoffs among various alternatives for managing the distributed state. Illustrating this point, the two case studies included in the paper, NFS and Sprite file systems, end up with a corresponding set of good and bad properties, and the strengths and weaknesses of the two systems are almost opposites.

The advantages of distributed state are performance (through local caching of the state), coherency (through using the state for collaboration and coordination), and reliability (through replication of the state). Unfortunately, these advantages are only potential advantages; in practice they are difficult to achieve due to the four problems introduced by distributed state: consistency, crash sensitivity, time and space overheads, and complexity.

The first problem is consistency: if the same piece of information is stored at several places and one of the copies changes, what happens to the other copies? There are three approaches to cope with this, detect stale data on use (lazy), prevent stale data (pessimistic --this is costly), and tolerate inconsistency (optimistic --this can introduce complexity). The second problem is crash sensitivity: unless you have full state replication, failure of one machine/state renders the rest of the distributed state useless as well (since that state has not been replicated). The third problem is time overhead, which is incurred mainly for maintaining consistency in preventing stale data. (Let's ignore the space overheads, we have space). The last problem is complexity, as getting a distributed protocol right is hard.

Case Study 1: NFS file system
The NFS design was optimized for simplicity and robustness, with performance a secondary goal. Simplicity and robustness were achieved by using a stateless protocol with idempotent operations. In NFS, servers do not store any information about their clients; they do not keep track of which files are currently in use or which clients are using which files. Distributed state is kept almost exclusively on the clients.

The advantage of this design choice is that NFS handles faults easily. As NFS is "stateless" (at the servers, that is) to begin with, no state is corrupted, so no special code is needed for crash recovery.

The disadvantage of this design is that the performance suffers greatly. Whenever a client issues a write request, the server must guarantee that all modified data is safely on disk before the write returns. And this kills performance causing low write-throughput. Even worse for the performance, NFS uses a "write-through-on-close" policy to reduce windows of inconsistency. Whenever a file is closed on a client machine, the client immediately transmits modified data for the file back to the server. Unfortunately, the statelessness of NFS requires that new data be returned immediately to the server to reduce consistency problems, and the only way to return data to the server is with the write operation, which forces data to disk. Still, consistency problems can also arise due to client caches of file at different places. Thus, the clients use polls to see if they should invalidate cache (which in turn reduces the performance). Even then, a window of inconsistency exists for NFS due to client timings of events.

Case Study 2: Sprite file system
Sprite's design is optimized for performance. Sprite chooses to be "stateful":
  1. Servers keep information in their main memories about which workstations are reading or writing which files. This requires clients to notify servers whenever files are opened or closed, but allows the servers to enforce consistency as described below.
  2. Servers retain modified file blocks in their main memories, and do not write that information back to disk until it has aged for thirty seconds.
  3. Clients also retain modified file blocks in their main memories; they do not pass new information back to servers until it has aged for thirty seconds or until the information is needed by some other client. If a client has dirty blocks for a file, the server's state information reflects this.

The advantage of this design choice is performance: Since the servers keep track of which clients are using which files, clients need not return modified file data to servers immediately. Another major advantage of statefulness is consistency: If a file is ever open simultaneously on several clients and at least one of them is writing the file, then the server notifies each of the clients and insists that they not cache the file; all read and write operations must be passed through to the server, where they are applied to a single copy of the file in the server's cache. Thus Sprite provides perfect file consistency: each read operation is guaranteed to return the most recently written data for that file, regardless of where and when the file is read and written.

The disadvantage of the stateful design choice is that the system is more complex; implementing Sprite was difficult due to subtle race conditions. This complexity also spawns more difficult crash recovery problems; server crash leaves clients in an inconsistent state with server, unable to move forward.

To handle the recovery problems, the designers of the Sprite system discovered that distributed state was not just the cause of the recovery problem, but also the solution. By adding slightly to the state kept by clients about their files, they made it possible for servers to recreate all their usage information after rebooting. A new operation was added to the Sprite protocol: reopen. When a server reboots, each client reopens all of its files by passing the server a copy of its state information (including information such as which files are open for reading or writing, which are locked, etc.). This allows the server to reconstruct its state so that it can continue to guarantee consistent access to files, and so that locks are not broken when servers reboot. A side-effect of this reopen fix took its toll on the performance though, the reopen approach leads to a recovery storm and which paralyze the server.

Ousterhout takes a strong stance in the conclusion. I think he provides critical and very good advice, so I quote it verbatim here.
Based on my experience with the NFS and Sprite file systems, I do not believe that the stateless model can meet the needs of high-performance workstations of the future. A stateless approach will limit the performance of the system to the performance of disks; unfortunately, disk performance is not improving at anywhere near the rate of processor performance. Non-volatile memory offers some hope for performance improvement, but I think the best solution is a change to more stateful protocols.

On the other hand, distributed state almost always introduces complexity and fragility, so system designers should attempt to reduce distributed state as much as possible. The less state, the better. In Sprite, I suspect that we may have been a little too eager to embrace state, and that a careful redesign of the system could reduce the amount of state we have to maintain.

Finally, the best approach to dealing with failures is to merge recovery with normal operation so that there is nothing special to do during recovery. NFS achieves this quite nicely through its combination of statelessness and idempotency. Recovery happens so infrequently that is is very difficult to debug special-case recovery code: it is hard to invoke the code under test conditions, and the code is hardly ever executed under real-life conditions. This means that there is a good chance that the code will not work when it is needed. On the other hand, if the recovery code and regular-case code are the same, the recovery code will be exercised constantly during everyday operation of the system so it is likely to work correctly when needed.

By the way, this approach of "dealing with failures by merging recovery with normal operation" is what the self-stabilization prescribes. So this last paragraph is a major motivation for self-stabilizing systems. I intend to write about self-stabilization in a future post.


Unknown said…
Excellent review. I read John Ousterhout's "Role of the Distributed State" back when I first got involved with IT and really enjoyed it. I think you can absolutely take a lot away from it. I'm now focused on improving data center software for global enterprises.

- G

Popular posts from this blog

The end of a myth: Distributed transactions can scale

Foundational distributed systems papers

Learning about distributed systems: where to start?

Speedy Transactions in Multicore In-Memory Databases

Metastable failures in the wild

Amazon Aurora: Design Considerations + On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes

SIGMOD panel: Future of Database System Architectures

The Seattle Report on Database Research (2022)

There is plenty of room at the bottom

Anna: A Key-Value Store For Any Scale