Saturday, September 25, 2010

The Design and Implementation of a Log-Structured File System

by Rosenblum, Mendel and Ousterhout, John K. SOSP 1991. PDF link for the paper

The motivation for the log-structured filesystem (LFS) are threefold.
1) Traditional file systems (say UNIX FAST filesystem) perform poorly for write. FAST requires 6 writes to create a new one block file.
2) Main memory caching capacity is increasing, which takes care of reads, so the workloads to disks are dominated by writes. And this is bad, see above.
3) There is a large gap between random and sequential i/o performance. Random is dominated by seeking time (which is very costly), whereas you avoid the seek time with sequential i/o.

The LFS idea is simple: focus on write performance and utilize sequential bandwidth of disks. Here is how it works. LFS buffers all writes into a segment in main memory. (Even though the writes may be for different files, they are appended into the segment.) When the segment is full, LFS writes the segment to unused place on disk sequentially again.

Since LFS writes sequentially by appending inodes, new versions can be later in the disk. Then we have the question of how to find the inodes. The answer is inode map data structure is used for finding inodes. But then how do we find the inode map? inode map is also written in the segment in a log structure manner.

The answer is checkpoint region (CR), which has a fixed place on disk. CR points to the inodes on the disks, and it is primarily kept in the main memory. The disk copy of CR is updated periodically every minute (as a defense to power outage, and for use during bootup). The main memory always has the most up to date version of CR. To read a file from disk, LFS consults the CR (at memory) to learn the disk address from the inode at CR.

Since new versions are always appended, old versions of things are left in the disk. These old dead versions should be garbage collected. But the problem is in a segment, there can be live blocks next to some dead blocks. So, LFS copies/packs the live blocks in to a new segment, and then reclaim the entire segment as available, free.

The way LFS detect live and dead blocks in a segment is by scanning the blocks. LFS looks at the blocks in a segment, learns its inode#, address in disk, checks this with the inode map in CR (in main memory). If these match, this block is new, otherwise the block is old as the CR points to a new version.

LFS performs crash recovery as follows. CR (on disk) has a pointer to the last segment it knows of, but another segment may have been added after CR has been updated. Luckily, each segment store a pointer to the next segment. So the CR asks the last segment it knows of to see if it has a next segment pointer. If so, CR follows these pointer[s] and updates itself.

Since LFS reduces the 6 write per block in FAST to almost 1 write (paper gives this number as 1.7 empirically), LFS performs very well. In fact for the experiment with several 1K sized file writes, LFS shows 10 times improvement over FAST. For larger files, LFS is still better, but the improvement is not that drastic, because for large files seek time is not the most dominant factor anymore. LFS performs worse that FAST if a file is re-read sequentially.

LFS has spanned of a lot of file systems including:
jffs: fs for flash, log structured
sun's zfs: copy on write
netapp's wafl: zero copy snapshot

In particular LFS becomes important for SSDs, due to a technical property of writing segments in SSD. Yes, SSDs can do random write very fast, and LFS is not needed. But the way SSDs update something on a segment is to read the entire segment and update and write the entire updated segment back. And that is inefficient. LFS can reduce that inefficiency. Here is a great coverage of the subject.

We discussed an interesting question after the presentation. "Is LFS a good filesystem for the mapreduce workload?" The answer is not straightforward. I would argue no, because mapreduce already deals with large sequential data blocks. But, in the shuffle, the system may need to a lot of random access writes, which may make a case for LFS.

1 comment:

J Chris said...

If you are interested in this sort of thing you might find CouchDB's pure tail-append storage format interesting.

Some basic info about it is here:http://horicky.blogspot.com/2008/10/couchdb-implementation.html