Sunday, February 6, 2011

GPFS: A shared disk file system for large computing clusters

GPFS(FAST'02 paper) is a general parallel filesystem developed by IBM. GPFS accesses disks parallelly to improve disk I/O throughput for both file data and file metadata. GPFS is entirely distributed, with no central manager, so it is fault-tolerant, and there is no performance bottleneck. GPFS was used in 6 of top 10 supercomputers when it first came out. It got less popular as opensource free filesystems such as linux/lustre gained popularity.

GPFS has a shared disk architecture. Figure shows the GPFS architecture with three basic components: clients, switching fabric, disks.

Managing parallelism and consistency
There are two approaches. The first is distributed locking: consult all other nodes, and the second is centralized management: consult a dedicated node. GPFS solution is a hybrid. There are local lock managers at every node, and there is a global lock manager to manage them via handling out tokens/locks.

GPFS uses byte-range locking/tokens to synchronize reads and writes to file data. This allows parallel applications to write concurrently to different parts of the same file. (For efficiency reasons, if the client is the only writer, it can get a lock for the entire file. only if other clients are also interested, then we start to use finer granularity locks.)

Synchronizing access to metadata
There is no central metadata server. Each node stores metadata for files it is responsible, and is called the metadata coordinator (metanode) for those files. Metanode is elected dynamically for a file (the paper does not give any details on this), and only the metanode reads and writes the inode for the file from/to disk. The metanode synchronizes access to the file metadata by providing shared-write lock to other nodes.

When a node fails, GPFS restores metadata being updated by failed node, releases any tokens held by failed node, and appoints replacements for this node for metanode allocation manager roles. Since GPFS stores logs on shared disks, any node can perform log recovery on behalf of the failed node.

On link failures, GPFS fences nodes that are no longer members of the minority groups from accessing the shared disks. GPFS also recovers from disk failures by supporting replication.

The paper does not discuss how the metanode (metadata coordinator for a file) is discovered by nodes that want to access that file. How is this done? Wouldn't it be easier to make this centralized? There is a centralized token manager, so why not a centralized metadata manager? It makes things easer, and you can always replicate it (via Paxos) to fight faults.

Here is a link to two reviews of the paper from CSE726 students.

No comments:

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...