Panache: a parallel filesystem cache for global file access
This paper (FAST'10) is from IBM, and builds on their previous work on the GPFS filesystem. The problem they consider here is how to enable a site to access a remote site's servers transparently without suffering from the effects of WAN latencies. The answer is easy: use a cache filesystem/cluster at the remote site. But there are a lot of issues to resolve for this to work seamlessly.
Panache is a parallel filesystem cache to provide reliability consistency and performance of a cluster filesystem despite the physical distance. Panache supports asynchronous operations for writes for both data and metadata updates. Panache uses synchronous operations for reads: if the read misses the panache cache cluster or the validity timer of the data in the cache had expired, the operation waits till the data is read from the remote cluster filesystem.
Panache architecture
Panache is implemented as a multi-node caching layer, integrated within the GPFS, that can persistently and consistently cache data and metadata from a remote cluster. Every node in the Panache cache cluster has direct access to cached data and metadata. Thus, once data is cached, applications running on the Panache cluster achieve the same performance as if they were running directly on the remote cluster. If the data is not in the cache, Panache acts as a caching proxy to fetch the data in parallel both by using a parallel read across multiple cache cluster nodes to drive the ingest, and from multiple remote cluster nodes using pNFS.
Panache allows updates to be made to the cache cluster at local cluster performance by asynchronously pushing all updates of data and metadata to the remote cluster. All updates to the cache cause an application node to send and queue a command message on one or more gateway nodes. At a later time, the gateway node(s) will read the data in parallel from the storage system and push it to the remote cluster over pNFS.
All file system data and metadata "read" operations, e.g., lookup, open, read, readdir, getattr, are synchronous. Panache scales I/O performance by using multiple gateway nodes to read chunks of a single file in parallel from the multiple nodes over NFS/pNFS. One of the gateway nodes (based on the hash function) becomes the coordinator for a file. It, in turn, divides the requests among the other gateway nodes which can proceed to read the data in parallel. Once a node is finished with its chunk it requests the coordinator for more chunks to read. When all the requested chunks have been read the gateway node responds to the application node that the requested blocks of the object are now in cache.
Discussion
The disadvantages of this solution is that it is in the kernel space and it depends on proprietary technologies. I wonder if it would be possible to implement a Panache like system in the user-space with reasonable efficiency?
Some other questions I have are as follows. For the Panache cache cluster/filesystem could we have used xFS? xFS provides distributed collaborative caching and would be suitable for this task, I think.
Panache uses a complicated distributed lock mechanism at the gateways of the remote site to sort the dependability of update operations originating at multiple gateways. Wouldn't it be easier to instead push these updates on a best-effort basis to the remote cluster/filesystem and let it sort out these dependabilities? Why was this not considered as an option?
Panache is a parallel filesystem cache to provide reliability consistency and performance of a cluster filesystem despite the physical distance. Panache supports asynchronous operations for writes for both data and metadata updates. Panache uses synchronous operations for reads: if the read misses the panache cache cluster or the validity timer of the data in the cache had expired, the operation waits till the data is read from the remote cluster filesystem.
Panache architecture
Panache is implemented as a multi-node caching layer, integrated within the GPFS, that can persistently and consistently cache data and metadata from a remote cluster. Every node in the Panache cache cluster has direct access to cached data and metadata. Thus, once data is cached, applications running on the Panache cluster achieve the same performance as if they were running directly on the remote cluster. If the data is not in the cache, Panache acts as a caching proxy to fetch the data in parallel both by using a parallel read across multiple cache cluster nodes to drive the ingest, and from multiple remote cluster nodes using pNFS.
Panache allows updates to be made to the cache cluster at local cluster performance by asynchronously pushing all updates of data and metadata to the remote cluster. All updates to the cache cause an application node to send and queue a command message on one or more gateway nodes. At a later time, the gateway node(s) will read the data in parallel from the storage system and push it to the remote cluster over pNFS.
All file system data and metadata "read" operations, e.g., lookup, open, read, readdir, getattr, are synchronous. Panache scales I/O performance by using multiple gateway nodes to read chunks of a single file in parallel from the multiple nodes over NFS/pNFS. One of the gateway nodes (based on the hash function) becomes the coordinator for a file. It, in turn, divides the requests among the other gateway nodes which can proceed to read the data in parallel. Once a node is finished with its chunk it requests the coordinator for more chunks to read. When all the requested chunks have been read the gateway node responds to the application node that the requested blocks of the object are now in cache.
Discussion
The disadvantages of this solution is that it is in the kernel space and it depends on proprietary technologies. I wonder if it would be possible to implement a Panache like system in the user-space with reasonable efficiency?
Some other questions I have are as follows. For the Panache cache cluster/filesystem could we have used xFS? xFS provides distributed collaborative caching and would be suitable for this task, I think.
Panache uses a complicated distributed lock mechanism at the gateways of the remote site to sort the dependability of update operations originating at multiple gateways. Wouldn't it be easier to instead push these updates on a best-effort basis to the remote cluster/filesystem and let it sort out these dependabilities? Why was this not considered as an option?
Comments