Thursday, March 10, 2011

Flexible, Wide-Area Storage for Distributed Systems with WheelFS

One of my students, Serafettin Tasci, wrote a good review of this paper, so I will save time by using his review below, instead of writing a review myself.

In this paper the authors propose a storage system for wide-area distributed systems called WheelFS. The main contribution of WheelFS is its ability of adaptation to different types of applications with different consistency, replica placement or failure handling requirements. This ability is obtained via semantic cues that can be easily expressed in path names. For example to force the primary site of a folder john to be X, we can specify the cue “home/users/.Site=X/john”. This representation enables preserving of POSIX semantics and minor change in application software to use the cues.

In WheelFS, there are 4 groups of semantic cues. Placement cues are used to arrange the location of primaries and replicas of a file or folder. Durability cues specify the number and usage of replicas. Consistency cues maintain a tradeoff between consistency and availability via timeout limits and eventual consistency. And finally, large read cues are useful in reading large files faster via entire file prefetching and usage of neighbor client caches.

WheelFS consists of clients and servers. Clients run applications that use WheelFS and uses FUSE to present the distributed file system to these applications. In addition, all clients have local caches. Servers keep file and directory objects in storage devices. They group objects into structures called slices.

A third component of WheelFS is the configuration service which keeps slice tables that contain object-server assignments. Each entry in the slice table contains a replication policy and replicas for a slice. Configuration service is replicated on a small set of servers and uses Paxos for master election. It also provides a locking interface to servers by which the usage of slices and slice table by servers is coordinated.

When a new file is created, a replication policy is used via the cues and then it contacts the configuration service to see if a slice in the table matches the policy. If no slice matches the policy, the request is forwarded to a random server matching that policy. In addition, WheelFS uses write-local policy in which the primary of a newly created file is the local server by default. This policy enables faster writes.

For replication, WheelFS uses primary/backup replication. For each slice there is a primary server and some backup servers. However, this scheme causes two problems: Firstly, since all operations pass through the primary and updates require the primary to wait ACKs from all backups; this replication scheme may cause significant delays in wide-area systems. Secondly, if the .SyncLevel cue is used the replicas may get some of the updates too late. So, if the primary dies, a backup which replaces the primary may miss some updates and needs to learn the missing updates from other backups.

By default, WheelFS uses close-to-open consistency. But in case of a primary failure, all operations will have to delay waiting for the new primary to start. To avoid this delay, WheelFS provides .EventualConsistency cue that can be used whenever consistency requirements are not strict. In addition, WheelFS uses a write-through cache that improves consistency by writing the copies of each updates in cache to disk with the cost of increased latency.

When clients need to use the data in their cache, they need to get an object lease from the primary to preserve consistency. But this also brings additional latency cost since the primary needs to wait for all leases to complete to make an update on the object.

In the experiments, they present a number of applications that can be built on top of WheelFS such as distributed web cache, email service and file distribution service. Distributed web cache application shows that it provides a comparable performance to popular systems such as CoralCDN. In addition, in case of failures, if eventual consistency is used, it provides consistently high throughput. In file distribution experiment, they revealed that with the help of locality provided via large read cues, it achieves faster file distribution than BitTorrent. Finally, comparison of WheelFS to NFSv4 shows that it is more scalable thanks to the distributed caching mechanism.

No comments:

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...