Thursday, February 24, 2011

Volley: Automated Data Placement for Geo-Distributed Cloud Services

Datacenters today are distributed across the globe, yet they need to share data with other datacenters as well as their clients. This paper from Microsoft Research presents a heuristic strategy for data placement to these geo-distributed datacenters. While there has been previous work on data placement in LANs and WSNs, Volley is the first heuristic for data placement strategies for WANs.

A simple heuristic is to place each data to the datacenter closest to the client of that data. But things are not that simple, there are several additional constraints to be considered, including business constraints, WAN bandwidth costs, datacenter capacity limits, data interdependencies, user-perceived latency, etc. For example, it makes more sense to collocate data that are tightly-coupled/interdependent, such as two friends in Facebook that update each other walls. As another example, the frequency of the clients accessing the data needs to be taken in to account as well. As live mesh and live messenger traces show, there is significant data sharing across clients (Figure 5), there can be significant benefits to placing data closest to those who use it most heavily, rather than just placing it close to some particular client that accesses the data. Finally, the live mesh and messenger traces also show that a significant portion of the clients move and change locations (Figure 7), so the ideal placement of data needs to be changed adaptively as well.
Volley takes as input the request logs for data in the system, analyzes them, and outputs the results on where to best place the data. Volley is not concerned with the actual migration, that should be handled by other applications.

Volley is an iterative algorithm. In phase 1, Volley computes an initial placement based on client IPs. In phase 2, Volley iteratively moves data to reduce latency. Finally in phase 3, Volley iteratively map data items into datacenters taking into account the datacenter capacities. In order

The live mesh and live messenger traces are used to evaluate Volley via emulations. In these emulations, 12 DCs are assumed. Volley is compared with the commonIP protocol (which places data as close as possible to the IP address that most commonly accesses it), hashing protocol (which randomly places data to one the 12 DCs to optimize for load-balancing), and oneDataCenter (which places all the data in one datacenter).

Among these protocolsVolley is the one with lowest latency, and hash is unsurprisingly the one with the highest latency. Hash protocol leads to a lot of inter-datacenter traffic as it frequently places interdependent data in different datacenters. Volley has the least inter-datacenter traffic, of course excluding the oneDC protocol which obviously has no inter-datacenter traffic.

The evaluations show that Volley converged after a small number of iterations, and reduced skew by 2x, inter-datacenter traffic by 1.8x, and latency by 30%. The paper does not make any optimality claims for Volley, as it just uses heuristics. The real contribution of Volley is reported as automating the data placement process.

Wednesday, February 23, 2011

PNUTS: Yahoo!'s hosted data serving platform

Given that web users are scattered across the globe, it is critical to have data replicas on multiple continents for low-latency access and high-availability. Supporting general transactions is typically unnecessary, since web applications tend to manipulate only one record at a time. Moreover, supporting general transactions (with serializability) over a globally-replicated distributed system is expensive and impractical. On the other hand, it is still useful to have a stronger consistency guarantee than the very weak "eventual consistency". It is often acceptable to read slightly stale data but having replicas use the same order of updates is important because out of order updates may expose undesirable (outside the invariant, unsafe) state at some replicas.

The most important contribution of the PNUTS paper is an asynchronous replication architecture that achieves serializability of writes to each record. This feat is achieved by using record-level masters.

Record-level masters
One way to serialize writes is to forward all the writes to a master replica, and the master replica serializes the writes according to the order it receives them, and the other replicas adopt this ordering determined by the master. The problems with this approach are that 1) writes are subject to WAN latency till the master, and 2) if the master goes down, writes become unavailable until another master takes over.

The truly decentralized approach, which provides low-latency and high-availability for writes, is to let any replica process any write request locally, but that of course means concurrent conflicting updates.

PNUTS proposes a middle ground. Instead of a single master to serialize all the write requests, PNUTS designates per-record masters to serialize the write requests for that record. So, at any time, each replica is serving as a master for a portion of the records in the system, and for each record there is a designated master. The designated master can be adaptively changed to suit the workload, the replica receiving the majority of recent write requests for a particular record becomes the master for that record. So, PNUTS achieve low-latency writes. PNUTS also achieve high-availability because in the event of a master failure, the records which are assigned to other masters are unaffected by this failure.

More clarifications about PNUTS architecture are as follows. PNUTS uses full replication; all the records are fully replicated at all replicas. Read requests can be satisfied by a local replica, while write requests must be satisfied by the master record. Banking on write locality, if a record is written by the same site three consecutive times, the new master for that record is designated to be the replica at that site. The consistency (serializability and timeline-consistent updates) guarantees provided by PNUTS is at the level of single record transactions. PNUTS makes no guarantees as to consistency for multi-record transactions.

Yahoo! Message Broker (YMB)
YMB is a publish subscribe system that is very crucial for the operation of PNUTS. (PNUTS + YMB is known as the Sherpa data services platform.) YMB takes care of asynchronous replication of updates in PNUTS. For asynchronous replication, YMB guarantees "timeline consistency" where replicas may not be consistent with each other but updates are guaranteed to be applied in the same order at all replicas. (Some replicas may be slow and lag behind other replicas in updates.)

Surprisingly, YMB also handles master handovers and failure/recovery in PNUTS. So, the presentation of PNUTS is very intertwined with the YMB, and unfortunately details of YMB (a propriatery technology of Yahoo!) operation are missing in the paper. YMB is described as this infallible service, but the question is what happens when YMB fails. Neither YMB and PNUTS are open-source as of 2011.

I was worried that it would be hard to handle failure/recovery of a replica. Then I realized that replicas are per record-level. Records are always modified in an atomic manner; they are atomic objects. So we can use the last write wins rule (by copying the latest version of the record), and recovery is done. We don't have multiple objects tainted, we don't have to worry about that kind of a messy inconsistency.

The presentation of the paper is unfortunately not the best organization of the material possible. I would have liked it better if the paper covered the core contribution, record-level masters concept, in more detail, including detailed explanation of master handovers and replica failure/recovery operations. Instead in the paper there was a separate thread (and a lot of coverage) on table-scans, scatter-gather operations, bulk loading. I guess some users need such operations, but their motivations were not clear to me. And in any case these are provided just as best-effort operations.

Additional links

Wednesday, February 16, 2011

Chain replication for supporting high throughput and availability

Chain replication (2004) is a replication protocol to support large-scale storage services (mostly key-value stores) for achieving high throughput and availability while providing strong consistency guarantees. Chain replication is essentially a more efficient retake of primary-backup replication. The below figure explains it all. I really love the way the protocol is presented as a stepwise-refinement starting from the high-level specificaiton of a storage system. This exposition provides a simple way of justifying the correctness of the protocol (especially for fault-tolerance actions for handling failures of servers), and also is convenient for pointing out to other alternative implementations.

Chain replication protocol There are two types of requests, a query request (read) and an update request (write). The reply for every request is generated and sent by the tail. Each query request is directed to the tail of the chain and processed there atomically using the replica of objID stored at the tail. Each update request is directed to the head of the chain. The request is processed there atomically using replica of objID at the head, then state changes are forwarded along a reliable FIFO link to the next element of the chain (where it is handled and forwarded), and so on until the reply to the request is finally sent to the client by the tail.

Coping with server failuresThe fault model of the servers are fail-stop (crash). With an object replicated on t servers, as many as t−1 of the servers can fail without compromising the object's availability. When a server is detected as failed, the chain is reconfigured to eliminate the failed server. For this purpose a "master" process is used.

Master is actually a replicated process at m nodes; Paxos is used for coordinating the m replicas so they behave in aggregate like a single process that does not fail. (Thus, if we think holistically, chain-replication is not tolerant to t-1 failures as claimed; since chain-replication employs a master implemented in Paxos, the system is tolerant to only m/2 failures. I wish this was explicitly mentioned in the paper to prevent any confusion.)

When the master detects that the head is failed, the master removes head from the chain and makes the successor of the old head as the new head of the chain. Master also informs client about the new head. This action is safe, as it only causes some transient drops of client requests, but that was allowed in the high-level specification of the system. When the master detects that the tail is failed, it removes the tail and makes the predecessor of the old tail as the new tail fo the chain. This actually does not drop any requests at all.

When the master detects the failure of a middle server, it deletes the middle server "S" from the chain and links the predecessor "S-" and successor "S+" of the failed server to re-link the chain. Below figure provides an overview of the process. Message 1 informs S+ of its new role; message 2 acknowledges and informs the master with the sequence number sn of the last update request S+ has received; message 3 informs S- of its new role and of sn so S- can compute the suffix of Sent_S- to send to S+; and message 4 carries that suffix. (To this end, each server i maintains a list Sent_i of update requests that i has forwarded to some successor but that might not have been processed by the tail. The rules for adding and deleting elements on this list are straightforward: Whenever server i forwards an update request r to its successor, server i also appends r to Sent_i. The tail sends an acknowledgement ack(r) to its predecessor when it completes the processing of update request r. And upon receipt ack(r), a server i deletes r from Sent_i and forwards ack(r) to its predecessor.)

As failed servers are removed, the chain gets shorter, and can tolerate fewer failures. So it is important to add a new server to the chain when it gets short. The simplest way to do this is by adding the new server as the tail. (For the tail, the value of Sent_tail is always empty list, so it is easy to initialize the new server to be the tail.)

Comparison to primary-backup replicationWith chain replication, the primary's role in sequencing requests is shared by two replicas. The head sequences update requests; the tail extends that sequence by interleaving query requests. This sharing of responsibility not only partitions the sequencing task but also enables lower-latency and lower-overhead processing for query requests, because only a single server (the tail) is involved in processing a query and that processing is never delayed by activity elsewhere in the chain. Compare that to the primary backup approach, where the primary, before responding to a query, must await acknowledgements from backups for prior updates.

Chain replication is at a disadvantage for reply latency to update requests since it disseminates updates serially, compared to primary-backup which disseminates updates parallelly. This does not have much effect on throughput however. (It is also possible to modify chain replication for parallel wiring of middle-servers, but this will bring some complexity to the fault-handling actions.)

Concluding remarksThe paper also provides evaluation results from an implementation of the chain replication protocol. Chain replication is simple and is a more efficient retake of primary backup replication. Since this is a synchronous replication protocol it falls into CP edge of the CAP theorem triangle. Chain replication has been used in Fawn and also in some other key-value storage systems.

Additional links:
This is genius. With a very simple observation and fix, Terrace and Freedman were able to optimize chain replication further for read-mostly workloads. The below two figures from their paper in 2009 are enough to explain the modification.

Related links:
High-availability distributed logging with BookKeeper

Pond: The Oceanstore prototype

This paper details the implementation of Pond, a distributed p2p filesystem for object-storage. Pond is a prototype implementation of Oceanstore, which was described here. Oceanstore is especially notable by its elegant solution to localization of objects in a P2P environment (employing Tapestry and using attenuated Bloom filters as hints for efficient search).

The system architecture is two tiered. The first tier consists of well connected hosts which serialize updates (running Byzantine agreement) and archive results. The second tier provides extra storage/cache resources to the system.

Replication and caching
Pond uses "erasure codes" idea for efficient and robust storage. Given storage is virtually unlimited today, they could as well have used full-replication and saved a lot of headache to themselves. The erasure coding leads to problems when reading. Several machines need to be contacted to reconstruct the block. What Pond does is, after a block is constructed it is kept at cache in the second tier replicas. So, the cost of consequent reads becomes low, and they compensate the initially high cost read.

Storing updates
Updates follow the path in Figure 2. Updates are both archived and cached simultaneously. The primary replica runs a Byzantine agreement, and the paper includes several implementation details related for that.

Implementation and evaluation
Pond is implemented on SEDA. This is a bit curious as Pond does not seem to have a long workflow. Was SEDA necessary? Could it be avoided without loss of much performance? How critical was SEDA for the implementation.

The paper presents comprehensive evaluation results. While Pond does OK for reads (4 times faster than NFS) due to caching, its writes were slow (7 times slower than NFS) understandably due to erasure coding, threshold signed Byzantine agreement, and versioning costs.

Thursday, February 10, 2011

SEDA: An architecture for well-conditioned scalable internet services

A service is well-conditioned if it behaves like a simple pipeline: as the offered load increases, the delivered throughput increases proportionally until the pipeline is full and the throughput saturates; additional load should not degrade throughput. Moreover, after saturation, the response time should not increase faster than linearly with the number of clients. Hence, the key property of a well-conditioned service is graceful degradation: as offered load exceeds capacity, the service maintains high throughput with a linear response-time penalty. Unfortunately, that is not the typical web experience; as load increases beyond saturation point throughput decreases and response time increases dramatically, creating the impression that the service has ground to a halt.

SEDA enables services to be well-conditioned to load, preventing resources from being overcommitted when demand exceeds service capacity. In other words, SEDA makes it easy to do load-shedding and to sacrifice the Q in the DQ principle when saturation is reached.

SEDA draws from (1) thread-based concurrency approach for ease of programming and (2) event-based approach for extensive concurrency. We briefly discuss the two before presenting the SEDA architecture.

Thread-based approach
Threads are too blackbox. Transparent resource virtualization prevents applications from making informed decisions, which are vital to managing excessive load. Using the thread-based approach, the application is unable to inspect the internal request stram to implement a load-shedding policy that identifies and drops expensive requests. As a result, throughput increases until the number of threads grows large, after which throughput degrades substantially. Response time becomes unbounded as task queue lenghts increase. Thread approach does not have a mechanism to drop Q easily and beyond saturation the entire system comes to a crashing halt.
Event-driven approach
Event-driven systems are robust to load, with little degradation in throughput as offered load increases beyond saturation. Event-driven approach performs inherent admission control. Excess tasks are absorbed in the server's event queue, the throughput remains constant across a huge range in load, with the latency of each task increasing linearly.
The event-driven approach implements the processing of each task as a finite state machine, where transitions between states in the FSM are triggered by events. This way, the server maintains its own continuation state for each task rather than relying upon a thread context. The key problem is the complexity of the event scheduler, which must control the execution of each finite state machine. The choice of an event scheduling algorithm is often tailored to the specific application, and introduction of new functionality may require the algorithm to be redesigned. This also makes modularity difficult to achieve.

SEDA architecture
In SEDA, applications/services are constructed as a network of stages, each with an associated incoming event queue. Each stage consists of a thread pool and an application-supplied event handler. The stage's operation is managed by the controller, which adjusts resource allocations and scheduling dynamically. Each stage can be individually conditioned to load by thresholding or filtering its event queue.
In addition, making event queues explicit allows applications to make informed scheduling and resource-management decisions, such as reordering, filtering, or aggregation of requests. SEDA makes use of dynamic resource throttling at the stage level (including thread pool sizing and dynamic event batching/scheduling) to control the resource allocation and scheduling of application components, allowing the system to adapt to overload conditions.

The introduction of a queue between stages decouples their execution by introducing an explicit control boundary. Introducing a queue between two modules provides isolation, modularity, and independent load management (the tradeoff is increased latency compared to a function call). The decomposition of application into stages and explicit event delivery mechanism facilitates inspection, debugging, and performance analysis of applications.

SEDA is evaluated through two applications: a high-performance HTTP server and a packet router for the Gnutella peer-to-peer file-sharing network. The performance and scalability results for these applications show that SEDA achieves flexibility of control and robustness over huge variations in load, and outperforms other service designs.Discussion
We have seen that SEDA makes it easy to do load-shedding and to sacrifice the Q in the DQ principle when saturation is reached. Can SEDA also be made to reduce D as well? I guess this is doable. But, in contrast to dropping Q by load-shedding, the dropping of D is inherently a whitebox operation and involves dealing with the application logic.

Does it make sense to partition/distribute stages across more than one server? SEDA makes this easy to do so by decoupling stages with event queues. I guess this could be useful for horizantally scaling-out a large sophisticated service. (What is an example of that?) The first stage of admitting requests can run on one machine and pass its output to the next stages running on other machines through the event queues.

With very large number of servers at our disposal in a large-scale datacenter do we still have a need/use for SEDA? Large number of servers is useful only to the face of embarassingly parallel applications, like map-reduce framework forges applications into. SEDA can still be useful for horizontally scaling-out large sophisticated applications as we mentioned above.

Additional links

A comparison of filesystem workloads

Due to the increasing gap between processor speed and disk latency, filesystem performance is largely determined by its disk behavior. Filesystems can provide good performance by optimizing for common usage patterns. In order to learn and optimize for the common usage patterns for filesystems, this 2000 paper describes the collection and analysis of filesystem traces from 4 different environments. The first 3 environments run HP-UX (a UNIX variant) and are INS: Instructrional, RES: Research, WEB: Webserver. The last group, NT, is a set of personal computers running Windows NT.

Here are the results from their investigation.

Filesystem calls
Notable in all workloads is the high number of requests to read file attributes. In particular, calls to stat (including fstat) comprise 42% of all file-system- related calls in INS, 71% for RES, 10% for WEB, and 26% for NT. There is also a lot of locality to filesystem calls. The percentage of stat calls that follow another stat system call to a file from the same directory to be 98% for INS and RES, 67% for WEB, and 97% for NT. The percentage of stat calls that are followed within five minutes by an open to the same file is 23% for INS, 3% for RES, 38% for WEB, and only 0.7% for NT.

Block lifetime
Block lifetime is the time between a block's creation and its deletion. Knowing the average block lifetime for a workload is important in determining appropriate write delay times and in deciding how long to wait before reorganizing data on disk.Unlike the other workloads, NT shows a bimodal distribution pattern: nearly all blocks either die within a second or live longer than a day. Although only 30% of NT block writes die within a day, 86% of newly created files die within that timespan, so many of the long-lived blocks belong to large files.

Lifetime locality
Most blocks die due to overwrites. For INS, 51% of blocks that are created and killed within the trace die due to overwriting; for RES, 91% are overwritten; for WEB, 97% are overwritten; for NT, 86% are overwritten.

A closer examination of the data shows a high degree of locality in overwritten files. In general, a relatively small set of files (e.g., 2%) are repeatedly overwritten, causing many of the new writes and deletions.

Effect of write delayThe efficacy of increasing write delay depends on the average block lifetime of the workload. For nearly all workloads, a small write buffer is sufficient even for write delays of up to a day. User calls to flush data to disk have little effect on any workload.

Cache efficacy
Even relatively small caches absorb most read traffic, but there are diminishing returns to using larger caches.
Read and write traffic
File systems lay out data on disk to optimize for reads or writes depending on which type of traffic is likely to dominate. The results from the 4 environments do not support the widely-repeated claim that disk traffic is dominated by writes when large caches are employed. Instead, whether reads or writes dominate disk traffic varies significantly across workloads and environments. Based on these results, any general file system design must take into consideration the performance impact of both disk reads and disk writes.

File size
The study found that applications are accessing larger files than previously, and the maximum file size has increased in recent years. It might seem that increased accesses to large file sizes would lead to greater efficacy for simple read-ahead prefetching; however, the study found that larger files are more likely to be accessed randomly than they used to be, rendering such straightforward prefetching less useful.

File access patterns
The study found that for all workloads, file access patterns are bimodal in that most files tend to be mostly-read or mostly-written. Many files tend to be read mostly. A small number of files are write-mostly. This is shown by the slight rise in the graphs at the 0% read-only point. This tendency is especially strong for the files that are accessed most frequently.
Concluding remarks
The message of the paper is clear. When it comes to filesystem performance, 3 things count: locality, locality, locality. Filesystem call locality, access locality, lifetime locality, read-write bimodality.

Naturally, after reading this paper you wonder if there is a more recent study on filesystem workloads to see which trends continued after this study. This 2008 paper "Measurement and Analysis of Large-Scale Network File System Workloads" provides such a study for a networked filesystem. The summary of its main findings are as follows.
Compared to Previous Studies
1. Both of the newer workloads are more write-oriented. Read to write byte ratios have significantly decreased. 2. Read-write access patterns have increased much more compared to read-only and write-only access patterns. 3. Most bytes are transferred in longer sequential runs. These runs are an order of magnitude larger. 4. Most bytes transferred are from larger files. File sizes are up to an order of magnitude larger. 5. Files live an order of magnitude longer. Fewer than 50% are deleted within a day of creation.
New Observations
6. Files are rarely re-opened. Over 66% are re-opened once and 95% fewer than five times. 7. Files re-opens are temporally related. Over 60% of re-opens occur within a minute of the first. 8. A small fraction of clients account for a large fraction of file activity. Fewer than 1% of clients account for 50% of file requests. 9. Files are infrequently shared by more than one client. Over 76% of files are never opened by more than one client. 10. File sharing is rarely concurrent and sharing is usually read-only. Only 5% of files opened by multiple clients are concurrent and 90% of sharing is read-only. 11. Most file types do not have a common access pattern.

It would be nice exploit the read-write bimodality of files while designing a WAN filesystem. Read-only files are very amenable to caching as they don't change. Write-only files are also good to cache and asynchronously write back to the remote home-server.

Sunday, February 6, 2011

Panache: a parallel filesystem cache for global file access

This paper (FAST'10) is from IBM, and builds on their previous work on the GPFS filesystem. The problem they consider here is how to enable a site to access a remote site's servers transparently without suffering from the effects of WAN latencies. The answer is easy: use a cache filesystem/cluster at the remote site. But there are a lot of issues to resolve for this to work seamlessly.

Panache is a parallel filesystem cache to provide reliability consistency and performance of a cluster filesystem despite the physical distance. Panache supports asynchronous operations for writes for both data and metadata updates. Panache uses synchronous operations for reads: if the read misses the panache cache cluster or the validity timer of the data in the cache had expired, the operation waits till the data is read from the remote cluster filesystem.

Panache architecture
Panache is implemented as a multi-node caching layer, integrated within the GPFS, that can persistently and consistently cache data and metadata from a remote cluster. Every node in the Panache cache cluster has direct access to cached data and metadata. Thus, once data is cached, applications running on the Panache cluster achieve the same performance as if they were running directly on the remote cluster. If the data is not in the cache, Panache acts as a caching proxy to fetch the data in parallel both by using a parallel read across multiple cache cluster nodes to drive the ingest, and from multiple remote cluster nodes using pNFS.

Panache allows updates to be made to the cache cluster at local cluster performance by asynchronously pushing all updates of data and metadata to the remote cluster. All updates to the cache cause an application node to send and queue a command message on one or more gateway nodes. At a later time, the gateway node(s) will read the data in parallel from the storage system and push it to the remote cluster over pNFS.

All file system data and metadata "read" operations, e.g., lookup, open, read, readdir, getattr, are synchronous. Panache scales I/O performance by using multiple gateway nodes to read chunks of a single file in parallel from the multiple nodes over NFS/pNFS. One of the gateway nodes (based on the hash function) becomes the coordinator for a file. It, in turn, divides the requests among the other gateway nodes which can proceed to read the data in parallel. Once a node is finished with its chunk it requests the coordinator for more chunks to read. When all the requested chunks have been read the gateway node responds to the application node that the requested blocks of the object are now in cache.

The disadvantages of this solution is that it is in the kernel space and it depends on proprietary technologies. I wonder if it would be possible to implement a Panache like system in the user-space with reasonable efficiency?

Some other questions I have are as follows. For the Panache cache cluster/filesystem could we have used xFS? xFS provides distributed collaborative caching and would be suitable for this task, I think.

Panache uses a complicated distributed lock mechanism at the gateways of the remote site to sort the dependability of update operations originating at multiple gateways. Wouldn't it be easier to instead push these updates on a best-effort basis to the remote cluster/filesystem and let it sort out these dependabilities? Why was this not considered as an option?

GPFS: A shared disk file system for large computing clusters

GPFS(FAST'02 paper) is a general parallel filesystem developed by IBM. GPFS accesses disks parallelly to improve disk I/O throughput for both file data and file metadata. GPFS is entirely distributed, with no central manager, so it is fault-tolerant, and there is no performance bottleneck. GPFS was used in 6 of top 10 supercomputers when it first came out. It got less popular as opensource free filesystems such as linux/lustre gained popularity.

GPFS has a shared disk architecture. Figure shows the GPFS architecture with three basic components: clients, switching fabric, disks.

Managing parallelism and consistency
There are two approaches. The first is distributed locking: consult all other nodes, and the second is centralized management: consult a dedicated node. GPFS solution is a hybrid. There are local lock managers at every node, and there is a global lock manager to manage them via handling out tokens/locks.

GPFS uses byte-range locking/tokens to synchronize reads and writes to file data. This allows parallel applications to write concurrently to different parts of the same file. (For efficiency reasons, if the client is the only writer, it can get a lock for the entire file. only if other clients are also interested, then we start to use finer granularity locks.)

Synchronizing access to metadata
There is no central metadata server. Each node stores metadata for files it is responsible, and is called the metadata coordinator (metanode) for those files. Metanode is elected dynamically for a file (the paper does not give any details on this), and only the metanode reads and writes the inode for the file from/to disk. The metanode synchronizes access to the file metadata by providing shared-write lock to other nodes.

When a node fails, GPFS restores metadata being updated by failed node, releases any tokens held by failed node, and appoints replacements for this node for metanode allocation manager roles. Since GPFS stores logs on shared disks, any node can perform log recovery on behalf of the failed node.

On link failures, GPFS fences nodes that are no longer members of the minority groups from accessing the shared disks. GPFS also recovers from disk failures by supporting replication.

The paper does not discuss how the metanode (metadata coordinator for a file) is discovered by nodes that want to access that file. How is this done? Wouldn't it be easier to make this centralized? There is a centralized token manager, so why not a centralized metadata manager? It makes things easer, and you can always replicate it (via Paxos) to fight faults.

Here is a link to two reviews of the paper from CSE726 students.

Wednesday, February 2, 2011

Serverless Network Filesystems (xFS)

This is a 1996 paper presenting a serverless network filesystem, xFS. xFS is not to be confused with the XFS journaling file system created by Silicon Graphics.

While traditional network filesystems rely on a central server machine, a serverless system utilizes computers cooperating as peers to provide all filesystem services. The major motivation for a serverless p2p filesystem is the opportunity provided by fast switched Ethernet/LANs to use LAN as an I/O backplane, harnessing physically distributed processors, memory, and disks into a single remote striped disk system.

Basically, xFS synthesizes previous innovations on scalable cache consistency (DASH), cooperative caching, disk striping (RAID, Zebra) and combines them into a serverless filesystem package. xFS dynamically distributes control processing across the system on a per-file granularity by utilizing a serverless management scheme. xFS distributes its data storage across storage server disks by implementing a software RAID using log-based network striping similar to Zebra's.

Prior technologies
RAID partitions a stripe of data into N-1 data blocks and a parity block (exclusive or of the corresponding bits of the data blocks). It can reconstruct the contents of a failed disk by taking the exclusive-OR of the remaining data blocks and the parity block.

Zebra combines LFS and RAID so that both work well in a distributed environment. Zebra has a single file manager and xFS improves that to multiple p2p file/metadata managers. Improving on Zebra, xFS also dynamically clusters disks into stripe groups to allow the system to scale to large numbers of storage servers.
xFS architecture
All data, metadata, and control can be located anywhere in the system and can be dynamically migrated during operation. xFS splits management among several metadata managers. xFS also replaces the server cache with cooperative caching that forwards data among client caches under the control of the managers.

In xFS there are four types of entities: clients, storage servers, managers, cleaners. The other entities are straightforward, so let's just explain cleaners. In a log-structured filesystem, since new blocks are always appended, some of these new blocks invalidate blocks in old segments as they provide a newer version. These invalidated blocks need to be garbage collected as they waste space. Cleaners take care of this process. xFS prescribes a distributed cleaner to keep up with the system writes.
The key challenge for xFS is locating data and metadata. Four key maps are used for this purpose: the manager map, the imap, file directories, and the stripe group map. The manager map allows clients to determine which manager to contact for a file. Manager map is small and is globally replicated to all of the managers and the clients in the system to improve performance. The imap allows each manager locate where its files are stored in the on-disk log. File directories provide a mapping from a human readable name to a metadata locator called an index number. The stripe group map provides mapings from segment identifiers embedded in disk log addresses to the set of physical machines storing the segments.

The authors have built a prototype of the xFS and presented performance results on a 32 node cluster of SPARCStation 10's and 20's. The evaluations highlight the scalability of xFS to increasing number of clients.