Showing posts from February, 2011

Volley: Automated Data Placement for Geo-Distributed Cloud Services

Datacenters today are distributed across the globe, yet they need to share data with other datacenters as well as their clients. This paper from Microsoft Research presents a heuristic strategy for data placement to these geo-distributed datacenters. While there has been previous work on data placement in LANs and WSNs, Volley is the first heuristic for data placement strategies for WANs. A simple heuristic is to place each data to the datacenter closest to the client of that data. But things are not that simple, there are several additional constraints to be considered, including business constraints, WAN bandwidth costs, datacenter capacity limits, data interdependencies, user-perceived latency, etc. For example, it makes more sense to collocate data that are tightly-coupled/interdependent, such as two friends in Facebook that update each other walls. As another example, the frequency of the clients accessing the data needs to be taken in to account as well. As live mesh and live m

PNUTS: Yahoo!'s hosted data serving platform

Given that web users are scattered across the globe, it is critical to have data replicas on multiple continents for low-latency access and high-availability. Supporting general transactions is typically unnecessary, since web applications tend to manipulate only one record at a time. Moreover, supporting general transactions (with serializability) over a globally-replicated distributed system is expensive and impractical. On the other hand, it is still useful to have a stronger consistency guarantee than the very weak "eventual consistency" . It is often acceptable to read slightly stale data but having replicas use the same order of updates is important because out of order updates may expose undesirable (outside the invariant, unsafe) state at some replicas. The most important contribution of the PNUTS paper is an asynchronous replication architecture that achieves serializability of writes to each record. This feat is achieved by using record-level masters. Record-leve

Chain replication for supporting high throughput and availability

Chain replication (2004) is a replication protocol to support large-scale storage services (mostly key-value stores) for achieving high throughput and availability while providing strong consistency guarantees. Chain replication is essentially a more efficient retake of primary-backup replication. The below figure explains it all. I really love the way the protocol is presented as a stepwise-refinement starting from the high-level specificaiton of a storage system. This exposition provides a simple way of justifying the correctness of the protocol (especially for fault-tolerance actions for handling failures of servers), and also is convenient for pointing out to other alternative implementations. Chain replication protocol There are two types of requests, a query request (read) and an update request (write). The reply for every request is generated and sent by the tail. Each query request is directed to the tail of the chain and processed there atomically using the replica of obj

Pond: The Oceanstore prototype

This paper details the implementation of Pond, a distributed p2p filesystem for object-storage. Pond is a prototype implementation of Oceanstore, which was described here. Oceanstore is especially notable by its elegant solution to localization of objects in a P2P environment (employing Tapestry and using attenuated Bloom filters as hints for efficient search). Architecture The system architecture is two tiered. The first tier consists of well connected hosts which serialize updates (running Byzantine agreement) and archive results. The second tier provides extra storage/cache resources to the system. Replication and caching Pond uses "erasure codes" idea for efficient and robust storage. Given storage is virtually unlimited today, they could as well have used full-replication and saved a lot of headache to themselves. The erasure coding leads to problems when reading. Several machines need to be contacted to reconstruct the block. What Pond does is, after a block is const

SEDA: An architecture for well-conditioned scalable internet services

A service is well-conditioned if it behaves like a simple pipeline: as the offered load increases, the delivered throughput increases proportionally until the pipeline is full and the throughput saturates; additional load should not degrade throughput. Moreover, after saturation, the response time should not increase faster than linearly with the number of clients. Hence, the key property of a well-conditioned service is graceful degradation: as offered load exceeds capacity, the service maintains high throughput with a linear response-time penalty. Unfortunately, that is not the typical web experience; as load increases beyond saturation point throughput decreases and response time increases dramatically, creating the impression that the service has ground to a halt. SEDA enables services to be well-conditioned to load, preventing resources from being overcommitted when demand exceeds service capacity. In other words, SEDA makes it easy to do load-shedding and to sacrifice the Q in t

A comparison of filesystem workloads

Due to the increasing gap between processor speed and disk latency, filesystem performance is largely determined by its disk behavior. Filesystems can provide good performance by optimizing for common usage patterns. In order to learn and optimize for the common usage patterns for filesystems, this 2000 paper describes the collection and analysis of filesystem traces from 4 different environments. The first 3 environments run HP-UX (a UNIX variant) and are INS: Instructrional, RES: Research, WEB: Webserver. The last group, NT, is a set of personal computers running Windows NT. Here are the results from their investigation. Filesystem calls Notable in all workloads is the high number of requests to read file attributes. In particular, calls to stat (including fstat) comprise 42% of all file-system- related calls in INS, 71% for RES, 10% for WEB, and 26% for NT. There is also a lot of locality to filesystem calls. The percentage of stat calls that follow another stat system call to a f

Panache: a parallel filesystem cache for global file access

This paper (FAST'10) is from IBM, and builds on their previous work on the GPFS filesystem . The problem they consider here is how to enable a site to access a remote site's servers transparently without suffering from the effects of WAN latencies. The answer is easy: use a cache filesystem/cluster at the remote site. But there are a lot of issues to resolve for this to work seamlessly. Panache is a parallel filesystem cache to provide reliability consistency and performance of a cluster filesystem despite the physical distance. Panache supports asynchronous operations for writes for both data and metadata updates. Panache uses synchronous operations for reads: if the read misses the panache cache cluster or the validity timer of the data in the cache had expired, the operation waits till the data is read from the remote cluster filesystem. Panache architecture Panache is implemented as a multi-node caching layer, integrated within the GPFS, that can persistently and consiste

GPFS: A shared disk file system for large computing clusters

GPFS (FAST'02 paper) is a general parallel filesystem developed by IBM. GPFS accesses disks parallelly to improve disk I/O throughput for both file data and file metadata. GPFS is entirely distributed, with no central manager, so it is fault-tolerant, and there is no performance bottleneck. GPFS was used in 6 of top 10 supercomputers when it first came out. It got less popular as opensource free filesystems such as linux/lustre gained popularity. GPFS has a shared disk architecture. Figure shows the GPFS architecture with three basic components: clients, switching fabric, disks. Managing parallelism and consistency There are two approaches. The first is distributed locking: consult all other nodes, and the second is centralized management: consult a dedicated node. GPFS solution is a hybrid. There are local lock managers at every node, and there is a global lock manager to manage them via handling out tokens/locks. GPFS uses byte-range locking/tokens to synchronize reads and writes

Serverless Network Filesystems (xFS)

This is a 1996 paper presenting a serverless network filesystem, xFS. xFS is not to be confused with the XFS journaling file system created by Silicon Graphics. While traditional network filesystems rely on a central server machine, a serverless system utilizes computers cooperating as peers to provide all filesystem services. The major motivation for a serverless p2p filesystem is the opportunity provided by fast switched Ethernet/LANs to use LAN as an I/O backplane, harnessing physically distributed processors, memory, and disks into a single remote striped disk system. Basically, xFS synthesizes previous innovations on scalable cache consistency (DASH), cooperative caching, disk striping (RAID, Zebra) and combines them into a serverless filesystem package. xFS dynamically distributes control processing across the system on a per-file granularity by utilizing a serverless management scheme. xFS distributes its data storage across storage server disks by implementing a software RAID

Popular posts from this blog

Learning about distributed systems: where to start?

Hints for Distributed Systems Design

Foundational distributed systems papers

Metastable failures in the wild

Scalable OLTP in the Cloud: What’s the BIG DEAL?

SIGMOD panel: Future of Database System Architectures

The end of a myth: Distributed transactions can scale

There is plenty of room at the bottom

Distributed Transactions at Scale in Amazon DynamoDB

Dude, where's my Emacs?