Showing posts from March, 2011

Megastore: Providing scalable, highly available storage for interactive services

Google's Megastore is the structured data store supporting the Google Application Engine . Megastore handles more than 3 billion write and 20 billion read transactions daily and stores a petabyte of primary data across many global datacenters. Megastore tries to provide the convenience of using traditional RDBMS with the scalability of NOSQL: It is a scalable transactional indexed record manager (built on top of BigTable), providing full ACID semantics within partitions but lower consistency guarantees across partitions (aka, entity groups in Figure 1). To achieve these strict consistency requirements, Megastore employs a Paxos-based algorithm for synchronous replication across geographically distributed datacenters. I have some problems with Megastore, but I save them to the end of the review to explain Megastore first. Paxos Megastore uses Paxos , a proven, optimal, fault-tolerant consensus algorithm with no requirement for a distinguished master. (Paxos is hard to cover i

Flexible, Wide-Area Storage for Distributed Systems with WheelFS

One of my students, Serafettin Tasci , wrote a good review of this paper, so I will save time by using his review below, instead of writing a review myself. In this paper the authors propose a storage system for wide-area distributed systems called WheelFS. The main contribution of WheelFS is its ability of adaptation to different types of applications with different consistency, replica placement or failure handling requirements. This ability is obtained via semantic cues that can be easily expressed in path names. For example to force the primary site of a folder john to be X, we can specify the cue “home/users/.Site=X/john”. This representation enables preserving of POSIX semantics and minor change in application software to use the cues. In WheelFS, there are 4 groups of semantic cues. Placement cues are used to arrange the location of primaries and replicas of a file or folder. Durability cues specify the number and usage of replicas. Consistency cues maintain a tradeoff between

Ceph: A Scalable, High-Performance Distributed File System

Traditional client/server filesystems (NFS, AFS) have suffered from scalability problems due to their inherent centralization. In order to improve performance, modern filesystems have taken more decentralized approaches. These systems replaced dumb disks with intelligent object storage devices (OSDs)--which include CPU, NIC and cache-- and delegated low-level block allocation decisions to these OSDs. In these systems, clients typically interact with a metadata server (MDS) to perform metadata operations (open, rename), while communicating directly with OSDs to perform file I/O (reads and writes). This separation of roles improve overall scalability significantly. Yet, these systems still face scalability limitations due to little or no distribution of the metadata workload. Ceph is a distributed file system designed to address this issue. Ceph decouples data and metadata operations completely by eliminating file allocation tables and replacing them with generating functions (called CR

Sinfonia: A New Paradigm for Building Scalable Distributed Systems

Sinfonia is an in-memory scalable service/infrastructure that aims to simplify the task of building scalable distributed systems. Sinfonia provides a lightweight "minitransaction" primitive that enables applications to atomically access and conditionally modify data at its multiple memory nodes. As the data model, Sinfonia provides a raw linear address space which is accessed directly by client libraries. Minitransactions In traditional transactions, a coordinator executes a transaction by asking participants to perform one or more participant-actions (such as retrieving or modifying data items), and at the end of the transaction, the coordinator decides and executes a two-phase commit. In the first phase, the coordinator asks all participants if they are ready to commit. If they all vote yes, in the second phase the coordinator tells them to commit; otherwise the coordinator tells them to abort. Sinfonia introduces the concept of minitransactions, by making the observation

PetaShare: A reliable, efficient and transparent distributed storage management system

This paper by my colleague Tevfik Kosar (to appear soon) presents the design and implementation of a reliable and efficient distributed data storage system, PetaShare, which manages 450 Terabytes of disk storage and spans 7 campuses across the state of Louisiana. There are two main components in a distributed data management architecture: a data server which coordinates physical access (i.e. writing/reading data sets to/from disks) to the storage resources, and a metadata server which provides the global name space and metadata of the files. Metadata management is a challenging problem in widely distributed large-scale storage systems, and is the focus of this paper. Petashare architecture The back-end of PetaShare is based on iRODS . All system I/O calls made by an application are mapped to the relevant iRODS I/O calls. iRODS stores all the system information, as well as user-defined rules in centralized database, which is called iCAT. iCAT contains the information of the distributed

Popular posts from this blog

Learning about distributed systems: where to start?

Hints for Distributed Systems Design

Foundational distributed systems papers

Metastable failures in the wild

Scalable OLTP in the Cloud: What’s the BIG DEAL?

SIGMOD panel: Future of Database System Architectures

The end of a myth: Distributed transactions can scale

There is plenty of room at the bottom

Distributed Transactions at Scale in Amazon DynamoDB

Dude, where's my Emacs?