Ceph: A Scalable, High-Performance Distributed File System

Traditional client/server filesystems (NFS, AFS) have suffered from scalability problems due to their inherent centralization. In order to improve performance, modern filesystems have taken more decentralized approaches. These systems replaced dumb disks with intelligent object storage devices (OSDs)--which include CPU, NIC and cache-- and delegated low-level block allocation decisions to these OSDs. In these systems, clients typically interact with a metadata server (MDS) to perform metadata operations (open, rename), while communicating directly with OSDs to perform file I/O (reads and writes). This separation of roles improve overall scalability significantly. Yet, these systems still face scalability limitations due to little or no distribution of the metadata workload.

Ceph is a distributed file system designed to address this issue. Ceph decouples data and metadata operations completely by eliminating file allocation tables and replacing them with generating functions (called CRUSH). This allows Ceph to leverage the intelligence present in OSDs and delegate to OSDs not only data access but also update serialization, replication, failure detection, and recovery tasks as well. The second contribution of Ceph is to employ an adaptive distributed metadata cluster architecture to vastly improve the scalability of metadata access.
Ceph has three main components: 1) the client, each instance of which exposes a near-POSIX file system interface to a host or process; 2) a cluster of OSDs, which collectively stores all data and metadata; and 3) a metadata server cluster, which manages the namespace (file names and directories), consistency, and coherence. In the rest of this review, we describe these three components in more detail.

The client code runs entirely in user space and can be accessed either by linking to it directly or as a mounted file system via FUSE (a user-space file system interface).

File I/O
An MDS traverses the filesystem hierarchy to translate the file name into the file inode. A file consists of several objects, and Ceph generalizes a range of striping strategies to map file data onto a sequence of objects (see my xFS review for an explanation of striping). To avoid any need for file allocation metadata, object names simply combine the file inode number and the stripe number. Object replicas are then assigned to OSDs using CRUSH, a globally known mapping function (we will discuss this in the next section on OSD clusters).

Client synchronization
POSIX semantics require that reads reflect any data previously written, and that writes are atomic. When a file is opened by multiple clients with either multiple writers or a mix of readers and writers, the MDS will revoke any previously issued read caching and write buffering capabilities, forcing client I/O for that file to be synchronous. That is, each application read or write operation will block until it is acknowledged by the OSD, effectively placing the burden of update serialization and synchronization with the OSD storing each object. Since synchronous I/O is a performance killer, Ceph provides a more relaxed option that sacrifices consistency guarantees. With O_LAZY flag, performance-conscious applications which manage their own consistency (e.g., by writing to different parts of the same file, a common pattern in HPC workloads) are then allowed to buffer writes or cache reads.

Ceph delegates the responsibility for data migration, replication, failure detection, and failure recovery to the cluster of OSDs that store the data, while at a high level, OSDs collectively provide a single logical object store to clients and metadata servers. To this end, Ceph introduces the Reliable Autonomic Distributed Object Store (RADOS) system, which achieves linear scaling to tens or hundreds of thousands of OSDs. Each Ceph OSD, in this system, manages its local object storage with EBOFS, an Extent and B-tree based Object File System. We describe the features of RADOS next.

Data distribution with CRUSH
In Ceph, file data is striped onto predictably named objects, while a special-purpose data distribution function called CRUSH assigns objects to storage devices. This allows any party to calculate (rather than look up) the name and location of objects comprising a file's contents, eliminating the need to maintain and distribute object lists. CRUSH works as follows. Ceph first maps objects into placement groups (PGs). Placement groups are then assigned to OSDs using CRUSH (Controlled Replication Under Scalable Hashing), a pseudo-random data distribution function that efficiently maps each PG to an ordered list of OSDs upon which to store object replicas. To locate any object, CRUSH requires only the placement group and an OSD cluster map: a compact, hierarchical description of the devices comprising the storage cluster. As such any party (client, OSD, or MDS) can independently calculate the location of any object without any exchange of distribution-related metadata.

CRUSH resembles consistent hashing. While consistent hashing would use a flat server list to hash onto, CRUSH utilizes a server node hierarchy (shelves, racks, rows) instead and enables the user to specify policies such as "Put replicas onto different shelves than the primary".

Data is replicated in terms of PGs, each of which is mapped to an ordered list of n OSDs (for n-way replication). Clients send all writes to the first non-failed OSD in an object's PG (the primary), which assigns a new version number for the object and PG and forwards the write to any additional replica OSDs. After each replica has applied the update and responded to the primary, the primary applies the update locally and the write is acknowledged to the client. Reads are directed at the primary. This approach also resembles the replication strategy of GFS.

The MDS cluster is diskless and MDSs just serve as an index to the OSD cluster for facilitating read and write. All metadata, as well as data, are stored at the OSD cluster. When there is an update to an MDS, such as a new file creation, MDS stores this update to the metadata at the OSD cluster. File and directory metadata in Ceph is very small, consisting almost entirely of directory entries (file names) and inodes (80 bytes). Unlike conventional file systems, no file allocation metadata is necessary--object names are constructed using the inode number, and distributed to OSDs using CRUSH. This simplifies the metadata workload and allows MDS to efficiently manage a very large working set of files, independent of file sizes.

Typically there would be around 5 MDSs in a 400 node OSD deployment. This looks like an overkill for just providing an indexing service to the OSD cluster, but actually is required for achieving very high-scalability. Effective metadata management is critical to overall system performance because file system metadata operations make up as much as half of typical file system workloads. Ceph also utilizes a novel adaptive metadata cluster architecture based on Dynamic Subtree Partitioning that adaptively and intelligently distributes responsibility for managing the file system directory hierarchy among the available MDSs in the MDS cluster, as illustrated in Figure 2. Every MDS response provides the client with updated information about the authority and any replication of the relevant inode and its ancestors, allowing clients to learn the metadata partition for the parts of the file system with which they interact.

Additional links
Ceph is licensed under the LGPL and is available at http://ceph.newdream.net/.
Checking out the competition

Exercise questions
1) How do you compare Ceph with GFS? XFS? GPFS?
2) It seems like the fault-tolerance discussion in Ceph assumes that OSDs are not network-partitioned. What can go wrong if this assumption is not satisfied?


Abbas Zaidi said…
Thank you Murat for a comprehensive and well-explained post.
Abbas Zaidi said…
I wanted to ask you, if you don't mind the question:
Can you see any shortcoming in this OSD file-system that would be a show-stopper for large-scale enterprise adoption? Conversely, would you recommend any avenue(s) for the team to look into, in order to address upcoming needs that they aren't already? Thanks again Murat, this was a very helpful post. ~ AZ

Popular posts from this blog

Foundational distributed systems papers

Your attitude determines your success

Progress beats perfect

Cores that don't count

Silent data corruptions at scale

Learning about distributed systems: where to start?

Read papers, Not too much, Mostly foundational ones

Sundial: Fault-tolerant Clock Synchronization for Datacenters


Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3 (SOSP21)