SOSP19 File Systems Unfit as Distributed Storage Backends: Lessons from 10 Years of Ceph Evolution

This paper is by  Abutalib Aghayev (Carnegie Mellon University), Sage Weil (Red Hat Inc.), Michael Kuchnik (Carnegie Mellon University), Mark Nelson (Red Hat Inc.), Gregory R. Ganger (Carnegie Mellon University), George Amvrosiadis (Carnegie Mellon University)

Ceph started as research project in 2004 at UCSC. At the core of Ceph is a distributed object store called RADOS. The storage backend was implemented over an already mature filesystem. The filesystem helps with block allocation, metadata management, and crash recovery. Ceph team built their storage backend on an existing filesystem, because they didn't want to write a storage layer from scratch. A complete filesystem takes a lot of time (10 years) to develop, stabilize, optimize, and mature.

However, having a filesystem in the path to the storage adds a lot of overhead. It creates problems for implementing efficient transactions. It  introduces bottlenecks for metadata operations. A filesystem directory with millions of small files will be a metadata bottleneck forexample. Paging etc also creates problems. To circumvent these problems, Ceph team tried hooking into FS internals by implementing WAL in userspace, and use the NewStore database to perform transactions. But it was hard to wrestle with the filesystem. They had been patching problems for seven years since 2010. Abutalib likens this as the stages of grief: denial, anger, bargaining, ..., and acceptance!

Finally the Ceph team deserted the filesystem approach and started writing their own storage system BlueStore which doesn't use a filesystem. They were able to finish and mature the storage level in just two years! This is because a small, custom backend matures faster than a POSIX filesystem.

The new storage layer, BlueStore, achieves a very high-performance compared to earlier versions. By avoiding data journaling, BlueStore is able to achieve higher throughput than FileStore/XFS.

When using a filesystem the write-back of dirty meta/data interferes with WAL writes, and causes high tail latency. In contrast, by controlling writes, and using write-through policy, BlueStore ensures that no background writes to interfere with foreground writes. This way BlueStore avoids tail latency for writes.

Finally, having full control of I/O stack accelerates new hardware adoption. For example, while filesystems have hard time adapting to the shingled magnetic recording storage, the authors were able to add metadata storage support to BlueStore for them, with data storage being in the works.

To sum up, the lesson learned was for distributed storage it was easier and better to implement a custom backend rather than trying to shoehorn a filesystem for this purpose.

Here is the architecture diagram of BlueStore, storage backend. All metadata is maintained in RocksDB, which layers on top of BlueFS, a minimal userspace filesystem.

Abutalib, the first author on the paper, did an excellent job presenting the paper. He is a final year PhD with a lot of experience and expertise on storage systems. He is on the job market.


Anonymous said…
Sorry to tell you guys this article is very wrong in terms of origin. A company called Dreamhost created Ceph, later sold to redhat.

Anonymous said…
Got any proof at all to that?
Anonymous said…
Don't Listen to idiots. The original ceph paper from 2006
Anonymous said…
CEPH started as a (graduate ?) project by the founder of DreamHost, who apparently also started Dreamhost in his undergraduate days at college.
Anonymous said…
> Ceph is software created by DreamHost founder Sage Weil and has been under development inside DreamHost for several years. Ceph has been open source since its inception, and in early 2012, a new company called Inktank was spun out of DreamHost to support and continue development of the technology. Inktank was then acquired by Red Hat in April 2014.
Anonymous said…
Redhat bought Inktank did they not? Ceph was from DreamHost founder Sage Weil; they then spun out that project to Inktank.
Anonymous said…
my views on this work:
a) Why is discussing the design of Blue store a research contribution? this is an open-source project.
b) there is a mention that accepting new hardware becomes faster because of bluestore; yeah sure, cause it is user space and more importantly it uses a copy on write approach. Hence it works on zoned devices. Also, blue store is simple, hence adding support is easier. Its basically a filesystem for improving performance of transactions and distributed actions.
c) f2fs already supports smaller SMR drives since f2fs is a log structured filesystem. However f2fs cannot support filesystem bigger than 16TB whereas SMR drives can be bigger than that.
d) The performance of journaling has been measured before. Transactions in filesystems are expensive and pretending to have them through WAL is expensive too. This is not new information.

Popular posts from this blog

Learning about distributed systems: where to start?

Hints for Distributed Systems Design

Foundational distributed systems papers

Metastable failures in the wild

The demise of coding is greatly exaggerated

Scalable OLTP in the Cloud: What’s the BIG DEAL?

The end of a myth: Distributed transactions can scale

SIGMOD panel: Future of Database System Architectures

Why I blog

There is plenty of room at the bottom