SOSP19 File Systems Unfit as Distributed Storage Backends: Lessons from 10 Years of Ceph Evolution
This paper is by Abutalib Aghayev (Carnegie Mellon University), Sage Weil (Red Hat Inc.), Michael Kuchnik (Carnegie Mellon University), Mark Nelson (Red Hat Inc.), Gregory R. Ganger (Carnegie Mellon University), George Amvrosiadis (Carnegie Mellon University)
Ceph started as research project in 2004 at UCSC. At the core of Ceph is a distributed object store called RADOS. The storage backend was implemented over an already mature filesystem. The filesystem helps with block allocation, metadata management, and crash recovery. Ceph team built their storage backend on an existing filesystem, because they didn't want to write a storage layer from scratch. A complete filesystem takes a lot of time (10 years) to develop, stabilize, optimize, and mature.
However, having a filesystem in the path to the storage adds a lot of overhead. It creates problems for implementing efficient transactions. It introduces bottlenecks for metadata operations. A filesystem directory with millions of small files will be a metadata bottleneck forexample. Paging etc also creates problems. To circumvent these problems, Ceph team tried hooking into FS internals by implementing WAL in userspace, and use the NewStore database to perform transactions. But it was hard to wrestle with the filesystem. They had been patching problems for seven years since 2010. Abutalib likens this as the stages of grief: denial, anger, bargaining, ..., and acceptance!
Finally the Ceph team deserted the filesystem approach and started writing their own storage system BlueStore which doesn't use a filesystem. They were able to finish and mature the storage level in just two years! This is because a small, custom backend matures faster than a POSIX filesystem.
The new storage layer, BlueStore, achieves a very high-performance compared to earlier versions. By avoiding data journaling, BlueStore is able to achieve higher throughput than FileStore/XFS.
When using a filesystem the write-back of dirty meta/data interferes with WAL writes, and causes high tail latency. In contrast, by controlling writes, and using write-through policy, BlueStore ensures that no background writes to interfere with foreground writes. This way BlueStore avoids tail latency for writes.
Finally, having full control of I/O stack accelerates new hardware adoption. For example, while filesystems have hard time adapting to the shingled magnetic recording storage, the authors were able to add metadata storage support to BlueStore for them, with data storage being in the works.
To sum up, the lesson learned was for distributed storage it was easier and better to implement a custom backend rather than trying to shoehorn a filesystem for this purpose.
Here is the architecture diagram of BlueStore, storage backend. All metadata is maintained in RocksDB, which layers on top of BlueFS, a minimal userspace filesystem.
Abutalib, the first author on the paper, did an excellent job presenting the paper. He is a final year PhD with a lot of experience and expertise on storage systems. He is on the job market.
Ceph started as research project in 2004 at UCSC. At the core of Ceph is a distributed object store called RADOS. The storage backend was implemented over an already mature filesystem. The filesystem helps with block allocation, metadata management, and crash recovery. Ceph team built their storage backend on an existing filesystem, because they didn't want to write a storage layer from scratch. A complete filesystem takes a lot of time (10 years) to develop, stabilize, optimize, and mature.
However, having a filesystem in the path to the storage adds a lot of overhead. It creates problems for implementing efficient transactions. It introduces bottlenecks for metadata operations. A filesystem directory with millions of small files will be a metadata bottleneck forexample. Paging etc also creates problems. To circumvent these problems, Ceph team tried hooking into FS internals by implementing WAL in userspace, and use the NewStore database to perform transactions. But it was hard to wrestle with the filesystem. They had been patching problems for seven years since 2010. Abutalib likens this as the stages of grief: denial, anger, bargaining, ..., and acceptance!
Finally the Ceph team deserted the filesystem approach and started writing their own storage system BlueStore which doesn't use a filesystem. They were able to finish and mature the storage level in just two years! This is because a small, custom backend matures faster than a POSIX filesystem.
The new storage layer, BlueStore, achieves a very high-performance compared to earlier versions. By avoiding data journaling, BlueStore is able to achieve higher throughput than FileStore/XFS.
When using a filesystem the write-back of dirty meta/data interferes with WAL writes, and causes high tail latency. In contrast, by controlling writes, and using write-through policy, BlueStore ensures that no background writes to interfere with foreground writes. This way BlueStore avoids tail latency for writes.
Finally, having full control of I/O stack accelerates new hardware adoption. For example, while filesystems have hard time adapting to the shingled magnetic recording storage, the authors were able to add metadata storage support to BlueStore for them, with data storage being in the works.
To sum up, the lesson learned was for distributed storage it was easier and better to implement a custom backend rather than trying to shoehorn a filesystem for this purpose.
Here is the architecture diagram of BlueStore, storage backend. All metadata is maintained in RocksDB, which layers on top of BlueFS, a minimal userspace filesystem.
Abutalib, the first author on the paper, did an excellent job presenting the paper. He is a final year PhD with a lot of experience and expertise on storage systems. He is on the job market.
Comments
Thanks!
https://www.ssrc.ucsc.edu/Papers/weil-osdi06.pdf
https://help.dreamhost.com/hc/en-us/articles/214840947-What-is-DreamCompute-
a) Why is discussing the design of Blue store a research contribution? this is an open-source project.
b) there is a mention that accepting new hardware becomes faster because of bluestore; yeah sure, cause it is user space and more importantly it uses a copy on write approach. Hence it works on zoned devices. Also, blue store is simple, hence adding support is easier. Its basically a filesystem for improving performance of transactions and distributed actions.
c) f2fs already supports smaller SMR drives since f2fs is a log structured filesystem. However f2fs cannot support filesystem bigger than 16TB whereas SMR drives can be bigger than that.
d) The performance of journaling has been measured before. Transactions in filesystems are expensive and pretending to have them through WAL is expensive too. This is not new information.