Facebook's software architecture

- October 18, 2014

I had summarized/discussed a couple papers (Haystack, Memcache caching) about Facebook's architecture before.

Facebook uses simple architecture that gets things done. Papers from Facebook are refreshingly simple, and I like reading these papers.

Two more Facebook papers appeared recently, and I briefly summarize them below.

TAO: Facebook's distributed data store for the social graph (ATC'13)

A single Facebook page may aggregate and filter 100s of items from the social graph. Since Facebook presents each user with customized content (which needs to be filtered with privacy checks) an efficient, highly available, and scalable graph data store is needed to serve this dynamic read-heavy workload.

Before Tao, Facebook's web servers directly accessed MySql to read or write the social graph, aggressively using memcache as a look aside cache (as it was explained in this paper).

The Tao data store implements a graph abstraction directly. This allows Tao to avoid some of the fundamental shortcomings of a look-aside cache architecture. Tao implements an objects and associations model and continues to use MySql for persistent storage, but mediates access to the database and uses its own graph-aware cache.

To handle multi-region scalability, Tao uses replication using the per-record master idea. (This multi-region scalability idea was again presented earlier in the Facebook memcache scaling paper.)

F4: Facebook's warm BLOB storage system (OSDI'14)

Facebook uses Haystack to store all media data, which we discussed earlier here.

Facebook's new architecture splits the media into two categories:
1) hot/recently-added media, which is still stored in Haystack, and
2) warm media (still not cold), which is now stored in F4 storage and not in Haystack.

This paper discusses the motivation for this split and how this works.

Facebook has big data! (This is one of those rare cases where you can say big data and mean it.) Facebook stores over 400 billion photos.

Facebook found that there is a strong correlation between the age of a BLOB (Binary Large OBject) and its temperature. Newly created BLOBs are requested at a far higher rate than older BLOBs; they are hot! For instance, the request rate for week-old BLOBs is an order of magnitude lower than for less-than-a-day old content for eight of nine examined types. Content less than one day old receives more than 100 times the request rate of one-year old content. The request rate drops by an order of magnitude in less then a week, and for most content types, the request rate drops by 100x in less than 60 days. Similarly, there is a strong correlation between age and the deletion rate: older BLOBs see an order of magnitude less deletion rate than the new BLOBs. These older content is called warm, not seeing frequent access like hot content, but they are not completely frozen either.

They also find that warm content is a large percentage of all objects. They separate the last 9 months Facebook data under 3 intervals: 9-6 mo, 6-3 mo, 3-0 months. In the oldest interval, they find that for the data generated in that interval more than 80% of objects are warm for all types. For objects created in the most recent interval more than 89% of objects are warm for all types. That is the warm content is large and it is growing increasingly.

In light of these analysis, Facebook goes with a split design for BLOB storage. They introduce F4 as a warm BLOB storage system because the request rate for its content is lower than that for content in Haystack and thus is not as hot. Warm is also in contrast with cold storage systems that reliably store data but may take days or hours to retrieve it, which is unacceptably long for user-facing requests. The lower request rate of warm BLOBs enables them to provision a lower maximum throughput for F4 than Haystack, and the low delete rate for warm BLOBs enables them to simplify F4 by not needing to physically reclaim space quickly after deletes.

F4 provides a simple, efficient, and fault tolerant warm storage solution that reduces the effective-replication-factor from 3.6 to 2.8 and then to 2.1. F4 uses erasure coding with parity blocks and striping. Instead of maintaining 2 other replicas, it uses erasure coding to reduce this significantly.

The data and index files are the same as Haystack, the journal file is new. The journal file is a write-ahead journal with tombstones appended for tracking BLOBs that have been deleted. F4 keeps dedicated spare backoff nodes to help with BLOB online reconstruction. This is similar to the use of dedicated gutter nodes for tolerating memcached node failures in the Facebook memcache paper.

F4 has been running in production at Facebook for over 19 months. F4 currently stores over 65PB of logical data and saves over 53PB of storage.

Discussion

1) Why go with a design that has a big binary divide between hot and warm storage? Would it be possible to use a system that handles hot and warm as gradual degrees in the spectrum? I guess the reason for this design is its simplicity. Maybe it is possible to optimize things by treating BLOBs differentially, but this design is simple and gets things done.

2) What are the major differences in F4 from the Haystack architecture? F4 uses erasure coding for replication: Instead of maintaining 2 other replicas, erasure coding reduces replication overhead significantly. F4 uses write-ahead logging and is aggressively optimized for read-only workload. F4 has less throughput needs. (How is this reflected in its architecture?)

Caching is an orthogonal issue handled at another layer using memcache nodes. I wonder if the caching policies treat content cached from Haystack versus F4 differently.

3) Why is energy-efficiency of F4 not described at all? Can we use grouping tricks to get cold machines/clusters in F4 and improve energy-efficiency further as we discussed here?

4) BLOBs have large variation in size. Can this be utilized in F4 to improve access efficiency? (Maybe treat/store very small BLOBs differently, store them together, don't use erasure coding for them. How about very large BLOBs?)

UPDATES:
Facebook monitoring tools (The Facebook Mystery machine)
The Facebook Stack (by Malte Schwarzkopf)