F1: A Distributed SQL Database That Scales

This is a VLDB 2013 paper (appeared earlier at Sigmod'12 it seems) from Google about paying tech-debt. F1 replaces the sharded MySQL hacky implementation of AdWords with a principled well-engineered infrastructure that builds a distributed SQL layer on top of Spanner.

My reaction to this paper probably up until 5 years ago would be "ugh, schemas, database stuff... No distributed algorithms, pass!". But I like to think that I improve with age. After having learned the importance of databases as real-world killer applications of distributed systems, I now look at papers like this as a potential Rosetta stone between the two fields. I am looking for papers that can shave off weeks and months from my journey to learn about distributed databases.

I think this paper has been very useful for understanding several issues regarding distributed databases, as it gave information about many facets of deploying a large scale distributed SQL database. But, the paper fails to become a good guide due to several shortcomings. The paper is written more like a list of topics, rather than with a coherent narrative. (This may be the reason why most blog posts/summaries about the paper are in bullet point format.) I don't think it would be hard to give a good narrative to the paper. Even the MySQL replacement narrative would be good, but that thread comes early on, and is mentioned again only at the end.

Moreover, the paper didn't talk about its contributions as teachable moments, and don't give explicit lessons and mental models for building a distributed SQL databases. These make it challenging to prioritize things from the laundry list of contributions/features the paper mentions. Where did the biggest benefits come from? The paper does not even have an evaluation section to put things in perspective.

These being said, I am still a fan of this paper and am puzzled by how little love it received from the community. This paper received only 323 citations, which is dwarfed by the 2311 citations Spanner received.

Here are the things I love about this work.

  • It is hard to find discussion of a truly large scale distributed SQL application layer as F1 provides. AdWords encompasses 100s of applications and 1000s of users, all sharing the same database. This database is over 100 TB, and it serves up to hundreds of thousands of requests per second, and runs SQL queries that scan tens of trillions of data rows per day. (2013 numbers!)
  • It is a tech-debt paying work, replacing the sharded MySQL deployment makes for an interesting before-after comparison opportunity.
  • It talks about disaggregated distributed databases (F1 is built on Spanner which had a remote storage model) early in 2012. F1 must have been deployed with spinning disks, rather than SSDs. Yet, they were bold enough to insist that this was the way to go, and work on batching/pipelining to make disaggregation work with acceptable performance.
  • It covers many facets of distributed SQL work, that often goes hidden, including data model, schema changes, change history management, ORM design, and distributed systems issues around query execution. Transactions is only one section out of 11 sections in the paper.


F1 architecture

An F1 hybrid (also known as filial 1 hybrid) is the first filial generation of offspring of distinctly different parental types. The paper says that the F1 database is indeed such a hybrid, combining the best aspects of traditional relational databases and scalable NoSQL systems like Bigtable.

F1 is built on top of Spanner. Both systems were developed at the same time and in close collaboration. Spanner handles lower-level storage issues like persistence, caching, replication, fault tolerance, data sharding and movement, location lookups, and transactions. And with F1, we get a database application level build on top of the Spanner infrastructure. F1 adds:

  • Distributed SQL queries, including joining data from external data sources
  • Transactionally consistent secondary indexes
  • Asynchronous schema changes including database re-organizations
  • Optimistic transactions
  • Automatic change history recording and publishing


F1 servers can communicate with Spanner servers outside their own datacenter. The Spanner servers in turn retrieve their data from the Colossus File System (CFS) in the same datacenter. F1 servers are mostly stateless, allowing a client to communicate with a different F1 server for each request. F1 servers can be quickly added (or removed) from the system in response to the total load because F1 servers do not own any data and hence a server addition (or removal) requires no data movement.

Data model

F1 has a relational schema and extends it by including explicit table hierarchy and columns with Protocol Buffer data types. F1 stores each child table clustered with and interleaved within the rows from its parent table. The child table must have a foreign key to its parent table as a prefix of its primary key. For example, the AdWords schema contains a  table Customer with primary key (CustomerId), which has a child table Campaign  with primary key (CustomerId, CampaignId), which in turn has a child table AdGroup with primary key (CustomerId, Campaigned, AdGroupId). All child table rows corresponding to a root row are clustered together with that root row in a single Spanner directory, meaning that cluster is normally stored on a single Spanner server. Child rows are stored under their parent row ordered by primary  key.


Using hierarchy, to the extent that it matches data semantics, is highly beneficial. Making the advertiser a root table (Customer) and clustered related tables under that was critical to achieving acceptable latency in AdWords.

Hierarchical clustering is especially useful for updates, since it reduces the number of Spanner groups involved in a transaction. Because each root row and all of its descendant rows are stored in a single Spanner directory, transactions restricted to a single root will usually avoid 2PC and the associated latency penalty, so most applications try to use single-root transactions as much as possible.

Spanner's original data model was more like Bigtable, but Spanner later adopted F1's data model. The F1 data model supports table columns that contain structured data types. These structured types use the schema and binary encoding format provided by Google’s open source Protocol Buffer library. Protocol Buffers have typed fields that can be required, optional, or repeated; fields can also be nested Protocol Buffers. Putting protocol buffers in the schema gives users a universal data structure they can use both in the database and in application code, without the need to write tedious and error-prone transformations between database rows and in-memory data structures.

All indexes in F1 are transactional and fully consistent. There are two types of physical storage layout for F1 indexes: local and global.

  • Local index keys must contain the root row primary key as a prefix. For example, an index on (CustomerId, Keyword) used to store unique keywords for each customer is a local index. Like child tables, local indexes are stored in the same Spanner directory as the root row. Consequently, the index entries of local indexes are stored on the same Spanner server as the rows they index, and local index updates add little additional cost to any transaction.
  • In contrast, global index keys do not include the root row primary key as a prefix and hence cannot be co-located with the rows they index.


Transactions

This paper has the same old tired motivation that Google used in the Spanner paper. (Ok, maybe this paragraph was not stale for 2013.)
With eventual consistency systems, developers spend a significant fraction of their time building extremely complex and error-prone mechanisms to cope with eventual consistency and handle data that may be out of date. We think this is an unacceptable burden to place on developers and that consistency problems should be solved at the database level. Full transactional consistency is one of the most important properties of F1.

I love transactions, don't get me wrong. My beef with the paper is that it didn't explain why the AdWords application required transactional consistency, and in particular serializable isolation. Why do we need to take it as granted that AdWords (I'm guessing auctions) require transactional consistency and serializable isolation? What were the cases that required this? Could this have been relaxed (with improved performance)? Questions, questions.

F1 implements three types of transactions, all built on top of Spanner’s transaction support. F1 clients use optimistic transactions by default, because it has the most benefits.

1. Snapshot transactions. These are *read-only transactions* with snapshot semantics, reading repeatable data as of a fixed Spanner snapshot timestamp. By default, snapshot transactions read at Spanner’s global safe timestamp, typically 5-10 seconds old, and read from a local Spanner replica. Snapshot transactions are the default mode for SQL queries and for MapReduces. Snapshot transactions allow multiple client servers to see consistent views of the entire database at the same timestamp.

2. Pessimistic transactions. These transactions map directly on to Spanner transactions. Pessimistic transactions use a stateful communications protocol that requires holding locks, so all requests in a single pessimistic transaction get directed to the same F1 server. If the F1 server restarts, the pessimistic transaction aborts. Reads in pessimistic transactions can request either shared or exclusive locks.

3. Optimistic transactions. Optimistic transactions consist of a read phase, which can take arbitrarily long and never takes Spanner locks, and then a short write phase. To detect row-level conflicts, F1 returns with each row its last modification timestamp. The new commit timestamp is automatically written into the lock column whenever the corresponding data is updated (in either pessimistic or optimistic transactions). The client library collects these timestamps, and passes them back to an F1 server with the write that commits the transaction. The F1 server creates a short-lived Spanner pessimistic transaction and re-reads the last modification timestamps for all read rows. If any of the re-read timestamps differ from what was passed in by the client, there was a conflicting update, and F1 aborts the transaction. Otherwise, F1 sends the writes on to Spanner to finish the commit.

F1 provides row-level locking by default. Each F1 row contains one default lock column that covers all columns in the same row. However, concurrency levels can be changed in the schema. For example, users can increase concurrency by defining additional lock columns in the same row, with each lock column covering a subset of columns.

Query processing

F1 SQL supports both centralized and distributed execution of queries. Centralized execution is used for short OLTP-style queries and the entire query runs on one F1 server node. Distributed execution is used for OLAP-style queries and spreads the query workload over worker tasks in the F1 slave pool. Distributed queries always use read-only snapshot transactions discussed above.



This query takes all AdClicks on a specific date, finds the corresponding AdGroupCreative and then the Creative. It then aggregates to find the number of clicks grouped by campaign, region and language. Figure 3 shows a possible query plan for this query. In the query plan, data is streamed bottom-up through each of the operators up until the aggregation operator. The deepest operator performs a scan of the AdClick table. In the same worker node, the data from the AdClick scan flows into a lookup join operator, which looks up AdGroupCreative records using a secondary index key. The plan then repartitions the data stream by a hash of the CustomerId and CreativeId, and performs a lookup in a hash table that is partitioned in the same way (a distributed hash join). After the distributed hash join, the data is once again repartitioned, this time by a hash of the CampaignId, Region and Language fields, and then fed into an aggregation operator that groups by those same fields (a distributed aggregation).

Doesn't Fig 3 look like a map-reduce illustration?

F1 does not store its data locally. F1’s main data store is Spanner, which is a remote data source. Since all data is remote, batching is used heavily to mitigate network latency. The prime example of how F1 SQL takes advantage of batching is found in the lookup join query plan operator. This operator executes a join by reading from the inner table using equality lookup keys. It first retrieves rows from the outer table, extracting the lookup key values from them and deduplicating those keys. This continues until it has gathered 50MB worth of data or 100,000 unique lookup key values. Then it performs a simultaneous lookup of all keys in the inner table. This returns the requested data in arbitrary order. The lookup join operator joins the retrieved inner table rows to the outer table rows which are stored in memory, using a hash table for fast lookup. The results are streamed immediately as output from the lookup join node.

This emphasis on data streaming means that ordering properties of the input data are lost while allowing for maximum read request concurrency and limiting the space needed for row buffering. Since ordering cannot be guaranteed, F1 eschewed range partitioning altogether and opted to only apply hash partitioning for repartitioning the data for processing steps.

Let's sum up the characteristics of query execution environment. All data is remote, and batching is used heavily to mitigate network latency. Individual query plan operators are designed to stream data to later operators as soon as possible, maximizing pipelining in query plans. Queries use many hash-based repartitioning steps.

Deployment, Latency and throughput

These sections do not include any evaluation, or any lessons learned, or takeaways. So we do with what we get here.

The F1 and Spanner clusters then deployed for AdWords use five datacenters spread out across mainland US. The Spanner configuration uses 5-way Paxos replication, to provide availability under a double failure. They deploy the five read/write replicas with two each on the east and west coasts of the US, and the fifth centrally. With leaders on the east coast, commits require round trips to the other east coast datacenter, plus the central datacenter, which accounts for the 50ms minimum latency. Multi-group commits require 2PC, which typically doubles the minimum latency.

Despite the higher database latency, overall user-facing latency for the main interactive AdWords web application averages about 200ms, which is similar to the preceding system running on MySQL. Avoiding serial reads in client code accounts for much of that. In fact, while the average is similar, the MySQL application exhibited tail latency much worse than the same application on F1.
 
Large distributed queries run with latency comparable to MySQL. Most of the largest queries actually run faster in F1 because they can use more parallelism than MySQL, which can only parallelize up to the number of MySQL shards. In F1, such queries often see linear speedup when given more resources.

Resource costs are usually higher in F1, where queries often use an order of magnitude more CPU than similar MySQL queries.

Comparison to the previous system

There is no section that compares F1 to the previous sharded MySQL solution it replaced, instead that is dispersed to a sentence here, a paragraph there. Since we mentioned some MySQL comparisons here, this is a good place to discuss the aftermath (in terms of systems/operations metrics). I don't have insider information, so I just go by what I try to piece together from this paper.

I think high availability was the biggest gain. They mention this in the paper: "Availability reaches five nines, even in the presence of unplanned outages, and observable latency on our web applications has not increased compared to the old MySQL system."

It seems like scaling up seamlessly was another big challenge with the previous MySQL deployment: "Our sharded database based on MySQL was hard to scale up, and even more difficult to rebalance."

I think another benefit was to provide cleaner APIs to their developers and paying technical debt they incurred while putting AdWords to production in the first place.

Concluding remarks

This was a huge paper to cover, even after skipping three sections from it

  • Change history: Change data capture is a first class citizen in F1.
  • Client design: They dropped the R in the ORM provided to the app developers, to prevent implicit/obscured access to expensive relational joins
  • Schema changes: F1 is designed to make all schema changes fully non-blocking. This is explained in a dedicated paper, "Online, asynchronous schema change in F1. PVLDB, 6(11), 2013."

I enjoyed reading this paper. To repeat what I said earlier, this paper needs some love from both the systems and databases community.

I also think, the F1 team needs to write a follow up paper to tie the loose ends. What are the main learnings from F1? What are the systems takeaways? Can we get some evaluation of improvements from specific optimizations/techniques?

Here are slides from the Sigmod'12 presentation of the paper. Unfortunately, I can't find a video of a presentation on F1.

Comments

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Foundational distributed systems papers

Advice to the young

Linearizability: A Correctness Condition for Concurrent Objects

Understanding the Performance Implications of Storage-Disaggregated Databases

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Designing Data Intensive Applications (DDIA) Book