Spanner: Becoming a SQL system

This is a VDLB 2017 paper. Last week we reviewed the F1 paper from 2012. It seems like F1 was an experiment and sort of a preview towards adding serious SQL support in Spanner. The original Spanner paper was published in 2012 had little discussion/support for SQL. It was mostly a "transactional NoSQL core". In the intervening years, though, Spanner has evolved into a relational database system, and many of the SQL features in F1 got incorporated directly in Spanner. Spanner got a strongly-typed schema system and a SQL query processor, among other features.

This paper describes Spanner's evolution to a full featured SQL system. It focuses mostly on the distributed query execution (in the presence of resharding of the underlying Spanner record space), query restarts upon transient failures, range extraction (which drives query routing and index seeks), and the improved blockwise-columnar storage format. I wish there was discussion on the evolution of data manipulation/modification and transactional components of Spanner as well. This paper, similar to the F1 paper and many other papers from the industry, does not have an evaluation section. It is fine, we will do with what we get, I will focus more on lessons learned part of the paper.


Query distribution and distributed query execution

The process of generating an optimized query plan involves pulling up the Distributed Union operator and pushing down other relational operators, such as Filters. This improves efficiency and performance by reducing unnecessary data movement. Distributed Union is used to ship a subquery to each shard of the underlying persistent or temporary data, and to concatenate the results. This operator provides a building block for more complex distributed operators such as distributed joins between independently sharded tables.



Apply-style joins are used aggressively to improve query performance. Query plan is rewritten to transform WHERE clauses into multiple correlated self-joins.





Spanner implements a Distributed Apply operator by extending Distributed Union and implementing Apply style join in a batched manner --as two joins, a distributed join that applies batches of rows from the input to remote subquery, and another that applies rows from each batch to the original join’s subquery locally on a shard. Figure 2 shows schematically the transformation we use. The resulting Distributed Apply operator combines functionality of Cross Apply or Outer Apply with Distributed Union operator.


The filter tree is a runtime data structure used for extracting the query key ranges from the query plan via bottom-up intersection/union of intervals, and for post-filtering the rows emitted by the correlated self-joins. The filter tree is shared across all correlated scans produced by the compile-time rewriting.

Spanner query engine supports query restart leveraging "restart tokens." These tokens are used to prevent query duplication and ensure forward progress in case of failures or interruptions during query execution.

Finally, to ensure the robustness of the query engine, the authors used a random query generation tool that targets the Abstract Syntax Tree (AST) of queries. This approach helps in evaluating the engine's performance and correctness under various query scenarios.


The coprocessor framework

At runtime, Distributed Union minimizes latency by using the Spanner coprocessor framework to route a subquery request addressed to a shard to one of the nearest replicas that can serve the request. Coproccesor framework is a great concept, but to me a weird name with the hardware connotation.

Spanner provides a special RPC framework, called the coprocessor framework, to hide much of the complexity of locating data. The communication layer of Spanner uses remote calls that are addressed to logical data ranges, not to compute nodes or processors directly; calls are routed to the best replica. The coprocessor framework determines which Paxos group (or groups) owns the data being addressed, and finds the nearest replica of that group that is sufficiently up-to-date for the specified concurrency mode. Data shards may move from one Paxos group to another or be split into smaller shards in new groups (usually for load balancing reasons). In this case the coprocessor framework transparently reroutes requests to the new group's replicas --even if the move happens while the request is in progress. The coprocessor framework hides a variety of other transient faults (server failure, network blips, etc) by automatically retrying.

Wow, just wow! But isn't this maybe a tad too much (with the masking of a variety of transient faults)?
I understand the allure of avoiding complexity at the client-side, but what is the cost of driving it to this extent (masking transient faults by automatically retrying)?

Don't get me wrong, I understand and sympathize with this perspective. Like the F1 paper, this paper is a paper about building solid infrastructure for applications to build upon. Fault-tolerance for the win (ftftw) seems to be the theme. I understand the allure of going the extra distance to even get rid of retry loops on the client side: those may be problematic and DDOS your system exactly when the system needs some slack to recover:
Unlike most distributed query processors, Spanner fully hides transient failures during query execution. This choice results in simpler applications by allowing clients to omit sometimes complex and error-prone retry loops. Moreover, transparent restarts enable rolling upgrades of the server software without disrupting latency-sensitive workloads.


But, when I see mentions of masking faults also doing this opaquely, I am unnerved. It immediately brings the "robust-yet-fragile" concept to my mind. If you optimize a system too much for fault-tolerance of anticipated faults, you leave it in a weakened state for the tolerance of unanticipated faults. I tried to talk about this in this 2013 post.


Blockwise-Columnar Storage

The paper mentions an ongoing effort to replace Spanner's existing storage engine with a columnar engine that uses the PAX layout. PAX (Partition attributes across) layout is a storage organization scheme that groups similar data together for more efficient query processing, especially for analytical workloads. The paper talks about this effort as follows.

Ressi is the new low-level storage format for Spanner, which will be entering production in the first half of 2017. It is designed from the ground up for handling SQL queries over large-scale, distributed databases comprising both OLTP and OLAP workloads.

Since Spanner is a time-versioned database, there may be many values of a particular row-key/column, with different timestamps. Ressi divides the values into an active file, which contains only the most recent values, and an inactive file, which may contain many older versions (or none). This allows queries for the most recent data to avoid loading old versions. Large (multi-page) values are also segregated into separate files. This allows rapid scans of tables without paying the I/O costs of reading large values until they are actually needed.



Lessons Learned And Challenges

This was interesting:
Spanner offers a wide array of physical data layout options. These include geographic placement, replication, using protocol buffers vs. interleaved tables, vertical partitioning of tables within shards, etc. Internally, Spanner engineers spend a fraction of their time reviewing the schema designs for new internal applications. In many cases, schema reviews helped avoid production challenges later on.

So, even with 100s of internal services using Spanner, the Spanner engineers vet schema designs for internal applications. (UPDATE: I heard that this is more like "office hours" and not too much like "design reviews".) The paper mentions a lot of optimizations, like query pushdown, multiple-consumer API, hierachical/embedded parent-child table storage (awesome for efficient txn and joins), and range extraction. But complexity accompanies the optimization potential. The price is that the schema design needs to be checked to see how it complies with these optimization pre-requisites, such as checking partitionability for enabling distributed union pull-up, checking parent-child relationship between tables for colocation, root-partitionability check, and conditions for range extraction.

Maybe it is as they say: You date your OS vendor, but you marry your database vendor.


This was interesting:
Despite the critique of one-fits-all systems, combining OLTP, OLAP, and full-text search capabilities in a single system re- mains at the top of customer priorities. Large-scale deployment and monitoring of multiple systems with different availability guarantees, transactional semantics, rollout cycles, and language and API quirks is a major burden on customers. It is our goal to make Spanner perform well and be cost-effective across a broad spectrum of use cases over time.

This was even more interesting. They seem to declare victory over the distributed SQL problem:
We observed that both the loosely coupled (in case of F1) and tightly coupled SQL designs can be deployed successfully, and even simultaneously, on a transactional NoSQL core (original Spanner system). A detailed exploration of the pros and cons of these designs is still outstanding. But it is perhaps fair to say that from the perspective of many engineers working on the Google infrastructure, the SQL vs. NoSQL dichotomy may no longer be relevant.

And this is their advice to others:
Our path to making Spanner a SQL system lead us through the milestones of addressing scalability, manageability, ACID transactions, relational model, schema DDL with indexing of nested data, to SQL. For other development organizations that will need to go a similar path we would recommend starting off with the relational model early on, once a scalable, available storage core is in place; well-known relational abstractions speed up the development and reduce the costs of foreseeable future migration.

Comments

Popular posts from this blog

Learning about distributed systems: where to start?

Hints for Distributed Systems Design

Foundational distributed systems papers

Metastable failures in the wild

The demise of coding is greatly exaggerated

Scalable OLTP in the Cloud: What’s the BIG DEAL?

The end of a myth: Distributed transactions can scale

SIGMOD panel: Future of Database System Architectures

Why I blog

There is plenty of room at the bottom