What’s Really New with NewSQL?

This paper is by Andy Pavlo and Matthew Aslett, and it appeared in Sigmod 2016.
 

NoSQL managed to scale horizontally, but this came at the expense of losing transaction and rich querying capability. NewSQL followed NoSQL to amend things and restore balance to the force.

NewSQL is a class of modern relational DBMSs that seek to provide on-par scalability to NoSQL for OLTP read-write workloads while still maintaining ACID guarantees for transactions.


Let's dissect this definition.

  1. The biggest benefit of NewSQL is that developers do not have to write code to deal with eventually consistent updates as they would in a NoSQL system, because they will be able to use ACID transactions and SQL-like rich querying capabilities.
  2. NewSQL is about OLTP (online transaction processing), not OLAP (online data analysis like in data warehouse systems). Well, with the caveat that it can also be about HTAP (hybrid transactional-analytical processing), as the paper mentions under the future trends section.
  3. Finally, what does modern mean? It means using shared-nothing distributed architecture, in-memory storage, partitioning/sharding, lockless concurrency control, as the paper discusses in the features section.

Let's start with the summary table, and discuss these NewSQL systems with respect to the three main categories they are categorized under and with respect to their features.




My notes in this summary comes mostly direct from the paper. This is a very readable paper, written in plain English with the objective to communicate ideas and educate people. Unfortunately it is getting harder and harder to say the same about most papers being published at conferences and journals as the need to impress takes precedence.

On this note, it is worth reminding everyone that Andy's database courses (both the intro and advanced courses) are on YouTube and worth checking to refresh your database knowledge.


Three categories of NewSQL

New architectures. All of the DBMSs in this category are based on distributed architectures that operate on shared-nothing resources and support multi-node concurrency control. The advantage of using a new DBMS that is built for distributed execution is that all parts of the system can be optimized for multi-node environments. Every one of the DBMSs in this category (with the exception of Google Spanner) also manages their own primary storage, either in-memory or on disk. This means that the DBMS is responsible for distributing the database across its resources with a custom engine instead of relying on an off-the-shelf distributed filesystem (e.g., HDFS) or storage fabric (e.g., Apache Ignite). This is valuable for sending the query/computation to the data to improve performance.

Transparent sharding middleware. The centralized middleware communicates via a shim layer with each DBMS node which executes queries and return results. This approach presents a single logical database to the application without needing to modify the underlying DBMS.

Database as a service. The table regards only those DBaaS products that are based on a new architecture as NewSQL. A notable example here is Amazon's Aurora for their MySQL RDS, whose  distinguishing feature over InnoDB is that it uses a log-structured storage manager to improve I/O parallelism.

Features

Main memory storage. We have now reached the point where it is affordable to store all but the largest OLTP databases entirely in memory. The in-memory systems can get better performance because many of the components that are necessary to handle disk-based storage cases, like a buffer pool manager or heavy-weight concurrency control schemes, are not needed. 

Partitioning/sharding. The OLTP databases are amenable to partitioning because their schemas can be transposed into a tree-like structure where descendants in the tree have a foreign key relationship to the root. Some of the NewSQL systems support live migration to move data between physical resources to re-balance and alleviate hotspots, or to increase/decrease the DBMS’s capacity without any interruption to service.

Concurrency control. Almost all of the NewSQL systems based on new architectures eschew two-phase locking (2PL) because the complexity of dealing with deadlocks. Instead, the current trend is to use variants of timestamp ordering (TO) concurrency control. The most widely used protocol in NewSQL systems is decentralized multi-version concurrency control (MVCC) where the DBMS creates a new version of a tuple in the database when it is updated by a transaction. Maintaining multiple versions potentially allows transactions to still complete even if another transaction updates the same tuples. It also allows for long-running, read-only transactions to not block on writers. This protocol is used in almost all of the NewSQL systems based on new architectures.

Other systems use a combination of 2PL and MVCC together. (Since the paper was published CockroachDB moved from OCC to 2PL.) With this approach, transactions still have to acquire locks under the 2PL scheme to modify the database. When a transaction modifies a record, the DBMS creates a new version of that record just as it would with MVCC. This scheme still allows read-only queries to avoid having to acquire locks and therefore not block on writing transactions.

These are very important and interesting distributed protocols. I want to write a separate summary post based on Andy Pavlo's DB lectures on concurrency control.

Secondary indexes. A secondary index contains a subset of attributes from a table that are different than its primary key(s). This allows the DBMS to support fast queries beyond primary key or partitioning key look-ups. All of the NewSQL systems based on new architectures are decentralized and use partitioned secondary indexes. This means that each node stores a portion of the index, rather than each node having a complete copy of it. 

Replication. All of the NewSQL systems that we are aware of support strongly consistent replication. But there is nothing novel about how these systems ensure this consistency. One aspect of the NewSQL systems that is different than previous work is the consideration of replication over the wide-area network (WAN).


Conclusions

In conclusion, the authors mention that their main takeaway is that NewSQL database systems are not a radical departure from existing system architectures but rather represent the next chapter in the continuous development of database technologies. What is innovative about these NewSQL DBMSs is that they incorporate these ideas into single platforms.

I asked Andy if he had a followup after 5 years of publishing this paper. He pointed me to his Hydra 2021 talk.

Comments

Popular posts from this blog

Foundational distributed systems papers

Your attitude determines your success

Progress beats perfect

Cores that don't count

Silent data corruptions at scale

Learning about distributed systems: where to start?

Read papers, Not too much, Mostly foundational ones

Sundial: Fault-tolerant Clock Synchronization for Datacenters

Linearizability

Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3 (SOSP21)