Paper Summary. The Case for Learned Index Structures

This paper was put on Arxiv yesterday and is authored by Tim Kraska, Alex Beutel, Ed Chi, Jeff Dean, Neoklis Polyzotis.

The paper aims to demonstrate that "machine learned models have the potential to provide significant benefits over state-of-the-art database indexes".

If this research bears more fruit, we may look back and say, the indexes were first to fall, and gradually other database components (sorting algorithms, query optimization, joins) were replaced with neural networks (NNs).

In any case this is a promising direction for research, and the paper is really thought provoking.

Motivation

Databases started as general, one-size fits all blackboxes. Over time, this view got refined to "standardized sizes" to OLAP databases and OLTP databases.

Databases use indexes to access data quickly. B-Trees and Hash-maps are common techniques to implement indexes. But along with the blackbox view, the databases treat the data as opaque, and apply these indexes blindly without making any assumptions about the data. However, it is obvious that not knowing about the data distribution leaves performance on the table. Consider this thought experiment. If the keys are from the range of 0 to 500m, it is faster to just use the key as index, rather than using a hash. This observation can be extended to other data distributions, if we know the cumulative distributed function (CDF) of the data. We can generalize by saying "CDF*key*record-size" gives us the approximate position of the record the key refers to.

Ok, so, by knowing about the data distribution, we can achieve performance gains. But now we lost reusability when we go full whitebox. We can't afford to go full whitebox, inspecting the data, and designing the indexes from scratch every time.

The paper shows that by using NNs to learn the data distribution we can have a graybox approach to index design and reap performance benefits by designing the indexing to be data-aware.

The case for applying NNs to indexing is shown over the following three index-types:

  • B-trees, which are used for handling range queries
  • hash-maps, which are used for point-lookup queries
  • bloom-filters, which are used for set inclusion checks

I will only summarize the section on how to replace the B-tree structure. For the hash maps, the learned structure is a straightforward function based on CDF of the data.

B-trees

B-trees provide a hierarchical efficient index.

Why is it even conceivable to replace B-tree with an NN model? Conceptually, b-tree maps a key to a page. We can have a model that also performs key to position mapping, and for the error range, we can do a variant of binary search (or expanded ring search) to locate the page.


OK, then, how do we know min_error and max-error? We train the model with the data we have. The data is static, the NN makes a prediction and then learns from these errors. (Even simple logistic regression may work for simple distributions.)

What potential benefits can we reap by replacing B-tree with a model:

  • smaller indexes: less main-memory or L1 cache storage
  • faster lookup: as a result of smaller indexes
  • more parallelism (TPU), instead of hierarchical if statements as in B-tree.

The key insight here is to trade computation for memory, banking on the trend that computation is getting cheaper (and if you can do it on TPU/GPU you reap more benefits). The evaluation of the paper doesn't even go into using TPUs for this.

The paper includes several strategies to improve the performance of the learned index, including using a recursive model index, hierarchical models, and hybrid models. For evaluation results, please refer to the paper.

Comments

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Advice to the young

Foundational distributed systems papers

Distributed Transactions at Scale in Amazon DynamoDB

Linearizability: A Correctness Condition for Concurrent Objects

Understanding the Performance Implications of Storage-Disaggregated Databases

Designing Data Intensive Applications (DDIA) Book