Tuesday, August 21, 2018

Logical index organization in Cosmos DB

This post zooms into the logical indexing subsystem mentioned in my previous post on "Schema-Agnostic Indexing with Azure Cosmos DB". 

With the advent of big data, we face a big data-integration problem. It is very hard to enforce a schema (structure/type system) on data, and irregularities and entropy is a fact of life. You will be better off if you accept this as a given, rather than pretend you are very organized, you can foresee all required fields in your application/database, and every branch of your organization will be disciplined enough to use the same format to collect/store data.

A patch employed by relational databases is to add sparse new columns to accommodate for possibilities and to store a superset of the schemas. However, after you invoke an alter-table on a big data set, you realize this doesn't scale well, and start searching for schema-agnostic solutions.

Achieving schema agnosticism

As we discussed in the previous post, JSON provides a solution for easier schema management. JSON's type system is simple and lightweight (in contrast to XML) and is self-documenting. JSON supports a strict subset of the type systems of Javascript and many modern programming languages. Today it is the lingua franca of Internet and is natively supported in most languages.

Using JSON helps NoSQL datastores to operate without a schema for data ingestion purposes and for possible application changes in the future. But this doesn't automatically make them fully schema agnostic. For querying the database, those solutions still require a schema. The user is typically asked to provide indexing fields, and the queries are performed on those indexes.

The Cosmos DB approach to achieving full schema agnosticism is by automatically indexing everything upon data ingest and allowing the users to query for anything without having to deal with schema or index management.

The question then becomes: what is a good indexing structure to solve the fully schema-agnostic querying problem?

Be the tree

Relational databases have been doing indexing for half century, but indexing there is highly optimized for relational schema databases and has limitations. Often a B-tree index per column is employed. While this achieves very fast reading and querying performance, it becomes inadequate for high volume writes on big data. Newly inserted data would need to be indexed for each column in the schema using B-trees and will cause write amplification problems. A newly inserted column or change in schema would lead to updating all the leafs.

Instead of creating an index tree for each column, Cosmos DB employs one index for the whole database account, i.e., the Cosmos DB container (e.g., a table, a collection or a graph). This one-index-tree-to-rule-them-all grows as new documents get added to the container. Since the schema variance is often not  very wild, the number of shared paths over intermediate schema nodes remain small compared to the number of leaf nodes (instance values). This way the index tree achieves efficiency for updates upon new data/schema insertion and enables for searching (range or point query) for any arbitrary schema or value in the container.

I try to explain how this works in the rest of the post. First I'd like to clarify that we restrict ourselves to the logical organization of the indexing, and don't go down the stack to discuss physical organization of the index structures. At the logical layer, we don't have to think about the various B-tree implementations in the physical layer: we will just treat the index organization as a sorted-map structure (e.g., as in sorted map in Java). At the physical organization layer, to score even more efficiency optimizations, Cosmos DB employs the Bw-tree data structure implementation of this logical index on flash/SSDs. There are many other efficient implementations of B-trees for different storage devices and scenarios based on write-ahead-log and log-structure-merge-tree ideas.

I would like to thank Shireesh Thota at Cosmos DB for giving me a crash course on the logical indexing topic. Without his clear explanations, I would be grappling with these concepts for a long long time.

Logical indexing

In our previous post, we discussed how the tree representation of the JSON documents allows the database engine to treat the structure of the document as well as the instance values homogeneously.

We also introduced the index tree that is constructed out of the union of all of the trees representing the individual documents within the container. Each node of the index tree is an index entry containing the label and position values, called the term, and the ids of the documents containing the term, called the postings.

The logical index organization

For cost effective persistence and lookup, the index tree needs to be converted into a storage efficient representation.  At the logical indexing layer, CosmosDB maps the paths in the index tree to key-value tuples. The value consists of the postings list of the encoded document (or document fragment) ids. The key consists of the term representing the encoded path information of the node/path in the index tree, concatenated with a posting entry selector (PES) that helps partition the postings horizontally.

This way, the terms are mapped to the corresponding Doc IDs (i.e., postings) containing them. The resulting sorted map enables the query processing to identify the documents that match the query predicates very quickly. On this sorted map, byte compare is employed for enabling range queries. There is also a reverse path representation to enable efficient point queries. As we'll see below, the logical indexing has a direct impact on what kind of queries the database can support.

We discuss how Cosmos DB represents the terms and the postings efficiently in the next two sections.

Representing terms efficiently

Cosmos DB uses a combination of partial forward path representation for paths to enable range querying support, and  partial reverse path representation to enable equality/hash support.

The terms for forward paths are byte encoded to be able to enable range queries such as SELECT * FROM root r WHERE r.Country < "Germany". Yes, you read that right, you can compare at the string level, because strings are byte-encoded to allow that.

The terms for the reverse paths are hash encoded for efficient point querying such as SELECT * FROM root r WHERE r.location[0].country = "France".

Finally, the path representations also allow wild card queries such as SELECT c FROM c JOIN w IN c.location WHERE w = "France". This is achieved by bunching the forward and backward paths always in 3 segments, such as location/0/city and 0/city/"Paris" rather than using the full path $/location/0/city/"Paris".  This is like the the n-gram idea the search engines use. This also reduces the storage cost of the index.

Partial forward path encoding scheme. To enable efficient range and spatial querying, the partial forward path encoding is done differently for numeric and non-numeric labels. For non-numeric values, each of the 3 segment paths are encoded based on all the characters. The least significant byte of the resultant hash is assigned for the first and second segments. For the last segment, lexicographical order is preserved by storing the full string or a smaller prefix based on the precision specified for the path.
For the numeric segment appearing as the first or second segments, a special hash function is applied to optimize for the non-leaf numeric values. This hash function exploits the fact that most non-leaf numeric values (e.g. enumerations, array indices etc.) are frequently concentrated between 0-100 and rarely contain negative or large values. A numeric segment occurring in the third position is treated specially: the most significant n bytes (n is the numeric precision specified for the path) of the 8 byte hash are applied, to preserve order.

Partial reverse path encoding scheme.  To enable point querying, the term generated in the reverse order, with the leaf having higher number of bits in the term, placed first. This scheme also serves wildcard queries like finding any node that contains the value "Paris", since the leaf node is the first segment.

Representing posting lists efficiently

The postings list captures the document ids of all the documents which contain the given term. The posting list is bitmap compressed for efficient querying/retrieval as well. In order to represent a postings list dynamically (i.e. without a fixed sized/static scheme or pre-reserved space), compactly and in a manner amenable to computing fast set operations (e.g., to test for document presence during query processing), Cosmos DB uses the below two techniques.

Partitioning a postings list. Each insertion of a new document to a container is assigned a monotonically increasing document ID. The postings list for a given term consists of a variable length list of postings entries partitioned by postings entry selector (PES). A PES is a variable length (up to 7 bytes), offset into the postings entry. The number of PES bytes is a function of the number of documents in a container. The number of postings entries --for a given size of a PES-- is a function of document frequency for the document id range which falls within the PES range. Document ids within 0-16K will use the first postings entry, document ids from 16K-4M will use the next 256 posting entries, and so on. For instance, a container with 2M documents will not use more than 1 byte of PES and will only ever use up to 128 postings entries within a postings list.

Dynamic encoding of posting entries. Within a single partition (pointed by a PES), each document needs only 14 bits which can be captured with a short word. However, Cosmos DB also optimizes this. Depending on the distribution, postings words within a postings entry are encoded dynamically using a set of encoding schemes including (but not restricted to) various bitmap encoding schemes inspired primarily by WAH (Word-Aligned Hybrid). The core idea is to preserve the best encoding for dense distributions (like WAH) but to efficiently work for sparse distributions (unlike WAH).

Customizing the index

The default indexing policy automatically indexes all properties of all documents. Developers can choose certain documents to be excluded or included in the index at the time of inserting or replacing them to the container. Developers can also choose to include or exclude certain paths (including wildcard patterns) to be indexed across documents.

Cosmos DB also supports configuring the consistency of indexing on a container.

Consistent indexing is the default policy. Here the queries on a given container follow the same consistency level as specified for the point-reads (i.e. strong, bounded-staleness, session or eventual). The index is updated synchronously as part of the document update (i.e. insert, replace, update, and delete of a document in a container). Consistent indexing supports consistent queries at the cost of possible reduction in write throughput. This reduction is a function of the unique paths that need to be indexed and the consistency level. The consistent indexing mode is designed for "write quickly, query immediately" workloads.

To allow maximum document ingestion throughput, a container can be configured with lazy consistency; meaning queries are eventually consistent. The index is updated asynchronously when a given replica of a container's partition is quiescent. For "ingest now, query later" workloads requiring unhindered document ingestion, the lazy indexing mode is more suitable.

MAD questions

1. Is this too specialized information?
I am a distributed systems/algorithms person. Logical indexing is a specialized database topic. Does understanding this help me become a better distributed systems researcher?

I would argue yes.  First of all, developing expertise in multiple branches, being a Pi-shaped academician, provides advantages. Aside from that, learning new things stretches your brain and makes it easier to learn other things.

2. How is filtering done within a document?
Cosmos DB represents documents also as binary encodings for efficient storage and querying. When a query returns documents that match the query predicates, instead of filtering records inside the document, Cosmos DB uses the binary encoding features and performs byte-compares to skim within the document quickly to jump/skip over irrelevant parts quickly. A lot of deduplication is also employed at these encoding. In the coming weeks, I may delve in to the physical organization of the index and documents, but I need to track down another expert to help me with that.

For topics that are too outside of my expertise it is very helpful to get a first introduction from an expert. Learning from Shireesh was very fun. An expert makes even the most complicated topics look easy and understandable. This is an interesting paradigm shift which you will have sometime if you haven't already: When you don't understand a topic, often the problem is, it is not presented very competently. The corollary to this epiphany is that if you are unable to explain something simply and in an accessible way, you haven't mastered it yet.

No comments:

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...