Towards a Standard for JSON Document Databases
Despite the ubiquity of the MongoDB aggregation framework, it has been lacking a formal mathematical framework/specification. This paper aims to fix this gap by providing a theoretical foundation, and proposes MQuery. The formalization in MQuery is largely based on the paper published at ICDT 2018 (for which the first author is involved), extending it to include more pipeline operators, relax the assumption that the JSON documents stored in the database comply to a predefined schema, and allow objects that are either ordered or unordered sets of key-value pairs.
Motivation
For decades, SQL proponents have flaunted the rigorous mathematical foundation of relational algebra (courtesy of Edgar Codd). The world of JSON document databases, however, has remained a bit of a Wild West in comparison. The analogy is apt because, like the frontier, there is immense opportunity here. JSON is the undisputed king of data exchange, and the MongoDB aggregation framework has emerged as the widely adopted query language for JSON collections. Thanks to its expressive pipeline model, massive developer base, and popularity, MongoDB aggregation framework has effectively become the de facto standard for querying JSON. The fact that major vendors (including Amazon, Microsoft, Oracle, and Google) seek to provide compatibility with the MongoDB API further underscores its recognition as a common lingua franca. (The authors' words, not mine, so don't think I'm bragging on behalf of MongoDB.)
To further motivate the need for a rigorous mathematical framework, the authors highlight current challenges. They argue that MongoDB's semantics are procedural rather than declarative. While the aggregation pipeline is pragmatic and powerful, its documentation often overlooks edge cases, leading to ambiguity.
The paper illustrates this with an example about query predicates. In MongoDB, the query origin: "UK" matches a document where origin is the string "UK". However, it also matches a document where origin is the array ["UK", "Japan"]. While this loose equality is convenient for developers, it is bad for mathematical logic, as it violates the property of transitivity: ["UK"] matches "UK", and "UK" matches "UK", yet [["UK"]] does not match "UK".
Furthermore, the paper argues MongoDB suffers from path polysemy. A path like origin.country is inherently ambiguous. Does it refer to a nested field in a single object? Or, if origin is an array, does it refer to the country field of every object inside that array? This leads to data-dependent behavior, where a valid query might throw a runtime error simply because a new document with a different structure was inserted into the collection.
MQuery
MQuery (which, admittedly, looks a lot like McQuery, and now that I've said this, you won't be able to read it any other way) serves as a formalized abstraction of the MongoDB language. MQuery formalizes the data model using "d-values" (document values), which encompass literals, arrays, and objects. It also defines 7 core pipeline stages that mirror the MongoDB aggregation framework: $match, $unwind, $project, $group, $lookup, $graphLookup, and $union.
By formalizing these stages, the authors confirm in Section 4 that "the MongoDB aggregation framework is very expressive: at least as expressive as full relational algebra (RA)". They mention:
- the match, unwind, project and group fragment captures RA over a single relation
- the match, unwind, project, group, and lookup fragment captures arbitrary RA
They demonstrate that $match corresponds to selection, $project to projection, and $lookup (or a combination of $unwind and $group) to joins. This confirms that document databases can theoretically perform every operation relational databases can, including complex joins and set operations. They also note the MongoDB aggregation framework goes beyond RA by handling Nested Relational Algebra and linear recursion via $graphLookup.
The Payoff: Algebraic Optimization
Why does all this math matter? The formal definition can help us safely optimize queries. The final section of the paper demonstrates algebraic rewriting rules. Thanks to the formal definitions, the authors can prove when it is safe to reorder pipeline stages without altering the result. They provide rules for filter anticipation (moving $match earlier to reduce data volume), unnesting postponement (moving $unwind later to save memory), and join optimization.
Comments