The Seattle Report on Database Research (2022)
Every 5 years, researchers from academia and industry gather to write a state-of-the-union (SOTU) report on database research. This one was released recently. It is a very readable report, and my summary consists of important paragraphs clipped from the report. Emphasis mine in bolded sentences. I use square brackets for when I paraphrase a long text with a more direct statement.
TL;DR: The SOTU is strong (the relational database market alone has revenue upwards of $50B) and growing stronger thanks to boom in cloud computing and machine learning (ML). For the next 5 years, research should scale up to address emerging challenges in data science, data governance, cloud services, and database engines.
What has changed in the last 5 years
Over the last decade, our research community pioneered the use of columnar storage, which is used in all commercial data analytic platforms. Database systems offered as cloud services have witnessed explosive growth. Hybrid transactional/analytical processing (HTAP) systems are now an important segment of the industry. Furthermore, memory-optimized data structures, modern compilation, and code-generation have significantly enhanced performance of traditional database engines. All data platforms have embraced SQL-style APIs as the predominant way to query and retrieve data. Database researchers have played an important part in influencing the evolution of streaming data platforms as well as distributed key-value stores. A new generation of data cleaning and data wrangling technology is being actively explored.
The last report identified big data as our field's central challenge. However, in the last five years, the transformation has accelerated well beyond our projections, in part due to technological breakthroughs in machine learning (ML) and artificial intelligence (AI).
The database community has a lot to offer ML users given our expertise in data discovery, versioning, cleaning, and integration. These technologies are critical for machine learning to derive meaningful insights from data. Given that most of the valuable data assets of enterprises are governed by database systems, it has become imperative to explore how SQL querying functionality is seamlessly integrated with ML. The community is also actively pursuing how ML can be leveraged to improve the database platform itself.
As personal data is increasingly valuable to customize the behavior of applications, society has become more concerned about the state of data governance as well as ethical and fair use of data. Data governance has also led to the rise of confidential cloud computing whose goal is to enable customers to leverage the cloud to perform computation even though customers keep their data encrypted in the cloud.
Usage of managed cloud data systems, in contrast to simply using virtual machines in the cloud, has grown tremendously. For cloud analytics, the industry has converged on a data lake architecture, which uses on-demand elastic compute services to analyze data stored in cloud storage. This architecture disaggregates compute and storage, enabling each to scale independently. These changes have profound implications on how we design future data systems.
Industrial Internet-of-Things (IoT) further stress-tested our ability to do efficient data processing at the edge, do fast data ingestion from edge devices to cloud data infrastructure, and support data analytics with minimal delay for real-time scenarios such as monitoring.
Finally, there are significant changes in hardware. With the end of Dennard scaling and the rise of compute-intensive workloads such as Deep Neural Networks (DNN), a new generation of powerful accelerators leveraging FPGAs, GPUs, and ASICs are now available. The memory hierarchy continues to evolve with the advent of faster SSDs and low-latency NVRAM. Improvements in network bandwidth and latency have been remarkable. These developments point to the need to rethink the hardware-software co-design of the next generation of database engines.
Research challenges
The rest of the report is organized into describing research challenges under 4 subsections
- data science
- data governance
- cloud services
- database engines
ML, IoT, and hardware are cross-cutting themes in these sections.
Data Science
Data science is defined as "processes and systems that enable the extraction of knowledge or insights from structured/unstructured data." From a technical standpoint, data science is about the pipeline from raw input data to insights that requires use of data cleaning and transformation, data analytic techniques, and data visualization.
In enterprise database systems, there are well-developed tools to move data from OLTP databases to data warehouses and to extract insights from their curated data warehouses by using complex SQL queries, online analytical processing (OLAP), data mining techniques, and statistical software suites.
Modern data scientists work in a different environment. They heavily use Data Science Notebooks, such as Jupyter, Spark, and Zeppelin, despite their weaknesses in versioning, IDE integration, and support for asynchronous tasks. Data scientists rely on a rich ecosystem of open source libraries such as Pandas for sophisticated analysis, including the latest ML frameworks. They also work with data lakes that hold datasets with varying levels of data quality --a significant departure from carefully curated data warehouses.
Data scientists repeatedly say that data cleaning, integration, and transformation together consume 80%-90% of their time.
Data scientists need to understand and assess [provenance] of their data and to reason about their impact on the results of their data analysis.
We must develop more effective techniques for discovery, search, understanding, and summarization of data distributed across multiple repositories. Specifically, data profiling, which provides a statistical characterization of data, must be efficient and scale to large data repositories. It should also be able to generate at low latency approximate profiles for large data sets to support interactive data discovery.
Even though popular data science libraries such as Pandas support tabular view of data using the DataFrame abstraction, their programming paradigms have important differences with SQL. [Adapting SQL here can boost programmer productivity as in RDBMS.]
ML models can help for metadata management by adding automated labeling and annotations of data, such as identification of data types. Another metadata challenge is minimizing the cost of modifying applications as a schema evolves, an old problem where better solutions continue to be needed. The existing academic solutions to schema evolution are hardly used in practice.
Data governance
All data producers (consumers and enterprises) have an interest in constraining how their data is used by applications while maximizing its utility, including controlled sharing of data.
To implement GDPR and similar data use policy, metadata annotations and provenance must accompany data items as data is shared, moved, or copied according to a data use policy.
Data privacy includes the challenges of ensuring that aggregation and other data analytic techniques may be applied effectively on a data set without revealing any individual member of the dataset.
Cloud services
The movement of workloads to the cloud has led to explosive growth for cloud database services.
In serverless data services, users pay not only for the resources they consume but also for how quickly those resources can be allocated to their workloads. However, today's cloud database systems do not tell users how quickly they will be able to auto-scale (up and down).
Disaggregated architectures must use caching across multiple levels of memory hierarchy inexpensively and can benefit from limited compute within the storage service to reduce data movement (see "Database Engines"). Database researchers need to develop principled solutions for OLTP and analytics workloads that are suitable for a disaggregated architecture. Finally, leveraging disaggregation of memory from compute is a problem still wide open.
The research community can lead by rethinking the resource management aspect of database systems considering multitenancy. The range of required innovation here spans reimagining database systems as composite microservices, developing mechanisms for agile response to alleviate resource pressure as demand causes local spikes, and reorganizing resources among active tenants dynamically, all while ensuring tenants are isolated from noisy neighbor tenants.
In an ideal world, on-premises data platforms would seamlessly draw upon compute and storage resources available in the cloud "on-demand." We are far from that vision today even though a single control plane for data split across on-premises and cloud data is beginning to emerge for realizing a hybrid cloud.
Recently we have seen emergence of data clouds by providers of multi-cloud data services that not only support movement of data across the infrastructure clouds, but also allow their data services to operate over data split across multiple infrastructure clouds.
While auto-tuning has always been desirable, it has become critically important for cloud data services. Studies of cloud workloads indicate that many cloud database applications do not use appropriate configuration settings, schema designs, or access structures. Furthermore, as discussed earlier, cloud databases need to support a diverse set of time-varying multitenant workloads. No single configuration or resource allocation works well universally. A predictive model that helps guide configuration settings and resource reallocation is desirable. Fortunately, telemetry logs are plentiful for cloud services and present a great opportunity to improve the auto-tuning functionality through use of advanced analytics. However, since the cloud provider is not allowed to have access to the tenant's data objects, such telemetry log analysis must be done in an "eyes off" mode, that is, inside of the tenant's compliance boundary.
One way to support multitenant SaaS applications is to have all tenants share one database instance with the logic to support multi-tenancy pushed into the application stack. While this is simple to support from a database platform perspective, it makes customization (for example, schema evolution), query optimization, and resource sharing among tenants harder.
Database engines
We see an inevitable trend toward heterogeneous computation with the death of Dennard scaling and the advent of new accelerators to offload compute. GPUs and FPGAs are available today, with the software stack for GPUs much better developed than for FPGAs. The progress in networking technology, including adoption of RDMA, is also receiving the attention of the database community. These developments offer the opportunity for database engines to take advantage of stack bypass. The memory and storage hierarchy are more heterogeneous than ever before. The advent of high-speed SSDs has altered the traditional tradeoffs between in-memory systems and disk-based database engines. Engines with the new generation of SSDs are destined to erode some of the key benefits of in-memory systems. Furthermore, availability of NVRAM may have significant impact on database engines due to their support for persistence and low latency. Re-architecting database engines with the right abstractions to explore hardware-software co-designs in this changed landscape, including disaggregation in the cloud context, has great potential.
There is an ongoing debate between two schools of thought: (a) Distributed transactions are hard to process at scale with high throughput and availability and low latency without giving up some traditional transactional guarantees. Therefore, consistency and isolation guarantees are reduced at the expense of increased developer complexity. (b) The complexity of implementing a bug-free application is extremely high unless the system guarantees strong consistency and isolation. Therefore, the system should offer the best throughput, availability, and low-latency service it can, without sacrificing correctness guarantees. This debate will likely not be fully resolved anytime soon, and industry will offer systems consistent with each school of thought.
Instead of a traditional setting where data is ingested into an OLTP store and then swept into a curated data warehouse through an ETL process, perhaps powered by a Big Data framework such as Spark, the data lake is a flexible storage repository. Subsequently, a variety of compute engines can operate on the data that are of varying data quality, to curate it or execute complex SQL queries, and store the results back in the data lake or ingest them into an operational system. Thus, data lakes exemplify a disaggregated architecture with the separation of compute and storage. An important challenge for data lakes is finding relevant data for a given task efficiently.
ML models that are invoked as part of inferencing, must be treated as first-class citizens inside databases. ML models may be browsed and queried as database objects and database systems need to support popular ML programming frameworks.
Database systems can systematically replace "magic numbers" and thresholds with ML models to auto-tune system configurations. Availability of ample training data also provides opportunities to explore new approaches that take advantage of ML for query optimization or multidimensional index structures, especially as state-of-the-art solutions to these prblems have seen only modest improvements in the last two decades.
Existing benchmarks (for example, TPC-E, TPC-DS, TPCH) are very useful but do not capture the full breadth of our field, for example, streaming scenarios and analytics on new types of data such as videos. Moreover, without the development of appropriate benchmarking and data sets, a fair comparison between traditional database architectures and ML-inspired architectural modifications to the engine components will not be feasible. Benchmarking in the cloud environment also presents unique challenges since differences in infrastructure across cloud providers makes apples to apples comparison more difficult.
MAD Questions
It is time to revive my MAD questions section.
0. Which of these research challenges most resonate with you?
For me it is the disaggregated architecture and distributed protocols required for it.
1. What do you think about the consistency debate?
Which do you prefer reducing consistency at the expense of increased developer complexity vs providing the best throughput and availability within the constraints of highest level of consistency?
My answer is yes. Let's determine the sweet point in consistency and provide that as the default, and give options go tighter and lower as desired.
2. What are the things this SOTU misses?
The report does not callout energy-efficiency and greenness. I think that is a big miss. I also think there should be some discussion on security given the emphasis on multitenancy.
3. What's the next big thing that the report does not anticipate?
That is a million dollar (is that updated to billion, yet?) question, isn't it? The report talks about fully-managed serverless as research areas, but does not specifically call out for developing a fully-managed serverless disaggregated cloud-native distributed-SQL database. I guess this would be a sum of all parts, but often the whole is more than the sum of the parts.
4. Why is there no equivalent report on distributed systems research?
If there were one, what entries would we expect there? I think there would be a big overlap, in serverles, disaggregated, multitenancy, consistency, and system rearchitecturing to take into account modern hardware.
Comments