HPTS day 2, part 2

- September 24, 2024

Continuing with our HPTS series. This is now the afternoon of second day.

The first session was on HTAP and streaming, and the second one on caching.

Session 7: HTAP and Streaming

Who cares about HTAP? - Tianyu Li (MIT)

Tianyu argued that while Hybrid Transactional/Analytical Processing (HTAP) showed great promise in 2014, it has failed to make a significant impact in the decade since. Instead, he proposed that the real disruption in the database world is coming from the migration of workloads to the cloud, with companies like Snowflake and Databricks leading the charge. The trend is moving towards virtualization and cloud-native architectures. The trend has been for composing specialized data engines connected by pipelines and streams, rather than relying on monolithic HTAP systems. He highlighted the MITDBG projects on developing modern abstractions and architectures for cloud-native data processing.

Amazon Zero-ETL - Gopal Paliwal and Gokul Soundararajan (AWS)

This talk focused on efficient data movement for generating real-time operational data analysis for modern enterprises in Amazon Redshift. Gopal and Gokul advocated for a "lazy" approach, suggesting that reinventing the wheel is unnecessary. They contrasted their solution with more complex architectures like ByteDance's HTAP system. Instead Amazon's Zero-ETL design is built on a log abstraction, managing metadata through an Integration Metadata Service (IMS). The system includes end-to-end security, ease of use through a unified experience without separate pipelines, and reliability. Key features include snapshots, change data capture (CDC), and heartbeat monitoring. A streaming server handles CDC from Aurora databases. During Q&A, when questioned about the absence of the 'T' (Transform) in ETL, Gokul clarified that "Zero-ETL" is more of a name, with transformations occurring on the Redshift side of the process.

The Future of Data Processing: Unifying Streaming, OLTP, and OLAP through Apache Flink, Kafka, and Iceberg - Eugene Koblov (Confluent)

This talk gave a high-level overview of Kafka and Confluent ecosystem. I wasn't able to get much interesting content out of the talk.

Consistency in Motion - Chris Douglas (UC Berkeley)

Chris talked about incremental computing in databases, focusing on invertible writes and efficient read updates. The concept of materialized view maintenance was central to his discussion which explored the possibility of updating all reads cheaply when data changes.

He highlighted the use of conflict-free data structures, specifically z-sets, which originate from the 2009 paper "Reconciliable Differences" by Todd Green, Zachary Ives, and Val Tannen. We had just reviewed a paper that used c-sets from the same paper: Transactional storage for geo-replicated systems.

Chris introduced z-relations, where every table has a hidden weight column, and said that this approach allows for sets, bags, and sets of updates (positive/negative) to be represented, making the system indifferent to operation order. He explained that in z-sets, only positive integers exist when reading data. He talked about the paper "DBSP: Automatic Incremental View Maintenance for Rich Query Languages" particularly focusing on how joins are handled in this framework. This is an interesting direction indeed. It seems like Mihai Budiu, first author on the above paper, has a blog post series talking about the design/implementation of Z-sets in databases.

Session 8: Caching

Stateful services: low latency, efficiency, scalability - pick three - Atul Adya (Databricks)

Atul presented a talk on stateful services, focusing on achieving low latency, efficiency, and scalability simultaneously through caching. He said he attended his first HPTS in 1999 and this is his second HPTS.

He highlighted challenges in caching, particularly in distributed key-value caches like Memcache and Redis, and pointed out performance issues with remote caches, including network latency and overreads. This paper by Atul in HotOS19 "Fast key-value stores: An idea whose time has come and gone" explains the motivations nicely.

He then introduced Databricks Dicer (not to be confused with Google's Slicer which Atul was involved earlier), a system which aims to make stateful services as easy to build as stateless ones. Dicer has a centralized control plane and distributed data plane, and is designed to provide auto-sharding without tying to storage.

He discussed the trade-offs and challenges in implementing strong consistency model in caching for stateful services. Strong consistency ensures cache state is never stale, but this requires Dicer's exclusive ownership capability. It offers better read availability but lower write availability since writes go only through the owner in cache. There are challenges to address in the system to deal with delayed writes and potential data races during resharding, and Atul had to get pretty familiar with them as he worked in this project.

Caches Replicate Everything Around Me - Michael Abebe (Salesforce)

Michael channeled Wu-Tang clan and started the talk saying cache rules everything around me (CREAM). He argued that caching is very similar to replication, so why shouldn't the database manage these caches? The database has access to more information and can make more informed decisions. We can view the result cache is a form of a replicated materialized view, and employ learned decisions on what/where to cache. So he emphasized that it could be productive to think of caches as partial materialized views.

Building bridges in the cloud: More co-design for efficient and robust disaggregated architectures - Tiemo Bang (Berkeley)

Tiemo discussed a recent paper "Online Parallel Paging with Optimal Makespan" and how it could prevent thrashing problem, and how it is compatible with every page replacement algorithm. He said he found this paper through a conversation with coPilot and argued that LLMs make theory very accessible. I wasn't able to understand the problem and the model/setup of the paper from Tiemo's description, so can't comment more on that. I also don't buy the LLM search argument. It is possible to spend some time in Google Scholar and get to relevant papers easily as well. LLMs are still useful for getting a high level overview of a new terrain.

Caching & Reuse of Subresults across Queries - Alex Hall (Firebolt)

The talk was about caching subresults in the Firebolt data warehouse. I think I was tired and didn't get much out of this talk either.

Coda

After dinner, there was a "LLM + AI + DB Panel". I couldn't stay for more than 10 minutes because the terms and discussion was too vague and it seemed hard to disagree with anything anybody said due to the vagueness of the topic. I headed to the Director's cottage alongside a dozen people, who were all hands-on veteran systems people. Coincidence?

Wednesday morning after breakfast, I had to leave because the drive to the SFO airport was 2.5 hours, and I had a flight to east coast to catch at noon. Return flights suck the most. You are going back home, which is great. But you are tired after pulling long days at the conference, and there is a lack of novelty: you are returning to your regular routing, you don't have the conference to look forward to during the flight. And you would have caught up to the recent movies, the good ones at least during your flight to the conference.

A peculiar trend this HPTS was that maybe half of the people used Google docs/slides to present their talks right from their browser. That is interesting, this was almost non-existent, it was all Powerpoint and Keynote presentations before. I don't know what it is. Maybe people like that they can get feedback/comments on their slides if they use Google docs/slides. Or maybe part of the trend of the browser taking over the entire computer.

Finally, I was impressed by the dedication of Pat, Shel, Phil, and Mike. They got started with the field before I was born, and after so many decades in the field, they seem to be as excited on these problems (opportunities?) as the first day. It is always Day 1 with them.

Search This Blog

Metadata

HPTS day 2, part 2

Session 7: HTAP and Streaming

Who cares about HTAP? - Tianyu Li (MIT)

Amazon Zero-ETL - Gopal Paliwal and Gokul Soundararajan (AWS)

The Future of Data Processing: Unifying Streaming, OLTP, and OLAP through Apache Flink, Kafka, and Iceberg - Eugene Koblov (Confluent)

Consistency in Motion - Chris Douglas (UC Berkeley)

Session 8: Caching

Stateful services: low latency, efficiency, scalability - pick three - Atul Adya (Databricks)

Caches Replicate Everything Around Me - Michael Abebe (Salesforce)

Building bridges in the cloud: More co-design for efficient and robust disaggregated architectures - Tiemo Bang (Berkeley)

Caching & Reuse of Subresults across Queries - Alex Hall (Firebolt)

Coda

Comments

Popular posts from this blog

Hints for Distributed Systems Design

The F word

The Agentic Self: Parallels Between AI and Self-Improvement

Learning about distributed systems: where to start?

Foundational distributed systems papers

Cloudspecs: Cloud Hardware Evolution Through the Looking Glass

Advice to the young

Agentic AI and The Mythical Agent-Month

Are We Becoming Architects or Butlers to LLMs?

Welcome to Town Al-Gasr