HPTS'24 Day 1, part 1

Wow, what a week that was! The two days of HPTS (Monday and Tuesday this week) felt like a week to me. I learned a lot, and had a lot of good conversations, and even was able to squeeze in some beach walks in there. 

HPTS has been operating since 1985, convenening mostly every two years. It has been described as Davos for database systems. Pat Helland has been running HPTS since 1989, and as usual he kicked it off with some short opening remarks 8:30am Monday morning. 

Pat reminisced on the early HPTSs which discussed cutting edge work on scalable computing, punching past 100 txn on a mainframe! Then he quickly set the context of the workshop. This event is about meeting people. There are 130 people, with 20+ of them students. He said "your job is to meet everyone, HPTS is about the connections you make". The HPTS website also emphasizes this. "HPTS is about the breaks. The presentations punctuate the breaks! HPTS exists to promote community and relationships. This community comprises people that share a common interest in scalable systems and all their challenges. We emphasize discussion during breaks and deliberately seek out presentations that spark thought-provoking controversy."

Justin Levandoski (Google) was the program committee chair, and he introduced the keynote speaker.


Keynote: Building the Future: A view into OpenAI’s Products, Real-World Impact, and Systems Behind the Magic - Srinivas Narayanan (OpenAI)

Srinivas is the VP of engineering at OpenAI. He has been the responsible person on the technical side of products, such as chatgpt. He is a former database person from IBM Almaden and U Wisconsin.

His presentation covered three points: current state of ChatGPT, AI+data, and future outlook. The presentation did not introduce any new concept or insights about OpenAI and the future of LLMs and ML. I haven't seen Srinivas after his keynote, they must be pretty busy at OpenAI.

Current state of ChatGPT

The presentation kicked off by showing the MMLU (Massive Multitask Language Understanding) benchmark in progression of model intelligence. In 2019, gpt2 scored 25 over 100. Now, o1 scores 90 over 100, where humans score 89.4 in the MMLU benchmark. (Here is a view of MMLU from another source.) Srinivas also mentioned how ChatGPT was crashing it in math competition, code, and phd level science problems.

ChatGPT was released on Nov 30, 2022, less than 2 years ago! They were not anticipating much press, and did a silent release, thinking maybe researchers will use it mostly in the initial months. But it became massively popular quickly. Srinivas attributes this to the creation of completely new interface to computing.

The vision of ChatGPT is to provide an AI superassistant for everyone. He mentioned 40-50% reduction in time to finish of tasks including coding, writing, etc.

He said Coca Cola is experimenting with ChatGPT for image generation for advertising.

Moderna, a big customer, has made them focus on how to get AI to interface with internal documents. Moderna uses 750 custom gpts. A popular one is Dose ID: which helps with exploring what dosage level to use for a drug. Doing this manually would take a lot of time for researching.

Srinivas said another big theme is multimodality. He showed a video of a person interacting with chatgpt on the phone via audio, asking it to evaluate his dad jokes. The emotion in the voice from ChatGPT is impressive. There is also the app: be my eyes, that takes advantage of multimodality.

Khan academy is using AI to help teach people, and provide in-context advice for personalized teaching.

Srinivas said they take safety seriously, and that it is getting easier for them to do safety because as the models improve (like o1 release), they are able to reason better.

I am a bit surprised to see people calling the models capable of reasoning, but I think their POV is that if it looks like a duck and quacks like a duck, they can call it a duck. Reasoning is defined as the process of drawing logical conclusions from available information, applying rules, or inferring outcomes. It involves using knowledge, logic, and sometimes abstraction to solve problems or make decisions. LLMs can mimic reasoning to some extent by identifying patterns in vast amounts of data and predicting the next likely sequence of words. LLMs rely on probabilistic associations learned from their training data and get good mileage out of that. LLMs simulate reasoning by pattern recognition rather than by true logical or conceptual understanding. But if it works, no need to get stuck on semantics too much. 

AI + data

The talk emphasized in-context learning, learning from examples, and being able to generalize to new context. It is really powerful to be able to provide context via retrieval search. Another useful thing is fine tuning. And if the fine tuning still isn't cutting it, the next step is custom training to change the head model to suit your purpose. This is costly, so it is done infrequently.

Srinivas emphasized an important feature, tool use. The models can learn to use tools: call functions, learn to write sql, etc. This really helps to reduce hallucinations, because you are restrict output to a set of known/deterministic functions.

There are several open questions of course. What does intelligence on structured data mean? Another pragmatic question is, can you leverage offpeak GPUs for something productive?

There is also a lot of systems work needed for scaling inference. KV cache, batching, quantization, prompt vs sampling, latency vs throughput tradeoffs, and demand variance.

Future

There wasn't much meat in this section. Obviously scaling laws would shape the future, and there is a lot of work going into training, and we will need to see if this would be able to scale. 

With o1, now openAI is using more GPU at the inference time as well, as the models are generating a chain of thought. Reinforcement learning could be more useful here as well. 

Srinivas mentioned multimodality again and said that with  video generation things will get really interesting. 

Again the vision was mentioned, which is that  each of us could have a personal assistant. And that this may lead to development of higher order complex systems in the future.

Q&A session

Data Warehouse Querying Issues: Mike Stonebraker mentioned challenges using ChatGPT for querying a data warehouse with 14 tables. Initial results had 0% accuracy, improving to 30% with tweaks, but complex joins brought the accuracy back to 0%. He argued that such models aren't ready for enterprise data warehouse tasks and MIT can't afford the resources for custom training. He suggested these tools might assist with query-writing rather than doing it entirely.

In reply, Srinivas said that these tools will help the employee hired to writing those queries, and get them more productive. He said the ChatGPT effort is still very young, and will improve.

Structured Data: Karsten asked why ChatGPT isn’t optimized for structured data. The speaker replied that it’s due to opportunity cost, implying that focusing on other areas yielded better returns.

Verifiability gap: A participant raised concerns about the difficulty of verifying complex outputs (e.g., SQL queries with joins). Srinivas acknowledged the challenge but expressed optimism about significantly reducing this gap, though maybe not eliminating it entirely.

Emergent Capabilities: There was a question about the emergent capabilities as models scale. Srinivas said they can't fully predict these emergent properties, but they do shape the model's development by steering it through tasks like math problems to improve reasoning. These efforts are big part of the training of a new model. 

Costs: Phil Bernstein inquired about the reduction in training and inference costs. Srinivas noted that training costs depend on hardware improvements, but inference costs have dropped tenfold in a year due to model efficiency, quantization, and architectural improvements.

Overwhelming Output: Amol Deshpande from UMD mentioned the overwhelming volume of output generated by these models and asked about bridging the gap to more manageable and verifiable systems. The speaker acknowledged the issue and repeated the verifiability gap answer.

SearchGPT: Upon a question, this was mentioned as an opportunity to build a product but not as a direct competitor to Google.


Student Intros - Ippokratis Pandis

Ippo gave each student 1 minute to introduce themselves, mention their research topics and hobbies. Here are random/amusing tidbits from this session. 

quarter life crisis (WTF! is that a thing?), magpie:htap, escaped from andy pavlo, joined apple in august, 3 days hiking in big sur, modern txn processing, automatically rewriting distributed protocols, hdfs, now working on incremental computation, asterixdb, big active data system, PL guy at berkeley, pickleball, k-pop, sam madden's student, htap workloads, abstractions for the modern cloud, large scale spatio-temporal databases, data movement on modern hardware, data systems for knowledge intensive AI workloads, algebraic properties in distributed data systems, thinking about computers podcast, lu.ma/sf-distsys, focus on data pipeline, natassa's 30th phD graduate, approx analytics & modern systems, integrating graph processing in relational query engines, duckpgq, competitive ballroom dancing, automatically designing cloud data infrastructures, DBOS, scalable oltp over RDMA.


Session 2: Evolution of the Cloud and Databases

With the keynote from OpenAI, and with LLMs being at the top of its hype cycle, you would think that the buzz in HPTS would be about machine learning, right? Wrong! Most of the talk and excitement was about hardware-software codesign, hardware-accelerators, kernel-by-pass for databases. This session as well as the session after lunch gave many examples of this line of thinking. 


The "Hyper-model" Database - Chris Taylor (Google)

This was among my favorite talks in the workshop. It gave a cool analogy and symbol for the existing system architectures, a fruit jello. And from there on proposed a better integrated system architecture through the use of graph database ideas.

Chris is a Google Fellow and tech lead on operational databases, including bigtable, spanner, and most recently, Spanner Graph. He started the talk with a disclaimer, saying that there are very few new ideas, and that (hopefully) someone probably did parts of this somewhere. The goal is not to be novel but to build a simple robust maintainable solution. 

Ok, what is the fruit jello design pattern? It is the emergent and dominant design pattern for most applications. Think of different databases (e.g., postgres, mongodb, redis, mysql, etc) as fruits. Why so many? Because each have different strengths, and your data ecosystem has a bit of everything everywhere. What about the jello? The pubsub/etl (e.g., kafka, zeromq, etc) is the jello that binds these databases together.

Fresh Berry Jello Fruit Cake

The fruit jello model has several strong points. It is easy to add a new database, it is possible to scale by sharding the tables or use case, and by adding serving layers. It has some isolation, you can protect critical bits, and simplify budgeting.

But it also has many disadvantage. This is hard for operations: you need to deal with out of sync schemas, out of quota problems. It is hard to maintain and evolve this: there are too many systems, and they are expensive to change.  Security is problematic: "Where does the data go? Do deletions work?" Consistency is problematic due to pipeline delays and user confusion. There are performance problems due to end-user latency.

Hyper-model database is proposed  to address these. It can deal with diverse data models better, it can scale, and provide isolation rather than just relying on being firewalled. 

The insight is that the overhead of moving data across is more than the overhead of crossing databases/datamodels/etc org-overhead. The approach is to use a graph database (Spanner Graph) to achieve integration of different databases in your organization. Well Spanner Graph leverages Spanner with some edge labeling etc to implement property graphs approach, and provide GraphQL (GQL) API

Chris showed examples of embedding GQL in SQL. He talked a bit about the performance of spanner graph in vectorizing some joins, and recursive join optimizations. He also talked about what worked for the Spanner Graph database implementation: ast mapping graph -> relational was natural, vectorized exec was awesome even in non-relational, PAX worked, query optimization worked, and finally SQL as a bridge worked like magic. SQL has good support even for pgvector, and bigtable launched sql extension.

Was this approah a good choice? Chris said skin deep api txns would have been mess. He added for queries we are targeting we are competitive.


Teaching Elephants to Tell Time - David Wein (AWS)

This was another of my favorite talks because it was able to pack a lot of distributed systems content in barely 30 minutes. 

David is a senior principal tech in AWS Aurora, and previously on Aurora postgres. He talked about transactions in Aurora limitless database, emphasizing how they leverage tight time synchronization for consistency and isolation. 

The ACID properties in question here are read committed, or repeatable read (which is SnapshotIsolation for Postgres), plus external consistency coming from time sync. He said, no one asks for serializable, but we can do it in our architecture if there would ever be demand.

What do we want out of Aurora limitless? Limitless write scaling, no single txn broker, Postgres transaction semantics even for multi-shard txn, and with well-sharded access patterns being able to obtain single-system like performance.

The key building blocks are 4/6 quorum aurora storage, bounded clocks from Amazon TimeSync, embedded commit time: deeply integrating commit timestamps into PostgreSQL's core functionality, time-based postgresql snapshots: modifying postgresql to create snapshots based on precise timestamps rather than transaction IDs, and finally time-aware two-phase commit (2PC): implementing a distributed commit protocol that leverages precise time synchronization.

Let's dive in. 

Amazon TimeSync: Provides microsecond-accurate time synchronization. It offers a clockbound with *earliest* and *latest* timestamps, guaranteeing the actual time "now" falls within these bounds. The new architecture, implemented on Nitro cards, achieves <50 microsecond uncertainty. This will get even better in the future, and will be deployed in all regions. TimeSync is open to customers and free to use as part of their deployments.

Timesync combines with a Hybrid Logical Clock (HLC) to create a robust clock: max{clockbound.latest, highest_time_seen_from_other_node} + local_event_counter

Commits require three-AZ durability and clockbound.earliest > commit_time, ensuring global read-after-write consistency and external consistency.

This gives us P50 latency: 1.5ms, and by the time I/O operations return, time uncertainty is resolved, so clock synchronization doesn't introduce performance bottlenecks. A 60-shard cluster can perform 2 million commits per second without blocking.

So, what does time sync buy us? Atomic multi-shard writes, Simultaneous commits across shards, Consistent point-in-time restore capabilities.

Now we talk about the Postgres part of the system. Aurora Limitless adapts PostgreSQL's MVCC (Multi-Version Concurrency Control) for distributed environments. Traditional PostgreSQL snapshots are taken "as of now". Limitless implements snapshots "as of then". Transaction routers establish snapshot times and distribute them to shards. Shards create local snapshots based on the provided timestamp, and multi-shard snapshots use consistent timestamps across all shards "as of then" which works best with distributed deployments.

There are some nuisances here and details to address. Transactions become visible at an indeterminate point after commit time. Traditional Postgres handles this by making a list of running transcations, but this is expensive and doesn't work in distributed systems. There is also long fork anomaly: A potential violation of snapshot isolation (as described in Concur 2015 Gotsman paper).

The solutions to these include careful management of commit timing and visibility checks for prepared tuples. In order to ensure reads-are-never blocked by writes, the implementation must minimize the read's wait window for a commit of a transaction. So Aurora predicts commit completion times, setting commit timestamps slightly in the future to minimize the commit window to resolve this problem. 

I know there is a lot to pack here. But we march on. 

Here is how  2PC works in Aurora Limitless.

Routers manage transactions and determine if 1PC or 2PC is needed at commit time. For 2PC, the router initially acts as coordinator but hands off to a local shard coordinator. Why do we have this handoff? Because we don't want to require high-availability at the router. Router should be soft-state dispensible. We already have high availability at the shards, so the router hands off the supervision 2PC completion to a shard leader, which is already replicated and highly available. 

There is another interesting case due to zombie shards (a zombie shard doesn't know it has been failed over, and hence dead!).  Aurora implements consistency leases to prevent linearizability issues caused by zombie shards. This is a classic application of a fencing approach through leases. The lease timeout is chosen to be less than failover detection plus recovery time. Leases expire after a few seconds, preventing stale reads at zombies.



Through the OLTP Looking Glass - 16 Years Later - Michael Stonebraker (MIT)

Well, this was also one of my favorite talks (I know we are going 3/3 here). Here Mike was not arrogant or obnoxious about RDBMS supremacy. He showed vulnerability and acknowledged/identified problems to address for the community. He delivered a really good talk. Mike is a legend, he is the 2014 Turing Award winner. He is still kicking hard at 80 years old!

The original "Looking Glass" paper (which I summarized here) identified performance issues in OLTP systems, analyzing the Shore DBMS. It found that only 10% of CPU cycles were spent on useful work, with 90% overhead distributed among buffer pool management (1/3), concurrency control (1/6), crash recovery (1/6), and multithreading (1/6). The message was that, if you wanna go faster, address these bottleneck. Lots of innovations happened addressing these bottlenecks, including HANA, H-Store (VoltDB), HyPer, LeanStore, MemSQL, and Silo.

But there were two major issues with the original study. Firstly it only measured backend-only CPU cycles, ignoring the cost of receiving work from remote clients. This overlooked significant overhead in message handling.

Secondly, it relied on stored procedures (SP). While SPs improve performance, they can compromise isolation, which is unacceptable in some industries like financial services. So you should do benchmarks twice with and without isolation.

To address these issues, Mike team performed a new stude. They used VoltDB with YCSB, Voter, and TPC-C benchmarks, focusing on main-memory data partitioned for single-shard transactions. This is done to make things go as fast as possible and study the bottlenecks associated with the two issues above.

The key findings point to a big network overhead. Server-side TCP/IP cycles are the primary bottleneck, consuming 68% for YCSB, 63% for voter, and 47% for TPC-C.

The experiments performed with SP found that SPs reduce some network cycles but raise isolation concerns.

So how can we address these bottlenecks?

For SPs, run them  in a separate address space, not inside the DBMS. They should be communicating with the DB through shared memory with polling or interrupts. This approach could help address isolation issues.

For network overheads, shun TCP/IP, and explore alternatives. Their preliminary experiments with Kernel bypass using DPDK and F-Stack showed 2x improvement over TCP/IP. It is possible to consider hardware solutions like AWS Nitro or RDMA.

Here are some other things to try to improve status quo. Leverage hardware memory isolation (e.g., Multics-style rings),  develop lighter-weight networking protocols, improve kernel bypass techniques and make them more accessible,  explore specialized networking in database-oriented operating systems (as in DBOS) and consider embedding DBMS functionality in Linux with efficient networking.

Here are Mike's conclusions.

  • TCP/IP is inefficient for OLTP systems and needs significant improvement. The exact words Mike used was: "TCP-IP is a pig, OS guys should fix it, it is an embarrassment." 
  • Stored procedures need better tooling support (debuggers, version control) to increase adoption.
  • Networking measurements are crucial for accurate performance evaluation. Studies/benchmarks ignoring networking or isolation are fundamentally flawed.

In the Q&A, Margo was quick to point out that these findings are not surprising, and the database community should look beyond traditional DB literature for solutions. Systems and networking guys have been working on these solutions for the last decade. Mike was also then right to point out that these solutions, including DPDK, should be made easier to use, and the fact is that all databases still use TCP/IP and incur these costs. 

Comments

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Foundational distributed systems papers

Advice to the young

Linearizability: A Correctness Condition for Concurrent Objects

Understanding the Performance Implications of Storage-Disaggregated Databases

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Designing Data Intensive Applications (DDIA) Book