HPTS'24 day 1, part 2
This is part 2 of day 1 of HPTS'24. (You can tell I did some lisp programming back in the day, huh?) Here is the first part of day 1, you should check that out as well.
There were 2 sessions each with 4 talks in the afternoon of day 2. After dinner, we had a gong show presentation on miscellaneous topics as well.
Session 3: DBOS
Virtual Memory: a Huge Step Backwards- Daniel Bittman (Elephance)
Daniel argued that virtual memory, despite its widespread use, is fundamentally flawed. He contends that virtual memory presents an abstraction of memory as flat, uniform, contiguous, infinite, isolated, and fast - all of which are untrue.
He pointed to significant performance issues with virtual memory, noting that virtual memory adds substantial overhead to every memory access. Every memory access costs 4 extra memory accesses and 100s of nanoseconds of wasted CPU time, which is an eternity in CPU world. He criticized the Transaction Look-Aside Buffer (TLB) for its complexity and lack of coherence, especially in multi-threaded systems.
Security is another major concern. The large codebase required for TLB management and the hardware itself present significant attack surfaces. Why do threaded systems have all threads in same address space? The process model is much more precise with OS about its isolation boundaries.
He also pointed out the dangers of overcommitting memory and the unpredictability of memory allocation failures through malloc. He did an experiment fuzzing malloc failures, and showed that almost all databases had problems.
Daniel argued that very early segmentation model was a superior memory model and proposes CHERI as a potential improvement. He's working on a new OS called Twizzler, written in Rust, which aims to address these issues by being more precise about memory management. See https://dbittman.com.
He concluded "Just run SPDK on bare metal, man. Peace out dude!".
Daniel is a natural presenter. He is very comfortable in front of crowds with great delivery. I guess he took after his advisor Peter Alvaro. What is your secret guys? Inquiring minds want to know.
Portable Compiler-Centric SIMD Code in Databases - Lawrence Benson (TU Munich)
Lawrence discussed the importance of SIMD (Single Instruction Multiple Data) in database operations. A simple example of SIMD is doing 4 additions in one instruction. SIMD allows for parallel processing, significantly improving performance in tasks like table scans, hash tables, and sorting.
However, he pointed out that SIMD code is challenging to develop, test, and benchmark due to its CPU-specific nature. This is particularly relevant given the increasing presence of ARM processors in cloud environments.
He questioned the need for hand-written SIMD code, suggesting that compilers should handle this optimization. He critiqued the current approach of using multiple layers of abstractions, such as platform intrinsics built on compiler intrinsics.
Instead, he proposes a more direct approach: using compiler intrinsics exclusively. His method aims to simplify SIMD implementation while maintaining portability and performance across different CPU architectures. He has code to back up his proposal.
Making Peace Between Mortal Enemies: Running a Database Management System in the Linux Kernel - Matt Butrovich (Carnegie Mellon University)
Matt proposed a radical approach to database management systems (DBMS): running them in the Linux kernel. Why? Because the operating system is often an impediment to database performance.
Instead of conventional user-space operations, Matt suggests pushing DBMS logic into kernel-space using eBPF (extended Berkeley Packet Filter). eBPF allows safe execution of event-driven programs in kernel space, supporting languages like C, Go, and Rust.
He introduced BPF-DB, an embedded transactional database for eBPF. However, implementing this faces challenges due to the eBPF verifier's strict limitations, including instruction limits and restrictions on branches and recursion. To overcome these obstacles, BPF-DB employed continuation passing style and uses kernel-resident thread-safe data structures for storage management. Transaction management used strict multi-version two-phase locking (MV2PL) to ensure ACID properties. A key innovation was the use of a ring buffer to communicate with user space for write-ahead logging, as eBPF can't directly access the disk.
While this was impressive, an expert (Margo) comments that this project is notable more for its audacity than its practicality, likening it to a "dancing bear" - remarkable not for how well it performs, but for the fact that it works at all.
DBOS 2.0: Practical DB-OS Co-Design with Process Virtualization - Xinjing Zhou (MIT)
First, here is some background about the DBOS project and its goals. Now there is a company around the project making some leeway.
Xinjing's talk focused on database-operating system co-design. This middle-ground solution aims to integrate some database functionalities into the OS without completely bypassing the kernel.
The talk brought up issues like virtual memory snapshotting, where operations like fork() can be blocking. To minimize security risks associated with kernel-level modifications, Xinjing proposed a "privileged kernel bypass" approach using virtualization. This method elevates the privilege level of the database process through a hypervisor, allowing it to access privileged information and use features like paging, interrupts, and protection rings.
A case study on the fork problem demonstrates a 20x reduction in copy operations using this approach. The privileged database remains a process, preserving compatibility with Linux tools and ecosystem. Xinjing argues that this evolutionary approach is preferable to a unikernel solution, as it doesn't require abandoning the Linux ecosystem.
Margo found the work interesting but questioned how it would be avle avoid past issues with kernel extensions work in the 1990s. Bruce Lindsay argued that fork isn't a significant problem for databases and that checkpoints don't necessarily require it. He said "we don't need much but we complain about what we get."
Session 4: Testing, Faults, and Privacy
FoundationDB Testing? Not Very Good, Actually!- Ben Collins (Antithesis)
Ben, formerly of FoundationDB and now with Antithesis, talked about the limitations of deterministic simulation testing (DST) and discussed the extended approach Antithesis takes to system testing.
Ben acknowledges the benefits of DST, such as replayability and time machine capabilities, but highlights significant drawbacks: High implementation costs, Deep integration requirements, Design limitations due to determinism constraints, and Prohibitive computational costs for bug search.
To address these issues, Antithesis developed a more generalized DST approach over the past six years. Key features include:
- Simulating at the computer level rather than system components
- Focusing on blackbox bug finding and correctness testing
- Combining multiple signal types: direct feedback, code coverage, and anomaly detection
- Addressing search challenges like local maxima and sparse signals
He said that with FoundationDB we were trying to get lucky with search, but here we bias the search strategy to find the bugs. The customers ask Antithesis to find violations of certain types of exceptions and panic statements, and Antithesis nudges things to explore these violations efficiently and effectively. To tackle workload issues, they use once-seeded PRNGs and batched validations. I got the impression that determinism is there to cope with not having infinite memory or time, as sort of a compressing mechanism.
Antithesis offers SDK support for various languages, integration with development tools, and actions on CI platforms. An example test demonstrated the ability to find exceptions under heavy loads within 2 hours of wall clock time (96 test hours including parallelization).
When questioned about non-deterministic bugs, Ben explained that their system wraps tests in a deterministic hypervisor. He also noted that fault-injection and thread pausing techniques are employed to quickly identify bugs before applying more complex search strategies.
I know that MongoDB is a satisfied customer of Antithesis. I think they do a good job of balancing between comprehensive coverage and practical exploration of implementation for large scale systems.
When Bad Hardware Happens to Good Databases- Scott Andreas (Apple)
Scott, an engineering leader at Apple working on Cassandra, discussed the challenges of building reliable database systems on unreliable hardware. He emphasized that bit flips, seemingly minor errors, can have severe consequences, particularly when they affect critical operations like distinguishing between enum operation types of "update" and "drop_table" commands.
He outlined various hardware issues affecting different components. DRAM faces challenges with socket connectivity, thermals, and signal integrity. Flash memory suffers from voltage degradation and wear, while CPUs can have manufacturing defects, thermal issues, and cache faults. While protection mechanisms like ECC and page offlining exist for DRAM, they aren't comprehensive solutions.
Flash memory presents other unpleasant surprises, with increasing density leading to higher error rates. With high write rates, and write amplication, they get more prone. PCIe-related issues in storage and network interface cards also require attention. Scott stressed the importance of enabling advanced error reporting for these components.
As we found out from Google and Facebook reports, CPUs are prone to silent Corruption Execution Errors (CEEs), and these necessitate burn tests and ongoing tracking of hardware performance over time.
A significant portion of the talk focused on checksumming as a crucial strategy for maintaining data integrity. Scott recommended Cyclic Redundancy Check (CRC) for its balance of efficiency and effectiveness, highlighting the textbook by Koopman on CRC polynomials.
In the Q&A, Brooker suggested that modern systems should use GMAC or cryptographic MACs, which are hardware-accelerated and can address potential spoofing attacks on CRCs.
These are scary problems indeed. For maintaining data integrity we cannot take hardware including CPU reliability for granted. We should ponder on the overhead involved in addressing computational CPU problems in particular.
Hyper-reliability for Hyper-scale Databases - David Bacon (Google)
David discussed the challenges of maintaining hyper-reliability in hyper-scale databases like Spanner, which manages over 15 exabytes of data. Spanner underlies most of Google's external-facing products and is scaling towards zettabyte levels.
This is another doom talk. It could be depressing to hear about these, but we need to face up to the reality and ruggedize our systems.
David highlights the issue of Corrupt Execution Errors (CEE), which can lead to silent data corruption or worse corrupted program states. To combat this, Google employs various protection techniques, including checksumming for 32K blocks with 32-bit checksums before encryption, validated on write and read operations.
They implemented a corruption monitoring pipeline, processing debug reports and evicting compromised machines. David calls this their sensor array. This system caught a device driver corruption bug, though it took 18 months to debug fully.
Google also developed the Tinfoil library, featuring "paranoid variables" for high-impact values (e.g., # records in file). This approach has a 90% conviction rate and only costs 0.1% in performance overhead.
David concludes that silent data corruption will worsen with advancing technology nodes and growing database sizes. While safe languages can help, they're not sufficient for lower abstraction levels. The sensor array approach shows promise, and interestingly, memory corruption is rarer compared to CPU corruption.
Privacy-Preserving Databases: What will you favour? Serializability or scalability? (Cuz you aren’t getting both) - Natacha Crooks (UC Berkeley)
Natacha discussed the challenges of creating privacy-preserving databases, focusing on the trade-offs between serializability and scalability while ensuring privacy. She used electronic health records (EHR) in the cloud as a motivating example, emphasizing that encryption alone is not sufficient for privacy, as access patterns by specific type of doctors can reveal sensitive information.
To set the problem up, Natacha talked about their previous work on two systems: Obladi and Snoopy. Obladi achieves serializability and privacy but at the cost of performance, being 10X slower than MySQL on TPC-C benchmarks. It uses a trusted centralized proxy and Oblivious RAM (ORAM) to hide workload patterns (making storage look completely random) by employing techniques like delayed commit notifications (because serializability only needs to hold at commit time) and epoch-based processing to amortize costs.
Snoopy, on the other hand, prioritizes scalability and privacy, performing 13X better than Obladi. However, it sacrifices transaction support. Snoopy replaces the central proxy with multiple load balancers, each with a trusted enclave, and uses techniques like deduplicating writes at the load balancers and arranging reads before writes to improve performance.
Natacha concluded that while both serializability and scalability can be achieved with privacy, combining all three remains a challenge. The choice between Obladi and Snoopy represents a trade-off between transactional consistency and performance in privacy-preserving database systems. She is trying to deal with this problem in her research. It is really an interesting research question.
Gong show
After the dinner, we had a gong show session. These are quick 5 minutes talks each on miscellaneous topics. Here are the titles:
- Kyle in Absentia - Jepsen test of Datomic, and unusual Intra-Transaction Semantics - Adrian Cockcroft (OrionX)
- Metadata is underrated - Danica Porobic (Oracle)
- BigQuery Omni: Shipping and supporting the monorepo on AWS/Azure - Jeff Johnson (Google)
- Vector-Relational Databases (and how to avoid historic recurrence) - Viktor Sanca (EPFL)
- Using SmartNICs for Database Management - Phil Bernstein (Microsoft)
- Leaky Memory Abstractions: Hidden Performance Loss on Modern Hardware - Hamish Nicholson (EPFL)
- Shipping up to Space - Robert Bayer (ITU Copenhagen)
- Plus ça change - Sailesh Krisnamurthy (Google)
- Missing the Metaphor for Self-Driving Databases - Johann Schleier-Smith (CrystalDB)
- Sharing Data Loading - Titus Theodorus (Ties) Robroek (ITU Copenhagen)
- Make DBs (for GenAI) Declarative Again - Yannis Papakonstantinou (Google)
- Text2SQL is Not Enough: Unifying AI and Databases with TAG - Liana Patel (Stanford)
- Optimizing Distributed Protocols with Query Rewrites - David Chu (UC Berkeley)
- Fixing Distributed Consensus to achieve scalability - Chuck Carman (Amazon)
David Chu won the best presentation award. This is top-notch story telling. He took a bold approach, using 4 of 5 minutes on the build up, and the last minute for pay-off. I was lost at the end of the 4 minutes about what even the title was, and what point he could possibly make after all this built up about Yugi-Oh playing cards. But then boom, boom, boom, came the pay-off! It was great, I won't ruin it for you. Please watch it yourself.
Comments