Best of Metadata in 2023

- December 05, 2023

It is that most wonderful time of the year again. Time to reflect back on the best posts at Metadata blog in 2023.

Distributed systems

Hints for Distributed Systems Design: I have seen these hints successfully applied in distributed systems design throughout my 25 years in the field, starting from the theory of distributed systems (98-01), immersing into the practice of wireless sensor networks (01-11), and working on cloud computing systems both in the academia and industry ever since.

Metastable failures in the wild: Metastable failure is defined as permanent overload with low throughput even after the fault-trigger is removed. It is an emergent behavior of a system, and it naturally arises from the optimizations for the common case that lead to sustained work amplification.

Towards Modern Development of Cloud Applications: This is an easy-to-read paper, but it is not an easy-to-agree-with paper. The message is controversial: Don't do microservices, write a monolith, and our runtime will take care of deployment and distribution.

Characterizing Microservice Dependency and Performance: Alibaba Trace Analysis: The paper conducts a comprehensive study of large scale microservices deployed in Alibaba clusters. They find that the microservice graphs are dynamic in runtime, most graphs are scattered to grow like a tree, and the size of call graphs follows a heavy-tail distribution.

TLA+

Beyond the Code: TLA+ and the Art of Abstraction: Abstraction is a powerful tool for avoiding distraction. The etimology of the word abstract comes from Latin for cut and draw away. With abstraction, you slice out the protocol from a complex system, omit unnecessary details, and simplify a complex system into a useful model. In his 2019 talk, Leslie Lamport said: "Abstraction, abstraction, abstraction! That's how you win a Turing Award."

Model Checking Guided Testing for Distributed Systems: The paper shows how to generate test-cases from a TLA+ model of a distributed protocol and apply it to the Java implementation to check for bugs in the implementation. They applied the technique to Raft, XRaft, and Zab protocols, and presented the bugs they find.

Going Beyond an Incident Report with TLA+ : This paper is about the use of TLA+ to explain root cause of a Microsoft Azure incident. It looks like the incident got undetected/unreported for 26 days, because it was a partial outage. "A majority of requests did not fail -- rather, a specific type of request was disproportionately affected, such that global error rates did not reveal the outage despite a specific group of users being impacted."

A snapshot isolated database modeling in TLA+: This shows a modeling walk through (and model checking) of a snapshot isolated database, where each transaction makes a copy of the store, and OCC merges their copy back to store upon commit.

Cabbage, Goat, and Wolf Puzzle in TLA+ : It is important to emphasize that abstraction is an art, not a science, and it is best learned through studying examples and practicing hands-on with modeling. TLA+ excels in providing rapid feedback on your modeling and designs, which facilitates this learning process significantly. Modeling the "cabbage, goat, and wolf" puzzle taught me that tackling real/physical-world scenarios is a great way to practice abstraction and design -- cutting out the clutter and focusing on the core challenge.

Production systems

Distributed Transactions at Scale in Amazon DynamoDB: Aligned with the predictability tenet, when adding transactions to DynamoDB, the first and primary constraint was to preserve the predictable high performance of single-key reads/writes at any scale. The second big constraint was to implement transactions using update in-place operation without multi-version concurrency control. The reason for this was they didn't want to mock with the storage layer which did not support multi-versioning. Satisfying both of the above constraints may seem like a fool's errand, as transactions are infamous for not being scalable and reducing performance for normal operations without MVCC, but the team got creative around these constraints, and managed to find a saving grace.

Kora: A Cloud-Native Event Streaming Platform For Kafka: Kora combines best practices to deliver cloud features such as high availability, durability, scalability, elasticity, cost efficiency, performance, multi-tenancy. For example, the Kora architecture decouples its storage and compute tiers to facilitate elasticity, performance, and cost efficiency. As another example, Kora defines a Logical Kafka Cluster (LKC) abstraction to serve as the user-visible unit of provisioning, so it can help customers distance themselves from the underlying hardware and think in terms of application requirements.

Spanner: Becoming a SQL system: The original Spanner paper was published in 2012 had little discussion/support for SQL. It was mostly a "transactional NoSQL core". In the intervening years, though, Spanner has evolved into a relational database system, and many of the SQL features in F1 got incorporated directly in Spanner. Spanner got a strongly-typed schema system and a SQL query processor, among other features. This paper describes Spanner's evolution to a full featured SQL system. It focuses mostly on the distributed query execution (in the presence of resharding of the underlying Spanner record space), query restarts upon transient failures, range extraction (which drives query routing and index seeks), and the improved blockwise-columnar storage format.

PolarDB-SCC: A Cloud-Native Database Ensuring Low Latency for Strongly Consistent Reads: PolarDB adopts the canonical primary secondary architecture of relational databases. The primary is a read-write (RW) node, and the secondaries are read-only (RO) nodes. Having RO nodes help for executing queries, and scaling out in terms of querying performance. On top of this, they are interested in being able to serve strong-consistency reads from RO nodes.

TiDB: A Raft-based HTAP Database: TiDB is an opensource Hybrid Transactional and Analytical Processing (HTAP) database, developed by PingCap. The TiDB server, written in Go, is the query/transaction processing component; it is stateless, in the sense that it does not store data and it is for computing only. The underlying key-value store, TiKV, is written in Rust, and it uses RocksDB as the storage engine. They add a columnar store called TiFlash, which gets most of the coverage in this paper.

Databases

The end of a myth: Distributed transactions can scale: The paper presents NAM-DB, a scalable distributed database system that uses RDMA (mostly 1-way RDMA) and a novel timestamp oracle to support snapshot isolation (SI) transactions. NAM stands for network-attached-memory architecture, which leverages RDMA to enable compute nodes talk directly to a pool of memory nodes.

ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks: This is a foundational paper in databases area. ARIES achieves long-running transaction recovery in a performant/nonblocking fashion. It is more complicated than simple (write-ahead-log) WAL-based per-action-recovery, as it needs to preserve the Atomicity and Durability properties for ACID transactions. Any transactional database worth its salt (including PostGres, Oracle, MySQL) implements recovery techniques based on the ARIES principles.

Epoxy: ACID Transactions Across Diverse Data Stores: Epoxy leverages Postgres transactional database as the primary/coordinator and extends multiversion concurrency control (MVCC) for cross-data store isolation. It provides isolation as well as atomicity and durability through its optimistic concurrency control (OCC) plus two-phase commit (2PC) protocol. Epoxy was implemented as a bolt-on shim layer for five diverse data stores: Postgres, MySQL, Elasticsearch, MongoDB, and Google Cloud Storage (GCS). (I guess the authors had Google Cloud credits to use rather than AWS credits, and so the experiments were run on Google Cloud.)

Detock: High Performance Multi-region Transactions at Scale: This is a followup to the deterministic database work that Daniel Abadi has been doing for more than a decade. I like this type of continuous research effort rather than people jumping from one branch to another before exploring the approach in depth.

Polyjuice: High-Performance Transactions via Learned Concurrency Control: This paper shows a practical application of simple machine learning to an important systems problem, concurrency control. Instead of choosing among a small number of known algorithms, Polyjuice searches the "policy space" of fine-grained actions by using evolutionary-based reinforcement learning and offline training to maximize throughput. Under different configurations of TPC-C and TPC-E, Polyjuice can achieve throughput numbers higher than the best of existing algorithms by 15% to 56%.

A Study of Database Performance Sensitivity to Experiment Settings: The paper investigates the following question: Many articles compare to prior works under certain settings, but how much of their conclusions hold under other settings? They find that the evaluations of the sampled work (and conclusions drawn from them) are sensitive to experiment settings. They make some recommendations as to how to proceed for evaluation of future systems work.

TPC-E vs. TPC-C: Characterizing the New TPC-E Benchmark via an I/O Comparison Study: This paper compares the two standard TPC benchmarks for OLTP, TPC-C which came in 1992, and the TPC-E which dropped in 2007. TPC-E is designed to be a more realistic and sophisticated OLTP benchmark than TPC-C by incorporating realistic data skews and referential integrity constraints. However, because of its complexity, TPC-E is more difficult to implement, and hard to provide reproducability of the results by others. As a result adoption had been very slow and little.

Is Scalable OLTP in the Cloud a Solved Problem? The paper draws attention to the divide between conventional wisdom on building scalable OLTP databases (shared-nothing architecture) and how they are built and deployed on the cloud (shared storage architecture). There are shared nothing systems like CockroachDB, Yugabyte, and Spanner, but the overwhelming trend/volume on cloud OLTP is shared storage, and even with single writer, like AWS Aurora, Azure SQL HyperScale, PolarDB, and AlloyDB.

Take Out the TraChe: Maximizing (Tra)nsactional Ca(che) Hit Rate: This is the main message of this paper: You have been doing caching wrong for your transactional workloads!

Designing Access Methods: The RUM Conjecture: Algorithms and data structures for organizing and accessing data are called access methods. Database research/development community has been playing catchup redesigning and tuning access methods to accommodate changes to hardware and workload. As data generation and workload diversification grow exponentially, and hardware advances introduce increased complexity, the effort for the redesigning and tuning have been accumulating as well. The paper suggests it is important to solve this problem once and for all by identifying tradeoffs access methods face, and designing access methods that can adapt and autotune the structures and techniques for the new environment.

Miscellaneous

SIGMOD panel: Future of Database System Architectures: Swami said that when Raghu invited him to be on the panel, he didn't know he would have to disagree with his PhD advisor Gustavo. He said that disaggregation has already arrived. He took a customer-focused view and said that the boundary between analytics, transactional, and ML is irrelevant for customers, and these are artificial distinctions of the research community has that needs to die. He built on the hardware-software codesign theme Anastasia mentioned. He said that humans are not good at high cardinality problems, this is where ML helps, and there is not enough investment on how to use ML for building DB. Being on-call at 2am, debugging, makes you appreciate these things. He said, being known as the NoSQL guy, he would controversially claim that "SQL is going to die" because LLMs are going to reinvent spec, and allow natural language based querying.

SIGMOD/PODS Day 2: Don Chamberlin (IBM fellow retired) is the creator of the SQL language. Why does the title say 49 years and not 50 years of querying? This is because the SQL paper was published 49 years ago at SIGMOD conference. The paper was titled: "SEQUEL: A structured English query language". But believe it or not, this was not the main show in that conference, and maybe even went low key unnoticed. The main show was two influential people debating.

Review: Performance Modeling and Design of Computer Systems: Queueing Theory in Action: We are A. Jesse Jiryu Davis, Andrew Helwer, and Murat Demirbas, three enthusiasts of distributed systems and formal methods. We’re looking for rigorous ways to model the performance of distributed systems, and we had hoped that this book would point the way.

Keep CALM and CRDT On: This paper focuses on the read/querying problem of conflict-free replicated data types (CRDTs). To solve this problem, it proposes extending CRDTs with a SQL API query model, applying the CALM theorem to identify which queries are safe to execute locally on any replica. The answer is of no surprise: monotonic queries can provide consistent observations without coordination.