Best of Metadata in 2023
It is that most wonderful time of the year again. Time to reflect back on the best posts at Metadata blog in 2023.
Distributed systems
Hints for Distributed Systems Design: I have seen these hints successfully applied in distributed systems
design throughout my 25 years in the field, starting from the theory of
distributed systems (98-01), immersing into the practice of wireless
sensor networks (01-11), and working on cloud computing systems both in
the academia and industry ever since.
Metastable failures in the wild: Metastable failure is defined as permanent overload with low throughput
even after the fault-trigger is removed. It is an emergent behavior of a
system, and it naturally arises from the optimizations for the common
case that lead to sustained work amplification.
Towards Modern Development of Cloud Applications: This is an easy-to-read paper, but it is not an easy-to-agree-with paper. The
message is controversial: Don't do microservices, write a monolith, and
our runtime will take care of deployment and distribution.
Characterizing Microservice Dependency and Performance: Alibaba Trace Analysis: The paper conducts a comprehensive study of large scale microservices deployed in Alibaba clusters. They find that the microservice graphs are dynamic in runtime, most graphs are scattered to grow like a tree, and the size of call graphs follows a heavy-tail distribution.
TLA+
Beyond the Code: TLA+ and the Art of Abstraction: Abstraction is a powerful tool for avoiding distraction. The etimology
of the word abstract comes from Latin for cut and draw away. With
abstraction, you slice out the protocol from a complex system, omit
unnecessary details, and simplify a complex system into a useful model. In his 2019 talk, Leslie Lamport said: "Abstraction, abstraction, abstraction! That's how you win a Turing Award."
Model Checking Guided Testing for Distributed Systems: The paper shows how to generate test-cases from a TLA+ model of a distributed
protocol and apply it to the Java implementation to check for bugs in
the implementation. They applied the technique to Raft, XRaft, and Zab
protocols, and presented the bugs they find.
Going Beyond an Incident Report with TLA+ : This paper is about the use of TLA+ to explain root cause of a Microsoft Azure incident. It looks like the incident got undetected/unreported for 26 days, because it was a partial outage. "A majority of requests did not fail -- rather, a specific type of request was disproportionately affected, such that global error rates did not reveal the outage despite a specific group of users being impacted."
A snapshot isolated database modeling in TLA+: This shows a modeling walk through (and model checking) of a snapshot isolated database, where each transaction makes a copy of the store, and OCC merges their copy back to store upon commit.
Cabbage, Goat, and Wolf Puzzle in TLA+ : It is important to emphasize that abstraction is an art, not a science, and it
is best learned through studying examples and practicing hands-on with
modeling. TLA+ excels in providing rapid feedback on your modeling and
designs, which facilitates this learning process significantly. Modeling
the "cabbage, goat, and wolf" puzzle taught me that tackling
real/physical-world scenarios is a great way to practice abstraction and
design -- cutting out the clutter and focusing on the core challenge.
Production systems
Distributed Transactions at Scale in Amazon DynamoDB: Aligned with the predictability tenet, when adding transactions to
DynamoDB, the first and primary constraint was to preserve the
predictable high performance of single-key reads/writes at any scale. The
second big constraint was to implement transactions using update
in-place operation without multi-version concurrency control. The reason
for this was they didn't want to mock with the storage layer which did
not support multi-versioning. Satisfying both of the above
constraints may seem like a fool's errand, as transactions are infamous
for not
being scalable and reducing performance for normal operations without
MVCC, but the team got creative around these constraints, and managed to
find a saving grace.
Kora: A Cloud-Native Event Streaming Platform For Kafka: Kora combines best practices to deliver cloud features such as high
availability, durability, scalability, elasticity, cost efficiency,
performance, multi-tenancy. For example, the Kora architecture decouples
its storage and compute tiers to facilitate elasticity, performance,
and cost efficiency. As another example, Kora defines a Logical Kafka
Cluster (LKC) abstraction to serve as the user-visible unit of
provisioning, so it can help customers distance themselves from the
underlying hardware and think in terms of application requirements.
Spanner: Becoming a SQL system: The original Spanner paper was published in 2012 had little
discussion/support for SQL. It was mostly a "transactional NoSQL core".
In the intervening years, though, Spanner has evolved into a relational
database system, and many of the SQL features in F1 got incorporated
directly in Spanner. Spanner got a strongly-typed schema system and a
SQL query processor, among other features. This paper describes
Spanner's evolution to a full featured SQL system. It focuses mostly on
the distributed query execution (in the presence of resharding of the
underlying Spanner record space), query restarts upon transient
failures, range extraction (which drives query routing and index seeks),
and the improved blockwise-columnar storage format.
PolarDB-SCC: A Cloud-Native Database Ensuring Low Latency for Strongly Consistent Reads: PolarDB adopts the canonical primary secondary architecture of
relational databases. The primary is a read-write (RW) node, and the
secondaries are read-only (RO) nodes. Having RO nodes help for executing
queries, and scaling out in terms of querying performance. On top of this, they are interested in being able to serve strong-consistency reads from RO nodes.
TiDB: A Raft-based HTAP Database: TiDB is an opensource
Hybrid Transactional and Analytical Processing (HTAP) database,
developed by PingCap. The TiDB server, written in Go, is the
query/transaction processing component; it is stateless, in the sense
that it does not store data and it is for computing only. The underlying key-value store, TiKV,
is written in Rust, and it uses RocksDB as the storage engine. They add
a columnar store called TiFlash, which gets most of the coverage in
this paper.
Databases
The end of a myth: Distributed transactions can scale: The paper presents NAM-DB, a scalable distributed database system that
uses RDMA (mostly 1-way RDMA) and a novel timestamp oracle to support
snapshot isolation (SI) transactions. NAM stands for
network-attached-memory architecture, which leverages RDMA to enable
compute nodes talk directly to a pool of memory nodes.
ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks: This is a foundational paper in databases area. ARIES achieves
long-running transaction recovery in a performant/nonblocking fashion.
It is more complicated than simple (write-ahead-log) WAL-based
per-action-recovery, as it needs to preserve the Atomicity and
Durability properties for ACID transactions. Any transactional database
worth its salt (including PostGres, Oracle, MySQL) implements recovery
techniques based on the ARIES principles.
Epoxy: ACID Transactions Across Diverse Data Stores: Epoxy leverages Postgres transactional database as the
primary/coordinator and extends multiversion concurrency control (MVCC)
for cross-data store isolation. It provides isolation as well as
atomicity and durability through its optimistic concurrency control
(OCC) plus two-phase commit (2PC) protocol. Epoxy was implemented as a
bolt-on shim layer for five diverse data stores: Postgres, MySQL,
Elasticsearch, MongoDB, and Google Cloud Storage (GCS). (I guess the
authors had Google Cloud credits to use rather than AWS credits, and so
the experiments were run on Google Cloud.)
Detock: High Performance Multi-region Transactions at Scale: This is a followup to the deterministic database work that Daniel Abadi has
been doing for more than a decade. I like this type of continuous
research effort rather than people jumping from one branch to another
before exploring the approach in depth.
Polyjuice: High-Performance Transactions via Learned Concurrency Control: This paper shows a practical application of simple machine learning to an important systems problem, concurrency control. Instead of choosing among a small number of known algorithms, Polyjuice
searches the "policy space" of fine-grained actions by using
evolutionary-based reinforcement learning and offline training to
maximize throughput. Under different configurations of TPC-C and TPC-E,
Polyjuice can achieve throughput numbers higher than the best of
existing algorithms by 15% to 56%.
A Study of Database Performance Sensitivity to Experiment Settings: The paper investigates the following question: Many articles compare
to prior works under certain settings, but how much of their
conclusions hold under other settings? They find that the evaluations of the sampled work (and conclusions drawn from
them) are sensitive to experiment settings. They make some
recommendations as to how to proceed for evaluation of future systems
work.
TPC-E vs. TPC-C: Characterizing the New TPC-E Benchmark via an I/O Comparison Study: This paper compares the two standard TPC benchmarks for OLTP, TPC-C which came in 1992, and the TPC-E which dropped in 2007. TPC-E
is designed to be a more realistic and sophisticated OLTP benchmark
than TPC-C by incorporating realistic data skews and referential
integrity constraints. However, because of its complexity, TPC-E is more
difficult to implement, and hard to provide reproducability of the
results by others. As a result adoption had been very slow and little.
Is Scalable OLTP in the Cloud a Solved Problem? The paper draws attention to the divide between conventional wisdom on
building scalable OLTP databases (shared-nothing architecture) and how
they are built and deployed on the cloud (shared storage architecture).
There are shared nothing systems like CockroachDB, Yugabyte, and
Spanner, but the overwhelming trend/volume on cloud OLTP is shared
storage, and even with single writer, like AWS Aurora, Azure SQL HyperScale, PolarDB, and AlloyDB.
Take Out the TraChe: Maximizing (Tra)nsactional Ca(che) Hit Rate: This is the main message of this paper: You have been doing caching wrong for your transactional workloads!
Designing Access Methods: The RUM Conjecture: Algorithms and data structures for organizing and accessing data are
called access methods. Database research/development community has been
playing catchup redesigning and tuning access methods to accommodate
changes to hardware and workload. As data generation and workload
diversification grow exponentially, and hardware advances introduce
increased complexity, the effort for the redesigning and tuning have
been accumulating as well. The paper suggests it is important to solve
this problem once and for all by identifying tradeoffs access methods
face, and designing access methods that can adapt and autotune the
structures and techniques for the new environment.
Miscellaneous
SIGMOD panel: Future of Database System Architectures: Swami said that when Raghu invited him to be on the panel, he didn't
know he would have to disagree with his PhD advisor Gustavo. He said
that disaggregation has already arrived. He took a customer-focused view
and said that the boundary between analytics, transactional, and ML is
irrelevant for customers, and these are artificial distinctions of the
research community has that needs to die. He built on the
hardware-software codesign theme Anastasia mentioned. He said that
humans are not good at high cardinality problems, this is where ML
helps, and there is not enough investment on how to use ML for building
DB. Being on-call at 2am, debugging, makes you appreciate these things.
He said, being known as the NoSQL guy, he would controversially claim
that "SQL is going to die" because LLMs are going to reinvent spec, and
allow natural language based querying.
SIGMOD/PODS Day 2: Don Chamberlin (IBM fellow retired) is the creator of the SQL language. Why does the title say 49 years and not 50 years of querying? This is because the SQL paper was published 49 years ago at SIGMOD conference. The paper was titled: "SEQUEL: A structured English query language".
But believe it or not, this was not the main show in that conference,
and maybe even went low key unnoticed. The main show was two influential
people debating.
Review: Performance Modeling and Design of Computer Systems: Queueing Theory in Action: We are A. Jesse Jiryu Davis, Andrew Helwer, and Murat Demirbas,
three enthusiasts of distributed systems and formal methods. We’re
looking for rigorous ways to model the performance of distributed
systems, and we had hoped that this book would point the way.
Keep CALM and CRDT On: This paper focuses on the read/querying problem of conflict-free replicated data
types (CRDTs). To solve this problem, it proposes extending CRDTs with a
SQL API query model, applying the CALM theorem to identify which
queries are safe to execute locally on any replica. The answer is of no
surprise: monotonic queries can provide consistent observations without
coordination.
Previous years in review
Best of metadata in 2020
Best of metadata in 2019
Best of metadata in 2018
Research, writing, and career advice
Comments