Posts

Understanding the Performance Implications of Storage-Disaggregated Databases

Image
Storage-compute disaggregation in databases has emerged as a pivotal architecture in cloud environments, as evidenced by Amazon ( Aurora ), Microsoft ( Socrates ), Google (AlloyDB), Alibaba ( PolarDB ), and Huawei (Taurus). This approach decouples compute from storage, allowing for independent and elastic scaling of compute and storage resources. It provides fault-tolerance at the storage level. You can then share the storage for other services, such as adding read-only replicas for the databases. You can even use the storage level for easier sharding of your database. Finally, you can also use this for exporting a changelog asynchronously to feed into peripheral cloud services, such as analytics. Disaggregated architecture was the topic of Sigmod 23 panel . I think this quote summarizes the industry's thinking on the topic. "Disaggregated architecture is here, and is not going anywhere. In a disaggregated architecture, storage is fungable, and computing scales independently.

Unanimous 2PC: Fault-tolerant Distributed Transactions Can be Fast and Simple

Image
This paper (PAPOC'24) is a cute paper. It isn't practical or very novel, but I  think it is a good thought-provoking paper. It did bring together the work/ideas around transaction commit for me. It also has TLA+ specs in the appendix, which could be helpful to you in your modeling adventures.  I don't like the paper's introduction and motivation sections, so I will explain these my way. The problem with 2PC 2PC is a commit protocol. A coordinator (transaction manager, TM) consistently decides based on participants (resource managers, RMs) feedback to commit or abort. If one RM sees/applies a commit, all RMs should eventually apply commit. If one RM sees/applies an abort, all RMs should eventually apply abort. Below  figure shows 2PC in action. (If you are looking for a deeper dive to 2PC, read this .) There are three different transactions going on here. All transactions access two RMs X and Y. T1, the blue transaction is coordinated by C1, and this ends up committing.

Optimizing Distributed Protocols with Query Rewrites

Image
This paper (Sigmod 2024) formalizes scalability optimizations as rule-driven rewrites, inspired by SQL query optimizations. It focuses on two well-known and popular techniques: decoupling (distributing logic/code across nodes for introducing pipeline parallelism) and partitioning (distributing data across nodes--no new components, but instances of them-- for workload parallelism). Whittaker et al. (hi!) applied decoupling and partitioning optimizations to Paxos in the Compartmentalized Paxos work. That work deconstructed Paxos and showed how to reconstruct it using decoupling and partitioning to be more scalable by individually focusing on each component/role in Paxos. This is a simple but effective trick. Even after you learn the trick, you still keep getting surprised by how effective it is. The current paper aims to do this application via query rewrite rules more methodically rather than the traditional/ad-hoc way.  To this end, the paper utilizes Dedalus, a Datalog¬ dialect with

Always Measure One Level Deeper

Image
This is a great paper (CACM 2018) by John Ousterhout. Ousterhout is well known for his work on log-structured file system, tcl/tk, Raft, and magic VLSI CAD. His book on Philosophy of Software Design is great, and he has a lot of wisdom about life in general that he shares in his Stanford CS classes. The paper is written very well, so I lift up paragraphs verbatim from it to summarize its main points. There are many war stories in the text. Please do read it, because they are fascinating, and likely you can see how they can apply to your work, and save you from making a mistake. At the end, I chime in with my reflections and link to other relevant work.  Key Insights In academic research a thorough performance evaluation is considered essential for many publications to prove the value of a new idea. In industry, performance evaluation is necessary to maintain a high level of performance across the lifetime of a product. A good performance evaluation provides a deep understanding of

TLA+ modeling of a single replicaset transaction modeling

Image
For some time I had been playing with transaction modeling and most recently with replicaset modeling by way of a single log. While playing with these, I realized I can build something cool on top of these building blocks. I just finished building snapshot isolated transaction modeling that sit on top of a replicaset log. This is also a high level approximation of MongoDB-style snapshot isolated transactions on a single replicaset. I put this up on github at  https://github.com/muratdem/MDBTLA/tree/main/SingleShardTxn Next, I will work on extending this to modeling MongoDB multishard transactions. That will be a whole lot of more work, but I think I have a good foundation to take on the challenge.   Modeling process My starting point was these components. ClientCentricIsolation: This is a module for checking isolation levels on transactions using an ops-log per transaction. I discussed this here two years ago .   KVSnap: This is a high level snapshot isolated singlenode key-value store

Know Yourself

MongoDB has a nice leadership development program internally. They suggested that filling/sharing this questionnaire would be useful to get you acquainted with the people you work with daily. I am not super into being in-touch with your feelings stuff, so I thought I probably won't do this. But then, the questions were interesting, and I jot down my answers to them quickly. I don't think this took me more than 5 minutes or so, and I felt good after having answered the questions. Maybe this gave me a little bit of self-acceptance, and being kind to my innate nature rather than judging it all the time. The questionnaire description was spot on: "Increase awareness of your own preferences by answering the questions below." Here are my answers to these questions. What would your answers be? I think it is good practice to do this exercise once in a while to know yourself a bit better. What are you AMAZING at? Where do you want to improve? I am good at research. I enjoy bei

Understanding Inconsistency in Azure Cosmos DB with TLA+

Image
This paper, by Finn Hackett, Joshua Rowe, Markus Kuppe, appeared in International Conference on Software Engineering 2023. It presents a specification of Azure Cosmos DB consistency behavior as exposed to the clients.  During my sabbatical at CosmosDB in 2018, I was involved in a specification of CosmosDB as exposed to the clients .  The nice thing about these specs is that they didn't need to model internal implementation but just captured the consistency semantics for clients precisely, rather than ambiguously like English explanations do. They aimed to answer the question of what kind of behavior should a client be able to witness while interacting with the service? The feedback at that time was that customers found this very useful. This 2023 paper improves on our preliminary specs from 2018. It has this great opening paragraph, which echoes the experience of everyone that has painstakingly specified a distributed/concurrent system behavior. Consistency guarantees for distribu

Amazon MemoryDB: A Fast and Durable Memory-First Cloud Database

Image
Key-value stores have simple semantics which make it cumbersome to perform complex operations involving multiple keys. Redis addresses this by providing low latency access to complex data structures like lists, sets, and hashes . Unfortunately, Redis lacks strong consistency guarantees and durability in the face of node failures, which limits its applicability beyond caching use cases. Amazon MemoryDB is a fully managed in-memory database service that leverages Redis's performance strengths while overcoming its durability limitations. It uses Redis as the in-memory data processing engine but offloads persistence to an AWS-internal transaction log service (internally known as the journal). This decoupled architecture provides in-memory performance with microsecond reads and single-digit millisecond writes, while ensuring across availability zone (AZ) durability, 99.99% availability, and strong consistency in the face of failures. This Sigmod 2024 paper describes MemoryDB's arch

Recent reads (Mar-Apr 2024)

I am late for my book reporting. These are the books I read mostly in March.  Going infinite (Michael Lewis, 2024) I had the highest of praise for Michael Lewis several times , and called him one of my favorite authors. I feel sorry to write a bad review for one of my favorite authors. Lewis faced significant backlash from people who questioned his impartiality in portraying Sam Bankman-Fried (SBF), raising concerns about the fairness of his depiction. My complaint is not that. I came to this book for captivating story-telling, which, unfortunately, falls short. This book pales in comparison to Lewis's past works. We are talking about the guy who made me read thick books about finance and stock trading worlds which I had zero interest in as if they were fast-paced thriller novels. So I was really looking forward to reading this book to get that reader's high, but I was disappointed. I am familiar with the blockchain and cryptocurrency world through my research and my internal

Popular posts from this blog

Learning about distributed systems: where to start?

Hints for Distributed Systems Design

Foundational distributed systems papers

Metastable failures in the wild

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Understanding the Performance Implications of Storage-Disaggregated Databases

Always Measure One Level Deeper

The end of a myth: Distributed transactions can scale

Dude, where's my Emacs?

Know Yourself