Posts

Showing posts with the label cloud computing

Vive la Difference: Practical Diff Testing of Stateful Applications

Image
This Google paper (to appear in VLDB'25) is about not blowing up your production system. That is harder than it sounds, especially with stateful applications with memories. When rolling out new versions of stateful applications, the "shared, persistent, mutable" data means bugs can easily propagate across versions. Modern rollout tricks (canaries, blue/green deployments) don't save you from this. Subtle cross-version issues often slip through pre-production testing and surface in production, sometimes days or weeks later. These bugs can be severe, and the paper categorizes them as data corruption, data incompatibility, and false data assumptions. The paper mentions real-world incidents from Google and open-source projects to emphasize these bugs' long detection and resolution times, and the production outages and revenue loss they cause. So, we need tooling that directly tests v1/v2 interactions on realistic data before a rollout. The paper delivers a prototype o...

Understanding the Performance Implications of Storage-Disaggregated Databases

Image
Storage-compute disaggregation in databases has emerged as a pivotal architecture in cloud environments, as evidenced by Amazon ( Aurora ), Microsoft ( Socrates ), Google (AlloyDB), Alibaba ( PolarDB ), and Huawei (Taurus). This approach decouples compute from storage, allowing for independent and elastic scaling of compute and storage resources. It provides fault-tolerance at the storage level. You can then share the storage for other services, such as adding read-only replicas for the databases. You can even use the storage level for easier sharding of your database. Finally, you can also use this for exporting a changelog asynchronously to feed into peripheral cloud services, such as analytics. Disaggregated architecture was the topic of Sigmod 23 panel . I think this quote summarizes the industry's thinking on the topic. "Disaggregated architecture is here, and is not going anywhere. In a disaggregated architecture, storage is fungible, and computing scales independently. ...

Optimizing Distributed Protocols with Query Rewrites

Image
This paper (Sigmod 2024) formalizes scalability optimizations as rule-driven rewrites, inspired by SQL query optimizations. It focuses on two well-known and popular techniques: decoupling (distributing logic/code across nodes for introducing pipeline parallelism) and partitioning (distributing data across nodes--no new components, but instances of them-- for workload parallelism). Whittaker et al. (hi!) applied decoupling and partitioning optimizations to Paxos in the Compartmentalized Paxos work. That work deconstructed Paxos and showed how to reconstruct it using decoupling and partitioning to be more scalable by individually focusing on each component/role in Paxos. This is a simple but effective trick. Even after you learn the trick, you still keep getting surprised by how effective it is. The current paper aims to do this application via query rewrite rules more methodically rather than the traditional/ad-hoc way.  To this end, the paper utilizes Dedalus, a Datalog¬ dialect ...

Lifting the veil on Meta’s microservice architecture: Analyses of topology and request workflows

Image
This paper appeared in USENIX ATC'23 . It is about a survey of microservices in Meta (nee Facebook). We had previously reviewed a microservices survey paper from Alibaba. Motivated maybe by the desire for differentiation, the Meta paper spends the first two sections justifying why we need yet another microservices survey paper. I didn't mind reading this paper at all, it is an easy read. The paper gives another design point/view from industry on microservices topologies, call graphs, and how they evolve over time. It argues that this information will help build more accurate microservices benchmarks and artificial microservice topology/workflow generators, and also help for future microservices research and development. I did learn some interesting information and statistics about microservices use in Meta from the paper. But I didn't find any immediately applicable insights/takeaways to improve the quality and reliability of the services we build in the cloud.     The con...

Towards Modern Development of Cloud Applications

Image
This paper is from HotOS'23. At 6 pages, it is an easy-to-read paper, but it is not an easy-to-agree-with paper. The message is controversial: Don't do microservices, write a monolith, and our runtime will take care of deployment and distribution. This is a big claim, and we have been burned by ambitious attempts like this many times before. I realize big claims are part of the style of HotOS, where work-in-progress and sometimes provocative papers make a debut to kickstart a discussion. This paper sure does a good job of starting a discussion. Good There is code, and it is opensource , so this is not just a speculation paper. A Go framework does exist, which has been under development for sometime inside Google. Given Google's expertise on infrastructure and Go, I think this framework will be a big boon to the Google Cloud Platform (GCP), if it gets into production. To evaluate the framework (let's call it ServiceWeaver, with its Github name, shall we?), they consider...

Kora: A Cloud-Native Event Streaming Platform For Kafka

Image
This paper from VLDB'23 (awarded the Best Industry Paper) describes how Confluent built Kora, to provide Kafka as a managed cloud event streaming platform. Kora combines best practices to deliver cloud features such as high availability, durability, scalability, elasticity, cost efficiency, performance, multi-tenancy. For example, the Kora architecture decouples its storage and compute tiers to facilitate elasticity, performance, and cost efficiency. As another example, Kora defines a Logical Kafka Cluster (LKC) abstraction to serve as the user-visible unit of provisioning, so it can help customers distance themselves from the underlying hardware and think in terms of application requirements. The writing of the paper could be much better. I think the paper fails to symphatize with the reader, who lacks the context about Kafka in the first place, and rushes in to explaining the mechanics how Kora makes Kafka a cloud managed offering. The motivation and use cases of Kafka could hav...

Metastable failures in the wild

Image
This paper appeared in OSDI'22. There is a great summary of the paper by Aleksey (one of the authors and my former PhD student, go Aleksey!). There is also a great conference presentation video from Lexiang. Below I will provide a brief overview of the paper followed by my discussion points. This topic is very interesting and important, so I hope you have fun learning about this. Metastability concept and categories Metastable failure is defined as permanent overload with low throughput even after the fault-trigger is removed. It is an emergent behavior of a system, and it naturally arises from the optimizations for the common case that lead to sustained work amplification. In this paper, the authors are able to capture/abstract the system behavior of interest in terms of two parameters, the load and capacity. If the load is above capacity, you have work piling up, right? Or if the capacity drops under the sustained load level, the same effect, right? Both of these create  a tem...

F1: A Distributed SQL Database That Scales

Image
This is a VLDB 2013 paper (appeared earlier at Sigmod'12 it seems) from Google about paying tech-debt. F1 replaces the sharded MySQL hacky implementation of AdWords with a principled well-engineered infrastructure that builds a distributed SQL layer on top of Spanner. My reaction to this paper probably up until 5 years ago would be "ugh, schemas, database stuff... No distributed algorithms, pass!". But I like to think that I improve with age. After having learned the importance of databases as real-world killer applications of distributed systems, I now look at papers like this as a potential Rosetta stone between the two fields. I am looking for papers that can shave off weeks and months from my journey to learn about distributed databases. I think this paper has been very useful for understanding several issues regarding distributed databases, as it gave information about many facets of deploying a large scale distributed SQL database. But, the paper fails to become a...

Popular posts from this blog

Hints for Distributed Systems Design

My Time at MIT

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Foundational distributed systems papers

Advice to the young

Learning about distributed systems: where to start?

Distributed Transactions at Scale in Amazon DynamoDB

Making database systems usable

Looming Liability Machines (LLMs)

Analyzing Metastable Failures in Distributed Systems