Posts

DDIA: Chp 5. Replication (Part 1)

Image
Chapter 5 of the Designing Data Intensive Applications (DDIA) book discusses strategies and challenges in replicating data across distributed systems. Replication is critical for ensuring data availability, fault tolerance, and scalability. One of the key challenges in replication is maintaining consistency across multiple replicas. Leader-based replication, the most common method, involves a single leader node accepting all writes, while follower nodes replicate the leader's changes. While single leader replication does not guarantee consistency readily (due to leader failover cornercases), it gives us a fighting chance. One primary advantage of leader-based replication is its simplicity: all write requests are serialized through the leader, ensuring a consistent order of operations. Followers receive a stream of changes (i.e., the replication log) from the leader and apply them in sequence, ensuring eventual consistency. However, replication lag, the delay between when a write i

FlexiRaft: Flexible Quorums with Raft

Image
This paper appeared in CIDR23 and is from Meta (wow, this is the first time I used the new name without needing to mention it is in fact Facebook... wait.. goddammit). The paper talks about how they applied Raft to MySQL replication, and used the flexible quorums in the process. This is not a technically deep paper, but it was interesting to see a practical application of flexible quorums idea to Raft rather than Paxos. The most technically interesting part is the adoption of flexible quorums to Raft rather than Paxos. What is the difference? Flexible quorums idea was developed for Paxos's two phases. But, Raft needs to impose an extra requirement on quorums in order to guarantee Leader Completeness: "the new leader must already have all log entries replicated by a majority of nodes in the previous term." The paper does not call this explicitly. This is what the paper says: "For state machine safety, every data commit quorum needs to intersect with every leader elec

DDIA: Chp 4. Encoding and Evolution (Part 2)

This second part of Chapter 4 of the Designing Data Intensive Applications (DDIA) book discusses methods of data flow in distributed systems, covering dataflow through databases, service calls, and asynchronous message passing. For databases, the process writing to the database encodes the data, and the reading process decodes it. We need both backward and forward compatibility, as older and newer versions of code may coexist during rolling upgrades. The book emphasizes that *data often outlives code*, hence this makes schema evolution crucial. The techniques we discussed in the first part of the chapter for encoding/decoding and backward/forward compatibility of schemas apply here. Most databases avoid rewriting large datasets when schema changes occur, instead opting for simple changes like adding nullable columns. For service calls, the chapter primarily discusses web services, which use HTTP as the underlying protocol. Web services are used not only for client-server communicatio

Taming Consensus in the Wild (with the Shared Log Abstraction)

Image
This paper recently appeared at ACM SIGOPS Operating Systems Review. It provides an overview of the shared log abstraction in distributed systems, particularly focusing on its application in State Machine Replication (SMR) and consensus protocols. The paper argues that this abstraction can simplify the design and implementation of distributed systems, and can make them more reliable and easier to maintain. What is the shared log abstraction? The shared log abstraction proposes to separate the system into two layers: the database layer (proposers/learners) and the log layer (acceptors). The shared log provides a simple API for appending entries, checking the tail, reading entries, and trimming the log. This separation allows the SMR layer to focus on higher-level concerns without dealing with the complexities of consensus protocols. This is a wisdom packed paper. It approaches the problem more from software engineering and systems/operations perspectives. ( Previous work, the Delos OSD

DDIA: Chp 4. Encoding and Evolution (Part 1)

Image
This first part of Chapter 4 of the Designing Data Intensive Applications (DDIA) book discusses the concepts of data encoding and evolution in data-intensive applications. As applications inevitably change over time, it's important to build systems that can adapt to these changes easily, a property referred to as evolvability (under maintainability) in Chapter 1 of the book . Different data models handle change differently. Relational databases typically enforce a single schema at any given time, with changes implemented through schema migrations. In contrast, schema-on-read databases allow for a mix of older and newer data formats. Data format or schema change often necessitates corresponding application code changes. However, in large applications, code changes can't always happen instantaneously. This leads to situations where old and new versions of code and data formats coexist in the system. To handle this, we need to maintain both backward compatibility (newer code can

Looming Liability Machines (LLMs)

As part of our zoom reading group ( wow, 4.5 years old now ), we discussed a paper that uses LLMs for automatic root cause analysis (RCA) for cloud incidents. This was a pretty straightforward application of LLMs. The proposed system employs an LLM to match incoming incidents to incident handlers based on their alert types, predicts the incident's root cause category, and provides an explanatory narrative. The only customization is through prompt-engineering. Since this is a custom domain, I think a more principled and custom-designed  machine learning system would be more appropriate rather than adopting LLMs. Anyways, the use of LLMs for RCAs spooked me vicerally. I couldn't find the exact words during the paper discussion, but I can articulate this better now. Let me explain. RCA is serious business Root cause analysis (RCA) is the process of identifying the underlying causes of a problem/incident, rather than just addressing its symptoms. One RCA heuristic is asking 5 Why&#

DDIA: Chp 3. Storage and Retrieval (Part 2)

Image
This is Chapter 3, part 2 for the Designing Data Intensive Applications (DDIA) book . This focuses on storage and retrieval for OLAP databases.  Analytics, data warehousing, star and snowflake schemas A data warehouse is a dedicated database designed for analytical queries. It houses a read-only replica of data from transactional systems within the organization. This separation ensures that analytics (OLAP) operations do not interfere with OLTP operations which are critical to the operation of the business. Data is extracted from OLTP databases, either through periodic snapshots or a continuous stream of updates. It's then transformed into a format optimized for analysis, cleaned to remove inconsistencies, and loaded into the data warehouse. This process is known as Extract-Transform-Load (ETL). Large enterprises often have massive data warehouses containing petabytes of transaction history stored in a "fact table". The "fact table" represents individual events

DDIA: Chp 3. Storage and Retrieval (Part 1)

Image
This is Chapter 3, part 1 for the Designing Data Intensive Applications (DDIA) book.  This part focuses on storage and retrieval for OLTP databases. Even if you won't be implementing a storage engine from scratch, it is still important to understand how databases handle storage and retrieval internally. This knowledge allows you to select and tune a storage engine that best fits your application's needs. The performance of your application often hinges on the efficiency of the storage engine, especially when dealing with different types of workloads, such as transactional (involving many small, frequent read and write operations) or analytical (involving block reads for executing complex queries over large datasets). There are two main families of storage engines: log-structured storage engines (e.g., LevelDB, RocksDB) and page-oriented storage engines like B-trees (e.g., PostGres, InnoDB/MySQL, WiredTiger/MongoDB). Log-Structured Storage Engines These engines use an append-onl

Making database systems usable

Image
C. J. Date's Sigmod 1983 keynote, "Database Usability", was prescient. Usability is the most important thing to the customers. They care less about impressive benchmarks or clever algorithms, and more about whether they can operate and use a database efficiently to query, update, analyze, and persist their data with minimal headache. (BTW, does anyone have a link to the contents of this Sigmod'83 talk? There is no transcript around, except for this short abstract .) The paper we cover today is from Sigmod 2007. It takes on the database usability problem raised in that 1983 keynote head-on, and calls out that the king is still naked.  Let's give some context for the year 2007. Yes, XML format was still popular then. The use-case in the paper is XQuery. The paper does not contain any  reference to json. MongoDB would be released in 2009 with the document model; and that seems to be great timing for some of the usability pains mentioned in the paper! Web 2.0 was in

Linearizability: A Correctness Condition for Concurrent Objects

Image
This paper is from Herlihy and Wing appeared in ACM Transactions on Programming Languages and Systems 1990. This is the canonical reference for the linearizability definition. I had not read this paper in detail before, so I thought it would be good to go to the source to see if there are additional delightful surprises in the original text. Hence, this post. I will dive into a technical analysis of the paper first, and then discuss some of my takes toward the end. I had written an accessible explanation of linearizability earlier; you may want to read that first. I will assume an understanding of linearizability to keep this review at reasonable length. Introduction I love how the old papers just barge in with the model, without bothered by pleasantries such as motivation of the problem. These are the first two sentences of the introduction. "A concurrent system consists of a collection of sequential processes that communicate through shared typed objects . This model encompass

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Foundational distributed systems papers

Advice to the young

Linearizability: A Correctness Condition for Concurrent Objects

Understanding the Performance Implications of Storage-Disaggregated Databases

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Designing Data Intensive Applications (DDIA) Book