Posts

DBSP: Automatic Incremental View Maintenance for Rich Query Languages

Image
Incremental computation represents a transformative (!) approach to data processing. Instead of recomputing everything when your input changes slightly, incremental computation aims to reuse the original output and efficiently update the results. Efficiently means performing work proportional only to input and output changes. This paper introduces DBSP, a programming language inspired by signal processing (hence the name DB-SP). DBSP is  simple, yet it offers extensive computational capabilities. With just four operators, it covers complex database queries, including entire relational algebra, set and multiset computations, nested relations, aggregations, recursive queries, and streaming computations. Basic DBSP operators  The language is designed to capture computation on streams. Streams are represented as infinite vectors indexed by consecutive time. Each stream can represent values of any type and incorporates basic mathematical operations like addition, subtraction, and a zero el

UB Hacking 2024

I attended the University at Buffalo Hacking event over the weekend. It was fun. There were 90+ projects, I judged 15 projects. There were some interesting talks as well. It was good to see youth energy. It feels good to teach next generation something. Another thing,  GeoGuessr played as a group game under time pressure is a lot of fun. This may be a great family activity. Why should you care about proofs, if all you want to do is coding? Atri and Andrew decided this would be a good talk to give at a hackathon. Daring! They did a good job imparting their passion about the joys and benefits of mathematical thinking. They talked about Paul Erdos 's the book of proofs concept, and the difference between a  correct proof versus a great proof from "the book". They talked about the deep insight that you can achieve through an abstract mathematical thinking. They also mentioned that if you don't have good insight to the problem or your program, you will have a hard time de

DDIA: Chp 9. Consistency and Consensus

Image
The chapter 9 of the Designing Data Intensive Applications (DDIA) book has the following 4 sections (which contain a total of 10 subsections).   Consistency guarantees Linearizability Ordering guarantees Distributed transactions and consensus TMI (Too much info) The chapter tries to do too much. Almost an entire semester of distributed systems content is force-jammed into this chapter. I don't think the discord reading group will be able to complete this chapter and make sense of what is going on. For a more manageable load, much of this chapter should be moved to other chapters. The chapter starts with explaining linearizability. It would have been better to explain this in Chapter 5, Replication. Linearizability sounds easy, but it takes time to internalize. I had discussed linearizability here , and later revisited it with this paper . And of course it was mentioned in many other posts. The chapter continues with CAP tradeoff, rightly calling it "the unhelpful CAP theorem&

Auto-WLM: machine learning enhanced workload management in Amazon Redshift

Image
This paper appeared in Sigmod'23. What? Auto-WLM is a machine learning based *automatic workload manager* currently used in production in Amazon Redshift. I thought this would be a machine learning paper, you know deep learning and stuff. But this paper turned out to be a practical/applied data systems paper. At its core, this paper is about improving query performance and resource utilization in data warehouses, possibly the first for a database system in production at scale.  They are not using deep learning, and rightly so! The main take-away from the paper is that locally-trained simple models (using XGBoost , a decision tree-based model built from the query plan trees) outperformed globally trained models, likely due to their ability to "instance optimize" to specific databases and workloads. They are using simple ML. And it works. Why? This is an important problem. If tuning is done prematurely, resources are unnecessarily wasted, and if it is done too late, overall

DDIA: Chp 8. The Trouble with Distributed Systems

Image
This is a long chapter. It touches on so many things. Here is the table of contents. Faults and partial failures Unreliable networks detecting faults timeouts and unbounded delays sync vs async networks Unreliable clocks monotonic vs time-of-day clocks clock sync and accuracy relying on sync clocks process pauses Knowledge truth and lies truth defined by majority byzantine faults system model and reality I don't know if listing a deluge of problems is the best way to approach this chapter. Reading these is fine, but it doesn't mean you learn them. I think you need time and hands on work to internalize these. What can go wrong? Computers can crash. Unfortunately they don't fail cleanly.   Fail-fast is failing fast! And again unfortunately, partial failures (limping computers) are very difficult to deal with. Even worse, with the transistor density so high, we now need to deal with silent failures. We have memory corruption and even silent faults from CPUs. The HPTS'24 s

DDIA: Chp 7. Transactions (Part 2): Serializability

Image
We are continuing from the first part of our Chapter 7 review .  Serializable isolation ensures that the final result of concurrent transactions is equivalent to if they had been run one at a time, without any concurrency. This eliminates any concurrency anomalies, since it ensures the transactions would behave as they would in a sequential environment. Databases offering serializability typically use one of three methods: Executing transactions literally in a serial order  Two-phase locking (2PL) Optimistic concurrency control, like serializable snapshot isolation (SSI) For now, we will focus on single-node databases. The book discusses how these methods apply to distributed systems in Chapter 9. Actual Serial Execution Serial transaction execution is used in systems like VoltDB/H-Store, Redis, and Datomic. While limiting the database to a single thread can boost performance by eliminating coordination overhead, it restricts throughput to the capacity of one CPU core. Unlike tradition

DDIA: Chp 7. Transactions (Part 1)

Image
Chapter 7 of the Designing Data Intensive Applications (DDIA) book discusses transactions, yay! Transactions in database systems group multiple operations into a logical unit, a box of operations if you will. They simplify error handling and help manage concurrency issues. See Gray-Reuters book introduction and fault-tolerance sections for the motivation on this. The ACID properties (Atomicity, Consistency, Isolation, Durability) are often used to describe transaction characteristics, but their precise meaning had been left fuzzy by the implementations. Atomicity ensures all operations in a transaction either complete fully or not at all. It's not about concurrency, but about handling faults during transaction processing. Consistency is an application-level concern, ensuring that transactions maintain an application-defined data integrity according to defined rules. This consistencey is not the same consistency as in CAP , and it is best to forget about C in ACID, and understand

700

Image
This is a special milestone: 700th post, after 14 years of blogging here. 700 posts is a lot of blogging. But that comes down to 50 posts per year, which is one post a week, totally doable, right? If I can get another 14 years of blogging at this rate, I will get to 1400. That is more than the EWD documents in terms of the sheer number of posts, not that I am comparing my posts with EWD documents. (BTW, of course I had a blog post about the EWDs. )  I hope I can go for many more years, because I do enjoy this. Writing in the open and getting occassional feedback is nice. But I don't do it for the feedback. I wrote about "Why I blog" recently , so I am not going to rehash that topic. Instead a good use of this post may be to serve as an index to other posts in this blog. I noticed I keep referring people to my previous posts where I covered a question before. 700 posts is a lot of posts, so maybe an index at this snapshot can help people get more out of this blog. Let

SRDS Day 2

Image
Ok, continuing on the SRDS day 1 post , I bring you SRDS day 2. Here are my summaries from the keynote, and from the talks for which I took some notes.  Mahesh Balakrishnan's Keynote Mahesh 's keynote was titled "An Hourglass Architecture for Distributed Systems" . His talk focused on the evolution of his perspective on distributed systems research and the importance of abstraction in managing complexity. He began by reflecting on how in his PhD days, he believed the goal was to develop better protocols with nice properties. He said, he later realized that the true challenge in distributed systems research lies in creating new abstractions that simplify these complex systems. Complexity creeps into distributed systems through failures, asynchrony, and change. Mahesh also confessed that he didn't realize the extent to the importance of managing change until his days in industry.  While other fields in computer science have successfully built robust abstractions (su

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Foundational distributed systems papers

Advice to the young

Linearizability: A Correctness Condition for Concurrent Objects

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Understanding the Performance Implications of Storage-Disaggregated Databases

Designing Data Intensive Applications (DDIA) Book