Posts

Showing posts with the label big-data

DDIA: Chp 10. Batch Processing

Image
Batch processing allows large-scale data transformations through bulk-synchronous processing. The simplicity of this approach allowed building reliable, scalable, maintainable applications with it. If you recall, "reliable-scalable-maintainable" was what we set out to learn when we began the DDIA book. This story of MapReduce starts when Google engineers realize there were a lot of repetitive tasks involved when computing over large data. These tasks often involved individually processing elements and then gathering and fusing their output. Interestingly, this bores a striking resemblance to electromechanical IBM card-sorting machines from the 1940-50s. MapReduce also got some inspiration from the map reduce operations in Lisp: (map square '(1 2 3 4)) gives us  (1 4 9 16), and (reduce + '(1 4 9 16))  gives us 30. The key innovation of Google's MapReduce framework was its ability to simplify parallel processing by abstracting away complex network communication and ...

DDIA: Chp 5. Replication (Part 2)

Image
Chapter 5 of the Designing Data Intensive Applications (DDIA) book discusses strategies and challenges in replicating data across distributed systems. I had covered the first part last week , here is the second part of leaderless replication. Leaderless replication abandons the concept of a leader node, and allows any replica to directly accept writes from clients. This method, while used in some of the earliest replicated data systems, fell out of favor during the dominance of relational databases. However, it regained popularity after Amazon implemented it in their in-house Dynamo system (not to be confused with DynamoDB). This inspired several open-source datastores like Riak, Cassandra, and Voldemort , which are often referred to as Dynamo-style databases. In leaderless replication systems, the concepts of quorums and quorum consistency are crucial. These systems use three key parameters: n: the total number of replicas w: the write quorum (number of nodes that must acknowledge a...

DDIA: Chp 4. Encoding and Evolution (Part 2)

This second part of Chapter 4 of the Designing Data Intensive Applications (DDIA) book discusses methods of data flow in distributed systems, covering dataflow through databases, service calls, and asynchronous message passing. For databases, the process writing to the database encodes the data, and the reading process decodes it. We need both backward and forward compatibility, as older and newer versions of code may coexist during rolling upgrades. The book emphasizes that *data often outlives code*, hence this makes schema evolution crucial. The techniques we discussed in the first part of the chapter for encoding/decoding and backward/forward compatibility of schemas apply here. Most databases avoid rewriting large datasets when schema changes occur, instead opting for simple changes like adding nullable columns. For service calls, the chapter primarily discusses web services, which use HTTP as the underlying protocol. Web services are used not only for client-server communicatio...

DDIA: Chp 3. Storage and Retrieval (Part 2)

Image
This is Chapter 3, part 2 for the Designing Data Intensive Applications (DDIA) book . This focuses on storage and retrieval for OLAP databases.  Analytics, data warehousing, star and snowflake schemas A data warehouse is a dedicated database designed for analytical queries. It houses a read-only replica of data from transactional systems within the organization. This separation ensures that analytics (OLAP) operations do not interfere with OLTP operations which are critical to the operation of the business. Data is extracted from OLTP databases, either through periodic snapshots or a continuous stream of updates. It's then transformed into a format optimized for analysis, cleaned to remove inconsistencies, and loaded into the data warehouse. This process is known as Extract-Transform-Load (ETL). Large enterprises often have massive data warehouses containing petabytes of transaction history stored in a "fact table". The "fact table" represents individual events ...

DDIA: Chp 3. Storage and Retrieval (Part 1)

Image
This is Chapter 3, part 1 for the Designing Data Intensive Applications (DDIA) book.  This part focuses on storage and retrieval for OLTP databases. Even if you won't be implementing a storage engine from scratch, it is still important to understand how databases handle storage and retrieval internally. This knowledge allows you to select and tune a storage engine that best fits your application's needs. The performance of your application often hinges on the efficiency of the storage engine, especially when dealing with different types of workloads, such as transactional (involving many small, frequent read and write operations) or analytical (involving block reads for executing complex queries over large datasets). There are two main families of storage engines: log-structured storage engines (e.g., LevelDB, RocksDB) and page-oriented storage engines like B-trees (e.g., PostGres, InnoDB/MySQL, WiredTiger/MongoDB). Log-Structured Storage Engines These engines use an append-onl...

Designing Data Intensive Applications (DDIA) Book

Image
We started reading this book as part of Alex Petrov's book club . We just got started, so you can join us, by joining the discord channel above. We meet Wednesday's 11am Eastern Time.  Previously we had read transaction processing book by Grey and Reuters. This page links to my summaries of that book. Chp 1. Reliable, Scalable, and Maintainable Applications I love the diagrams opening each chapter. Beautiful! The first chapter consists of warm up stuff. It talks about the definitions of reliabilty, scalability, and maintainability. It is still engaging, because  it is written with an educator and technical blogger voice, rather than a dry academic voice. This book came out on 2017. Martin is working on the new version. So if you have comments for things to focus on for the new version, it would be helpful to collect them in a document and email it to Martin. For example, I am curious about how the below paragraph from the Preface will get revised with 8 more years of hindsight...

SIGMOD/PODS Day 1

Image
This week I am attending the SIGMOD/PODS conference at Seattle . I am a bit out of my depth here, because I am a distributed systems person putting on a database hat. But so far, so good. I am happy to see that I am able to follow most talks well. I also enjoyed meeting and talking with many database people in coffee breaks and the receptions. Today, day 1, was the PODS conference. I thought PODS was like about storage systems, because you know, in US we see these PODS boxes that are used to pack/store and transport stuff. As I attended PODS talk after another and saw blocks of Greek symbols and formulas, it dawned on me to check what the acronym meant. Turns out PODS stands for "Symposium on Principles of Database Systems". Like its namesake "Principles of Distributed Computing (PODC)", PODS is also a theory conference. These two conferences use "principles" as a code word for theory, I guess. I don't see why that would be the case though, I like my p...

Amazon Redshift Re-invented

Image
This paper (SIGMOD'22) discusses the evolution of Amazon Redshift since 2015 when it launched. Redshift is a cloud data warehouse. Data warehouse basically means a place where analysis/querying/reporting is done for shitload of data coming from multiple sources. Tens of thousands of customers use Redshift to process Exabytes of data daily. Redshift is fully managed to make it simple and cost-effective to efficiently analyze BIG data. The concept art in this blog post are creations of Stable Diffusion. Since its launch in 2015, the use cases for Redshift have evolved, and the teams focused on meeting the following customer needs High-performance execution of complex analytical queries using innovative query execution via C++ code generation Enabling fast scalability in response to changing workloads by disaggregating storage and compute layers Ease of use via incorporating machine learning based autonomics Seamless integration with the AWS ecosystem and other AWS purpose built serv...

The Seattle Report on Database Research (2022)

Every 5 years, researchers from academia and industry gather to write a state-of-the-union (SOTU) report on database research. This one was released recently . It is a very readable report, and my summary consists of important paragraphs clipped from the report. Emphasis mine in bolded sentences. I use square brackets for when I paraphrase a long text with a more direct statement. TL;DR: The SOTU is strong (the relational database market alone has revenue upwards of $50B) and growing stronger thanks to boom in cloud computing and machine learning (ML). For the next 5 years, research should scale up to address emerging challenges in data science, data governance, cloud services, and database engines.   What has changed in the last 5 years Over the last decade, our research community pioneered the use of columnar storage, which is used in all commercial data analytic platforms. Database systems offered as cloud services have witnessed explosive growth. Hybrid transactional/analytical...

Popular posts from this blog

Hints for Distributed Systems Design

My Time at MIT

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Foundational distributed systems papers

Advice to the young

Learning about distributed systems: where to start?

Distributed Transactions at Scale in Amazon DynamoDB

Making database systems usable

Looming Liability Machines (LLMs)

Analyzing Metastable Failures in Distributed Systems