This is with apologies to Butler Lampson, who published the " Hints for computer system design " paper 40 years ago in SOSP'83. I don't claim to match that work of course. I just thought I could draft this post to organize my thinking about designing distributed systems and get feedback from others. I start with the same disclaimer Lampson gave. These hints are not novel, not foolproof recipes, not laws of design, not precisely formulated, and not always appropriate. They are just hints. They are context dependent, and some of them may be controversial. That being said, I have seen these hints successfully applied in distributed systems design throughout my 25 years in the field, starting from the theory of distributed systems (98-01), immersing into the practice of wireless sensor networks (01-11), and working on cloud computing systems both in the academia and industry ever since. These heuristic principles have been applied knowingly or unknowingly and has proven...
Back in 2005, when I first joined the SUNY Buffalo CSE department, the department secretary was a wonderful lady named Joann, who was over 60. She explained that my travel reimbursement process was simple: I'd just hand her the receipts after my trip, she'd fill out the necessary forms, submit them to the university, and within a month, the reimbursement check would magically appear in my department mailbox. She handled this for every single faculty member, all while managing her regular secretarial duties. Honestly, despite the 30-day turnaround, it was the most seamless reimbursement experience I've ever had. But over time the department grew, and Joann moved on. The university partnered with Concur, as corporations do, forcing us to file our own travel reimbursements through this system. Fine, I thought, more work for me, but it can't be too bad. But, the department also appointed a staff member to audit our Concur submissions. This person's job wasn't to hel...
2025 was the year of the agent. The goalposts for AGI shifted; we stopped asking AI to merely "talk" and demanded that it "act". As an outsider looking at the architecture of these new agents and agentic system, I noticed something strange. The engineering tricks used to make AI smarter felt oddly familiar. They read less like computer science and more like … self-help advice . The secret to agentic intelligence seems to lie in three very human habits: writing things down, talking to yourself, and pretending to be someone else. They are almost too simple. The Unreasonable Effectiveness of Writing One of the most profound pieces of advice I ever read as a PhD student came from Prof. Manuel Blum, a Turing Award winner. In his essay "Advice to a Beginning Graduate Student", he wrote: "Without writing, you are reduced to a finite automaton. With writing you have the extraordinary power of a Turing machine." If you try to hold a complex argument enti...
This is definitely not a "learn distributed systems in 21 days" post. I recommend a principled, from the foundations-up, studying of distributed systems, which will take a good three months in the first pass, and many more months to build competence after that. If you are practical and coding oriented you may not like my advice much. You may object saying, "Shouldn't I learn distributed systems with coding and hands on? Why can I not get started by deploying a Hadoop cluster, or studying the Raft code." I think that is the wrong way to go about learning distributed systems, because seeing similar code and programming language constructs will make you think this is familiar territory, and will give you a false sense of security. But, nothing can be further from the truth. Distributed systems need radically different software than centralized systems do. --A. Tannenbaum This quotation is literally the first sentence in my distributed systems syllabus. Inst...
I talked about the importance of reading foundational papers last week. To followup, here is my compilation of foundational papers in the distributed systems area. (I focused on the core distributed systems area, and did not cover networking, security, distributed ledgers, verification work etc. I even left out distributed transactions, I hope to cover them at a later date.) I classified the papers by subject, and listed them in chronological order. I also listed expository papers and blog posts at the end of each section. Time and State in Distributed Systems Time, Clocks, and the Ordering of Events in a Distributed System. Leslie Lamport, Commn. of the ACM, 1978. Distributed Snapshots: Determining Global States of a Distributed System. K. Mani Chandy Leslie Lamport, ACM Transactions on Computer Systems, 1985. Virtual Time and Global States of Distributed Systems. Mattern, F. 1988. Practical uses of synchronized clocks in distributed systems. B. Liskov, 1991. Exp...
In a recent viral post , Matt Shumer declares dramatically that we've crossed an irreversible threshold. He asserts that the latest AI models now exercise independent judgment, that he simply gives an AI plain-English instructions, steps away for a few hours, and returns to a flawlessly finished product better than he could produce. In the near future, he claims, AI will autonomously handle all knowledge work and even build the next generation of AI itself, leaving human creators completely blindsided by the exponential curve. This was a depressing read. The dramatic tone lands well. And by extrapolating from progress in the last six years, it's hard to argue against what AI might achieve in the next six. I forwarded this to a friend of mine, who had the misfortune of reading it before bed. He told me he had a nightmare about it, dreaming of himself as an Uber driver, completely displaced from his high-tech career.
Hold your horses, though. I'm not unveiling a new S3-native database. This paper is from 2008. Many of its protocols feel clunky today. Yet it nails the core idea that defines modern cloud-native databases: separate storage from compute. The authors propose a shared-disk design over Amazon S3, with stateless clients executing transactions. The paper provides a blueprint for serverless before the term existed. SQS as WAL and S3 as Pagestore The 2008 S3 was painfully slow, and 100 ms reads weren't unusual. To hide that latency, the database separates "commit" from "apply". Clients write small, idempotent redo logs to Amazon Simple Queue Service (SQS) instead of touching S3 directly. An asynchronous checkpoint by a client applies those logs to B-tree pages on S3 later. This design shows strong parallels to modern disaggregated architectures . SQS becomes the write-ahead log (WAL) and logstore. S3 becomes the pagestore. Modern Aurora follows a similar logic : t...
I notice I haven't written any advice posts recently. Here is a collection of my advice posts pre 2020. I've been feeling all this elderly wisdom pent up in me, ready to pour at any moment. So here it goes. Get ready to quench your thirst from my fount of wisdom. No man, think for yourself, only get what works for you. It is called foundations, not theory Foundations of computer science (or rather any field of study) are the most important topics you can learn. These lay down the frame of thinking/perspective for that area of study. Yet, I am saddened to hear these called as "theory", and labeled as "unpractical". This couldn't be farther from the truth. Take a look at how I recommend studying distributed systems . Don't you dare call this "theory" and "unpractical". This lays the bedrock that you build your practice on. Don't skimp on the foundations. Don't build your home on quicksand. Keep your hands dirty, your mind cl...
This paper (CIDR'26) presents a comprehensive analysis of cloud hardware trends from 2015 to 2025, focusing on AWS and comparing it with other clouds and on-premise hardware. TL;DR: While network bandwidth per dollar improved by one order of magnitude (10x), CPU and DRAM gains (again in performance per dollar terms) have been much more modest. Most surprisingly, NVMe storage performance in the cloud has stagnated since 2016. Check out the NVMe SSD discussion below for data on this anomaly. CPU Trends Multi-core parallelism has skyrocketed in the cloud. Maximum core counts have increased by an order of magnitude over the last decade. The largest AWS instance u7in now boasts 448 cores. However, simply adding cores hasn't translated linearly into value. To measure real evolution, the authors normalized benchmarks (SPECint, TPC-H, TPC-C) by instance cost. SPECint benchmarking shows that cost-performance improved roughly 3x over ten years. A huge chunk of that gain comes from AWS G...
I remember the early 2010s as the golden age of productivity hacking. Lifehacker, 37signals, and their ilk were everywhere, and it felt like everyone was working on jury-rigging color-coded Moleskine task-trackers and web apps into the perfect Getting Things Done system. So recently I found myself wondering: what happened to all that excitement? Did I just outgrow the productivity movement, or did the movement itself lose stream? After poking around a bit, I think it's both. We collectively grew out of that phase, and productivity itself fundamentally changed. The Trap of Micro-Optimizations Back then, the underlying promise of productivity culture was about outputmaxxing (as we would now call it). We obsessed over efficiency at the margins: how to auto-sync this app with that one, or how to shave 5 seconds off an email reply. We accumulated systems, hacks, and integrations like collectors. Eventually, the whole thing got exhausting. I think we all realized that tweaking task manag...
Comments