This is with apologies to Butler Lampson, who published the " Hints for computer system design " paper 40 years ago in SOSP'83. I don't claim to match that work of course. I just thought I could draft this post to organize my thinking about designing distributed systems and get feedback from others. I start with the same disclaimer Lampson gave. These hints are not novel, not foolproof recipes, not laws of design, not precisely formulated, and not always appropriate. They are just hints. They are context dependent, and some of them may be controversial. That being said, I have seen these hints successfully applied in distributed systems design throughout my 25 years in the field, starting from the theory of distributed systems (98-01), immersing into the practice of wireless sensor networks (01-11), and working on cloud computing systems both in the academia and industry ever since. These heuristic principles have been applied knowingly or unknowingly and has proven...
Back in 2005, when I first joined the SUNY Buffalo CSE department, the department secretary was a wonderful lady named Joann, who was over 60. She explained that my travel reimbursement process was simple: I'd just hand her the receipts after my trip, she'd fill out the necessary forms, submit them to the university, and within a month, the reimbursement check would magically appear in my department mailbox. She handled this for every single faculty member, all while managing her regular secretarial duties. Honestly, despite the 30-day turnaround, it was the most seamless reimbursement experience I've ever had. But over time the department grew, and Joann moved on. The university partnered with Concur, as corporations do, forcing us to file our own travel reimbursements through this system. Fine, I thought, more work for me, but it can't be too bad. But, the department also appointed a staff member to audit our Concur submissions. This person's job wasn't to hel...
2025 was the year of the agent. The goalposts for AGI shifted; we stopped asking AI to merely "talk" and demanded that it "act". As an outsider looking at the architecture of these new agents and agentic system, I noticed something strange. The engineering tricks used to make AI smarter felt oddly familiar. They read less like computer science and more like … self-help advice . The secret to agentic intelligence seems to lie in three very human habits: writing things down, talking to yourself, and pretending to be someone else. They are almost too simple. The Unreasonable Effectiveness of Writing One of the most profound pieces of advice I ever read as a PhD student came from Prof. Manuel Blum, a Turing Award winner. In his essay "Advice to a Beginning Graduate Student", he wrote: "Without writing, you are reduced to a finite automaton. With writing you have the extraordinary power of a Turing machine." If you try to hold a complex argument enti...
This is definitely not a "learn distributed systems in 21 days" post. I recommend a principled, from the foundations-up, studying of distributed systems, which will take a good three months in the first pass, and many more months to build competence after that. If you are practical and coding oriented you may not like my advice much. You may object saying, "Shouldn't I learn distributed systems with coding and hands on? Why can I not get started by deploying a Hadoop cluster, or studying the Raft code." I think that is the wrong way to go about learning distributed systems, because seeing similar code and programming language constructs will make you think this is familiar territory, and will give you a false sense of security. But, nothing can be further from the truth. Distributed systems need radically different software than centralized systems do. --A. Tannenbaum This quotation is literally the first sentence in my distributed systems syllabus. Inst...
I talked about the importance of reading foundational papers last week. To followup, here is my compilation of foundational papers in the distributed systems area. (I focused on the core distributed systems area, and did not cover networking, security, distributed ledgers, verification work etc. I even left out distributed transactions, I hope to cover them at a later date.) I classified the papers by subject, and listed them in chronological order. I also listed expository papers and blog posts at the end of each section. Time and State in Distributed Systems Time, Clocks, and the Ordering of Events in a Distributed System. Leslie Lamport, Commn. of the ACM, 1978. Distributed Snapshots: Determining Global States of a Distributed System. K. Mani Chandy Leslie Lamport, ACM Transactions on Computer Systems, 1985. Virtual Time and Global States of Distributed Systems. Mattern, F. 1988. Practical uses of synchronized clocks in distributed systems. B. Liskov, 1991. Exp...
This paper (CIDR'26) presents a comprehensive analysis of cloud hardware trends from 2015 to 2025, focusing on AWS and comparing it with other clouds and on-premise hardware. TL;DR: While network bandwidth per dollar improved by one order of magnitude (10x), CPU and DRAM gains (again in performance per dollar terms) have been much more modest. Most surprisingly, NVMe storage performance in the cloud has stagnated since 2016. Check out the NVMe SSD discussion below for data on this anomaly. CPU Trends Multi-core parallelism has skyrocketed in the cloud. Maximum core counts have increased by an order of magnitude over the last decade. The largest AWS instance u7in now boasts 448 cores. However, simply adding cores hasn't translated linearly into value. To measure real evolution, the authors normalized benchmarks (SPECint, TPC-H, TPC-C) by instance cost. SPECint benchmarking shows that cost-performance improved roughly 3x over ten years. A huge chunk of that gain comes from AWS G...
I notice I haven't written any advice posts recently. Here is a collection of my advice posts pre 2020. I've been feeling all this elderly wisdom pent up in me, ready to pour at any moment. So here it goes. Get ready to quench your thirst from my fount of wisdom. No man, think for yourself, only get what works for you. It is called foundations, not theory Foundations of computer science (or rather any field of study) are the most important topics you can learn. These lay down the frame of thinking/perspective for that area of study. Yet, I am saddened to hear these called as "theory", and labeled as "unpractical". This couldn't be farther from the truth. Take a look at how I recommend studying distributed systems . Don't you dare call this "theory" and "unpractical". This lays the bedrock that you build your practice on. Don't skimp on the foundations. Don't build your home on quicksand. Keep your hands dirty, your mind cl...
The premise of this position paper is appealing . We know Brooks' Law : adding manpower to a late software project makes it later. That is, human engineering capacity grows sub-linearly with headcount due to communication overhead and ramp-up time. The authors propose that AI agents offer a loophole: "Scalable Agency". Unlike humans, agents do not need days/weeks to ramp up, they load context instantly. So, theoretically, you can spin up 1,000 agents to explore thousands of design hypotheses in parallel, compressing the Time to Integrate (TTI: duration required to implement/integrate new features/technologies into infrastructure systems) for complex infrastructure from months to days. The paper calls this vision Self-Defining Systems (SDS), and suggests that thanks to Agentic AI future infrastructure will design, implement, and evolve itself. I began reading with great excitement, but by the final sections my excitement soured into skepticism. The bold claims of the intro...
In a recent viral post , Matt Shumer declares dramatically that we've crossed an irreversible threshold. He asserts that the latest AI models now exercise independent judgment, that he simply gives an AI plain-English instructions, steps away for a few hours, and returns to a flawlessly finished product better than he could produce. In the near future, he claims, AI will autonomously handle all knowledge work and even build the next generation of AI itself, leaving human creators completely blindsided by the exponential curve. This was a depressing read. The dramatic tone lands well. And by extrapolating from progress in the last six years, it's hard to argue against what AI might achieve in the next six. I forwarded this to a friend of mine, who had the misfortune of reading it before bed. He told me he had a nightmare about it, dreaming of himself as an Uber driver, completely displaced from his high-tech career.
Al-Gasr began as an autonomous agent town, but no one remembers now who deployed it. The original design documents were very clear. There were tasks. There were agents. There was persistence. Everything else had been added later by a minister's cousin. Al-Gasr ran on nine ministries. The Ministry of Compute handled execution, except when it didn't, in which case responsibility was transferred to the Ministry of Storage Degradation. The Ministry of Truth published daily bulletins. The Ministry of Previously Accepted Truth issued corrections. The Ministry of Future Truth prepared explanations in advance. Each ministry employed agents whose sole job was to supervise agents supervising their own nephews. At the top sat the Emir. Or possibly the late Emir. Or the Emir-in-Exile, depending on which dashboard you trusted. The system maintained three Emirs simultaneously to ensure high availability. This caused no confusion at all. The Emir du Jour governed by instinct and volume. Each ...
Comments