Posts

Welcome to Town Al-Gasr

Image
Al-Gasr began as an autonomous agent town, but no one remembers now who deployed it. The original design documents were very clear. There were tasks. There were agents. There was persistence. Everything else had been added later by a minister's cousin. Al-Gasr ran on nine ministries. The Ministry of Compute handled execution, except when it didn't, in which case responsibility was transferred to the Ministry of Storage Degradation. The Ministry of Truth published daily bulletins. The Ministry of Previously Accepted Truth issued corrections. The Ministry of Future Truth prepared explanations in advance. Each ministry employed agents whose sole job was to supervise agents supervising their own nephews. At the top sat the Emir. Or possibly the late Emir. Or the Emir-in-Exile, depending on which dashboard you trusted. The system maintained three Emirs simultaneously to ensure high availability. This caused no confusion at all. The Emir du Jour governed by instinct and volume. Each ...

Agentic AI and The Mythical Agent-Month

Image
The premise of this position paper is appealing . We know Brooks' Law : adding manpower to a late software project makes it later. That is, human engineering capacity grows sub-linearly with headcount due to communication overhead and ramp-up time. The authors propose that AI agents offer a loophole: "Scalable Agency". Unlike humans, agents do not need days/weeks to ramp up, they load context instantly. So, theoretically, you can spin up 1,000 agents to explore thousands of design hypotheses in parallel, compressing the Time to Integrate (TTI: duration required to implement/integrate new features/technologies into infrastructure systems) for complex infrastructure from months to days. The paper calls this vision Self-Defining Systems (SDS), and suggests that thanks to Agentic AI future infrastructure will design, implement, and evolve itself. I began reading with great excitement, but by the final sections my excitement soured into skepticism. The bold claims of the intro...

Rethinking the University in the Age of AI

Image
Three years ago, I wrote a post titled "Getting schooled by AI, colleges must evolve" . I argued that as we entered the age of AI, the value of "knowing" was collapsing, and the value of "doing" was skyrocketing. (See Bloom's taxonomy. ) Today, that future has arrived. Entry-level hiring has stalled because AI agents absorb the small tasks where new graduates once learned the craft. So how do we prepare students for this reality? Not only do I stand by my original advice, I am doubling down. Surviving this shift requires more than minor curriculum tweaks; it requires a different philosophy of education. I find two old ideas worth reviving: a systems design mindset that emphasizes holistic foundations , and alternative education philosophies of the 1960s that give students real agency and real responsibility. Holistic Foundations Three years ago, I begged departments: "Don't raise TensorFlow disk jockeys. Teach databases! Teach compilers! Tea...

Cloudspecs: Cloud Hardware Evolution Through the Looking Glass

Image
This paper (CIDR'26) presents a comprehensive analysis of cloud hardware trends from 2015 to 2025, focusing on AWS and comparing it with other clouds and on-premise hardware. TL;DR: While network bandwidth per dollar improved by one order of magnitude (10x), CPU and DRAM gains (again in performance per dollar terms) have been much more modest. Most surprisingly, NVMe storage performance in the cloud has stagnated since 2016. Check out the NVMe SSD discussion below for data on this anomaly. CPU Trends Multi-core parallelism has skyrocketed in the cloud. Maximum core counts have increased by an order of magnitude over the last decade. The largest AWS instance u7in now boasts 448 cores. However, simply adding cores hasn't translated linearly into value. To measure real evolution, the authors normalized benchmarks (SPECint, TPC-H, TPC-C) by instance cost. SPECint benchmarking shows that cost-performance improved roughly 3x over ten years. A huge chunk of that gain comes from AWS G...

The Sauna Algorithm: Surviving Asynchrony Without a Clock

Image
While sweating it out in my gym's sauna recently, I found a neat way to illustrate the happened-before relationship in distributed systems. Imagine I suffer from a medical condition called dyschronometria , which makes me unable to perceive time reliably, such that 10 seconds and 10 minutes feel exactly the same to me. In this scenario, the sauna lacks a visible clock. I'm flying blind here, yet I want to leave after a healthy session. If I stay too short, I get no health benefits. If I stay too long, I risk passing out on the floor. The question becomes: How do I, a distributed node with no local clock, ensure operating within a safety window in an asynchronous environment? Thankfully, the sauna has a uniform arrival of people. Every couple of minutes, a new person walks in. These people don't suffer from dyschronometria and they stay for a healthy session, roughly 10 minutes. My solution is simple: I identify the first person to enter after me, and I leave when he leaves....

Are Database System Researchers Making Correct Assumptions about Transaction Workloads?

Image
In this blog, we had reviewed quite a number of deterministic database papers, including Calvin , SLOG , Detock , which aimed to achieve higher throughput and lower latency. The downside of these systems is sacrificing transaction expressivity. They rely on two critical assumptions: first, that transactions are "non-interactive", meaning they are sent as a single request (one-shot) rather than engaging in a multi-round-trip conversation with the application, and second, that the database can know a transaction's read/write set before execution begins (to lock data deterministically). So when these deterministic database researchers write a paper to validate how these assumptions hold in the real world, we should be skeptical and cautious in our reading. Don't get me wrong, this is a great and valuable paper. And we still need to be critical in our reading.  Summary The study employed a semi-automated annotation tool to analyze 111 popular open-source web applications...

Too Close to Our Own Image?

Image
Recent work suggests we may be projecting ourselves onto LLMs more than we admit. A paper in Nature reports that GPT-4 exhibits "state anxiety". When exposed to traumatic narratives (such as descriptions of accidents or violence), the model's responses score much higher on a standard psychological anxiety inventory. The jump is large, from "low anxiety" to levels comparable to highly anxious humans. The same study finds that therapy works: mindfulness-style relaxation prompts reduce these scores by about a third, though not back to baseline. The authors argue that managing an LLM's emotional state may be important for safe deployment, especially in mental health settings and perhaps in other mission-critical domains. Another recent paper argues that LLMs can develop a form of brain rot. Continual training on what the authors call junk data (short, viral, sensationalist content typical of social media) leads to models developing weaker reasoning, poorer lon...

The Agentic Self: Parallels Between AI and Self-Improvement

2025 was the year of the agent. The goalposts for AGI shifted; we stopped asking AI to merely "talk" and demanded that it "act". As an outsider looking at the architecture of these new agents and agentic system, I noticed something strange. The engineering tricks used to make AI smarter felt oddly familiar. They read less like computer science and more like … self-help advice . The secret to agentic intelligence seems to lie in three very human habits: writing things down, talking to yourself, and pretending to be someone else. They are almost too simple. The Unreasonable Effectiveness of Writing One of the most profound pieces of advice I ever read as a PhD student came from Prof. Manuel Blum, a Turing Award winner. In his essay "Advice to a Beginning Graduate Student", he wrote: "Without writing, you are reduced to a finite automaton. With writing you have the extraordinary power of a Turing machine." If you try to hold a complex argument enti...

Rethinking the Cost of Distributed Caches for Datacenter Services

Image
This paper (HOTNETS'25) re-teaches a familiar systems lesson: caching is not just about reducing latency, it is also about saving CPU! The paper makes this point concrete by focusing on the second-order effect that often dominates in practice: the monetary cost of computation. The paper shows that caching --even after accounting for the cost of DRAM you use for caching-- still yields 3–4x better cost efficiency thanks to the reduction in CPU usage. In today's cloud pricing model, that CPU cost dominates. DRAM is cheap. Well, was cheap... The irony is that after this paper got presented, the DRAM prices jumped by 3-4x ! Damn Machine Learning ruining everything since 2018! Anyways, let's ignore that point conveniently to get back to the paper. Ok, so caches do help, but when do they help the most? Many database-centric or storage-side cache designs miss this point. Even when data is cached at the storage/database cache, an application read still needs to travel there, pay fo...

Randomer Things

I aspire to get bored in the new year I've realized that chess has been eating my downtime. Because it lives on my phone (Lichess), it is frictionless to start a bullet game, and get a quick dopamine hit. The problem is that I no longer get bored. That is bad. I need to get bored so I can start to imagine, daydream, think, self-reflect, plan, or even get mentally prepared for things (like the Stoics talked about). I badly need that empty space back. So bye chess. Nothing personal. I will play only when teaching/playing with my daughters. I may occasionally cheat and play a bullet game on my wife's phone. But no more chess apps on my phone. While I was at it, I installed the  Website Blocker extension for Chrome. I noticed my hands typing reddit or twitter at the first hint of boredom. The blocker is easy to disable, but that is fine. I only need that slight friction to catch myself before opening the site on autopilot. I am disappointed by online discourse In 2008, Reddit had a...

LeaseGuard: Raft Leases Done Right!

Image
Many distributed systems have a leader-based consensus protocol at their heart. The protocol elects one server as the "leader" who receives all writes. The other servers are "followers", hot standbys who replicate the leader’s data changes. Paxos and Raft are the most famous leader-based consensus protocols. These protocols ensure consistent state machine replication , but reads are still tricky. Imagine a new leader L1 is elected, while the previous leader L0 thinks it's still in charge. A client might write to L1, then read stale data from L0, violating Read Your Writes . How can we prevent stale reads? The original Raft paper recommended that the leader communicate with a majority of followers before each read, to confirm it's the real leader. This guarantees Read Your Writes but it's slow and expensive. A leader lease is an agreement among a majority of servers that one server will be the only leader for a certain time. This means the leader can run...

TLA+ modeling tips

Model minimalistically Start from a tiny core, and always keep a working model as you extend. Your default should be omission. Add a component only when you can explain why leaving it out would not work. Most models are about a slice of behavior, not the whole system in full glory: E.g., Leader election, repair, reconfiguration. Cut entire layers and components if they do not affect that slice. Abstraction is the art of knowing what to cut . Deleting should spark joy.  Model specification, not implementation Write declaratively. State what must hold, not how it is achieved. If your spec mirrors control flow, loops, or helper functions, you are simulating code. Cut it out. Every variable must earn its keep. Extra variables multiply the state space (model checking time) and hide bugs. Ask yourself repeatedly: can I derive this instead of storing it? For example, you do not need to maintain a WholeSet variable if you can define it as a state function of existing variables: WholeSet =...

Brainrot

Image
I drive my daughter to school as part of a car pool. Along the way, I am learning a new language, Brainrot. So what is brainrot ? It is what you get when you marinate your brain with silly TikTok, YouTube Shorts, and Reddit memes. It is slang for "my attention span is fried and I like it". Brainrot is a self-deprecating language. Teens are basically saying: I know this is dumb, but I am choosing to speak it anyway. What makes brainrot different from old-school slang is its speed and scale. When we were teenagers, slang spread by word of mouth. It mostly stayed local in our school hallways or neighborhood. Now memes go global in hours. A meme is born in Seoul at breakfast and widespread in Ohio by six seven pm. The language mutates at escape velocity and gets weird fast.  Someone even built a brainrot programming language . The joke runs deep , and is getting some infrastructure. Here are a few basic brainrot terms you will hear right away. He is cooked : It means he is finis...

Best of metadata in 2025

Image
It is that time of year again to look back on a year of posts. I average about sixty posts annually. I don't explicitly plan for the number, and I sometimes skip weeks for travel or work, yet I somehow hit the number by December. Looking back, I always feel a bit proud. The posts make past Murat look sharp and sensible, and I will not argue with that. Here are some of the more interesting pieces from the roughly sixty posts of 2025. Advice Looks like I wrote several advice posts this year. I must be getting old. The Invisible Curriculum of Research Academic chat: On PhD What I'd do as a College Freshman in 2025 My Time at MIT What makes entrepreneurs entrepreneurial? Publish and Perish: Why Ponder Stibbons Left the Ivory Tower Databases Concurrency Control book reading was fun. Also the series on use of time in distributed databases. And it seems like I got hyperfocused on transaction isolation this year.  Concurrency Control and Recovery in Database Systems Book reading series...

Popular posts from this blog

Hints for Distributed Systems Design

My Time at MIT

TLA+ modeling tips

Foundational distributed systems papers

Learning about distributed systems: where to start?

Optimize for momentum

The Agentic Self: Parallels Between AI and Self-Improvement

Advice to the young

Cloudspecs: Cloud Hardware Evolution Through the Looking Glass

Scalable OLTP in the Cloud: What’s the BIG DEAL?