Posts

Claude Code experiment: Visualizing Hybrid Logical Clocks

Image
Yesterday morning I downloaded Claude Code, and wanted to see what this bad boy can do. What better way to learn how this works than coding up a toy example with it. The first thing that occurred to me was to build a visualizer for Hybrid Logical Clocks (HLC). HLC is a simple idea we proposed in 2014 : combine physical time with a logical counter to get timestamps that are close to real time but still safe under clock skew. With HLC, you get the best of both worlds: real-time affinity augmented with causality when you need it. Since then HLC has been adopted by many distributed databases, including MongoDB, CockroachDB, Amazon Aurora DSQL, YugabyteDB, etc. This felt well scoped for a first project with Claude Code. Choosing Javascript enabled me to host this on Github (Git Pages) for free. Easy peezy way of sharing something small yet useful with people.  Claude Code is a clever idea. It is essentially an agent wrapped around the Claude LLM. Chat works well for general Q/A, but it ...

OSTEP Chapter 10: MultiProcessor Scheduling

This chapter from Operating Systems: Three Easy Pieces explores multiprocessor scheduling as we transition from the simpler world of single-CPU systems to the challenges of modern multicore architectures. This is part of our series going through OSTEP book chapters.  The OSTEP textbook is freely available at Remzi's website if you like to follow along. Core Challenges in Multiprocessor Scheduling The shift to multiple CPUs introduces several hardware challenges that the operating system must manage: Cache Coherence: Hardware caches improve performance by storing frequently used data. In multiprocessor systems, if one CPU modifies data in its local cache without updating main memory immediately, other CPUs may read "stale" (incorrect) data. Synchronization: Accessing shared data structures across multiple CPUs requires mutual exclusion (e.g., locks). Without these, concurrent operations can lead to data corruption, such as double frees in a linked list. Cache Affinity: ...

Measuring Agents in Production

Image
When you are in TPOT echo chamber, you would think fully autonomous AI agents are running the world. But t his 2025 December paper, "Measuring Agents in Production", cuts through the reality behind the hype. It surveys 306 practitioners and conducts 20 in-depth case studies across 26 domains to document what is actually running in live environments. The reality is far more basic, constrained, and human-dependent than TPOT suggest. The Most Surprising Findings Simplicity and Bounded Autonomy:  80% of case studies use predefined structured workflows rather than open-ended autonomous planning, and 68% execute fewer than 10 steps before requiring human intervention. Frankly, these systems sound to me less "autonomous agent" than glorified state machine or multi-step RAG pipeline.  Prompting Beats Fine-Tuning:  Despite the academic obsession with reinforcement learning and fine-tuning, 70% of teams building production agents simply prompt off-the-shelf proprietary mode...

Modeling Token Buckets in PlusCal and TLA+

Image
Retry storms are infamous in distributed systems. It is easy to run into them. Inevitably, a downstream service experiences a hiccup, so your clients automatically retry their failed requests. Those retries add more load to the struggling service, causing more failures, which trigger more retries. Before you know it, the tiny unavailability cascades into a full-blown self-inflicted denial of service. Token Bucket is a popular technique that helps with gracefully avoiding retry storms. Here is how the token bucket algorithm works for retries: Sending is always free. When a client sends a brand-new request, it doesn't need a token. It just sends it. Successes give you credit. Every time a request succeeds, the client deposits a small fraction of a token into its bucket (up to a maximum capacity). Retries costs you credit. If a request fails, the client must spend a whole token (or a large fraction of one) to attempt a retry. If the downstream service is healthy, the bucket stays ful...

The Serial Safety Net: Efficient Concurrency Control on Modern Hardware

Image
This paper proposes a way to get serializability without completely destroying your system's performance. I quite like the paper, as it flips the script on how we think about database isolation levels.  The Idea In modern hardware setups (where we have massive multi-core processors, huge main memory, and I/O is no longer the main bottleneck), strict concurrency control schemes like Two-Phase Locking (2PL) choke the system due to contention on centralized structures. To keep things fast, most systems default to weaker schemes like Snapshot Isolation (SI) or Read Committed (RC) at the cost of allowing dependency cycles and data anomalies. Specifically, RC leaves your application vulnerable to unrepeatable reads as data shifts mid-flight, while SI famously opens the door to write skew, where two concurrent transactions update different halves of the same logical constraint. Can we have our cake and eat it too? The paper introduces the Serial Safety Net (SSN), as a certifier that sits...

TLA+ as a Design Accelerator: Lessons from the Industry

Image
After 15+ years of using TLA+, I now think of it is a design accelerator. One of the purest intellectual pleasures is finding a way to simplify and cut out complexity. TLA+ is a thinking tool that lets you do that. TLA+ forces us out of implementation-shaped and operational reasoning into mathematical declarative reasoning about system behavior. Its global state-transition model and its deliberate fiction of shared memory make complex distributed behavior manageable. Safety and liveness become clear and compact predicates over global state. This makes TLA+ powerful for design discovery. It supports fast exploration of protocol variants and convergence on sound designs before code exists. TLA+ especially shines for distributed/concurrent complex systems. In such systems, complexity exceeds human intuition very quickly. (I often point out to very simple interleaving/nondeterministic execution puzzles to show how much we suck at reasoning about such systems.) Testing is inadequate for su...

Building a Database on S3

Image
Hold your horses, though. I'm not unveiling a new S3-native database. This paper is from 2008. Many of its protocols feel clunky today. Yet it nails the core idea that defines modern cloud-native databases: separate storage from compute. The authors propose a shared-disk design over Amazon S3, with stateless clients executing transactions. The paper provides a blueprint for serverless before the term existed. SQS as WAL and S3 as Pagestore The 2008 S3 was painfully slow, and 100 ms reads weren't unusual. To hide that latency, the database separates "commit" from "apply". Clients write small, idempotent redo logs to Amazon Simple Queue Service (SQS) instead of touching S3 directly. An asynchronous checkpoint by a client applies those logs to B-tree pages on S3 later. This design shows strong parallels to modern disaggregated architectures . SQS becomes the write-ahead log (WAL) and logstore. S3 becomes the pagestore. Modern Aurora follows a similar logic : t...

800th blog post: Write that Blog!

Image
I had given an email interview to the "Write That Blog!" newsletter. That came out today , which coincided with my 800th blog post. I am including my answers also here.  Why did you start blogging – and why do you continue? In 2010, when I was a professor, one of my colleagues in the department was teaching a cloud computing seminar. I wanted to enter that field coming from theory of distributed systems, and later wireless sensor networks fields. So I attended the seminar. As I read the papers, I started blogging about them. That is how I learn and retain concepts better, by writing about them. Writing things down helps crystalize ideas for me. It lets me understand papers more deeply and build on that understanding.  The post on MapReduce , the first paper discussed in the seminar, seems to have opened the floodgates of my blogging streak, which has been going strong for 15 years. I think a big influence on me has been  the EWD documents . I remember the day I came acros...

Popular posts from this blog

Hints for Distributed Systems Design

The F word

The Agentic Self: Parallels Between AI and Self-Improvement

Learning about distributed systems: where to start?

Foundational distributed systems papers

Cloudspecs: Cloud Hardware Evolution Through the Looking Glass

Advice to the young

Agentic AI and The Mythical Agent-Month

Are We Becoming Architects or Butlers to LLMs?

Welcome to Town Al-Gasr