Agentic AI and The Mythical Agent-Month
The premise of this position paper is appealing. We know Brooks' Law: adding manpower to a late software project makes it later. That is, human engineering capacity grows sub-linearly with headcount due to communication overhead and ramp-up time. The authors propose that AI agents offer a loophole: "Scalable Agency". Unlike humans, agents do not need days/weeks to ramp up, they load context instantly. So, theoretically, you can spin up 1,000 agents to explore thousands of design hypotheses in parallel, compressing the Time to Integrate (TTI: duration required to implement/integrate new features/technologies into infrastructure systems) for complex infrastructure from months to days.
The paper calls this vision Self-Defining Systems (SDS), and suggests that thanks to Agentic AI future infrastructure will design, implement, and evolve itself. I began reading with great excitement, but by the final sections my excitement soured into skepticism. The bold claims of the introduction simply evaporated (especially about TTI), never to be substantiated or revisited. Worse, the paper left even the most basic concepts (e.g., "specification") vague.
Still, the brief mention of Brooks' Law stayed with me. (The introduction glides past it far too casually.) Thinking it through, I came to conclude that we are not escaping the Mythical Man-Month anytime soon, not even with agents. The claim that "Scalable Agency" bypasses Brooks' Law is not supported by the evidence. Coordination complexity ($N^2$) is a mathematical law, not a sociological suggestion as some people take Brooks' Law to mean. Until we solve the coordination and verification bottlenecks, adding more agents to a system design problem will likely just result in a faster and more expensive way to generate merge conflicts.
The "Scalable Agency" vs. The Reality
The core premise of "Scalable Agency" seems to rely on a flawed assumption, that software engineering is an embarrassingly parallel task. My intuition, and the paper's own results, suggests the opposite.
The case study in the paper tasked agents with building an LLM inference runtime. The agent-built monolithic-llm-runtime achieved 1.2k tokens/s, while a simple human-written baseline, nano-vLLM, achieved 1.76k tokens/s. The paper says that the agents successfully "rediscovered" standard techniques like CUDA graphs (I suspect this was already in training data for the LLMs), but they hit a hard ceiling. They could not propose "qualitatively new designs". They scaled volume, but not insight. This reminds me of the famous video of three professional soccer players against one hundred kids: You can add more bodies, but without shared intuition and expert judgment, numbers alone do not win real games.
The results get worse when we look at integration. When tasked with building allmos_v2 on top of an existing codebase (allmos), the agents took 35 days to produce a working system. The paper blames "deployment failures," "GLIBC mismatches," and "driver issues", but I suspect the true complexity came from the architecture itself. Allmos_v2 was not a monolith, but an integration on top of distributed pre-existing components. The agents were likely thrashing against the exponential complexity of a distributed dependency graph.
(As an aside: The paper claims the second system, monolithic-llm-runtime, took only 2 days because the agents reused a "key solutions playbook" generated during the allmos_v2 attempt. While the playbook surely helped, I suspect the monolithic architecture was the real hero here.)
This mirrors the classic Common Knowledge Problem in distributed systems: for a protocol to be robust, every node must know not just the fact (distributed knowledge), but that every other node knows it (common knowledge). Agentic systems face a similar epistemic gap. "Context loading" is not equivalent to "common knowledge". Agents may ingest 100,000 lines of code instantly, but reading tokens is not the same as understanding the causal chain of changes across the system. As architectures grow (especially non-monolithic ones) the compute required to simulate downstream effects of a line change explodes exponentially. An agent may "see" the entire codebase, but without common knowledge of how, for example, a batcher change propagates through kernel fusion, regressions appear. Multi-agent systems do not escape Brooks' Law, they simply hit the wall of state-space explosion faster than humans do.
The Shell Game
In the end Self-Defining Systems remains as a blue sky vision paper that fails to improve on concrete results compared to its predecessor, the ADRS paper ("Barbarians at the Gate"). ADRS admitted to being a tool for optimizing existing codebases, tuning parameters and heuristics within a defined sandbox. SDS promises to go further, claiming the ability to design systems from scratch, but in practice it largely replays the ADRS workflow with new terminology. SDS proposes a progression ranging from Phase 1: Self-Configuring to Phase 5: Self-Managing, but admits in the fine print that "Our current methodology retains goal setting, architecture decomposition, and evaluation design as human responsibilities..." This means that we haven't actually moved the needle on design at all. We are still just doing hyper-parameter tuning with AI as the intern.
This reminds me of the Shell Game podcast, where Evan Ratliff tried to build a startup using only AI agents. (You know what a shell game means, right?) Evan sets out to build a real company, HurumoAI, staffed almost entirely by AI agents to test if a one-man unicorn was possible by managing AI employees in a startup environment. The experiment sounds futuristic until you learn how this unravels. Spoiler alert: even with the help of a Stanford technical co-founder, Evan eventually bails out and pivots to a joke/novelty business model: AI agents that procrastinate and browse brainrot so you don't have to.
Comments