Writing Code vs. Shipping Code: Productivity Effects Across Generations of AI Coding Tools
The transformative power of LLMs in coding has been irrefutable, and it feels like we are living through a magical computing renaissance. On the socials, we hear impressive numbers of lines of code generated, features delivered, and bugs fixed. But, the macroeconomic indicators seem to be still lagging. Heck, if you talk with an engineering manager, you find that their product shipping dates haven't miraculously compressed by a factor of five, either.
This paper just landed 10 days ago. It is from MIT and Wharton by Mert Demirer, Leon Musolff, and Liyuan Yang. Their study attempts to provide a structured economic model for evaluating actual productivity obtained from AI coding tools. By pairing confidential Microsoft telemetry with the public footprints of over 100,000 GitHub developers (tracking everything from open-source utilities to web app repositories), the authors show significant systemic friction downstream of AI code generation.
Of course, I do my usual skeptical critic of the paper. In this case, this is especially heightened because these are economists peeking into the messy non-linear world of software engineering and trying to impose a "production hierarchy" abstraction onto it. But if we reconsider their analysis from a different perspective, it becomes possible to translate their complex production functions into Amdahl's Law terms, and then we can start doing our own evaluations and draw our own conclusions as I discuss below.
The Monotonic Decay
The core of the paper rests on this monotonic decay argument. The sheer task-level velocity gains we see from AI coding tools start to bleed out as they move up the production hierarchy. The authors break down AI tool adoption into three distinct generational tiers:
- Autocomplete (intelligent text prediction),
- Synchronous (Sync) Agents (interactive, real-time code modifiers like Claude Code or Cursor),
- Asynchronous (Async) Agents (autonomous async agents).
When we look at the task-level velocity of these tools, we see impressive numbers. The paper's abstract claims that autocomplete, interactive sync agents, and autonomous async agents increase overall commit activity by cumulative totals of 40%, 140%, and 180% respectively. But as that work climbs toward an official production milestone, the improvements decay significantly.
For Autocomplete, the +228.2% explosion in raw lines of code bleeds out layer-by-layer until becomes a meager +10.2% increase in actual shipped software releases. For Sync agents, a gigantic +741.3% surge in code syntax reduces down to a modest +20.3% final weekly releases.
My immediate reaction to this vertical hierarchy (Lines -> Files -> Commits -> PRs -> Repositories -> Releases) is skepticism. Treating code production like a neat production line feels superficial. Software engineering is not a linear conveyor belt, as coding is highly nonlinear, and a single commit routinely alters fifty files. However, giving the authors the benefit of doubt and reading onwards, I find that there is still value to this naive abstraction, as it points to human gatekeeping and coordination overhead at higher levels of the CI/CD pipeline. AI can write lines of code instantly, not being bogged down by the code syntax at the lower layers. But as that work climbs toward an official production milestone, the structural constraints of the system and human bottlenecks take over, and the massive improvements at the task level decay down to nearly nothing.
Let's dive deeper on the mathematical modeling behind this. By taking only the performance of Autocomplete into account (because it operates exclusively at the code-writing level), the authors chose parameters that minimized the differences between their model's predictions and actual developer behavior. Through this exercise, they extracted a local Upstream Output Elasticity ($\theta$) of approximately 0.75. In this layered production model, $\theta = 0.75$ acts as a vertical pass-through metric. It means that at any given stage (say, turning raw commits into clean pull requests), 75% of that layer's success leans entirely on the upstream technical assets flowing into it from the layer below, while a remaining fraction represents the local human effort added at that layer. Because they model software development as a vertically sequential aggregation process, that human intervention operates like a compounding efficiency tax. A massive initial code productivity surge at the bottom layer gets relentlessly multiplied by 0.75 over and over again as it attempts to climb the hierarchy, mathematically forcing the steep vertical attenuation we see in the empirical data.
Now, the sync and async agents aren't just typing lines inside an editor. They operate at a level where they directly manage files, commits, and pull requests. This expanded layer coverage allows agentic workflows to drop their productivity contributions closer to the finish line. By short-circuiting the early stages of the vertical decay chain, agents handle the work more efficiently, doubling the final impact on shipped software compared to autocomplete as seen in Figure 1.
Translating the Economics formulas to Amdahl’s Law
To model what happens inside each of these individual layers, the paper transitions to a nested Constant Elasticity of Substitution (CES) production function. Here, they extract an Elasticity of Substitution ($\sigma$) between AI-generated code and human review effort of a rigid 0.25. In economic lingo, an elasticity of substitution well below $1.0$ means the inputs are "strong complements". That means they are tied/dependent together like a car chassis and tires. It doesn't matter if an automated factory line can manufacture tires 10,000% faster; if you don't speed up the production of the chassis, you don't get more cars.
This of course looks a whole lot like Amdahl’s Law, which dictates that the overall speedup of a system is strictly limited by its sequential un-parallelizable bottlenecks:
$S_{\text{total}} = \frac{1}{(1-P) + \frac{P}{S_{\text{task}}}}$
where $S_{\text{task}}$ is the speedup achieved at the automated task level, and $P$ is the "Global Parallelizable Fraction" of the entire system workload.
When the elasticity of substitution ($\sigma$) between machine output and human validation drops to 0.25, the economic model behaves almost exactly like a Leontief production function. (A Leontief production function describes a strict, zero-flexibility production process where inputs must be combined in exact, unalterable proportions, meaning an excess of one ingredient cannot substitute for a shortage of another.) This dictates that human code review is a non-negotiable strictly sequential bottleneck ($1 - P$). If $\sigma$ were infinite, you could completely substitute human verification with raw AI text volume effectively parallelizing the entire layer. But because $\sigma = 0.25$, throwing an infinite mountain of automated code ($S_{\text{task}} \to \infty$) at the problem does nothing to diminish the fixed, sequential time investment required for a human to review it.
If we run the paper's real-world empirical findings through this equation (isolating the global parallel fraction $P$ for commits against final releases), we find the following:
- Autocomplete: $S_{\text{task}} = 1.359\times$, $S_{\text{total}} = 1.102\times \implies \mathbf{P \approx 35.0\%}$
- Sync Agents: $S_{\text{task}} = 2.091\times$, $S_{\text{total}} = 1.202\times \implies \mathbf{P \approx 32.2\%}$
- Async Agents: $S_{\text{task}} = 2.800\times$, $S_{\text{total}} = 1.300\times \implies \mathbf{P \approx 35.9\%}$
Huh! Even going from a simple inline autocomplete tool to fully autonomous agents that clone repositories and run test suites out-of-band, the global parallelizable fraction ($P$) refuses to budge and hovers around 35% across all three generations of AI!
Then how come Sync and Async agents manage to squeeze out 2-3x more final software releases than Autocomplete? While Autocomplete notches a modest 10.2% release expansion, Sync agents push it to 20.3%, and the cumulative stack of Async tools lifts the final output baseline up by 30%. If the global parallelizable envelope ($P$) is locked at 35% from the formulas above, how is the system actually accelerating? This is because according to Amdahl's Law, a system has two entirely separate levers for optimization. You can either increase $P$, or you can aggressively push harder and increase $S_{\text{task}}$. Autocomplete achieved a commit-level speedup ($S_{\text{task}}$) of just $1.359\times$ relative to releases. But Sync agents drive that localized task speedup to $2.091\times$, and Async agents push $S_{\text{task}}$ to $2.800\times$. But this hits a wall of diminishing returns quickly. When the parallelizable footprint ($P$) is pinned to 35%, scaling the localized task speedup ($S_{\text{task}}$) toward infinity yields a hard asymptotic ceiling. Mathematically, the absolute maximum total speedup this configuration can ever achieve is $1 / (1 - 0.35)$, which works out to a hard cap of a 53% overall increase ($1.53\times$) in shipped software. So, no matter how fast an autonomous bot can process a commit, the remaining 65% human sequential bottleneck ($1 - P$) acts as a hard stop.
Is P=0.35 sensible? I think that 35% parallel fraction passes the smell test. If you ask any developer what percentage of their week is spent actually writing code, they'll give you a number right in this ballpark, around 20% to 40%. The remaining 65% of the developers' time is consumed by finding/defining the task, planning, and paying the inevitable human communication tax of meetings and team alignment.
This is why that flatlining 35% profile across all three tool generations makes sense to me. Generative AI can supercharge the coding sandbox by churning out code at high speed (maxing out $S_{\text{task}}$), but it can't parallelize the systemic reasoning, organizational consensus, and deep problem/model comprehension that dominates the rest of the job. Until we find a way to automate those parts as well, shipping software will remain an inherently human-throttled process.
Comments