Barbarians at the Gate: How AI is Upending Systems Research

- October 22, 2025

This recent paper from the Berkeley Sky Computing Lab has been making waves in systems community. Of course, Aleksey and I did our live blind read of it, which you can watch below. My annotated copy of the paper is also available here.

This is a fascinating and timely paper. It raises deep questions about how LLMs will shape the research process, and how that could look like. Below, I start with a short technical review, then move to the broader discussion topics.

Technical review

The paper introduces AI-Driven Research for Systems (ADRS) framework. By leveraging the OpenEvolve framework, ADRS integrates LLMs directly into the systems research workflow to automate much of the solution-tweaking and evaluation process. As shown in Figure 3, ADRS operates as a closed feedback loop in which the LLM ensemble iteratively proposes, tests, and refines solutions to a given systems problem. This automation targets the two most labor-intensive stages of the research cycle, solution tweaking and evaluation, leaving the creative areas (problem formulation, interpreting results, and coming up with insights) untouched.

Within the inner loop, four key components work together. The Prompt Generator creates context-rich prompts that seed the LLM ensemble (Solution Generator), which outputs candidate designs or algorithms. These are then assessed by the Evaluator, a simulator or benchmark written by humans, for gathering quantitative feedback. The Solution Selector identifies the most promising variants, which are stored along with their scores in the Storage module to inform subsequent iterations. This automated loop runs rapidly and at scale, and enables exploration of large design spaces within hours rather than weeks! They applied ADRS to several systems problems, including cloud job scheduling, load balancing, transaction scheduling, and LLM inference optimization. In each case, the AI improved on prior human-designed algorithms, often within a few hours of automated search. Reported gains include up to 5x faster performance or 30–50% cost reductions compared to published baselines, which are achieved in a fraction of the time and cost of traditional research cycles.

The creative and difficult work happens outside the ADSR optimization loop. The scientist identifies the research problem, directs the search, and decides which hills are worth climbing. Machines handle the iterative grunt work of tweaking and testing solutions, while humans deal with abstraction, framing, and insight.

There are several other important limitations for the framework's effectiveness as well. The paper's examples mostly involve trivial correctness, and also no concurrency, security, or fault-tolerance concerns. These domains require reasoning beyond performance tuning. Another limitation is that these LLMs focus/update one component only, and can't handle system-wide interactions yet.

Simulator-based evaluation makes this approach feasible, but the systems field undervalues simulation work and this leads to limited infrastructure for automated testing. Similarly, evaluators also pose risks: poorly designed ones invite reward hacking, where LLMs exploit loopholes rather than learn real improvements. If AI-driven research is to scale, we need richer evaluators, stronger specifications, and broader respect for simulation as a first-class research tool.

Discussion topics

Here I wax philosophical on many interesting questions this work raises.

LLMs provide breadth, but research demands depth

LLMs excel at high-throughput mediocrity. By design, they replicate what has already been done, and optimize across the surface of knowledge. Research, however, advances through novelty, depth, and high-value insight.

"Research is to see what everybody else has seen, and to think what nobody else has thought."
-- Albert Szent-Györgyi (Nobel laureate)

In this sense, LLMs are not as dangerous as the "Barbarians" at the gates. They are more like "Barbies" at the gates, with gloss, confidence, and some hollowness. They may dazzle with presentation but they will lack the inner substance/insights/value that mastery, curiosity, and struggle bring.

LLMs address only the tip of the iceberg

LLMs operate on the visible tip of the research iceberg I described earlier. They cannot handle the deep layers that matter: Curiosity, Clarity, Craft, Community, Courage.

Worse, they may even erode those qualities. The danger in the short-term is not invasion, but imitation: the replacement of thought with performance, and depth with polish. We risk mistaking synthetic polish with genuine understanding.

In the long term though, I am not worried. In the long term, we are all dead.

I'm kidding, ok. In the long term, we may be screwed as well. The 2004 movie "Idiocracy" rings more true every day. I am worried that due to the inherent laziness of our nature, we may end up leaning more and more on AI to navigate literature, frame questions, or spin hypotheses, that we may not get enough chances to exercise our curiosity or improve our clarity of understanding.

LLMs are bad researchers, but can they still make good collaborators?

In our academic chat follow-up to the iceberg post, I wrote about what makes a bad researcher:

Bad research habits are easy to spot: over-competition, turf-guarding, incremental work, rigidity, and a lack of intellectual flexibility. Bad science follows bad incentives such as benchmarks over ideas, and performance over understanding. These days the pressure to run endless evaluations has distorted the research and publishing process. Too many papers now stage elaborate experiments to impress reviewers instead of illuminating them with insights. Historically, the best work always stood on its own, by its simplicity and clarity.

LLMs are bad researchers. The shoe fits.

But can they still be good collaborators? Is it still worth working with them? The hierarchy is simple:

Good collaborators > No collaborators > Bad collaborators

Used wisely, LLMs can climb high enough to reach the lowest range of the good collaborator category. If you give them bite-sized well defined work, they can reduce friction, preserve your momentum, and speed up parts of your work significantly. In a sense, they can make you technically fearless. I believe that when used for rapid prototyping, LLMs can help improve the design. And, through faster iteration, you may uncover some high-value insights.

But speed cuts both ways, because premature optimization is the root of all evil. If doing evaluations and optimizations becomes very cheap and effortless, we will more readily jump to this step, without nothing forcing us to think harder. Human brains are lazy by design. They don't want to think hard, and they will take the quick superficial route out, and we don't get to go deep.

So, we need to tread carefully here as well.

Can we scale human oversight?

The worst time I ever had as an advisor was when I had to manage 6-7 (six-seveeeen!) PhD students at once. I would much rather work with 2 sharp creative students I support myself than 50 mediocre ones handed to me for free. The former process of working is more productive and it results in deep work and valuable research. Focus is the key, and it does not scale.

The same holds for LLM-augmented research. Validation (via human focus) remains as the bottleneck. They can generate endless results, but without distilling those results into insight or wisdom, they all remain as AI slop in abundance.

Can clear insights distill without dust, tear, and sweat?

One may argue that with machines handling the grunt work, the researchers would finally get more time for thinking. Our brains are --what?-- yes, lazy. Left idle, they will scroll Reddit/Twitter rather than solve concurrency bugs.

I suspect we need some friction/irritation to nudge us to think in the background. And I suspect this is what happens when we are doing the boring work and working in the trenches. While writing a similar code snippet for the fifth time in our codebase, an optimization opportunity or an abstraction would occur to us. Very hard problems are impossible to tackle head on. Doing the legwork, I suspect we approach the problem sideways, and have a chance to make some leeway.

Yes, doing evaluation work sucks. But it is often necessary to generate the friction and space to get you think about the performance, and more importantly the logic/point of your system. Through that suffering, you gradually get transformed and enlightened. Working in the trenches, you may even realize your entire setup is flawed, and your measurements are garbage due to using closed loop clients instead of open loop ones.

What happens when we stop getting our hands dirty? We risk distilling nothing at all. Insights don't bubble up while we are sitting in comfort and scrolling cat videos. In an earlier post, Looming Liability Machines (LLMs), I argued that offloading root-cause analysis to AI misses the point. RCA isn't about assigning blame to a component. It is an opportunity to think about the system holistically, and understand it better, and improve. Outsourcing this to LLMs strike me as a very stupid thing to do. We need to keep exercising those muscles, otherwise they would atrophy alongside our understanding of the system.

What will happen to the publication process?

In his insightful blog post on this paper, Brooker concludes:

Which leads systems to a tough spot. More bottlenecked than ever on the most difficult things to do. In some sense, this is a great problem to have, because it opens the doors for higher quality with less effort. But it also opens the doors for higher volumes of meaningless hill climbing and less insight (much of which we’re already seeing play out in more directly AI-related research). Conference organizers, program committees, funding bodies, and lab leaders will all be part of setting the tone for the next decade. If that goes well, we could be in for the best decade of systems research ever. If it goes badly, we could be in for 100x more papers and 10x less insight.

Given my firm belief in human laziness, I would bet on the latter. I have been predicting the collapse of the publishing system for a decade, and the flood of LLM-aided research may finally finish the job. That might not be a bad outcome either. We are due for a better model/process anyways.