Looming Liability Machines (LLMs)

As part of our zoom reading group (wow, 4.5 years old now), we discussed a paper that uses LLMs for automatic root cause analysis (RCA) for cloud incidents.

This was a pretty straightforward application of LLMs. The proposed system employs an LLM to match incoming incidents to incident handlers based on their alert types, predicts the incident's root cause category, and provides an explanatory narrative. The only customization is through prompt-engineering. Since this is a custom domain, I think a more principled and custom-designed  machine learning system would be more appropriate rather than adopting LLMs.

Anyways, the use of LLMs for RCAs spooked me vicerally. I couldn't find the exact words during the paper discussion, but I can articulate this better now. Let me explain.


RCA is serious business

Root cause analysis (RCA) is the process of identifying the underlying causes of a problem/incident, rather than just addressing its symptoms. One RCA heuristic is asking 5 Why's to push deeper into the cause-effect relationship. RCA should be done in a holistic (systems thinking) manner exploring the causes of a problem in different dimensions such as People, Process, Equipment, Materials, Environment, and Management. Finally, RCAs should consider relationships between the potential causes, as that may illuminate the pathways that lead to the problem.

Nancy Leveson is a leading expert in safety engineering. She is known for her work on preventing accidents in complex systems like airplanes and power plants. She developed a method called STAMP (Systems-Theoretic Accident Model and Processes) that looks at how accidents happen due to failures in controlling these systems, not just technical faults or human mistakes. Leveson's approach focuses how different parts interact and influence each other.

The incident analysis Nancy Leveson does for the Bhopal disaster is really eye-opening. The pipe washing operation should have been supervised by a second shift operator, but that position had been eliminated due to cost cutting. But why? As the plant lost money, many of the skilled workers left, and they were either not replaced, or replaced by unskilled workers.  (Boeing might have succumbed to cost cutting pressures according to this article from 2019.)


My worries about systemic failures

So, safety engineering is a whole academic field with a lot of smart experts working on it. On the industry side, a lot of smart experts practice safety enginering, and possess a lot of wisdom. There are specialized go-to people in every big organization for these things. It would be very stupid if management decides that LLMs do a good job for RCA, and the company doesn't need human experts investigating these issues.

Ok, maybe they won't be that careless, but I am still concerned this may lead to decline in developing new experts. If LLMs are adopted to perform RCA, companies may stop hiring and training new engineers in this crucial skill. 

I bet LLMs would not be able to deep root cause identification as experts could do. Consider the RCA is serious business section again. LLMs would not be able to dive deep, and produce superficial results.

Furthermore, we should not get fixated on the "root cause" part of RCA. Most safety experts are alergic to the phrase root cause. An incident is often a systemic complex problem stemming from many things. So the analysis part, rather actually performing the analysis part is the more important thing. Through the analysis, the goal is to prevent the recurrence of similar issues, thereby improving processes, enhancing safety. 

If we offload the RCA learning/categorization part to the LLM (whatever that means), we wouldn't be able to make much progress in the enhancing reliability and safety part.

In sum, I am worried that the use of LLMs for RCA will lead to cost cutting, and this will lead to systemic failures in the mid-long term. 


My worries about automation surprise

Another problem, maybe a short-mid term problem, I can see with using LLMs for doing RCAs is the automation surprise problem.

Automation surprise occurs when an automated system behaves in an unexpected way, catching users off guard. This often happens because users don't fully understand how the automation works or the range of possible outcomes the system might produce.

For example, in aviation, pilots might experience automation surprise if an autopilot system suddenly changes the aircraft's behavior due to a mode switch they didn't anticipate. This can lead to confusion, reduced situational awareness, and potentially dangerous situations if the users cannot quickly understand and correct the system's actions.

This highlights the importance of designing automated systems that are predictable and providing adequate training so users are aware of the system's capabilities and limitations.

LLMs are prone to hallucination problems. After some initial success with RCA, people might start placing more trust in LLMs and build some automation around their decisions. They would then be in for a big surprise when things go awry, and they can't figure out the problem. 

The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair. -- Douglas Adams


Some LLM news from AWS

This bit of news is making the rounds. AWS seems to have integrated Amazon Q, their GenAI assistant for software development, into their internal systems and applied it to Java 17 upgrades of their systems: "The average time to upgrade an application to Java 17 plummeted from what’s typically 50 developer-days to just a few hours. We estimate this has saved us the equivalent of 4,500 developer-years of work (yes, that number is crazy but, real)."

I am fazed that there is not even a single negative comment under this announcement. Could this maybe go wrong?

When I was at AWS, I saw first-hand the excellent operational culture there. The weekly operational meeting (was that on Wednesday) was really instructional in terms of learning the thought processes of these safety/security/availability expert engineers. The correct mindset to apply here is a paranoid mindset, and to scrutinize everything, and even be wary of success. It would be a shame if this culture erodes due to some early success with using LLMs to update software.

OK, so let's revisit that question. What could go wrong with LLMs making the upgrade to Java 17? I will speculate, because I don't know much about this problem. I can see some operational problems. If we put people to do this work, maybe while doing this on certain packages, they will notice, "oh shit, we never thought of this problem, but for these type of packages, upgrading them to be Java 17 compliant might open these security problems". We may be losing this opportunity with engineers goin in the field, getting their hands dirty, and discovering certain problematic cases. Another problem I mentioned above is that maybe we are failing to train new engineers for operational challenges. 

I am not suggesting a Butlerian Jihad against LLMs. But I am worried, we are enticed too much by LLMs. Ok, let's use them, but maybe we shouldn't open the fort doors to let them in. 

Comments

Anonymous said…
I find kind of funny how everyone is wowed by LLMs.... except in their own field - where they can't be reliable nor insightful.
wimaxapp said…
One of the key areas I have been working in AIOps! In my first projects on Kubernetes events RCA, LLM based troubleshooting and RCA does work well depending on how we prepare the data ,prompts and the model. Distributed system RCA is more complex ,it requires a careful data modeling and prompt engineering depending on the scenario. There are still some challenges. But powerful, large context LLMs are enabling to deal with data complexities to an extent.
Anonymous said…
> "The average time to upgrade an application to Java 17 plummeted from what’s typically 50 developer-days to just a few hours."

*blink*

...why are they boasting about migrating to Java 17... in 2024?

...and from what versions were they migrating from? Java 17 didn't introduce any significant breaking-changes (IME) for users on Java 16 or even the next previous LTS version, Java 11 - I don't work at Amazon, but surely Amazon isn't in the habit of running on unsupported JVMs? - so assuming these Java projects were being competently maintained, then the only work actually required to migrate to 17 is changing your `org.gradle.java.home=` path to where JDK 17 followed by running your test suite. If Amazon was using a monorepo then 1 person could do this in 5 minutes with a 1-liner awk/sed command - it would actually take an AI far longer to do this given LLMs don't (yet) have read/write access to my git repo SSD, and they'd probably want to re-evaluate the prompt for each project file. The "50 days" number he gives, without any context either, is a nice shorthand to communicate his utter disconnection from what really goes-on inside his org.

In conclusion: these remarks by leadership unintentionally make the company look bad, not good, once you fill-in-the-blanks to make up for what they dind't say. (What's next...? Big Brother increasing our chocolate ration to 20 grammes per week?)

-----

One more thing: the comment-replies to the post are utterly derranged and I genuinely can't put my sense of unease into words.

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Advice to the young

Foundational distributed systems papers

Linearizability: A Correctness Condition for Concurrent Objects

Understanding the Performance Implications of Storage-Disaggregated Databases

Designing Data Intensive Applications (DDIA) Book

Use of Time in Distributed Databases (part 2): Use of logical clocks in databases