You spent a decade eliminating toil- writing runbooks, automating away the repetitive work. Then AI-augmented engineering teams started shipping code faster than your team can operate it. That toil is back, with a vengeance.

The industry’s answer is to fight fire with fire. Every other booth at KubeCon Amsterdam was pitching an “AI SRE”- agents that triage incidents, run diagnostics, and write runbooks. But when you ask vendors how their tool handles novel production incidents in complex distributed systems, the answers get vague fast.

So where do LLMs actually add value in operations- and where do they fall apart? Rootly’s data shows average incidents per user tripled since 2023. AI agents can handle the SEV-3s and SEV-2s. But when a SEV-0 hits across a complex distributed system, the agent can’t get you there. Is that a fundamental limit, or just a gap with current technology?

And if agents absorb the routine work while humans only see the hardest problems- what happens to the foundational skills that make someone good at the hard stuff? Operational maturity in five years depends on what we do with the answer.

Sylvain Kalache is Head of AI Labs at Rootly, where he’s building open-source, AI-driven reliability tools with backing from Anthropic and Google. Before Rootly, he was a Senior SRE at LinkedIn and co-founder of Holberton School, which has trained software engineers now working at major tech companies worldwide. He’s also building the SRE Skills Bench- an evaluation framework for how well models actually perform on SRE-specific tasks.

Links: