How Machine Learning Can Transform Incident Management from a Burden into a DevOps Enabler
September 25, 2023

Ajay Singh
Zebrium

Among the most vital performance metrics for any DevOps team is mean time to recovery (MTTR) — the time to respond to a software or infrastructure issue and return to expected performance levels. As organizations embrace DevOps methodologies, with an emphasis on close collaboration and rapid iteration, they expect MTTR to improve.

But a long-running study of DevOps practices — Google Cloud's Accelerate State of DevOps Report — suggests that any historical gains in MTTR reduction have now plateaued. For years now, the time it takes to restore services has stayed about the same: less than a day for high performers but up to a week for middle-tier teams and up to a month for laggards. The fact that progress is flat despite big investments in people, tools and automation is a cause for concern. And it's happening because complexity is growing as quickly as these investments, meaning we're running harder without actually making progress.

The good news is that machine learning (ML) can help development teams break through this seemingly intractable MTTR barrier, transforming incident management (IM) into not only a core competency but also an enabler of successful DevOps.

The IM Conundrum

Development teams struggle with IM for two reasons. First, as organizations move to cloud-native architectures, developers become responsible for a complex environment built around microservices, with every application an assemblage of many discrete, loosely coupled components. Not even the most experienced developer can envision all the pieces of that puzzle and how they fit together.

The second reason the needle hasn't moved on IM is the rapid pace of change in today's applications. In the past, when teams uncovered an issue, they built diagnostic and alert rules to respond faster the next time. They could benefit from those rules for months or even years.

Today, however, teams might roll out new versions of microservices almost daily. Any incident-based rules they build remain relevant for weeks at best. There's simply not a big enough payoff for investing in such semi-automated IM.

The Power of ML-based IM

Just as there are two reasons teams struggle with IM, there are two fundamental IM challenges. First is incident troubleshooting, which can involve sifting through millions or even billions of events. Hunting for an unknown root cause by recognizing patterns and identifying outliers in vast quantities of data is mentally exhausting for developers.

But not for ML.

ML is well-suited to pattern recognition and anomaly detection. It can quickly learn the baseline of time-series metrics, or normal cadence of events, by training in any environment over a period of time.

The second IM challenge is root-cause analysis. Human brains aren't designed to identify root causes at the complexity and scale of modern applications. But ML is very good at it — even for event streams, or continuous series of events, which tend to be closer to root causes than time-series metrics.

Something else ML performs well is correlation. In a complex system, the symptoms of an incident might show up in a database, but the root cause might start in an authentication service or third-party API. Such correlations are much quicker to uncover using an ML model that has been trained on anomaly patterns and correlations from the same system.

Development teams are often hampered by a dearth of experienced site reliability engineers (SREs). Less experienced platform or network technicians often end up being the first line of IM defense. Because these team members are less familiar with application details, they rely on SREs to summarize incidents in written language — a costly and time-consuming process.

That's an area where LLMs excel. ML can identify errors and distill them to events. LLMs can then quickly describe the events in plain language terms less experienced staff can use to take action.

IM involves another type of connecting the dots, and that's contextualization. For example, have the root-cause events of today's problem been mentioned in past case notes, product documentation, or source code? If so, how do they inform our understanding of today's issue, and the best corrective steps? This is an area where generative AI powered by an LLM can help, by connecting ML-generated root-cause reports to the repository of tribal knowledge within each organization.

Achieving Positive IM Outcomes

Leveraging ML to improve IM can drive better outcomes in key ways:

Fast root-cause analysis

ML can uncover anomalies and correlations without having to know which queries to type, which filters to apply, and which events to look for. Once the ML model understands what normal looks like, it can spot outliers and their connections with "bad" events immediately.

Eradication of silent bugs

For catching new bugs, statistical techniques can determine that a cluster of anomalies isn't likely to occur by chance. That can enable you to uncover silent bugs that might not yet be directly associated with an incident.

Accelerated application recovery

Traditional incident troubleshooting is like hunting for a needle in a haystack without knowing what the needle looks like. ML is exponentially faster. ML can present developers with a report that lists a small number of potentially relevant events. A person still has to make a decision. But a 5X to 6X time reduction is achievable.

Ultimately, ML will transform IM from a necessary evil into a DevOps enabler. In The Lean Startup, author Eric Ries talks about creating an organizational "immune system." The idea is that if you make small adjustments every time you uncover a problem, you eventually build up defenses that allow you to quickly and nearly automatically recover from problems.

That's the eventual goal of leveraging ML for IM. When an incident occurs, the system should be smart enough to recognize that something went wrong, determine what happened, and automatically roll back the change that caused the problem. We might never achieve that level of automation for all incidents, but ML should soon be able to automate IM in 80% of cases. In the meantime, investing in ML-enabled IM can help development teams achieve tangible improvements today in IM and their DevOps effectiveness.

Ajay Singh is CEO of Zebrium
Share this

Industry News

October 03, 2024

Check Point® Software Technologies Ltd. announced its position as a leader in The Forrester Wave™: Enterprise Firewalls, Q4 2024 report.

October 03, 2024

Sonar announced two new product capabilities for today’s AI-driven software development ecosystem.

October 03, 2024

Redgate announced a wide range of product updates supporting multiple database management systems (DBMS) across its entire portfolio, designed to support IT professionals grappling with today’s complex database landscape.

October 03, 2024

Elastic announced support for Google Cloud’s Vertex AI platform in the Elasticsearch Open Inference API and Playground.

October 02, 2024

Progress announced the recipients of its 2024 Women in STEM Scholarship Series.

October 02, 2024

SmartBear has integrated the load testing engine of LoadNinja into its automated testing tool, TestComplete.

October 01, 2024

Check Point® Software Technologies Ltd. announced the completion of its acquisition of Cyberint Technologies Ltd., a highly innovative provider of external risk management solutions.

October 01, 2024

Lucid Software announced a robust set of new capabilities aimed at elevating agile workflows for both team-level and program-level planning.

October 01, 2024

Perforce Software announced the Hadoop Service Bundle, a new professional services and support offering from OpenLogic by Perforce.

October 01, 2024

CyberArk announced the successful completion of its acquisition of Venafi, a provider of machine identity management, from Thoma Bravo.

October 01, 2024

Inflectra announced the launch of its AI-powered SpiraApps.

October 01, 2024

The former Synopsys Software Integrity Group has rebranded as Black Duck® Software, a newly independent application security company.

September 30, 2024

Check Point® Software Technologies Ltd. announced that it has been recognized as a Visionary in the 2024 Gartner® Magic Quadrant™ for Endpoint Protection Platforms.

September 30, 2024

Harness expanded its strategic partnership with Google Cloud, focusing on new integrations leveraging generative AI technologies.

September 30, 2024

OKX announced the launch of OKX OS, an onchain infrastructure suite.