How Machine Learning Can Transform Incident Management from a Burden into a DevOps Enabler
September 25, 2023

Ajay Singh

Among the most vital performance metrics for any DevOps team is mean time to recovery (MTTR) — the time to respond to a software or infrastructure issue and return to expected performance levels. As organizations embrace DevOps methodologies, with an emphasis on close collaboration and rapid iteration, they expect MTTR to improve.

But a long-running study of DevOps practices — Google Cloud's Accelerate State of DevOps Report — suggests that any historical gains in MTTR reduction have now plateaued. For years now, the time it takes to restore services has stayed about the same: less than a day for high performers but up to a week for middle-tier teams and up to a month for laggards. The fact that progress is flat despite big investments in people, tools and automation is a cause for concern. And it's happening because complexity is growing as quickly as these investments, meaning we're running harder without actually making progress.

The good news is that machine learning (ML) can help development teams break through this seemingly intractable MTTR barrier, transforming incident management (IM) into not only a core competency but also an enabler of successful DevOps.

The IM Conundrum

Development teams struggle with IM for two reasons. First, as organizations move to cloud-native architectures, developers become responsible for a complex environment built around microservices, with every application an assemblage of many discrete, loosely coupled components. Not even the most experienced developer can envision all the pieces of that puzzle and how they fit together.

The second reason the needle hasn't moved on IM is the rapid pace of change in today's applications. In the past, when teams uncovered an issue, they built diagnostic and alert rules to respond faster the next time. They could benefit from those rules for months or even years.

Today, however, teams might roll out new versions of microservices almost daily. Any incident-based rules they build remain relevant for weeks at best. There's simply not a big enough payoff for investing in such semi-automated IM.

The Power of ML-based IM

Just as there are two reasons teams struggle with IM, there are two fundamental IM challenges. First is incident troubleshooting, which can involve sifting through millions or even billions of events. Hunting for an unknown root cause by recognizing patterns and identifying outliers in vast quantities of data is mentally exhausting for developers.

But not for ML.

ML is well-suited to pattern recognition and anomaly detection. It can quickly learn the baseline of time-series metrics, or normal cadence of events, by training in any environment over a period of time.

The second IM challenge is root-cause analysis. Human brains aren't designed to identify root causes at the complexity and scale of modern applications. But ML is very good at it — even for event streams, or continuous series of events, which tend to be closer to root causes than time-series metrics.

Something else ML performs well is correlation. In a complex system, the symptoms of an incident might show up in a database, but the root cause might start in an authentication service or third-party API. Such correlations are much quicker to uncover using an ML model that has been trained on anomaly patterns and correlations from the same system.

Development teams are often hampered by a dearth of experienced site reliability engineers (SREs). Less experienced platform or network technicians often end up being the first line of IM defense. Because these team members are less familiar with application details, they rely on SREs to summarize incidents in written language — a costly and time-consuming process.

That's an area where LLMs excel. ML can identify errors and distill them to events. LLMs can then quickly describe the events in plain language terms less experienced staff can use to take action.

IM involves another type of connecting the dots, and that's contextualization. For example, have the root-cause events of today's problem been mentioned in past case notes, product documentation, or source code? If so, how do they inform our understanding of today's issue, and the best corrective steps? This is an area where generative AI powered by an LLM can help, by connecting ML-generated root-cause reports to the repository of tribal knowledge within each organization.

Achieving Positive IM Outcomes

Leveraging ML to improve IM can drive better outcomes in key ways:

Fast root-cause analysis

ML can uncover anomalies and correlations without having to know which queries to type, which filters to apply, and which events to look for. Once the ML model understands what normal looks like, it can spot outliers and their connections with "bad" events immediately.

Eradication of silent bugs

For catching new bugs, statistical techniques can determine that a cluster of anomalies isn't likely to occur by chance. That can enable you to uncover silent bugs that might not yet be directly associated with an incident.

Accelerated application recovery

Traditional incident troubleshooting is like hunting for a needle in a haystack without knowing what the needle looks like. ML is exponentially faster. ML can present developers with a report that lists a small number of potentially relevant events. A person still has to make a decision. But a 5X to 6X time reduction is achievable.

Ultimately, ML will transform IM from a necessary evil into a DevOps enabler. In The Lean Startup, author Eric Ries talks about creating an organizational "immune system." The idea is that if you make small adjustments every time you uncover a problem, you eventually build up defenses that allow you to quickly and nearly automatically recover from problems.

That's the eventual goal of leveraging ML for IM. When an incident occurs, the system should be smart enough to recognize that something went wrong, determine what happened, and automatically roll back the change that caused the problem. We might never achieve that level of automation for all incidents, but ML should soon be able to automate IM in 80% of cases. In the meantime, investing in ML-enabled IM can help development teams achieve tangible improvements today in IM and their DevOps effectiveness.

Ajay Singh is CEO of Zebrium
Share this

Industry News

December 06, 2023

ngrok unveiled its JavaScript and Python SDKs, enabling developers to programmatically serve their applications and manage traffic by embedding ingress with a single line of code.

December 06, 2023

Data Theorem introduced API Attack Path Visualization capabilities for the protection of APIs and the software supply chain.

December 05, 2023

Security Journey announced support for WCAG, SCIM and continued compliance with SOC2 Type 2, which are leading industry standards.

December 05, 2023

Vercel announced a new suite of features for its Developer Experience (DX) Platform, made for enterprise teams with large codebases.

December 04, 2023

Atlassian Corporation has completed the acquisition of Loom, a video messaging platform that helps users communicate through instantly shareable videos.

December 04, 2023

Orca Security announced that the Orca Cloud Security Platform has achieved the Amazon Web Services (AWS) Built-in Competency.

November 30, 2023

Parasoft, a global leader in automated software testing solutions, today announced complete support for MISRA C++ 2023 with the upcoming release of Parasoft C/C++test 2023.2.

November 30, 2023 achieved the Amazon Elastic Kubernetes Service (Amazon EKS) Ready designation from Amazon Web Services (AWS).

November 29, 2023

CircleCI implemented a gen2 GPU resource class, leveraging Amazon Elastic Compute Cloud (Amazon EC2) G5 instances, offering the latest generation of NVIDIA GPUs and new images tailored for artificial intelligence/machine learning (AI/ML) workflows.

November 29, 2023

XM Cyber announced new capabilities that provide complete and continuous visibility into risks and vulnerabilities in Kubernetes environments.

November 29, 2023

PerfectScale has achieved the Amazon Elastic Kubernetes Service (Amazon EKS) Ready designation from Amazon Web Services (AWS).

November 28, 2023

BMC announced two new product innovations, BMC AMI DevX Code Insights and BMC AMI zAdviser Enterprise.

November 28, 2023

Rafay Systems announced the availability of the Rafay Cloud Automation Platform — the evolution of its Kubernetes Operations Platform — to enable platform teams to deliver automation and self-service capabilities to developers, data scientists and other cloud users.

November 28, 2023

Bitrise is integrating with Amazon Web Services (AWS) to provide compliance-conscious companies with greater access to CI/CD capabilities for mobile app development.

November 28, 2023

Armory announced a new unified declarative deployment capability for AWS Lambda.