How Artificial Intelligence is Revolutionizing IT Operation Analytics
July 20, 2016

Akhil Sahai
Perspica

After many science fiction plots and decades of research, Artificial Intelligence (AI) is being applied across industries for a wide variety of purposes. AI, Big Data and human domain knowledge are converging to create possibilities formerly only dreamed of. The time is ripe for IT operations to incorporate AI into its processes.

IT infrastructures today are increasingly dynamic and agile but at the same time extraordinarily complex. Humans are no longer able to sift through the variety, volume and velocity of Big Data streaming out of IT infrastructures in real time, making AI—especially machine learning—a powerful and necessary tool for automating analysis and decision making. By helping teams bridge the gap between Big Data and humans, and by capturing human domain knowledge, machine learning is able to provide the necessary operational intelligence to significantly relieve this burden of near real-time, informed decision-making. Industry analysts agree. In fact, Gartner named machine learning among the top 10 strategic technologies for 2016, noting “The explosion of data sources and complexity of information makes manual classification and analysis infeasible and uneconomical.”

IT administrators, IT operators for TechOps and Site Reliability Engineers (SRE) for DevOps are tasked with manually gathering this disparate information and applying their domain expertise in an attempt to make informed decisions. While these professionals are great at what they do, trying to analyze so much data from multiple tools leaves the door wide open for human error. On the other hand, analytics that are based on machine learning are quickly becoming a necessity to ensure the availability, reliability, performance and security of applications in today's digital, virtualized and hybrid-cloud network environments.

The traditional approach centered around using multiple monitoring tools for IT siloes that provided IT operations teams with information about their virtual and physical infrastructure, application infrastructure and application transaction performance. While these tools provide pieces of the puzzle, they offer a narrow view of the IT infrastructure and, therefore, only one aspect of the tool chain. The other aspect is service desk tools that manage tickets and change management. Humans more often than not bridge this gap between the siloed monitoring tools of yesterday and service desk applications with their domain expertise.

What Analytics Can Do Now

Today, the entire application infrastructure stack is overflowing with Big Data. TechOps and DevOps environments need to automate, learn and make intelligent, informed decisions based on real-time analysis of all that data. Following are key analytics for IT operations:

1. Anomaly Detection: Machine learning algorithms should have the ability to look at contextual, historical and sudden changes in the behavior of objects to detect anomalies. Understanding when there is a real anomaly and more importantly, when there is not, is critical to avoid generating false alarms. This is the bedrock of what is typically referred to as diagnostic analytics.

2. Topology Analysis: This type of analytics understands the hierarchal, peer-to-peer and temporal relationship between hybrid cloud elements. Topology is something every IT administrator or SRE should be aware of. This type of analysis should be able to self-learn the inter-relationships of objects and the impact of their performance on one another. Learning those relationships and maintaining that understanding in order to spot trouble in time is extremely important for both TechOps and DevOps environments.

3. Behavior Profiling: This is about understanding the behavior profile of every metric, how that is incorporated into the object behavior and then how the object behaviors relate to other object behaviors across the hybrid cloud environment. It is a multi-dimensional problem, and understanding and adapting to “normal” behavior is extremely important.

4. Root Cause: By finding the specific cause and impact of an incident, root-cause analysis is able to fast-track the resolution and reduce mean time to repair substantially.

5. Predictive: These analytics help operators identify early indicators and provide insights into looming problems that may eventually lead to performance degradation and outages. Predictive analytics are also good at providing early insights into anomalies to better plan for what's ahead.

6. Prescriptive: When you are looking for intelligent and actionable recommendations to remediate an incident, prescriptive analytics are the way to go. These recommendations should capture tribal knowledge gathered over the years in the organization and best practices in the industry, and may even be crowd-sourced to capture state-of-the-art knowledge. These analytics provide the opportunity to finally close the loop in automated IT Operations Management.

Embracing Machine Learning

It's been tough for a while now to be in IT operations, having to constantly react to incidents as well as trying to resolve them after they have spun out of control. Instead, AI provides technologies to help automate many of these tasks in order to handle incidents in advance. The whole notion of automating IT operational tasks, as well as preventing outages in the first place, and getting to the root cause quickly and in an automated way is the next frontier in remediating these issues.

As Gartner so eloquently put it, manual classification and analysis is infeasible and uneconomical. Not even an army of IT staff could review monitoring data quickly and thoroughly enough to identify incidents. Fortunately, AI has the capacity to enable real-time decision making by using multiple analytics capabilities simultaneously to see what's going on across the application stack.

Akhil Sahai, Ph.D., is VP Product Management at Perspica.

Share this

Industry News

October 06, 2022

Platform.sh announced it has partnered with MongoDB.

October 06, 2022

Veracode announced the enhancement of its Continuous Software Security Platform to include container security.

This early access program for Veracode Container Security is now underway for existing customers.

The new Veracode Container Security offering, designed to meet the needs of cloud-native software engineering teams, addresses vulnerability scanning, secure configuration, and secrets management requirements for container images.

October 06, 2022

Mirantis announced that Mirantis Container Runtime – latest generation of the Docker Enterprise Engine, the secure container runtime that forms the foundation of Mirantis Container Cloud and Mirantis Kubernetes Engine and is used at the heart of many other Kubernetes deployments – is now available in the Microsoft Azure Marketplace.

October 05, 2022

Perforce Software announced enhanced support for automated testing with the release of Helix ALM 2022.2.

October 05, 2022

Parasoft announced the latest releases of its API and microservices testing tools, including SOAtest, Virtualize, CTP, and Selenic.

October 05, 2022

Vaadin announced the release of four Acceleration Kits designed to make it faster and easier to build and modernize Java applications for enterprise use.

October 04, 2022

Pegasystems announced the latest release of Robot Studio, the robotic process automation (RPA) low-code authoring environment for Pega's intelligent automation platform.

October 04, 2022

EvolveWare announced the Agile Business Rules Extraction (Agile BRE) solution on its Intellisys platform.

October 04, 2022

Mabl announced new features that empower quality professionals to easily validate APIs as part of their integrated end-to-end tests.

October 03, 2022

Spectro Cloud announced a major new release of its Palette Edge platform.

October 03, 2022

Arcion announced agentless change data capture (CDC) for all of its supported databases and applications.

September 29, 2022

CloudBees announced the acquisition of ReleaseIQ to expand the company’s DevSecOps capabilities, empowering customers with a low-code, end-to-end release orchestration and visibility solution.

September 29, 2022

SmartBear continues expanding its commitment to the Atlassian Marketplace, adding Bugsnag for Jira and SwaggerHub Integration for Confluence.

Bugsnag developers monitoring application stability and documenting in Jira no longer need to interrupt their workflow to access the app. Developers working in SwaggerHub can use the macro to push API definitions and changes directly to other teams and business stakeholders that work within Confluence. By increasing the presence of SmartBear tools on the Atlassian Marketplace, the company continues meeting developers where they are.

September 29, 2022

Ox Security exited stealth today with $34M in funding led by Evolution Equity Partners, Team8, and M12, Microsoft's venture fund, with participation from Rain Capital.

September 29, 2022

cnvrg.io announced that the new Intel Developer Cloud is now available via the cnvrg.io Metacloud platform, providing a fully integrated software and hardware solution.