A Smarter, Data Science-Driven Approach to Operations
December 04, 2015

Amit Sasturkar
OpsClarity

Data science and machine learning algorithms have become pervasive throughout the modern consumer world. There are many successful applications of machine learning in consumer products that we use on a daily basis, including:

■ Movie, music and product recommendations

■ Ad targeting

■ Web search

But, when we look at the Ops world, we find that there is no breakthrough product that incorporates these same machine learning innovations.

A relevant comparison is with page-level and host-level features (for example, page rank for URLs or host rank for hosts) used in search ranking. These features are typically a function of the WebMap (the massive graph where nodes are URLs, and edges are hyperlinks between URLs). The page rank algorithm allows the ranking of URLs in the WebMap based on the hyperlinks between them. It is a very effective way to get a reasonable estimate of the overall importance of a URL.

What if we used similar ideas to rank the hundreds or sometimes thousands of alerts that operations engineers receive, especially when they are managing hundreds of machines? What is the equivalent of the WebMap in the Ops world?

Another relevant example is provided by duplicate web page detection. These algorithms run as MapReduce jobs on massive Hadoop clusters (thousands of machines) and detect duplicate pages across tens of billions of web pages. When the mappers or reducers fail or when there are performance degradations, hundreds of alerts are generated, many of them for the same underlying root cause.

What if we applied the techniques of web page duplicate detection to eliminate the duplicate and unnecessary alerts received by Ops?

A third big challenge is personalization of content. Personalization is a well-studied problem in the consumer space, with user feedback — both implicit (clicks and actions) and explicit (reviews and ratings) — contributing critical inputs to the learning algorithms. Employing this type of machine learning means that the more time a user spends with a product, the better their user experience will be.

What if we incorporated feedback to learn Ops users’ preferences and continuously improve the accuracy of alert generation and alert ranking?

The answers to these questions will become evident as we bring the innovations in data science and machine learning that are commonplace in the consumer world to the Ops world. DevOps teams need, in effect, an “expert assistant” that can learn their application and system environment, detect and correlate failures, and make recommendations that drive increased focus and productivity — even as everything is continuously changing. It’s time for Ops to get smarter.

Amit Sasturkar is Co-Founder and CTO of OpsClarity.

Share this

Industry News

May 17, 2022

DevOps Institute, a global professional association for advancing the human elements of DevOps, announced the release of the Upskilling IT 2022 report.

May 17, 2022

Replicated announced a host of new platform features and capabilities that enable their customers to accelerate enterprise adoption of their Kubernetes applications.

May 17, 2022

Codefresh announced that its flagship continuous delivery (CD) platform will be made accessible as a fully-hosted solution for DevOps teams seeking to quickly and easily achieve frictionless, GitOps-based continuous software delivery in the cloud.

May 16, 2022

Red Hat announced new capabilities and enhancements across its portfolio of open hybrid cloud solutions aimed at accelerating enterprise adoption of edge compute architectures through the Red Hat Edge initiative.

May 16, 2022

D2iQ announced a partnership with GitLab.

May 16, 2022

Kasten by Veeam announced the new Kasten by Veeam K10 V5.0 Kubernetes data management platform.

May 12, 2022

Red Hat introduced Red Hat Enterprise Linux 9, the Linux operating system designed to drive more consistent innovation across the open hybrid cloud, from bare metal servers to cloud providers and the farthest edge of enterprise networks.

May 12, 2022

Couchbase announced version 7.1 of Couchbase Server.

May 12, 2022

Copado added Copado Robotic Testing to Copado Essentials.

May 11, 2022

Red Hat announced new advancements within its Red Hat Cloud Services portfolio, delivering a fully-managed and streamlined user experience as organizations build, deploy, manage and scale cloud-native applications across hybrid environments.

May 11, 2022

JFrog introduced a new Docker Desktop Extension for JFrog Xray that allows organizations to automatically scan Docker Containers for vulnerabilities and violations early in the development process.

May 11, 2022

Progress announced a series of updates in Progress Telerik and Progress Kendo UI.

May 11, 2022

Vultr announces that Vultr Kubernetes Engine (VKE) is generally available.

May 10, 2022

Docker announced new features and partnerships to increase developer productivity. Specifically, the company announced Docker Extensions which allow developers to discover and add complementary development tools to Docker Desktop.

May 10, 2022

Red Hat announced the general availability of Red Hat Ansible Automation Platform on Microsoft Azure, pairing hybrid cloud automation with the convenience and support of a managed offering.