Bugs in Production: How to Avoid Unpleasant Surprises

May 23, 2019

Frank Huerta
Curtail

In the DevOps rapid iteration cycle, too many organizations push their software and services out without being able to properly test for bugs that will show up with production traffic. This can cause unanticipated downtime, which means it's a big risk; it could take down the whole service. And no one wants that. So, what can be done?

The Perils of Buggy Code

The average cost of downtime is $5,600 a minute

Downtime is expensive — both financially and to the brand. Gartner has estimated that the average cost of downtime is $5,600 a minute(link is external). That's well over $300,000 an hour. To provide a real-world example of what this looks like, Microsoft Azure(link is external) suffered a major outage in November 2018 caused by issues introduced as part of a code update. The outage lasted for 14 hours and affected customers throughout Europe and beyond. With migration from legacy systems to microenvironments in the cloud, outages and downtime pose a growing and serious problem.

The kinds of quality-testing tools in use now don't enable developers to know how a new software version will perform in production or if it will even work in production. The Cloudbleed bug is an example of this problem. In February 2017, a coding error in a software upgrade from security vendor Cloudflare led to a serious vulnerability discovered by a Google researcher several months later.

In addition to having the immediate impacts mentioned above, flaws can lead to serious security issues later. Heartbleed(link is external), a vulnerability that arose in 2014 stemming from a programming mistake in the OpenSSL library, left large numbers of private keys and sensitive information exposed to the internet, enabling theft that would otherwise have been protected by SSL/TLS(link is external) encryption.

The Need to Test with Production Traffic

For today's increasingly frequent and fast development cycles, the way QA testing is typically done is no longer sufficient. Traditionally, DevOps teams haven't been able to do side-by-side testing of the production version and an upgrade candidate. The QA testing used by many organizations is a set of simulated test suites, which may not give comprehensive insight into the myriad ways in which customers may actually make use of the software. Just because upgraded code works under one set of testing parameters doesn't mean it will work in the unpredictable world of production usage.

In the case of the Cloudflare incident, the error went entirely unnoticed by end-users for an extended period of time and there were no system errors logged as a result of the flaw. Just as QA testing isn't sufficient, relying on system logs and users also has a limited scope for what can be detected.

Fixing bugs post-release ... estimated to be 5X as expensive as fixing them during design

Fixing bugs post-release gets pricey. It's estimated to be five times as expensive as fixing them during design — and it can lead to even costlier development delays. Giving software teams a way to identify potential bugs and security concerns prior to release can alleviate those delays. Clearly, testing with production traffic earlier in the code development process can save time, money and pain. Software and DevOps teams need a way to test quickly and accurately how new releases will perform with real (not just simulated) customer traffic and while maintaining the highest standards.

If teams have the capability to evaluate release versions side-by-side, they can quickly locate any differences or defects. In addition, they can gain real insight on network performance while also verifying the stability of upgrades and patches in a working environment. Doing this efficiently will significantly reduce the likelihood of releasing software that later needs to be rolled back. Rollbacks are expensive, as we saw in the case of the Microsoft Azure incident.

Teams sometimes stage rollouts, which necessitates running multiple software versions in production. The software teams put a small percentage of users on the new version, while most users run the status quo. Unfortunately, this approach to testing with production traffic is cumbersome to manage, costly and still vulnerable to rollbacks. The other problem with these kinds of rolling deployments is that while failures can be caught early in the process, they are — by design — only caught after they've affected end-users.

Issues Remain

Important questions arise at this point. For instance, how do you know whether the new software is causing the "failures"? How many "failures" does the business allow before recalling or rolling back the software, since the business does not observe side-by-side results from the same customer? This disrupts the end-user experience, which ultimately affects business operations and company reputation. And staging may not provide a sufficient sample to gauge the efficacy of the new release versus the entire population of customers.

Another issue that persists is cost. Even if you stage with only 10% of customers on the new version, if a failure costs more than $300,000 an hour, then a failure affecting 10% of users could potentially still cost more than $30,000 per hour. The impact is reduced, of course, but it's still significant, not counting the uncertainty of when to roll back.

A Better Way

Gone are the days when standard QA testing sufficed. Instead, DevOps teams have the option of testing in production and evaluating release versions side-by-side. This reduces the risk of bugs that comes with today's rapid dev cycles. This approach helps organizations release product that is secure and high-quality while avoiding expensive rollbacks or staging.

Frank Huerta is CEO of Curtail

Industry News

Check Point Accelerates Threat Detection and Response with AI-Powered Security Management for the Modern Enterprise

May 28, 2025

Check Point® Software Technologies Ltd.(link is external) announced the launch of its next generation Quantum(link is external) Smart-1 Management Appliances, delivering 2X increase in managed gateways and up to 70% higher log rate, with AI-powered security tools designed to meet the demands of hybrid enterprises.

Salesforce Signs Definitive Agreement to Acquire Informatica

May 28, 2025

Salesforce and Informatica have entered into an agreement for Salesforce to acquire Informatica.

Red Hat and Google Cloud Extend Alliance on Agentic AI

May 28, 2025

Red Hat and Google Cloud announced an expanded collaboration to advance AI for enterprise applications by uniting Red Hat’s open source technologies with Google Cloud’s purpose-built infrastructure and Google’s family of open models, Gemma.

Mirantis k0rdent Enterprise and Mirantis k0rdent Virtualization Released

May 28, 2025

Mirantis announced Mirantis k0rdent Enterprise and Mirantis k0rdent Virtualization, unifying infrastructure for AI, containerized, and VM-based workloads through a Kubernetes-native model, streamlining operations for high-performance AI pipelines, modern microservices, and legacy applications alike.

Snyk Announces AI Trust Platform

May 28, 2025

Snyk launched the Snyk AI Trust Platform, an AI-native agentic platform specifically built to secure and govern software development in the AI Era.

Bit Cloud Launches Hope AI

May 28, 2025

Bit Cloud announced the general availability of Hope AI, its new AI-powered development agent that enables professional developers and organizations to build, share, deploy, and maintain complex applications using natural language prompts, specifications and design files.

Check Point to Acquire Veriti to Transform Threat Exposure Management and Reduce Organizations' Cyber Attack Surface

May 27, 2025

AI-fueled attacks and hyperconnected IT environments have made threat exposure one of the most urgent cybersecurity challenges facing enterprises today. In response, Check Point® Software Technologies Ltd.(link is external) announced a definitive agreement to acquire Veriti Cybersecurity, the first fully automated, multi-vendor pre-emptive threat exposure and mitigation platform.

LambdaTest Introduces Automation MCP Server

May 27, 2025

LambdaTest announced the launch of its Automation MCP Server, a solution designed to simplify and accelerate the process of triaging test failures.

Next-Gen Security Operations Center Capabilities Added to DefectDojo Pro

May 27, 2025

DefectDojo announced the launch of their next-gen Security Operations Center (SOC) capabilities for DefectDojo Pro, which provides both SOC and AppSec professionals a unified platform for noise reduction and prioritization of SOC alerts and AppSec findings.

Check Point Software Technologies Named One of America's Best Cybersecurity Companies by Newsweek and Statista

May 22, 2025

Check Point® Software Technologies Ltd.(link is external) has been recognized on Newsweek’s 2025 list of America’s Best Cybersecurity Companies(link is external).

Red Hat Introduces AI-Powered Management and Extends Container-Native Reach for Red Hat Enterprise Linux

May 22, 2025

Red Hat announced enhanced features to manage Red Hat Enterprise Linux.

StackHawk Raises $12 Million in Strategic Funding

May 22, 2025

StackHawk has taken on $12 Million in additional funding from Sapphire and Costanoa Ventures to help security teams keep up with the pace of AI-driven development.

Red Hat Introduces Cloud-Optimized Red Hat Enterprise Linux

May 21, 2025

Red Hat announced jointly-engineered, integrated and supported images for Red Hat Enterprise Linux across Amazon Web Services (AWS), Google Cloud and Microsoft Azure.

Komodor Integrates with Internal Developer Portals

May 21, 2025

Komodor announced the integration of the Komodor platform with Internal Developer Portals (IDPs), starting with built-in support for Backstage and Port.

Operant Launches Woodpecker

May 21, 2025

Operant AI announced Woodpecker, an open-source, automated red teaming engine, that will make advanced security testing accessible to organizations of all sizes.

DEVOPSdigest

The Perils of Buggy Code

The average cost of downtime is $5,600 a minute

The Need to Test with Production Traffic

Fixing bugs post-release ... estimated to be 5X as expensive as fixing them during design

Issues Remain

A Better Way

Industry News

On-Demand Webinars

Analyst Reports

White Papers

Media Partners

The Latest

Hot Topics

The Perils of Buggy Code

The average cost of downtime is $5,600 a minute

The Need to Test with Production Traffic

Fixing bugs post-release ... estimated to be 5X as expensive as fixing them during design

Issues Remain

A Better Way

Related Links

Industry News

Search form

On-Demand Webinars

Analyst Reports

White Papers

Media Partners

User login

The Latest

Hot Topics