We Don't Need Another Hero
May 27, 2016

Dave Josephsen

Someone recently asked me this question: "What's the first thing that comes to your mind when you hear the word 'DevOps'?"

A loaded question, I agree, and of course I lied, and made up something about the "first way." I mean really, if you want a possibly embarrassing answer to a loaded question you really should confront me face to face with it in a public place. If the asker of that question had done so, I would have had to answer honestly that the first thing I think about when I hear the word DevOps is Brent from The Phoenix Project.

If you haven't read it, Brent is probably you. That person who knows how all the stuff actually works, and who everyone depends on to fix things when they go sideways. Brent is a hero, and DevOps abhors hero's. They constrain the productivity of the system overall and create single points of failure. They enable broken systems to limp along instead of allowing them to fail and be replaced. So in the book, Brent is basically a huge organizational problem.

I was deeply hurt by this plot device on my first reading. In fact if I hadn't been trapped on a plane with nothing else to read, I probably wouldn't have finished The Phoenix Project because of it (which totally would have been my loss). DevOps is an emergent property of failure in classical IT management, and it's just hard to wrap your head around the fact that sometimes failure needs to happen before we can pick up the pieces and make something great from them.

Further, the DevOps end-game is hard to visualize, because in many ways it's orthogonal to what preceded it. Many organizations just can't get there until they've failed. Getting there isn't easy either way, so we talk a lot about improvement-kata and "getting there", and why you'd want to be there in the first place. What it looks like day-to-day once you've arrived is something you almost never hear about.

What does that mean? Well to understand that I need to digress a little bit into what we make for a living, which happens to be time-series databases.

One big problem with designing time-series databases is that the pattern of reads and writes is very different. Typically we want to optimize for writes, because we write a lot more than we read, but reads also need to be pretty efficient or no-one will want to use the UI. One way we manage to minimize read latency is to use a rotating row key. That is, we simply preface the name of a given metric with a key that changes every so often, which in turn forces the DB to create a new row to store the data. Breaking the data up like this helps us keep the rows down to a predictable size, and allows us to make queries of large datasets parallel (ie we can read out a bunch of chunks of predictably sized data at the same time rather than one huge chunk).

Predictable is something of a key word there, because really what row-keys buy you is a heap of different kinds of very important predictability. With row keys, we know our rows will always be the size of a measurement times the number of measurements in a row-key interval. From there we can extrapolate all kinds of stuff like how much data we'll need to put on the wire for client reads, and how much storage and processing power our data tier will require and etc. The point is, we can make really important decisions by relying on the predictability that row-keys provide.

But here's the rub. We're a multi-tenant system. In other words, we have users, and users have the power to name things. Uh-oh. If users can put an arbitrarily-sized label on a data point, then we longer have a predictably sized data structure anymore. That's ok, we can isolate that variable by creating a predictably sized UID which maps back to the end-user's human-readable label.

So when you use a row key to buy predictable quantities in the data tier, you also often have to pay some taxes in the form of lookup-tables, or indexes if you prefer. In this example we're going to need two indexes, one to keep track of where in our storage tier a given 6-hour block of measurements is stored (because our rows are named after rotating numbers now, so we need some way to actually find the right data when a user asks for 60-minutes of metric:foo), and another to map user-generated variable length names into either hashes or some other unique identifier (so we know what to even search for in the first index when a user asks for 60 minutes of metric:foo).

Ok, now we know pretty much everything we need to re-examine the problem. Ben, the ops guy in that conversation, is tracking what he considers to be a resource allocation problem. The symptom Ben is reacting to here is high CPU utilization.

Dave Josephsen is the Developer Evangelist for Librato.

Share this

Industry News

June 06, 2023

Appdome has integrated its platform with GitHub to build, scale, and deliver software.

June 06, 2023

DigiCert, announced a partnership with ReversingLabs to enhance software security by combining advanced binary analysis and threat detection from ReversingLabs with DigiCert's enterprise-grade secure code signing solution.

June 06, 2023

Semgrep announced that Semgrep Supply Chain is now free for all to use, up to a 10-contributor limit.

June 05, 2023

Checkmarx announced its new AI Query Builders and AI Guided Remediation to help development and AppSec teams more accurately discover and remediate application vulnerabilities.

June 05, 2023

Copado announced a technology partnership with nCino to provide financial institutions with proven tools for continuous integration, continuous delivery and automated testing of nCino features and functionality of the nCino cloud banking platform.

June 05, 2023

OpsMx announced extensions to OpsMx Intelligent Software Delivery (ISD) that make it a CI/CD solution designed for secure software delivery and deployment.

June 01, 2023

Couchbase announced a broad range of enhancements to its Database-as-a-Service Couchbase Capella™.

June 01, 2023

Remote.It release of Docker Network Jumpbox to enable zero trust container access for Remote.It users.

June 01, 2023

Platformatic launched a suite of new enterprise-grade products that can be self-hosted on-prem, in a private cloud, or on Platformatic’s managed cloud service:

May 31, 2023

Parasoft announced the release of C/C++test 2023.1 with complete support of MISRA C 2023 and MISRA C 2012 with Amendment 4.

May 31, 2023

Rezilion announced the release of its new Smart Fix feature in the Rezilion platform, which offers critical guidance so users can understand the most strategic, not just the most recent, upgrade to fix vulnerable components.

May 31, 2023

Zesty has partnered with skyPurple Cloud, the public cloud operations specialists for enterprises.

With Zesty, skyPurple Cloud's customers have already reduced their average monthly EC2 Linux On-Demand costs by 44% on AWS.

May 30, 2023

Red Hat announced Red Hat Trusted Software Supply Chain, a solution that enhances resilience to software supply chain vulnerabilities.

May 30, 2023

Mirantis announced Lens Control Center, to enable large businesses to centrally manage Lens Pro deployments by standardizing configurations, consolidating billing, and enabling control over outbound network connections for greater security.

May 25, 2023

Red Hat announced new capabilities for Red Hat OpenShift AI.