Designing a Zero-Downtime System in Kubernetes
August 18, 2020

Gigi Sayfan
Author of "Mastering Kubernetes"

The following is an excerpt from my book, Mastering Kubernetes:

There is no such thing as a zero-downtime system. All systems fail and all software systems definitely fail. Sometimes the failure is serious enough that the system or some of its services will be down. Think about zero downtime as a best-effort distributed system design.

You design for zero downtime in the sense that you provide a lot of redundancy and mechanisms to address expected failures without bringing the system down. As always, remember that, even if there is a business case for zero downtime, it doesn't mean that every component must have zero downtime. Reliable (within reason) systems can be constructed from highly unreliable components.

The plan for zero downtime is as follows:

Redundancy at every level

This is a required condition. You can't have a single point of failure in your design because when it fails, your system is down.

Automated hot swapping of failed components

Redundancy is only as good as the ability of the redundant components to kick into action as soon as the original component has failed. Some components can share the load (for example, stateless web servers), so there is no need for explicit action. In other cases, such as the Kubernetes scheduler and controller manager, you need a leader election in place to make sure the cluster keeps humming along.

Tons of metrics, monitoring, and alerts to detect problems early

Even with careful design, you may miss something or some implicit assumption might invalidate your design. Often, such subtle issues creep up on you and with enough attention, you may discover it before it becomes an all-out system failure.

For example, suppose there is a mechanism in place to clean up old log files when disk space is over 90% full, but for some reason, it doesn't work. If you set an alert for when disk space is over 95% full, then you'll catch it and be able to prevent the system failure.

Tenacious testing before deployment to production

Comprehensive tests have proven themselves as a reliable way to improve quality. It is hard work to have comprehensive tests for something as complicated as a large Kubernetes cluster running a massive distributed system, but you need it.

What should you test? Everything. That's right. For zero downtime, you need to test both the application and the infrastructure together. Your 100% passing unit tests are a good start, but they don't provide much confidence that when you deploy your application on your production Kubernetes cluster, it will still run as expected.

The best tests are, of course, on your production cluster after a blue-green deployment or identical cluster. In lieu of a full-fledged identical cluster, consider a staging environment with as much fidelity as possible to your production environment. Here is a list of tests you should run. Each of these tests should be comprehensive because if you leave something untested, it might be broken:

• Unit tests

• Acceptance tests

• Performance tests

• Stress tests

• Rollback tests

• Data restore tests

• Penetration tests

Does that sound crazy? Good. Zero-downtime, large-scale systems are hard. There is a reason why Microsoft, Google, Amazon, Facebook, and other big companies have tens of thousands of software engineers (combined) just working on infrastructure, operations, and making sure things are up and running.

Keep the raw data

For many systems, the data is the most critical asset. If you keep the raw data, you can recover from any data corruption and processed data loss that happens later. This will not really help you with zero downtime because it can take a while to re-process the raw data, but it will help with zero data loss, which is often more important. The downside to this approach is that the raw data is often huge compared to the processed data. A good option may be to store the raw data in cheaper storage compared to the processed data.

Perceived uptime as a last resort

OK. Some part of the system is down. You may still be able to maintain some level of service. In many situations, you may have access to a slightly stale version of the data or can let the user access some other part of the system. It is not a great user experience, but technically the system is still available.

Gigi Sayfan is a software engineer and author of the book "Mastering Kubernetes"
Share this

Industry News

March 30, 2023

CloudBees announced the integration of CloudBees’ continuous delivery and release orchestration solution, CloudBees CD/RO, with Argo Rollouts.

March 30, 2023

amazee.io, a Mirantis company, announced that its fully-managed application delivery platform is available in AWS Marketplace.

March 30, 2023

env0 secured an additional $18.1 million of funding to conclude its Series A investment round with a total of $35.1 million.

March 29, 2023

Planview announced a new strategic collaboration with UiPath. The integration is designed to fuse the UiPath Business Automation Platform with the Planview Value Stream Management (VSM) solution Planview® Tasktop Hub.

March 29, 2023

Noname Security announced major enhancements to its API security platform to help organizations protect their API ecosystem, secure their applications, and increase cyber resilience.

March 28, 2023

Mirantis announced the latest version of Mirantis Container Cloud -- MCC 2.23 -- that simplifies operations with the ability to monitor applications performance with a new Grafana dashboard and to make updates to Kubernetes clusters with a one-click “upgrade” button from a web interface.

March 28, 2023

Pegasystems announced updates to Pega Cloud supported by an enhanced Global Operations Center to deliver a more scalable, reliable, and secure foundation for its suite of AI-powered decisioning and workflow automation solutions.

March 28, 2023

D2iQ announced the launch of DKP Gov, a new container-management solution optimized for deployment within the government sector.

March 28, 2023

StackHawk announced the availability of StackHawk Pro and StackHawk Enterprise for trial and purchase through the Amazon Web Services (AWS) Marketplace.

March 27, 2023

Octopus Deploy announced the results KinderSystems has seen working with Octopus. Through the use of Octopus, KinderSystems automates its software deployment processes to meet the complex needs of its customers and reduce the time to deploy software.

March 27, 2023

Elastic Path announced Integrations Hub, a library of instant-on, no-code integrations that are fully managed and hosted by Elastic Path.

March 27, 2023

Yugabyte announced key updates to YugabyteDB Managed, including the launch of the YugabyteDB Managed Command Line Interface (CLI).

March 23, 2023

Ambassador Labs released Telepresence for Docker, designed to make it easy for developer teams to build, test and deliver apps at scale across Kubernetes.

March 23, 2023

Fermyon Technologies introduced Spin 1.0, a major new release of the serverless functions framework based on WebAssembly.

March 23, 2023

Torc announced the acquisition of coding performance measurement application Codealike to empower software developers with even more data that increases skills, job opportunities and enterprise value.