Navigating the Complexities of Operating Large-Scale Kubernetes Environments - 2

July 14, 2022

Sayandeb Saha
NetApp

As containers become the default choice for developing and distributing modern applications and Kubernetes (k8s) the de-facto platform for deploying, running, and scaling such applications, enterprises need to scale their Kubernetes environments rapidly to keep up. However, rapidly scaling Kubernetes environments can be challenging and create complexities that may be hard for you to address and difficult to resolve without a clear strategy. Part 2 of this blog specifies a few more common techniques that you can use to navigate the complexities of managing scaled-out Kubernetes environments.

Start with: Navigating the Complexities of Operating Large-Scale Kubernetes Environments - 1

Keeping Up with Kubernetes Updates

Kubernetes is a thriving open-source project delivering rapid innovation with releases three times a year. If using fully managed Kubernetes from public cloud providers, be prepared for Kubernetes service life cycles that are aggressive. Test your applications with newer versions of Kubernetes as they are released to minimize upgrade-related downtime. If possible, avoid in-place upgrades of Kubernetes clusters — create new clusters, clone your applications to the new clusters, divert traffic to the new clusters, and retire the old clusters. Proactively adopt more recent versions of Kubernetes for running your business-critical applications to prevent public cloud providers from upgrading your Kubernetes control plane version after the end of life of a particular version of the Kubernetes control plane.

For self-managed Kubernetes platforms, vendors also release aggressively to keep up with upstream innovation. You will have more control over when to upgrade, but you do not want to fall behind as it becomes difficult to upgrade if you are too far back and vendors discontinue support for the versions you are on.

Most Kubernetes providers document their life cycle. Read, understand, and take the necessary actions to keep up with rapid releases and subsequent end-of-life schedules.

Reduce or Eliminate Application/Cluster Downtime

Like all other applications and environments, Kubernetes applications and clusters can also experience service-impacting disasters or outages, which can be self-inflicted or accidental. To keep up with the rapid upgrades as explained in the previous section and recover from unplanned outages, use commercially licensed or open-source Kubernetes data protection solutions that provide backup, DR, and mobility for Kubernetes applications. While adopting such solutions look for ones can handle scaled out multi-cluster environments providing a single pane of glass for your K8s protection needs.

GitOps for Application Life-Cycle Management

Releasing applications on Kubernetes can be challenging and even more daunting in scaled-out environments. GitOps, which leverages the power of Git, a popular software version control tool, to provide both revision and change control for applications within the Kubernetes platform, is a best practice that you should consider adopting in large Kubernetes environments.

This model stores the system's desired state in a software version control system like Git. Developers make changes to the configuration files representing the desired state instead of using CLI or GUI to directly make changes on the K8s clusters. A delta between the desired state stored in Git and the system's actual state indicates the changeset that needs to be deployed. These changesets can be reviewed and approved (or rejected) through standard Git processes such as pull requests, code reviews, and merges to master. Approved and merged changesets to the main branch are applied to K8s clusters for changing the system's current state to the desired state based on the configuration stored in Git.

You can quickly and easily release applications using this practice and roll back as needed if things don't go according to plan. Using GitOps for change control leverages Kubernetes' core functionality as a reconciliation engine. This process provides an implicit audit trail of actions taken while releasing applications enabling easier troubleshooting and root cause analyses in large K8s environments.

Comprehensive Observability

Rich observability is essential for maintaining large Kubernetes environments so that you can proactively and reactively mitigate issues that can otherwise become a revenue and/or productivity impacting outage. Kubernetes observability is complex as Kubernetes constitutes multiple layers of infrastructure and several distinct, highly distributed services, each producing its own set of monitoring data with no single master source/log.

To maintain large Kubernetes environments, you must implement:

■ Monitoring of K8s infrastructure (cluster, nodes, namespaces, pods, etc.) and application resources (CPU, memory, storage, networking)

■ Log collection and management for all Kubernetes services and infrastructure

■ Alerts and notifications

Monitoring data generated from various sources need to be collected separately, correlated, and sometimes analyzed to provide the full context of each event or change to an admin, who can understand it, and take corrective action(s) as needed to keep your environment humming without disruption.

Summary

If you have started dabbling into Kubernetes or have small/medium K8s environments, it's only a matter of time you will be managing a large K8s environment as developers embrace containers and Kubernetes for new apps and refactor existing apps. Adopting a few strategies outlined here can reduce some of your pains that are associated with large K8s estates. Seek solutions that can help with your data management needs for large scale Kubernetes environments making upgrades easier, recover from disasters faster, and backup your precious application data with support for "Namespace-as-a-Service" operating models commonly used in such environments.

Sayandeb Saha is Sr. Director, Product Management, at NetApp

Industry News

LambdaTest Integrates with Assembla

June 03, 2025

LambdaTest announced its partnership with Assembla, a cloud-based platform for version control and project management.

Salt Security Introduces Salt Illuminate

June 03, 2025

Salt Security unveiled Salt Illuminate, a platform that redefines how organizations adopt API security.

Workday Introduces AI Developer Toolset

June 03, 2025

Workday announced a new unified, AI developer toolset to bring the power of Workday Illuminate directly into the hands of customer and partner developers, enabling them to easily customize and connect AI apps and agents on the Workday platform.

Pega Agentic Process Fabric Introduced

June 02, 2025

Pegasystems introduced Pega Agentic Process Fabric™, a service that orchestrates all AI agents and systems across an open agentic network for more reliable and accurate automation.

Fivetran Expands Connector SDK to Support Any Source With Native-Grade Reliability and Performance

June 02, 2025

Fivetran announced that its Connector SDK now supports custom connectors for any data source.

Copado Robotic Testing Available in AWS Marketplace

June 02, 2025

Copado announced that Copado Robotic Testing is available in AWS Marketplace, a digital catalog with thousands of software listings from independent software vendors that make it easy to find, test, buy, and deploy software that runs on Amazon Web Services (AWS).

AI-Powered Defense at the Edge: Check Point Launches New Branch Office Security Gateways with 4x Faster Threat Prevention Performance

May 29, 2025

Check Point® Software Technologies Ltd.(link is external) announced major advancements to its family of Quantum Force Security Gateways(link is external).

Sauce Labs Releases iOS 18 Testing on Virtual Device Cloud

May 29, 2025

Sauce Labs announced the general availability of iOS 18 testing on its Virtual Device Cloud (VDC).

Infragistics Ultimate 25.1 Released

May 29, 2025

Infragistics announced the launch of Infragistics Ultimate 25.1, the company's flagship UX and UI product.

CIQ Establishes Open Source Program Office

May 29, 2025

CIQ announced the creation of its Open Source Program Office (OSPO).

Check Point Accelerates Threat Detection and Response with AI-Powered Security Management for the Modern Enterprise

May 28, 2025

Check Point® Software Technologies Ltd.(link is external) announced the launch of its next generation Quantum(link is external) Smart-1 Management Appliances, delivering 2X increase in managed gateways and up to 70% higher log rate, with AI-powered security tools designed to meet the demands of hybrid enterprises.

Salesforce Signs Definitive Agreement to Acquire Informatica

May 28, 2025

Salesforce and Informatica have entered into an agreement for Salesforce to acquire Informatica.

Red Hat and Google Cloud Extend Alliance on Agentic AI

May 28, 2025

Red Hat and Google Cloud announced an expanded collaboration to advance AI for enterprise applications by uniting Red Hat’s open source technologies with Google Cloud’s purpose-built infrastructure and Google’s family of open models, Gemma.

Mirantis k0rdent Enterprise and Mirantis k0rdent Virtualization Released

May 28, 2025

Mirantis announced Mirantis k0rdent Enterprise and Mirantis k0rdent Virtualization, unifying infrastructure for AI, containerized, and VM-based workloads through a Kubernetes-native model, streamlining operations for high-performance AI pipelines, modern microservices, and legacy applications alike.

Snyk Announces AI Trust Platform

May 28, 2025

Snyk launched the Snyk AI Trust Platform, an AI-native agentic platform specifically built to secure and govern software development in the AI Era.

DEVOPSdigest

Keeping Up with Kubernetes Updates

Reduce or Eliminate Application/Cluster Downtime

GitOps for Application Life-Cycle Management

Comprehensive Observability

Summary

Industry News

On-Demand Webinars

Analyst Reports

White Papers

Media Partners

The Latest

Hot Topics

Keeping Up with Kubernetes Updates

Reduce or Eliminate Application/Cluster Downtime

GitOps for Application Life-Cycle Management

Comprehensive Observability

Summary

Related Links

Industry News

Search form

On-Demand Webinars

Analyst Reports

White Papers

Media Partners

User login

The Latest

Hot Topics