Kubernetes Phase 2—Key Challenges at Scale

By: Evan Klein

Kubernetes is THE buzzword these days. Almost every IT organization is currently using it or is in the process of implementing it as part of their infrastructure. The transition to Kubernetes is complicated, whether a company is using an on-premises, cloud, hybrid, or managed solution, and it usually involves other changes in the codebase, such as shifting to a microservices architecture. While the implementation phase is led by the DevOps team, it requires the participation of the whole R&D group. It is a complex project that should be planned and tracked.

What happens after the implementation is completed? What are Kubernetes’ phase 2 challenges—the ones that DevOps teams face on a day-to-day basis?

This article will describe some of the challenges that need to be met in order to maintain, monitor, secure, and continue developing a working cluster that supports business needs.

How To Maintain and Scale Kubernetes

The first challenge DevOps teams face, after deploying Kubernetes and getting it running, is maintenance. Both the Kubernetes master and its node kubelets constantly receive system updates which can improve or secure the cluster but, at the same time, can spoil an important functionality in the application. Every upgrade, whether it’s to the node kubelet or to the master, should be examined from all possible points of view. All consequences should be taken into consideration and tested on different environments (including the pre-production environment) to prevent service downtime and adverse effects on business operations. Because Kubernetes is constantly developing and new vulnerabilities are being discovered, it is recommended to continuously upgrade the cluster resources in order to stay up-to-date with the latest security patches.

The Horizontal Pod Autoscaler (HPA) may be Kubernetes’ most powerful feature. This tool allows DevOps to configure predefined rules for every type of service and control the pod scale of the specific service based on CPU usage, memory usage, and other metrics. High load, DDoS attacks, and other types of problems are no longer critical, since Kubernetes can handle them without human intervention. However, these predefined rules that can immensely improve service to customers can also negatively affect that service if incorrectly configured. For services that cannot be scaled, configuring HPA can create business cases that were not designed. Some services must maintain a minimum number of running pods, and changing this configuration without understanding its effects can be disastrous. Services HPA (which is configured in the Kubernetes deployment layer) should be constantly reviewed and tested as part of the release process.

Another interesting feature of Kubernetes is High Availability. This allows a DevOps team to distribute the availability risk to different locations around the globe. Kubernetes’ high availability clusters allow you to start different pods in different data centers, minimizing the effects of an unresponsive data center. DevOps teams must be familiar with high availability. They must be able to configure it correctly and have the ability to attend to different types of failures, including Kubernetes master or replica incidents.

Cluster Security

As part of the infrastructure planning and Kubernetes setup process, you should consider how security affects architecture and deployment. While security is always considered in the design phase, it is often forgotten after the transition and initial deployment are completed. Kubernetes security should remain a top priority for DevOps teams, and it should constantly evolve and undergo reviews by the security officers.

In a separate blog post, the different security operations and activities that should be implemented to protect the Kubernetes cluster were discussed. DevOps teams must remember that the ongoing activities of adding new services, configuring new ports and firewall rules, and adding new node pulls require security reviews and approval. Every new pod should have its relevant secrets protected (using whatever method that was agreed upon), and every node should have the latest OS patches installed. Security review processes and pen tests should be automated (see DevSecOps methodology for implementation guidance) for the cluster as a whole in order to provide a smoother release and prevent possible breaches.

Monitor Everything and Stay Focused!

Most organizations prefer to focus on application delivery. They allocate fewer budgetary resources and less attention to monitoring the cluster and figuring out how it can improve the production environment and the customer experience from day one. Effort should be invested into deploying a monitoring solution—including the decision and configuration of all thresholds and relevant alerts—as part of the cluster design and setup.

Kubernetes monitoring includes keeping track of pod health and readiness, the current amount of nodes and pods, and all cluster resources (e.g. CPU, memory and disk space). Monitoring the cluster provides the DevOps team a wider status, so, during incident handling, all important information is available for rapid problem solving. But, monitoring is a tricky process.

Old-school best practices preach the generation of an alert for every problematic activity in the application and a human response to every worrisome behavior. When Kubernetes is used correctly, some of these alerts are no longer needed. When a service is unavailable, Kubernetes knows to restart the relevant pods and make sure that the service is working again. When a node is not responsive, Kubernetes knows to spin up a new node and move the relevant services to it. In many of these cases, a call for action alert is not required. Only an informative alert should be generated to trigger a root cause investigation.

When services are running inside Kubernetes, not every restart of a backend service demands a response. In order to prevent false positives and preclude possible credibility problems with the monitoring and alerting systems, alerts must be configured appropriately. They should be generated based on Kubernetes events that were validated as actual problems. Alerts on pod restarts should be sent only if the restarts are recurring and are based on pod readiness failures. Dashboards should be built to round up all relevant data and cluster status metrics in order to allow the user to visualize this information in a clear and meaningful way. Monitoring systems should be selected based on their ability to support Kubernetes infrastructure, not based solely on their logging and monitoring abilities.

Summary

Kubernetes has rapidly become a valuable and significant component of every IT infrastructure. Nevertheless, achieving real business value from Kubernetes is not a simple task. Kubernetes maintenance and management may end up attracting more attention than planned or required, and its ROI may become zero—or even negative—if used without expertise in the establishment and maintenance of a stable, efficient, and valuable Kubernetes cluster. Recognizing Kubernetes’ potential weaknesses, preparing a continuation plan for the cluster, and employing a CD process that involves all the points in this article are all critical to maximizing the efficacy of this tool.