FinOps Observability: Monitoring Kubernetes Cost
With the current financial climate, cost reduction is top of mind for everyone. IT is one of the biggest cost centers in organizations, and understanding what drives those costs is critical.
Many simply don’t understand the cost of their Kubernetes workloads, or even have observability into basic units of cost. This is where FinOps comes into play, and organizations are beginning to implement those best practices and open standards to understand their cost.
There’s also a fascinating open source project, OpenCost, which aims to provide an open standard around FinOps under the Cloud Native Computing Foundation.
Kubernetes and Cloud Native adds complexity to FinOps
Kubernetes is the new kid on the IT infrastructure block. The first edition of the Cloud FinOps book didn’t really cover containers and Kubernetes, besides a brief chapter. But Kubernetes deserves much more attention, as it makes things more complex.
When reviewing the cloud cost and the shared resources of Kubernetes workloads, it can be challenging to attribute spend to cost per customer, per team, per different environments, for example. It can also be challenging to track the cost efficiency of Kubernetes workload allocations over time across different aggregations and perform capacity planning.
The first challenge is that Kubernetes is designed to share infrastructure resources for efficiency, which makes the attribution model tricky, when even deployments and namespaces are not really isolated and share the underlying resources. The attribution challenge goes beyond Kubernetes nodes and into persistent volumes, load balances and other types of resources.
The second challenge is the added complexity of microservice architecture, which is commonplace in Kubernetes workloads. In these cases, where each incoming request flows through a chain of interacting microservices, advanced context propagation techniques are required to track invocations of downstream backend back to the respective business unit, product line or tenant.
The third challenge is the dynamic scalability of cloud native applications, which can make costs vary suddenly as an application scales. This brings up the need to correlate cost spikes with application needs and behavior and with business KPIs. For example, a spike could come out of a Black Friday sale craze in an ecommerce website, which is a legitimate business outcome, while another spike could be a buggy configuration that scales out some backend services with no guardrails.
The need for FinOps becomes obvious when needing to make this attribution and then make forecasting and capacity planning and also negotiate based on where you expect the business to grow across different teams and different product lines, based on the existing consumption trends.
In an article a couple of years ago I called for the evolution from common infrastructure to common practice to support FinOps workflows inside the organization. Now it’s time to move from common practice to common specification, to create the much-needed common language across organizations and vendors.
Optimizing Resource Utilization: Capacity Management in Kubernetes
The first challenge I mentioned above was that Kubernetes is designed to share infrastructure resources. This is not just a challenge for attribution modeling, but also for capacity management. As our Kubernetes environment expands and our applications evolve, effectively managing resources becomes critical.
At these stages, issues such as overcommitted nodes, pods without memory limits, and resource wastage are encountered more frequently and their impact is more noticeable due to the scale. This calls for planning resource allocation, monitoring usage, and optimizing efficiency is becoming a necessity. Let’s look at some common issues.
High node pressure occurs when the total resource requests from pods on a node exceed the node’s available resources, resulting in node overcommitment. This leads to resource contention and performance degradation as the node struggles to fulfill all requests adequately. Overcommitted nodes can result in increased latency, pod evictions (as part of the automated node-pressure eviction process), and potential system instability, ultimately impacting the reliability and performance of applications running in the cluster.
Running pods without memory limits is generally considered a bad practice, as it can lead to resource contention, where one pod consumes excessive memory, impacting the performance and stability of other pods on the same node. In extreme cases where the pod consumes all available memory on a node, it may render the entire node non-responsive. Additionally, it makes it difficult to predict and manage resource usage, leading to inefficient resource allocation and potential infrastructure costs.
Underutilized resources in Kubernetes occur when CPU and memory allocations within the cluster are underutilized, which results in inflated infrastructure costs. This inefficiency can arise from overprovisioning, where resource requests exceed actual application needs, or inefficient scheduling, which deploys pods with excessive resource requirements.
Kubernetes exposes various metrics that can assist in monitoring resource utilization, as well as various configuration options such as setting appropriate resource limits, optimizing pod sizes, defining pod priorities and required quality of service, and tuning scheduling algorithms and eviction policies, which can help maximize resource utilization and minimize unnecessary expenditure in Kubernetes deployments.
These options are available on various granularities, from the individual containers, pods and nodes and up to entire workloads and clusters. For more on these and other common issues, and how to monitor and address them with proper capacity management in Kubernetes, check out this comprehensive guide.
OpenCost: Open Specification for Kubernetes Cost Monitoring
The Cloud Native Computing Foundation (CNCF) is an open source foundation which hosts Kubernetes, as well as many other open source cloud-native projects. In 2022, the CNCF added the OpenCost project to its ranks, to address the specific FinOps challenges of Kubernetes.
OpenCost is not a tool but a vendor-neutral specification for implementing cost monitoring in Kubernetes environments. The project brings vendors and end users together on writing a specification for what it means to monitor Kubernetes for cost, how you identify different types of usage, whether it’s idle or allocated.
It was started by KubeCost, whose founders came from Google with aggregated experience in containerized workloads and their FinOps aspects. Many other companies are involved in the specification, including AWS, Google, Microsoft, D2IQ, Adobe, Armory, New Relic, Mindcurv, and Red Hat. OpenCost is also a FinOps Foundation certified solution.
The first focus of the OpenCost specification was on allocation monitoring. Essentially, OpenCost queries the cloud API of the various providers for the relevant information, and compares that with the Kubernetes usage. Going back to the AWS example, the query can tell you how many EC2 instances are running their cost per hour according to the list price. Then it queries Kubernetes to discover the namespaces, workloads and pods, so you can break down those EC2 instances by all of those Kubernetes primitives.
In addition to the specification, KubeCost also offers a UI that lets you slice and dice this data to explore it. The data is stored in Prometheus, so you can use any Prometheus-compatible tool to analyze the data.
OpenCost specification is already implemented on some tools such as Kubecost, EKS, Prometheus, Vantage and Grafana Labs. As the specification matures, I expect to see more OpenCost compatible API endpoints out there so you can pull your financial data out of them in a standard manner.
OpenCost community manager Matt Ray shared with me the project’s goal: “The goal for us with OpenCost is just make it the ubiquitous default monitoring stack for cost. So as soon as you spin up a cluster and a Kubernetes cluster in any public cloud, you just throw an OpenCost on it to keep an eye on it and then put it on the dashboard of your choice. And so that’s what we’re doing with OpenCost.”
OpenCost works out of the box with all the top cloud providers, AWS, GCP and Azure. Later more cloud providers were added, such as Scaleway, a European provider, and Alibaba cloud. It uses the on-demand pricing to do the calculations (not taking into consideration dynamic changes due to discounts, left credits from plans etc.). There is also support for on-premise deployment, by providing static pricing.
OpenCost provides what’s allocated for your Kubernetes cluster, namely instances, storage and networking. The plan is to bring in external asset costs, such as a remote database service, object storage, S3, RDS, monitoring.
You can follow the project on opencost.io and on the CNCF slack under the #opencost channel, as well as its GitHub repo github.com/opencost.
Endnote
As more organizations migrate their applications to microservices and containerized workloads over Kubernetes, we need to adapt our FinOps practices to match the dynamic nature and growing complexity that comes with it.
An open standard for reporting and tracking this FinOps data, across the different hosted Kubernetes providers as well as self-hosted clusters, will help us reduce the complexity. OpenCost has an impressive list of contributors, including the three major cloud providers and other prominent vendors.
The real test is in implementing the specification in their products and services, and making it their ubiquitous format for reporting and consuming cloud cost data.
Want to learn more? Check out my OpenObservability Talks podcast episode: FinOps Observability: Monitoring Kubernetes Cost with OpenCost.
Get started for free
Completely free for 14 days, no strings attached.