Is Kubernetes Monitoring Flawed?

By: Dotan Horovits

Kubernetes has come a long way, but the current state of Kubernetes open source monitoring is in need of improvement. This is in part due to the issues related to an unnecessary volume of data related to that monitoring.

For example, a 3-node Kubernetes cluster with Prometheus will ship around 40,000 active series by default. Do we really need all that data? Some of the issues that need to be addressed with Kubernetes monitoring include high churn rate of pod metrics, proliferation of metrics with low usage, and configuration complexity.

I discussed this topic on the OpenObservability Talks podcast with Aliaksandr Valialkin, CTO at VictoriaMetrics and creator of its open source time series database and monitoring solutions. Aliaksandr is a Golang engineer who likes writing simple and performant code, and creating easy-to-use programs. Sometimes these hard-to-match requirements work together, like in the VictoriaMetrics case.

During the podcast, we discussed the aforementioned common problems, as well as directions and best practices to overcome some of these complexities as individuals and as a community. We also discussed the VictoriaMetrics open source project and how it addresses some of these challenges.

How to Fix the Kubernetes Metrics Problem

By default, Kubernetes exposes many metrics. The number of such metrics usually is very big from the start. When you monitor Kubernetes clusters consisting of three nodes, for example, you end up usually in tens of thousands of metrics in the cluster, as mentioned before.

The majority of such metrics aren’t used anywhere in dashboards. They aren’t used in dashboards, nor are they aren’t used in alerting rules, or records and rules. According to a Grafana study, only 25% of the exposed metrics are actually used.

“It’s possible to just reduce a lot on your monitoring system by removing these unused metrics,” Aliaksandr said.

According to Aliaksandr, the best way to reduce the collection of these unused Kubernetes metrics is for Kubernetes developers to decide which metrics they need to expose. Then, compose a set of essential metrics and create some kind of standard for these metrics.

“These standards should describe where these metrics should be collected, and these metrics should be exposed by Kubernetes companions, which I include in every Kubernetes installation,” he said. “Third party monitoring solutions should not install additional companions for monitoring Kubernetes itself. Right now you need to install additional components…these companions aren’t included in Kubernetes itself.”

This is a huge challenge. I wouldn’t expect most users to really know what they need, and most just go for what gets collected by default. The amount of metrics exposed by Kubernetes keeps growing, and expecting the end users to keep up with this growth is very difficult. This is where we as a community and its leaders might be able to help provide guidance on some of what is used or not.

For example, at Logz.io we curated a list that is open source as part of the Helm charts that we provide. You have the list for both the Kubernetes out-of-the-box and also for AKS, GKS, all the managed versions of the Kubernetes to recommend at least what we find useful amongst our users, should you use a kube-system, kube-dns, or others.

The question is: can we provide some sort of an overall, overarching best practices for the entire community and industry?

“I think that system kind of metrics such as CPU, user memory, network session and so on can be the list of such metrics that isn’t too big,” Aliaksandr said.

Want to learn more? Check out the OpenObservability Talks latest episode: Is Kubernetes Monitoring Flawed?