The Cardinality Challenge in Monitoring

By: Daniel Berman

What Are Metrics and Cardinality Anyway?

Monitoring is an essential aspect of any IT system. System metrics such as CPU, RAM, disk usage, and network throughput are the basic building blocks of a monitoring setup. Nowadays, they are often supplemented by higher-level metrics that measure the performance of the application (or microservice) itself as seen by its users (human beings on the internet or other microservices in the same or different clusters).

The Specific Challenges of Time-Series Databases

Metrics are usually stored in time-series databases (TSDBs). Such databases present unique challenges when it comes to allowing the ingestion and efficient retrieval of a large amount of data points (some systems can generate a million data points per second). Some TSDBs are optimized for storing data and some are optimized for retrieving data, and it is actually quite difficult to find the right balance between the two. Indexes are required during the retrieval phase to be able to search and group based on dimensions, but indexes are costly to maintain during the storing phase. It is fair to say that all monitoring system have limitations, whether they are open source like Prometheus or proprietary like SignalFX or NewRelic.

Current Monitoring Best Practices and Their Impacts on Cardinality

System metrics (CPU, disk usage and performance, RAM, network throughput, etc.) are still necessary practices, for reasons that include triggering auto-scaling events. In addition to system metrics, more high-level, integrated metrics are becoming commonplace. They are used to monitor the application (or service) performance as seen by its user (whether human or machine). A typical example of this kind of metric is how long a request takes to be served from the time it is received to the time it is answered. Those metrics generally have a lot more choices in terms of what dimensions should be attached to them, compounding the cardinality challenge.

In a microservices case, such as a Kubernetes cluster running a number of highly available services, a microservice might not be directly contactable by the outside world. However, it still has clients (probably other microservices), so measuring performance is very valuable. It is important that a microservice-to-microservice transaction related to a higher transaction (e.g., a human interacting with the website) be identifiable. Achieving this end might require adding yet another dimension, such as a request id.

The Third Layer of Monitoring of a Docker-based Workload

Interestingly, when using a Docker-based workload and an orchestration tool such as Kubernetes, there is a third layer that requires monitoring—one that is sandwiched between the system level of CPU and RAM and the high-level perceived performance. That middle layer is the cluster itself, which includes monitoring the health of the containers and of the volumes they are using.

The three layers of monitoring, then, are:

First layer: infrastructure (CPU, RAM, etc. of underlying instances, network throughput, etc.)
Second layer: containerized workload (this is mainly about the health of the containers and their volumes)
Third layer: application or microservice-perceived performance

Clearly, opting for the Docker-based workload increases the amount of metrics you have to keep track of. If you are interested in Kubernetes monitoring, this page will be helpful.

The Case of Immutable Infrastructures

Immutable infrastructures require careful thought. They are characterized by the method used to make changes to them. In immutable infrastructures, once a resource is created, it is never updated. Instead, it is destroyed and recreated. This poses a challenge for the monitoring system. In a typical example of an instance, when that instance is destroyed and a new one is created to replace it, all the metrics for the old instance will stop as far as the time series is concerned. A new time series will be created for the new instance because one of its dimensions—the instance id—has changed. The fact that a new time series is created every time a resource is replaced requires the monitoring system to regularly sweep stale data; otherwise, the cardinality of the monitoring system will increase incrementally over time as new deployments are performed. If this housekeeping task is not completed, the monitoring system will become slower and slower as more and more dimensions are added to the system, even though most of them will become obsolete when new deployments are performed.

The Case of High-Level Performance Metrics

The high-level metrics mentioned previously, which are used to measure the perceived performance of the workload, also present unique cardinality challenges. Indeed, such metrics are much more versatile and varied in nature when compared with system metrics. Let’s take the request time example used above. What dimensions should be attached to this metric? The answer is not as obvious as it is with system metrics, where you would want to attach the instance id to the “CPU Utilization” metric or the filesystem mount point to the “disk usage” metric.

How should a request time be indexed? Some of the meaningful dimensions we might want to attach to such a metric include user id, id of the instance that first received the request, id of the product or service being requested, type of request, endpoint name, and microservice name. At this point, a balance needs to be struck between knowing in advance which dimensions are relevant and which ones aren’t. If you’re at the beginning of a project and traffic is still quite low, you can probably add as many dimensions as you want. You will be able to trim them later on, when you have more insight into your workload and which dimensions are relevant to you.

What To Do About The Cardinality Challenge?

How do you manage and mitigate a high cardinality in your monitoring system?

Here are a few steps you can take:

Manage stale data. For example, if you perform immutable deployments every couple of days, you will end up with a lot of stale data. Devise life cycle policies for monitoring data, such as archiving and moving the data to cheaper long-term storage.
Make sure you choose the right solution based on the complexity of your requirements. For example, Prometheus or CloudWatch would work well for a small/medium workload with low cardinality. For a high workload and/or high cardinality, SignalFX or NewRelic would be good choices to consider. A very high workload and very high cardinality situation may require custom or more specialized solutions.
Think twice about using a containerized solution. Going for a Docker-based workload will increase the amount of metrics you need to keep track of (and make sense of).
Find the right balance between indiscriminately using dimensions and keeping their usage so minimal that the data no longer makes sense. For example, keeping track of an incoming request time is absolutely useless without more context added to it—like which endpoint was hit.

Wrapping Up

High cardinality is not a bad thing in and of itself. It is a consequence of evolving best practices for system architecture and performance evaluations at all levels of an IT system. Issues arise when the monitoring system is not designed to cope with the cardinality of the monitoring data that is generated by such an IT system. To deal with this situation efficiently, you will need to spend some time crafting your monitoring system to match the size and complexity of your workload. You will also need to spend time choosing which dimensions you want to attach to your metrics based on what meaningful information you want to extract from your monitoring data. You will need to work your way backward, starting with making explicit what your monitoring system should provide (i.e., your requirements). From there, you can design your monitoring system and choose your metrics and dimensions accordingly. Choosing a monitoring toolset that employs advanced analysis tools to add context and helps you make sense of your large amount of data can be a big help to your team.

Don’t be too hard on yourself, and avoid analysis-paralysis! Get something up and running and adapt as you go along.

Happy monitoring!