Phantom Metrics: Why Your Monitoring Dashboard May Be Lying to You

By: Dotan Horovits

January 3, 2023

Phantom Metrics: Why Your Monitoring Dashboard May Be Lying to You

This is the first blog in our series on Monitoring Essentials. To learn more, check out these posts on Expensive Metrics and Unreadable Metrics.

Whether you’re a DevOps, SRE, or just a data driven individual, you’re probably addicted to dashboards and metrics. We look at our metrics to see how our system is doing, whether on the infrastructure, the application or the business level. We trust our metrics to show us the status of our system and where it misbehaves. But do our metrics show us what really happened? You’d be surprised how often it’s not the case.

In this post I will look into the math and mechanics behind metrics, some common misconceptions, what it takes to have accurate metrics, and if there even is such a thing.

Metrics essentials

Metrics are essentially roll ups of raw events. During this roll up process, the events are translated into numerical data points. A simple example is errors occurring in the system, with a simple metric to count the errors. Metrics can also involve multiple variables, such as a count of requests with response time higher than 1 second. When measured over time, these data points form a time series.

Metrics can be of various types, such as Counters, Gauges and Histograms. Counters are used for the cumulative counting of events, as we saw in the above examples. Gauges typically represent the latest value of measurement. And then there are more elaborate types such as Histograms that can sample the distribution of metric values, by counting events in configurable “buckets” or “bins”. For example, you may want to understand the memory usage percent segmented by pods across your cluster in given points in time.

The mechanics of metrics

In an ideal world, we would ingest and store all the raw events, and then calculate the metrics on query time. This would allow us to slice and dice the events in any way we need, and ask any ad-hoc question we desire.

In the real world, however, keeping all the raw events for extended periods of time can be prohibitively expensive, due to the high volumes of data. To overcome this, events are oftentimes rolled up into metrics in the collection pipeline, while discarding the raw events or retaining them for short periods only. This is oftentimes a matter of a simple configuration in your metrics collector agent.

In addition to reducing cost, aggregation upon collection can improve the performance of real-time analytics with higher metric transmission and ingestion rates at higher frequency, and by avoiding heavy aggregations and calculations on query time.

The math of metrics in a nutshell

This rolling up process involves some math. We might want to calculate the mean or median of the response times, or maybe a percentile, or an aggregation over a time window. We might also want to roll up multiple events into one composite metric. For example, I may want to calculate the 95th percentile (commonly known as P95) of all the pods of a specific service across my cluster.

Even if you don’t like math, you cannot avoid it with metrics. You need to understand the different aggregation functions, and the relation between the question you wish to ask and the metric and aggregate you need in order to answer it. Let’s look at the Average function as an example, as many tend to start there. Averages, by definition, smoothen things up, and will be less suitable for flushing out anomalous behavior and outliers. When investigating latency problems, for example, it will be quite useless to look at average metric values, and you’d be better off looking at percentiles.

OpenObservability Talks: All Metrics Are Wrong, Some Are Useful

Determine your questions, design your metrics accordingly

In a way, you can think about these metrics as a lossy compression, during which we lose data and context from the raw events. If we don’t keep the raw events, then we need to determine upfront what’s important for us. For example, if I only calculate the average value over the data, I will not be able to ask about the P95 (95th percentile) later over the pre-aggregated data.

You need to determine what questions you want to answer, what’s important for you, and design your metrics and aggregations accordingly. A common mistake is that people avoid this design phase, and just use the preset metrics and default values provided out of the box with their metrics collector of choice. While you may think these defaults represent some industry standard, these are oftentimes quite legacy, and in most cases won’t be in tune with your specific needs.

The measurement problem

Just like in physics, the measurement problem occurs when we measure a (seemingly) continuous property at discrete intervals, often called the sampling interval, which determine the sampling rate. This creates a distorted representation, whereby the metric may not actually reflect the original measured property. For example, if we measure the CPU utilization every 60 seconds, then any CPU outlier happening between these sampling points will be invisible to us. Moreover, in order to draw a consecutive line, visualization tools oftentimes average over consecutive data points, which gives the misleading appearance of a smooth line.

On some occasions the opposite can occur, where you can get artifacts in your metrics that aren’t real, like peaks in your metrics that don’t really exist. This can happen when running aggregations within the storage backend, due to the in which the calculation is being made.

Mean time to detection

The sampling period also influences how fast a change in the system will be visible in the metrics. Most algorithms require five data points to detect a trend. If the sampling interval is 60 sec, then the simple math determines that it will take five minutes (that is, 60 sec X 5 data points) before we see something is wrong. Could you afford waiting 5 minutes to know that your system crashed? Using shorter sampling intervals (i.e. higher sampling rates) will shorten this period and enable us to detect and react faster. Of course, higher sampling rates incur overhead in CPU and storage, so we need to find the configuration that strikes the right balance for our needs.

Varying resolution and downscaling

A common practice is to save metrics in different resolutions in a tiered approach, to reduce cost. For example, you may want to save the metric every 10 seconds for the first day, but then every 5 minutes for the next week, and perhaps every 1 hour for the month or more ahead. This practice assumes that we need the finest granularity for the near real time period, in which we may need it if there’s an issue in the system, while investigations over longer periods require larger scale trends.

The different granularities can be achieved with downscaling the metrics, namely calculating the less granular metric off of the higher granularity one. While this sounds perfectly reasonable, math can interfere here, as some aggregation functions are not compatible with certain computations, and can therefore not be aggregated later. For example, percentiles are not additive and cannot be summed up. So, following the above example, if you have a P99 percentile sampled with 10 seconds resolution, you can’t roll them up to a 5 minute resolution. It’s important to be cognizant of the compatibility of the aggregation functions, and when using non-compatible functions such as percentiles, to make design decisions about which resolutions we require, and calculate these time series upfront.

The varying resolution is not limited only to the time factor. Another example is saving per-pod data, and then wishing to “group by” nodes or clusters. The same constraint applies here, meaning that if we expect to be interested in slicing and dicing a percentile based metric per node, per region, per namespace, or across the entire cluster, we need to pre-aggregate accordingly.

Another approach is to give up the accuracy of measurements to gain compatibility in computation, by using histograms. You can take histograms of a few servers and sum them up, or histograms of several time windows and sum them up, and then to downscale. The problem is that in this case percentiles will be estimates rather than accurate. It’s also important to note that histograms are more consuming in storage and in throughput, as every sample is not just a single number but rather a few samples (one per bucket).

Summary

Metrics are a powerful way to monitor our applications. But they are not necessarily representative of the actual system’s state. It requires understanding of the math and nature of metrics, as well as careful design, to make sure our metrics are indeed useful to answer the questions we need. Having access to the raw data in addition to the metrics is always good, as this is ultimately the source of truth.

Continue reading the series: