How Observability Can Get Expensive

How much does monitoring and observability actually cost us?

We all collect logs, metrics, traces, and possibly other telemetry data. Yet, this can get expensive pretty quickly, especially in today’s microservices-based systems. This leads into what is called “the cardinality problem.”

On the latest episode of OpenObservability talks, I was thrilled to host Ben Sigelman, co-founder and GM of Lightstep (now part of ServiceNow), to get his perspective on the cardinality challenge of observability and the best ways to address the high costs organizations often see.

Ben is very vocal on this topic and his experience includes co-creating the OpenTracing and OpenTelemetry open source projects. He was also one of the people who architected Google’s own metrics and distributed tracing systems.

The Factors Driving Observability Costs

No matter the approach you take, the cost of observability ends up being dominated by the collection and long-term storage of the raw observability telemetry data itself, Ben said. The largest forms of data volumes are metrics data, logging data, and tracing data.

There are two main factors that contribute to the high cost of observability. The first is quite obviously the number of business transactions being performed. It is quite expected that as your business volume and the revenue grows, so will the associated data volumes with the associated cost. However, there is a second factor at play here: the number of microservices. In today’s systems running hundreds or thousands of microservices in production, this additional multiplier quickly gets monitoring and observability cost out of hand. As Ben said:

“So as you go, and both increase your top line in terms of application use, and then also increase the number of just engineers typing code, and number of services that are involved, you are going to see an explosion in the size of the data itself.”

Another issue stems from organizations using monitoring tools to do observability. Ben describes monitoring tools as those who can produce metrics data by generating charts, alerts, etc. If that’s the only tool you have available to understand why something happened in your environment, you’ll have to do it by filtering and grouping. That can become extraordinarily expensive, and therein lies the cardinality problem.

“I’ve talked to economic buyers of metrics tools where they’ve had single developers add in one line of code, that’s costing them $200,000 a year, steady state, and that’s totally typical,” he said. “The cardinality problem needs to have a solution, or we’re going to end up in a negative ROI place for observability.”

Developers Can, and Should, Only Be So Aware of Cardinality

I often find that for developers, it may not be common knowledge that adding tags and dimensions for metrics can be something that cuts across all systems. This can be a major driver of cost around the effort of observability. Developers may not understand that multiplication of factors can cause a metrics bill to get out of control.

When I asked Ben about this awareness gap, he assessed that it may be too much to expect developers to grasp the aggregated combinatorial implications of their adding metrics.

“You can have several dimensions that seem independently okay, maybe they have a hundred or a thousand values,” he said. “But if you add all of those attributes at once, it’s the combinatorics of those values can still get you way up into the hundreds of thousands, or millions of distinct time series. And at that point, it just gets really expensive. And it’s a lot to expect a developer to know ahead of time what it’s going to be.”

In today’s landscape, we’re in a position where we’re only bound by our own ability to implement solutions to support and abstract these observability use cases. It should be possible for a developer to add as much cardinality as they want, and then it’s up to the observability system to do the right thing. The instrumentation should reflect the business logic, not be constrained.

“It is very important that the instrumentation be sufficient to diagnose things, because you really cannot afford to go back, reinstrument and redeploy during an emergency,” Ben said. “The instrumentation, dynamic or static, needs to be able to get this data out. I think that the problem we have is that it’s one size fits all right now. And for a lot of people, when they add tags or attributes, that turns into a bunch of high cardinality metrics, which is ROI negative. That’s the problem, and I think that’s what we need to fix.”

Want to learn more? Check out the OpenObservability Talks episode: Expensive Observability: The Cardinality Challenge.