How to Tackle Spiraling Observability Costs

By: Charlie Klein

The source of observability costs
The opportunity for cost reduction
Pitfalls of common cost reduction strategies
Practical strategies to reduce costs
Next steps

As today’s businesses increasingly rely on their digital services to drive revenue, the tolerance for software bugs, slow web experiences, crashed apps, and other digital service interruptions is next to zero. Developers and engineers bear the immense burden of quickly resolving production issues before they impact customer experience.

To do this, they rely on massive volumes of telemetry data – including logs, metrics, and traces – to provide the insights they need to understand what’s happening in their system and diagnose their system’s behavior.

All of this data requires additional computing power, which drives up costs at a startling rate for those already struggling to stay within budget.

To address this problem, many engineering teams turn to strategies that ultimately sacrifice visibility into their environments (which we’ll revisit in a moment) – jeopardizing their ability to quickly identify, investigate, and resolve production issues.

In this guide, we’ll explore strategies to significantly reduce observability costs while maintaining full observability into your environment.

Why are observability costs so high?

To understand how to reduce observability costs, we first need to understand their root cause.

The short answer is simple: the more telemetry data you collect, process, and store for analysis – a prerequisite for observability – the more cloud infrastructure costs you need to pay. Unfortunately for the cost-conscious crowd, telemetry data volumes are exploding.

Why? Larger applications in production + increasing traffic = growing telemetry data volumes.

Of course, we normally think of application development investment and traffic as good things, so there is no reason to think telemetry data volumes will stop growing anytime soon.

This is the case whether you’re running your own open source stack or purchasing a proprietary observability vendor. As data volumes grow, you can’t escape the growing compute footprint that mandates higher costs.

In other words, the path to decreasing observability costs is to reduce the computing footprint of telemetry data. While this may sound obvious, this principle is hardly applied to modern observability implementations.

The opportunity for observability cost reduction

Cloud-native applications and infrastructure generate mountains of telemetry data – some data is mission critical, some data is only needed occasionally, while other data is useless. Most of this data is processed and stored in the same way, and therefore cost the same amount.

This represents an enormous opportunity to reduce observability costs by minimizing the computing footprint of the occasionally-needed and useless data.

Observability is expensive because of the growing computing footprint of telemetry data, so the only real path to cost reduction is by minimizing the computing power needed to process and store this data.

When most people think of high observability costs, they correctly think of spiraling bills from vendors like Datadog, New Relic, and Splunk. While they justify these growing costs with huge feature sets and growing data volumes, huge portions of that data are junk!

And after analyzing hundreds of telemetry data sets from different engineering organizations, Logz.io has found that about a third of the data is never actually used!

Let’s explore how some engineering teams have attempted to reduce their observability costs, and then dive into some alternative strategies that focus on optimizing large data volumes.

The perils of common observability cost reduction strategies

The two most common ways of reducing observability costs can also reduce visibility into cloud infrastructure and applications. Let’s see how.

Strategy #1: Turn the lights off

The first strategy involves intentionally excluding observability from the maintenance of some services. This obviously reduces computing power by limiting the amount of telemetry data that is collected, processed, and stored, which also eliminates costs.

This strategy forces tough decisions around which services require observability, and which can continue running without visibility into their health and performance.

Strategy #2: Switch from vendors to open source

The second cost reduction strategy is to move to open source. While open source users still need to pay for the infrastructure to store the data, it’s usually far less than paying for a proprietary vendor to do it for you.

There are many fantastic open source tools to collect and monitor telemetry data – such as OpenTelemetry, Prometheus, and OpenSearch. For many use cases, open source tools are undoubtedly the best route.

However, for those needing to correlate insights from different telemetry data in a unified interface, open source falls short. Open source tools offer distinct point solutions for log, metric, and trace analytics – preventing fast troubleshooting while creating tool sprawl.

Is there an alternative path to cost reduction?

To summarize the strategies above, many engineering teams believe they need to make a choice between full observability and cost reduction – they believe they can’t have both.

Throughout the rest of this guide, we’ll explore alternative paths to reducing costs without sacrificing visibility, which relies on telemetry data optimization.

How to reduce observability costs without jeopardizing visibility

By leveraging practical data optimization techniques, any engineering team can dramatically reduce the overall computing footprint of their telemetry data – and therefore significantly reduce the cost of observability.

Importantly, none of these methods sacrifice the critical insights and visibility engineers need to quickly identify and resolve production issues.

Simplify the process to remove useless data

It seems too obvious, but more often than not, observability practitioners collect huge volumes of data that are never actually used. At Logz.io, we’ve found that about a third of overall data volumes don’t actually provide any useful insights.

The reason for this is simple: identifying and removing useless data is usually unintuitive. Normally, engineering teams need to manually comb through their data to separate the critical information from the noise, and then reconfigure the relevant data collection components (whether its Fluentd, Prometheus, Splunk’s data shipper, etc.) to filter out the data.

Like anything in the world of cloud and DevOps, if a process is unintuitive, it often never gets done.

To simplify this process, observability practitioners need a central location to inventory their incoming data and filter out the junk. As an example, Logz.io’s Data Optimization Hub provides a single place to catalog all the incoming data alongside data filters to remove the unneeded information.

From the left to right columns, Data Optimization Hub recognizes the type of log, the log structure, and the amount of data that follows each structure. In the right column, users have the option to filter out what they don’t need.

We can see the same solution for metric data, along with recommendations to filter out data that isn’t being monitored by alerts or dashboards.

By making it easier to identify and remove junk data, our customers remove about a third of their data volumes, which drastically lowers the computing footprint and associated costs without jeopardizing observability into the system.

Data transformation can reduce observability computing footprints

As discussed in earlier sections, the path to reducing observability costs is to reduce the computing footprint of the data. One way to do that is through data transformation.

While many logs need to be indexed and queried, others just need to be monitored.

HTTP logs, for example, usually don’t contain the same debugging information that application logs may contain, so they don’t need to be searched and queried in the same way. Rather, HTTP logs are most valuable when they’re visualized, so engineers can monitor for spikes and dips that could indicate production issues or other interesting information.

When logs are used for monitoring use cases, they can be transformed into lightweight metrics to drastically cut required computing power and costs.

AWS ELB logs are another example of data that is most valuable when being monitored on dashboards. In the example below, we can see how Logz.io LogMetrics converted ELB logs into metrics for data visualization – the data provides the same insights, without having to be indexed as costly log data.

Simplify data storage optimization

Any observability practitioner knows the value and use cases of telemetry data varies greatly. If not all data is equally valuable, why should it all cost the same?

Many observability implementations follow a hot-warm-cold architecture to solve this problem, which requires a classic tradeoff between search performance and cost. Hot storage is most expensive and delivers fast query results, while cold storage is the cheapest and returns slow query results.

This tradeoff is what prevents widespread adoption of cold storage: who wants to wait 10 minutes for query results?

In order to significantly reduce the computing footprint and total costs of telemetry data storage, cold data needs to be more accessible.

Logz.io Cold Search (coming soon!) eliminates the tradeoff between search performance and cost reduction by providing near real time searching on cold data. Of course, this will encourage engineers to keep more data in cold storage, which ultimately reduces the computing footprint and cost of data storage – especially over long term periods.

In the screenshot below, you can see data queried directly from cold storage appear in the query results.

Bonus strategy: Focus on the observability essentials

The previous sections highlight simple ways to significantly reduce observability costs through data optimization. They can apply to anyone for nearly any observability use case.

This strategy is different because it relies on Logz.io’s general approach toward observability, rather than optimizing the computing power and costs needed to process telemetry data.

Our approach is that the market is bloated with features and capabilities that vendors use to justify enormous bills – even though many of these features aren’t needed for most observability use cases.

Alternatively, Logz.io focuses on making the essential observability use cases – like log querying and visualization, infrastructure metrics monitoring, distributed tracing, service performance monitoring, data correlation, and anomaly detection – as simple and cost-effective as possible.

By focusing on optimizing data for these mission-critical use cases, we’re able to reduce costs by roughly 50% compared to the traditional vendors, without sacrificing visibility into complex cloud-native environments.

Next steps for data optimization

To reduce your telemetry data computing footprint and costs, start by getting to know your data. After understanding the different use cases for your data, you can easily determine what requires instant access, what can be transformed, what can live in cold storage, and what can be filtered out entirely.

Your telemetry data volumes and types will change as your team builds more applications and attracts new users, so optimizing your data to reduce costs is a continuous process.

To apply some of these strategies to your observability implementation, try a Logz.io free trial or get in contact with one of our observability specialists.