The Challenge of Cost Efficient Observability in the Cloud

To gain observability into their systems, teams collect and analyze telemetry data generated by their production environments. In modern Azure environments, the scale and volatility of this telemetry data can increase sharply without warning – which can translate to high costs without the right preparation.

Distributed cloud environments consist of many interconnected and interdependent components such as Azure VMs, databases, AKS clusters, and containers, which generate varying types of telemetry data – consisting of logs, metrics, and traces. Production teams collect and analyze this data to understand what’s happening in their systems, and why it’s happening.

As businesses grow, so does the amount of telemetry data produced by Azure environments. Large, unpredictable volumes of data can put a dent in wallets and the stability of observability systems without careful planning and preparation.

To help prepare for these potential costs without compromising the observability of your system, this guide will explore:

  • The strains that scaling telemetry data can place on an observability system, the costs incurred from those strains, and how to plan around them.
  • What to keep in mind when designing an observability stack to avoid high-costs down the road.
  • Practical strategies for identifying and filtering out unneeded telemetry data.

Part One

Gauging the Total Cost Of Ownership of your Observability Stack

Weighing the Total Cost of Ownership (TCO) of your observability stack can help you understand the costs your team will incur beyond the up front price of a solution. TCO not only accounts for the price tag (or lack thereof), but also:

  • The engineering resources needed to get a solution up and running and to onboard new engineers.
  • The engineering resources needed to keep a solution up and running at scale.
  • How easily a solution integrates into an existing DevOps environment.

When considering TCO, it quickly becomes clear that the cost of observability is far more complicated than the up front price tag, which is why it’s difficult to measure. While determining a precise TCO of an observability stack may be an impossible endeavor, gauging the factors that could impact cost is worth careful consideration.

To help conceptualize how to apply these factors to observability stacks, let’s go through an example of an oversimplified TCO analysis between open source ELK Stack (the most popular logging solution) and a proprietary vendor.

A Comparison: Open Source vs Proprietary

In the table on the next page, our example cloud environment produces 100 GB of log data per day, and we need to retain the log data for 7 days. We have a team of two DevOps engineers – each fully burdened DevOps hour costs $100. For the sake of simplicity, we’ll compare proprietary vs open source, but many situations will include a combination.

The purpose of the comparison below is not to show that proprietary solutions have a higher TCO than open source. Rather, it’s to show how many factors can impact TCO.

The table compares the up front price tag, infrastructure costs, and the time needed to set up and maintain a logging pipeline that can handle 100GB/day. But additional factors could include security and analytics features – such as Role-Based Access Control, alerting mechanisms, anomaly detection, or compliance requirements.

Clearly, there is more to cost than the price tag. While it’s difficult to foresee all of the overhead needed to run an observability system, understanding and planning for costs beyond the price tag can help inform a more cost efficient approach to designing an observability stack.

TCO comparison table:

 

How Scale impacts TCO

As an example, let’s compare small and large scale deployments of the most common open source observability stacks: the ELK Stack for logs and a Prometheus / Grafana combination for metrics.

At smaller scales, they are both easy to stand up. For the ELK, Beats forwards the log data to be transformed by Logstash so it can be indexed appropriately in Elasticsearch, and ultimately analyzed in Kibana. For Prometheus and Grafana, Prometheus scrapes metrics data from the data source and stores it so it can be queried by Grafana for analysis.

How Scale impacts TCO

small scale prometheus deployment

At small scales, running these deployments is easy to start, easy to manage, and provides powerful monitoring capabilities. However, at larger scales, more components are required to handle the data.

Large scale ELK deployment

Large Scale Prometheus Deployment

There are far more components to manage when environments generate high volumes of logs and metrics. Each additional component represents another element to manage, tune, and monitor – this adds up when repeated often over time!

 

Common Tasks for Scaling Open Source Observability Stacks

 

Performance tuning

There are many ways to tweak Elasticsearch to improve performance. For example, you will want to configure the allocations for the different memory types used by Elasticsearch such as the JVM heap and OS swap. Additionally, the size and number of indices handled by Elasticsearch affects performance, so you will need to shard you indices and make sure you remove or freeze old and unused indices.

Handling upgrades

Handling an ELK Stack upgrade is one of the biggest issues you must consider when deciding whether to deploy ELK on your own. In fact, upgrading a large ELK deployment in production is so daunting a task that you will find plenty of companies that are still using extremely old versions.

Horizontal Scaling Limitations

Prometheus is a single node solution – meaning once you max out the telemetry size, the only way to send more data is to spin up another Prometheus on another server, which can get tedious and sometimes not possible as Azure environments scale. This also means configuring the Alert Manager and Service Discovery for every server.

Data retention

Many teams prefer to store metrics for at least a year to conduct long-term trend analysis of their environments. However, storing every data point is unrealistic. Most teams implement Prometheus servers that ‘roll up’ metrics every X seconds from another server, and condense them into a single data point. It’s common to have roll up servers for roll up servers.

Storing log data is a different operation – many archive their logs in Azure Blob storage. You’ll need to be careful about what you decide to store, because log data is much heavier than metrics and can take up a lot of space.

Guiding Questions to help gauge TCO

There is a lot to think about when gauging TCO. Realistically, there will always be tasks that impact TCO that you could not have foreseen. But that doesn’t make planning unimportant. Here are some questions to ask yourself when considering the TCO of your observability stack, beyond the up front price tag:

  • How long will it take to get my telemetry data pipeline up and running?
  • How long will it take to set up the required dashboards, alerts, and other analytics I need to analyze my data?
  • What are the required engineering resources needed to scale and maintain my data pipeline?
  • What are the infrastructure requirements for running my observability system?
  • How will my observability system integrate with my existing technologies?

Part Two

Identifying and Filtering out Unneeded Telemetry Data

After designing an observability stack, it’s time to ship some combination of logs, metrics and traces – depending on which signals you’re hoping to monitor. Each data type will expose different information about what is happening in a system. However, not all of that information will be helpful.

Part Two will examine which telemetry data can help you monitor critical signals in Azure, and what can be filtered out to improve cost efficiency.

Guidelines for deciding which data to filter out

The first step to removing unneeded data is identifying what you don’t need. Sending data that doesn’t get analyzed wastes time and money by:

  • Impacting the performance and stability of your system by clogging up databases.
  • Requiring users to search through more data when exploring logs, metrics, or traces.
  • Racking up unnecessary infrastructure costs if using open source on premise solutions.
  • Triggering steep overage fees from proprietary vendors.

To determine which data can be filtered out, the key questions to ask are: is the service generating this data something I need to monitor? And, does this data help me monitor a critical signal?

If the answer is no to either of those questions, it should qualify for removal. If the answer is yes, perhaps it should be tracked on a dashboard and monitored by an alerting mechanism.

The grey area in that criteria is how ‘critical signals’ is defined. Critical signals will vary across different teams. It all depends on why you’re gaining observability into your system. Examples of critical signals could be:

  • Application latency
  • Infrastructure CPU and memory usage
  • Usage of a new product feature
  • Number of page views
  • Number of errors generated by a specific service

Depending on the scope of your operation, make a list of the signals you need to monitor. Whichever telemetry data does not help you monitor your critical signals for services worth monitoring should be up for removal.

An example: identifying unneeded logs

To identify unneeded log data, it’s hard to know what you don’t need until you see it. First, you’ll need to identify the most common log data, and then determine which data won’t help you monitor a critical signal.

But how to know which logs are impacting your costs? You could start by creating a visualization bucketing different log data types, but there could still be variation within those buckets.

While this is not always easy, with the right analytics, you can get a clear picture of which log data is impacting costs.

In the screenshot above, Logz.io Log Patterns cluster similar logs together to clearly show the most common logs. 13% of our logs have the message: “decision check InProgressDecisionCheck returned Number”. If we know this log won’t help monitor critical signals, it’s a waste of time and money and there is no reason to keep it.

If you’re still unsure exactly whether specific data observability Here are some general guidelines that can suggest whether given telemetry data is worth keeping or not.

  • Data that is collected, stored, and indexed but is rarely analyzed (i.e. once a quarter) should be questioned for removal.
  • Data that is not displayed on a dashboard or monitored by an alerting mechanism should be questioned for removal.

Filtering out telemetry data

Once you’ve identified data doesn’t help you monitor a critical signal, the next step is to filter out that data before incurring the costs for it.

Filtering out metrics

In most cases, the easiest way to filter out metrics is to simply remove the instrumentation exposing those metrics.

In the cases where you are interested in filtering out some metrics, you can do so with metricbeat – a popular open source agent that ships metrics to a desired location. By adding ‘processors’ to the config file, you can filter out data, enrich the data, or perform additional data processing. The order that the processors take place will correspond to their order in the configuration.

Below is an example of adding a processor to remove metrics that meet a condition:

processors:
 - drop_event:
     when:
        condition

Filtering out traces

Removing tracing instrumentation is not possible much of the time if you are using auto instrumentation or built in instrumentation contained in libraries. Sampling is very common when tracing distributed systems – you can effectively filter out some of your traces by changing the sampling frequency.

There are many different types of sampling depending on the tracing solution in use. Below are some sampling examples for Jaeger, a popular open source tracing solution:

  • Constant sampler examples: Sample all traces, or none of them.
  • Probabilistic sampler example: Sample approximately 1 in 10 traces.
  • Rate Limiting sampler example: Sample 5 traces per second

By changing the sampling rates in the configuration, you can reduce the amount of data you’re sending. Note that this will result in a slightly less representative sample of what’s actually going on in your distributed system.

Filtering out logs

Instrumenting a service to send log data can expose all sorts of unneeded information along with very helpful information. After identifying unneeded log data (see previous section), the next step is to filter it to the archives.

Why filter logs to the archives?

In the event of a production outage, the last thing you’d want is to be missing log messages that describe what happened. This is why you should archive log data for a few weeks before deleting it. Most centralized logging solutions will allow you to reingest log data that was previously sent to the archives, so you can analyze that data later if needed.

Server level filtering

Filebeat and Fluentd, some of the most popular modern log shipping technologies (both are open source), both allow you to use Regular Expressions to modify event streams – giving you some control over which data makes it to the desired destination. For example, with Fluentd, you can use the derivative below to grep the value of one or more fields:

  @type grep
  regexp1 message USA

While this solution is very efficient, it can introduce configuration management challenges at larger scales. How can you ensure you’re pushing the proper configurations to every host?

Log collection and filtering

Syslog servers are a popular way to centralize log collection from multiple sources before analyzing them. Log types can be filtered out on syslog servers with the syntax below:

filter { expression; };
While syslog offers a single place for log collection across multiple data sources, it also represents another server to monitor and maintain.

Filtering log data with Logz.io

Some log management and analytics solutions can centralize log data from multiple services and prevent unneeded data from being indexed all in one place.

Earlier, we saw how the “decision check InProgressDecisionCheck returned Number” log message was taking up a lot of space, but wouldn’t help us monitor a critical signal – so we determined we could filter it out.

With Logz.io’s Drop Filters, we can simply paste that log message into the filter and automatically prevent that log from being indexed – instead sending it to the archives.

After applying the filter, we can toggle it on or off at any time to dynamically manage the flow of incoming logs.

The Path to Cost Efficiency Observability on Azure

The volatility and scale of telemetry data generated by modern Azure environments can quickly rack up costs before teams are even aware of what’s happening. These costs can take different forms.

Proprietary solutions typically charge on the amount of telemetry data ingested or some other measurement indicating the scale of the operation. If you pass the defined data limit, many vendors will send monthly overage bills. These bills can be tough pills to swallow considering the volatile nature of collecting telemetry data.

Alternatively, costs for open source solutions can be incurred via infrastructure spending, engineering time spent on maintenance, and other factors. If you’re running your own open source system and it cannot scale with the amount of telemetry data being generated, the whole thing could fall over. This can be an especially big problem if an outage caused the data spike, which means you’ll be blind to what caused it.

By carefully planning your observability stack and filtering out data that won’t help monitor critical signals, you can guard against the unpredictable costs of Azure observability.