Cost Management: Identify and Filter Telemetry Data

By: Charlie Klein

Reduce Monitoring Costs: How to Identify and Filter Unneeded Telemetry Data

To understand what’s going on in their environment, DevOps teams usually ship some combination of logs, metrics and traces—depending on which signals they’re hoping to monitor. Each data type will expose different information about what is happening in a system. However, not all of that information will be helpful on a day-to-day basis, which can rack up unnecessary data storage costs. That should require users start to filter telemetry data across their observability stacks.

Filtering Guidelines: What to Filter Out and What to Keep?

The first step to removing unneeded data is identifying what you don’t need. Sending data that doesn’t get analyzed wastes time and money by:

Impacting the performance and stability of your system by clogging up databases.
Requiring users to search through more data when exploring logs, metrics, or traces.
Racking up unnecessary infrastructure costs if using open source on premise solutions.
Triggering steep overage fees from proprietary vendors.

To determine which data to filter out, the key questions to ask are:

Is the service generating this data something I need to monitor?
Does this data help me monitor a critical signal?

If the answer is no to either of those questions, it should qualify for removal. If the answer is yes, perhaps you should track it on a dashboard and monitor it with an alerting mechanism.

The grey area in that criteria is how ‘critical signals’ is defined. Critical signals will vary across different teams. It all depends on why you’re gaining observability into your system. Examples of critical signals could be:

Application latency
Infrastructure CPU and memory usage
Usage of a new product feature
Number of page views
Number of errors that a specific service generates

Depending on the scope of your operation, make a list of the signals you have to monitor. Whichever telemetry data does not help you monitor your critical signals for services worth monitoring should be up for removal.

An Example: Identifying Unneeded Logs

To identify unnecessary log data, it’s hard to know what you don’t need until you see it. First, you’ll have to identify the most common log data, and then determine which data won’t help you monitor a critical signal.

One way to better understand which log data is impacting your costs is by grouping them together to see which are the most common log messages.

In the screenshot above, Logz.io Log Patterns cluster similar logs together to clearly show the most common logs. 13% of our logs have the message: “decision check InProgressDecisionCheck returned Number”. If we know this log won’t help monitor critical signals, it’s a waste of time and money and there is no reason to keep it.

If you’re still unsure exactly whether specific data observability. Here are some general guidelines that can suggest whether given telemetry data is worth keeping or you should consider for removal:

Collected, stored, and indexed data that is rarely analyzed (i.e. once a quarter)
Data that is not displayed on a dashboard or monitored by an alerting mechanism

Filter Telemetry Data

Once you’ve identified data doesn’t help you monitor a critical signal, the next step is to filter telemetry data before incurring the costs for it.

Filtering Out Metrics

In most cases, the easiest way to filter out metrics is to simply remove the instrumentation exposing those metrics.

In the cases where you are interested in filtering out some metrics, you can do so with metricbeat—a popular open source agent that ships metrics to a desired location. By adding ‘processors’ to the config file, you can filter out data, enrich the data, or perform additional data processing. The order that the processors take place will correspond to their order in the configuration.

Below is an example of adding a processor to remove metrics that meet a condition:

processors:
- drop_event:
when:
condition

Filtering Out Traces

Removing tracing instrumentation isn’t always possible. That’s certainly the case if you are using auto-instrumentation or built-in, library-contained instrumentation. Sampling is very common when tracing distributed systems—you can effectively filter out some of your traces by changing the sampling frequency.

In fact, there are many different types of sampling depending on the tracing solution in use. Below are some sampling examples for Jaeger, a popular open source tracing solution:

Constant sampler examples: Sample all traces, or none of them.
Probabilistic sampler example: Sample approximately 1 in 10 traces.
Rate Limiting sampler example: Sample 5 traces per second

By changing the sampling rates in the configuration, you can reduce the amount of data you’re sending. Note that this will result in a slightly less representative sample of what’s actually going on in your distributed system.

Filtering Out Logs

Instrumenting a service to send log data can expose all sorts of unecessary information along with very helpful information. After identifying unneeded log data, the next step is to filter it out.

Server Level Filtering

Popular open source log shippers Filebeat and Fluentd both let you use Regex to modify event streams. This you some control over which data makes it to the desired destination. For example, with Fluentd, you can use the derivative below to grep the value of one or more fields:

@type grep
regexp1 message USA

While this solution is very efficient, it can introduce configuration management challenges at larger scales. How can you ensure you’re pushing the proper configurations to every host?

Log Collection and Filtering

Syslog servers are a popular way to centralize log collection from multiple sources before analyzing them. Log types can filter out on syslog servers with the syntax below:

filter { expression; };

While syslog offers a single place for log collection across multiple data sources, it also represents another server to monitor and maintain.

Filtering Log Data with Logz.io

Some log management and analytics solutions can centralize log data from multiple services and prevent the indexing of unneeded data all in one place.

Earlier, we saw how the “decision check InProgressDecisionCheck returned Number” log message was taking up a lot of space, but wouldn’t help us monitor a critical signal—so we determined we could filter it out.

With Logz.io’s Drop Filters, we can simply paste that log message into the filter and automatically prevent the indexing of that log—instead sending it to the archives.

After applying the filter, we can toggle it on or off at any time to dynamically manage the flow of incoming logs.

Where to Send Filtered Log Data?

In the event of a production outage, the last thing you’d want is to be missing log messages that describe what happened. This is why you should archive log data for a few weeks before deleting it.

Shipping all of your data to an S3 bucket or Azure Blob is a good way to cheaply store all of your data, so you can access it when it’s really needed.

Why bother?

The volatility and scale of telemetry data that modern cloud environments generate can quickly rack up costs before teams are even aware of what’s happening. For this reason, teams should pay careful attention to which data they’re ingesting, and how badly they need that data on a day-to-day basis.

Proprietary solutions typically charge on the amount of ingested telemetry data or some other measurement indicating the scale of the operation. In general, the more data you send, the more you’ll need to pay.

On the other hand, you’ll need to pay for data in the form of cloud compute power and cloud storage costs if you’re using open source. Obviously, the more data you send, the higher the costs.

Regardless of what you’re using, the guidelines above can provide some good first steps in deciding what you may want to filter out to save costs.