Beyond Logs, Metrics and Traces

By: Dotan Horovits

Despite what you may have seen and heard, the intersection of logging, metrics and tracing does not tell the whole story about observability.

Our systems emit telemetry, and those previously noted telemetry signals are considered the “three pillars” of observability. They’re all important, but by themselves, they aren’t observability.

Many users I see day in and day out find themselves with broken observability even though they’re collecting those three pillars. Why is this the case?

In my opinion, observability can’t just be about the external outputs. My definition of observability is having the capability to allow a human to ask and answer questions about the system. Observability needs to be an integral part of any system, and this definition also makes it clear that observability is a data analytics problem.

Those “three pillars” are just the raw data—what we need in observability are the insights. It’s time to put aside the reactive monitoring and maintenance mindset, and take more of a proactive data analyst approach.

I discussed these concepts and more in my 2023 Container Days talk “Beyond Logs, Metrics and Traces: The Data Analytics Paradigm Shift in Observability.” We’ve recapped what was discussed in this blog, but you can watch the full replay of my talk here:

What is Observability Really About?

With this shift in mindset, organizations have to reconsider what observability is really about. It’s about collecting data from different sources, formats and types; structured and standardized data; enriching and correlating data; unified query and visualization; and data volume and data-to-noise ratio.

We need not limit ourselves to collecting data from the “three pillars” alone. Events and profiling are among the data types that can be considered as an expansion of traditional observability. OpenTelemetry, for its part, has added a data model for continuous profiling to help lead on this trend.

Observability involves ingesting data signals from multiple sources across different tiers, frameworks and programming languages. Consistently collecting heterogeneous data across all these places has been a serious challenge for many years. Each source has its own way of exposing, collecting and relaying telemetry data. But the biggest pain about this is putting it all together.

Data silos prevent us from correlating what telemetry is emitted. This is where OpenTelemetry can help. It’s an open source framework for collecting and generating telemetry data across different telemetry types and data sources in a vendor-agnostic, open-source manner.

The Critical Importance of Structured, Standardized Data

If you’re still using unstructured, plain text logs in your system, it’s time to move on. No human is going to read through these mountains of logs to come up with these insights. We’re talking about data analytics here.

We need structured, machine-readable logs that we can ingest with software at scale to get insights. Observability is a property of your system—put thought into the schema. Use consistent conventions in your code so you can correlate and match items across your application and infrastructure. This is an area where platform engineering can play a role.

OpenTelemetry and OpenMetrics have made as part of their mission a way to provide open standards and open specifications around structured data.

In the talk, I go deeper into discussions around:

Enriching and correlating data: Data enrichment is an important step in data analytics. You need to be able to map a log to a root cause. Is an issue customer-specific? Or is it part of a problematic build? For correlation, you can consider packaging your logs as part of your trace data, or adding exemplars on metrics, for more effective observability.
Unified query and visualization: How can you easily ask and answer ad hoc questions about your systems and infrastructure? Unified query and visualization structure is critical, but the market is very fragmented. A new CNCF working group is working on a standardized query language, and we need something similar for visualization. In the meantime, you need to implement organizational standards.
Data volume and data-to-noise ratio: Data volumes can explode and become difficult to scale out, and you often find that a massive amount of that data isn’t valuable. You need to carefully consider the value of your data and apply observability-driven development. Which logs do you actually use for your real-time troubleshooting? Which metrics and labels do you really use in dashboards and alerts?

You’ll come away with a new understanding of what observability should be, with information on the latest efforts in the community to make these areas of observability beyond logs, metrics and traces more accessible. Watch the full talk from the embedded video above.

At Logz.io, we are committed to helping customers meet their observability goals. If you’d like to see how we can help, sign up for a free trial of the Logz.io Open 360™ observability platform.