We’ve all grown used to logs, metrics and traces serving as the “three pillars of observability.” And indeed they are very important telemetry signals. But are they indeed the sum of the observability game? Not at all. In fact, one of the key trends in observability is moving beyond the ‘three pillars:
These three pillars continue to be critically important. But it’s important not to be confined by the “three pillars” paradigm and to choose the right telemetry data for your needs.
One emerging telemetry type shows a particularly interesting potential for observability: Continuous Profiling. At KubeCon Europe 2022, when we discussed the OpenTelemetry roadmap in the OpenTelemetry project meeting, Continuous Profiling was a hot topic. I saw similar excitement in follow up discussions that culminated in the establishment of a dedicated working group for the topic, as I’ll share in this post.
In fact, in 2021 I devoted an episode of OpenObservability Talks podcast to exploring Continuous Profiling and its potential for observability. For that episode I had the pleasure of hosting Frederic Branczyk, founder and CEO of Polar Signals, and the creator of the Parca open source project for continuous profiling. Before founding Polar Signals Frederic was a senior principal engineer and the main architect for all things observability at Red Hat.
In this post I’ll summarize the rise of Continuous Profiling as a new observability signal, what it’s about and where it can help, with insights from the podcast, from KubeCon and more.
Note: In a previous blog post, I discussed another new signal that is of particular interest to developer observability: application snapshots.
Whenever I hear the term “profiling,” I immediately get the shivers from the infamous OOM – the Out of Memory exceptions. Profiling has been with us for decades, as a means to understand where our code spends its time and which hardware resources, primarily memory and CPU, it consumes, down to the individual line number. We used to call the profiler at specific investigations, and I recall it being quite a “heavy” operation to execute, as it is typically based on stack traces.
Continuous Profiling offers a different experience with the notion of statistical profiling using time-based sampling. Google pioneered this concept with its own data centers. Over a decade ago, in 2010, Google published a research paper called “Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers”:
Google-Wide Profiling (GWP), a continuous profiling infrastructure for data centers, provides performance insights for cloud applications. With negligible overhead, GWP provides stable, accurate profiles and a datacenter-scale tool for traditional performance analyses.
Continuous profiling provided the different profiling experience thanks to three main elements:
Frederic attested to how influential Google’s research paper was on his observability journey and in founding his current startup around continuous profiling. Another important source of inspiration for Frederic is Prometheus, the open source project for metrics monitoring under the Cloud Native Computing Foundation (CNCF). Frederic was inspired by the way data is stored and queried in Prometheus, as well as keeping compatibility with its labeling conventions.
What does Continuous Profiling add to our observability? That’s the ability to understand resource usage down to the line number.
For starters, it opens up the option to compare profiles between different runs (think about it like a profile “diff”). This helps pinpointing lines of code that were executed in one run but not the other, which account for the differences in behavior and resource consumption.
Continuous profiling can also augment existing telemetry types you already use. Let’s look at a couple of examples of how continuous profiling fits into everyday observability flows.
Let’s say you monitor your application’s performance with Prometheus, and you detect a latency spike. Assuming you use the same labeling system as in Prometheus, you can take the same label selector and request the CPU profile for the same inspected period of time. This will instantly show you where the CPU was spent within your application, down to the line number.
This is also a great example of how a specification, such as canonical labeling and metadata, enables easy correlation between telemetry signals, which is needed for full observability. Frederic adopted this design choice and built it into his Parca open source project. There is also work to create an open standard, as I’ll show below.
Now let’s look at another example, this time one of distributed tracing data augmented with profiling data: In order to make that work, you need to attach to each span (tracing data) a metadata of the invoked function. Now if you inspect a request trace, and detect a span that is taking a long time to execute, you can take the function metadata and query the profiling data, filtering on this specific function, to zoom in on this function’s behavior.
Furthermore, you have the ability to analyze this function’s behavior across the cluster, for example across all the API servers, thanks to the fact that profiles are continuously collected not just from a suspected node or machine but throughout the environment.
These examples show not only the need to add more data signals, but a far more fundamental need: to go beyond the “pillars of observability” that are raw telemetry data types, and into observability as a data analytics problem. As we saw in the examples above, it’s not about which data I collect but how I fuse this data together to debug and understand my system.
In addition to its research paper, Google also released pprof, a popular tool for visualization and analysis of profiling data. Pprof also defines an open format for profiling data that is language and runtime independent, based on Protobuf. It is, however, not the only format, and in fact there is currently no single de-facto standard format for profiling. Other open formats out there include JFR (Java Flight Recorder) for Java applications and Collapsed. And then there are customer formats by vendors such as Pyroscope and Prodfiler by Elastic. We certainly need an open standard to converge the industry.
The OpenTelemetry community is looking into supporting Continuous Profiling, alongside the currently supported traces, metrics and logs. Following the discussions at KubeCon Europe 2022, a new working group was established in June 2022 for OpenTelemetry Profiling. The working group is still in its early days, establishing the goals, evaluating existing open and custom formats, and seeing how profiling should be modeled in the OpenTelemetry architecture. You are welcome to take part and influence the discussion: you can join the group calls and read the summary of past calls in this running document. You can also join the #otel-profiles channel on the CNCF slack.
Want to learn more? Check out the OpenObservability Talks episode: Prometheus Pitfalls and the Rise of Continuous Profiling on: