Every journey in Observability begins with instrumenting an application to emit telemetry data – primarily logs, metrics and traces – from each service as it executes. OpenTelemetry is an open source project under the Cloud Native Computing Foundation (CNCF) that offers a unified framework for generating, collecting and transmitting telemetry data. With OpenTelemetry you can instrument your application in a vendor-agnostic way, and then analyze the telemetry data in your backend tool of choice, whether Prometheus, Jaeger, Zipkin, or others. On this tutorial I’ll cover:
- Overview of OpenTelemetry: signals, client libraries, protocol, collector and more.
- The current state of OpenTelemetry: What’s GA, what’s in Beta and what’s expected
- Basic concepts in instrumentation: what’s a span, what’s context propagation
- Automatic vs. manual instrumentation and supported programming languages
- Recommendations and best practices
- Useful cheat sheet of reference links
What is OpenTelemetry?
OpenTelemetry (informally called OTEL or OTel) is an observability framework – software and tools that assist in generating and capturing telemetry data from cloud-native software.
OpenTelemetry aims to address the full range of observability signals across traces, metrics and logs.
OpenTelemetry is a community-driven open source project, which is the result of a merge between OpenTracing and OpenCensus projects. As of August 2021, OpenTelemetry is a CNCF incubating project. In fact, the recent CNCF dev stats show that OpenTelemetry is the second most active CNCF project behind Kubernetes.
OpenTelemetry offers several components, most notably:
- APIs and SDKs per programming language for generating and emitting telemetry
- Collector component to receive, process and export telemetry data
- OTLP protocol for transmitting telemetry data
We’ll look at each one in the following sections.
OpenTelemetry API & SDK Specification
OpenTelemetry provides for each programming language a single API and a single SDK (an OpenTelemetry client library) with which you can manually instrument your application to generate metrics and tracing telemetry.
The standard API ensures no code changes will be required when switching between different SDK implementations.
The SDK takes care of sampling, context propagation and other required processing, and then exports the data, typically to OpenTelemetry Collector (see below). OpenTelemetry SDKs can send data to other destinations, using a suite of SDK exporters that support multiple data formats.
OpenTelemetry also supports automatic instrumentation with integrations to popular frameworks, libraries, storage clients etc. as well as with auto-instrumentation agents. This reduced the manual coding required in your application to capture things such as metrics and traces.
The OpenTelemetry Specification defines the cross-language requirements for the APIs and SDKs, as well as the data specification around semantic conversions and protocol.
The OpenTelemetry Collector can collect data from OpenTelemetry SDKs and other sources, and then export this telemetry to any supported backend, such as Jaeger, Prometheus or Kafka queues.
OpenTelemetry Collector can serve as both a local agent co-located on the same host with the application, as well as a central collector service aggregating across multiple application nodes.
OpenTelemetry Collector is built as a processing pipeline in a pluggable architecture, with four main parts:
- Receivers for ingesting incoming data of various formats and protocols, such as OTLP, Jaeger and Zipkin. You can find the list of available receivers here.
- Processors for performing data aggregation, filtering, sampling and other collector processing logic on the telemetry data. Processors can be chained to produce complex processing logic.
- Exporters for emitting the telemetry data to one or more backend destinations (typically analysis tools or higher order aggregators) in various formats and protocols, such as OTLP, Prometheus and Jaeger.
You can find the list of available exporters here.
- Connectors for connecting different pipelines in the same Collector. The connector serves as both an exporter and receiver, so it can consume data as an exporter from one pipeline and emit the data as a receiver into another pipeline. For example, Span Metrics Connector can aggregate metrics from spans.
OTLP: OpenTelemetry Protocol
So what is OTLP?
OpenTelemetry defines a vendor and tool agnostic protocol specification called OTLP (OpenTelemetry Protocol) for transmitting traces, metrics and logs telemetry data. With that in place, replacing a backend analysis tool would be as easy as a configuration change on the collector.
OTLP can be used for transmitting telemetry data from the SDK to the Collector, as well as from the Collector to the backend tool of choice. The OTLP specification defines the encoding, transport, and delivery mechanism for the data, and is the future-proof choice.
Your system may, however, be using third party tools and frameworks, which may come with built-in instrumentation other than OTLP, such as Zipkin or Jaeger formats. OpenTelemetry Collector can ingest these other protocols using appropriate Receivers as mentioned above.
You can find a detailed specification of OpenTelemetry’s components here.
OpenTelemetry: Current State
Opentelemetry is an aggregate of multiple groups, each working on a different component of this huge endeavor: different groups handle the specification for the different telemetry signals – distributed tracing, logging and metrics, there are different groups focused on the different programming-language specific clients, to name a few. Each group has its own release cadence, which means that different components of OpenTelemetry may be in different stages of the maturity lifecycle:
Draft → Experimental → Stable → Deprecated.
Stable is the equivalent of GA (generally available), which is what you’d be seeking to run in a production environment. Stable means it comes with guarantees around long-term support, backward compatibility, and dependency isolation.
Experimental is a Beta stage, which should enable testing the technology in evaluations and PoC towards integration.
Know your stack
When coming to evaluate OpenTelemetry for your project, you should first know your stack, and then map it to the relevant components for your system. Start with these three basic questions:
- Which programming languages and frameworks? For example Java & Spring for the backend, and NodeJS & Express for the frontend, would be a good start. This will determine the client libraries you’d be using, and potentially also the agents for instrumenting programming frameworks you use in your code.
- Which signal types and protocols? Are you collecting logs/metrics/traces? Do these come from your app via SDK or also from other sources such as Kafka, Docker or MySQL? This will determine the receivers you’d be using in your OpenTelemetry Collector.
- Which analytics tools in the backend? Are you sending your trace data to Jaeger? Are you sending your metrics to Prometheus? Or perhaps to a vendor tool such as Logz.io or to a Kafka cluster for downstream queueing? This will determine the exporters you’d be using in your OpenTelemetry Collector.
Once you map the relevant OpenTelemetry components for your stack, you can then check the status of these components.
The state of OpenTelemetry signals: traces, metrics and logs
Let’s look at the state according to the signal types:
Traces – General Availability (GA)
Distributed tracing was the first signal to have reached GA on OpenTelemetry, back in September 2021. The Tracing API, SDK and Protocol specifications are stable.
Metrics – Release Candidate (RC)
Metrics GA is around the corner, after announcing the Release Candidates during KubeCon in May 2021. This means the Metrics API, SDK and Protocol specifications are stable, and the OpenTelemetry Collector supports metric pipelines.
Seeing Prometheus is the leading metrics monitoring framework out there, and within the CNCF ecosystem in particular, it’s also important to note the Prometheus support by OpenTelemetry, achieved in collaboration with the Prometheus community. This means that the SDK offers prometheus exporters, the Collector offers Prometheus receivers and exporters, and there’s alignment of the data model specification between Prometheus and OTLP.
Logs – Beta
Logs are the least mature telemetry data type in OpenTelemetry and at the time of writing it is still experimental.
As logs are the longest standing signal, and essentially every system out there already has logs baked in, the initial focus of the OpenTelemetry community has been to integrate with existing logging systems and sources. However, current logging convention relies heavily on textual and unstructured formats, which are less suitable for modern observability. Consequently, the next stage of OpenTelemetry is to build a new strongly-typed and machine-readable logging format.
Supporting existing logging sources starts with fetching existing logs and transmitting them over OTLP alongside metrics and traces.
The data model is Stable, as well as the OTLP Logs Protocol.
The Logging SDK specification is experimental, and allows clients to ingest logging data from existing logging systems and output the logs as part of OTLP along with tracing and metrics.
There is also development of log appenders in many languages, to allow appending telemetry data, such as Trace ID or Span ID, to existing logging systems.
The Collector is also experimental, but it already supports log processing for many data formats, thanks to the donation of Stanza to the OpenTelemetry project.
In parallel, OpenTelemetry is starting to address the second phase of building a new strongly-typed and machine-readable logging format. Current work is mostly on establishing the semantic conventions for logs. There is important work to align with Elastic Common Schema (ECS), which is another open standard for log structure, to converge the industry and the open source communities.
As the specifications are still in early stages, the API specification is still in draft mode.
Service Instrumentation: The Basic Concepts
When a service is instrumented for distributed tracing, each invocation of an operation on a service emits a span (and in some cases multiple spans).
You can create spans manually in your code using API and SDK (a client library sometimes called a tracer). In some cases you can also use auto-instrumentation agents that generate spans automatically, so that no code change is required in your application.
The span contains data on the invoked service and operation, the invocation start and finish timestamps, the span context (trace id, span id, parent span id etc.), and an optional list of user-defined attributes (essentially key-value pairs). The SDK takes care of propagating the context through the involved services, to ensure the causal relationship between the spans is captured.
The span is then formatted to a specific protocol and emitted via the SDK to a collector backend (typically via an agent or a collector component), and from there to a tracing analysis backend tool such as Jaeger.
The spans are ingested and collected on the backend, and traces are reconstructed from the spans according to causality, namely the sequence of invocation.
This is a very basic outline. I’ve left out many details not directly relevant to instrumentation. Now let’s dive into how to instrument our application.
Automatic and Manual Instrumentation
Instrumentation is the ability of our services to emit well-formatted spans with proper context. But how do we generate these spans?
You can instrument your application manually, by adding code to start and finish the span (to designate the start and end timestamp of the executed code block), specify the payload and submit the span data.
Some software frameworks and agents offer automatic instrumentation, which saves you the need to modify your application code for many use cases, and can provide baseline telemetry.
Automatic and manual are not mutually exclusive options. In fact, it would be recommended to combine the two, to leverage the benefits of a codeless approach where possible, with fine-grain control where required.
Let’s see how to instrument your code with manual and automatic instrumentation, and the considerations for choosing the right instrumentation method for your needs.
Manual instrumentation means the developer needs to add code to the application to start and finish a span and to define the payload. It makes use of client libraries and SDKs, which are available for a variety of different programming languages, as we’ll see below.
Let’s look at the considerations for manually instrumenting our application:
- That’s the only option in application stacks where auto-instrumentation is not supported
- Manual instrumentation gives you maximum control over the data that is being generated.
- You can instrument custom code blocks
- Enables capturing business metrics or other custom metrics within the trace including events or messages you want to use for monitoring or business observability.
- It is time consuming.
- There is a learning curve to perfect it.
- Can cause performance overhead.
- More room for human errors resulting in broken span context.
- Change the instrumentation may require recompiling the application
OpenTelemetry Client Libraries
OpenTelemetry currently offers SDKs for the following programming languages:
Some languages also have agents for auto-instrumentation, which can speed up your instrumentation work.
Automatic instrumentation requires no code changes and no need for recompilation of the application. This method uses an intelligent agent that attaches to the running application and extracts tracing data.
You can also find auto-instrumentation agents for popular programming languages such as Python, Java, .NET and PHP. In addition, common libraries and frameworks for these languages also offer built-in instrumentation.
Java programmers, for example, can leverage the Java Agent that automatically injects bytecode to capture baseline telemetry without need to change your application’s Java source code. Java programmers can also find standalone instrumentation for several popular libraries such Spring, JDBC, RxJava, Log4J and others.
There are ways to reduce coding instrumentation, such as service meshes via their sidecar proxy, and eBPF via Linux kernel instrumentation, which we will not be able to discuss in this introductory scope.
If you use ORM libraries such as such as Django, Hibernate and Sequelize, you can use SQLCommenter (contributed by Google to OpenTelemetry Sept. 2021) to auto-instruments these libraries and enable app-focused database observability.
Let’s look at the considerations for auto-instrumenting our application:
- Does not require code changes.
- Provides good coverage of application endpoints and operations.
- Saves time instrumenting your code and lets you focus on the business.
- Reduces the need for code changes due to updates to the instrumentation (such as new metadata captured in the payload)
- Not all the languages and frameworks provide auto-instrumentation
- Offers less flexibility than manual instrumentation, typically in the scope of a function or a method call
- Only instruments basic metrics around usage and performance. Business metrics or other custom metrics need to be manually instrumented
- Often only capture error data in terms of related events or logs within a trace.
Here are some useful guidelines for instrumentations:
- Identify the tools in your application stack that provide built-in instrumentation and enable their instrumentation for infrastructure baseline traces. For each one, verify in which format and protocol it exports trace data, and make sure you can ingest this format (using an appropriate receiver)
- Leverage auto-instrumentation as much as possible. Use agents for your programming language that can generate trace data themselves or via software frameworks, libraries and middleware you use.
- Once out-of-the-box capabilities are fully used, map the gaps in your instrumentation data and observability, and augment with manual instrumentation as needed. Make it a gradual step, starting with the most prominent gaps you’ve mapped. Oftentimes it’s useful to start with the client-facing services and endpoints, and later add more backend services.
- Verify the release and maturity level of each component you use, whether the collector, client library, protocol or others, as each component has its own release lifecycle.
Note: If your instrumentation is still based on OpenTracing, note that OpenTracing is being deprecated, and it is advised to migrate to OpenTelemetry.