Every journey in Observability begins with instrumenting an application to emit telemetry data – primarily logs, metrics and traces – from each service as it executes. OpenTelemetry is an open source project under the Cloud Native Computing Foundation (CNCF) that offers a unified framework for generating, collecting and transmitting telemetry data. With OpenTelemetry you can instrument your application in a vendor-agnostic way, and then analyze the telemetry data in your backend tool of choice, whether Prometheus, Jaeger, Zipkin, or others. On this tutorial I’ll cover:
- Overview of OpenTelemetry: signals, client libraries, protocol, collector and more.
- The current state of OpenTelemetry: What’s GA, what’s in Beta and what’s expected
- Basic concepts in instrumentation: what’s a span, what’s context propagation
- Automatic vs. manual instrumentation and supported programming languages
- Recommendations and best practices
- Useful cheat sheet of reference links
What is OpenTelemetry?
OpenTelemetry (informally called OTEL or OTel) is an observability framework – software and tools that assist in generating and capturing telemetry data from cloud-native software.
OpenTelemetry aims to address the full range of observability signals across traces, metrics and logs.
OpenTelemetry is a community-driven open source project, which is the result of a merge between OpenTracing and OpenCensus projects. As of August 2021, OpenTelemetry is a CNCF incubating project. In fact, the recent CNCF dev stats show that OpenTelemetry is the second most active CNCF project behind Kubernetes.
OpenTelemetry offers several components, most notably:
- APIs and SDKs per programming language for generating and emitting telemetry
- Collector component to receive, process and export telemetry data
- OTLP protocol for transmitting telemetry data
We’ll look at each one in the following sections.
OpenTelemetry API & SDK Specification
OpenTelemetry provides for each programming language a single API and a single SDK (an OpenTelemetry client library) with which you can manually instrument your application to generate metrics and tracing telemetry.
The standard API ensures no code changes will be required when switching between different SDK implementations.
The SDK takes care of sampling, context propagation and other required processing, and then exports the data, typically to OpenTelemetry Collector (see below). OpenTelemetry SDKs can send data to other destinations, using a suite of SDK exporters that support multiple data formats.
OpenTelemetry also supports automatic instrumentation with integrations to popular frameworks, libraries, storage clients etc. as well as with auto-instrumentation agents. This reduced the manual coding required in your application to capture things such as metrics and traces.
The OpenTelemetry Specification defines the cross-language requirements for the APIs and SDKs, as well as the data specification around semantic conversions and protocol.
The OpenTelemetry Collector can collect data from OpenTelemetry SDKs and other sources, and then export this telemetry to any supported backend, such as Jaeger, Prometheus or Kafka queues.
OpenTelemetry Collector can serve as both a local agent co-located on the same host with the application, as well as a central collector service aggregating across multiple application nodes.
OpenTelemetry Collector is built as a processing pipeline in a pluggable architecture, with three main parts:
- Receivers for ingesting incoming data of various formats and protocols, such as OTLP, Jaeger and Zipkin. You can find the list of available receivers here.
- Processors for performing data aggregation, filtering, sampling and other collector processing logic for each signal type. For example, SpanMetrics Processor can aggregate metrics from spans. Processors can be chained to produce complex processing logic.
- Exporters for emitting the telemetry data to one or more backend destinations (typically analysis tools or higher order aggregators) in various formats and protocols, such as OTLP, Prometheus and Jaeger.
You can find the list of available exporters here.
OTLP: OpenTelemetry Protocol
So what is OTLP?
OpenTelemetry defines a vendor and tool agnostic protocol specification called OTLP (OpenTelemetry Protocol) for transmitting traces, metrics and logs telemetry data. With that in place, replacing a backend analysis tool would be as easy as a configuration change on the collector.
OTLP can be used for transmitting telemetry data from the SDK to the Collector, as well as from the Collector to the backend tool of choice. The OTLP specification defines the encoding, transport, and delivery mechanism for the data, and is the future-proof choice.
Your system may, however, be using third party tools and frameworks, which may come with built-in instrumentation other than OTLP, such as Zipkin or Jaeger formats. OpenTelemetry Collector can ingest these other protocols using appropriate Receivers as mentioned above.
You can find a detailed specification of OpenTelemetry’s components here.
OpenTelemetry: Current State
Opentelemetry is an aggregate of multiple groups, each working on a different component of this huge endeavor: different groups handle the specification for the different telemetry signals – distributed tracing, logging and metrics, there are different groups focused on the different programming-language specific clients, to name a few. Each group has its own release cadence, which means that different components of OpenTelemetry may be in different stages of the maturity lifecycle:
Draft → Experimental → Stable → Deprecated.
Stable is the equivalent of GA (generally available), which is what you’d be seeking to run it in a production environment. Experimental is a Beta stage, which should enable testing the technology in evaluations and PoC towards integration.
When coming to evaluate OpenTelemetry for your project, you should map the status of the relevant components for your system:
- The signal types of interest (traces/metrics/logs)
- The protocols for receiving the signal types
- The client library for the programming language(s) you use. Potentially also agents for instrumenting programming frameworks you use in your code.
Let’s start with the status of the API and SDK specification:
- The distributed tracing specification is generally available (Stable status) as of the v1.0.0 release, built on the solid foundations of OpenTracing.
- The metrics specification is expected to reach v1.0 GA (Stable status) towards the end of 2021 according to the roadmap, with compatibility to OpenCensus, as well as full Prometheus and OpenMetrics compatibility.
- The logging specification is the least advanced, and is not expected before 2022.
- OpenTracing and OpenCensus APIs are deprecated in favor of OpenTelemetry API.
OTLP also has a separate life cycle for each signal: At present, OTLP is in Stable status for tracing and metrics, and is Experimental for logging.
SDKs (client libraries) for different languages are developed independently and are therefore found in different maturity levels. For example, at the time of writing the Java SDK was already in GA with version 1.4.1, while GoLang SDK was still in release candidate 1 (v1.0.0-RC1). Consult the below section for the supported programming languages and review the status of the stack relevant to you.
OpenTelemetry Collector has reached GA for Tracing in September 2021, with Metrics expected to reach GA by end of 2021, and Logging still experimental.
You can find high level status information in the OpenTelemetery status page.
Service Instrumentation: The Basic Concepts
When a service is instrumented for distributed tracing, each invocation of an operation on a service emits a span (and in some cases multiple spans).
You can create spans manually in your code using API and SDK (a client library sometimes called a tracer). In some cases you can also use auto-instrumentation agents that generate spans automatically, so that no code change is required in your application.
The span contains data on the invoked service and operation, the invocation start and finish timestamps, the span context (trace id, span id, parent span id etc.), and an optional list of user-defined attributes (essentially key-value pairs). The SDK takes care of propagating the context through the involved services, to ensure the causal relationship between the spans is captured.
The span is then formatted to a specific protocol and emitted via the SDK to a collector backend (typically via an agent or a collector component), and from there to a tracing analysis backend tool such as Jaeger.
The spans are ingested and collected on the backend, and traces are reconstructed from the spans according to causality, namely the sequence of invocation.
This is a very basic outline. I’ve left out many details not directly relevant to instrumentation. Now let’s dive into how to instrument our application.
Automatic and Manual Instrumentation
Instrumentation is the ability of our services to emit well-formatted spans with proper context. But how do we generate these spans?
You can instrument your application manually, by adding code to start and finish the span (to designate the start and end timestamp of the executed code block), specify the payload and submit the span data.
Some software frameworks and agents offer automatic instrumentation, which saves you the need to modify your application code for many use cases, and can provide baseline telemetry.
Automatic and manual are not mutually exclusive options. In fact, it would be recommended to combine the two, to leverage the benefits of a codeless approach where possible, with fine-grain control where required.
Let’s see how to instrument your code with manual and automatic instrumentation, and the considerations for choosing the right instrumentation method for your needs.
Manual instrumentation means the developer needs to add code to the application to start and finish a span and to define the payload. It makes use of client libraries and SDKs, which are available for a variety of different programming languages, as we’ll see below.
Let’s look at the considerations for manually instrumenting our application:
- That’s the only option in application stacks where auto-instrumentation is not supported
- Manual instrumentation gives you maximum control over the data that is being generated.
- You can instrument custom code blocks
- Enables capturing business metrics or other custom metrics within the trace including events or messages you want to use for monitoring or business observability.
- It is time consuming.
- There is a learning curve to perfect it.
- Can cause performance overhead.
- More room for human errors resulting in broken span context.
- Change the instrumentation may require recompiling the application
OpenTelemetry Client Libraries
OpenTelemetry currently offers SDKs for the following programming languages:
Some languages also have agents for auto-instrumentation, which can speed up your instrumentation work.
Automatic instrumentation requires no code changes and no need for recompilation of the application. This method uses an intelligent agent that attaches to the running application and extracts tracing data.
You can also find auto-instrumentation agents for popular programming languages such as Python, Java, .NET and PHP. In addition, common libraries and frameworks for these languages also offer built-in instrumentation.
Java programmers, for example, can leverage the Java Agent that automatically injects bytecode to capture baseline telemetry without need to change your application’s Java source code. Java programmers can also find standalone instrumentation for several popular libraries such Spring, JDBC, RxJava, Log4J and others.
There are ways to reduce coding instrumentation, such as service meshes via their sidecar proxy, and eBPF via Linux kernel instrumentation, which we will not be able to discuss in this introductory scope.
If you use ORM libraries such as such as Django, Hibernate and Sequelize, you can use SQLCommenter (contributed by Google to OpenTelemetry Sept. 2021) to auto-instruments these libraries and enable app-focused database observability.
Let’s look at the considerations for auto-instrumenting our application:
- Does not require code changes.
- Provides good coverage of application endpoints and operations.
- Saves time instrumenting your code and lets you focus on the business.
- Reduces the need for code changes due to updates to the instrumentation (such as new metadata captured in the payload)
- Not all the languages and frameworks provide auto-instrumentation
- Offers less flexibility than manual instrumentation, typically in the scope of a function or a method call
- Only instruments basic metrics around usage and performance. Business metrics or other custom metrics need to be manually instrumented
- Often only capture error data in terms of related events or logs within a trace.
Here are some useful guidelines for instrumentations:
- Identify the tools in your application stack that provide built-in instrumentation and enable their instrumentation for infrastructure baseline traces. For each one, verify in which format and protocol it exports trace data, and make sure you can ingest this format (using an appropriate receiver)
- Leverage auto-instrumentation as much as possible. Use agents for your programming language that can generate trace data themselves or via software frameworks, libraries and middleware you use.
- Once out-of-the-box capabilities are fully used, map the gaps in your instrumentation data and observability, and augment with manual instrumentation as needed. Make it a gradual step, starting with the most prominent gaps you’ve mapped. Oftentimes it’s useful to start with the client-facing services and endpoints, and later add more backend services.
- Verify the release and maturity level of each component you use, whether the collector, client library, protocol or others, as each component has its own release lifecycle.
Note: If your instrumentation is still based on OpenTracing, note that OpenTracing is being deprecated, and it is advised to migrate to OpenTelemetry.