Every journey in distributed tracing starts with instrumenting an application to emit or extract trace data from each service as they execute. There are many ways to instrument, including the use of SDKs and pre-configured frameworks, and many protocols for transmitting the trace data to the analysis tool. On this post I’ll cover:
- Basic concepts in instrumentation
- OpenTracing API specification
- Automatic vs. manual instrumentation
- Built-in instrumentation for popular software frameworks
- Instrumentation for various programming languages
- Available protocols for transmitting trace data to the backend
- Recommendations and best practices for instrumentation
- The future of instrumentation with OpenTelemetry
- Useful cheat sheet of reference links
Service Instrumentation: The Basic Concepts
When a service is instrumented, each invocation of an operation of the service emits a span (and in some cases multiple spans).
You can create spans manually in your code using API and SDK provided by a tracer client library. In some cases you can also use auto-instrumentation agents that generate spans automatically, so that no code change is required in your application.
The span contains data on the invoked service and operation, the invocation timestamps, the span context (trace id, span id, parent span id etc.), and additional metadata such as tags and logs.
The span is formatted to a specific protocol and emitted via a tracer to a distributed tracing backend (typically an agent or a collector component).
The spans are ingested and collected on the backend, and traces are reconstructed through the concatenation of spans by sequence of invocation.
This is a very basic outline. I’ve left out many details not directly relevant to instrumentation. You can find an introduction to distributed tracing and Jaeger, and an overview of Jaeger’s backend components and how to deploy them, in the resources section at the end of this article.
Instrumentation is therefore the ability of our services to emit well-formatted spans with proper context. In order to instrument, we need to answer two questions:
- How do we generate spans?
- In which protocols do we emit the spans?
The answer depends on your application’s specific needs. The following sections will provide guidance on answering these questions.
OpenTracing API Specification
Before delving into how to instrument your code, it’s important to present OpenTracing.
Today there are community-driven open standards such as OpenTracing and its emerging successor OpenTelemetry, which are vendor neutral. They are backed by the Cloud Native Computing Foundation (CNCF) and by all the major observability and cloud vendors, along with the end user community.
When instrumenting your code with an OpenTracing compliant API, changing distributed tracing tools should be as simple as replacing the tracer or auto-instrumentation agent in use, with the rest of the instrumentation code remaining unchanged.
The current preferred path for tracing instrumentation is the OpenTracing specification. You can find OpenTracing supported tracers for all the popular open source monitoring and APM tools such as Jaeger, Zipkin and Apache skywalking, along with proprietary tools such as VMware Wavefront, LightStep and Elastic APM.
OpenTelemetry is a new community-driven open standard, which is the result of a merge between OpenTracing and OpenCensus. It aims to address the full range of observability data across traces, metrics and logs. OpenTelemetry API is set to succeed OpenTracing on API specification, as well as address additional aspects of tracing and observability framework, which will be discussed later in this article.
Automatic and Manual Instrumentation
You can instrument your application manually, by adding code to start and finish the span (to designate the start and end timestamp of the executed code block), specify the payload and submit the span data.
Some software frameworks and languages offer automatic instrumentation, which saves you the need to modify your application code for many use cases.
Automatic and manual are not mutually exclusive options. In fact, it would be recommended to combine the two, to leverage the benefits of a codeless approach where possible, with fine-grain control where required.
Let’s see how to instrument your code with manual and automatic instrumentation, and the considerations for choosing the right instrumentation method for your application.
Automatic instrumentation requires no code changes. This method uses an intelligent agent that attaches to the running application and extracts tracing data.
You can also find auto-instrumentation agents for popular programming languages such as Python, Java, .NET and PHP. In addition, common libraries and frameworks for these languages also offer built-in instrumentation.
Java programmers, for example, can leverage the Java Agent for OpenTracing open source that automatically instruments 3rd-party libraries such as Spring Web, AWS SDK and various drivers for databases (e.g. Mongo, Cassandra) and queues (e.g. Rabbit, Kafka).
Many popular service meshes and proxies also support automatic instrumentation out of the box such as:
Let’s look at the considerations for auto-instrumenting our application:
- Does not require code changes.
- Provides good coverage of application endpoints and operations.
- Saves time instrumenting your code and lets you focus on the business.
- Reduce the need for code changes due to updates to the instrumentation (such as new metadata captured in the payload)
- Not all the languages and frameworks provide auto-instrumentation
- Offers less flexibility than manual instrumentation, typically in the scope of a function or a method call
- Only instruments basic metrics around usage and performance. Business metrics or other custom metrics need to be manually instrumented
- Often only capture error data in terms of related events or logs within a trace.
Manual instrumentation means the developer needs to add code to the application to start and finish a span and to define the payload. It makes use of client libraries and SDKs, which are available for a variety of different programming languages, as we’ll see below.
Let’s look at the considerations for manually instrumenting our application:
- That’s the only option in application stacks where auto-instrumentation is not supported
- Manual instrumentation gives you maximum control over the data that is being generated.
- You can instrument custom code blocks
- Enables capturing business metrics or other custom metrics within the trace including events or messages you want to use for monitoring or business observability.
- It is time consuming.
- There is a learning curve to perfect it.
- Can cause performance overhead.
- More room for human errors resulting in broken span context.
- Change the instrumentation may require recompiling the application
Supported Programming Languages
Jaeger offers client libraries, which support the OpenTracing API specification, for the following programming languages:
As mentioned above, some languages also have agents for auto-instrumentation, which can speed up your instrumentation work (see above section on automatic instrumentation).
If you can’t find your programming language on that list, don’t worry – it may be under development by the Jaeger community. You can check out the status here.
You can find more information on the supported client libraries and tracers by Jaeger and Zipkin, as well as the OpenTracing APIs per language, in the reference section at the bottom.
OpenTelemetry project is expected to take over as the standard path for instrumentation. It aims to provide standard SDKs capable of exporting trace data in OpenTelemetry protocol (OTLP) as well as others such as Jaeger and Zipkin. OpenTelemetry is not yet generally available at the time of writing. A GA release of the tracing portion is currently expected by the end of 2020. As OpenTelemetry matures, the Jaeger community plans to adopt it, probably as soon as Jaeger’s next major release.
Distributed Tracing Protocols
In which format should my application emit spans?
The preferred path for avoiding vendor lock-in and future-proofing your code is to follow an open specification or protocol.
Until recently, that meant primarily the protocols by the popular open source distributed tracing tools Zipkin and Jaeger, which have gained broad industry support and in Jaeger’s case also the backing of the Cloud Native Computing Foundation (CNCF).
If you work with a Jaeger backend, your natural choice at present is the Jaeger protocol. If your existing stack contains some legacy components with built-in Zipkin instrumentation, you can use that with your Jaeger backend. The Jaeger collector can accept spans in Zipkin formats, namely Thrift, JSON v1/v2 and Protobuf.
OpenTelemetry defines a vendor and tool agnostic protocol specification called OTLP for transmitting traces, metrics and logs telemetry data. With that in place, replacing a tool would involve only a configuration change on the backend collector.
Here are some useful guidelines for instrumentations:
- Determine the protocol you wish to use for your trace data, depending on your application stack, tracing backend and use case. I’d recommend future proofing towards OpenTelemetry.
- Leverage auto-instrumentation as much as possible. Identify the frameworks in your application stack that provide built-in instrumentation and start instrumentation there.
- Once out-of-the-box capabilities are fully used, map the gaps in your instrumentation data and observability, and augment with manual instrumentation as needed.
For manual instrumentation, determine the client libraries and tracers you wish to use for instrumentation, based on your programming language, tracing protocol, tracing backend and use case. I’d recommend choosing ones that are compliant with the OpenTracing API specification.