tracing

While logs can tell us whether a specific request failed to execute or not and metrics can help us monitor how many times this request failed and how long the failed request took, traces help us debug the reason why the request failed, or took so long to execute by breaking up the execution flow and dissecting it into smaller events.

In a microservice architecture, tracing is extremely challenging because requests will span multiple services, each executing one or more processes, across multiple servers. Distributed tracing is a tracing methodology that seeks to overcome this challenge by instrumenting applications in specific junctions in the request’s path and reconstructing its lifecycle.  

Zipkin, based on Google Dapper and initially developed by Twitter, is a Java-based application that is used for distributed tracing and identifying latency issues. Unique identifiers are attached to each request which are then passed downstream through the different waypoints, or services. Zipkin then collects this data and allows users to analyze it in a UI.

collects data

So why hook up Zipkin with the ELK Stack?  

The Zipkin UI provides us with some basic options to analyze traced requests — we can use a dependency diagram to view the execution flow, we can filter or sort the traces collected by Zipkin per application, length of trace and timestamp.

Using ELK though, takes trace analysis up a notch. Elasticsearch can be used for long-term retention of the trace data and Kibana will allow you gain much deeper insight into the data. This article provides you with the steps to integrate the two tools.

Installing Zipkin

The easiest way to install Zipkin is using Docker:

If you want to customize your Zipkin configuration though, it might be easier to pass some variables in the run command. In our case, I want to pass some environment variables and specify Elasticsearch as the storage type (this example assumes a locally running ELK Stack), and so I will use the following commands to download and run Zipkin:

If all goes as expected, you should see this output in your terminal:

Open Zipkin at http://localhost:9411:

find traces

Of course, we have no traces to analyze in Zipkin yet, so our next step is to simulate some requests.

Setting up demo services

Zipkin provides a number of built in instrumentations for Java, Go, Ruby, and #C apps, as well as a decent list of community instrumentations, for collecting tracing information and sending it to Zipkin.

For the sake of demonstrating how to hook up Zipkin with the ELK Stack, I’m going to be using an instrumentation library for Java applications called Brave. Specifically, I will be setting up two Java servlet services that communicate via http. Brave supports a whole lot more but this example will suffice for our purposes (you will need Java JDK 1.8 to run the services in your environment).

First, download the source code:

In this example, there are two services that send tracing info to Zipkin — a backend service and a frontend service. Your next step is to start these services.

The frontend service:

The backend service:

Both services should report the same success message in the run output:

It’s time to send out some requests. The frontend service can be accessed via http://localhost:8081 and calls the backend service and displays a timestamp.

Opening Zipkin again, you should be able to see traces for called requests:

traces for call requests

Because we defined Elasticsearch as the storage type when running Zipkin, the trace data should be indexed in Elasticsearch. To verify, run:

You should see a ‘zipkin*’ index created:

Our next step is to open Kibana and define the new index pattern. Open Kibana at http://localhost:5601, and commence with defining a ‘zipkin*’ index pattern:

zipkin Kibana

As the timestamp field, use the ‘timestamp_millis’ field:

step 2

After creating the index pattern, open the Discover tab and you will be able to see your trace information collected by Zipkin.

discover

As you may notice, some of the fields includes in the trace data are not mapped correctly. You could make do with the existing mapping but if you want to analyze and visualize all available data properly, you will need to do some Elasticsearch mapping adjustments.

For example, to map the ‘remoteEndpoint.port’ field, you would use the following mapping API:

Once you have updated field mapping, refresh the mapping in Kibana to see your trace fields mapped correctly (under management → Index Patterns).

Analyzing trace data in Kibana

Kibana allows you to query the data using various different query types, but things start to get much more interesting when visualizing the data. Here are a few examples.

Average service duration

Using an average aggregation of the ‘duration’ field, you can create some metric visualizations that give you the average time for your services. Use a query to build the visualization on a specific service. In our example:

duration

Average duration per service over time

It’s also interesting to see the trend over time for service duration. Kibana line chart visualizations are perfect for this.

line

Trace list

You can create a data table visualization that gives you a nice breakdown of all the traces collected from your application by Zipkin and stored in Elasticsearch.

To do this you would need to use a sum metric aggregation of the ‘duration’ field and a number of basic bucket aggregations as depicted below.

metrics

You can then add these up into one dashboard that provides a nice overview of your trace data.

dashboards

Endnotes

Logs, metrics and traces are the three underlying tenets of what is being coined today in the industry as observability — instrumenting our environment to provide actionable insight through meaningful data. Of course, applications using microservices make implementing observability a whole lot harder and it’s no surprise that more and more tools are being introduced to help make it easier.   

The ability to sample requests, Zipkin’s instrumentation libraries, as well as its native support for Elasticsearch storage are the main reasons that we at Logz.io use Zipkin to discover latency issues with our services. In a future post we will compare Zipkin to another tool in this space — Jaeger.

Get powerful insights from your data with the Logz.io ELK Stack.