Cracking Performance Issues in Microservices with Distributed Tracing

Distributed Tracing

Microservices architecture is the new norm for building products these days. An application made up of hundreds of independent services enables teams to work independently and accelerate development. However, such highly distributed applications are also harder to monitor.

When hundreds of services are traversed to satisfy a single request, it becomes difficult to investigate system issues. This includes when a customer request returns a failure code, or if a customer request is suddenly very slow to respond.

While logs have been an established tool for analyzing root causes, microservices architecture is not monolithic. There can be hundreds or thousands of services involved, each making the same number of requests per second. 

With log entries then scattered across numerous log files, how can you determine which are relevant, or or put them together according to the execution flow?

Distributed Tracing Fundamentals

This has given rise to the new discipline of Distributed Tracing. Google released a monumental research paper in 2010 following their experience building Dapper, their own in-house Distributed Tracing system. According to OpenTracing.org:

“Distributed tracing, also called distributed request tracing, is a method used to profile and monitor applications, especially those built using a microservices architecture. Distributed tracing helps pinpoint where failures occur and what causes poor performance.”

With Distributed Tracing, our application reports tracing data for each service and operation that is invoked as part of the request execution. This data, called spans, is collected by the analytics backend, ordered by causality, and then visualized, typically as a Gantt chart. In the example below, we can see a trace, starting with the HTTP GET /dispatch operation invoked on the frontend service and then flowing through a series of services and operations to fulfill.

 Timeline view visualizing a distributed trace in Jaeger UI. Source: Logz.io Distributed Tracing 

In this example, it’s easy to see where most of the time is spent and potential performance inefficiencies, such as a series of sequential calls that, if run concurrently, could reduce the overall request latency.

To learn more about this topic, please read my full article published in DevPro Journal, and watch my recent talk at ContainerDays 2022 conference. If you’re looking to take your distributed tracing practice to the next level, feel free to schedule a Logz.io demo here.

Observability at scale, powered by open source

Top

Consolidate Your AWS Data In One Place

Learn More