The Road to OpenTelemetry: How We Got Here

By: Jonah Kowall

May 7, 2020

The Road to OpenTelemetry: How We Got Here

Monitoring began by using software agents to capture data from infrastructure, operating systems, and applications. These agents would collect metrics and events from these systems to understand the health of the underlying system and the applications. This is what infrastructure monitoring is today.

Each of these components would generate a lot of log data, which was used to diagnose issues. As these applications began scaling out, there was a lot of log data distributed across many systems, and this is where log analytics became the first centralized debugging system. Most of these systems were used to facilitate business, but as organizations themselves transformed to being digital first, the understanding of software became a business problem.

We needed new technologies to understand the business problems in this new environment. The legacy monitoring systems evolved by developing deeper technologies to instrument applications, mostly for diagnosing problems. This was the evolution of logs towards higher value data needed to debug applications.

We know these products as APM solutions, which is where instrumentation began. These solutions comprised sophisticated agents that would attach to an application and watch the transactions as they executed. The agents would then emit data into a centralized analysis solution.

APM technologies evolved to collect business metrics and not just diagnostic information. The challenge is that these agents need to understand libraries and frameworks to decode the actions in the software. There were many iterations to instrument libraries with standards to remove the burden from APM vendors. None of these really took off until the open source community had evolved.

While auto-instrumentation is helpful and still in heavy use, software development has evolved. More business-minded engineers and product managers have written instrumentation to collect this data from the software directly and not rely on expensive agents. This allowed businesses to move faster than their competitors. As these new digital-native companies broke out, the landscape shifted.

How did Open Source Get Involved?

As the mega tech vendors built new businesses, they also open sourced many of their stacks. They invented innovative ways to build software with major contributions from Netflix, Google, Facebook, Uber, Twitter and countless others. We highlight these companies as they contributed heavily to monitoring in new ways. When Google had to deal with scale before anyone else, they invented methods to build and operate software. This required creating new software, methodologies, and even organizations. The creation of SRE concepts were critical in forming the foundation of modern software operations.

Google published many papers on the way and what they were building. This includes one of the most popular technologies Kubernetes which is based on Google’s Borg and the concepts in Borgmon which is what Prometheus is based on One of these papers the Dapper paper (2010) describes in depth a method of distributed tracing, this paper, led by a bright Google engineer and team led by Ben Segelman. Ben is now the CEO of LightStep. Based on similar thinking, entrepreneurs and innovators founded several APM companies on these same concepts including AppDynamics, Dynatrace, and New Relic.

Trace Propagation

The first open source project based on these concepts was Zipkin, created inside Twitter and going open source in 2012. Zipkin used the first open protocol for propagating traces between instrumented software. This protocol, b3, has been common across many of the open source distributed tracing projects. Trace propagation is critical to making distributed tracing work, and embedding this inside of the protocols is essential for the technology to work. There is also the challenge of how the collected data is sent to the tool itself. This backend collection hadn’t been standardized until most recently in OpenTelemetry, which I’ll cover below.

Instrumentation

One of the biggest challenges of distributed tracing is determining how to extract data from the software. There are many ways to do this including the use of SDKs or frameworks which have instrumentation, or you can use community standards such as ARM (circa 2007), more recently OpenTracing (circa 2015) and more recently OpenTelemetry (circa 2019). We have focused these modern attempts on getting instrumentation into the commonly used libraries and frameworks to make collecting data from the software using these libraries and frameworks easier. Increasingly in recent times we’ve seen this done in every type of library, software system, and even infrastructure components like proxies, service meshes, and even orchestration systems like Kubernetes itself.

Community Instrumentation Initiatives

As Logz.io covered in a previous blog there have been three major community based attempts at building standards for instrumentation. As expected, these are slow moving initiatives with many cooks in the kitchen. There is also the challenge of commercial vendors, end users, and standards bodies getting involved in the process which has made progress slow and challenging.

OpenTracing

OpenTracing was largely conceived by the team at LightStep which is mentioned above. They smartly engaged with the massive marketing engine at the Cloud Native Computing Foundation (CNCF), which provides valuable marketing support that has made this something folks want to talk about. Unfortunately some vendors have been using this machine for their commercial benefit, which distinctly goes against the concepts of open source. The challenges with OpenTracing was the standard only focused on the SDK and developer side, but had no standardization on how to get this data into tools. It was also missing auto-instrumentation or agents which were easily deployable. In turn, that required developers to do a lot of manual coding to make observability possible. As its name implies, OpenTracing focused only on tracing while ignoring logs and metrics (vital data for observability). This is precisely why the adoption of the technology was limited. Even so, the ecosystem has evolved since its 2015 founding.

OpenCensus

OpenCensus was created internally at Google but also leveraged inside StackDriver. The technology went open source in 2018. Simultaneously, its team expanded to include Microsoft and others. There were agents which did auto-instrumentation and libraries the team built to extract data from software with little to no changes. Unlike other initiatives, OpenCensus included not just tracing but also metric. More importantly, it had the concepts of standard exporters to connect to easily swappable tools. This was a big step forward to making pluggable observability tools without requiring software changes.

OpenTelemetry

Although there was competition between OpenTracing and OpenCensus, there were folks involved in both initiatives owing to OpenTracing being a developer API specification. Although OpenCensus didn’t have a developer API, but had many other aspects which were missing. The idea to join both projects made complete sense, but also complex. There was now a need to create another set of standards when there were already refined protocols and designs (going back to the creation of popular Zipkin). Additionally, most of today’s tools support Zipkin communication and instrumentation.

Conclusion: The Future of Observability

Tracing and Observability is still very much in flux. This important technology should be part of any strategy and the choices in open source have exploded over the last few years. Maturity is still elusive, but these solutions have benefits and challenges. Logz.io will support tracing as a core component. This doesn’t mean supporting traces as logs, this means providing a user experience for tracing as the open source projects intended. Providing analytics on top of the data will provide substantial benefits. Stay tuned for the public beta soon!