Software monitoring allows developers and IT professionals to observe events occurring within a monitored system. The data gathered by monitoring processes offers visibility into how the monitored entity is behaving and provides warning signs indicating that some aspect of the system deserves greater attention. More and more software is migrating to the cloud, and monolithic software is being decomposed into microservices to create distributed applications. As this occurs, it is becoming increasingly difficult to observe what your software is doing and how it is doing it. This article will discuss what is meant by observability, how it differs from traditional monitoring practices, and how it can help you. This will include a description of tracing, and specifically distributed tracing. We will then investigate the two competing standards to implement software observability: OpenCensus and OpenTracing. We will conclude the article by examining Open Telemetry, a new standard that merges the two existing standards into a single framework.
What Is Software Observability?
In engineering, observability is the process through which a system’s internal states can be inferred from the knowledge of its external outputs. You can improve software observability by using telemetry. Telemetry is the process of recording what your code is doing as it happens. Unlike traditional monitoring approaches, telemetry does not involve using third party agents. Instead, you write your own instrumentation code and use open-source libraries or tools provided by the numerous vendors in this space. The downside of this approach is that it involves writing or adapting additional code and integrating it with your software. The upside is that this code does not force you to comply with rigid workflow or vendor-specific standards.
What is Distributed Tracing?
Once you have telemetry in place, you can use it to observe what is happening at the system level. To do this, you will need to use distributed tracing, Distributed tracing is a tracking process that follows the workflow of a transaction, rather than a series of random events, especially within microservices architectures. Instead of monitoring isolated entities, distributed tracing lets you follow a trace—something that records the path of an event or a request across all of the entities or systems that interact with the event. Each trace is made up of spans which record the interaction between the request and the systems the trace passes through as it travels from its point of origin to its final destination. This makes distributed tracing ideal for modern, cloud-based software and distributed architectures such as microservices or serverless computing.
Three Tools For Achieving Trace Observability
Observability is achieved by integrating telemetry and distributed tracing tools into your applications—OpenCensus, OpenTracing, and OpenTelemetry being the most popular.
Like many recent software innovations, trace-collecting tool OpenCensus started its life at Google as an internal observability platform. It has been open-sourced by Google and is now available on GitHub. Its supporters include Microsoft and VMWare. OpenCensus is described as a platform for collecting metrics, which are data that indicate what is happening within your system. You can use OpenCensus to provide the telemetry that collects, for example, information about the latency of a web service or access to network resources, such as file stores or databases. In addition to metrics, OpenCensus—via traces—tracks messages, requests, and services from their points of origin to their destinations.
OpenCensus provides libraries for a wide range of high-level languages. Monitored data is collected in real time and can be persisted to your preferred backend and storage technology. This data can be processed to provide analytics and visualizations that show what is happening across your system. It can also help you locate, debug, and fix issues that occur within highly distributed systems. The advantage of OpenCensus is that it gives you everything that you need to achieve observability. The disadvantage is that once you’ve chosen to use OpenCensus, it can be hard to migrate to another platform.
OpenTracing, which is supported by the Cloud Native Computing Foundation (CNCF), has its roots in OpenCensus. Like OpenCensus, the goal of OpenTracing is to provide a standardized approach to observability. It differs from OpenCensus in two important ways. First, as the name OpenTracing suggests, it emphasizes tracing (particularly distributed tracing) over metrics and telemetry.
This is not a problem if your main goal is to track requests, but this tool is less helpful if you want to monitor more traditional types of system resources and measure performance. In other words, if you want your software to achieve full observability, then you may need to incorporate elements from OpenCensus into your approach. Otherwise, you will either need to use telemetry products from third-party vendors or build your own solution from scratch.
The other key difference between the two platforms is that OpenTracing is not an observability platform in the traditional sense. OpenCensus provides both the specification and a range of libraries that cover nearly all of the popular high-level languages, as well as components for building the tools. OpenTracing defines specifications for building application programming interfaces (APIs) to build libraries and frameworks. In other words, OpenTracing will tell you what to do, but it does not tell you how to do it.
While this approach is infinitely flexible, it does have one massive drawback—it leaves the implementation details up to vendors and developers.
Until now, the best approach to achieving full observability was to build a hybrid solution by combining the two largest platforms in this space. By using OpenTracing to handle tracing and OpenCensus for telemetry and tooling, you could create a complete solution. The leaders of both observability platforms must have had the same idea, because they recently decided to join forces. In May 2019, the Cloud Native Computing Foundation announced the merging of the OpenTracing and OpenCensus projects into a new standard called OpenTelemetry.
In theory, this merger is a good thing, since the stated aim of OpenTelemetry is to provide a single set of APIs, libraries, agents, and collection services for capturing metrics and traces. The project also promises to support current analytics tools, such as Prometheus (for metrics) and Jaeger (for traces). Another possible benefit of this merger is its potential to make observability easier to achieve.
The merger creates a climate of uncertainty, however, since it is still too early to tell what the combination of these two very different platforms will look like in practice. In the meantime, OpenTelemetry has stated that it should be ready to replace both platforms sometime in 2020 and will continue to support existing codebases in the interim.
Conclusion: The Future of Observability
In this article, we looked at traditional approaches to software monitoring and why they are no longer relevant in an age of distributed cloud-based systems and microservice architectures. Next, we introduced the concept of observability and how it can be achieved through telemetry and distributed tracing. In the final section, we looked at how to achieve observability using the existing standards (OpenCensus and OpenTracing) and their proposed replacement, OpenTelemetry.
Observability is an important step in improving the quality and reliability of your software while minimizing downtime. Until the OpenTelemetry foundation delivers production-ready tools and libraries, however, it is hard to conclusively determine which of the current standards is best to use.
While we wait to see how OpenTelemetry evolves, the best way to get started with observability is to begin with OpenCensus, the more established of the existing standards. Let’s hope that OpenTelemetry provides a clear migration path and the necessary tools to make it the obvious solution in the near future.