Software monitoring allows developers and IT professionals to observe events, metrics, and communications occurring within or between monitored systems. Monitoring-gathered data offers visibility into how the monitored entities are behaving. Thus, it provides warning signs that indicate what parts of a system are changing behavior and warrant investigation. More and more software is migrating to the cloud, and monolithic software is being decomposed into microservices to create distributed applications. As this occurs, it’s getting harder to observe what infrastructure is doing.
This article will discuss the definition of observability, how it differs from traditional monitoring practices, and how it can help you. We will then investigate the two more recent competing standards to implement software observability: OpenCensus and OpenTracing. Finally, we will conclude the article by examining OpenTelemetry, a new standard that merges the two existing standards into a single framework.
Much work has been done—there are now many frameworks (Zipkin, Dropwizard, Micrometer, etc.) and many libraries available.
However, these new standards have gotten the most industry visibility via the Cloud Native Computing Foundation’s support for these projects. Additionally, and not to mention, along with millions of marketing dollars behind them. As a result, this has also attracted many more monitoring and cloud vendors. Of course, as you might have guessed, that increases convolution in creating standards.
What Is Software Observability?
In engineering, observability is the process through which a system’s internal states can be inferred from the knowledge of its external outputs.
You can improve software observability by using telemetry, the process of reading and recording instruments. In the context of infrastructure and applications, this is capturing and measuring data from components and additional measurement software. However, you can also manually add telemetry to code if you’re writing your own instrumentation code or using open source libraries or tools.
The downside of this approach is that it involves writing or adapting additional code and integrating it with your software. This may introduce faults or performance issues, but result in deeper measurement of internal or business processes.
What is Distributed Tracing?
Once you have telemetry in place, you can use it to observe what is happening at the system level. To do this, you will need to use distributed tracing, a set of techniques to follow a single transaction which often times requires multiple component interactions downstream.
Instead of monitoring isolated events or entities, distributed tracing lets you create a trace—something built across the components a single transaction interacts with.
Each trace includes a number of spans that record a single component interaction. This makes distributed tracing ideal for any distributed system, but with the explosion of microservices architectures this has become a necessary technology to analyze modern, cloud-native software.
Three Tools For Achieving Trace Observability
The newest standards pushed on the community by the Cloud Native Computing Foundation have been OpenTracing and OpenTelemetry. Google was behind OpenCensus, and Microsoft quickly joined in the creation of these technologies.
Like many recent software innovations, trace-collecting tool OpenCensus started its life at Google as an internal observability platform. It has been open-sourced by Google and is now available on GitHub. Soon after its creation Microsoft began contributing and steering the standard. OpenCensus is described as a platform for collecting metrics, which are data that indicate what is happening within your system. You can use OpenCensus to report telemetry to pluggable backends which are the monitoring or observability platforms. In addition to metrics, OpenCensus, via traces, tracks messages, requests, and services from their points of origin to their destinations. OpenCensus is primarily using auto-instrumentation agents built by the community. There is no developer API to embed OpenCensus into code.
You can process this data to provide analytics and visualizations that show what is happening across your system. With the right tools, this technology can locate, debug, and fix issues happening in distributed systems.
OpenTracing has the backing and marketing power of the Cloud Native Computing Foundation (CNCF). The has the goal of OpenTracing is to provide a standardized developer API for tracing generally and distributed tracing specifically.
This allows easy embedding of instrumentation into common libraries or easy custom coding into applications. Unfortunately tracing is only one part of observing a system, hence there are major gaps in using this by itself.
While this approach is infinitely flexible, it does have one massive drawback—it leaves the implementation details up to vendors and developers which have been largely inconsistent. In the several years of this project there have been numerous breaking changes in the implementation. This is a big problem, since once compiling libraries are with the instrumentation, it’s very hard to make changes.
In theory, this merger is a good thing, since the stated aim of OpenTelemetry is to provide a single set of APIs, libraries, agents, and collection services for capturing metrics and traces.
The project also promises to support current analytics tools and make pluggable backends a reality, such as sending data to one or more tools, for example Prometheus and Elasticsearch (for metrics), Jaeger, Zipkin, Skywalking, and others (for traces).
The merger creates a climate of uncertainty, however, since it is still too early to tell what the combination of these two very different platforms will look like in practice. There are multiple working groups in progress along with many vendors involving themselves in several aspects.
Logz.io is participating in this community to ensure the best possible support and outcomes.
Conclusion: The Future of Observability
Remember that previous implementations of observability are valid in many ways. They are solving specific problems. The idea of having more common APIs, data formats, and other defined standards will provide users with more choices in the future.
This landscape is still evolving and the outcomes are still uncertain. Standards move very slowly, and the outcomes often get clouded in the process.
What we do know is that distributed, cloud-based systems and microservice architectures need this visibility. Observability is an important step in improving the quality and reliability of your software while minimizing downtime.
Logz.io is currently in beta with support for Jaeger as of April 2020, please reach out if you have an interested in testing our new capabilities which complement the Observability platform built on our unified data store providing the best capabilities from Grafana and Kibana with our Application and Cognitive Insights capabilities.