A Guide to Mastering Metadata for Full Observability

By: Dotan Horovits

Metadata types
Logs enrichment with metadata
Distributed tracing metadata enrichment
Metrics metadata enrichment and the Kubernetes use case
Enabling cross-signal correlation with metadata
Standardizing metadata with OpenTelemetry
Endnote

People tend to think of observability through the lens of the core telemetry signals, namely logs, metrics and traces. But they tend to overlook the importance of metadata, which is crucial for full observability.

Metadata is the “data about data” that gives us additional context about our telemetry. This could be technical context such as the IT environment on which the application is running, or business context such as the customer account from which a request is invoked.

When looking at the adjacent Data Observability domain, you can see that metadata is in fact positioned as one of the pillars. This is not much different in DevOps observability. After all, observability is a data analytics problem. It’s time to unveil the power of metadata in observability.

Metadata types

Metadata gives us context on the entity producing the telemetry, whether on the infrastructure resources or the application code. Looking at the infrastructure, this metadata can be geolocation, the environment (prod, staging, etc.), instance name, hostname or cloud region, to name a few common examples.

Typically, operations teams also tag the underlying machines, Kubernetes nodes or other infrastructure resources with resource labels and annotations, which may provide additional relevant infrastructure context for telemetry emitted from those resources. Kubernetes, for instance, has an API Server for fetching, validating and configuring that metadata. We will explore the Kubernetes use case more later.

Application metadata can give a wide range of context on what goes on in the code itself. Some instrumentation libraries can provide basic out-of-the-box context about the module, package or class that produced the telemetry. Application developers then add custom metadata relevant to the business logic they develop.

Once we have the metadata on our telemetry, we can apply data analytics on it, whether filtering, group-by, run queries or alert based on the metadata. Let’s look at various telemetry data types, and how they can be enriched with relevant metadata.

Logs enrichment with metadata

We’ll start with logs, the most commonplace telemetry. Logs are verbose and typically contain a lot of intrinsic textual context. Too often we find these gems hidden inside the message body, but this is hardly the best practice. In order to enable true data analytics, structure your logs in a meaningful way so important metadata is placed in its own designated fields.

Typically, logging frameworks enrich logs with some table stakes context out of the box, such as the log level and the timestamp, which appear in designated fields and conventional formatting for easy querying and filtering. Some enrich logs with infrastructure-related context such as the Docker container or Kubernetes pod.

On top of that, you should add your own custom metadata to enhance your observability. For example, adding the user ID as metadata can help us root-cause a customer-specific issue. Similarly, adding the build number can greatly help mapping errors to a problematic build version.

{
  "log_level": "INFO",
  "type": "checkoutservice",
  "message": "payment went through (transaction_id: 437d3582be51)",
  "docker": {
    "container_id": "bfc6ddc97255511"
  },
  "env_id": "Astronomy-Shop",
  "timestamp": "2023-08-07T19:17:28.887170752Z"
  "user_id": "dksf70ts",
  "build": "20513097",
}

While logs are oftentimes unstructured, take the time to design your metadata and to structure it with well-defined semantic conventions. Effective data enrichment can turn your logs into more meaningful events.

Distributed tracing metadata enrichment

In today’s systems, satisfying a single request may involve invoking dozens of API calls and queries across microservices and databases. Distributed tracing is quickly gaining popularity, as a way to pinpoint errors and latency issues in microservices. A trace represents a request execution flow, and consists of spans representing the individual operations in that flow.

At the heart of distributed tracing lies the notion of “trace context” and its propagation through the system. This context consists of metadata such as globally unique identifiers that represent the unique request that each span is a part of, and the execution time and duration. This notion is formalized in the W3C Trace Context specification, among others.

The trace context is attached as metadata to each span, and the effective propagation of this context through the system calls is vital for reconstructing the full trace at the backend and to establish the causality relationship between the invoked operations.

Tracing instrumentation client libraries typically take care of the creation of the trace context, such as assigning the Trace ID, Span ID, and the start time and duration of each operation invocation. Application developers can add custom metadata with manual instrumentation, to capture a diverse additional context such as the URL of an HTTP request or an SQL statement of a database query. This flexibility, coupled with the context propagation mechanism, provides a powerful tool, which can serve observability not just in IT, but also in developer experience, business, and FinOps use cases, such as propagating the customer ID or owning team down to the backend processes.

SQL Select statement in a trace context. Source: Logz.io Distributed Tracing

Metrics metadata enrichment and the Kubernetes use case

Unlike logs, metrics are essentially numerical data points with little inherent context. This makes the data enrichment process even more crucial with metrics. The good news is that many metrics collection agents provide out-of-the-box metadata enrichment of metrics based on context gathered by the agent, which often runs co-located with the monitored application.

Let’s look at the example of Prometheus, the popular open source monitoring tool. Prometheus Service Discovery (SD) mechanism knows a lot about the scrape targets (the entities from which Prometheus pulls metrics), and adds this metadata as labels to the metrics, which are then exposed in the OpenMetrics format.

Let’s look at the common use case of monitoring Kubernetes containerized workloads. Prometheus’ Kubernetes SD exposes metadata for kubernetes resources such as node, service, pod, endpoint, and ingress. Here is an example metadata provided for the Kubernetes Node resource type:

__meta_kubernetes_node_name: The name of the node object.
__meta_kubernetes_node_provider_id: The cloud provider's name for the node object.
__meta_kubernetes_node_label_<labelname>: Each label from the node object.
__meta_kubernetes_node_labelpresent_<labelname>: true for each label from the node object.
__meta_kubernetes_node_annotation_<annotationname>: Each annotation from the node object.
__meta_kubernetes_node_annotationpresent_<annotationname>: true for each annotation from the node object.
__meta_kubernetes_node_address_<address_type>: The first address for each node address type, if it exists.

The convention in Prometheus is to prefix the metadata label name with __meta_, since in Prometheus everything on top of the metric is essentially labels, key-value pairs, whether intrinsic or extrinsic to the measured metric.

A Kubernetes node oftentimes has its own metadata, in the form of labels and annotations, assigned by the Operations team to denote logical classification or designation or similar context, which can be valuable during investigations. In the above example you can see that these labels and annotations from the scraped Kubernetes node object are also fetched as metadata of the metric, for additional context.

In addition to Kubernetes, Prometheus provides a wide range of SDs for other infrastructure types (‘scrape targets’ in Prometheus terms), such as AWS EC2, Azure, and Docker. And of course Prometheus is just one example. Other tools, platforms and vendors expose metadata in various formats.

It could be wasteful to replicate and attach the same static data such as the environment, hostname, and region to each individual emitted metric. In such cases, a central approach for static metadata could be a good option. For example, OpenMetrics specification defines target_info:

target_info{env="prod",hostname="host1",datacenter="dc3",region="europe",owner="frontend"} 1

The metric value in this case is irrelevant, and the metric is used to encapsulate the list of metadata as labels, which you can then ‘join’ into queries on other metrics.

So far we’ve seen how we can enrich individual observability signal types with metadata and the benefits it can gain. But this is only the beginning. The full power of observability is unleashed when we combine these signals together.

Enabling cross-signal correlation with metadata

So far we’ve explored the use of metadata to provide extrinsic context, outside of the telemetry. Metadata can also help correlate between different telemetry signal types. This intrinsic context is essential for breaking down the data silos and enabling cross-cutting queries and true data analytics.

One common practice is to add the trace ID as metadata to all the logs, to enable easy log-trace correlation. Better yet, you can emit your application logs as part of the trace data, instead of emitting two different telemetry types. You can output the logs as part of the span context, so they are inherently contextualized to that specific operation of that specific request execution flow.

Another best practice is to add Exemplars on metrics, to enable metric-trace correlation. The Examplars mechanism, which has been incorporated into Prometheus, OpenTelemetry and other tools, attaches the trace ID for easy jump from a metric to a sample trace.

Other custom correlations can be implemented in a similar fashion using metadata, to create meaningful associations and to speed up incident investigation and root cause analysis. Going back to our Kubernetes use case as an example, you can capture Kubernetes events via the Kubernetes API, and plot them as annotations on top of your system metrics to see correlation.

In the screenshot below you can easily see that the pod memory usage changes right after a new deployment was made.

Increase in memory usage viewed after a deployment event. Source: Logz.io Kubernetes 360

Similarly you could triage issues following an update of a ConfigMap or a deletion of a secret, or any other change to a Kubernetes resource, flagged as a Kubernetes Event. In fact, Events are so valuable that many consider them a distinct observability signal type alongside logs, metrics and traces (known together as MELT).

We see the power of metadata and correlation, but how can we achieve consistency of the metadata across different telemetry sources and types, to make effective correlation and investigation of that metadata?

Standardizing metadata with OpenTelemetry

OpenTelemetry is an open source framework and standard for generating and collecting telemetry data, across the different telemetry types. Among others, OpenTelemtry defines the open unified specification for telemetry data, which enables consistent metadata across logs, metrics and traces. This consistency facilitates cross-signal correlation, such as isolating all the signals originating from a given node, or finding logs for a given trace.

OpenTelemetry specification defines Attributes, key-value pairs that can represent signal attributes, instrumentation-scope attributes, and resource attributes, with their respective semantic conventions.

Let’s look at the Kubernetes use case again. OpenTelemetry Collector has a Kubernetes attributes processor that allows automatic setting of spans, metrics and logs resource attributes with Kubernetes metadata. The processor automatically discovers the Kubernetes pods, extracts metadata from them and adds the extracted metadata as resource attributes to the telemetry.

Here’s an example showing various pod metadata that can be configured in the Kubernetes Attributes Processor:

k8sattributes:
  filter:
    node_from_env_var: KUBE_NODE_NAME
  extract:
    metadata:
      - k8s.pod.name
      - k8s.pod.uid
      - k8s.deployment.name
      - k8s.namespace.name
      - k8s.node.name
      - k8s.pod.start_time
    labels:
     - tag_name: app.label.component
       key: app.kubernetes.io/component
       from: pod
  pod_association:
    - sources:
        - from: resource_attribute
          name: k8s.pod.ip
    - sources:
        - from: resource_attribute
          name: k8s.pod.uid
    - sources:
        - from: connection

With the Kubernetes metadata in place, you can perform a powerful investigation of your Kubernetes workloads.

OpenTelemetry Collector supports many other infrastructure types. In fact, it can even detect the type of infrastructure for you, using the Resource Detection Processor. This processor supports multiple infrastructure types, including Kubernetes, Docker, Heroku, OpenShift and various cloud providers including AWS, Azure and GCP.

Endnote

Observability is a data analytics problem, and metadata plays a key role. Metadata illuminates your telemetry with additional context on the underlying infrastructure, the internals of the executed application, or the business scope.

Bake data enrichment into your telemetry instrumentation and collection pipeline, to add this extrinsic context to your logs, metrics or traces, and to enable correlating between these signals. Leverage open standards and specifications such as OpenTelemetry to provide you with the needed consistency across your metadata.