Monitoring Microservices the Right Way

Blog / Best Practices

#Container Orchestration

#Containers

#Microservices

By: Dotan Horovits

January 5, 2021

Key Takeaways

Modern systems are more complex to monitor. They are highly dynamic in nature, and emit huge amounts of high cardinality telemetry data, especially when based on cloud-native and microservices architectures.
New requirements need to be considered when choosing a monitoring solution for the job. These include scalability, query flexibility and metrics collection.
Common monitoring tools and conventiongras previously fell short in meeting the new requirements, but new ones emerged, most notably, the Prometheus open source project which has gained massive adoption.
Scalability is Prometheus’s achilles heel. Current monitoring needs a clustered solution that can hold historical data long term, without sacrificing data granularity with aggressive downsampling.
A new generation of open-source time series databases (TSDB) hold the key to overcoming the scalability challenge by providing long-term storage.
Evaluating the new requirements and the new tools, spearheaded by the open source community, against your system’s architecture and needs, can put you on the right track when choosing the right monitoring solution for your needs.

If you’ve migrated from a monolith to a microservices architecture you probably experienced it:

Modern systems today are far more complex to monitor.

Microservices combined with containerized deployment results in highly dynamic systems with many moving parts across multiple layers.

These systems emit massive amounts of highly dimensional telemetry data from hardware and the operating system, through Docker and Kubernetes, all the way to application and business performance metrics.

Many have come to realize that the commonly prescribed Graphite+StatsD monitoring stack is no longer sufficient to cover their backs. Why is that? And what is the current approach to monitoring?

In this post I will look at the characteristics of modern systems and the new challenges they raise for monitoring. I will also discuss common solutions for monitoring, from the days of Graphite and StatsD to the currently dominant Prometheus. I’ll review the benefits of Prometheus as well as its limitations, and how open source time series databases can help overcome some of these limitations.

The Challenges of Modern Systems

Let’s have a look at the changes in software systems and the challenges they introduce in system monitoring.

1. Open Source and Cloud Services

Open source and SaaS increased the selection of readily available libraries, tools, and frameworks for building software systems, including web servers, databases, message brokers, and queues. That turned system architecture into a composite of multiple third-party frameworks. Moreover, many frameworks are now clustered solutions with their own suite of moving parts (just think Hadoop or Kafka, compared to traditional MySQL).

The Challenge: Monitoring systems now needs to provide integrations with a large and dynamic ecosystem of third-party platforms to provide complete observability.

2. Microservices instead of a Monolith

Following a microservice architecture, a typical monolith application would be broken down into a dozen or more microservices, each one potentially running its own programming language and database, each one independently deployed, scaled and upgraded.

Uber for example reported in late 2014 over 4,000 proprietary microservices and a growing number of open source systems which posed a challenge for their monitoring system.

The Challenge: A surge in the number of discrete components you need to monitor.

*The complexity of microservices illustrated through the Amazon and Netflix Deathstars (credit: Amazon and Netflix)*

3. Cloud Native Architecture and Kubernetes

Cloud Native architectures, based on containers and Kubernetes, have grown in popularity for running microservices, but have also added yet another layer of complexity. Now you need to monitor applications spanning multiple containers, pods, and namespaces, potentially over fleets of clusters.

The containers framework itself is now a vital system you need to monitor: Kubernetes cluster metrics, node metrics, pod, and container metrics, as well as Kubernetes’ own control plane services.

The Challenge: In containerized workloads you have to monitor multiple layers and dimensions:

Infrastructure metrics, such as host CPU and memory
Container runtime and Kubernetes metrics, such as running pods and node resource utilization
Application metrics, such as request rate and duration.

Background on Monitoring Tools

Now that we saw WHAT we need to monitor in our systems, let’s talk about HOW we monitor them.

Not so long ago a common monitoring practice in the industry was based on the open source combination of Graphite, StatsD and Grafana. In this setup:

System components send metrics to StatsD in a fairly simple way, StatsD performs some aggregation and then flushes them to Graphite.
Graphite uses Carbon to receive and aggregate the metrics from StatsD, then stores them in its Whisper time series database.
Grafana provides dashboarding and visualization on the data.

Modern systems brought new requirements which existing solutions weren’t equipped to handle. In fact, when the engineering team at SoundCloud moved to a microservices architecture and encountered limitations with their existing Graphite and StatsD monitoring, that drove them to create the Prometheus open source project.

Since then, Prometheus and Grafana open source have emerged as a popular combination for monitoring. Prometheus is ranked 66th on DB-Engine, while Graphite is behind in 78th place, per the November 2020 report. Currently, Prometheus ranks 32k stars on GitHub with 4.9k forks, while Grafana ranks 36.1k stars with 7.2k forks.

*In a way, the transition from Graphite to Prometheus represents the transition to modern monitoring practices.*

The Solution: New Monitoring Capabilities for Microservices

As we saw, systems have undergone significant transformation that challenged traditional monitoring approaches. In order to meet these challenges, modern monitoring systems had to gear up with new capabilities:

Flexible querying with high cardinality
Efficient automated metric scraping
Scalability to handle large volume of metrics

Let’s look into these new requirements and the available monitoring approaches to meet them:

Flexible Querying over High Cardinality

The Solution: The introduction of flexible querying based on tags makes it simpler to slice and dice the data along multiple dimensions like service, instance, endpoint, and method.

Back in the Graphite days, we would work with a hierarchical dot-concatenated metrics naming format. For example, a Kafka metric may look like this :

PROD.KafkaBrokers.us-east-1.1a.kafka1.load.load.shortterm

The hierarchical naming worked well for static machine-centric metrics, but it turned out to be too strict and cumbersome to express the dynamic services metrics with the high cardinality of modern systems.

For example, cloud-native environments assume that at any given moment a container may crash, a pod may be evicted, or a node may die, resulting in a hostname or instance name change, which in Graphite will result in a new set of metrics.

And there’s also the dynamics of deployment changes. For example, if system dynamics require another level in the metrics hierarchy, such as when moving from a single region to multi-region (like “us-east-1” in the middle of the metric name in the above example), it will “break” existing queries, aliases and regex, which are based on the old hierarchy.

Querying across hierarchies, such as error rates across all services, can result in queries which are hard to compose and later hard to understand and maintain.

Prometheus offered a tagged data model from the get-go in 2015, which swept the market. As Graphite itself later observed:

“It’s becoming common in monitoring systems to use a tagged (or labelled) metric format. This allows for more flexibility both in naming and retrieving metrics.”

Graphite reacted with delay and launched its own tags support in 2018, admitting that tagging “allows for much more flexibility than the traditional hierarchical layout”.

Metric Scraping and Autodiscovery

The Solution: automatic service discovery and scraping metrics off of the services streamlines getting your metrics into your monitoring system.

The common practice by StatsD and other traditional solutions was to collect metrics in push mode, which required explicitly configuring each component and third-party tool with the metrics collector destination.

With the many frameworks and languages involved in modern systems it has become challenging to maintain this explicit push-mode sending of metrics. Adding Kubernetes to the mix increased the complexity even further. Teams were looking to offload the work of collecting metrics.

This was a distinct strongpoint of Prometheus, which offered a pull-mode scraping, together with service discovery of the components (“targets” in Prometheus terms). In particular, Prometheus shined with its native scraping from Kubernetes, and as Kubernetes’s demand skyrocketed so did Prometheus.

As the popularity of Prometheus grew, many open source projects added support for the Prometheus Metrics Exporter format, which has made metrics scraping with Prometheus even more seamless. Today you can find Prometheus exporters for many common systems including popular databases, messaging systems, web servers, or hardware components.

Clustered Scalable Solution

The Solution: current metrics volumes require a horizontally scalable storage that enables both long retention of historical data and high availability.

As the volume of telemetry data started rising, companies such as Taboola, Wayfair, and Uber started experiencing scaling challenges with their existing Graphite and StatsD monitoring solution. As the Wayfair team put it: “Graphite/Carbon does not support true clustering (replication and high availability)”. Uber’s testimonials states:

“By late 2014, all services, infrastructure, and servers at Uber emitted metrics to a Graphite stack that stored them using the Whisper file format in a sharded Carbon cluster. We used Grafana for dashboarding and Nagios for alerting, issuing Graphite threshold checks via source-controlled scripts. While this worked for a while, expanding the Carbon cluster required a manual resharding process and, due to lack of replication, any single node’s disk failure caused permanent loss of its associated metrics. In short, this solution was not able to meet our needs as the company continued to grow.”

Prometheus did not manage to solve the scaling problem either. Prometheus comes as a single node deployment, storing the data on local storage. This limits the total amount of data you can store, which translates to limits on the historical data you can query and compare against. Uber engineering, for example, had several teams that ran significant Prometheus deployments, but then reported:

“As our dimensionality and usage of metrics increases, common solutions like Prometheus and Graphite become difficult to manage and sometimes cease to work.”

The need for historical data in Prometheus often forces aggressive downsampling strategies, which may severely limit the ability to identify and analyze outliers in the historical data. Alternatively, organizations find themselves manually sharding their metrics across several discrete Prometheus instances, having to deal with challenges of distributed systems.

Time Series Databases (TSDB) to the rescue

The sheer volume of telemetry data in modern systems calls for new scalable solutions. Prometheus doesn’t meet the scaling challenge. Modern time series databases (TSDB) offer a variety of robust scalable solutions to address this challenge.

The popularity of TSDB has skyrocketed in the past couple of years, far beyond any other type of database. A variety of use cases have helped with this uptick in popularity such as systems metrics collection, IoT, and financial services. Alongside general-purpose TSDB, some are more specialized, including those designed primarily for the purposes of metrics monitoring.

*Database trends from January 2019 through October 2020 showing the large increase in demand for time series databases (credit: DB-engines)*

Many new TSDB have emerged in the past few years, a significant percentage of which are open-source. In fact, open-source appears to be a main driving force behind TSDB, with some 80% of TSDB in use are open-source based, the highest OSS rate among all database types.

*Comparison of database popularity by open-source versus commercial licenses (credit: DB-engines)*

TSDB as Prometheus Long Term Storage (LTS)

The new generation of TSDB can serve as long-term storage (LTS) solutions for Prometheus, so you can enjoy the benefits of Prometheus’ metrics scraping and query capabilities, while supporting long-term data retention and the wealth of Grafana dashboards built for a Prometheus datasource.

The standard way in which TSDB integrates with Prometheus as a LTS is via Prometheus’s remote write feature. Prometheus’s remote write allows sending samples to LTS, and takes care of relabeling, buffering (it can run multiple queues for multiple remote endpoints), authorization, and other aspects.

When choosing a TSDB for a Prometheus LTS, we look for:

Prometheus compatibility, supporting PromQL (Prometheus Query Language) and the respective APIs
Horizontal scalability and high availability to handle large-scale metrics
Downsampling of data with regular rollups and aggregation
Backfilling and out of order writes (for network lags or retrospective historical feed)
Multi-tenancy to support multiple teams, business units, customers, partners etc.

Many modern open source TSDB offer integration with Prometheus. Within the Cloud Native Computing Foundation (CNCF) incubation projects, you can find both Cortex and Thanos, with some cross-pollination between the two. Other open source TSDB include M3DB by Uber, VictoriaMetrics and InfluxDB by InfluxData, to name a few. These solutions take different approaches and present different tradeoffs as Prometheus LTS. Some of the solutions come with their own UI, data collection mechanisms, and even query language variants, however using Prometheus and Grafana remains the common practice.

Some of these TSDBs are fairly young OSS projects, with all the implications on maturity, documentation, and accumulated expertise in the user community. If you require Prometheus long term storage for your production system, TSDB is a good path, though I highly recommend making a thorough evaluation based on your workload, needs, and tolerance.

Endnote

Modern systems are highly dynamic in nature, and emit huge amounts of high cardinality telemetry data, especially when based on cloud-native and microservices architectures. These systems also make ever-increasing use of third-party open source tools and cloud services which need to be monitored as well.

Prometheus sets a good standard with its metrics scraping and flexible tagged query language. This approach, best exemplified with the seamless integration with Kubernetes, encouraged many projects and tools to start offering Prometheus exporters and Grafana dashboards as a standard for their monitoring.

Running only on a single node proved to be a severe limitation of Prometheus.

Current monitoring needs a clustered solution that can hold historical data long term, without sacrificing data granularity with aggressive downsampling.

Whether you’re choosing the monitoring system for a new project, or whether you’re facing scaling issues with your current monitoring solution, integrating TSDB as a long-term storage for Prometheus enables you to enjoy the best of both worlds: metric scraping, querying and visualization based on Prometheus and Grafana, while handling high scale of telemetry data.

*This article was originally published on InfoQ at December 3rd 2020.