How To ‘Translate’ Grafana Dashboards from Graphite to Elasticsearch

Grafana is the de facto open source tool for visualizing metrics. Grafana supports many different backends for data sources and handles each one slightly differently. This blog post is geared towards helping convert Grafana dashboards from using the Graphite backend to using Elasticsearch as a metrics datasource. There are many similarities between how to use both as datasources and how to plot graphs around them, but there are also many differences that need to be accounted for.

Graphite is a dot spaced time series database (TSDB). Each metric is treated as a point in a hierarchy with queries consisting of a name like PROD.KafkaBrokers.us-east-1.1a.kafka1.load.load.shortterm. The metrics can have a variety of functions applied to them, for example aliasByNode, to alter or aggregate the metrics. Each metric is treated as a separate data point and querying the dotted namespace results in a single metric each time.

By contrast, Elasticsearch is a logging search platform that also has the capability to store metrics as a time series database. Because of Elasticsearch’s logging origins, it treats metrics like a JSON-formatted series of data points, as opposed to the simple time-stamped data points of Graphite. As an example, system network metrics, such as bytes in, bytes out, and errors in, will be aggregated together into one record that can be searched using the Lucene query language, similar to how logs in Elasticsearch are searched.

Each metric record also contains metadata about the record, such as the hostname of the machine reporting the metrics, the metricset name, disk names, network interface names, etc. Understanding how these metrics records are formatted is extremely important for understanding how to query and aggregate our metrics. As an example, metrics from network can be aggregated and aliased based on which network interface is reporting the data or specific network interfaces can be used in the query, as simply as:

system.network.name: eth0.

Graphite Dashboard Translation Examples

I will provide some examples of dashboards I have converted from Graphite queries to Elasticsearch queries for migrating our Kafka metrics dashboard. For a little helpful background to better understand what I’m doing on these examples, I’ll explain how the metrics were actually shipped. Both Graphite and Elasticsearch function as TSDBs, but neither actually scrapes metrics. For that, both rely on external shippers, typically running on the hosts being monitored, to gather metrics and push them to the TSDB.

In my examples, I am primarily using collectd as the metrics shipper to send to Graphite. The Graphite ecosystem has a wide variety of metrics shippers that can be utilized to write to the Graphite backend. In the Elasticsearch world, by contrast, the vast majority of metrics shipping can be done with Elastic’s own Metricbeat shipper. In the examples below, I am using the Metricbeat system module to replace the collectd shipper.

Basic Queries

Most system metrics you will need to migrate are fairly simple. An example of this simple type of metric is CPU load. The metric will simply be a set of points that need to be graphed by some aggregation like Max, Average, Sum, etc. and will require no further calculations.

Here you can see an example of load metrics for Kafka brokers in Graphite. In this example, I am using variables to search for the load metric for particular nodes, by machine name.

Here is the same metric query in the Lucene language to pull the metrics from Elasticsearch. Much as you would create a search for logs, you can search on the host.name field from the metric record. Here I am also using the $broker variable to search for the specific hosts I want. Next, the Metric field is where you specify what metric type you want to graph from the record. The system.load.1 metric is the Metricbeat system module metric that most closely corresponds to the load.shortterm metric from Graphite. You can select the aggregation to use (in this case, Average).

Choosing the correct aggregation should typically be fairly straightforward. Look at the function used in Graphite such as consolidateBy or highestMax to determine which aggregation was used in the original graph and select the corresponding Elasticsearch aggregation.

In this example, you can see a query using consolidateBy with >max as the argument.

This corresponds very neatly to the ‘Max’ dropdown when selecting the metric to graph with Elasticsearch.

In most cases, I found that the relationship between the two queries was this straightforward. Occasionally, you may run into an example with more complex queries, such as finding the average of max values, where some experimentation with different aggregations and metric calculations may come into play. I’ll cover some of these in the following sections.

Counter Metrics Requiring Derivatives

Some metrics will require calculations to find changes over time or a specific rate (e.g. Megabytes per second). A great example of this is graphing network metrics. Network metrics from the Metricbeat system module are counters, they count the total number of bytes over time.

Looking at an example from Graphite, the network counter values already correspond to a rate, so we don’t necessarily have any pointers on where to begin. In some cases, though, you can see a metric might use a derivative function or the perSecond function.

First, I created a Lucene query to find the metric I wanted. Note the use of the system.network.name metadata to find results specifically from the eth0 network interface.

Next, I selected that I wanted to graph system.network.in.bytes with a Max aggregation, but hid the metric by clicking on the eye icon. Then I added the derivative function to measure the rate of change over time in the Max system.network.in.bytes metric I selected. Finally, I opened the ‘Options’ dropdown of the Derivative query to set the unit to seconds to get Mb/s rates.

Pay attention to the time interval (under “Group by”). Auto is fine, but you may want to set the interval to the lowest possible setting that still retrieves metrics (probably 10s)

If there are drops or negative graphs at the beginning or end of the time interval, use the ‘Trim Edges’ feature to trim the end of the graph. Elasticsearch can report a drop if the bucket has not yet been filled at the end of the graph (typically when the shipper hasn’t shipped enough data yet) and ‘Trim Edges’ prevents these incomplete data points from being used in the derivative calculation. You can also set ‘Min Doc Count’ field to 1 to ensure that each bucket always has a data point.

Grouping Metrics by Nodes

Grouping metrics by node is very useful. In Graphite it is performed by the aliasByNode function.

You can do the same type of aggregation with Elasticsearch by using the ‘Group By’ function. Here, I have used the host.name metadata to aggregate the metric graphs by hostname, set the size to no limit and ordered by ‘Term value’. It’s also possible to use the combination of ‘Size’ and ‘Order By’ functions to display only top x or bottom y (e.g. top 10, bottom 5).

The ‘Order By’ function can also be adjusted to either display the results in the tooltip alphabetically, as is the case here, or by the metric value.

This type of aggregation can be performed on any metadata field provided by the Elasticsearch metric records as well (e.g. disk name, network interface, etc).

Using Aliases

Aliases in graphite are a function that uses regex on the dotted namespace to extract terms to use when labelling the graph

In this example, you can see the aliasSub function that is performing a regex on the metric name to extract the text used in the graph.

Elasticsearch uses the Alias box and in the query editor and uses templating to extract metadata from the metric (in this case host.name). Metadata keys are inserted by enclosing the name inside double curly braces {{ }}. Any free text can also be added.

And now the legend for my graph looks like this:

One caveat to using the alias feature is that the metadata keys used in the templates must be one of the Group By terms to be used as an alias.

Using the Explore feature in open source Grafana and in Logz.io

Reading through the shipper documentation helps greatly with understanding the structure of Elasticsearch metrics. However, if you are using Logz.io’s Infrastructure Monitoring service based on open source grafana, we have added additional features to help browse and search metrics on top of open source Grafana, using the “Explore” feature.

In the Explore tab, you will find a query editor with a number of backends, including both logs and metrics. Select the metric datasource you want to search on and click on the box that says “Metrics.”

You will see metric records in this window in JSON format that can be expanded to a table, as I’ve shown in the network metric screenshot previously. Here, you can analyze what different types of metrics look like in a “raw” form and even practice searching for specific fields, using the Lucene query editor. This will help give a good idea of what types of queries you will want to run to populate your metrics dashboards when creating them. I recommend keeping a separate browser window open to the Explore query editor to help find the right data you want and to troubleshoot when you run into graphing issues.

If you are running open source grafana, the Explore function is still available, however, you will only be able to query metric data points and graph the results. There is no built-in functionality to display the relevant JSON structure. Explore can still be useful for testing metric queries, even without the ability to view the JSON.

Final Note

Although Elasticsearch’s Lucene-based queries and Graphite’s function-based queries seem radically different from one another, translating Graphite dashboards into Elasticsearch is not particularly difficult.

The main hurdles to remain cognizant of are ensuring that

  1. you are taking derivatives on data points that should represent a rate,
  2. you are paying particular attention to the time units on derivative functions, and
  3. understanding how aliases function differently in Graphite and Elasticsearch.

Elasticsearch provides a great benefit of providing much richer metadata on metrics data when compared to Graphite, which makes it a compelling alternative to create much more useful visualizations of your data. If you use Elasticsearch to analyze your logs, it may be even more compelling. We at Logz.io offer a Log Management and Infrastructure Monitoring services based on the popular open sources Grafana, Kibana and Elasticsearch, so you can stick to the open source you know and like, while having it managed for you in one place, with the ability to correlate across logs and metrics, as well as security and distributed tracing (based on Jaeger open source, now in beta). You can find more information on Logz.io open source based observability here.

Stay updated with us!

By submitting this form, you are accepting our Terms of Use and our Privacy Policy

Thank you for subscribing!

Internal