If you’ve followed our latest blog posts, you’ll have learned how to send metric data to Logz.io and visualize that data on Infrastructure Monitoring—our Grafana-based metrics monitoring solution that we made generally available on Monday. This blog will walk through configuring Infrastructure Monitoring alerts, Log-Metric Correlation, and markers.
At this point you’ll have some nice looking Grafana dashboards in your account. While Grafana visualizations are awfully beautiful and very satisfying to build, you’ve probably built them to monitor the health of your cloud environment. This blog will examine some of the features we’ve built on top of Grafana to help you stay notified of production issues and quickly understand what’s causing them.
Configuring Infrastructure Monitoring Alerts
Infrastructure Monitoring Alerts are built within Grafana visualizations. After clicking ‘Edit’ on a visualization (from the drop down next to the title), click on the bell icon (left) to open up the alerting page.
The first step is to give your alert a name and determine how often it will evaluate your metrics (default is every minute). Since alerts are based on metrics meeting defined conditions (B), you’ll also specify how long the data needs to be meeting those conditions before sending an alert (A). This will prevent an alert from triggering based on a blip in the metric data.
The first time the alert discovers that your metrics meet the defined condition, the alert status will be “pending”. If the condition is met for the predefined threshold time, the status will switch to “alerting” and a notification will be sent.
The next step is to determine the condition that will trigger an alert (B). First decide whether you want to measure the average, maximum, minimum, sum, last, count, difference, and other measurements of your metrics.
Then, choose the query you’d like to monitor (which is defined in the “Query” section of the visualization). The timeframe in the query can range anywhere from 10 sec to 48 hr.
Finally, choose the threshold that triggers the alert. You can type in the value, or drag the alert on the visualization to the desired threshold (below). Set up multiple conditions and queries for more advanced alerts.
You can also decide what will happen if the visualization contains no data or if there is an error (C).
Lastly, specify the end point for the notification and the message you’d like it to contain. In this case we’re sending it to Slack.
After you’ve configured an alert, you can test the end point connection by hitting “test alert” in the top right corner, which will show whether the connection is established.
Setting up Log-Metric Correlation
Log-Metric Correlation allows you to seamlessly navigate to the logs associated with your metrics in a given visualization. While metrics are great for identifying the symptoms of a production issue, you’ll need to investigate the the logs associated with those metrics to diagnose the problem.
This feature makes that an instant, seamless task. Personally, I think this is the most exciting feature we’ve built on top of Grafana.
To add this feature to a panel, begin editing the visualization and scroll to the ‘General’ section. At the bottom, you’ll see the ‘Panel links’ box.
To set up the correlation link, you’ll need to add the correct URL. The highlighted part of the URL is the same for every correlation link. If you want to link to the associated logs in Kibana, there is no reason to change this part of the link. Where you’ll need to make edits is in the non highlighted part.
If you’ve queried log data in Kibana, the non highlighted part of the link should look familiar – it’s a Kibana query! The full link (highlight+non highlighted) takes you to Kibana and automatically applies the query you define here.
In this case, you can see that the query contains a cluster, namespace, service, and pod-name – which is the same metric query this Grafana visualization is monitoring.
Once you’ve added a Kibana query after the highlighted part of the link, hit ‘Add link’. This will add the link to the top right corner of the visualization. When you hover over it, you can hit ‘Explore in Kibana’ to see the logs associated with the metrics in your visualization. Below is the result! Neat, huh?
The Kibana query is the most important thing to get right here, so give it a try to make sure it brings up the log data you’d expect to be associated with the metrics in the visualization of interest.
After all, the next time you click on it, you may be urgently troubleshooting a production incident. Below is the query I tested in Kibana before adding it to the Correlation link in my Pod Memory Grafana visualization.
Adding Markers: understand how changes in production affect metrics
Another way to correlate logs and metrics is by marking production events on your dashboards. Having both your logs and metrics under the same Logz.io account enables you to find events indicating changes in production. This makes it easy to see those events layered over your metrics as Grafana annotations.
This helps correlate metric behavior with specific events in your system so you can see how your environment is impacted by recent deployments, patches, etc.
In the image below, the purple and red lines are Markers. Notice that the sudden spike in CPU happened just after the Kubernetes configuration change. The red Markers indicate failures.
To add Markers to your dashboard, go to “Settings’ in the top toolbar and go to Annotations. Like Log-Metric Correlation, the annotation is defined through a log query! Therefore, you need to choose the right logs account as the datasource and phrase the Lucene query according to those specific logs documents.
Infrastructure Monitoring will add Markers to your visualizations based on logs identified in this query, so make sure the query won’t return a constant flow of logs.
Like all log queries, it helps to understand your logs to extract the right information. Test the query in Kibana first to make sure its giving you the exact logs representing production events you want to mark in Grafana. This ensures you correlate your metrics with meaningful production information.
For example, before adding the K8s configuration change Marker above, I tested the query in Kibana (below) to make sure it only gave me the production events I wanted to see in my Grafana dashboard. You can see they are limited to these logs indicating configuration changes.
That looks like a reasonable amount of information. Too many logs in this query would result in a Marker overload, which would make them useless.
Lastly, specify specific log fields to present as Text and Tag, which will appear when you hover over the marker to provide additional information about the specific event.
Once the annotations are defined, you will see them on the panel, with the ability to toggle them on/off via toggle buttons added at the panel’s top section.
Infrastructure Monitoring Alerts make it easy to stay notified of production issues as they show up in metrics. But monitoring metrics should not only help you identify production issues, but troubleshoot and resolve them as well. With the ability to correlate logs and metrics, this is a quick, seamless task.
This concludes our blog series on Infrastructure Monitoring. If you haven’t already, make sure to check out previous blogs written by my colleagues on shipping metrics to Logz.io and building visualizations on Infrastructure Monitoring.