Announcing Logz.io Alert Manager for Metrics Monitoring

By: Amos Etzion

Logz.io alerts are a critical capability for our customers monitoring their production environment. By keeping a watchful eye for data that indicates an issue – like spiking memory metrics or 3xx-4xx response codes – alerting quickly notifies engineers that something is going wrong. Setting an actionable alert to immediately notify engineers of oncoming problems can be the difference between a minor issue and a major event with widespread customer impact.

While Logz.io users can set alerts to monitor any telemetry data, many prefer to base alerts on metric data, which are great for providing near real-time measurements of key signals that could indicate issues in infrastructure, applications, or services.

This is why we’re excited to announce that Alert Manager for metrics monitoring is in Public Beta – this will make building, editing, and managing metrics alerts in Logz.io easier than ever. Alert Manager will also help customers configure actionable and informative alerts so teams can quickly understand if the notification indicates minor concern or an all-hands-on deck situation.

Logz.io’s Alert Manager is fully compatible with the Prometheus Alert Manager, which will make it fast and easy to migrate existing Prometheus alerts to Logz.io- furthering the promise of delivering an enhanced Prometheus-based monitoring experience.

In this post, we’ll see the new capabilities for building and managing alerts for metric data in Logz.io. The next five sections will cover the five tabs that you’ll see when opening Alert Manager within Logzio’s Infrastructure Monitoring product.

Alert rules – creating a new alert

The first step is to create a query for the alert to monitor, which is based on PromQL – the Prometheus query language. In this example, we are going to monitor the latency of our back-end services.

While you can create an alert from scratch, it’s far easier to go to a monitoring panel first, edit the panel, and hit ‘Create alert rule from this panel’ from there. This automatically carries over the query and the variables to the alert configuration page (rather than requiring that you add it all manually). Lets go to one of our backend services, called ‘Orders.’

Starting at the ‘Rule type’ section, give your alert a name. We’ll put it in the ‘Sockshop’ folder and the ‘Backend’ group within the folder to keep things organized.

Now we can scroll down to look at the queries we’ll be monitoring. Since we created this alert from an existing monitoring panel, it carried over the query from that panel. We can simply edit the query so that it monitors metrics from all of our backend services rather than just ‘Orders’ metrics (this is much easier than starting the query from scratch).

We can see that the ‘Cart’, ‘Catalogue,’ and ‘Orders’ services currently match the query conditions. We have other services that have lower latency, so they aren’t currently shown.

Now that we have the query we want to monitor in the A table, we can scroll down to the expression table (D) to define the function that will perform calculations on the data.

In the ‘Operation’ drop down, we will use ‘Reduce’ to reduce the time series to one data point (required for the alert to run).

For the function itself, we will use the “Last” function, to get the most relevant data point. Because we defined the actual threshold in the first section, this only affects what will be presented in the notification rather than an actual condition.

Now we can define the alert condition. Under ‘Condition,’ we will choose D, which will monitor the expression we just configured. Under ‘Evaluate,’ we’ll specify that the alert should evaluate the expression every minute; and if it is pending over a span of 5 minutes, the alert will trigger a notification. This prevents the alert from triggering if the query briefly crosses the alerting threshold.

Under ‘Configure no data and error handling,’ we can also configure the alert to trigger a notification if there is no data or errors appear – this way we can fix any problems that would prevent the alert from firing.

We can also preview the alert to verify it is successfully monitoring the data. By hitting ‘Preview alerts’, Logz.io will run the query and give the alerting result.

Finally, we can add details to give the alert context and annotations if it triggers. In the summary, we used templates like {{$labels.path}} and {{$values.D}} to automatically populate the summary with information from the relevant services.

Finally, we can hit ‘Save’ or ‘Save and exit’ to save the alert. Going back to the Alert Manager homepage (which is in the left menu), we can see our new alert in the ‘Sockshop’ folder.

Now, let’s determine where the alert will send notifications if it’s triggered.

Contact Points – define your notification endpoint

This is where Logz.io users can define the messages within the alerts and destinations for their notifications. Common alerting endpoints for Logz.io include Slack, PagerDuty, Gmail, OpsGenie, and others. Go to the ‘Contact points’ tab in Alert Manager and hit ‘New contact point.’

Give the contact point a name and select the application you’ll use as an endpoint. In this example, we’ll choose Slack. We can also specify the Slack channel we want to send the notification.

Next, we can provide the Slack API token or add the Webhook URL. Hit ‘Save contact point.’

Now that we have a Contact point, we can link the alerting rule with our new Contact Point by going to ‘Notification Policies.’

Notification policies – configure your alerting notifications

This is where you can match alerts with your contact points (which contains the notification endpoint) and define other components of the alerting notification.

To start, go to the ‘Notification Policies’ tab and hit ‘New Policy,’ and then ‘Add Matcher.’

We will add the label (alertname) and the value for the label (Backend Latency Alert) for the alert we’re interested in. Then we’ll select the contact point we just configured (Slack Backend Alerts).

By enabling ‘Override grouping’ and selecting a label, we can group all the firing alerts with that label to ensure we aren’t getting annoying duplicates.

We can also configure a mute timing to pause notifications during predined periods – like weekend, for example.

Silences – temporarily mute notifications

Production issues, scheduled maintenance, or other production events can cause an overwhelming barrage of alerts. Silences provide an easy way to get some peace and quiet during these events.

Simply open up an alert within the ‘Alert Rules’ tab and hit ‘Silence.’ This will bring you to the ‘Silences’ tab where you can quickly prevent the alert from firing.

Alert Groups – organize your alerts

Alerts Groups can condense multiple alerts into single notifications to prevent alerting overload. This also prevents duplicate alerts from triggering separate notifications.

Simply search labels with their according value under ‘Search by label’ and/or select a label under the ‘custom group by’ drop down to identify groups of alerts to consolidate.

Try Alert Manager today!

Logz.io’s Infrastructure Monitoring product is getting a huge boost with Alert Manager, which makes alerting for metrics data easier and more customizable for complex alerts.

Alert Manager is in Public Beta. To join the Beta program and try it for yourself, contact your Logz.io Account Manager or reach out to us through our contact form.