Monitoring and incident management: a winning combination

By: Evan Klein

September 24, 2019

Monitoring and incident management: a winning combination

Monitoring systems gather and log a wide range of performance data on a diverse range of targets — from applications to user experience, networks, servers, and more. Usually, monitoring is conducted under runtime conditions, but synthetic monitoring can also be used to simulate loads and test the resilience of web services, for example.

Incident management systems use monitoring system outputs (and other relevant inputs) in order to quickly detect, prioritize, diagnose, and resolve performance issues that are disrupting normal service operation. The monitoring system output may be the log data itself, an event-triggered alert indicating that a performance threshold has been breached, or both.

Closely coupling monitoring and incident management systems creates a synergy that is far more powerful than either process alone. This blog post explores how monitoring and incident management systems complement each other to achieve proactive, realtime incident responses that accelerate business outcomes.

The benefits of integrating monitoring and incident management

The security realm has long acknowledged that a carefully defined and implemented integration of security monitoring with security incident management can significantly improve security postures and mitigate security risks. In many cases, this integration is achieved through a Security Information and Event Management (SIEM) technology platform. However, this blog post focuses more on system health and performance and the benefits that can be reaped by integrating system monitoring and incident response management, including:

Context: Incident management systems aggregate and prioritize a wide variety of inputs, creating a single source of truth by putting monitoring alerts in context along with web forms, call center inputs, technical staff reports, and other data.
Triage: Streams of distributed, disparate monitoring data—or actual alerts—can flow into a tier that prioritizes detected anomalies and directs incidents to the right response team.
Enhanced end-user satisfaction: Tight integration of monitoring and incident management can dramatically reduce MTTR through seamless, highly automated transitions from data collection to actionable insights to targeted responses.
Intelligence: The bidirectional flow between the monitoring and incident management systems can be used to improve both the monitoring algorithms and thresholds as well as response workflows.
Agility: The integration by design of monitoring and incident management systems provides a shared resource for development, QA, operations, and ITSM teams to work together efficiently, coordinate resources, and reduce errors.

Integrated monitoring and incident management also supports scalability, minimizes alert fatigue, and can lower support costs.

The most popular incident management systems

A couple of years ago, we reviewed in detail our top five incident management systems: PagerDuty, OpsGenie, VictorOps, Jira Service Desk and Fresh Service. All five are cloud- and mobile-based systems, and we compared them across features such as on-call routing, automatic assigning, real time dashboards, ease of configurability, and multiple alerting methods.

In a more recent article from Software Testing (last updated August 21, 2019) that ranks the ten best incident management systems, our top five systems figure prominently in their top six choices, which are (in this order):

Jira Service Desk from Atlassian: This Java-based solution was developed for agility and collaboration, including real time updates with knowledge bases and SLAs. Although it provides excellent templates, it is also highly customizable. It automatically sends emails to designated response personnel and provides a single, fully documented portal for testers and developers.
Mantis BT (Bug Tracker), an open-source tool: With multiple plugins and strong system and wiki integration hooks, Mantis BT can easily track multiple projects and users. It supports audit trails and change logs. Its features are easy to use, but it requires a skilled person to set up, configure, and maintain the system.
PagerDuty: This trusted incident management tool features real time collaboration and incident management, event grouping and rich alerts, and automated escalation workflows. It also provides powerful API and email integration, and its scheduler is easy to use.
VictorOps: This incident management tool is specially designed for DevOps teams. It features strong collaboration, integration, automation, and measurement capabilities as well as a reliable on-call scheduler. It suppresses alert noise and accelerates time to resolution with features such as live call routing, reporting, chats, and delivery insights.
Fresh Service: This customer support platform has a powerful ticketing system and knowledge base. It has an excellent track record for precise prioritization, quick analysis and highly automated resolution of issues. It supports incident, change, and release management. It is extremely flexible and customizable and provides enterprise-grade reporting.
OpsGenie: This incident management solution includes its own monitoring system that tracks and checks the end-to-end flow of applications. With a powerful dashboard and targeted automated alerts, OpsGenie never misses a critical alert and provides actionable insights to improve operational efficiency. It has strong collaboration tools and integrates easily with other tools and applications.

The other four incident management tools that made it into Software Testing’s top ten are: LogicManager, Zendesk, Spiceworks, and Plutora.

Case in point: Logz.io and PagerDuty integration

Logz.io is a fully managed monitoring and troubleshooting service based on the ELK Stack. The highly scalable Logz.io system accelerates incident detection, analysis, and resolution with crowdsourced machine learning, advanced clustering, and smart alerts.

Realizing that alert velocity and verbosity often hinder the responsiveness of development and operations teams, Logz.io’s built-in alerting mechanism lets users customize log-based alerts in Kibana. After defining an alert, users can then choose between two views: JSON (the default) or Table. If they select Table, they can then specify which alert fields will be displayed—up to seven per log type. The alert fields can also be used to logically group alerts, supporting up to three grouping levels. Last but not least, users can filter out unwanted pieces of data within a field using a REGEX pattern. After doing so, customized alert notifications are delivered to the selected endpoints in an easy to consume table format.

When it comes to endpoints, Logz.io can be integrated with any paging, messaging, or incident management application that uses webhooks. As of today, it supports sending notifications, e.g., triggered alerts, sharing a Kibana object, or a new Logz.io insight to BigPanda, Datadog, PagerDuty, Slack, VictorOps, or a custom endpoint.

A concrete example of how Logz.io can seamlessly work with an incident management system is its integration with PagerDuty. The two systems can be configured to work together in two simple steps. In the PagerDuty Configuration/Services menu, you first add Logz.io as a New Integration to a new or existing service. Then, you will need to copy the Integration Key for the new integration and go to the Logz.io Alerts/Alert Endpoints page. There, you select PagerDuty as the alert type, give it a name, and enter the Integration Key. Finally, add PagerDuty as a Notification Endpoint, and PagerDuty will notify you when predefined alerts are triggered in your ELK Stack environment.

Summary

In today’s highly competitive market, maintaining performance SLAs and minimizing downtime have become critical business KPIs. Coupling your monitoring and incident management systems creates a synergy that can dramatically accelerate incident response times and even prevent incidents before they occur.