Reduce MTTR and Address the Talent Gap with Logz.io Alert Recommendations

By: Matt Hines

When our CEO and co-founder Tomer Levy delivered his “Observability is Broken” presentation at last year’s AWS re:Invent, he highlighted numerous challenges faced by today’s organizations as they seek to advance their observability practices.

Of the six individual points that he noted, two specifically dealt with the current shortage of available engineering expertise, with another two focused on data overload. If you read into those points, there’s a clear trend around resourcing and efficiency.

Now turn to the recently published DevOps Pulse Report 2023. For the third straight year, the key metric of MTTR continued to grow at a troubling rate despite growing observability and DevOps maturity.

I could go on citing data points, but the conclusion here is not some new or overly-surprising concept – organizations simply need better ways to translate their observability data into more actionable and effective insights.

Modeling and Scaling Human Interactions

What’s the solution? You guessed it. AI — specifically backed by supervised machine learning that enables teams to take action faster, requiring less experienced human expertise (since this isn’t a limitless resource).

Supervised machine learning has been around as a concept since at least the 1940s. These days countless software systems use it to make recommendations by modeling the input of human operators. So, when it comes to applying AI built on supervised machine learning to the previously-cited observability challenges, enlisting this approach has to be (forgive me) a no-brainer.

Helping our customers get their hands on better guidance so they can troubleshoot their software faster is the precise goal of our latest innovation, the newly-released Alert Recommendations capability in the Logz.io Open 360™ platform.

Based on a unique, patent-pending form of supervised ML, Alert Recommendations models actions taken by platform users and then advises subsequent users what to do when faced with similar issues, recommending the fastest, most effective resolution path.

Why is this so helpful? The reality is that whenever an alert triggers in production, most teams have a common manner of troubleshooting. But, in addition to the sheer number of alerts we face, various users will always take different response measures, with some of these tactics ultimately more efficient than others.

Previously, we have also employed more static assets such as runbooks to provide common practices aimed at accelerating and normalizing response. However, as with all such materials, they quickly became outdated or even obsolete within the context of our dynamic, ephemeral cloud environments.

What’s really needed is a way to automate intelligence derived from our best and brightest engineers and analysts, not only to reduce resolution times but to inform and improve the actions of less experienced teammates. Supervised ML is the best way to do this, addressing many of Tomer’s initial observability critiques in a productized fashion.

Alert Recommendations – How it Works

The quick summary of how Alert Recommendations works is as follows:

The system records actions undertaken during investigations, from opening dashboards to performing specific searches across various data types, along with creating alerts
Next it tracks all the actions per specific types of investigations into certain types of issues, from the time someone clicks on the alert(s) that notifies them of a problem
Meanwhile, related algorithms are watching for the efficacy of these steps and how quickly they resolve the related issues in production
The system takes all the relevant actions and applies a clustering algorithm to combine the relevant actions including queries, filters, keywords, application types and more
Clustering is also achieved via combination of the related terms’ frequency, density and distance to help determine what might be “similar” actions
The system then ranks the action, providing a score per action, for each cluster of actions via a learning algorithm that takes into account multiple features, such as: popularity, user experience, time and frequency, total investigation time, and more
Thereafter, the system generates preferred/recommended actions per problem, building a unique model based on all the relevant parameters
Moving forward the platform embeds recommendation links for the most relevant solution into any similar alerts, or suggests them during the investigation phase

Bottom line: the more any one organization uses Open 360 alerting, the more informed and refined the recommendations become, reducing MTTR and automating the insights – building on the experience of its best users to help other people benefit from their expertise.

We think this type of Alert Recommendations capability represents the future potential of “AIOps” and advances the growing ability of users and platforms to automate even more of the work that currently consumes so much time and so many critical resources.

If observability is indeed broken, we need to focus on those innovations that truly help move the needle in allowing us to work smarter, and faster.

See how the Open 360 platform can help transform your observability strategy by signing up for a free trial today.