AI-Powered Observability: Picking Up Where AIOps Failed

By: Asaf Yigal

September 27, 2024

GenAI promises evolutionary changes in how we use observability tools, but meeting expectations means heeding the lessons of our AIOps mistakes.

The emergence of generative AI in observability tools was inevitable, but there’s already been an extreme degree of hype in the market. Monitoring, DevOps and ITOps have never been immune to trends, and with GenAI capabilities, the propagandahype machine is running out of control.

Organizations looking to ride the wave of GenAI undoubtedly recall the massive hype around AIOps tools in the not-so-distant past. The core purpose of AIOps was to address the complexity, volume and velocity of operational telemetry, enabling proactive incident response and reducing manual intervention.

Many believed that AIOps was the future that could solve problems within systems, but adoption lagged because AIOps didn’t meet the needs of critical IT use cases. What were organizations trying to get out of AIOps? What were the right tools? Those questions were never answered.

To succeed, AIOps needed organizations to change their processes, and many organizations were reluctant to do that. Failure to realize benefits from those solutions wasn’t due to the technology — it was because organizations weren’t making the changes required to get those benefits.

How AI-Powered Observability Can Meet Expectations

Organizations are looking for productivity gains in their IT environments. Many ask: “How can we complete tasks faster? How can we increase our time-to-value? What can we do to remediate issues faster so we can get the most out of core issues in our business?”

GenAI and AI-powered observability tools can help in all of these areas. Surfacing insights about system behavior — and providing direct knowledge on how to remediate issues that arise in telemetry data (logs, metrics and traces) — is what observability should provide.

Traditionally, these insights haven’t been available to anyone except technical experts and analysts who understand complex query language or have an intimate understanding of the telemetry data flowing through a system. But what if AI-powered observability can take things even a step further? What if you could interact using natural language with your system?

There’s potential for these tools to open up deeper insights to a much broader user base. This could significantly increase awareness of system behavior, democratize observability to nontechnical users and provide greater understanding of points of failure or difficulty in environments.

In an era of IT staffing knowledge gaps and hiring difficulties, AI-powered observability could fill some of those needs. What would it mean for your team to have the equivalent of a junior developer working directly within your technology platform?

The strongest applications of observability today involve strategic capabilities delivered through GenAI integration. These range from automatic collection of relevant contextual insights and anomaly detection, to the ability to pinpoint critical data to optimize data and costs.

AI-powered capabilities can transform the day-to-day interactions of engineering and DevOps teams by reinventing core monitoring and troubleshooting practices, spanning from querying to root cause analysis.

These types of AI-powered systems — with full dashboarding, data visualizations and answers to pressing questions in seconds — can help meet the promise AIOps was intended to provide.

The core idea of AIOps is to pull in as much telemetry data as possible to identify anomalies. However, this is different from what observability solutions provide. Observability provides services on selective telemetry data and displays real-time metrics, such as CPU usage or other areas of interest.

While incorporating AI for anomaly detection within these metrics might seem like an AIOps feature, it actually is an enhancement to an observability solution. In contrast, AIOps starts with AI and might not offer a single dashboard.

The Revolution Is Waiting, But We Must Evolve First

The lessons from AIOps must be applied to the next generation of observability tools for them to help organizations meet varied and intricate use cases around cloud-native, ephemeral architectures.

Thanks to GenAI, there is potential for evolutionary changes in the way we interact with our observability tools, as well as revolutionary changes in how we organize our operations teams.

We’re already seeing benefits of bringing GenAI into observability tools, such as:

Teams can use these capabilities to filter out irrelevant data and speed troubleshooting.
AI can identify top errors and suggest potential mitigation strategies.
Manual processes can be automated to save engineers hours of work, so they can focus on bigger-picture strategies and projects.

It is one thing to talk about implementing these capabilities and another to take advantage of them. The question remains about what benefit organizations realistically can get from these shifts. Use cases have to be met, and productivity gains must be realized. It can be challenging for organizations to understand and accept the necessary changes; if the barriers are too great, the benefits won’t materialize.

The next-generation approach to system monitoring and management, which leverages GenAI and machine learning to automatically detect, diagnose and resolve issues without human intervention, isn’t far off. This evolution will allow technical teams to focus on strategic tasks while ensuring optimal system performance and reliability.

Teams are best served by remembering the successes and failures of past rapid technology shifts. Be prepared to shift mindsets across an organization to meet your goals.