Manual vs. AI-Driven Alert Triage and RCA: Who Will Win?
August 6, 2025
Curious to see how AI actually performs in a real-world production scenario?
Watch the webinar “AI-Driven Alert Triage and RCA” with Logz.io Customer Success Engineer, Seth King. Below, we also bring the main highlights of the webinar.
AI claims to make engineers more efficient and agile, by shortening processes and surfacing insights that help drive decisions. As an AI-driven observability company, we decided to put an AI agent to the test, investigate the same alert manually and with AI, and compare the results.
The Manual Alert Triage Challenge
Manual alert triage is a time-consuming process. Engineers spend a lot of time sifting through logs, correlating events, and hunting for context. This is a reactive process that slows down root cause analysis and increases MTTR. In addition, scanning thousands of logs in real time across systems isn’t scalable without automated help.
Example Use Case: gRPC Error 14 – Manual Investigation
Let’s take an example of a common issue, a recurring gRPC Error 14 (“unavailable”), and see what a manual investigation looks like.
Manual Log Investigation
1. The engineer is notified on Slack about an alert.
2. The engineer views the relevant logs, trying to find what triggered the alerts. The logs indicate a gRPC error 14
, pointing to possible connectivity issues or a downstream service failure.
3. To start the investigation, the engineer runs a Lucene query that searches for all logs from the frontend container with an “unavailable” string in the message field, from the past three hours.
Let’s say ~1,400 logs were found. This suggests a widespread, persistent issue that requires further investigation.
Manual Service Investigation
To identify whether other parts of the stack are affected and which services the frontend is unable to reach, a search is run for the string “unavailable” in the message field of non-frontend containers.
Let’s say no other services were found to be affected. This indicates the issue is isolated to the front end. It could be an error in a connection to a single downstream dependency.
Manual Infrastructure Investigation
5. Now, we’ll check the infrastructure by examining the Kubernetes node the frontend was running on.
Let’s say the node appears to be healthy. This rules out infrastructure instability and points back to it being a connectivity or service-to-service communication issue.
Manual Dependency Investigation
6. It’s time to check dependencies and any errors that exist in other services the front end depends on. For example, cart and recommendation.
No relevant errors were found, suggesting these services were operating normally and dependencies are healthy.
What we achieved with manual alert investigation:
The issue appears to be isolated to the front-end service, with consistent errors occurring repeatedly, while the infrastructure and dependent services seem healthy. This points to a likely networking or connectivity problem, though the exact root cause is still unknown.
How an AI Agent Automates Investigation and Identifies Root Cause
AI agents in observability platforms can automatically scan logs and metrics, identify root causes, and generate actionable insights, often catching issues humans miss. Unlike static automation, these agents are non-deterministic, and they adapt their analysis steps dynamically. This allows for deeper, context-aware investigations through autonomous workflows.
Now, let’s see how an AI agent investigates the same issue from above.
gRPC Error 14 – AI Investigation
The AI agent runs the investigation automatically. But by looking at the AI agent’s chat history, we can see the steps it took to analyze the gRPC Error 14 issue.
Observing AI Agent Investigation
1. The AI agent followed a similar investigation path to the user, starting with log views. However, the AI agent broadened the log search beyond the initial alert, expanding on the human engineer’s actions.
2. The AI agent identified this is a consistent error pattern, which confirms the engineer’s conclusion that this is a widespread issue.
3. Then, the AI agent checked for the impact across additional services. It found that the issue was isolated to the frontend, just like the engineer identified.
4. Unlike the human engineer, the AI agent investigated for additional errors, beyond Error 14.
5. The AI Agent went deeper than the human investigation and successfully identified the root cause: an issue with the recommendations endpoint. The problem affects multiple pods and results in 500 errors across various product IDs.
AI Agent Querying
6. To enhance the AI analysis and our understanding of the issue, we can ask the AI agent follow up questions. For example:
- Which services does the frontend depend on?
- Which service is most impacted by the front end 14 error?
The AI responds with both direct responses and contextual explanations, providing a clear picture of service relationships and impact. This can be advanced further to continue the investigation and get guidance on how to resolve the issues.
AI Agent Identifies Root Cause Analysis
7. The agent identified the root cause as not just a frontend issue, but a problem with the recommendation service, including signs like CrashLoopBackOff and High Restart Count behavior.
These findings weren’t surfaced during the manual investigation.
Additional AI Agent Capabilities
Engineers can leverage additional AI agent capabilities for advanced RCA. Including:
- Inputting context the agent will use for analysis the next time a similar alert fires.
- Visualizing the data and insights as dashboards through natural language prompts
- Going back to the agent’s query history for further investigations.
- Remediating issues through AI-suggested next steps.
AI Agent vs. Manual Investigation: Main Benefits
The AI agent outperformed the manual investigation through a quicker, more insightful and iterative process. The main advantages:
- The AI agent ran the investigation faster than the manual investigation
- The AI agent identified the true root cause of the issue, which wasn’t uncovered in the manual investigation.
- The AI agent can dynamically adapt analysis based on human input. This is similar to the iterative process a human will go through.
- Using an AI agent future-proofs troubleshooting and triage. This is done through contextual explanations the engineer provides.
- The AI agent provides faster MTTR, thanks to accelerated investigation and contextual remediation guidance.
Making an AI Agent Work for You
Watch the entire webinar in the video above.
To see a personalized demo of an AI agent and how it can fit your environments, click here.
Get started for free
Completely free for 14 days, no strings attached.