Webinar: A Four-Step Blueprint for Faster Root Cause Analysis

By: Libi Michelson

June 29, 2026

Incident investigations take so long not because the fix is hard, but because finding the right fix is. Most engineers spend 20 to 60 minutes just understanding what’s wrong before they can act, not fixing anything, just trying to see the full picture. The framework that changes this has four steps: Orient, Isolate, Hypothesize, and Verify, and the order matters more than the tools.

On June 23rd, Logz.io hosted a live webinar titled From Raw Telemetry to Actionable RCA: Logz.io’s Blueprint and Customer Insights, bringing together engineers, SREs, and platform teams to work through this problem in depth. The session drew attendees from across the SRE, DevOps, NOC, and platform engineering community, and the recording is now available on demand.

The Problem: Investigation Time, Not Fix Time

The webinar opened with a scenario familiar to most in the audience: a 2:47 AM checkout latency alert, an engineer named Maya opening five, six, seven tools before understanding what is actually wrong, metrics in one tab, logs in another, Slack, traces, and a runbook last updated six months ago.

David Lotan Bolotnikoff, VP of Product at Logz.io, framed the real cost: “Everyone tracks MTTR, time to resolve. But most of that time is time to understand, not time to fix. The fix is often a small rollback. Finding out it’s the right rollback is the expensive part.”

A live poll confirmed the point. Most attendees reported spending 20 to 60 minutes just understanding an incident before they could act on it.

Framework First, AI Second

The session’s central argument was that AI tools fail when deployed on top of a broken process. “Without an RCA method, you’re just automating a messy process,” David said. “You don’t get order. You just get faster mess.”

Kevin Klein, the AI engineer who leads OrionIQ’s RCA specialization and built the core reasoning system, echoed this: “Getting real value from automated RCA requires a clear RCA process first. The agent speeds that process up. If the process is bad, garbage in, garbage out.”

According to Logz.io’s field data, about 70 percent of incidents trace back to a change made in the last hour, which makes “what changed?” the highest-leverage question in any investigation.

The framework they presented has four steps, and the order matters. The four-step RCA framework: Orient (build one timeline), Isolate (ask what changed), Hypothesize (rank suspects by evidence, not opinion), Verify and act (confirm before touching production).

Orient. Put everything on one timeline with one shared clock, including not just logs, metrics, and traces, but recent deploys, config changes, feature flags, alert history, runbooks, past incidents, and Slack threads. The answer is usually already in the data, but the challenge is seeing it all at once.

Isolate. The core question here is what changed. Deploys, config updates, scaling events, dependency shifts, access changes: all of it needs to be lined up on the same clock as the telemetry data.

Hypothesize. Rank suspects by evidence, not by who talks loudest in the incident channel. Every suspect needs supporting data. “The database is slow” is a guess. “Database p99 jumped 40ms right after the 2:46 deploy that added a slow query, here’s the trace” is a hypothesis. A past fix for the same pattern is also strong evidence and can move a hypothesis to the top of the list.

Verify and act. Confirm the hypothesis before touching production, then take the smallest safe step and write down what you learned, because that note becomes the context for the next incident.

What OrionIQ Actually Does

The second half of the session covered what an AI agent looks like when the framework is already in place. Kevin was direct about where the hype ends: most AI in observability today is text summarization at best, and OrionIQ is built to do something more substantive than that.

What distinguishes an investigation agent, Kevin explained, comes down to four things: it works across all available context including runbooks, tickets, past incidents, and chat rather than just telemetry; it defends its reasoning when challenged; it can act with human approval; and it learns from feedback over time.

The live demo made this tangible. Kevin showed IQon, OrionIQ’s chat-driven interface, working through the same 2:47 AM checkout incident from the opening story. He asked it to look at the incidents channel and decide what mattered, and rather than surfacing every alert, it found the one that counted and ignored the rest.

From there, it ran the framework automatically: one timeline, changes overlaid, a hypothesis ranked by evidence. It surfaced a Confluence runbook from a similar incident three weeks earlier. Then Kevin pushed back and asked how the agent could know it wasn’t just a traffic spike. The agent walked through its reasoning: latency stayed flat as traffic climbed, then broke only after the deploy, with the log line, the trace, and the deploy timestamp all visible. The recommended fix was an immediate rollback of the deploy that introduced the error, after which it waited for a human decision before touching anything in production.

David was direct about what this means for teams: “We’re not saying this replaces the team. It changes what the team works on, away from doing the same repetitive work again and again, towards work that only people can do.”

Results in Production

The session included a customer story that grounded the demo in reality. ThetaRay automated nearly their entire NOC using OrionIQ, a case study published together with Logz.io. The people using it daily rate it at 87% on average, not a marketing number but a live trust score generated by humans reviewing and grading the agent’s work on every run.

That score feeds back into the system as well. Every review, approval, or correction both measures the agent empirically and sharpens the playbooks it draws from. The goal, as David put it, is that the agent never quietly gets worse.

Human Control, Always

A recurring question throughout the Q&A was what happens when the agent gets it wrong. The answer was consistent: every action requires your approval, you can give the agent read-only APIs, gate any action that touches production, and maintain a full audit record of everything it did and why. You control how much autonomy the agent has for each type of incident.

David also offered practical advice for getting started: involve your CISO early. An agent with full context means letting it connect to your systems, which is a data access and security conversation. The teams that move fastest are the ones that start that conversation before the technology is in place.

What to Do Next

The session closed with practical steps: adopt the four-step framework whether you plan to automate it or run it by hand; prune alert noise before automating anything; involve your CISO early; and pick one pilot on a high-value, reversible alert to get one agent earning trust before expanding.

A new version of OrionIQ’s RCA capabilities begins rolling out in the coming days, and customers who upgrade will receive the new capabilities as they land.

Watch the Recording

The full session, including the live demo, Q&A, and the challenge moment, is available on demand here.

Talk to the Team

If you want to see OrionIQ reasoning through incidents on your own telemetry, the OrionIQ team is available for working sessions, request a demo here.

FAQs

What is the difference between MTTR and time to understand?

MTTR measures how long it takes to resolve an incident from start to finish. But most of that window is spent figuring out what is wrong, not fixing it. The fix is often a one-line rollback. Finding the right rollback is where the time goes. Separating these two phases is the first step to improving both.

Do I need to rewrite my runbooks before using an AI agent?

No. OrionIQ reads the runbooks you already have, as they are. After each investigation, you review the agent’s work and correct it where needed. That feedback sharpens the playbooks over time from real, reviewed incidents rather than from someone maintaining a wiki.

What does “Framework First, AI Second” mean in practice?

It means setting up a consistent four-step process (Orient, Isolate, Hypothesize, Verify) before introducing any automation. An AI agent that runs on top of a clear process gets faster and more accurate over time. One running on top of an inconsistent process just produces inconsistent results faster.

How does OrionIQ handle alert fatigue?

Rather than sending every alert to a human to triage, OrionIQ reads the full alert stream and surfaces the ones that actually matter. In the demo, Kevin gave it access to the incidents channel and asked what needed attention. It found the one real issue and ignored the noise without being told what to look for.