How RCA Works
Investigation Triggered
RCA begins when an incident is created or manually initiated. The system creates a dedicated AI conversation and queues an RCA task in the background.
Agent Activation
Based on your connected infrastructure, relevant specialized agents are activated. Agent Anna coordinates the investigation, while specialists (Alex, Tony, Kai, Oliver) focus on their domains.
Context Gathering
Agents explore infrastructure topology, collect baseline metrics, identify affected services, and examine recent deployments and configuration changes.
Analysis
AI forms 2-5 competing hypotheses about potential causes. Each hypothesis is tested by examining logs, traces, and dependencies. Evidence confirms or rules out each theory.
Investigation Phases
RCA follows a structured three-phase workflow. When agents move to a new phase, the previous phase automatically completes if still in progress.Phase 1: Context Gathering
Agents collect infrastructure context to establish baseline conditions:- Map affected services and dependencies via topology
- Gather metrics from CloudWatch, Prometheus, Datadog
- Compare incident metrics to historical baselines
- Identify recent deployments and configuration changes
Phase 2: Analysis & Hypothesis Testing
Agents narrow down root cause through hypothesis testing:- Generate 2-5 competing theories based on symptoms
- Collect evidence: logs, traces, dependencies, resource metrics
- Test each hypothesis against evidence
- Rule out hypotheses when evidence contradicts them
- Track confidence scoring as evidence accumulates
Phase 3: Resolution
Finalize root cause with evidence and remediation:- Resolve all remaining hypotheses (confirmed or ruled out)
- Confirm the winning hypothesis as root cause
- Curate 3-6 strongest evidence items
- Generate 1-3 actionable remediation steps
- Set disposition (IDENTIFIED, NOT_FOUND, FALSE_ALARM, ON_HOLD)
- Record confidence score (0.0-1.0)
Evidence Chain
RCA builds a structured evidence chain with automatic calculations. Evidence can be linked to specific hypotheses to show which findings support each theory:Metrics
Incident vs baseline comparison with auto-calculated deviation percentage. Example: “CPU 95% vs 25% baseline = 280% deviation”Fields: incident_value, baseline_value, baseline_period, threshold, unit
Deployments & Changes
Recent changes with auto-calculated time delta from incident start. Positive = before incident (likely causative).Fields: type, description, timestamp, correlation, service
Logs
Relevant log entries with deep links to log consoles (CloudWatch, Splunk, Datadog).Fields: source, description, deep_link, timestamp, severity
Traces
Distributed trace data showing request flow and latency breakdowns.Fields: source, description, raw_data
Configuration
Configuration changes with exact parameter modifications.Fields: source, description, timestamp
Alerts
Related alerts from monitoring systems during incident window.Fields: source, severity, description
Confidence Scoring
The AI provides a confidence score (0.0-1.0) for the identified root cause:| Score Range | Category | Meaning | Action |
|---|---|---|---|
| 0.9 - 1.0 | Very High | Root cause identified with overwhelming evidence | Implement remediation immediately |
| 0.7 - 0.9 | High | Root cause identified with strong evidence | Implement remediation with normal priority |
| 0.5 - 0.7 | Medium | Probable root cause, but gaps remain | Implement remediation; monitor for alternatives |
| 0.3 - 0.5 | Low | Possible root cause, evidence is circumstantial | Validate findings manually before action |
| 0.0 - 0.3 | Uncertain | Insufficient evidence to establish root cause | Cannot determine; consider NOT_FOUND |
- Positive: Temporal correlation, metric anomalies (>50% deviation), error patterns, hypotheses ruled out, multiple data sources
- Negative: Alternative explanations, weak temporal correlation, missing verification, conflicting evidence
Hypothesis Tracking
RCA implements hypothesis-driven investigation inspired by “5 Whys” and Fishbone methodologies.Hypothesis Workflow
Hypothesis States
- Created: Initial theory with 0.0-1.0 confidence estimate
- Investigating: Gathering evidence to test the theory
- Confirmed: Sufficient evidence supports this as root cause
- Ruled Out: Evidence contradicts or disproves hypothesis
Example Timeline
- Root cause cannot be set until at least one hypothesis is confirmed
- Hypotheses require supporting evidence before confirmation
- All hypotheses must be resolved (confirmed or ruled out) before closing the investigation
Investigation Timeline
RCA generates a real-time investigation timeline showing every step of the AI’s reasoning.Timeline Entry Types
- info: General investigation note
- finding: Specific discovery impacting analysis
- warning: Potential issue requiring verification
- error: Failed investigation attempt
- success: Confirmed finding
- hypothesis_created: New theory proposed
- hypothesis_ruled_out: Theory disproven
- hypothesis_confirmed: Hypothesis validated as root cause
Disposition Status
Every investigation must conclude with a disposition. This is the only way to close an incident and move it out of “Investigating” status.| Status | Meaning | Investigation Continues? |
|---|---|---|
| IDENTIFIED | Root cause found with supporting evidence (requires confirmed hypothesis, confidence 0.7+) | ❌ NO (Terminal) |
| NOT_FOUND | Investigation exhausted, no clear root cause | ❌ NO (Terminal) |
| FALSE_ALARM | Issue was not a real incident | ❌ NO (Terminal) |
| ON_HOLD | Awaiting external input or additional data | ✅ YES (Resumable) |
Triggering RCA
Automatic Trigger
Configure webhooks to auto-trigger RCA:Manual Trigger
- Open incident detail page
- Click Start RCA Analysis button
- System validates no duplicate RCA is running
- Investigation begins within 1-3 seconds
- Real-time timeline appears as findings are discovered
Viewing RCA Results
Root Cause Summary
Clear explanation of root cause with confidence score (0.0-1.0) and identification timestamp.
Hypothesis Tracking
All hypotheses with lifecycle: creation → testing → confirmation/ruling out with reasoning.
Evidence Chain
Evidence organized by type with severity ranking, source attribution, and deep links.
Investigation Timeline
Real-time chronological log of AI investigation steps with phase transitions.
Remediation Actions
AI-suggested fixes (1-3 items) with priority levels (critical|high|medium|low).
Affected Services
Services impacted during investigation with blast radius visualization if topology connected.
Multiple RCA Runs
Run multiple investigations on the same incident with version tracking (v1, v2, v3…). Use when additional information becomes available or initial investigation was inconclusive. Compare results via history dropdown.Best Practices
Investigation Setup:- Connect topology for blast radius analysis and service correlation
- Configure webhooks to auto-trigger RCA for medium+ severity incidents
- Provide incident context in description to guide investigation
- Monitor timeline in real-time to follow hypothesis testing
- Review hypothesis chain to understand which theories were tested and ruled out
- Verify evidence timestamps correlate with incident start
- Validate root cause if confidence < 0.7 before implementing remediation
- Focus on critical-priority remediation actions first
- Use version history if new information emerges (run RCA multiple times)