Skip to main content
Root Cause Analysis is CloudThinker’s AI-powered investigation engine that automatically diagnoses incidents across your infrastructure. Using specialized agents with deep domain expertise, RCA conducts hypothesis-driven investigations, builds structured evidence chains, and suggests actionable remediation steps with full transparency into the AI’s reasoning process.

How RCA Works

1

Investigation Triggered

RCA begins when an incident is created or manually initiated. The system creates a dedicated AI conversation and queues an RCA task in the background.
2

Agent Activation

Based on your connected infrastructure, relevant specialized agents are activated. Agent Anna coordinates the investigation, while specialists (Alex, Tony, Kai, Oliver) focus on their domains.
3

Context Gathering

Agents explore infrastructure topology, collect baseline metrics, identify affected services, and examine recent deployments and configuration changes.
4

Analysis

AI forms 2-5 competing hypotheses about potential causes. Each hypothesis is tested by examining logs, traces, and dependencies. Evidence confirms or rules out each theory.
5

Resolution

The confirmed hypothesis becomes the root cause. Evidence is curated (3-6 items), remediation suggestions are generated (1-3 actions), and disposition is set with confidence scoring.

Investigation Phases

RCA follows a structured three-phase workflow. When agents move to a new phase, the previous phase automatically completes if still in progress.

Phase 1: Context Gathering

Agents collect infrastructure context to establish baseline conditions:
  • Map affected services and dependencies via topology
  • Gather metrics from CloudWatch, Prometheus, Datadog
  • Compare incident metrics to historical baselines
  • Identify recent deployments and configuration changes

Phase 2: Analysis & Hypothesis Testing

Agents narrow down root cause through hypothesis testing:
  • Generate 2-5 competing theories based on symptoms
  • Collect evidence: logs, traces, dependencies, resource metrics
  • Test each hypothesis against evidence
  • Rule out hypotheses when evidence contradicts them
  • Track confidence scoring as evidence accumulates
Quality Requirements: Agents must gather supporting evidence and investigate for sufficient time before confirming any hypothesis.

Phase 3: Resolution

Finalize root cause with evidence and remediation:
  • Resolve all remaining hypotheses (confirmed or ruled out)
  • Confirm the winning hypothesis as root cause
  • Curate 3-6 strongest evidence items
  • Generate 1-3 actionable remediation steps
  • Set disposition (IDENTIFIED, NOT_FOUND, FALSE_ALARM, ON_HOLD)
  • Record confidence score (0.0-1.0)
Critical: Setting disposition is mandatory to close the investigation. Without it, the incident remains in “Investigating” status.

Evidence Chain

RCA builds a structured evidence chain with automatic calculations. Evidence can be linked to specific hypotheses to show which findings support each theory:

Metrics

Incident vs baseline comparison with auto-calculated deviation percentage. Example: “CPU 95% vs 25% baseline = 280% deviation”Fields: incident_value, baseline_value, baseline_period, threshold, unit

Deployments & Changes

Recent changes with auto-calculated time delta from incident start. Positive = before incident (likely causative).Fields: type, description, timestamp, correlation, service

Logs

Relevant log entries with deep links to log consoles (CloudWatch, Splunk, Datadog).Fields: source, description, deep_link, timestamp, severity

Traces

Distributed trace data showing request flow and latency breakdowns.Fields: source, description, raw_data

Configuration

Configuration changes with exact parameter modifications.Fields: source, description, timestamp

Alerts

Related alerts from monitoring systems during incident window.Fields: source, severity, description
Evidence is ranked by severity: Critical (direct cause) → High (strong support) → Medium (context) → Low (background).

Confidence Scoring

The AI provides a confidence score (0.0-1.0) for the identified root cause:
Score RangeCategoryMeaningAction
0.9 - 1.0Very HighRoot cause identified with overwhelming evidenceImplement remediation immediately
0.7 - 0.9HighRoot cause identified with strong evidenceImplement remediation with normal priority
0.5 - 0.7MediumProbable root cause, but gaps remainImplement remediation; monitor for alternatives
0.3 - 0.5LowPossible root cause, evidence is circumstantialValidate findings manually before action
0.0 - 0.3UncertainInsufficient evidence to establish root causeCannot determine; consider NOT_FOUND
Confidence Factors:
  • Positive: Temporal correlation, metric anomalies (>50% deviation), error patterns, hypotheses ruled out, multiple data sources
  • Negative: Alternative explanations, weak temporal correlation, missing verification, conflicting evidence

Hypothesis Tracking

RCA implements hypothesis-driven investigation inspired by “5 Whys” and Fishbone methodologies.

Hypothesis Workflow

CREATE → INVESTIGATING → CONFIRM or RULE OUT

Hypothesis States

  • Created: Initial theory with 0.0-1.0 confidence estimate
  • Investigating: Gathering evidence to test the theory
  • Confirmed: Sufficient evidence supports this as root cause
  • Ruled Out: Evidence contradicts or disproves hypothesis

Example Timeline

Timeline Entry 1: hypothesis_created
├── Hypothesis 1: "Database connection pool exhaustion"
├── Confidence: 0.75
└── Message: "Pool exhaustion likely given 500s response times"

Timeline Entry 3: hypothesis_ruled_out
├── Hypothesis 1: Ruled Out
├── Reason: "DB metrics show 45/100 connections—well within limits"
└── Evidence: Max concurrent connections remained stable

Timeline Entry 6: hypothesis_created
├── Hypothesis 2: "Lambda cold start latency after memory reduction"
├── Confidence: 0.85

Timeline Entry 8: hypothesis_confirmed
├── Hypothesis 2: Confirmed
├── Updated Confidence: 0.92
└── Evidence: CloudWatch init duration spike, deployment timing match
Quality Enforcement:
  • Root cause cannot be set until at least one hypothesis is confirmed
  • Hypotheses require supporting evidence before confirmation
  • All hypotheses must be resolved (confirmed or ruled out) before closing the investigation

Investigation Timeline

RCA generates a real-time investigation timeline showing every step of the AI’s reasoning.

Timeline Entry Types

  • info: General investigation note
  • finding: Specific discovery impacting analysis
  • warning: Potential issue requiring verification
  • error: Failed investigation attempt
  • success: Confirmed finding
  • hypothesis_created: New theory proposed
  • hypothesis_ruled_out: Theory disproven
  • hypothesis_confirmed: Hypothesis validated as root cause
Timeline entries appear instantly as agents discover findings, showing phase progress, hypothesis testing, and evidence collection with timestamps. Limit: 100 entries per investigation (enforced at database level)

Disposition Status

Every investigation must conclude with a disposition. This is the only way to close an incident and move it out of “Investigating” status.
StatusMeaningInvestigation Continues?
IDENTIFIEDRoot cause found with supporting evidence (requires confirmed hypothesis, confidence 0.7+)❌ NO (Terminal)
NOT_FOUNDInvestigation exhausted, no clear root cause❌ NO (Terminal)
FALSE_ALARMIssue was not a real incident❌ NO (Terminal)
ON_HOLDAwaiting external input or additional data✅ YES (Resumable)
Important: Before setting disposition, all hypotheses must be resolved (confirmed or ruled out). Terminal dispositions (IDENTIFIED, NOT_FOUND, FALSE_ALARM) prevent further investigation. Resumable disposition (ON_HOLD) allows investigation to continue when new information becomes available.

Triggering RCA

Automatic Trigger

Configure webhooks to auto-trigger RCA:
{
  "auto_trigger_rca": true,
  "auto_trigger_rca_min_severity": "medium"
}
When an incident meets the severity threshold, RCA starts automatically in the background.

Manual Trigger

  1. Open incident detail page
  2. Click Start RCA Analysis button
  3. System validates no duplicate RCA is running
  4. Investigation begins within 1-3 seconds
  5. Real-time timeline appears as findings are discovered

Viewing RCA Results

Root Cause Summary

Clear explanation of root cause with confidence score (0.0-1.0) and identification timestamp.

Hypothesis Tracking

All hypotheses with lifecycle: creation → testing → confirmation/ruling out with reasoning.

Evidence Chain

Evidence organized by type with severity ranking, source attribution, and deep links.

Investigation Timeline

Real-time chronological log of AI investigation steps with phase transitions.

Remediation Actions

AI-suggested fixes (1-3 items) with priority levels (critical|high|medium|low).

Affected Services

Services impacted during investigation with blast radius visualization if topology connected.

Multiple RCA Runs

Run multiple investigations on the same incident with version tracking (v1, v2, v3…). Use when additional information becomes available or initial investigation was inconclusive. Compare results via history dropdown.

Best Practices

Investigation Setup:
  • Connect topology for blast radius analysis and service correlation
  • Configure webhooks to auto-trigger RCA for medium+ severity incidents
  • Provide incident context in description to guide investigation
During Investigation:
  • Monitor timeline in real-time to follow hypothesis testing
  • Review hypothesis chain to understand which theories were tested and ruled out
  • Verify evidence timestamps correlate with incident start
After Investigation:
  • Validate root cause if confidence < 0.7 before implementing remediation
  • Focus on critical-priority remediation actions first
  • Use version history if new information emerges (run RCA multiple times)