How it works - CloudThinker

Root Cause Analysis is CloudThinker’s AI-powered investigation engine that automatically diagnoses incidents across your infrastructure. Using specialized agents with deep domain expertise, RCA conducts hypothesis-driven investigations, builds structured evidence chains, and suggests actionable remediation steps with full transparency into the AI’s reasoning process.

The Problem With Manual RCA

Traditional incident investigation is a manual, high-pressure process. An alert fires, an engineer is paged, and then the real work begins: pulling up CloudWatch logs, checking Datadog metrics, reviewing recent deployments in your CI/CD system, cross-referencing Kubernetes events, asking teammates what changed recently. The average MTTR for complex incidents is 4–8 hours — mostly spent in this investigation phase. The problems with manual RCA:

Knowledge bottleneck: only senior engineers with system-wide context can investigate complex incidents effectively
Tool fragmentation: logs are in one place, metrics in another, deployments in a third — manual correlation is error-prone
No systematic hypothesis testing: engineers follow intuition, potentially missing non-obvious root causes
Post-mortems don’t improve future response: findings aren’t stored in a queryable form to help the next incident

How Existing Tools Compare

Tool	What It Does	What’s Missing
PagerDuty / Opsgenie	Alert routing and escalation	Routes you to the right person, but you still investigate manually
Datadog AIOps / Watchdog	Alert correlation and anomaly detection	Groups related alerts but doesn’t investigate or form hypotheses
Splunk ITSI / IT Essentials	Service health monitoring and event correlation	Correlation rules require manual configuration; no AI-driven investigation
Blameless / FireHydrant / Rootly	Incident workflow and post-mortem tooling	Coordination and documentation, not real-time investigation
AWS Systems Manager OpsCenter	Operational issues aggregation	Aggregates findings, no cross-service RCA

CloudThinker RCA is the only tool that actively investigates — forming and testing hypotheses, correlating evidence across domains, and producing a structured chain of reasoning with confidence scoring.

How RCA Works

Investigation Triggered

RCA begins when an incident is created — automatically when a Pulse cluster escalates, or manually from the incident detail page. When triggered from a cluster escalation, the cluster summary and all member signals are injected into the agent’s context automatically, so investigation starts with the full signal history already loaded. The system queues an RCA task in the background and opens a dedicated AI conversation.

Agent Activation

Based on your connected infrastructure, relevant specialized agents are activated. Agent Anna coordinates the investigation, while specialists (Alex, Tony, Kai, Oliver) focus on their domains.

Context Gathering

Agents explore infrastructure topology, collect baseline metrics, identify affected services, and examine recent deployments and configuration changes.

Analysis

AI forms competing hypotheses about potential causes. Each hypothesis is tested by examining logs, traces, and dependencies. Evidence confirms or rules out each theory.

Resolution

The confirmed hypothesis becomes the root cause. Evidence is curated, remediation suggestions are generated, and disposition is set with confidence scoring.

Investigation Phases

RCA follows a structured three-phase workflow. When agents move to a new phase, the previous phase automatically completes if still in progress.

Phase 1: Context Gathering

Agents collect infrastructure context to establish baseline conditions:

Map affected services and dependencies via topology
Gather metrics from CloudWatch, Prometheus, Datadog
Compare incident metrics to historical baselines
Identify recent deployments and configuration changes

Phase 2: Analysis & Hypothesis Testing

Agents narrow down root cause through hypothesis testing:

Generate competing theories based on symptoms
Collect evidence: logs, traces, dependencies, resource metrics
Test each hypothesis against evidence
Rule out hypotheses when evidence contradicts them
Track confidence scoring as evidence accumulates

Quality Requirements: Agents must gather supporting evidence and investigate for sufficient time before confirming any hypothesis.

Phase 3: Resolution

Finalize root cause with evidence and remediation:

Resolve all remaining hypotheses (confirmed or ruled out)
Confirm the winning hypothesis as root cause
Curate the strongest evidence items
Generate actionable remediation steps
Set disposition (IDENTIFIED, NOT_FOUND, FALSE_ALARM, ON_HOLD)
Record confidence score (0.0-1.0)

Critical: Setting disposition is mandatory to close the investigation. Without it, the incident remains in “Investigating” status.

Evidence Chain

RCA builds a structured evidence chain with automatic calculations. Evidence can be linked to specific hypotheses to show which findings support each theory:

Metrics

Incident vs baseline comparison with auto-calculated deviation percentage. Example: “CPU 95% vs 25% baseline = 280% deviation”Fields: incident_value, baseline_value, baseline_period, threshold, unit

Deployments & Changes

Recent changes with auto-calculated time delta from incident start. Positive = before incident (likely causative).Fields: type, description, timestamp, correlation, service

Logs

Relevant log entries with deep links to log consoles (CloudWatch, Splunk, Datadog).Fields: source, description, deep_link, timestamp, severity

Traces

Distributed trace data showing request flow and latency breakdowns.Fields: source, description, raw_data

Configuration

Configuration changes with exact parameter modifications.Fields: source, description, timestamp

Alerts

Related alerts from monitoring systems during incident window.Fields: source, severity, description

Evidence is ranked by severity: Critical (direct cause) → High (strong support) → Medium (context) → Low (background).

Confidence Scoring

The AI provides a confidence score (0.0-1.0) for the identified root cause:

Score Range	Category	Meaning	Action
0.9 - 1.0	Very High	Root cause identified with overwhelming evidence	Implement remediation immediately
0.7 - 0.9	High	Root cause identified with strong evidence	Implement remediation with normal priority
0.5 - 0.7	Medium	Probable root cause, but gaps remain	Implement remediation; monitor for alternatives
0.3 - 0.5	Low	Possible root cause, evidence is circumstantial	Validate findings manually before action
0.0 - 0.3	Uncertain	Insufficient evidence to establish root cause	Cannot determine; consider NOT_FOUND

Confidence Factors:

Positive: Temporal correlation, metric anomalies (>50% deviation), error patterns, hypotheses ruled out, multiple data sources
Negative: Alternative explanations, weak temporal correlation, missing verification, conflicting evidence

Hypothesis Tracking

RCA implements hypothesis-driven investigation inspired by “5 Whys” and Fishbone methodologies.

Hypothesis Workflow

Hypothesis States

Investigating: Actively gathering evidence to test the theory
Confirmed: Sufficient evidence supports this as root cause
Ruled Out: Evidence contradicts or disproves hypothesis

Example Timeline

Timeline Entry 1: hypothesis_created
├── Hypothesis 1: "Database connection pool exhaustion"
├── Confidence: 0.75
└── Message: "Pool exhaustion likely given 500s response times"

Timeline Entry 3: hypothesis_ruled_out
├── Hypothesis 1: Ruled Out
├── Reason: "DB metrics show 45/100 connections—well within limits"
└── Evidence: Max concurrent connections remained stable

Timeline Entry 6: hypothesis_created
├── Hypothesis 2: "Lambda cold start latency after memory reduction"
├── Confidence: 0.85

Timeline Entry 8: hypothesis_confirmed
├── Hypothesis 2: Confirmed
├── Updated Confidence: 0.92
└── Evidence: CloudWatch init duration spike, deployment timing match

Quality Guidelines:

The AI aims to confirm at least one hypothesis before setting root cause
Hypotheses should have supporting evidence before confirmation
The AI resolves all hypotheses (confirmed or ruled out) before closing the investigation

Investigation Timeline

RCA generates a real-time investigation timeline showing every step of the AI’s reasoning.

Timeline Entry Types

info: General investigation note
finding: Specific discovery impacting analysis
warning: Potential issue requiring verification
error: Failed investigation attempt
success: Confirmed finding
hypothesis_created: New theory proposed
hypothesis_ruled_out: Theory disproven
hypothesis_confirmed: Hypothesis validated as root cause

Timeline entries appear instantly as agents discover findings, showing phase progress, hypothesis testing, and evidence collection with timestamps. Limit: 100 entries per investigation (enforced at database level)

Disposition Status

Every investigation must conclude with a disposition. Setting disposition updates the incident status accordingly.

Disposition	Meaning	Investigation Continues?
IDENTIFIED	Root cause found with supporting evidence	❌ NO (Terminal)
NOT_FOUND	Investigation exhausted, no clear root cause	❌ NO (Terminal)
FALSE_ALARM	Issue was not a real incident	❌ NO (Terminal)
ON_HOLD	Awaiting external input or additional data	✅ YES (Resumable)

Terminal dispositions (IDENTIFIED, NOT_FOUND, FALSE_ALARM) prevent further investigation. ON_HOLD allows investigation to continue when new information becomes available. After disposition is set, the incident can progress through additional lifecycle statuses (Resolved, Post-Mortem, Closed) as your team completes follow-up actions.

Triggering RCA

Automatic Trigger

Configure webhooks to auto-trigger RCA:

{
  "auto_trigger_rca": true,
  "auto_trigger_rca_min_severity": "medium"
}

When an incident meets the severity threshold, RCA starts automatically in the background.

Manual Trigger

Open incident detail page
Click Start RCA Analysis button
System validates no duplicate RCA is running
Investigation begins within 1-3 seconds
Real-time timeline appears as findings are discovered

Viewing RCA Results

Root Cause Summary

Clear explanation of root cause with confidence score (0.0-1.0) and identification timestamp.

Hypothesis Tracking

All hypotheses with lifecycle: creation → testing → confirmation/ruling out with reasoning.

Evidence Chain

Evidence organized by type with severity ranking, source attribution, and deep links.

Investigation Timeline

Real-time chronological log of AI investigation steps with phase transitions.

Remediation Actions

AI-suggested fixes with priority levels (critical, high, medium, low).

Affected Services

Services impacted during investigation with blast radius visualization if topology connected.

Multiple RCA Runs

Run multiple investigations on the same incident with version tracking (v1, v2, v3…). Use when additional information becomes available or initial investigation was inconclusive. Compare results via history dropdown.

Best Practices

Investigation Setup:

Connect topology for blast radius analysis and service correlation
Configure webhooks to auto-trigger RCA for medium+ severity incidents
Provide incident context in description to guide investigation

During Investigation:

Monitor timeline in real-time to follow hypothesis testing
Review hypothesis chain to understand which theories were tested and ruled out
Verify evidence timestamps correlate with incident start

After Investigation:

Validate root cause if confidence < 0.7 before implementing remediation
Focus on critical-priority remediation actions first
Use version history if new information emerges (run RCA multiple times)
Connect Runbooks so agents can automatically find and execute remediation procedures during future investigations

​The Problem With Manual RCA

​How Existing Tools Compare

​How RCA Works

​Investigation Phases

​Phase 1: Context Gathering

​Phase 2: Analysis & Hypothesis Testing

​Phase 3: Resolution

​Evidence Chain

Metrics

Deployments & Changes

Logs

Traces

Configuration

Alerts

​Confidence Scoring

​Hypothesis Tracking

​Hypothesis Workflow

​Hypothesis States

​Example Timeline

​Investigation Timeline

​Timeline Entry Types

​Disposition Status

​Triggering RCA

​Automatic Trigger

​Manual Trigger

​Viewing RCA Results

Root Cause Summary

Hypothesis Tracking

Evidence Chain

Investigation Timeline

Remediation Actions

Affected Services

​Multiple RCA Runs

​Best Practices

The Problem With Manual RCA

How Existing Tools Compare

How RCA Works

Investigation Phases

Phase 1: Context Gathering

Phase 2: Analysis & Hypothesis Testing

Phase 3: Resolution

Evidence Chain

Confidence Scoring

Hypothesis Tracking

Hypothesis Workflow

Hypothesis States

Example Timeline

Investigation Timeline

Timeline Entry Types

Disposition Status

Triggering RCA

Automatic Trigger

Manual Trigger

Viewing RCA Results

Multiple RCA Runs

Best Practices