> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cloudthinker.io/llms.txt
> Use this file to discover all available pages before exploring further.

# How it works

> AI-powered investigation with hypothesis tracking and evidence chains

[Root Cause Analysis](/guide/incident/root-cause-analysis) is CloudThinker's AI-powered investigation engine that automatically diagnoses incidents across your infrastructure. Using specialized agents with deep domain expertise, RCA conducts hypothesis-driven investigations, builds structured evidence chains, and suggests actionable remediation steps with full transparency into the AI's reasoning process.

***

## The Problem With Manual RCA

Traditional incident investigation is a manual, high-pressure process. An alert fires, an engineer is paged, and then the real work begins: pulling up CloudWatch logs, checking Datadog metrics, reviewing recent deployments in your CI/CD system, cross-referencing Kubernetes events, asking teammates what changed recently. The average MTTR for complex incidents is 4–8 hours — mostly spent in this investigation phase.

The problems with manual RCA:

* **Knowledge bottleneck**: only senior engineers with system-wide context can investigate complex incidents effectively
* **Tool fragmentation**: logs are in one place, metrics in another, deployments in a third — manual correlation is error-prone
* **No systematic hypothesis testing**: engineers follow intuition, potentially missing non-obvious root causes
* **Post-mortems don't improve future response**: findings aren't stored in a queryable form to help the next incident

***

## How Existing Tools Compare

| Tool                                 | What It Does                                    | What's Missing                                                             |
| ------------------------------------ | ----------------------------------------------- | -------------------------------------------------------------------------- |
| **PagerDuty / Opsgenie**             | Alert routing and escalation                    | Routes you to the right person, but you still investigate manually         |
| **Datadog AIOps / Watchdog**         | Alert correlation and anomaly detection         | Groups related alerts but doesn't investigate or form hypotheses           |
| **Splunk ITSI / IT Essentials**      | Service health monitoring and event correlation | Correlation rules require manual configuration; no AI-driven investigation |
| **Blameless / FireHydrant / Rootly** | Incident workflow and post-mortem tooling       | Coordination and documentation, not real-time investigation                |
| **AWS Systems Manager OpsCenter**    | Operational issues aggregation                  | Aggregates findings, no cross-service RCA                                  |

CloudThinker RCA is the only tool that actively investigates — forming and testing hypotheses, correlating evidence across domains, and producing a structured chain of reasoning with confidence scoring.

***

## How RCA Works

<Steps>
  <Step title="Investigation Triggered">
    RCA begins when an incident is created — automatically when a Pulse cluster escalates, or manually from the incident detail page. When triggered from a cluster escalation, the cluster summary and all member signals are injected into the agent's context automatically, so investigation starts with the full signal history already loaded. The system queues an RCA task in the background and opens a dedicated AI conversation.
  </Step>

  <Step title="Agent Activation">
    Based on your connected infrastructure, relevant specialized agents are activated. Agent [Anna](/guide/agents/anna) coordinates the investigation, while specialists ([Alex](/guide/agents/alex), [Tony](/guide/agents/tony), [Kai](/guide/agents/kai), [Oliver](/guide/agents/oliver)) focus on their domains.
  </Step>

  <Step title="Context Gathering">
    Agents explore infrastructure [topology](/guide/infrastructure/topology), collect baseline metrics, identify affected services, and examine recent deployments and configuration changes.
  </Step>

  <Step title="Analysis">
    AI forms competing hypotheses about potential causes. Each hypothesis is tested by examining logs, traces, and dependencies. Evidence confirms or rules out each theory.
  </Step>

  <Step title="Resolution">
    The confirmed hypothesis becomes the root cause. Evidence is curated, remediation suggestions are generated, and disposition is set with confidence scoring.
  </Step>
</Steps>

***

## Investigation Phases

RCA follows a structured three-phase workflow. When agents move to a new phase, the previous phase automatically completes if still in progress.

### Phase 1: Context Gathering

Agents collect infrastructure context to establish baseline conditions:

* Map affected services and dependencies via [topology](/guide/infrastructure/topology)
* Gather metrics from CloudWatch, Prometheus, Datadog
* Compare incident metrics to historical baselines
* Identify recent deployments and configuration changes

### Phase 2: Analysis & Hypothesis Testing

Agents narrow down root cause through hypothesis testing:

* Generate competing theories based on symptoms
* Collect evidence: logs, traces, dependencies, resource metrics
* Test each hypothesis against evidence
* Rule out hypotheses when evidence contradicts them
* Track confidence scoring as evidence accumulates

**Quality Requirements:** Agents must gather supporting evidence and investigate for sufficient time before confirming any hypothesis.

### Phase 3: Resolution

Finalize root cause with evidence and remediation:

* Resolve all remaining hypotheses (confirmed or ruled out)
* Confirm the winning hypothesis as root cause
* Curate the strongest evidence items
* Generate actionable remediation steps
* Set disposition (IDENTIFIED, NOT\_FOUND, FALSE\_ALARM, ON\_HOLD)
* Record confidence score (0.0-1.0)

**Critical:** Setting disposition is mandatory to close the investigation. Without it, the incident remains in "Investigating" status.

***

## Evidence Chain

RCA builds a structured evidence chain with automatic calculations. Evidence can be linked to specific hypotheses to show which findings support each theory:

<CardGroup cols={2}>
  <Card title="Metrics" icon="chart-line">
    Incident vs baseline comparison with auto-calculated deviation percentage. Example: "CPU 95% vs 25% baseline = 280% deviation"

    **Fields**: incident\_value, baseline\_value, baseline\_period, threshold, unit
  </Card>

  <Card title="Deployments & Changes" icon="rocket">
    Recent changes with auto-calculated time delta from incident start. Positive = before incident (likely causative).

    **Fields**: type, description, timestamp, correlation, service
  </Card>

  <Card title="Logs" icon="file-lines">
    Relevant log entries with deep links to log consoles (CloudWatch, Splunk, Datadog).

    **Fields**: source, description, deep\_link, timestamp, severity
  </Card>

  <Card title="Traces" icon="route">
    Distributed trace data showing request flow and latency breakdowns.

    **Fields**: source, description, raw\_data
  </Card>

  <Card title="Configuration" icon="gear">
    Configuration changes with exact parameter modifications.

    **Fields**: source, description, timestamp
  </Card>

  <Card title="Alerts" icon="bell">
    Related alerts from monitoring systems during incident window.

    **Fields**: source, severity, description
  </Card>
</CardGroup>

Evidence is ranked by severity: **Critical** (direct cause) → **High** (strong support) → **Medium** (context) → **Low** (background).

***

## Confidence Scoring

The AI provides a confidence score (0.0-1.0) for the identified root cause:

| Score Range | Category      | Meaning                                          | Action                                          |
| ----------- | ------------- | ------------------------------------------------ | ----------------------------------------------- |
| 0.9 - 1.0   | **Very High** | Root cause identified with overwhelming evidence | Implement remediation immediately               |
| 0.7 - 0.9   | **High**      | Root cause identified with strong evidence       | Implement remediation with normal priority      |
| 0.5 - 0.7   | **Medium**    | Probable root cause, but gaps remain             | Implement remediation; monitor for alternatives |
| 0.3 - 0.5   | **Low**       | Possible root cause, evidence is circumstantial  | Validate findings manually before action        |
| 0.0 - 0.3   | **Uncertain** | Insufficient evidence to establish root cause    | Cannot determine; consider NOT\_FOUND           |

**Confidence Factors:**

* **Positive**: Temporal correlation, metric anomalies (>50% deviation), error patterns, hypotheses ruled out, multiple data sources
* **Negative**: Alternative explanations, weak temporal correlation, missing verification, conflicting evidence

***

## Hypothesis Tracking

RCA implements hypothesis-driven investigation inspired by "5 Whys" and Fishbone methodologies.

### Hypothesis Workflow

<div style={{display: 'flex', justifyContent: 'center'}}>
  ```mermaid theme={null}
  stateDiagram-v2
      direction LR
      [*] --> Investigating : hypothesis created
      Investigating --> Confirmed : evidence supports
      Investigating --> RuledOut : evidence contradicts
      Confirmed --> [*]
      RuledOut --> [*]
  ```
</div>

### Hypothesis States

* **Investigating**: Actively gathering evidence to test the theory
* **Confirmed**: Sufficient evidence supports this as root cause
* **Ruled Out**: Evidence contradicts or disproves hypothesis

### Example Timeline

```
Timeline Entry 1: hypothesis_created
├── Hypothesis 1: "Database connection pool exhaustion"
├── Confidence: 0.75
└── Message: "Pool exhaustion likely given 500s response times"

Timeline Entry 3: hypothesis_ruled_out
├── Hypothesis 1: Ruled Out
├── Reason: "DB metrics show 45/100 connections—well within limits"
└── Evidence: Max concurrent connections remained stable

Timeline Entry 6: hypothesis_created
├── Hypothesis 2: "Lambda cold start latency after memory reduction"
├── Confidence: 0.85

Timeline Entry 8: hypothesis_confirmed
├── Hypothesis 2: Confirmed
├── Updated Confidence: 0.92
└── Evidence: CloudWatch init duration spike, deployment timing match
```

**Quality Guidelines:**

* The AI aims to confirm at least one hypothesis before setting root cause
* Hypotheses should have supporting evidence before confirmation
* The AI resolves all hypotheses (confirmed or ruled out) before closing the investigation

***

## Investigation Timeline

RCA generates a real-time investigation timeline showing every step of the AI's reasoning.

### Timeline Entry Types

* **info**: General investigation note
* **finding**: Specific discovery impacting analysis
* **warning**: Potential issue requiring verification
* **error**: Failed investigation attempt
* **success**: Confirmed finding
* **hypothesis\_created**: New theory proposed
* **hypothesis\_ruled\_out**: Theory disproven
* **hypothesis\_confirmed**: Hypothesis validated as root cause

Timeline entries appear instantly as agents discover findings, showing phase progress, hypothesis testing, and evidence collection with timestamps.

**Limit**: 100 entries per investigation (enforced at database level)

***

## Disposition Status

Every investigation must conclude with a disposition. Setting disposition updates the incident status accordingly.

| Disposition      | Meaning                                      | Investigation Continues? |
| ---------------- | -------------------------------------------- | ------------------------ |
| **IDENTIFIED**   | Root cause found with supporting evidence    | ❌ NO (Terminal)          |
| **NOT\_FOUND**   | Investigation exhausted, no clear root cause | ❌ NO (Terminal)          |
| **FALSE\_ALARM** | Issue was not a real incident                | ❌ NO (Terminal)          |
| **ON\_HOLD**     | Awaiting external input or additional data   | ✅ YES (Resumable)        |

Terminal dispositions (IDENTIFIED, NOT\_FOUND, FALSE\_ALARM) prevent further investigation. ON\_HOLD allows investigation to continue when new information becomes available.

After disposition is set, the incident can progress through additional lifecycle statuses (Resolved, Post-Mortem, Closed) as your team completes follow-up actions.

***

## Triggering RCA

### Automatic Trigger

Configure webhooks to auto-trigger RCA:

```json theme={null}
{
  "auto_trigger_rca": true,
  "auto_trigger_rca_min_severity": "medium"
}
```

When an incident meets the severity threshold, RCA starts automatically in the background.

### Manual Trigger

1. Open incident detail page
2. Click **Start RCA Analysis** button
3. System validates no duplicate RCA is running
4. Investigation begins within 1-3 seconds
5. Real-time timeline appears as findings are discovered

***

## Viewing RCA Results

<CardGroup cols={2}>
  <Card title="Root Cause Summary" icon="crosshairs">
    Clear explanation of root cause with confidence score (0.0-1.0) and identification timestamp.
  </Card>

  <Card title="Hypothesis Tracking" icon="lightbulb">
    All hypotheses with lifecycle: creation → testing → confirmation/ruling out with reasoning.
  </Card>

  <Card title="Evidence Chain" icon="link">
    Evidence organized by type with severity ranking, source attribution, and deep links.
  </Card>

  <Card title="Investigation Timeline" icon="clock">
    Real-time chronological log of AI investigation steps with phase transitions.
  </Card>

  <Card title="Remediation Actions" icon="list-check">
    AI-suggested fixes with priority levels (critical, high, medium, low).
  </Card>

  <Card title="Affected Services" icon="server">
    Services impacted during investigation with blast radius visualization if [topology](/guide/infrastructure/topology) connected.
  </Card>
</CardGroup>

***

## Multiple RCA Runs

Run multiple investigations on the same incident with version tracking (v1, v2, v3...). Use when additional information becomes available or initial investigation was inconclusive. Compare results via history dropdown.

***

## Best Practices

**Investigation Setup:**

* Connect [topology](/guide/infrastructure/topology) for blast radius analysis and service correlation
* Configure [webhooks](/guide/incident/webhook-integrations) to auto-trigger RCA for medium+ severity incidents
* Provide incident context in description to guide investigation

**During Investigation:**

* Monitor timeline in real-time to follow hypothesis testing
* Review hypothesis chain to understand which theories were tested and ruled out
* Verify evidence timestamps correlate with incident start

**After Investigation:**

* Validate root cause if confidence \< 0.7 before implementing remediation
* Focus on critical-priority remediation actions first
* Use version history if new information emerges (run RCA multiple times)
* Connect [Runbooks](/guide/incident/runbooks) so agents can automatically find and execute remediation procedures during future investigations
