> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cloudthinker.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Root Cause Analysis

> Accelerating Root Cause Analysis for cloud incidents with CloudThinker on AWS

### **The role of observability in incident response**

Production observability systems need three key requirements:

* **Unified visibility**: Correlate logs, metrics, and traces from multiple services to understand failure propagation through microservices and trace issues to origin
* **Real-time correlation**: Preserve relationships between applications, databases, queues, and infrastructure to quickly determine if issues are local or dependency-driven
* **Automated analysis with contextual intelligence**: AWS provides CloudWatch, X-Ray, and CloudWatch Logs, but these generate massive data volumes requiring human interpretation. CloudThinker adds intelligent analysis that automatically correlates signals, identifies anomalies, and explains root causes in business context

Traditional siloed monitoring creates blind spots where symptom-to-root-cause relationships remain obscured, leading to extended investigation times and potential misdiagnosis.

<Note>
  This page focuses on the Incident pillar of the [Deep Response Engine](/guide/incident/overview). Upstream of investigation, **[Pulse](/guide/pulse/overview)** ingests signals from 10+ sources, applies seven suppression layers to remove \~98% of noise, and correlates the rest into clusters — so RCA only fires for events that are actually actionable.
</Note>

### **Challenges with traditional incident response workflows**

**Manual investigation delays**: Engineers manually grep CloudWatch Logs across log groups, correlate with CloudWatch metrics, and trace requests through X-Ray extending customer impact

**Siloed expertise**: Application teams, DBAs, and infrastructure teams investigate sequentially in their domains, discovering root causes outside their area after wasting time this handoff-intensive process increases MTTR

**Cross-service correlation**: Applications span RDS Aurora, DocumentDB, Lambda, ECS/EKS, and external dependencies. Understanding how database timeouts cause checkout failures requires pulling data from RDS metrics, VPC Flow Logs, ALB access logs, and application logs each with different query languages and access patterns, consuming hours to reconstruct failure timelines

### **Solution: CloudThinker's agentic approach to automated RCA**

CloudThinker uses specialized AI agents coordinated by a Multi-Agent System that automatically investigate across the entire stack, correlate signals, and generate comprehensive RCA documentation within minutes.

**Specialized agents:**

* **[Tony](/guide/agents/tony) (Database Administrator)**: Analyzes RDS Aurora and DocumentDB performance, identifies slow queries, connection pool exhaustion, resource constraints
* **[Alex](/guide/agents/alex) (Cloud Engineer)**: Examines AWS infrastructure including EC2, load balancers, VPC networking
* **[Kai](/guide/agents/kai) (Kubernetes Specialist)**: Investigates pod health, container restarts, resource limits, service mesh configurations on Amazon EKS
* **[Oliver](/guide/agents/oliver) (Security Engineer)**: Analyzes security groups, network policies, IAM permissions, security-related failure modes

The agentic approach performs parallel investigation across traditionally siloed domains. Agents simultaneously query AWS APIs and observability data sources, correlating findings in real-time to quickly identify fault domains, validate/eliminate hypotheses, and pinpoint upstream triggers—even when symptoms appear far from the underlying issue.

### **Real-World Investigation: EC2 Terminations and EKS Network Issues**

Consider a scenario where CloudThinker's continuous monitoring detects two concerning patterns:

**Critical Finding 1: Frequent EC2 Instance Terminations**

**Critical Finding 2: EKS Network Issues**

This investigation demonstrates how to use CloudThinker's agents to analyze these interconnected infrastructure issues systematically.

### **Step 1: Analyze EC2 termination patterns with Alex**

**Analyze AutoScaling termination patterns:**

```
@alex investigate EC2 termination patterns for the past 60 days

Focus on:
- AutoScaling group policies and scaling triggers
- Time-based patterns (analyze the 17:15 UTC daily terminations)
- Correlation between scale-down events and application load metrics
- CloudWatch alarms triggering the scale-down decisions
- Cost impact of frequent instance churn vs stability risks

Break down by:
- AutoScaling group name and configuration
- Termination reason (scale-down, health check failure, manual)
- Instance types and sizes affected
- Availability zone distribution
- Time to replacement for terminated instances
```

<Frame>
  <img src="https://mintcdn.com/cloudthinker/0IKJjKZJEIROke98/images/use-cases/incident-root-cause-analysis/01-ec2-termination-analysis.jpg?fit=max&auto=format&n=0IKJjKZJEIROke98&q=85&s=5365c6080b4971af1fc30fa9db1ff918" alt="EC2 termination pattern analysis showing AutoScaling events and timeline" width="1680" height="1014" data-path="images/use-cases/incident-root-cause-analysis/01-ec2-termination-analysis.jpg" />
</Frame>

<p style={{textAlign: 'center', fontSize: '0.9em', color: '#666', marginTop: '8px'}}>EC2 termination pattern analysis showing AutoScaling events</p>

### **Step 2: Investigate EKS network interface failures with Kai and Alex**

**Cross-correlate network failures with instance terminations:**

```
@alex correlate CreateNetworkInterface failures with EC2 termination events

Analysis goals:
- Identify if aggressive scale-downs contribute to IP address fragmentation
- Determine if rapid instance churn prevents IP address reclamation
- Calculate optimal scaling strategy to maintain IP address availability
- Assess whether termination lifecycle hooks allow proper ENI cleanup

Timeline correlation window: 60 days
```

<Frame>
  <img src="https://mintcdn.com/cloudthinker/0IKJjKZJEIROke98/images/use-cases/incident-root-cause-analysis/02-network-failure-correlation.jpg?fit=max&auto=format&n=0IKJjKZJEIROke98&q=85&s=6c2aa5a8e8ab2fc67cedc1a330e50dc3" alt="Network failure correlation with CreateNetworkInterface errors and IP exhaustion" width="1678" height="672" data-path="images/use-cases/incident-root-cause-analysis/02-network-failure-correlation.jpg" />
</Frame>

<p style={{textAlign: 'center', fontSize: '0.9em', color: '#666', marginTop: '8px'}}>Network failure correlation with CreateNetworkInterface errors and IP exhaustion</p>

### **Step 3: Generate comprehensive RCA report with Anna**

After completing the multi-agent investigation, use [Anna](/guide/agents/anna) to synthesize all findings into an executive RCA report:

**Generate complete RCA documentation:**

```
@anna #report comprehensive RCA for infrastructure stability issues

Incident ID: INFRA-2025-001
Title: "Frequent EC2 Terminations and EKS Network Interface Failures"

Include:
- Executive summary with business impact
- Detailed timeline of 60-day pattern analysis
- Root cause analysis combining findings from Alex (EC2/cost), Kai (EKS/networking), and Oliver (security)
- Visual topology snapshots showing subnet IP exhaustion
- Cost-benefit analysis of remediation options
- Immediate remediation steps (0-7 days)
- Strategic improvements (30-90 days)
- Preventive measures to avoid recurrence

Context:
- 30+ EC2 instances terminated in 60 days (28 by AutoScaling, 2 manual)
- 47 CreateNetworkInterface failures due to IP exhaustion
- Service degradation incidents during morning scaling events
- Vicious cycle of aggressive scale-down and scaling failures

Target audience: Engineering leadership and Infrastructure team
Required output: Action items with owners and due dates
 • Created runbook library for common scaling scenarios
```

<Frame>
  <img src="https://mintcdn.com/cloudthinker/0IKJjKZJEIROke98/images/use-cases/incident-root-cause-analysis/03-comprehensive-rca-report.jpg?fit=max&auto=format&n=0IKJjKZJEIROke98&q=85&s=eeb6c73da96c64dc68e83663ddf7b88b" alt="Comprehensive RCA report with findings, remediation steps, and preventive measures" width="1674" height="964" data-path="images/use-cases/incident-root-cause-analysis/03-comprehensive-rca-report.jpg" />
</Frame>

<p style={{textAlign: 'center', fontSize: '0.9em', color: '#666', marginTop: '8px'}}>Comprehensive RCA report with findings and remediation steps</p>

This comprehensive RCA report demonstrates how CloudThinker synthesizes findings from multiple agents ([Alex](/guide/agents/alex) for infrastructure and cost, [Kai](/guide/agents/kai) for EKS networking, [Oliver](/guide/agents/oliver) for security) into actionable documentation that includes technical details, business impact quantification, remediation steps, and preventive measures—all generated automatically in minutes rather than requiring hours of manual documentation effort.

## **Comparing CloudThinker with traditional incident response approaches**

| Dimension                    | Traditional Incident Response                                                                    | CloudThinker Automated RCA                                                                                 |
| ---------------------------- | ------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------- |
| **Time to Root Cause**       | 30-120 minutes with sequential investigation across teams                                        | 2-10 minutes with parallel agent investigation                                                             |
| **Expertise Required**       | Deep knowledge of grep, AWS CLI, CloudWatch Logs Insights, and log correlation                   | Visual pattern recognition on topology plus natural language prompts                                       |
| **Cross-Service Visibility** | Very difficult—requires querying multiple systems sequentially and manually correlating findings | Immediately visible through [Topology Explorer](/guide/infrastructure/topology) with automated correlation |
| **Documentation Time**       | Manual hours after incident resolution, often delayed or incomplete                              | Instant generation via #report command with comprehensive analysis                                         |
| **Investigation Approach**   | Reactive sequential troubleshooting check app logs, then DB metrics, then network, etc.          | Proactive parallel analysis all agents investigate simultaneously across the stack                         |
| **Dependency Mapping**       | Mental model or outdated architecture diagrams, often incomplete                                 | Real-time topology from actual AWS service dependencies and traffic patterns                               |
| **Alert Fatigue**            | High multiple teams receive alerts, many false positives require manual triage                   | Reduced agents provide context and severity assessment automatically                                       |
| **Post-Incident Learning**   | Varies widely depends on engineer's documentation thoroughness and time availability             | Consistent high-quality RCA reports with actionable recommendations                                        |

## What's Next

<CardGroup cols={2}>
  <Card title="Pulse" icon="wave-pulse" href="/guide/pulse/overview">
    Upstream signal intelligence — 7 suppression layers, AI classification, cluster correlation
  </Card>

  <Card title="RCA Deep Dive" icon="magnifying-glass" href="/guide/incident/root-cause-analysis">
    Hypothesis tracking, evidence chains, confidence scoring, and disposition workflow
  </Card>

  <Card title="Webhook Integrations" icon="webhook" href="/guide/incident/webhook-integrations">
    Auto-trigger RCA from PagerDuty, Datadog, Prometheus, and 11+ platforms
  </Card>

  <Card title="Topology Explorer" icon="diagram-project" href="/guide/infrastructure/topology">
    Build live dependency maps for faster blast radius analysis during incidents
  </Card>
</CardGroup>
