> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cloudthinker.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Root Cause Analysis with Topology

> How to use Topology Explorer for faster incident resolution and RCA

[Topology Explorer](/guide/infrastructure/topology) transforms how teams perform [Root Cause Analysis](/guide/incident/root-cause-analysis) (RCA). Instead of hunting through logs and metrics, visually trace issues through your infrastructure to find root causes in minutes, not hours.

***

## The Problem with Traditional RCA

**Traditional approach:**

1. Alert fires → Check dashboard → Looks fine
2. SSH into servers → Grep logs → Nothing obvious
3. Check database → Metrics look normal
4. Call more engineers → War room starts
5. 2 hours later → Find it was a downstream dependency

**With Topology:**

1. Alert fires → Open Topology
2. See the failing service highlighted in red
3. Trace upstream → Find the actual root cause
4. **Time to resolution: 5 minutes**

***

## Real-World RCA Scenarios

### Scenario 1: E-Commerce Checkout Failures

**The alert:** "Checkout success rate dropped to 60%"

**Traditional debugging:**

* Check checkout service logs: Some timeout errors
* Check payment service: Healthy
* Check database: CPU looks fine
* Check Redis: No errors
* **30 minutes in, still searching...**

**With Topology:**

```bash theme={null}
@alex show topology centered on checkout-service with health status
```

<Frame>
  <img src="https://mintcdn.com/cloudthinker/M-utUm-TaqDSbEEK/images/infrastructure/topology-explorer.jpg?fit=max&auto=format&n=M-utUm-TaqDSbEEK&q=85&s=b931c4d83fd7741964ff90dca6583bcc" alt="Topology showing service dependencies" width="3578" height="2010" data-path="images/infrastructure/topology-explorer.jpg" />
</Frame>

**Instant visibility:**

* Checkout → Payment Gateway → **External API (degraded)**
* Third-party payment provider having issues
* Root cause found in **2 minutes**

**Action:** Enable backup payment provider, notify customers.

***

### Scenario 2: Mysterious Latency Spike

**The alert:** "API P99 latency exceeded 5 seconds"

```bash theme={null}
@alex overlay latency metrics on production topology
@tony identify slowest component in the request path
```

**Topology reveals the hot path:**

```
Client → CDN (2ms) → ALB (5ms) → API (100ms) → Cache Miss → RDS (4500ms)
                                             ↘ Cache Hit → Redis (3ms)
```

**Root cause:** Cache invalidation bug causing 100% cache misses

**Without Topology:** Would have blamed the database and spent hours optimizing queries.

**With Topology:** Immediately saw cache bypass pattern, fixed in 15 minutes.

***

### Scenario 3: Cascading Service Failures

**The alert:** "Multiple services returning 500 errors"

When everything is failing, where do you start?

```bash theme={null}
@anna trace failure propagation timeline across topology
@alex identify the origin point of cascading failures
```

**Topology timeline view:**

| Time     | Service          | Status | Cause                 |
| -------- | ---------------- | ------ | --------------------- |
| 10:00:00 | **Auth Service** | Failed | Certificate expired   |
| 10:00:05 | User Service     | Failed | Can't validate tokens |
| 10:00:08 | Order Service    | Failed | Auth dependency       |
| 10:00:10 | Payment Service  | Failed | Auth dependency       |
| 10:00:15 | All Services     | Failed | Cascade complete      |

**Root cause:** Expired SSL certificate on Auth Service

**Without Topology:** Teams would investigate each service independently, taking hours to correlate.

**With Topology:** Single view shows the failure origin and propagation path.

***

### Scenario 4: Database Connection Exhaustion

**The alert:** "PostgreSQL: too many connections"

```bash theme={null}
@tony map all services connecting to production-db in topology
@alex show connection pool sizes per service
```

**Topology visualization:**

```
                 ┌─────────────────────────┐
                 │    PostgreSQL (RDS)     │
                 │   Max: 500 | Used: 498  │
                 └─────────────────────────┘
                    ↑      ↑      ↑      ↑
              ┌─────┴──┐ ┌─┴───┐ ┌┴────┐ ┌┴─────┐
              │API     │ │Jobs │ │Cron │ │Admin │
              │Pool:200│ │300  │ │ 50  │ │ 20   │
              └────────┘ └─────┘ └─────┘ └──────┘
                           ↑
                    PROBLEM: Pool too large
```

**Root cause:** Jobs service configured with 300 connections, only needs 50.

**Fix:** Right-size connection pools based on actual usage patterns.

***

### Scenario 5: Kubernetes Pod Crashloop

**The alert:** "Pod inventory-service in CrashLoopBackOff"

```bash theme={null}
@kai show topology for inventory-service with all dependencies
@kai correlate crash events with dependency health
```

**Topology + Events:**

* inventory-service → **MongoDB (connection timeout)**
* MongoDB running but network policy blocking pod
* Recent change: Security team updated network policies

**Root cause:** Network policy change broke pod-to-database connectivity.

**Resolution:** Update network policy to allow inventory-service → MongoDB.

***

## Why Topology Beats Log Diving

| Aspect                   | Log Analysis           | Topology RCA               |
| ------------------------ | ---------------------- | -------------------------- |
| **Time to root cause**   | 30-120 minutes         | 2-10 minutes               |
| **Expertise required**   | Deep system knowledge  | Visual pattern recognition |
| **Cross-service issues** | Very difficult         | Immediately visible        |
| **Context switching**    | Multiple tools/screens | Single unified view        |
| **Knowledge transfer**   | Hard to explain        | Visual, shareable          |

***

## RCA Best Practices with Topology

### 1. Start from the Alert

```bash theme={null}
@alex show topology centered on [alerting-service] with health overlay
```

Don't start from logs. Start from the visual representation of your system.

### 2. Trace Upstream First

Most issues are caused by dependencies, not the alerting service itself.

```bash theme={null}
@alex trace upstream dependencies from [failing-service]
```

### 3. Overlay Metrics

Add context to your topology with real-time metrics.

```bash theme={null}
@alex overlay [latency/errors/cpu/memory] metrics on topology
```

### 4. Check Recent Changes

Correlate with deployment timeline.

```bash theme={null}
@alex show topology changes in last 24 hours
@alex highlight recently deployed services
```

### 5. Document for Postmortem

Export topology snapshots for incident documentation.

```bash theme={null}
@alex export topology snapshot with annotations for incident INC-1234
```

***

## Setting Up for Success

### Pre-Incident Preparation

1. **Build your topology** before incidents happen
2. **Connect all data sources** (cloud, Kubernetes, databases)
3. **Set up health checks** so topology shows real-time status
4. **Train the team** on topology navigation

### During Incident

1. Open Topology as **first response action**
2. Share topology view in war room
3. Annotate findings in real-time
4. Track resolution progress visually

### Post-Incident

1. Export topology snapshot for postmortem
2. Document the failure path
3. Identify missing monitoring
4. Update runbooks with topology-based procedures

***

## Get Started

<CardGroup cols={2}>
  <Card title="Topology Explorer" icon="diagram-project" href="/guide/infrastructure/topology">
    Learn how to build and use topology maps
  </Card>

  <Card title="Infrastructure Setup" icon="server" href="/guide/infrastructure/resources">
    Connect your infrastructure for topology discovery
  </Card>
</CardGroup>

***

## Key Takeaways

* **Topology reduces MTTR** from hours to minutes
* **Visual RCA** requires less expertise than log analysis
* **Dependency mapping** reveals issues traditional monitoring misses
* **Pre-built topology** is essential for fast incident response
* **Share topology views** to align teams during incidents

> "We went from 2-hour war rooms to 15-minute fixes. Topology changed everything about how we do incident response." — Platform Engineering Lead
