Root Cause Analysis with Topology

Topology Explorer transforms how teams perform Root Cause Analysis (RCA). Instead of hunting through logs and metrics, visually trace issues through your infrastructure to find root causes in minutes, not hours.

The Problem with Traditional RCA

Traditional approach:

Alert fires → Check dashboard → Looks fine
SSH into servers → Grep logs → Nothing obvious
Check database → Metrics look normal
Call more engineers → War room starts
2 hours later → Find it was a downstream dependency

With Topology:

Alert fires → Open Topology
See the failing service highlighted in red
Trace upstream → Find the actual root cause
Time to resolution: 5 minutes

Real-World RCA Scenarios

Scenario 1: E-Commerce Checkout Failures

The alert: “Checkout success rate dropped to 60%” Traditional debugging:

Check checkout service logs: Some timeout errors
Check payment service: Healthy
Check database: CPU looks fine
Check Redis: No errors
30 minutes in, still searching…

With Topology:

@alex show topology centered on checkout-service with health status

Instant visibility:

Checkout → Payment Gateway → External API (degraded)
Third-party payment provider having issues
Root cause found in 2 minutes

Action: Enable backup payment provider, notify customers.

Scenario 2: Mysterious Latency Spike

The alert: “API P99 latency exceeded 5 seconds”

@alex overlay latency metrics on production topology
@tony identify slowest component in the request path

Topology reveals the hot path:

Client → CDN (2ms) → ALB (5ms) → API (100ms) → Cache Miss → RDS (4500ms)
                                             ↘ Cache Hit → Redis (3ms)

Root cause: Cache invalidation bug causing 100% cache misses Without Topology: Would have blamed the database and spent hours optimizing queries. With Topology: Immediately saw cache bypass pattern, fixed in 15 minutes.

Scenario 3: Cascading Service Failures

The alert: “Multiple services returning 500 errors” When everything is failing, where do you start?

@anna trace failure propagation timeline across topology
@alex identify the origin point of cascading failures

Topology timeline view:

Time	Service	Status	Cause
10:00:00	Auth Service	Failed	Certificate expired
10:00:05	User Service	Failed	Can’t validate tokens
10:00:08	Order Service	Failed	Auth dependency
10:00:10	Payment Service	Failed	Auth dependency
10:00:15	All Services	Failed	Cascade complete

Root cause: Expired SSL certificate on Auth Service Without Topology: Teams would investigate each service independently, taking hours to correlate. With Topology: Single view shows the failure origin and propagation path.

Scenario 4: Database Connection Exhaustion

The alert: “PostgreSQL: too many connections”

@tony map all services connecting to production-db in topology
@alex show connection pool sizes per service

Topology visualization:

                 ┌─────────────────────────┐
                 │    PostgreSQL (RDS)     │
                 │   Max: 500 | Used: 498  │
                 └─────────────────────────┘
                    ↑      ↑      ↑      ↑
              ┌─────┴──┐ ┌─┴───┐ ┌┴────┐ ┌┴─────┐
              │API     │ │Jobs │ │Cron │ │Admin │
              │Pool:200│ │300  │ │ 50  │ │ 20   │
              └────────┘ └─────┘ └─────┘ └──────┘
                           ↑
                    PROBLEM: Pool too large

Root cause: Jobs service configured with 300 connections, only needs 50. Fix: Right-size connection pools based on actual usage patterns.

Scenario 5: Kubernetes Pod Crashloop

The alert: “Pod inventory-service in CrashLoopBackOff”

@kai show topology for inventory-service with all dependencies
@kai correlate crash events with dependency health

Topology + Events:

inventory-service → MongoDB (connection timeout)
MongoDB running but network policy blocking pod
Recent change: Security team updated network policies

Root cause: Network policy change broke pod-to-database connectivity. Resolution: Update network policy to allow inventory-service → MongoDB.

Why Topology Beats Log Diving

Aspect	Log Analysis	Topology RCA
Time to root cause	30-120 minutes	2-10 minutes
Expertise required	Deep system knowledge	Visual pattern recognition
Cross-service issues	Very difficult	Immediately visible
Context switching	Multiple tools/screens	Single unified view
Knowledge transfer	Hard to explain	Visual, shareable

RCA Best Practices with Topology

1. Start from the Alert

@alex show topology centered on [alerting-service] with health overlay

Don’t start from logs. Start from the visual representation of your system.

2. Trace Upstream First

Most issues are caused by dependencies, not the alerting service itself.

@alex trace upstream dependencies from [failing-service]

3. Overlay Metrics

Add context to your topology with real-time metrics.

@alex overlay [latency/errors/cpu/memory] metrics on topology

4. Check Recent Changes

Correlate with deployment timeline.

@alex show topology changes in last 24 hours
@alex highlight recently deployed services

5. Document for Postmortem

Export topology snapshots for incident documentation.

@alex export topology snapshot with annotations for incident INC-1234

Setting Up for Success

Pre-Incident Preparation

Build your topology before incidents happen
Connect all data sources (cloud, Kubernetes, databases)
Set up health checks so topology shows real-time status
Train the team on topology navigation

During Incident

Open Topology as first response action
Share topology view in war room
Annotate findings in real-time
Track resolution progress visually

Post-Incident

Export topology snapshot for postmortem
Document the failure path
Identify missing monitoring
Update runbooks with topology-based procedures

Get Started

Topology Explorer

Learn how to build and use topology maps

Infrastructure Setup

Connect your infrastructure for topology discovery

Key Takeaways

Topology reduces MTTR from hours to minutes
Visual RCA requires less expertise than log analysis
Dependency mapping reveals issues traditional monitoring misses
Pre-built topology is essential for fast incident response
Share topology views to align teams during incidents

“We went from 2-hour war rooms to 15-minute fixes. Topology changed everything about how we do incident response.” — Platform Engineering Lead

Start Here

Code Review

Infrastructure

Incident

Setup

Use Cases

Reference

Root Cause Analysis with Topology

The Problem with Traditional RCA

Real-World RCA Scenarios

Scenario 1: E-Commerce Checkout Failures

Scenario 2: Mysterious Latency Spike

Scenario 3: Cascading Service Failures

Scenario 4: Database Connection Exhaustion

Scenario 5: Kubernetes Pod Crashloop

Why Topology Beats Log Diving

RCA Best Practices with Topology

1. Start from the Alert

2. Trace Upstream First

3. Overlay Metrics

4. Check Recent Changes

5. Document for Postmortem

Setting Up for Success

Pre-Incident Preparation

During Incident

Post-Incident

Get Started

Topology Explorer

Infrastructure Setup

Key Takeaways

Start Here

Code Review

Infrastructure

Incident

Setup

Use Cases

Reference

​The Problem with Traditional RCA

​Real-World RCA Scenarios

​Scenario 1: E-Commerce Checkout Failures

​Scenario 2: Mysterious Latency Spike

​Scenario 3: Cascading Service Failures

​Scenario 4: Database Connection Exhaustion

​Scenario 5: Kubernetes Pod Crashloop

​Why Topology Beats Log Diving

​RCA Best Practices with Topology

​1. Start from the Alert

​2. Trace Upstream First

​3. Overlay Metrics

​4. Check Recent Changes

​5. Document for Postmortem

​Setting Up for Success

​Pre-Incident Preparation

​During Incident

​Post-Incident

​Get Started

Topology Explorer

Infrastructure Setup

​Key Takeaways

The Problem with Traditional RCA

Real-World RCA Scenarios

Scenario 1: E-Commerce Checkout Failures

Scenario 2: Mysterious Latency Spike

Scenario 3: Cascading Service Failures

Scenario 4: Database Connection Exhaustion

Scenario 5: Kubernetes Pod Crashloop

Why Topology Beats Log Diving

RCA Best Practices with Topology

1. Start from the Alert

2. Trace Upstream First

3. Overlay Metrics

4. Check Recent Changes

5. Document for Postmortem

Setting Up for Success

Pre-Incident Preparation

During Incident

Post-Incident

Get Started

Key Takeaways