Skip to main content

Root Cause Analysis with Topology

Topology Explorer transforms how teams perform Root Cause Analysis (RCA). Instead of hunting through logs and metrics, visually trace issues through your infrastructure to find root causes in minutes, not hours.

The Problem with Traditional RCA

Traditional approach:
  1. Alert fires → Check dashboard → Looks fine
  2. SSH into servers → Grep logs → Nothing obvious
  3. Check database → Metrics look normal
  4. Call more engineers → War room starts
  5. 2 hours later → Find it was a downstream dependency
With Topology:
  1. Alert fires → Open Topology
  2. See the failing service highlighted in red
  3. Trace upstream → Find the actual root cause
  4. Time to resolution: 5 minutes

Real-World RCA Scenarios

Scenario 1: E-Commerce Checkout Failures

The alert: “Checkout success rate dropped to 60%” Traditional debugging:
  • Check checkout service logs: Some timeout errors
  • Check payment service: Healthy
  • Check database: CPU looks fine
  • Check Redis: No errors
  • 30 minutes in, still searching…
With Topology:
@alex show topology centered on checkout-service with health status
Topology showing service dependencies
Instant visibility:
  • Checkout → Payment Gateway → External API (degraded)
  • Third-party payment provider having issues
  • Root cause found in 2 minutes
Action: Enable backup payment provider, notify customers.

Scenario 2: Mysterious Latency Spike

The alert: “API P99 latency exceeded 5 seconds”
@alex overlay latency metrics on production topology
@tony identify slowest component in the request path
Topology reveals the hot path:
Client → CDN (2ms) → ALB (5ms) → API (100ms) → Cache Miss → RDS (4500ms)
                                             ↘ Cache Hit → Redis (3ms)
Root cause: Cache invalidation bug causing 100% cache misses Without Topology: Would have blamed the database and spent hours optimizing queries. With Topology: Immediately saw cache bypass pattern, fixed in 15 minutes.

Scenario 3: Cascading Service Failures

The alert: “Multiple services returning 500 errors” When everything is failing, where do you start?
@anna trace failure propagation timeline across topology
@alex identify the origin point of cascading failures
Topology timeline view:
TimeServiceStatusCause
10:00:00Auth ServiceFailedCertificate expired
10:00:05User ServiceFailedCan’t validate tokens
10:00:08Order ServiceFailedAuth dependency
10:00:10Payment ServiceFailedAuth dependency
10:00:15All ServicesFailedCascade complete
Root cause: Expired SSL certificate on Auth Service Without Topology: Teams would investigate each service independently, taking hours to correlate. With Topology: Single view shows the failure origin and propagation path.

Scenario 4: Database Connection Exhaustion

The alert: “PostgreSQL: too many connections”
@tony map all services connecting to production-db in topology
@alex show connection pool sizes per service
Topology visualization:
                 ┌─────────────────────────┐
                 │    PostgreSQL (RDS)     │
                 │   Max: 500 | Used: 498  │
                 └─────────────────────────┘
                    ↑      ↑      ↑      ↑
              ┌─────┴──┐ ┌─┴───┐ ┌┴────┐ ┌┴─────┐
              │API     │ │Jobs │ │Cron │ │Admin │
              │Pool:200│ │300  │ │ 50  │ │ 20   │
              └────────┘ └─────┘ └─────┘ └──────┘

                    PROBLEM: Pool too large
Root cause: Jobs service configured with 300 connections, only needs 50. Fix: Right-size connection pools based on actual usage patterns.

Scenario 5: Kubernetes Pod Crashloop

The alert: “Pod inventory-service in CrashLoopBackOff”
@kai show topology for inventory-service with all dependencies
@kai correlate crash events with dependency health
Topology + Events:
  • inventory-service → MongoDB (connection timeout)
  • MongoDB running but network policy blocking pod
  • Recent change: Security team updated network policies
Root cause: Network policy change broke pod-to-database connectivity. Resolution: Update network policy to allow inventory-service → MongoDB.

Why Topology Beats Log Diving

AspectLog AnalysisTopology RCA
Time to root cause30-120 minutes2-10 minutes
Expertise requiredDeep system knowledgeVisual pattern recognition
Cross-service issuesVery difficultImmediately visible
Context switchingMultiple tools/screensSingle unified view
Knowledge transferHard to explainVisual, shareable

RCA Best Practices with Topology

1. Start from the Alert

@alex show topology centered on [alerting-service] with health overlay
Don’t start from logs. Start from the visual representation of your system.

2. Trace Upstream First

Most issues are caused by dependencies, not the alerting service itself.
@alex trace upstream dependencies from [failing-service]

3. Overlay Metrics

Add context to your topology with real-time metrics.
@alex overlay [latency/errors/cpu/memory] metrics on topology

4. Check Recent Changes

Correlate with deployment timeline.
@alex show topology changes in last 24 hours
@alex highlight recently deployed services

5. Document for Postmortem

Export topology snapshots for incident documentation.
@alex export topology snapshot with annotations for incident INC-1234

Setting Up for Success

Pre-Incident Preparation

  1. Build your topology before incidents happen
  2. Connect all data sources (cloud, Kubernetes, databases)
  3. Set up health checks so topology shows real-time status
  4. Train the team on topology navigation

During Incident

  1. Open Topology as first response action
  2. Share topology view in war room
  3. Annotate findings in real-time
  4. Track resolution progress visually

Post-Incident

  1. Export topology snapshot for postmortem
  2. Document the failure path
  3. Identify missing monitoring
  4. Update runbooks with topology-based procedures

Get Started


Key Takeaways

  • Topology reduces MTTR from hours to minutes
  • Visual RCA requires less expertise than log analysis
  • Dependency mapping reveals issues traditional monitoring misses
  • Pre-built topology is essential for fast incident response
  • Share topology views to align teams during incidents
“We went from 2-hour war rooms to 15-minute fixes. Topology changed everything about how we do incident response.” — Platform Engineering Lead