Skip to main content

The role of observability in incident response

Production observability systems need three key requirements:
  • Unified visibility: Correlate logs, metrics, and traces from multiple services to understand failure propagation through microservices and trace issues to origin
  • Real-time correlation: Preserve relationships between applications, databases, queues, and infrastructure to quickly determine if issues are local or dependency-driven
  • Automated analysis with contextual intelligence: AWS provides CloudWatch, X-Ray, and CloudWatch Logs, but these generate massive data volumes requiring human interpretation. CloudThinker adds intelligent analysis that automatically correlates signals, identifies anomalies, and explains root causes in business context
Traditional siloed monitoring creates blind spots where symptom-to-root-cause relationships remain obscured, leading to extended investigation times and potential misdiagnosis.

Challenges with traditional incident response workflows

Manual investigation delays: Engineers manually grep CloudWatch Logs across log groups, correlate with CloudWatch metrics, and trace requests through X-Ray extending customer impact Siloed expertise: Application teams, DBAs, and infrastructure teams investigate sequentially in their domains, discovering root causes outside their area after wasting time this handoff-intensive process increases MTTR Cross-service correlation: Applications span RDS Aurora, DocumentDB, Lambda, ECS/EKS, and external dependencies. Understanding how database timeouts cause checkout failures requires pulling data from RDS metrics, VPC Flow Logs, ALB access logs, and application logs each with different query languages and access patterns, consuming hours to reconstruct failure timelines

Solution: CloudThinker’s agentic approach to automated RCA

CloudThinker uses specialized AI agents coordinated by a Multi-Agent System that automatically investigate across the entire stack, correlate signals, and generate comprehensive RCA documentation within minutes. Specialized agents:
  • Tony (Database Administrator): Analyzes RDS Aurora and DocumentDB performance, identifies slow queries, connection pool exhaustion, resource constraints
  • Alex (Cloud Engineer): Examines AWS infrastructure including EC2, load balancers, VPC networking
  • Kai (Kubernetes Specialist): Investigates pod health, container restarts, resource limits, service mesh configurations on Amazon EKS
  • Oliver (Security Engineer): Analyzes security groups, network policies, IAM permissions, security-related failure modes
The agentic approach performs parallel investigation across traditionally siloed domains. Agents simultaneously query AWS APIs and observability data sources, correlating findings in real-time to quickly identify fault domains, validate/eliminate hypotheses, and pinpoint upstream triggers—even when symptoms appear far from the underlying issue.

Real-World Investigation: EC2 Terminations and EKS Network Issues

Consider a scenario where CloudThinker’s continuous monitoring detects two concerning patterns: Critical Finding 1: Frequent EC2 Instance Terminations Critical Finding 2: EKS Network Issues This investigation demonstrates how to use CloudThinker’s agents to analyze these interconnected infrastructure issues systematically.

Step 1: Analyze EC2 termination patterns with Alex

Analyze AutoScaling termination patterns:
@alex investigate EC2 termination patterns for the past 60 days

Focus on:
- AutoScaling group policies and scaling triggers
- Time-based patterns (analyze the 17:15 UTC daily terminations)
- Correlation between scale-down events and application load metrics
- CloudWatch alarms triggering the scale-down decisions
- Cost impact of frequent instance churn vs stability risks

Break down by:
- AutoScaling group name and configuration
- Termination reason (scale-down, health check failure, manual)
- Instance types and sizes affected
- Availability zone distribution
- Time to replacement for terminated instances
EC2 termination pattern analysis showing AutoScaling events and timeline

EC2 termination pattern analysis showing AutoScaling events

Step 2: Investigate EKS network interface failures with Kai and Alex

Cross-correlate network failures with instance terminations:
@alex correlate CreateNetworkInterface failures with EC2 termination events

Analysis goals:
- Identify if aggressive scale-downs contribute to IP address fragmentation
- Determine if rapid instance churn prevents IP address reclamation
- Calculate optimal scaling strategy to maintain IP address availability
- Assess whether termination lifecycle hooks allow proper ENI cleanup

Timeline correlation window: 60 days
Network failure correlation with CreateNetworkInterface errors and IP exhaustion

Network failure correlation with CreateNetworkInterface errors and IP exhaustion

Step 3: Generate comprehensive RCA report with Anna

After completing the multi-agent investigation, use Anna to synthesize all findings into an executive RCA report: Generate complete RCA documentation:
@anna #report comprehensive RCA for infrastructure stability issues

Incident ID: INFRA-2025-001
Title: "Frequent EC2 Terminations and EKS Network Interface Failures"

Include:
- Executive summary with business impact
- Detailed timeline of 60-day pattern analysis
- Root cause analysis combining findings from Alex (EC2/cost), Kai (EKS/networking), and Oliver (security)
- Visual topology snapshots showing subnet IP exhaustion
- Cost-benefit analysis of remediation options
- Immediate remediation steps (0-7 days)
- Strategic improvements (30-90 days)
- Preventive measures to avoid recurrence

Context:
- 30+ EC2 instances terminated in 60 days (28 by AutoScaling, 2 manual)
- 47 CreateNetworkInterface failures due to IP exhaustion
- Service degradation incidents during morning scaling events
- Vicious cycle of aggressive scale-down and scaling failures

Target audience: Engineering leadership and Infrastructure team
Required output: Action items with owners and due dates
 • Created runbook library for common scaling scenarios
Comprehensive RCA report with findings, remediation steps, and preventive measures

Comprehensive RCA report with findings and remediation steps

This comprehensive RCA report demonstrates how CloudThinker synthesizes findings from multiple agents (Alex for infrastructure and cost, Kai for EKS networking, Oliver for security) into actionable documentation that includes technical details, business impact quantification, remediation steps, and preventive measures—all generated automatically in minutes rather than requiring hours of manual documentation effort.

Comparing CloudThinker with traditional incident response approaches

DimensionTraditional Incident ResponseCloudThinker Automated RCA
Time to Root Cause30-120 minutes with sequential investigation across teams2-10 minutes with parallel agent investigation
Expertise RequiredDeep knowledge of grep, AWS CLI, CloudWatch Logs Insights, and log correlationVisual pattern recognition on topology plus natural language prompts
Cross-Service VisibilityVery difficult—requires querying multiple systems sequentially and manually correlating findingsImmediately visible through Topology Explorer with automated correlation
Documentation TimeManual hours after incident resolution, often delayed or incompleteInstant generation via #report command with comprehensive analysis
Investigation ApproachReactive sequential troubleshooting check app logs, then DB metrics, then network, etc.Proactive parallel analysis all agents investigate simultaneously across the stack
Dependency MappingMental model or outdated architecture diagrams, often incompleteReal-time topology from actual AWS service dependencies and traffic patterns
Alert FatigueHigh multiple teams receive alerts, many false positives require manual triageReduced agents provide context and severity assessment automatically
Post-Incident LearningVaries widely depends on engineer’s documentation thoroughness and time availabilityConsistent high-quality RCA reports with actionable recommendations

Conclusion

CloudThinker automates cloud failure investigations by replacing manual, sequential workflows with specialized AI agents that perform parallel analysis across infrastructure, security, and database domains. By utilizing a Multi-Agent System and a Topology Explorer to trace issues through complex microservices, the platform identifies root causes in minutes rather than hours. This approach eliminates traditional handoff delays and instantly generates comprehensive RCA documentation, significantly improving incident response efficiency and consistency.