Skip to main content
Infrastructure Analytics gives you a unified view of health, performance, and cost signals across all connected clouds — correlating data from CloudWatch, Kubernetes metrics, database performance, and cost telemetry into a single operational picture.

The Problem With Siloed Infrastructure Data

Modern cloud infrastructure generates data from dozens of sources: CloudWatch metrics for AWS, Azure Monitor for Azure, Prometheus for Kubernetes, RDS Performance Insights for databases, Datadog or Grafana for APM. Each tool has its own query language, its own dashboard format, and its own access model. The result: answering a question like “Is the performance degradation I’m seeing cost-related or resource-contention-related?” requires pulling data from 3–4 separate tools, correlating timestamps manually, and hoping you’re looking at the same time window. Infrastructure Analytics connects these signals into a coherent picture. Ask agents questions in plain language — they query the right sources, correlate the data, and surface actionable insights.

Key Dashboards

Resource Utilization

Track compute, memory, storage, and network utilization across your entire infrastructure:
# Multi-cloud utilization overview
@alex #dashboard resource utilization across all accounts last 7 days

# Per-service breakdown
@alex #dashboard EC2 CPU and memory utilization by service tag

# Kubernetes workload efficiency
@kai #dashboard pod resource requests vs actual usage by namespace
What you can see:
  • CPU/memory utilization trends per service, region, and account
  • Overprovisioned vs underprovisioned resources at a glance
  • Peak vs average utilization (P95, P99) for capacity planning

Performance Health

Monitor application performance signals correlated with infrastructure state:
# API latency correlation with infrastructure
@alex #dashboard API response times vs infrastructure load

# Database performance trends
@tony #dashboard query latency P50/P95/P99 over last 30 days

# Kubernetes cluster performance
@kai #dashboard cluster CPU pressure and OOMKill events by namespace

Cost Correlation

Connect infrastructure changes to cost impact:
# Cost vs utilization efficiency
@alex #dashboard cost per unit of utilization across services

# Anomaly detection
@alex identify infrastructure changes correlated with cost spikes

# Waste attribution
@alex #dashboard unused and underutilized resources by team tag

Trend Analysis

Understand how your infrastructure evolves over time:
# Growth trends
@alex analyze infrastructure growth patterns over last 6 months

# Capacity forecasting
@anna forecast infrastructure needs for 2x traffic growth

# Efficiency trends
@alex show improvement in resource utilization since last quarter

Anomaly Detection

CloudThinker agents continuously monitor for anomalies and surface them automatically:
SignalWhat’s DetectedAlert Threshold
Cost spikeSpend increase >20% day-over-dayConfigurable
CPU pressureSustained >85% CPU across clusterConfigurable
Memory growthSteady memory growth without release (leak pattern)>10% per hour
Latency degradationP95 latency increase >2x baselineConfigurable
OOMKillsPod terminated due to memory limitAny occurrence
Replication lagDatabase replica falling behind>30 seconds
Configure thresholds to match your environment:
# Set a cost anomaly alert
@alex #alert when daily spend exceeds $5,000 or increases >25% day-over-day

# Set a performance alert
@tony #alert when P95 query latency exceeds 500ms for 5 consecutive minutes

# Set a K8s health alert
@kai #alert on OOMKilled events or nodes with >90% memory pressure

Infrastructure Insights Dashboard

The built-in Infrastructure Insights dashboard (accessible at Infrastructure → Analytics) provides:

Health Score

Composite health score across compute, network, database, and Kubernetes — updated continuously

Cost Efficiency

Ratio of actual resource utilization to what you’re paying for — identifies waste at a glance

Reliability Indicators

Error rates, availability, recent incidents, and MTTR trends over time

Capacity Headroom

How much runway you have before resource constraints impact performance

How to Interpret Analytics

High utilization + high cost → Appropriately sized, consider reserved capacity purchases Low utilization + high cost → Right-sizing opportunity — talk to @alex High latency + normal utilization → Application-layer issue or database bottleneck — talk to @tony Utilization spikes + OOMKills → Resource limits misconfigured — talk to @kai Cost spike without traffic change → Configuration drift or orphaned resource — talk to @alex or check CloudKeepers findings

What’s Next

CloudKeepers

Set up autonomous pilots to detect and alert on anomalies automatically

Cost Analytics

Deep-dive into cloud spend trends and cost attribution

Assessment

Run a Well-Architected assessment to baseline infrastructure health

Topology

Correlate analytics signals with your infrastructure dependency graph