The Problem With Siloed Infrastructure Data
Modern cloud infrastructure generates data from dozens of sources: CloudWatch metrics for AWS, Azure Monitor for Azure, Prometheus for Kubernetes, RDS Performance Insights for databases, Datadog or Grafana for APM. Each tool has its own query language, its own dashboard format, and its own access model. The result: answering a question like “Is the performance degradation I’m seeing cost-related or resource-contention-related?” requires pulling data from 3–4 separate tools, correlating timestamps manually, and hoping you’re looking at the same time window. Infrastructure Analytics connects these signals into a coherent picture. Ask agents questions in plain language — they query the right sources, correlate the data, and surface actionable insights.Key Dashboards
Resource Utilization
Track compute, memory, storage, and network utilization across your entire infrastructure:- CPU/memory utilization trends per service, region, and account
- Overprovisioned vs underprovisioned resources at a glance
- Peak vs average utilization (P95, P99) for capacity planning
Performance Health
Monitor application performance signals correlated with infrastructure state:Cost Correlation
Connect infrastructure changes to cost impact:Trend Analysis
Understand how your infrastructure evolves over time:Anomaly Detection
CloudThinker agents continuously monitor for anomalies and surface them automatically:| Signal | What’s Detected | Alert Threshold |
|---|---|---|
| Cost spike | Spend increase >20% day-over-day | Configurable |
| CPU pressure | Sustained >85% CPU across cluster | Configurable |
| Memory growth | Steady memory growth without release (leak pattern) | >10% per hour |
| Latency degradation | P95 latency increase >2x baseline | Configurable |
| OOMKills | Pod terminated due to memory limit | Any occurrence |
| Replication lag | Database replica falling behind | >30 seconds |
Infrastructure Insights Dashboard
The built-in Infrastructure Insights dashboard (accessible at Infrastructure → Analytics) provides:Health Score
Composite health score across compute, network, database, and Kubernetes — updated continuously
Cost Efficiency
Ratio of actual resource utilization to what you’re paying for — identifies waste at a glance
Reliability Indicators
Error rates, availability, recent incidents, and MTTR trends over time
Capacity Headroom
How much runway you have before resource constraints impact performance
How to Interpret Analytics
High utilization + high cost → Appropriately sized, consider reserved capacity purchases Low utilization + high cost → Right-sizing opportunity — talk to@alex
High latency + normal utilization → Application-layer issue or database bottleneck — talk to @tony
Utilization spikes + OOMKills → Resource limits misconfigured — talk to @kai
Cost spike without traffic change → Configuration drift or orphaned resource — talk to @alex or check CloudKeepers findings
What’s Next
CloudKeepers
Set up autonomous pilots to detect and alert on anomalies automatically
Cost Analytics
Deep-dive into cloud spend trends and cost attribution
Assessment
Run a Well-Architected assessment to baseline infrastructure health
Topology
Correlate analytics signals with your infrastructure dependency graph