Infrastructure Analytics

Query utilization, performance, and reliability data across all connected cloud environments using agent-led dashboards, charts, and alerts. Cost anomaly analysis and spending trends live in Cost Analytics.

Prompt syntax

The general form for an analytics query:

@agent #tool your query [time range]

Component	Description	Values
`@agent`	Who executes the query	`@alex` (cloud and compute), `@tony` (databases), `@kai` (Kubernetes), `@anna` (forecasts and coordination)
`#tool`	Output format	`#dashboard` (visual), `#chart` (inline), `#report` (exportable), `#alert` (threshold rule)
`time range`	Optional; defaults to 7 days	`last 7 days`, `last 30 days`, `last quarter`, `since last quarter`

See CloudThinker Language for the complete syntax reference.

Dashboards

Dashboard	What it shows	Agent
Resource utilization	CPU, memory, storage, and network usage per service, region, and account	Alex, Kai
Kubernetes workload efficiency	Pod resource requests vs. actual usage by namespace	Kai
Application performance	API response times and error rates correlated with infrastructure load	Alex
Database performance	Query latency at P50/P95/P99, slow queries, connection counts	Tony
Cluster health	CPU pressure, OOMKill events, and node status	Kai
Capacity headroom	Resource runway before performance is impacted	Alex, Anna
Utilization trends	Improvement or regression over a chosen period	Alex

Alerts

Signal	What is detected	Default threshold
CPU pressure	Sustained high CPU across a cluster	>85% for a configurable duration
Memory growth	Steady increase without release (leak pattern)	>10% per hour
Latency degradation	P95 latency rise above baseline	>2× baseline
OOMKills	Pod terminated due to memory limit	Any occurrence
Replication lag	Database replica falling behind primary	>30 seconds

All thresholds are configurable per environment using #alert prompts.

Signal interpretation

Pattern	Likely cause	Next step
High utilization + normal latency	Appropriately sized workload	Consider reserved capacity — ask `@alex`
Low utilization + high cost	Overprovisioned resources	Right-size with `@alex`
High latency + normal utilization	Application or database bottleneck	Ask `@tony`
Utilization spikes + OOMKills	Resource limits misconfigured	Ask `@kai`
Cost spike without traffic change	Configuration drift or orphaned resource	Ask `@alex` or check CloudKeepers findings

Examples

Utilization dashboards:

@alex #dashboard resource utilization across all accounts last 7 days
@kai #dashboard pod resource requests vs actual usage by namespace

Performance dashboards:

@tony #dashboard query latency P50/P95/P99 over last 30 days
@kai #dashboard cluster CPU pressure and OOMKill events by namespace

Alert configuration:

@tony #alert when P95 query latency exceeds 500ms for 5 consecutive minutes
@kai #alert on OOMKilled events or nodes with more than 90% memory pressure

Trends and forecasting:

@anna forecast infrastructure needs for 2x traffic growth
@alex show improvement in resource utilization since last quarter

CloudKeepers

Set up keepers that continuously monitor infrastructure and surface findings

Cost Analytics

Analyze spending patterns and anomalies across connected accounts

Assessment

Run a Well-Architected assessment to baseline infrastructure health

Topology

Correlate analytics signals with your infrastructure dependency graph

Code Review

CostOps

Deep Response Engine

Infrastructure

ChatOps

Skills

Artifacts & Dashboards

Infrastructure Analytics

Prompt syntax

Dashboards

Alerts

Signal interpretation

Examples

CloudKeepers

Cost Analytics

Assessment

Topology

​Prompt syntax

​Dashboards

​Alerts

​Signal interpretation

​Examples

​Related

CloudKeepers

Cost Analytics

Assessment

Topology

Prompt syntax

Dashboards

Alerts

Signal interpretation

Examples

Related