Skip to main content

What You’ll Set Up

By the end of this tutorial, alerts from your monitoring tools will automatically trigger AI-powered investigation — agents correlate evidence across metrics, logs, traces, and topology to identify root cause and suggest remediation.
1

Navigate to Incident Settings

Go to Incident in your workspace. You’ll see the incident dashboard and configuration options.
2

Connect Monitoring Tools via Webhooks

CloudThinker ingests alerts from 15+ monitoring platforms through webhooks:
PlatformSetup
PagerDutyAdd CloudThinker webhook URL as a service integration
DatadogCreate a webhook notification in Monitors
Prometheus / AlertmanagerAdd webhook receiver configuration
AWS CloudWatchRoute alarms through SNS to webhook
GrafanaAdd webhook contact point
OpsgenieConfigure webhook integration
New RelicAdd webhook notification channel
SentryConfigure webhook integration for issues
To connect:
  1. Go to Incident > Integrations
  2. Select your monitoring platform
  3. Copy the generated webhook URL
  4. Paste it in your monitoring tool’s webhook/notification settings
  5. Send a test alert to verify the connection
See Webhook Integrations for detailed setup instructions for each platform.
3

Configure Alert Routing

Once webhooks are connected, configure how alerts are handled:
  • Auto-investigate: Automatically start AI investigation when an alert arrives (recommended)
  • Severity mapping: Map your monitoring tool’s severity levels to CloudThinker’s (Critical, High, Medium, Low)
  • Deduplication: Prevent duplicate incidents from related alerts
4

Trigger Your First Incident

You can either:
  • Wait for a real alert: Let your monitoring tools trigger a real incident
  • Send a test webhook: Use your monitoring tool’s test feature to send a sample alert
  • Log manually: Go to Incident > Manual Logging to create a test incident
For your first run, manual logging lets you see the full investigation flow immediately:
  1. Click New Incident
  2. Describe the issue: “High CPU utilization on production web server”
  3. Set severity and affected resources
  4. Submit
5

Watch the AI Investigation

Once an incident is created, the AI agent starts a hypothesis-driven investigation:
  1. Initial hypothesis: Forms possible root causes based on the alert data
  2. Evidence gathering: Pulls metrics, logs, traces, configs, and recent deployments
  3. Timeline correlation: Maps events across systems to a unified timeline
  4. Topology analysis: Traces service dependencies to understand blast radius
  5. Root cause identification: Narrows down to the most likely cause with a confidence score
You can watch the investigation in real-time as the agent works through each step.
6

Review the Root Cause Analysis

The completed investigation shows:
  • Root cause: The identified issue with confidence score
  • Evidence chain: All data points that support the conclusion
  • Blast radius: Which services and users are affected
  • Timeline: Sequence of events leading to the incident
  • Remediation: Recommended actions to resolve and prevent recurrence
The agent’s reasoning is fully transparent — you can see every hypothesis it considered and why it was confirmed or ruled out.
7

Resolve and Learn

After resolving the incident:
  1. Mark the incident as Resolved
  2. The agent stores the investigation in its memory system
  3. Future similar incidents benefit from learned patterns
Over time, the system gets faster and more accurate at diagnosing issues it has seen before.

How Investigation Works

Alert arrives → AI forms hypotheses → Gathers evidence (metrics, logs, traces)
→ Correlates timeline → Analyzes topology → Identifies root cause
→ Recommends remediation → Learns from resolution
The entire flow — from alert to root cause — typically completes in under 5 minutes.

Tips

  • Connect multiple monitoring tools: The more data sources agents can access, the more accurate the root cause analysis
  • Start with auto-investigate on: Let the AI investigate every alert automatically — you can always tune later
  • Review dismissed hypotheses: Understanding why the agent ruled out alternatives builds trust in its reasoning
  • Enable Slack notifications: Route incident updates to your #incidents channel so the team stays informed
  • Combine with CloudKeepers: Many incidents are preventable — CloudKeepers catch drift before it causes outages

Tutorial Complete

You’ve now set up the core CloudThinker features end-to-end:

What’s Next