The Problem
Cloud drift is constant. Security groups get opened during incidents and never closed. Dev instances left running over weekends compound into hundreds of dollars of monthly waste. New resources get provisioned without required tags, skip encryption, or use overly permissive IAM roles. Kubernetes pods crash-loop unnoticed while CPU and memory requests stay oversized. Most teams only discover these issues during quarterly audits or, worse, after a security event. Continuous monitoring currently requires:- AWS Config rules — powerful but complex to write, maintain, and interpret
- Terraform Sentinel or OPA policies — code-based, developer-only, no plain-language rules
- Cloud Custodian — YAML-based, requires DevOps expertise to maintain
- Manual scheduled audits — infrequent, incomplete, and immediately stale
How Existing Tools Compare
| Tool | What It Does | What’s Missing |
|---|---|---|
| AWS Config | Rules-based configuration drift detection | Complex rule authoring, no AI analysis, no remediation guidance, AWS-only |
| Terraform Sentinel / OPA | Policy-as-code enforcement | Developer-only, requires code changes to add rules, no AI recommendations |
| Cloud Custodian | YAML-based cloud governance automation | Complex setup, no natural-language interface, no prioritization |
| Wiz / Orca (CSPM) | Cloud security posture management | Security-only (no cost/performance), expensive, requires dedicated analyst |
| AWS Trusted Advisor | Basic well-architected checks | ~50 fixed checks, no customization, no daily operational cadence |
Keeper Architecture
CloudKeepers organizes monitoring into a 3 × 3 matrix of providers and pillars:| Provider | Cost | Security | Performance |
|---|---|---|---|
| AWS | AWS-COST | AWS-SEC | AWS-PERF |
| GCP | GCP-COST | GCP-SEC | GCP-PERF |
| Kubernetes | K8S-COST | K8S-SEC | K8S-PERF |
- Cost rules: idle compute instances, unattached storage, old snapshots, unused static IPs, oversized databases, idle load balancers, over-requested pod resources, and more
- Security rules: public S3 buckets, unused IAM roles, MFA disabled on root, open security groups, secrets in parameter store, and more
- Performance rules: RDS connection limits, missing health probes, CrashLooping pods, throttled resources, and more
Autonomy Levels
Every detection rule can operate at one of three autonomy levels:| Level | Name | Behavior |
|---|---|---|
| 1 | Suggest | Read-only. The keeper analyzes infrastructure and reports findings. No changes are made. |
| 2 | Approve | The keeper drafts actions for each finding. You review and approve before anything runs. |
| 3 | Autonomous | The keeper executes approved command types automatically. You are notified after each action. |
What Makes This Different
- Keepers, not rules: instead of writing policy code, you enable provider-pillar keepers and configure detection rules — the keepers decide what matters
- Three pillars: cost optimization, security monitoring, and performance analysis in a single system
- Configurable autonomy: choose Suggest, Approve, or Autonomous per rule — from read-only observation to full auto-remediation
- Tunable thresholds: adjust detection sensitivity per rule (e.g., idle CPU threshold, snapshot max age, lookback period)
- Daily operational cadence: designed to run daily (configurable cron), not quarterly — catching drift before it compounds
- Plain-language findings: each finding explains the risk and impact in business terms, not just a rule name
- Remediation playbooks attached: every finding includes impact analysis, before/after estimates, and step-by-step implementation guidance
- Multi-cloud + Kubernetes: scans AWS, GCP, and Kubernetes in a single system — not separate tools per provider
Responsibilities
- Policy enforcement: apply cost, security, and performance guardrails through specialized keepers for day-to-day operations.
- Drift detection: continuously scan for misconfigurations, risky defaults, resource bloat, and performance bottlenecks.
- Remediation playbooks: attach implementation steps and automation options to every finding.
- Alerting: notify the right channels by severity so teams can triage quickly.
Prerequisites
- At least one cloud account or Kubernetes cluster connected with permissions for read/monitoring and (optionally) remediation.
- Slack Integration, Microsoft Teams, or email destinations configured if you want outbound alerts in addition to in-app Notifications.
- Optional: tags or filters ready if you plan to scope findings to specific environments.
Quick Start
Open CloudKeepers
Go to CloudKeepers to see the onboarding view. It walks you through three steps: connect a cloud account, enable keepers, and run your first detection scan. Click Enable Your First Keepers to begin.

Select and configure keepers
The setup wizard has two steps. In Select Keepers, choose which keepers to activate — filter by provider (AWS, Kubernetes) or pillar (Cost, Security, Performance). In Review & Configure, fine-tune detection rules per keeper, set the autonomy level (Suggest, Approve, or Autonomous), and adjust which rules are enabled.

Review the dashboard
Once keepers are enabled, select one from the sidebar to see its Dashboard tab. Four stat cards — Open Findings, Critical & High, Potential Savings, and This Week — give you a quick pulse. The Findings Over Time chart breaks down trends by severity.

Triage findings
Switch to the Findings tab to see a Kanban board with columns for Pending, In Progress, Implemented, and Ignored. Each finding card shows the title, estimated savings, effort level, and risk severity. Click a card to drill into details, or drag it between columns to update its status.

Review detection runs
The Runs tab shows every detection run with its status, summary, duration, and how many findings were created or updated. Use this as an audit trail to verify keepers are running on schedule.

How enforcement and drift detection work
- Keepers run on the cron schedules you set (default: daily at 7 AM UTC) or on-demand to scan all permitted resources for cost, security, and performance risk — not limited to what you previously discovered.
- Each detection run produces an audit trail in the Runs tab showing status, timing, findings created/updated/closed, and any errors.
- Findings are tagged with pillar, severity (Low / Medium / High), effort, and estimated savings to prioritize the highest-value fixes.
- Findings start as drafts; promote them to active recommendations, then save to Plan when you are ready for approvals, scheduling, and execution tracking.
Finding statuses
Findings move through a Kanban workflow:| Status | Meaning |
|---|---|
| New | Just detected — awaiting triage |
| Acknowledged | Team is aware, not yet acting |
| Active | Remediation in progress |
| Resolved | Fix implemented and verified |
| Dismissed | Intentionally skipped — keeper will not re-flag |
CloudKeepers is your daily operational guardrail.
Assessment is a deeper, periodic
evaluation and is not meant for day-to-day runs.
Keeper settings
Each keeper has a dedicated Settings tab where you can:- Schedule: set a cron expression for automated runs (minimum 1-hour interval).
- Detection rules: toggle individual rules, adjust their autonomy level (Suggest / Approve / Autonomous), and configure per-rule thresholds (e.g., idle CPU %, lookback days, snapshot max age).
- Commands & permissions: manage which cloud commands each rule is allowed to execute, with per-command effects (Allow / Require Approval / Deny).
- Notifications: configure Email, Slack, and Teams channels with per-channel minimum severity thresholds.
Remediation playbooks
- Every finding includes an impact analysis with before/after estimates and a step-by-step playbook.
- Use Impact Analytics for deeper analysis, Generate Guidelines for shareable runbooks, Custom Prompt to explore edge cases, or Implement to execute changes.
- Track status and outcomes in Plan so governance, FinOps, and security teams share the same source of truth.
Alerting and routing
- Set per-channel minimum severities to keep noise low while still surfacing critical issues quickly.
- Slack: real-time triage with action links back to CloudThinker.
- Email: audit trails with workspace-aware links.
- Teams: team-channel delivery with severity filtering.
- In-app Notifications are always delivered regardless of channel settings.
- Combine alerting with Plan workflows to ensure findings get reviewed, approved, and closed.
What’s Next
Plan
Save findings to Plan for approvals, scheduling, and execution tracking
Assessment
Run deeper periodic Well-Architected assessments alongside daily CloudKeepers runs
Slack Integration
Route CloudKeepers alerts to Slack channels for real-time triage
Recurring Tasks
Schedule additional recurring analysis to complement CloudKeepers
