3.1 A working definition
DEFINITIONAgentic infrastructure operations (AgenticOps) is an operating model in which autonomous AI agents carry out the core loop of operational work — detecting conditions, analyzing causes, resolving issues, and validating outcomes — across cloud and on-premise infrastructure, under explicit human-defined policy, with humans supervising on the loop rather than executing in the loop.
- Goal-directed. It is given outcomes (“keep checkout latency under 300ms”; “keep monthly cloud spend within budget”), not step-by-step instructions.
- Perceptive. It continuously consumes telemetry — metrics, logs, traces, events, configuration state, cost data — rather than waiting to be prompted.
- Reasoning. It forms and tests causal hypotheses, weighs alternative remediations, and explains its thinking in language an engineer can audit.
- Tool-using. It acts through the same interfaces engineers use — cloud APIs, kubectl, Terraform, SQL, CI/CD — with scoped, auditable credentials.
- Self-verifying. After acting, it checks whether the intended outcome was achieved, and escalates or rolls back when it wasn’t.
3.2 The autonomy spectrum
Autonomy is not binary. Mature agentic platforms expose autonomy as a policy dial, typically per action class and per environment:| Level | Name | Agent behavior | Human role |
|---|---|---|---|
| L0 | Observe | Monitors and reports; takes no action | Executes everything |
| L1 | Advise | Investigates and recommends with evidence | Decides and executes |
| L2 | Act with approval | Prepares full remediation; waits for sign-off | One-click approve/reject |
| L3 | Act with notification | Executes pre-approved action classes; informs humans | Reviews after the fact |
| L4 | Autonomous in domain | Owns a bounded domain end-to-end within policy | Sets policy; audits outcomes |
BIG TECH PRACTICE: THE SPECTRUM IS NOW PRODUCT REALITYThe L0–L4 spectrum is not a theoretical construct — it is how the hyperscalers ship. Google’s Gemini Cloud Assist proactive investigations run at L1 by explicit design (investigate everything, change nothing). AWS’s own adoption guidance for DevOps Agent is to start in recommendation-only mode and measure for weeks before granting action. Azure SRE Agent exposes the dial directly: a Review mode where every action awaits an “Approve” click, and a privileged mode for pre-authorized action classes, governed per tool. When all three clouds independently converge on the same graduated-autonomy posture, that is the industry’s collective answer to how much trust an agent starts with: none — it earns it.
Figure 3 — The autonomy dial: action classes graduate from L0 to L4 on evidence, per environment.
3.3 What agentic operations is not
“Agent washing” is now common enough that Gartner has named it: vendors rebranding assistants, chatbots, and RPA as “agents” without meaningful agentic capability. In mid-2025, Gartner estimated that of the thousands of vendors claiming agentic AI, only around 130 were real. A precise negative definition is therefore a buyer’s best defense:- Not a chatbot over your dashboards. Conversational access to telemetry is a feature, not the model. If a human must read the answer and then go do the work, you are still in Gen 3 — whatever the marketing says.
- Not lights-out operations. No credible practitioner advocates removing humans. The target is human leverage: one engineer supervising work that used to take a team.
- Not a replacement for engineering discipline. Agents amplify the environment they are given. Weak observability, absent IaC, and undocumented systems produce weak agents. Garbage context in, garbage autonomy out.
- Not one giant model that does everything. As the next chapter shows, production systems are converging on orchestrated teams of specialists, not monolithic super-models.
3.4 The scope of operational work agents can own today
| Domain | Representative agent tasks | Typical autonomy (2026) |
|---|---|---|
| Incident response | Triage, correlation, root-cause analysis, remediation, post-incident reports | L1–L3 |
| Cloud cost (FinOps) | Rightsizing, idle-resource cleanup, commitment planning, anomaly detection | L2–L4 |
| Kubernetes operations | Pod/node health, resource tuning, upgrade assistance, capacity planning | L2–L3 |
| Database operations | Slow-query analysis, index advice, replication health, storage forecasting | L1–L3 |
| Security operations | Misconfiguration detection, CVE triage, IAM hygiene, compliance evidence | L1–L2 |
| Change & release | Pre-deploy risk analysis, canary monitoring, automated rollback | L2–L3 |
| Infrastructure as Code | Drift detection, module generation, plan review, state hygiene | L1–L3 |