Chapter 3 · What Agentic Infrastructure Operations Actually Is

Definitions matter, because “agent” is the most abused word in enterprise software.

3.1 A working definition

DEFINITIONAgentic infrastructure operations (AgenticOps) is an operating model in which autonomous AI agents carry out the core loop of operational work — detecting conditions, analyzing causes, resolving issues, and validating outcomes — across cloud and on-premise infrastructure, under explicit human-defined policy, with humans supervising on the loop rather than executing in the loop.

Unpacking the definition: an agent in this sense is not a chatbot with a runbook, and not a script with an LLM bolted on. A true operations agent has five properties:

Goal-directed. It is given outcomes (“keep checkout latency under 300ms”; “keep monthly cloud spend within budget”), not step-by-step instructions.
Perceptive. It continuously consumes telemetry — metrics, logs, traces, events, configuration state, cost data — rather than waiting to be prompted.
Reasoning. It forms and tests causal hypotheses, weighs alternative remediations, and explains its thinking in language an engineer can audit.
Tool-using. It acts through the same interfaces engineers use — cloud APIs, kubectl, Terraform, SQL, CI/CD — with scoped, auditable credentials.
Self-verifying. After acting, it checks whether the intended outcome was achieved, and escalates or rolls back when it wasn’t.

3.2 The autonomy spectrum

Autonomy is not binary. Mature agentic platforms expose autonomy as a policy dial, typically per action class and per environment:

Level	Name	Agent behavior	Human role
L0	Observe	Monitors and reports; takes no action	Executes everything
L1	Advise	Investigates and recommends with evidence	Decides and executes
L2	Act with approval	Prepares full remediation; waits for sign-off	One-click approve/reject
L3	Act with notification	Executes pre-approved action classes; informs humans	Reviews after the fact
L4	Autonomous in domain	Owns a bounded domain end-to-end within policy	Sets policy; audits outcomes

In practice, organizations run different levels simultaneously: L3–L4 for reversible, low-blast-radius actions (restart a pod, clear a cache, scale a replica set, rotate a credential), L2 for consequential changes (schema migrations, security group changes, failovers), and L1 for anything novel. The art of agentic operations is moving action classes up the ladder as evidence accumulates — never faster.

BIG TECH PRACTICE: THE SPECTRUM IS NOW PRODUCT REALITYThe L0–L4 spectrum is not a theoretical construct — it is how the hyperscalers ship. Google’s Gemini Cloud Assist proactive investigations run at L1 by explicit design (investigate everything, change nothing). AWS’s own adoption guidance for DevOps Agent is to start in recommendation-only mode and measure for weeks before granting action. Azure SRE Agent exposes the dial directly: a Review mode where every action awaits an “Approve” click, and a privileged mode for pre-authorized action classes, governed per tool. When all three clouds independently converge on the same graduated-autonomy posture, that is the industry’s collective answer to how much trust an agent starts with: none — it earns it.

Figure 3 — The autonomy dial: action classes graduate from L0 to L4 on evidence, per environment.

3.3 What agentic operations is not

“Agent washing” is now common enough that Gartner has named it: vendors rebranding assistants, chatbots, and RPA as “agents” without meaningful agentic capability. In mid-2025, Gartner estimated that of the thousands of vendors claiming agentic AI, only around 130 were real. A precise negative definition is therefore a buyer’s best defense:

Not a chatbot over your dashboards. Conversational access to telemetry is a feature, not the model. If a human must read the answer and then go do the work, you are still in Gen 3 — whatever the marketing says.
Not lights-out operations. No credible practitioner advocates removing humans. The target is human leverage: one engineer supervising work that used to take a team.
Not a replacement for engineering discipline. Agents amplify the environment they are given. Weak observability, absent IaC, and undocumented systems produce weak agents. Garbage context in, garbage autonomy out.
Not one giant model that does everything. As the next chapter shows, production systems are converging on orchestrated teams of specialists, not monolithic super-models.

THE FIVE-QUESTION VENDOR TESTAsk any “agentic” vendor:

Can the system execute a remediation end-to-end, or only recommend?
Does it verify its own outcomes and roll back on failure?
Can autonomy be set per action class and per environment?
Does every action carry a full, immutable reasoning trail?
What were its rollback and intervention rates in its last three production deployments?

A real platform answers all five with evidence. Agent washing fails by question two.

3.4 The scope of operational work agents can own today

Domain	Representative agent tasks	Typical autonomy (2026)
Incident response	Triage, correlation, root-cause analysis, remediation, post-incident reports	L1–L3
Cloud cost (FinOps)	Rightsizing, idle-resource cleanup, commitment planning, anomaly detection	L2–L4
Kubernetes operations	Pod/node health, resource tuning, upgrade assistance, capacity planning	L2–L3
Database operations	Slow-query analysis, index advice, replication health, storage forecasting	L1–L3
Security operations	Misconfiguration detection, CVE triage, IAM hygiene, compliance evidence	L1–L2
Change & release	Pre-deploy risk analysis, canary monitoring, automated rollback	L2–L3
Infrastructure as Code	Drift detection, module generation, plan review, state hygiene	L1–L3

​3.1 A working definition

​3.2 The autonomy spectrum

​3.3 What agentic operations is not

​3.4 The scope of operational work agents can own today

3.1 A working definition

3.2 The autonomy spectrum

3.3 What agentic operations is not

3.4 The scope of operational work agents can own today