Chapter 8 · Measuring What Matters

Agentic operations succeeds or fails on evidence. Instrument the program like the production system it is.

8.1 The headline outcomes

Across published deployments and industry research, four outcome ranges recur for teams that implement agentic incident response with discipline:

Metric	Documented range	Driver
MTTR reduction	40–70% (vendor pilots report up to 75%)	Automated investigation and pre-approved remediation collapse the diagnosis phase, which consumes most of incident time
Alert volume reduction	80–90%	Correlation, deduplication, and symptom suppression before any human sees a page
Toil reduction	30–50% of L1/L2 operational work	Triage, routine remediation, evidence gathering, and reporting absorbed by agents
Cloud cost savings	10–30% of optimizable spend	Continuous rightsizing, idle cleanup, and commitment management instead of quarterly reviews

Treat these as benchmarks to verify, not promises to assume. MTTR gains in particular vary with implementation maturity and data quality — the documented pattern is that noise reduction lands first and most consistently, root-cause acceleration second, and autonomous remediation last.

Figure 8 — Documented outcome ranges across disciplined deployments, 2025–2026. Verify against your own baseline.

The named data points behind the ranges: AWS reports preview customers of DevOps Agent seeing up to 75% lower MTTR, 80% faster investigations, and 94% root-cause accuracy, with WGU publicly describing a two-hour investigation cut to 28 minutes; Microsoft reports 35,000+ incidents mitigated and 20,000+ engineering hours saved running 1,300+ SRE agents on its own services. These are vendor and first-party numbers — the strongest currently published, and the right ones to pressure-test in your own pilot rather than accept on faith.

8.2 The operating dashboard

A production agentic program runs on roughly eight KPIs, reviewed monthly with the on-call rotation:

MTTD and MTTR — detection and recovery time, trended by severity, with agent-handled and human-handled incidents separated.
Autonomous resolution rate — share of incidents resolved with no human action. The single best maturity indicator.
Recommendation acceptance rate — share of agent proposals approved unmodified. Below ~70%, agent judgment or policy needs tuning; above ~95%, the approval gate is theater and the action class should graduate.
Rollback / intervention rate — agent actions that had to be reversed or overridden. The safety counterweight to autonomy growth.
Repeat-incident rate — whether the system is learning. Falling repeats mean memory and problem management are working.
Runbook / coverage ratio — share of incident classes the agent team can handle at L2 or above.
On-call load — pages per engineer per week, and after-hours pages specifically. The human-experience metric leadership feels.
Agent unit cost — model and platform spend per incident resolved and per service covered. Agent economics are a first-class architectural concern; track them from day one.

8.3 Building the business case

The ROI model has three lines. First, downtime avoided: multiply your incident frequency by your cost per minute of downtime by the MTTR reduction you validate in pilot. For the cost input, use your own finance team’s number if you have one; if you don’t, use the Splunk/Oxford Economics 2026 downtime benchmark cited in Chapter 1 (materially higher for payment and trading systems) as the most defensible external anchor. Second, toil converted: hours of L1/L2 work absorbed by agents, valued at loaded engineering cost — typically the largest line in talent-constrained markets. Third, cloud waste recovered: continuous optimization against the 20–30% of spend most organizations privately acknowledge is wasted. Against these, count platform subscription, model usage, and the engineering time to integrate and govern — honestly, including the autonomy-policy owner’s time. Disciplined deployments typically reach payback within two to three quarters, with the toil line alone often covering the platform cost. If your model needs the downtime line to work, your pilot domain is wrong; pick one where toil savings carry the case and downtime is upside.

8.4 The unit economics of an agent

Chapter 4 argued that two-tier sensing is what makes 24/7 agentic coverage economically viable. This section makes that concrete, because “it pays for itself” is exactly the kind of hand-wave this book tells buyers to reject. The cost of an agentic program is dominated by model inference, and inference cost is dominated by where the expensive reasoning runs. The discipline is to keep cheap perception always-on and reserve frontier reasoning for the few signals that warrant it. Read a single incident as a cost object. An always-on “pulse” performs lightweight perception across every signal at low cost; a heavyweight resolver performs multi-step reasoning only when the pulse finds something worth investigating. In a representative split, sensing accounts for a small share of per-incident spend, triage a little more, and deep resolution the majority — but deep resolution only fires on the minority of signals that survive triage. A worked illustration: an environment emitting on the order of fifty signals a day, where most are dispatched by pulse-only perception and only a handful escalate to a full resolver run, keeps the expensive tier’s duty cycle low while still covering every signal. The exact ratios are yours to measure; the architecture is what makes the ratio favourable.

Figure 12 — Where the tokens go in one incident: cheap perception runs on every signal; expensive reasoning runs only on the few that survive triage. Track cost per resolved incident from day one.

This is why “agent cost per resolved incident” and “agent cost per covered service” are first-class KPIs in the Chapter 8 dashboard, not afterthoughts: an architecture that runs frontier reasoning on every noisy signal erases its own ROI before the first renewal, and you will not see it happen unless you are tracking the per-incident number from day one. The commercial structure can be aligned to this reality rather than fighting it. An outcome-based option — a fee set as a share of verified savings, with nothing owed if no savings are found — ties the vendor’s revenue to the customer’s realised benefit and removes the incentive to run up inference cost for its own sake. As a concrete instance: a cost-optimisation engagement priced at 50% of verified annual savings, billing zero if it finds nothing, makes the interest alignment total. Whatever the structure, the figure that belongs in your business case is the one you measure in your own environment against the 90-day baseline of Chapter 9 — not a headline from anyone’s slide, including this book’s.

8.5 Evaluating an operations agent before you trust it

The book has called agentic operations “self-verifying” since Chapter 3 without showing what a verification step concretely checks, or how you would catch a confidently-wrong remediation before it reaches production. This section closes that gap, because the engineers the Foreword promises to serve are precisely the ones who will have to build and run this harness. An agent evaluation harness has four parts. First, a scenario library: recorded real incidents plus deliberately injected faults, each with a known correct outcome — the right root cause, the safe action, the clean rollback. Second, a shadow run: the agent acts against a sandbox mirror of production, never production itself, so its proposed actions can be observed without consequence. Third, scoring against ground truth: did it reach the correct root cause, did it choose a safe action, would its rollback have worked? Fourth, a regression gate: a behaviour change versus the previous agent version must pass the library before it ships, the same discipline a team applies to any other production code.

Figure 13 — An agent evaluation harness: a scenario library, a shadow run against a sandbox mirror, scoring against ground truth, and a regression gate before any version ships.

Non-determinism is the part that surprises teams coming from deterministic automation: the same scenario can yield different agent behaviour on different runs. The harness handles this by running each scenario many times and scoring the distribution of outcomes, not a single pass — a fix that is correct 70% of the time is a different risk profile from one that is correct 99% of the time, and only a distribution reveals which you have. A concrete verification check makes this tangible: for the database-connection-exhaustion scenario, the harness asserts that the agent identifies connection-pool exhaustion as the root cause, that its action restores the correct connection limit rather than merely restarting the service, and that post-action the connection count returns to baseline and stays there — the same DARV “Validate” step the agent runs in production, run here against a known answer. An agent that cannot pass its own verification check in the harness has no business running it unattended in production.

MEASUREMENT PRINCIPLEBaseline before you deploy. The single most common business-case failure is having no credible “before”: capture 90 days of MTTR, alert volume, page counts, and toil hours before the first agent touches production, or you will be arguing from anecdotes forever.

​8.1 The headline outcomes

​8.2 The operating dashboard

​8.3 Building the business case

​8.4 The unit economics of an agent

​8.5 Evaluating an operations agent before you trust it

8.1 The headline outcomes

8.2 The operating dashboard

8.3 Building the business case

8.4 The unit economics of an agent

8.5 Evaluating an operations agent before you trust it