Frameworks
Measuring AI ROI Without Lying to Yourself
A four-tier KPI framework for teams who need the numbers to survive an audit.
The ROI number that nobody believes
The ROI slide in a typical enterprise AI steering committee is ceremonial. It is produced by a team whose bonus depends on it being large, reviewed by a CFO who knows it cannot be right, and approved by a board that does not want to be the one that killed the AI program. Everyone knows. Nobody says. The result: one of the most consequential spending categories in modern enterprise is governed by numbers that a first-year auditor could puncture. That is a governance problem disguised as a measurement problem. The framework we use solves the measurement problem first.
Four tiers, not one
The core mistake in AI ROI reporting is collapsing four very different types of value into a single number. They compound on different timelines and require different evidence. We separate them explicitly.
Tier 1 — Direct cost displacement
The only tier that can be audited against a general ledger. It answers: what did we stop paying for? Contract renewals that did not happen, seats that were removed, BPO minutes that were not invoiced, overtime that was not paid. The rule: if it is not a line item that changed on an invoice or a payroll run, it is not Tier 1. A 15% productivity improvement is not Tier 1. A cancelled $340,000 Zendesk contract is Tier 1. This is the tier that survives an audit — also almost always the smallest of the four, which is why teams are tempted to inflate the others.
Tier 2 — Direct revenue attribution
Revenue that can be traced, by a specific mechanism, to the AI system. "Our conversion rate went up after we launched" is not attribution. "Of 12,400 conversions last month, 3,100 included the agent interaction and converted at 2.3x the control" is attribution. This tier requires experimental design, usually a holdout or a geographic A/B. Teams skip it because it is hard, slow, and might produce an unflattering answer — which is exactly why it is worth doing.
We borrow from Douglas Hubbard's How to Measure Anything here. The right question is not "what is the precise ROI?" but "is the expected value greater than the cost, with what confidence?" A Tier 2 program does not need to prove 40x ROI. It needs to prove positive expected value at 85% confidence. That is a much easier bar.
Tier 3 — Reallocated capacity
This is where most AI ROI decks live, and where most of the lying happens. It answers: what are the humans doing now that they were not doing before? The correct way to measure Tier 3 is to require that the reallocated capacity produce a Tier 1 or Tier 2 outcome downstream. If the AI saved the support team 400 hours and those hours produced a measurable increase in first-contact resolution that reduced escalations that reduced BPO spend, that chain is a Tier 3 argument that eventually becomes Tier 1. The incorrect way — multiply saved hours by fully loaded hourly rate and call it ROI — is ubiquitous and the monetization of a counterfactual that never happened. We allow Tier 3 claims only when paired with a written downstream hypothesis and a test date.
Tier 4 — Strategic optionality
The hardest to measure and often the most real. It answers: what can we now do that we could not do before? A new product category becomes viable, a new customer segment addressable, a response to a competitor possible on a timeline that would not have existed otherwise. We measure Tier 4 by enumerated optionality — a written list of specific bets the organization is now in position to take, with a rough probability and a rough payoff. This produces a range, not a number, and it is honest about being a range. Nassim Taleb's framing of "being right versus being sensible" applies: you do not need a precise Tier 4 number, you need to articulate what the optionality is and what it would be worth if realized.
The report card
We report each tier separately on every engagement. The format is stable by design: Tier 1 YTD in dollars audited against GL (high confidence); Tier 2 YTD with attribution method (medium-high); Tier 3 hours mapped to downstream hypotheses with linkage status (medium); Tier 4 enumerated bets with combined expected value at a stated probability (ranged). Less flattering than a blended number; considerably harder to dispute. Boards uniformly prefer it after the second quarter because it shows which tiers are strengthening and which are weakening — which a blended number hides.
The three things that almost always go missing
Even organizations that adopt the four-tier framework miss three things on the first pass. Counterfactual baselines: you cannot claim a Tier 1 saving without documenting what you were paying before. Cost of the measurement itself: observability, eval infrastructure, and human review belong in the cost side of the ledger. We have seen clients spend 18% of their inference budget on measurement. Negative value: agents sometimes produce bad outcomes that cost money — refunds to the wrong customer, escalations from bad handoffs, brand risk. These belong on the ledger. An ROI model that reports only the wins stops being believable the first time leadership sees an incident that did not appear on it.
The honesty dividend
Clients who adopt this framework report a counterintuitive outcome: their AI budgets survive renewal cycles better, not worse. The inflated slide is the one that gets cut. The quieter, defensible four-tier report is the one that gets expanded. The larger point is that AI programs are not failing to produce value. They are failing to report it credibly. The specific taxonomy matters less than the discipline of separating the tiers.
If your ROI slide has stopped being trusted by your CFO, the path back is not a better number. It is a better framework. [Cadence Advisors Group can help you build one](/contact).