Cloud chargeback works because the unit of cost is boring. A vCPU-hour costs what it costs. A seat is a seat. Consumption tracks linearly with headcount or provisioned capacity, and finance teams build allocation models that hold up quarter over quarter because the inputs don't surprise anyone.
Token-based AI spend breaks this. A single prompt template change can swing a team's weekly cost by 10x. An agentic workflow that ran fine on Tuesday can cost 30x more on Wednesday, same task, same inputs. The unit of cost exists — it's the token — but consumption per unit of work is volatile in ways no existing chargeback model was designed to absorb.
The FinOps Foundation now treats AI as a first-class budgetable "scope," meaning it gets its own dedicated governance treatment, separate from cloud infrastructure. Their annual survey tells the story in one trajectory: the percentage of practitioners managing AI spend went from 31% in 2024 to 63% in 2025 to 98% in 2026. That trajectory is a phase change. AI stopped being a planning exercise and became an operating cost that virtually every FinOps team now governs.
A quick note on vocabulary, because it matters in your account. "Token" in your world means an OAuth access token or a SAML assertion, a credential with a defined lifetime and a predictable cost profile. "Token" in AI means a chunk of text (roughly three-quarters of a word) that gets priced per unit consumed. When your buyer's finance team says "token budget," they mean the second one. When their security team says "token policy," they might mean either. The collision is real and you'll encounter it.
Recap:
- FinOps for AI: The organizational discipline of budgeting, allocating, and forecasting AI token spend, built because token consumption is volatile enough to break chargeback models designed for predictable cloud infrastructure.
- Scope: The FinOps Foundation's term for a segment of technology spending that gets its own governance. AI is now one, and 98% of FinOps practitioners are managing it.
Why the Variance Is Structural
The volatility comes from how LLMs process work, and it gets worse as systems get more capable. This is architecture, not early-tooling roughness that smooths out with maturity.
Take the simplest case. A single API call: user sends a prompt, model responds. Input tokens billed at one rate, output at another. If you know the prompt length and expected response size, the total is roughly predictable. This part feels like cloud billing. Manageable.
Add turns, and the economics shift. LLMs are stateless. Every API call in a multi-turn conversation resends the full conversation history as input. Turn one might send 5,000 input tokens. By turn fifteen, the model is carrying 25,000+ input tokens of accumulated context on every request. You're paying for turn three's context again on turns four through fifteen. If each turn adds 1,500 tokens of new context, by turn ten you've paid for that first turn's context ten times. Cost doesn't grow linearly with turns. It grows closer to quadratically, because every new turn pays for all prior turns' accumulated context.
Tool calls compound the problem further. Agentic systems that interact with external tools include the full set of tool schemas in every request. A GitHub analysis of their own agentic workflows found that 40 tools can add 10–15 KB of schema per turn, whether the agent uses two tools or twenty. Pure overhead, repeated on every call.
Then retries. When an agent hits a failure, tries a fix, fails again, and tries another approach, each retry is a full round-trip at the current inflated context size. A session that runs twice as many turns can cost three to four times as much.
A Stanford Digital Economy Lab paper published in April 2026 (co-authored by Erik Brynjolfsson; this is a serious research group with a track record in technology economics, not a blog post) quantified the result: even running the same agent on the same task, costs varied by up to 30x across runs. The paper describes agent trajectories as "inherently stochastic." The agent can't predict its own spending because it can't know ahead of time how much context it will accumulate or how many retries it will need.
The same paper found agentic tasks consuming roughly 1,000x more tokens than standard chat interactions, with input tokens driving the cost. The expensive part is the context the agent carries, not the answers it generates. For budgeting purposes, that's the number that matters.
Recap:
- Context accumulation: Every turn in an agentic workflow resends all prior context as input tokens. Costs compound closer to quadratically than linearly because each new turn pays for every prior turn's context.
- Stochastic variance: Identical tasks can produce 30x cost differences across runs because agent trajectories are unpredictable. This is structural, not a tooling gap.
Token chargeback resembles per-user license allocation in structure: assign a cost unit to a consumer, set quotas per team, track consumption per use case, build reports. The analogy holds through the allocation step. It breaks at variance. A seat costs the same whether the user logs in once or a hundred times. A token budget can be consumed in a single agentic session that hits a retry loop. Your IDAM intuition about predictable per-unit costs will mislead you here: the unit exists, but consumption per unit of work is content-driven, not seat-driven.
How FinOps Governs What Chargeback Can't
The Foundation's 2025 Framework update introduced Scopes as a core concept, and their AI working group has been building out the specifics. The priority activities they identify: allocation, data ingestion, reporting, anomaly detection, and planning and forecasting.
Optimization didn't make the priority list yet. Their own survey data found that most teams are still trying to achieve basic visibility into AI spending. You can't optimize what you can't see, and most organizations can't see it yet. Worth knowing because your buyer may be at the visibility stage, not the optimization stage, and the conversation is different.
That visibility stage is where the distinction between chargeback and showback matters. Chargeback allocates actual costs to business units — it hits their budget. Showback makes costs visible to teams without billing them directly. Many organizations are starting with showback for AI precisely because the variance makes actual cost allocation politically fraught. If a team's agentic workflow hits a retry loop and burns through a month's token budget in a week, does that team eat the cost? Showback lets organizations build the attribution muscle before they answer that question. When your buyer says "cost visibility," they may mean showback, not chargeback. Knowing which one signals fluency.
The practical mechanisms the Foundation describes: resource quotas by team and use case, token proxies layered in front of APIs to identify the source and allocation of cost, prompt-template-level spend dashboards, and cache-hit-rate tracking. There's also provisioned throughput to account for: some organizations commit to reserved inference capacity at a fixed rate rather than paying per token. This creates its own showback problem. Is the team that justified the reserved capacity actually using it, or is it sitting idle while other teams queue for on-demand tokens? If you've ever walked a buyer through an enterprise license true-up, you recognize the shape of this conversation.
The Foundation's forecasting guidance makes an explicit point that usage optimization isn't just a cost lever. It prevents significant forecasting errors, because a small workflow change can blow a quarterly forecast when the underlying cost dynamics are this unpredictable.
The emerging unit of measurement: cost per task, or token budget per project. As the Foundation's agentic use cases paper frames it, understanding the API costs associated with running agents is becoming as important as understanding the cost of the underlying compute. This is the metric that makes AI spend governable, because it ties cost to business output rather than raw consumption.
Recap:
- Showback before chargeback: Most organizations start by making AI costs visible to teams without billing them, because variance makes hard allocation politically difficult.
- Visibility before optimization: Most teams are still building basic cost attribution. The Foundation says this is the right sequence.
- Cost per task: The emerging metric that ties token spend to business output, making AI budgets defensible.
The IAM Policy Review That Your Buyer Will Recognize
The FinOps Foundation's own forecasting paper includes a case study that lands squarely in territory you know.
A financial services organization had more than 60,000 custom IAM policies, each with multiple versions. The expertise to review them was scarce and expensive. Manual review was estimated to take years. They pointed a generative AI workflow at the problem: send each policy document to the model with a prompt requesting a summary and a flag for potential issues, broad access grants, unexpected conditions. The model returned structured reports, stored them in a database, linked them to static analysis with concrete remediation items. Stakeholders got clear action items sorted by urgency.
It worked. It was also expensive.
The optimization was an MD5 checksum. They hashed each policy and compared it to the previous week's hash. Only policies that had actually changed triggered the AI summarization workflow. Result: a 99% reduction in ongoing token usage. No smarter prompt, no better model. A hash comparison.
The forecasting implications are worth pausing on. Before the optimization, monthly token consumption was the full corpus times the review frequency. After, it was proportional to the change rate of the policies. That's a fundamentally different cost curve. It's the kind of change that makes a quarterly AI budget defensible instead of speculative.
The pattern is portable. Your buyer's security team has workflows shaped exactly like this: policy review, entitlement analysis, access certification, compliance reporting against large document sets. Federal environments especially — think ATO documentation review cycles, continuous monitoring against NIST frameworks, large policy corpuses that change incrementally. The naive AI implementation processes everything every time. The biggest cost lever in this workflow turned out to be a traditional software engineering technique applied to the workflow surrounding the prompts, rather than to the prompts themselves.
Recap:
- The IAM case study: 60,000+ custom IAM policies reviewed by AI, then optimized to 99% token reduction by processing only changed policies via MD5 checksums.
- The portable lesson: The biggest AI cost lever was a software engineering technique applied to the workflow surrounding the prompts. The pattern maps directly to federal compliance workflows.
Prompt Caching as a Finance Metric
Prompt caching stores large portions of a prompt that stay identical across requests, so the model reuses prior computation instead of reprocessing from scratch. Both major providers offer it, with meaningfully different economics.
Anthropic's implementation is explicit: developers mark which content blocks to cache. OpenAI's implementation is automatic: no code changes required, the system attempts to route requests to cached prefixes. Both require a minimum cacheable block of 1,024 tokens.
| Anthropic | OpenAI | |
|---|---|---|
| Mechanism | Explicit (developer marks cache blocks) | Automatic (system routes to cached prefixes) |
| Cache read discount | 90% reduction (0.1x base input price) | 50%+ reduction, deeper on newer models at high hit rates |
| Write premium | 1.25x (5-min TTL) to 2x (1-hr TTL) | None (automatic) |
| Observed hit rate | ~100% when explicitly requested | ~50% on immediate resends |
| Tradeoff | Predictable savings, requires design effort | Lower implementation cost, less predictable savings |
Hit-rate observations from a practitioner comparison by ngrok, an API gateway provider with direct observability into request routing patterns. These numbers will shift as both providers iterate, but the architectural difference — explicit vs. automatic — reflects genuinely different design philosophies.
For FinOps purposes, the difference is control versus convenience. Anthropic gives you predictable cache behavior and higher hit rates when you design for it. OpenAI gives you lower implementation cost but less predictable savings.
At production scale, this becomes a CFO line item. Systems with stable system prompts, consistent document retrieval patterns, and repetitive query structures can cache 70%+ of input tokens. For an organization running billions of tokens monthly, the difference between cached and uncached processing is the difference between a defensible budget line and an unpredictable cost center. The FinOps Foundation's forecasting guidance explicitly identifies caching as infrastructure that strengthens the foundation on which forecasts are built.
Cache-hit rate becomes a KPI. A finance KPI, tracked at the CFO level.
Recap:
- Prompt caching: Reuses previously computed prompt prefixes. Anthropic offers 90% reduction with explicit control; OpenAI offers 50%+ reduction automatically.
- CFO visibility: At production scale, cache-hit rate directly determines whether AI spend is forecastable. The FinOps Foundation treats it as a forecasting input.
If you've worked with token refresh flows in OAuth, you understand the economics of avoiding unnecessary recomputation: a cached token avoids a round-trip to the authorization server. Prompt caching follows the same logic at a different scale. The analogy holds on the mechanism — reuse prior work to avoid redundant processing. It breaks on cost magnitude. A missed OAuth token cache costs milliseconds of latency. A missed prompt cache on a 100,000-token context window costs real money, multiplied by every request in the batch. Cache governance in AI is a budget control, full stop. The cost magnitude dwarfs the latency concern.
What This Means in Your Account
When your buyer's CAIO or CISO raises AI cost management, they're dealing with a specific problem: their existing cloud chargeback model assumes consumption tracks predictably with provisioned resources or headcount, and AI breaks that assumption.
The vocabulary you'll hear: scopes, token quotas, showback dashboards, cache-hit rates, cost-per-task unit economics. These are FinOps concepts being extended to AI, and the Foundation has formalized the framework. Your buyer's finance team is almost certainly already involved. If they're not, the CFO will be soon, and the conversation often starts with a CFO pushing back on an AI pilot because they can't forecast the cost. That's the moment where knowing the showback-before-chargeback sequence, understanding cache-hit rates as forecasting inputs, and being able to reference cost-per-task unit economics makes you genuinely useful in the room. You're helping them see a path from "we can't predict this" to "we can govern this."
The identity connection is real but specific. Don't force it. Token quotas need to be allocated by team and use case, which means they need to be tied to identity. Cost attribution requires knowing which user, service, or agent generated the consumption. Access controls around resource allocation are governance mechanisms that sit in the identity layer. When the buyer describes their AI cost governance challenge, the identity dependency surfaces naturally in the conversation. You don't need to introduce it.
For public sector accounts, the moments when this conversation becomes acute are predictable: ATO renewal cycles where AI workloads need cost justification, budget formulation windows where the buyer needs defensible spend projections, and any procurement conversation where a CFO is comparing the cost profile of an AI capability against the manual process it replaces. Knowing the FinOps vocabulary and the structural reasons AI costs behave differently gives you standing in those conversations.
The question to listen for: "How do we attribute AI costs to the right team when consumption is this unpredictable?" That's a FinOps question with an identity dependency, and it's where your expertise is genuinely useful.
Things to follow up on...
- GitHub's tool schema overhead: GitHub published a detailed engineering analysis of how tool definitions inflate token costs in agentic workflows, with production data showing that unused MCP tool schemas can add 10–15 KB of context per turn.
- Stanford's 30x variance paper: The Stanford Digital Economy Lab's April 2026 paper on agentic token consumption found that models consistently underestimate their own token spend, which has direct implications for any automated budgeting or forecasting system.
- FinOps Foundation AI certification: The Foundation launched FinOps Certified: FinOps for AI, covering AI cost allocation, anomaly detection, unit economics, and workload optimization as a formal practitioner credential your buyer's finance team may be pursuing.
- Anthropic's ZDR-compatible caching: Anthropic's documentation confirms that prompt caching is ZDR-eligible, with KV cache representations held in memory only and not stored at rest, which matters for public sector accounts where data retention constraints intersect with cost optimization.

