Your cloud chargeback model was built around a stable assumption: cost follows capacity. You provision a VM, an S3 bucket, a database instance. The meter runs on what you reserved, and what you reserved doesn't change unless someone files a ticket. Finance can forecast it. Engineering can be accountable for it. The model works because the unit of cost is a resource with a relatively stable size.
Tokens are not that.
A token is a unit of content, not capacity. Its cost is a function of what gets sent to the model and what comes back — and both of those are determined at runtime, by prompt templates that engineers can change in an afternoon without touching infrastructure. The FinOps Foundation's AI/ML cost management guidance, published in 2024, names this directly: token-based pricing creates a cost surface that is "highly sensitive to application-layer decisions," which is a polite way of saying that a single pull request can detonate your monthly budget without triggering any of the alerts your existing tooling was built to catch.
The Variance Problem Is Structural
A team ships a customer-facing summarization feature. The initial prompt template has a 150-token system prompt. Three weeks later, a developer improves output quality by expanding the system prompt to 1,800 tokens, adding examples and formatting instructions. The feature call volume hasn't changed. The infrastructure footprint hasn't changed. But the per-call token cost has increased roughly elevenfold, and because the cost shows up as a single line item under the team's API spend, nobody in finance sees the prompt template change that caused it — they see a number that's 11x what it was last month.
This is the default behavior of any system where the cost unit is content rather than capacity, and where the people who control content (developers) are not the people who watch cost (finance and platform engineering). The gap between those two groups is where AI budget surprises live.
What Governance Actually Looks Like
The FinOps Foundation's framework treats tokens as a first-class budgetable resource, which means governing them the way you'd govern any other metered resource: with quotas, dashboards, and showback.
Team-level quotas assign a monthly or weekly token budget to a cost center or product team. In practice, this means the platform team configures limits at the API gateway or AI proxy layer — tools like Azure API Management, AWS Bedrock's guardrails, or purpose-built AI observability platforms like Helicone or LangSmith can enforce or alert on these thresholds. The quota doesn't have to be a hard cutoff; many organizations start with soft limits that generate alerts rather than rejections, because a hard cutoff on a production inference call is a different kind of incident. The team owns a number, and that number is visible before the invoice.
Prompt-template-level spend dashboards are the mechanism that closes the gap between developer decisions and financial visibility. By tagging API calls with a template identifier — a header, a metadata field, whatever the gateway supports — you can break down token spend by the specific prompt template driving it. When the summarization feature's system prompt expanded, a template-level dashboard would have shown the per-call cost increase within hours of the deploy, not weeks later when finance noticed the line item. Template-level attribution is what gives chargeback actual teeth.
Cache-hit-rate tracking surfaces as a CFO-visible metric because cached responses cost a fraction of fresh inference. Azure OpenAI's prompt caching, for example, charges roughly 50% of the standard input token rate for cache hits; Anthropic's Claude API offers similar economics. A team running a high-volume feature with repetitive system prompts and a 20% cache hit rate is leaving meaningful money on the table compared to a team that's engineered for 60%. Reporting cache hit rate to finance, not only to engineering, creates the right incentive. It also gives finance a lever: if the cache hit rate drops, that's a signal worth investigating before it becomes a variance explanation.
Provisioned throughput showback is the AI analog of reserved instance reporting. When a team purchases provisioned throughput units (PTUs in Azure's framing, or provisioned capacity in AWS Bedrock), they're paying for a committed rate of inference regardless of actual usage. Showback — reporting what was committed versus what was consumed — makes underutilization visible. A team that purchased PTUs for a feature that didn't scale as projected is carrying a cost that needs to be reallocated or released. Without showback, that cost sits in a platform budget line and nobody asks the question.
When You'll Need This Conversation
Budget reviews are where this surfaces. A CIO or CFO looking at AI spend for the first time will ask a version of the same question: "Why did this cost 8x what we projected?" If the answer is "a developer changed a prompt template," that answer needs to land in a room where finance understands what a prompt template is and why it has cost implications. The FinOps framework gives you the vocabulary for that conversation — token budgets, template-level attribution, cache efficiency, provisioned utilization — but the vocabulary only helps if the governance infrastructure exists to produce the numbers behind it.
For public sector organizations working through AI procurement cycles, this is increasingly a pre-award question. Agencies that have stood up AI pilots are now being asked to demonstrate cost governance before scaling. Having a token budget framework in place is the difference between "we have a governance model" and "we have a bill we can't explain."
Okta Concept Mapping
Team-level token quotas rhyme with RBAC resource assignments: a group gets scoped access to a resource with defined limits, and the platform enforces those limits. The analogy holds for the organizational structure — cost centers map to groups, quotas map to entitlements. Where it breaks is enforcement timing. RBAC denies an unauthorized request before the resource is accessed. Token quotas, by contrast, are metered after consumption — the inference call happens, the tokens are spent, and the quota system records the draw. You can alert on threshold breach or throttle future requests, but you cannot retroactively un-spend tokens the way you can retroactively revoke a session. The enforcement model is fundamentally different, and any governance design that assumes token quotas behave like access controls will have gaps exactly where the costs are highest.

