When a federal buyer tells you they're "using Claude" or "evaluating GPT," they are almost certainly not calling Anthropic or OpenAI directly. They're calling the same model through AWS Bedrock, Azure AI Foundry, or Google Vertex AI. The model is identical. The infrastructure around it — compliance posture, procurement path, audit controls — is where the decision actually gets made.
This piece covers the consumption layer: what it means to pay per token rather than run AI infrastructure, how the pricing model works, and why hyperscaler-hosted endpoints are the regulated-enterprise default. Understanding this gives you a clean mental model for how AI spend actually flows in the accounts you're selling into.
What a token is
A token is the billing unit for AI model usage. Think of it as a chunk of text, roughly four English characters, that the model processes as a single unit. OpenAI's rule of thumb: one token is about three-quarters of a word. Google's Gemini documentation gives a similar figure. "Authentication" is two or three tokens depending on the provider. "Hi" is one.
Tokenization is not standardized across providers. Different models use different encoding schemes, so the same paragraph fed to Claude and GPT will produce different token counts. You cannot port a token count from one provider to another without re-counting. This matters because tokens are what you're billed for.
The alternative to paying per token is running your own GPU infrastructure: hosting model weights, managing scaling, handling the operational burden yourself. API consumption is the pattern where someone else does all of that and you pay for what you use. For regulated enterprises, it's overwhelmingly the default.
The pricing model, not the prices
Specific per-token rates shift every few months as providers compete on price. The structure underneath does not.
Every major model provider bills input tokens and output tokens separately. You pay one rate for what you send in (the prompt, the context, the question) and a higher rate for what comes back (the model's response). Across flagship models, output tokens typically cost 4–6× more than input tokens. The reason is computational: generating each output token requires sequential processing that consumes more GPU time than processing input tokens in parallel. A workload that asks short questions and gets long answers costs meaningfully more than one that sends long documents and gets back a classification label.
The rate card is a starting point. Every provider publishes a per-million-token price for each model tier. As of spring 2026, flagship models cluster around $2–5 per million input tokens and $10–25 per million output tokens. Budget-tier models run 10–50× cheaper. Model selection is the single largest cost lever available: CloudZero's analysis shows the difference between a flagship and a budget-tier model can turn a $1,200/month workload into a $100/month workload, with no infrastructure change at all.
The modifier stack is where real costs live. On top of the base rate:
- Prompt caching lets you reuse repeated context (system prompts, policy documents, conversation history) at dramatically lower rates. Anthropic offers up to 90% off cached input tokens. AWS Bedrock and Vertex AI offer similar discounts.
- Batch processing runs at a 50% discount across all major providers, with a 24-hour turnaround.
- Data residency adds a surcharge. OpenAI charges a 10% uplift for regional processing endpoints. Hyperscalers charge similar cross-region premiums.
- GovCloud adds another 20–30% on top of standard rates on AWS.
- Provisioned throughput flips the model entirely: you reserve capacity at an hourly rate rather than paying per token, which makes sense for predictable high-volume workloads.
These modifiers stack. Practitioners consistently report that real-world enterprise AI spend runs well above the rate-card estimate once modifiers accumulate. One Bedrock-focused analysis put the multiplier at 1.5–2×. Take the specific number with appropriate skepticism — it's a single study, not a survey — but the directional point holds: the advertised per-token rate is the sticker price. Nobody pays sticker.
When Anthropic released Claude Opus 4.7 in April 2026, the headline per-token price stayed at $5/$25 per million tokens. But per <a href="https://www.finout.io/blog/claude-opus-4.7-pricing-the-real-cost-story-behind-the-unchanged-price-tag">Finout's analysis</a>, the new tokenizer produces up to 35% more tokens for the same input text. Same price per token, more tokens per sentence, higher bill.
Anthropic's own documentation hasn't published a direct comparison figure, so treat the 35% as an outside estimate. But the mechanism is real: the rate stayed flat while the bill went up. The kind of thing that hides inside a rate card and shows up on next month's invoice.
Why the hyperscaler wins
The pricing mechanics are the same whether you call the provider directly or go through a hyperscaler. Which raises the obvious question.
When Bedrock hosts Claude, the per-token rate matches Anthropic's direct pricing. Same model, same price. Google's Vertex AI documentation confirms the same for Claude and Gemini models. The hyperscaler is not charging a premium for the proprietary frontier models your buyer cares about. (Open-source models like Llama do carry a significant markup on hyperscaler platforms, because you're paying for managed infrastructure you could theoretically run yourself. But for Claude, GPT, and Gemini, the rate is effectively the same.)
If the price is the same, why does the hyperscaler win? Three reasons, and they're mutually reinforcing.
Compliance inheritance. This is the structural mechanism that makes everything else possible. The FedRAMP authorization belongs to the cloud service — the Bedrock endpoint, the Azure AI Foundry instance — and covers everything the service wraps around the model. When an agency calls Claude through Bedrock in AWS GovCloud, the authorization covers the Bedrock service: the infrastructure, data routing, encryption, access controls, and audit logging. Google's FedRAMP implementation documentation states this explicitly:
"Individual LLMs aren't independently authorized under FedRAMP... the Marketplace reflects authorizations for cloud services like Generative AI on Vertex AI."
In practice: calling a model through Bedrock or Azure AI Foundry stays inside the agency's existing trust boundary. Calling OpenAI directly exits that boundary and requires establishing a new compliance evaluation from scratch. For a public sector buyer, that's the difference between adding a service to an existing ATO and starting a new one.
The current state of authorizations, as of early 2026:
| Hyperscaler | FedRAMP High | DoD IL4/5 | Higher Classifications |
|---|---|---|---|
| Azure AI Foundry | ✓ | ✓ | IL6, Top Secret (specialized) |
| AWS Bedrock (GovCloud) | ✓ | ✓ | — |
| Google Vertex AI | ✓ (March 2025) | — | — |
These authorizations move; verify the current state before citing them in a meeting.
The most capable frontier models are often not yet available in the most restricted government environments. Azure Government's latest models are GPT-4.1 and o3-mini, not GPT-5.x. AWS GovCloud Bedrock offers Claude Sonnet 4.5, not Opus 4.7. Your buyer may be authorized for AI but not for the <a href="https://www.sweetspot.so/articles/fedramp-ai-models/">specific model</a> they want.
Procurement vehicles. AI tokens flow through existing cloud contracts. The GSA OneGov agreement with AWS provides up to $1 billion in savings for federal cloud and AI access. The OneGov agreement with Microsoft covers Azure AI through existing enterprise agreements. Organizations already consuming Microsoft 365 or Azure credits can apply those toward AI workloads. No new contract vehicle required.
Infrastructure integration. The hyperscaler wrapper adds IAM-based access control, VPC endpoints (traffic that never traverses the public internet), audit logging via CloudTrail, Azure Monitor, or Cloud Audit Logs, customer-managed encryption keys, and data residency controls. All three hyperscalers commit contractually that customer data sent through their AI endpoints is not used to train the underlying models. All of this plugs into the governance stack the agency already operates.
The model itself is unchanged by the wrapper. Anthropic confirms that Claude on Bedrock behaves identically to Claude on their direct API. Same capabilities, same context window, same output quality. The hyperscaler adds modest latency and nothing else at the model layer.
Where your IDAM intuition applies, and where it breaks
If you've spent time in federation, the compliance inheritance mechanism should feel familiar. When an agency's identity provider federates with a new SaaS application, they extend their existing trust boundary to cover the new service without establishing a new identity silo. The SAML assertion carries the agency's authentication decision into the service provider's domain. The trust relationship carries forward.
Hyperscaler AI hosting works the same way at the infrastructure layer. The existing trust boundary — FedRAMP ATO, IAM policies, audit logging — extends to cover the AI endpoint.
This is where your IDAM intuition helps. Here's where it starts to mislead you.
A federated SaaS application operates within defined authorization scopes. The same request with the same authorization produces the same access decision. A model endpoint is different. It produces non-deterministic outputs. The compliance posture covers the infrastructure: who can call the model, how the data is encrypted, where it's stored, whether it's used for training. It does not govern what the model says in response, or whether the context retrieved for a given prompt was appropriately scoped to the requesting user's permissions. The trust boundary covers the pipe. What comes out of the pipe is a different problem entirely.
That gap between infrastructure compliance and output governance is where a lot of the hard problems in enterprise AI sit right now. This piece isn't the place to solve them, but you should know the gap exists, because your buyer is probably staring at it.
What this means Tuesday
When your buyer says "we're using Claude" or "we're evaluating Gemini," the right follow-up is where they're accessing it. That single question does more work than it looks like.
It tells you which cloud they're standardized on, which compliance boundary they're operating in, and which procurement vehicle is funding it. Most importantly for you, it tells you whose IAM stack is governing access to the model endpoint. That's the thread that connects AI consumption to everything you already know how to sell.
The pattern across regulated enterprises is consistent and structurally predictable: AWS shops use Bedrock. Azure shops use Azure AI Foundry. Google shops use Vertex AI. The infrastructure integration drives the choice. The model catalog follows. Your buyer chose their AI platform the same way they chose their cloud provider — years ago, for reasons that had nothing to do with AI.
The consumption model is pay-per-token through an API. The compliance model is inheritance from the cloud provider. The procurement model is existing contract vehicles. These three facts together explain why calling a model provider directly is increasingly the exception in the accounts you'll talk to.
Things to follow up on...
- GAO found agencies struggling: A April 2026 GAO report found federal agencies more than doubled their AI use from 2023 to 2024 but reported difficulty understanding AI-related costs and accessing technical experts to evaluate proposals.
- The frontier-in-GovCloud gap narrows: A useful tracker from Sweetspot maps which frontier models are available inside each compliance boundary, from commercial FedRAMP through GovCloud IL5 and ITAR.
- AI FinOps is accelerating: The State of FinOps 2025 reported 63% of respondents managing AI costs, up from 31% the prior year, and Finout's platform comparison breaks down the cost-attribution differences across Bedrock, Vertex, and Azure.
- Vertex AI's context cliff: Google's Gemini Pro models double their input token price when prompts exceed 200,000 tokens, a hard threshold that RAG pipelines pulling long documents can silently cross.

