When your buyer starts talking about AI inference costs, four pricing structures cover the ground: per-token(pay-as-you-go, the default), provisioned throughput(reserved capacity at a fixed hourly rate), prompt caching(discounts for reusing repeated input), and batch API(half-price processing, 24-hour turnaround). You'll hit per-token first because it's the baseline every vendor quotes. Provisioned throughput comes up when a buyer describes production workloads with steady volume. Caching surfaces when their architects mention repeated system prompts or shared context. Batch shows up in any conversation about overnight processing or bulk document work. Same product, same model, different purchase vehicle. The price gap between the cheapest and most expensive option for the same work runs 2–10×. The sentence that buys you credibility in a room with a CAIO: "How are you planning to purchase your inference?"
Per-Token Pricing
What it is: The default billing model. You pay for each unit of text processed, measured in tokens — roughly ¾ of a word.
What it does: Every API call gets metered on both sides. You send input tokens (your prompt, your context, your instructions). You receive output tokens (the model's response). The formula worth carrying into every meeting:
Your cost = input tokens × input rate + output tokens × output rate
No commitments, no minimums, no contract. Use it once, pay once.
Who's behind it: Every model provider publishes per-token rates as the baseline — OpenAI, Anthropic, Google. The hyperscalers (AWS Bedrock, Azure AI Foundry, Google Vertex AI) host the same models at per-token parity for frontier proprietary models (checkthat.ai is a pricing comparison tool that cross-references published provider rate cards). Claude on Bedrock costs the same per token as Claude on Anthropic's direct API. The markup shows up elsewhere. More on that in the comparison section.
What makes it distinct: Zero commitment, zero waste, and you absorb the full output-token premium. This is the number most people get wrong. Output tokens cost 3–10× more than input tokens depending on the model. Claude Sonnet charges $3 per million input tokens and $15 per million output — a 5× multiplier. GPT-5 runs $1.25 input, $10 output — 8×. Most people estimate costs based on what they send. The bill is dominated by what the model generates.
All pricing in this piece reflects published rates as of May 2026. This market moves fast. Verify specific numbers before using them in a buyer conversation.
In IDAM, a token is a credential — OAuth access tokens, OIDC ID tokens, SAML assertions. In AI pricing, a token is a billing unit: a chunk of text, roughly ¾ of a word. The word is identical; the concepts are unrelated. Your intuition about token lifecycle (issuance, expiration, revocation) won't help you here. But one overlap matters: when AI agents hold credentials to access enterprise systems, the agent needs both kinds of tokens simultaneously — an AI token budget to run and an OAuth token to authenticate.
Provisioned Throughput
What it is: Dedicated inference capacity purchased by the hour, billed whether you use it or not.
What it does: You reserve a block of GPU processing capacity instead of paying per token. AWS Bedrock calls these Model Units. Azure uses Provisioned Throughput Units. Google Vertex AI uses Generative AI Scale Units. Different names, same idea: a guaranteed slice of GPU at a fixed hourly rate. Commitment terms range from hourly (Azure) to weekly (Google) to monthly or six-month (AWS). Longer commitments, deeper discounts.
Who's behind it: This is a hyperscaler play. You won't find provisioned throughput on OpenAI's direct API or Anthropic's direct API. It exists because Bedrock, Azure, and Vertex are selling infrastructure, not just model access. Pricing for frontier models is opaque across all three — AWS redirects to account managers, Azure requires the pricing calculator, Google requires a sales conversation.
What makes it distinct: You're paying for capacity whether you use it or not. Third-party estimates put savings at 15–40% versus on-demand for sustained workloads (tokenmix.ai is a commercial inference cost aggregator; these figures are directional, not official provider numbers). Break-even sits around $30–50/day of consistent spend on a single model. Below that threshold, you're paying for idle GPUs. Above it, savings compound. The buyer who mentions provisioned throughput is telling you they've moved past experimentation into production inference.
Prompt Caching
What it is: A discount mechanism that reduces the cost of repeated input content by storing previously processed prompt prefixes.
What it does: Most enterprise AI calls share significant repeated content: system instructions, policy documents, few-shot examples (sample question-answer pairs included in the prompt), or retrieval-augmented context (documents pulled from a knowledge base and injected into the prompt). Prompt caching stores the processed version of that shared prefix so subsequent calls skip reprocessing it. OpenAI's caching is automatic — no code changes — and gives 50% off cached input tokens. Anthropic charges 1.25× for the initial cache write but reads at 0.1× base price, a 90% discount. Google's implicit caching is automatic and delivers up to 90% off for Gemini 2.5 models.
Who's behind it: All three major providers, with meaningfully different implementations. OpenAI: automatic, 50% off, short TTL (5–10 minutes of inactivity before expiry). Anthropic: steeper discount with explicit API configuration on hyperscaler platforms, though an automatic mode now exists on Anthropic's direct API. Google: both automatic and explicit modes, with the longest default TTL (up to 24 hours implicit).
What makes it distinct: Caching rewards architectural consistency. If your buyer's applications send the same 4,000-token system prompt with every call, they're paying full price for those tokens every time without it. The constraint is the TTL — how long the cached content persists. Calls too infrequent, the cache expires, and you pay the write cost again. Anthropic reports latency reductions up to 85% on long prompts, so the savings are in both dollars and speed.
The TTL concept maps directly from session management. A cached prompt prefix works like a cached SAML assertion or a session cookie: it persists for a defined window, resets on use in some implementations, and expires if idle too long. Your intuition about session timeout policies — balancing security against user experience — is the same tradeoff buyers face with cache TTL. Different mechanism, same governance question.
Batch API
What it is: Asynchronous processing at a flat 50% discount with a 24-hour completion window.
What it does: You submit a file of requests — up to 50,000 on OpenAI, 10,000 on Anthropic — and the provider processes them in the background using spare capacity. No streaming, no real-time responses. Results come back within 24 hours, though most batches finish in under an hour in practice.
Who's behind it: OpenAI, Anthropic, and Google all offer batch processing at the same 50% discount. Implementations differ in limits and features. Anthropic offers a 300,000 output token limit per request in batch mode on their direct API (a beta feature not yet available on Bedrock or Vertex), versus roughly 64K–128K synchronous, which matters for document-heavy workloads. Google's batch mode carries no SLA and cannot be combined with provisioned throughput.
What makes it distinct: Half price. The tradeoff is latency. Any workload that can wait for its answer — nightly report generation, document classification, bulk summarization, compliance scanning — should be a batch job. The spec calls this "asynchronous processing," which is a generous word for it — the provider runs your work when their GPUs aren't busy with someone paying full price. Discounts stack with caching: OpenAI's batch + cache combination can reduce input costs by roughly 75% (tokenmix.ai, a commercial inference cost aggregator that calculates combined discount rates from published provider pricing). Google's batch mode includes implicit caching by default for Gemini 2.5 models.
Which Structure Wins Where
I'm using scenario mapping here because a flat comparison table would imply these four structures compete on a level field. They don't. Each one wins under specific workload conditions and loses under others. The buyer's usage pattern determines the answer, so the comparison follows the usage pattern.
Scenario 1: Steady, high-volume production workload An agency running 10M+ tokens/day through a single model. Provisioned throughput wins. The fixed hourly cost amortizes well above the $30–50/day break-even. Per-token would cost 15–40% more for the same work. Batch doesn't apply because the workload needs real-time responses. Caching helps on top of provisioned throughput — Google's GSU model gives 10× efficiency on cached tokens (cached tokens burn down at 1/10th the rate of uncached tokens) — but provisioned capacity is the structural savings.
Scenario 2: Repetitive prompts with large shared context A chatbot with a 5,000-token system prompt answering hundreds of queries per hour. Prompt caching wins. The shared prefix gets cached once and read hundreds of times. On Anthropic, after the initial 1.25× write cost, every subsequent call pays 0.1× for that prefix — break-even at roughly two reads per cached prefix. On OpenAI, it's automatic and 50% off with no write premium. The volume might also justify provisioned throughput underneath, but caching is the lever that changes the economics.
Scenario 3: Non-urgent bulk processing Classifying 50,000 documents overnight. Generating weekly compliance summaries. Batch API wins. Flat 50% off, no architectural changes beyond submitting requests as a file instead of individual API calls. If the documents share common instructions, caching stacks on top. Batch with cached prompts on OpenAI hits roughly 75% total savings on input tokens versus standard per-token pricing.
Scenario 4: Variable, exploratory, or low-volume usage An agency piloting AI with a few hundred calls per day. Per-token wins. No commitments, no wasted capacity, no cache management overhead. The per-call cost is highest, but total spend is low enough that optimization doesn't move the needle. Most agencies start here. It's the right place to start.
The Hyperscaler Markup Factor
One cost dimension cuts across all four scenarios. For frontier proprietary models (Claude, GPT, Gemini), the per-token rate on a hyperscaler matches the direct API exactly (checkthat.ai cross-references published rate cards; Anthropic's own documentation confirms pricing parity across Bedrock, Vertex, and their direct API). The 15–40% premium shows up in surrounding services: guardrails, logging, network egress, knowledge bases, and compliance tooling (inference.net, an inference platform vendor — treat their overhead estimates as directional rather than precise).
For open-weight models (Llama, Mistral), the math flips. Bedrock charges roughly 2–3× specialized hosts for the same Llama model (tokenmix.ai aggregates published rates; the comparison is directional, and the premium reflects managed infrastructure costs, not pure markup). That premium buys VPC integration, IAM, and compliance tooling. Real value for public sector buyers who need those things. And a cost your buyer should know they're paying.
Provisioned throughput maps to per-seat licensing with a minimum commitment. Your buyer already understands paying for capacity whether every seat is active — that's how they buy Okta. Above a utilization threshold, reserved pricing wins; below it, consumption pricing wins. Where the analogy breaks: AI provisioned throughput is model-specific and region-specific. You can't reassign unused GPU capacity across models the way you might reassign unused licenses across departments.
How to Say This in the Field
| Don't say | Do say | Why it matters |
|---|---|---|
| "AI is expensive" | "The same model costs 2–10× different amounts depending on how you buy inference" | Positions you as someone who understands the procurement lever |
| "Tokens are like words" | "A token is roughly three-quarters of a word — it's the billing unit for AI inference, completely unrelated to auth tokens" | Preempts the IDAM/AI token confusion before it derails the room |
| "Output is more expensive" | "Output tokens run 3–10× the input rate — most of the bill comes from what the model generates, not what you send it" | Reframes cost estimation around the number buyers usually undercount |
| "You should use caching" | "If your applications share a common system prompt or policy context, caching cuts input costs 50–90% with no model change" | Ties the recommendation to a condition the buyer can verify |
| "Batch is cheaper" | "Any workload that doesn't need a real-time answer — overnight processing, weekly reports, bulk classification — runs at half price through the batch API" | Gives the buyer a concrete filter for their own use cases |
| "Provisioned throughput saves money" | "Once you're spending $30–50 a day on a single model consistently, reserved capacity saves 15–40% — same economics as reserved instances" | Anchors to a concept every public sector infrastructure buyer holds |
| "The hyperscalers charge more" | "For Claude and GPT, the per-token rate is identical on Bedrock versus the direct API — the premium is in the platform services around it" | Corrects a common misconception with precision |
| "Open-source models are cheaper" | "Open-weight models on a hyperscaler cost 2–3× what you'd pay on a specialized host — the premium buys managed infrastructure and compliance tooling" | Acknowledges the tradeoff without taking a side; "open-weight" is the accurate term — these models publish weights but not always training data or full licenses |
| "Which model are you using?" | "How are you planning to purchase your inference — on-demand, reserved, or batch?" | Shifts the conversation to procurement strategy, where you can add value |
| "Caching is automatic" | "On OpenAI it's automatic. On Anthropic you get a steeper discount but need explicit configuration through the hyperscalers — their direct API now has an automatic option too." | Shows you know the implementation differences across providers |
| "You can stack discounts" | "Batch plus caching on the same workload can cut input costs by 75% — that's real money at agency scale" | Gives the buyer a number they can take to their CFO |
Things to follow up on...
- Provisioned pricing is opaque: None of the three hyperscalers publish self-serve provisioned throughput rates for frontier models like Claude or GPT-4o, so the savings estimates in this piece come from third-party aggregators rather than official rate cards.
- Batch and provisioned don't mix: Google's Vertex AI explicitly prohibits combining batch inference with provisioned throughput, which means buyers running both workload types need to plan two separate cost structures.
- Open-weight markup varies wildly: Bedrock charges roughly 201% more than Together AI for Llama 3.3 70B while Vertex AI charges about 55% more, a gap documented by TokenMix's per-model cost analysis that shows the hyperscaler premium depends heavily on which hyperscaler.
- Cache TTL is evolving fast: OpenAI's December 2025 changelog introduced extended prompt cache retention up to 24 hours, closing the TTL gap with Google's implicit caching and potentially changing the caching calculus for lower-frequency workloads.

