AI Inference Pricing: Four Structures, One Framework

By Carey Whitten— May 5, 2026

AI Inference Pricing: Four Structures, One Framework

Per-Token

What it is: Pay for each token processed, with input and output tokens priced separately.

What it does: Meters inference cost at the most granular unit. You pay for what you consume, nothing more. Input tokens — what you send to the model — are typically priced at 20–30% of output token rates. The model does more compute generating than reading, and the pricing reflects that. On a model like GPT-4o, illustrative benchmarks run roughly $5/1M input tokens and $15/1M output tokens (verify against current provider documentation before quoting).

Who's behind it: Every major lab publishes per-token rates. OpenAI, Anthropic, Google DeepMind, Mistral. It's the default pricing structure for direct API access and the baseline against which every other structure is measured.

What makes it distinct: Maximum flexibility, highest per-unit cost at scale. No commitment, no minimum. The right structure when workloads are unpredictable or early-stage. The wrong structure when they're not.

Per-Call

What it is: A flat fee per API invocation, regardless of token count.

What it does: Converts inference cost into a per-transaction unit. You pay the same whether the model processes 50 tokens or 500. This makes cost modeling straightforward when transaction volume is predictable, but it creates a hidden risk: a per-call rate that looks cheap can become expensive if average token counts run higher than expected.

Who's behind it: More common in specialized AI services than in general-purpose LLM access. Vision APIs, classification endpoints, and embedding services — AWS Rekognition, Google Vision AI, Azure AI Services — use this model for non-generative tasks. Some agentic workflow platforms also surface per-call pricing at the orchestration layer, even when the underlying model charges per-token.

What makes it distinct: Predictable cost per operation. When a buyer is building a workflow with fixed transaction volumes and bounded prompt sizes, per-call pricing simplifies the finance conversation. When prompt sizes vary widely, it obscures the actual cost driver.

Provisioned Throughput

What it is: Reserved inference capacity: a committed tokens-per-minute or requests-per-second rate, billed whether you use it or not.

What it does: Trades flexibility for unit economics. At sustained utilization, provisioned rates run 30–50% below equivalent on-demand per-token pricing (illustrative benchmark; verify against current provider documentation). The provider guarantees your capacity; you guarantee your spend. OpenAI calls these Provisioned Throughput Units (PTUs). Azure OpenAI has provisioned deployment tiers. AWS Bedrock offers provisioned throughput for supported models. Google Vertex AI has analogous committed-use options.

Who's behind it: All four major hyperscalers offer some version of this. The mechanics differ — PTU sizing, commitment periods, and minimum purchase thresholds vary — but the economic structure is consistent across providers.

What makes it distinct: The only pricing structure where your cost is determined by your commitment, not your consumption. Upside and risk, same line item.

Prompt Caching

What it is: Reduced pricing on repeated prompt prefixes — the model processes the prefix once, caches the result, and charges a fraction of normal input rates on subsequent cache hits.

What it does: Cuts input token costs for workloads where the same large context appears in many requests: a system prompt, a policy document, a RAG knowledge base preamble. Anthropic's implementation charges roughly 10% of normal input rates on cache hits, with a one-time cache write cost at approximately 125% of input rates (illustrative benchmark). OpenAI applies automatic caching for prompts exceeding 1,024 tokens. Google Gemini offers explicit context caching with configurable TTLs.

Who's behind it: Anthropic (explicit prompt caching API), OpenAI (automatic caching), Google Gemini (context caching). Not universally available across all models or deployment tiers — worth confirming before building it into a cost model.

What makes it distinct: The only structure where your prompt architecture directly controls your unit economics. A well-structured prompt with a stable, front-loaded prefix can cut effective input costs by 80% or more on high-volume workloads. The savings are real, but capturing them requires prompt architecture discipline, not just a pricing tier selection.

Comparison: When Each Structure Wins

Comparison strategy: scenario mapping. The four structures don't cluster cleanly by mechanism — per-token and per-call are both consumption-based but measure different units; provisioned and cached are both optimization plays but operate at different layers. Scenario mapping shows when each structure wins without forcing false equivalences. All four structures appear in each scenario.

Scenario 1: Early-stage deployment, variable workload

Per-token is the right default. No commitment risk, full flexibility. Per-call is viable if the workload is transaction-based with predictable call volumes and bounded prompt sizes. Provisioned throughput is the wrong choice here — you'll pay for capacity you don't use, and you don't yet know what capacity you need. Prompt caching is worth evaluating even at low volume if your system prompt is large (above 2,000 tokens), because the per-call savings compound quickly and the architecture discipline pays dividends later.

Scenario 2: Production workload, sustained high volume

Provisioned throughput becomes the right question. The back-of-envelope break-even:

“

If sustained utilization exceeds ~65% of committed capacity, provisioned throughput typically breaks even against on-demand within the commitment period.

At a 40% provisioned discount, you need to use roughly 60–65% of your committed capacity to match what you'd have paid on-demand. At 80% utilization, the savings are real and compounding. At 40% utilization, you've paid more than on-demand pricing for the privilege of a commitment.

Prompt caching compounds on top of provisioned savings when prompts qualify. These structures aren't mutually exclusive. A production RAG pipeline running on provisioned throughput with cached system prompts is a common production configuration.

Scenario 3: RAG pipeline or agent with large, repeated context

Prompt caching is the primary lever. Break-even:

“

Cache savings per call = (cached prefix tokens) × (normal input rate − cache hit rate) Daily savings = cache savings per call × daily call volume If daily savings > amortized cache write cost, caching pays.

Take a 4,000-token system prompt at $5/1M input tokens, with cache hits at $0.50/1M (illustrative figures). Each cache hit saves $0.0175. At 10,000 calls per day, that's $175/day in input cost reduction. Cache write cost is approximately $0.025 per write. If the cache entry survives even a few hundred calls before expiring, the economics are clear. The variable to watch is cache lifetime, which providers control, not you.

Scenario 4: Federal or enterprise buyer already on a hyperscaler

The hyperscaler markup question surfaces here. Azure OpenAI, AWS Bedrock, and Google Vertex AI typically price 15–40% above direct API rates for equivalent models (illustrative benchmark range; actual rates vary by enterprise agreement and model tier). What the markup buys: FedRAMP authorization, data residency controls, VNet and private endpoint access, existing enterprise billing relationships, and SLA commitments that direct API access doesn't provide.

For a federal buyer, "why are we paying more through Azure?" often has a one-word answer: FedRAMP. The agency can't use the direct OpenAI API in a FedRAMP-authorized environment. That premium is the compliance cost, already negotiated into the existing cloud relationship.

For an enterprise buyer, the effective markup may be lower than list price suggests. Enterprise Agreement discounts on the hyperscaler's broader cloud spend often reduce the marginal cost of AI services below the published delta. Ask what their EA discount structure does to that number before assuming the markup is 25%.

Field Language Guide

Don't say	Do say	Why it matters
"AI usage fees"	"Per-token inference costs"	Buyers who've read their vendor quotes use token-based language; matching it signals fluency
"The model charges per request"	"This endpoint uses per-call pricing"	"Request" is ambiguous; "call" maps to the API invocation unit buyers see in billing
"Provisioned is cheaper"	"Provisioned throughput breaks even at sustained utilization above roughly 65%"	"Cheaper" without a utilization condition is wrong half the time
"You're paying extra for Azure"	"The hyperscaler tier includes FedRAMP authorization and data residency controls"	Framing the markup as a compliance cost preserves the buyer's existing cloud relationship
"Caching saves money"	"Prompt caching reduces input costs on repeated prefixes — the savings depend on prefix length and call volume"	Unqualified savings claims get tested; qualified claims get trusted
"Token costs"	"Input token costs" or "output token costs"	Input and output rates differ by 3–5x on most models; conflating them produces wrong estimates
"Committed use"	"Provisioned throughput"	"Committed use" is GCP's term; "provisioned throughput" is the cross-provider term that appears in most contracts
"This will save you 30–50%"	"At sustained utilization above 65%, provisioned throughput typically reduces per-token costs 30–50% versus on-demand"	The savings figure without the utilization condition sets up a disappointed buyer
"The cache hit rate"	"The cache hit rate, which depends on provider infrastructure and prompt structure"	Buyers asking about SLAs need to know cache hits aren't guaranteed by contract
"AI API pricing"	"Inference pricing"	"Inference" is the term in vendor contracts and procurement documents; "API pricing" is vague and doesn't survive a procurement review

“

Okta Concept Mapping

The IDAM analog for provisioned throughput is concurrent session licensing — the model where you buy a pool of capacity and pay whether the pool is full or not. The mapping holds structurally: both involve committing to a capacity ceiling, both reward predictable utilization patterns, and both penalize overcommitment. Where it breaks: concurrent session licenses are sized against user counts, which are relatively stable and forecastable. AI inference workloads are sized against tokens-per-minute, which can spike by orders of magnitude within a single agent run. A buyer comfortable sizing concurrent session licenses may underestimate how variable AI throughput requirements actually are, and undercommit to provisioned capacity in a way that leaves them paying on-demand rates for their peak load anyway. Before provisioned throughput enters the proposal, ask: "What does your peak-to-average inference ratio look like?" If they don't know, the commitment is premature.

All pricing figures in this piece are illustrative benchmarks derived from published provider documentation as of the date of writing. Inference pricing changes frequently. Verify current rates against provider documentation before using figures in a customer conversation or proposal.