The Four Structures
Per-Token Pricing
What it is: You pay for each token processed, billed separately for input and output.
What it does: Every inference call generates a bill based on the tokens consumed — the prompt going in, the completion coming out. Input and output tokens price differently, and output tokens are almost always more expensive, sometimes by a factor of three to five. The meter runs on every call, with no commitment required and no minimum spend.
Who uses it: Every major frontier model provider — Anthropic, OpenAI, Google — publishes per-token rates as their default pricing tier. It's what you get if you sign up, add a credit card, and start calling the API.
What makes it distinct: Zero commitment, infinite flexibility, and the highest per-unit cost of any structure here. Per-token is the right default for experimentation and low-volume workloads. At scale, it becomes the most expensive option by design: providers price it to make provisioned throughput look attractive once your volume justifies the commitment.
Illustrative benchmark: A frontier-class model (GPT-4 tier or equivalent) runs roughly $3–$5 per million input tokens and $15–$20 per million output tokens at on-demand rates. These figures change without notice and should be verified against current provider pricing pages before any proposal conversation.
Provisioned Throughput
What it is: You commit to a reserved inference capacity in exchange for a discounted per-unit rate, typically 30–50% below on-demand.
What it does: Instead of drawing from shared capacity at spot prices, you reserve a dedicated throughput allocation, measured in tokens per minute or requests per second depending on the provider. The discount is real, but so is the commitment: you're paying for the reservation whether you use it or not, usually on a monthly or annual basis. Unused capacity doesn't roll over.
Who uses it: OpenAI calls this "Provisioned Throughput Units" (PTUs). Anthropic offers similar capacity reservations for enterprise accounts. Google Cloud's Vertex AI has a committed-use model for Gemini. The mechanism is structurally similar across providers even when the terminology differs.
What makes it distinct: It's the only structure where your cost is partially fixed regardless of actual usage. That's a feature for workloads with predictable, sustained volume, and a liability for anything bursty or seasonal. The break-even point against on-demand pricing is typically somewhere between 60–70% utilization of the reserved capacity. Below that, you're paying more than you would on per-token.
Illustrative benchmark: 30–50% discount versus on-demand rates at full utilization. Actual discount varies by provider, model tier, and commitment length. Annual commitments typically yield better rates than monthly.
Prompt Caching
What it is: Repeated input context — system prompts, document chunks, few-shot examples — is cached server-side and billed at a steep discount on subsequent calls.
What it does: Many production AI workloads send the same large context block with every request: a detailed system prompt, a reference document, a set of examples. Without caching, you pay full input-token rates on that context every single time. With caching enabled, the provider stores the processed representation of that context, and subsequent calls that hit the cache pay a fraction of the normal input rate, typically 75–90% less on the cached portion.
Who uses it: Anthropic's prompt caching is the most mature implementation, offering up to 90% reduction on cached input tokens. OpenAI introduced a similar mechanism. The feature is model-specific and not universally available across all tiers.
What makes it distinct: Prompt caching is the only structure that targets a specific cost driver — repeated input context — rather than overall volume or capacity. Its value is entirely dependent on workload shape. If your prompts are unique every call, caching saves you nothing. If 70% of your input tokens are a fixed system prompt that repeats across millions of calls, caching is the highest-leverage cost lever available.
Illustrative benchmark: Up to 90% reduction on cached input tokens. A workload where 70% of input is repeated context could see overall input costs drop by 60–65% with caching enabled, depending on cache hit rate.
Hyperscaler Markup
What it is: When you access a frontier model through a cloud platform (Azure OpenAI, AWS Bedrock, Google Cloud Vertex AI) rather than directly from the model provider, you pay the platform's rate, which includes a markup over the provider's direct API price.
What it does: The hyperscaler bundles the model access with their infrastructure, compliance posture, billing relationship, and support tier. The convenience is real: your AI spend consolidates into an existing cloud contract, procurement already has the vendor relationship, and data residency questions have cleaner answers. The markup is also real, running 15–40% over what you'd pay going direct, depending on the model, the platform, and any enterprise agreement discounts already in place.
Who uses it: Azure OpenAI Service, AWS Bedrock, and Google Cloud Vertex AI all operate this way for third-party models. The markup is not always visible as a line item. It's embedded in the per-token rate the platform publishes, which is higher than the model provider's direct rate.
What makes it distinct: It's the only structure where the pricing decision is primarily driven by procurement and compliance logic rather than workload economics. The "data goes to Azure because our lawyers said so" pattern is real, and it has a real cost attached. That cost is worth paying when the alternative is a months-long security review. It's worth understanding when someone asks why the AI budget is running 25% over the direct-API estimate.
Illustrative benchmark: 15–40% markup over direct API rates. Actual markup varies by model, platform, and whether enterprise agreement pricing applies. Azure OpenAI and AWS Bedrock markups are not always published explicitly — compare the platform's published per-token rate against the model provider's direct rate.
How They Interact: A Scenario Map
Scenario mapping works better here than a trait table because these structures layer. A buyer can be paying hyperscaler markup on a provisioned throughput commitment with prompt caching enabled. Asking "which is cheapest?" produces no useful answer; the right analysis is which combination fits the workload.
Scenario 1: Exploration / low volume Per-token, direct API, no caching. No commitment risk, maximum flexibility, highest per-unit cost. Appropriate until you have enough production data to estimate sustained volume.
Scenario 2: High-volume, predictable workload, direct API Per-token baseline plus provisioned throughput once utilization justifies the commitment. Add prompt caching if your workload has significant repeated context. This is the cost-optimal path for a mature production deployment at scale.
Scenario 3: High-volume, predictable workload, enterprise cloud requirement Same as Scenario 2, but add 15–40% for the hyperscaler layer. The markup doesn't disappear because you've negotiated provisioned throughput — it applies on top. Buyers who assume their Azure enterprise agreement eliminates the markup are often wrong; check the per-token rate against the model provider's direct rate.
Scenario 4: Workload with heavy repeated context (RAG, document Q&A, consistent system prompts) Prompt caching is the primary lever regardless of volume tier. Even at low volume, the caching discount on repeated input context can be larger than the provisioned throughput discount. These are not mutually exclusive — you can run caching on top of provisioned throughput — but for this workload shape, caching should be evaluated first.
When a vendor proposal lands, ask which structure the rate is based on, and whether that structure matches how you'll actually use the model. A competitive rate built on the wrong structure is still the wrong structure.
Back-of-Envelope Estimation Framework
This is arithmetic you can run in fifteen minutes before a procurement conversation. All figures are illustrative benchmarks — verify against current provider pricing before any actual commitment.
Step 1: Estimate monthly token volume, separately for input and output.
Don't average them. Output tokens price at 3–5x input tokens on most frontier models. A workload that generates short completions from long prompts has a very different cost profile than one generating long-form content from short queries.
Example workload: 10 million input tokens/day, 2 million output tokens/day.
Step 2: Calculate on-demand baseline.
Using illustrative rates of $4/million input, $16/million output:
- Input: 10M × $4 × 30 days = $1,200/month
- Output: 2M × $16 × 30 days = $960/month
- On-demand baseline: $2,160/month
Step 3: Apply provisioned throughput discount (if volume is sustained).
At 40% discount with full utilization:
- Provisioned cost: ~$1,296/month
- Break-even vs. on-demand: ~65% utilization. Below that, on-demand is cheaper.
Step 4: Apply prompt caching (if workload has repeated context).
Assume 70% of input tokens are a repeated system prompt, 85% cache discount on that portion:
- Cached input: 7M tokens × $4/M × 0.15 = $4.20/day (vs. $28/day uncached)
- Uncached input: 3M tokens × $4/M = $12/day
- Input cost with caching: $16.20/day × 30 = $486/month (vs. $1,200/month)
- Output unchanged: $960/month
- With caching: ~$1,446/month — a 33% reduction from baseline without any volume commitment
Step 5: Apply hyperscaler markup (if using Azure/AWS/GCP).
At 25% markup on the on-demand baseline:
- Hyperscaler on-demand: ~$2,700/month
At 25% markup on provisioned throughput:
- Hyperscaler provisioned: ~$1,620/month
The markup doesn't disappear when you negotiate capacity. It compounds.
What this tells you: For this workload, prompt caching alone (no commitment required) saves more per month than provisioned throughput alone. If the workload has heavy repeated context, that's the first question to ask the vendor, not "can we get a volume discount?"
Field Language Guide
| Don't say | Do say | Why it matters |
|---|---|---|
| "We're paying per call" | "We're on per-token pricing — input and output billed separately" | Per-call implies a flat rate; token pricing has asymmetric input/output costs that matter for cost estimation |
| "AI pricing is unpredictable" | "Per-token pricing has variable cost; provisioned throughput gives you a fixed capacity cost with predictable per-unit rates" | Buyers asking about cost predictability have a real concern — the answer depends on which structure you're using |
| "We'll just use Azure, it's the same price" | "Azure OpenAI carries a 15–40% markup over direct API rates; we should check whether your enterprise agreement offsets that" | The markup is real and often invisible; naming it specifically opens the conversation |
| "We can cache the prompts to save money" | "Prompt caching applies to repeated input context — system prompts, document chunks. If your prompts are unique per call, caching won't help" | Caching is workload-specific; overpromising it creates a credibility problem when the savings don't materialize |
| "Provisioned throughput is always cheaper" | "Provisioned throughput breaks even against on-demand at roughly 60–70% utilization; below that, you're paying for capacity you're not using" | The commitment has a real downside for bursty or seasonal workloads |
| "We'll negotiate a volume discount" | "Volume discounts in AI inference usually come through provisioned throughput commitments, not ad hoc negotiation — we should ask what the commitment terms look like" | Positions the buyer correctly for the conversation they're about to have |
| "The model costs X per month" | "At current on-demand rates and our estimated token volume, the model runs roughly X per month — that's before any caching or capacity discounts" | Separates the baseline from the optimization, which is where the real conversation happens |
| "Output tokens are the same price as input" | "Output tokens typically cost 3–5x more than input tokens on frontier models — your cost estimate needs to account for that ratio" | The asymmetry is the most common source of underestimated AI budgets |
| "The hyperscaler handles all the AI billing" | "The hyperscaler consolidates billing, but the underlying model costs are still there — plus the platform markup" | Prevents the misconception that cloud consolidation eliminates model costs |
| "Caching saves up to 90%" | "Caching saves up to 90% on the cached portion of input tokens — overall savings depend on what percentage of your input is repeated context" | The 90% figure is real but applies to a subset of tokens; misquoting it creates expectation problems |
Okta Concept Mapping
The sharpest analog here isn't per-seat licensing — it's the way identity providers meter authentication events. Okta's MAU (Monthly Active User) pricing is consumption-based: you pay for what fires, and a spike in authentication volume produces a spike in cost. That's per-token pricing. When Okta enterprise customers negotiate a committed authentication volume at a discounted rate, that's provisioned throughput: you're buying capacity in advance, accepting some utilization risk in exchange for a lower per-unit cost. The analogy holds well through the provisioned/on-demand distinction. Where it breaks: prompt caching has no real identity analog, because authentication events don't have "repeated context" that can be cached server-side — each auth event is discrete. And hyperscaler markup maps loosely to running Okta through an MSSP or reseller, but the economics are different enough that the comparison gets thin quickly. In a buyer conversation, the MAU analogy is useful for explaining why AI costs are variable and why commitments exist — but stop there before the analogy starts doing damage.
Pricing figures in this piece are illustrative benchmarks consistent with published ranges as of early 2026. All figures should be verified against current provider pricing pages before use in a proposal or procurement conversation. Prices change without notice.

