Four pricing structures appear in AI contracts: per-token (the default), provisioned throughput (the volume commitment), prompt caching (the repeated-context discount), and per-call (the specialty API exception). A fifth factor applies across all four: the hyperscaler markup, running 15–40% over going direct to the lab.
All pricing figures in this piece are illustrative benchmarks. Verify against current vendor pricing pages before any procurement conversation.
The Four Structures
Per-Token Pricing
What it is: Billing based on the number of tokens processed, with input and output counted separately at different rates.
What it does: Creates a direct relationship between usage and cost, with no commitment required. Input tokens — what you send to the model — are priced lower than output tokens because generation is computationally more expensive than reading. A frontier model might charge $15 per million input tokens and $60 per million output tokens. A mid-tier model might run $3 input / $15 output. The gap between input and output rates matters: workloads that generate long responses cost disproportionately more than workloads that process long documents.
Who sets it / where it appears: All major labs publish per-token rates on their pricing pages — OpenAI, Anthropic, Google DeepMind. This is the default structure for direct API access and the baseline against which every other structure is compared. It's what an agency gets if they sign up, start calling the API, and never negotiate anything else.
What makes it distinct: No floor. You pay for exactly what you use, which means costs scale linearly with volume in both directions. At low volume, it's the cheapest option. At high volume, it's usually not.
Provisioned Throughput
What it is: A reserved capacity commitment — you pay for a guaranteed level of throughput, measured in tokens per minute or requests per second, regardless of whether you use it fully.
What it does: Eliminates the per-token variable cost in exchange for a fixed reservation fee. At sufficient volume, the effective per-token cost drops 30–50% compared to on-demand pricing. It also provides latency guarantees that on-demand pricing doesn't — reserved capacity doesn't queue behind other users during peak demand. For a production workload where an agency's employees are waiting on a response, that queue matters.
Who sets it / where it appears: Azure OpenAI Service calls these Provisioned Throughput Units (PTUs) — a term you'll encounter in federal RFPs because Azure is the dominant hyperscaler path for FedRAMP-authorized AI workloads. AWS Bedrock and Google Vertex AI have equivalent mechanisms under different names. Not all models are available in provisioned configurations; newly released models typically start on-demand only and add provisioned options later.
What makes it distinct: The commitment is the point. The purchase is capacity, not tokens. Savings are real at scale, but the floor is real at low volume — you pay for the reservation whether the agency uses it or not.
Prompt Caching
What it is: A mechanism that stores the processed representation of repeated prompt content — system prompts, document context, instruction sets — so it doesn't need to be reprocessed on every API call.
What it does: Reduces the effective input token cost for workloads where a large portion of the prompt is identical across requests. The cached portion is billed at a fraction of the standard input rate, up to 90% reduction on cached tokens. Uncached tokens, the unique portion of each request, are still billed at standard rates. The math only works if the use case actually has repeated context.
Who sets it / where it appears: Anthropic introduced prompt caching for Claude in 2024; OpenAI followed with a similar mechanism. Implementation details differ — Anthropic requires explicit cache control markers in the API call; OpenAI's implementation is more automatic. Availability through hyperscalers may lag direct API availability, which is worth asking about explicitly when a vendor proposes a caching-dependent cost model.
What makes it distinct: The savings are use-case-dependent in a way the other structures aren't. A RAG pipeline where every query pulls different documents gets minimal benefit. An agent with a fixed 50,000-token system prompt that runs 10,000 times a day gets dramatic benefit. Same model, same pricing structure, two completely different cost outcomes.
Per-Call Pricing
What it is: A flat fee per API request, regardless of the token count of the input or output.
What it does: Decouples cost from prompt complexity. A one-sentence query and a 10,000-token document analysis cost the same per call. This creates predictable per-transaction costs but can be economically inefficient for short, simple queries where the flat fee exceeds what per-token pricing would have cost.
Who sets it / where it appears: Appears primarily in specialty or embedded model APIs: image generation (DALL-E, Stable Diffusion API), speech-to-text (Whisper), embedding models in some configurations, and fine-tuned or domain-specific model endpoints. Less common for general-purpose text generation, where per-token is the standard. When an agency is procuring a multi-modal AI capability — document processing, image analysis, transcription — expect to see per-call pricing for the non-text components alongside per-token pricing for the text components.
What makes it distinct: The unit of pricing is the transaction, not the content. Useful for budgeting when call volume is predictable but query complexity varies. Harder to optimize through prompt engineering, because the cost doesn't change with prompt length.
Reading the Pricing Structure in the Room
Comparison structure: scenario mapping rather than a trait matrix. In a procurement conversation, the relevant question is which structure fits what the agency just described, and what that implies for the total cost.
The hyperscaler markup is a cross-cutting factor that applies to every scenario below. When an agency accesses AI through Azure, AWS, or GCP rather than directly through the lab, the hyperscaler adds 15–40% to the lab's published rates. That premium buys FedRAMP authorization, enterprise support agreements, data residency controls, and integration with existing cloud infrastructure, all legitimate value for a federal buyer. It doesn't change which pricing structure applies; it changes what that structure costs.
Scenario 1: Exploratory or low-volume use case. An agency is piloting an AI assistant for a small team — a few hundred queries per day, no production SLA, no committed timeline. Per-token pricing is the right structure. No commitment makes sense. The hyperscaler markup applies if they're going through Azure or AWS, which for a federal agency they almost certainly are. Total monthly cost at this scale is typically in the hundreds of dollars, not thousands — the markup is real but not the dominant variable.
Scenario 2: High-volume, steady-state production workload. An agency has moved past pilot and is running an AI capability that processes thousands of requests per day with a defined user population and predictable query patterns. This is the provisioned throughput conversation. The question to ask: what's the current or projected monthly per-token spend? If it's approaching $5,000–$10,000 per month, provisioned throughput is worth modeling. Apply the 30–50% discount to the per-token baseline and compare against the commitment floor — what's the minimum reservation, and what's the penalty for underutilization? The hyperscaler markup applies to the provisioned rate, not the lab's published rate.
Scenario 3: Agent or assistant with fixed large context. An agency is deploying an AI assistant that has a large, consistent system prompt — agency policies, regulatory context, a fixed knowledge base — plus variable user queries. This is the prompt caching opportunity. Estimate what percentage of total input tokens are repeated context. If a 40,000-token system prompt appears in every call, and the average user query is 500 tokens, roughly 98% of input tokens are cacheable. Apply the 80–90% caching discount to that portion. The uncached portion (the user query) is still billed at standard input rates. Prompt caching and provisioned throughput are not mutually exclusive; a high-volume agent workload may benefit from both.
Scenario 4: Multi-modal or specialty capability. An agency is procuring document processing, image analysis, or transcription alongside text generation. Expect a hybrid pricing structure: per-call for the specialty components, per-token for the text generation. Budget these separately. The per-call components are often where cost surprises hide, because buyers focus on the text generation pricing and undercount the specialty API calls.
Back-of-Envelope Estimation Framework
Five steps, usable before a procurement conversation:
Step 1 — Estimate monthly token volume.
- Input tokens: (average prompt length in tokens) × (calls per day) × 30
- Output tokens: (average response length in tokens) × (calls per day) × 30
- If you don't have these numbers, ask the agency what a typical interaction looks like. A 500-word prompt is roughly 650 tokens. A 300-word response is roughly 400 tokens.
Step 2 — Apply per-token rates.
- Frontier model (illustrative): $15/million input, $60/million output
- Mid-tier model (illustrative): $3/million input, $15/million output
- These figures give you the on-demand baseline before any discounts or markups.
Step 3 — Check provisioned throughput threshold.
- If monthly on-demand spend exceeds ~$5,000–$10,000, model provisioned throughput.
- Apply 30–50% discount to the per-token baseline.
- Compare against the commitment floor — what's the minimum reservation, and what's the penalty for underutilization?
Step 4 — Check caching opportunity.
- What percentage of input tokens are repeated context?
- Apply 80–90% discount to that portion of input costs.
- Confirm caching is available for the specific model through the specific hyperscaler.
Step 5 — Apply hyperscaler markup.
- Add 15–40% to the lab's published rates if going through Azure, AWS, or GCP.
- The exact markup varies by model, by hyperscaler, and by the agency's enterprise agreement terms. Ask the hyperscaler rep directly — this number is negotiable in enterprise agreements.
The output is a range, not a precise cost projection. Enough to tell you whether the vendor's number is in the right order of magnitude, and which variables are worth negotiating.
Field Language Guide
| Don't say | Do say | Why it matters |
|---|---|---|
| "What's the cost per query?" | "What's the average token count per request, and how does the structure price input and output separately?" | Per-query framing misses the input/output split, which is where cost surprises live |
| "AI costs" | "Inference costs" or "token consumption costs" | Buyers use "AI costs" to mean everything from compute to licensing to headcount; precision signals fluency |
| "We'll pay as we go" | "We're starting on on-demand pricing — at what volume does provisioned throughput become worth modeling?" | Signals you know there's a threshold, which opens the negotiation |
| "The caching discount" | "Prompt caching — the discount applies to the repeated portion of the input, not the full request" | Buyers who've heard "90% savings" need to understand the condition attached |
| "It's cheaper through Azure" | "The hyperscaler markup runs 15–40% over direct lab pricing — what's included in that premium for this agency's compliance posture?" | Reframes the markup as a value question, not a cost complaint |
| "Per request" | "Per call" (specialty APIs) or "per token" (general-purpose models) | The distinction matters for budgeting; conflating them produces wrong estimates |
| "We can scale it later" | "What's the provisioned throughput commitment period, and what's the utilization floor before the economics flip?" | Surfaces the commitment risk before the agency signs |
| "Fixed-price AI contract" | "Provisioned throughput commitment" | If that's what they mean — using the right term confirms you understand what they're buying |
| "Token budget" | "Monthly token volume estimate" (cost context) or "context window allocation" (prompt design context) | "Token budget" is ambiguous between cost planning and prompt engineering; the buyer may mean either |
| "The model costs X" | "The model's per-token rate is X, but effective cost depends on prompt structure, caching eligibility, and whether we're going direct or through a hyperscaler" | Prevents the buyer from anchoring on a number that doesn't reflect their actual workload |
Okta Concept Mapping
The closest IDAM analog to provisioned throughput is the enterprise seat license with a committed user floor — you pay for capacity whether or not every seat is active, and the per-unit cost drops significantly compared to consumption-based pricing. The structural logic is identical: commit to volume, get a lower rate, absorb the floor risk.
The analogy breaks at the unit of measurement. A seat is a person: visible in the directory, auditable in a report, countable without instrumentation. A token is invisible to the buyer without active monitoring. You know how many users are licensed. You do not automatically know how many tokens your agency consumed last month, how they were distributed across input and output, or what percentage were cached — unless you've built the observability layer to capture that data. When a CAIO asks "how do we track this," the honest answer is that token consumption requires instrumentation that seat-based licensing never did. Make observability part of the procurement conversation before the contract is signed.

