The Three-Question Framework
Apply these in sequence. Skipping to Q3 before answering Q2 is how you waste forty minutes on model comparisons before discovering the agency already has a cloud contract that settles the question.
Q1: What capability tier is actually needed?
Structured tasks — classification, extraction, summarization against a defined schema — belong at the efficient or small tier. Open-ended reasoning, novel synthesis, complex judgment calls belong at frontier. The default instinct is to reach for frontier. Route instead. The capability gap between frontier and efficient has narrowed faster than the pricing gap has.
Q2: What procurement and compliance constraints already exist?
This question usually determines the hyperscaler, which usually determines the available model menu. FedRAMP High authorization currently covers Bedrock and Azure OpenAI; Vertex is in process. An existing Azure EA makes Azure OpenAI the path of least resistance regardless of which lab's model the buyer prefers. An AWS GovCloud footprint points to Bedrock. No existing cloud contract means you have a vendor onboarding conversation before you have a model conversation.
Q3: What is the volume profile?
Per-token pricing is efficient under roughly 10M tokens per month. Above that threshold, per-token costs compound fast enough that provisioned throughput becomes worth the fixed commitment: guaranteed queue position at predictable monthly cost, which is a different value proposition than per-token savings. Bursty workloads and steady-state workloads price differently even at the same total volume.
If you remember nothing else: The hyperscaler is usually determined before the model. Sequence the questions in order.
Core Vocabulary
Frontier model — The highest-capability model currently available from a given lab; priced accordingly. When it comes up: Buyers ask for it by name. GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro are frontier-tier references. Don't confuse with: Frontier lab. The lab trains the model; the model is what you're pricing. A frontier lab may not offer a frontier model on every hyperscaler platform.
Efficient model — Mid-tier capability optimized for cost-to-performance on structured tasks. When it comes up: The right answer for most production workloads that don't require open-ended reasoning. Don't confuse with: A degraded frontier model. Efficient models are purpose-built, not just cheaper versions of their frontier siblings.
Open-weight model — A model whose weights are publicly released, allowing local deployment and fine-tuning. When it comes up: Buyers ask about "open source AI" and mean this. The distinction matters for licensing. Don't confuse with: Open source. Open-weight means the weights are available; it says nothing about training data, code, or commercial use rights. Meta's Llama license includes a MAU clause that triggers commercial restrictions above 700M monthly active users — Apache 2.0 and MIT carry no such condition.
Provisioned throughput — A fixed monthly commitment that reserves model capacity and guarantees response latency. When it comes up: High-volume production deployments, latency-sensitive workloads, budget predictability requirements. Don't confuse with: A volume discount. Provisioned throughput buys queue priority, not cheaper tokens. Per-token cost may be identical or higher; you're paying for reliability, not savings.
Prompt caching — Reuse of processed prompt segments across API calls to reduce billable input tokens. When it comes up: Long system prompts repeated across many calls — common in agentic workflows. Don't confuse with: RAG. Both reduce token costs, but RAG retrieves external content at query time; prompt caching reuses already-processed context. Different mechanism, different tradeoff.
Hyperscaler markup — The price delta between a lab's direct API rate and the same model accessed through a hyperscaler's hosted platform. When it comes up: Pricing comparisons between direct lab access and Bedrock/Azure OpenAI/Vertex. Don't confuse with: Pure margin. The markup buys FedRAMP posture, existing contract vehicle, data residency guarantees, and enterprise SLA. For public sector buyers, that's the product.
Telemetry — Data that leaves your environment during model inference: prompts, completions, usage metadata. When it comes up: Every geopolitics and data residency conversation. The question is not what trained the model; it's what your inference calls send and where. Don't confuse with: Training data. What trained the model is historical. Telemetry is live. These are separate risk surfaces.
If you remember nothing else: "Open source" means Apache 2.0 to a lawyer and "I can see the weights" to an engineer. Know which conversation you're in.
Mapping Tables
Table 1: Pricing Models → Enterprise Software Contracts
| AI Term | What It Means in AI | IDAM Equivalent | Key Divergence |
|---|---|---|---|
| Per-token pricing | Pay per unit of text processed (input + output separately) | Per-API-call / per-transaction | One "call" can cost 10x another based on prompt length — there's no fixed unit price |
| Provisioned throughput | Fixed monthly commitment for reserved capacity and latency guarantee | Reserved instance / committed use | Doesn't reduce per-token cost; buys queue priority and predictable spend ceiling |
| Prompt caching | Reuse of processed prompt segments to reduce billable input tokens | Session persistence / token reuse | Cache hits reduce cost but don't persist state the way a session token does |
| Hyperscaler markup | Price delta for hosting a lab's model on a cloud platform | Reseller margin / distribution fee | Markup includes compliance posture and contract vehicle — not separable from the product |
Table 2: Capability Tiers → Procurement Language
| AI Term | What It Means in AI | IDAM Equivalent | Key Divergence |
|---|---|---|---|
| Frontier model | Highest-capability, highest-cost; at the research edge | Enterprise tier / full feature set | "Frontier" is not a stable designation — the frontier shifts every 6–12 months |
| Efficient model | Strong on structured tasks; meaningfully lower cost | Professional / Business tier | Capability gap vs. frontier has narrowed; pricing gap has not fully adjusted |
| Small / edge model | Runs in constrained environments or on-device | Lite tier / on-prem deployment | "Small" is an architectural designation, not just a feature reduction — inference location changes |
Table 3: Vendor Layers → Familiar Vendor Relationships
| AI Term | What It Means in AI | IDAM Equivalent | Key Divergence |
|---|---|---|---|
| Frontier lab | Trains and owns the model weights | Software ISV / IP owner | Labs may lack enterprise sales motion or FedRAMP posture — direct purchase is often not an option |
| Hyperscaler | Hosts models from multiple labs; provides enterprise wrapper | Cloud marketplace / distribution channel | Hyperscaler is often pre-determined by existing cloud contract, not model preference |
| Aggregator | API layer across multiple models; adds routing, fallback, observability | Identity broker / federation hub | Aggregators add latency and a new vendor dependency that may not be visible to the buyer |
| Specialty provider | Optimized hosting for specific model families or workloads | Niche ISV / point solution | Procurement path is typically outside existing contract vehicles — new vendor onboarding required |
If you remember nothing else: The vendor layer determines what's actually procurable. Model preference is downstream of contract reality.
Geopolitics: One Principle
Who runs the datacenter matters. Who trained the weights does not — at least not for data residency.
DeepSeek trained its models in China. DeepSeek's models are available on Amazon Bedrock. When an agency accesses DeepSeek via Bedrock, inference runs on AWS infrastructure in US-based datacenters under AWS's FedRAMP authorization. The agency's prompts and completions do not touch DeepSeek's infrastructure. The Chinese lab's involvement ends at the weights.
The risk surface that remains is telemetry: what usage metadata leaves your environment and where it goes. Model origin and telemetry destination are separate questions, and only one of them is live. "Where was this model trained?" is historical. "Where does my inference traffic go, and what does the platform log?" is what the security team actually needs answered.
The same distinction applies wherever a buyer conflates model origin with data exposure.
If you remember nothing else: Model origin is a procurement optics question. Data residency is an architecture question. Don't let buyers conflate them, and don't let yourself conflate them either.
For More Information, See…
| Topic | Source Lesson |
|---|---|
| Four-layer stack (frontier labs → hyperscalers → aggregators → enterprise) | Opening: The Market Structure |
| Frontier lab profiles: OpenAI, Anthropic, Google DeepMind, xAI | Lesson 1: The Labs |
| Open-weight licensing spectrum: Apache 2.0, MIT, Llama MAU clause, OSI definition | Lesson 2: Open Weight |
| Hyperscaler hosting platforms: Bedrock, Azure OpenAI, Vertex; procurement reality | Lesson 3: The Hyperscalers |
| Pricing structures: per-token, provisioned throughput, prompt caching, markup | Lesson 4: Pricing |
| Capability tiers: frontier, efficient, small; routing pattern | Lesson 5: Tiers |
| Geopolitics: trained model vs. hosting vs. telemetry; DeepSeek-on-Bedrock | Lesson 6: Geopolitics |
| Specialty providers: Cohere, Databricks, Together, Fireworks, Replicate | Lesson 7: Specialty Providers |

