Models & Vendors: Recap

By Carey Whitten— May 5, 2026

The Three-Question Framework

Apply these in sequence. Skipping to Q3 before answering Q2 is how you waste forty minutes on model comparisons before discovering the agency already has a cloud contract that settles the question.

Q1: What capability tier is actually needed?

Structured tasks — classification, extraction, summarization against a defined schema — belong at the efficient or small tier. Open-ended reasoning, novel synthesis, complex judgment calls belong at frontier. The default instinct is to reach for frontier. Route instead. The capability gap between frontier and efficient has narrowed faster than the pricing gap has.

Q2: What procurement and compliance constraints already exist?

This question usually determines the hyperscaler, which usually determines the available model menu. FedRAMP High authorization currently covers Bedrock and Azure OpenAI; Vertex is in process. An existing Azure EA makes Azure OpenAI the path of least resistance regardless of which lab's model the buyer prefers. An AWS GovCloud footprint points to Bedrock. No existing cloud contract means you have a vendor onboarding conversation before you have a model conversation.

Q3: What is the volume profile?

Per-token pricing is efficient under roughly 10M tokens per month. Above that threshold, per-token costs compound fast enough that provisioned throughput becomes worth the fixed commitment: guaranteed queue position at predictable monthly cost, which is a different value proposition than per-token savings. Bursty workloads and steady-state workloads price differently even at the same total volume.

If you remember nothing else: The hyperscaler is usually determined before the model. Sequence the questions in order.

Core Vocabulary

Frontier model — The highest-capability model currently available from a given lab; priced accordingly. When it comes up: Buyers ask for it by name. GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro are frontier-tier references. Don't confuse with: Frontier lab. The lab trains the model; the model is what you're pricing. A frontier lab may not offer a frontier model on every hyperscaler platform.

Efficient model — Mid-tier capability optimized for cost-to-performance on structured tasks. When it comes up: The right answer for most production workloads that don't require open-ended reasoning. Don't confuse with: A degraded frontier model. Efficient models are purpose-built, not just cheaper versions of their frontier siblings.

Open-weight model — A model whose weights are publicly released, allowing local deployment and fine-tuning. When it comes up: Buyers ask about "open source AI" and mean this. The distinction matters for licensing. Don't confuse with: Open source. Open-weight means the weights are available; it says nothing about training data, code, or commercial use rights. Meta's Llama license includes a MAU clause that triggers commercial restrictions above 700M monthly active users — Apache 2.0 and MIT carry no such condition.

Provisioned throughput — A fixed monthly commitment that reserves model capacity and guarantees response latency. When it comes up: High-volume production deployments, latency-sensitive workloads, budget predictability requirements. Don't confuse with: A volume discount. Provisioned throughput buys queue priority, not cheaper tokens. Per-token cost may be identical or higher; you're paying for reliability, not savings.

Prompt caching — Reuse of processed prompt segments across API calls to reduce billable input tokens. When it comes up: Long system prompts repeated across many calls — common in agentic workflows. Don't confuse with: RAG. Both reduce token costs, but RAG retrieves external content at query time; prompt caching reuses already-processed context. Different mechanism, different tradeoff.

Hyperscaler markup — The price delta between a lab's direct API rate and the same model accessed through a hyperscaler's hosted platform. When it comes up: Pricing comparisons between direct lab access and Bedrock/Azure OpenAI/Vertex. Don't confuse with: Pure margin. The markup buys FedRAMP posture, existing contract vehicle, data residency guarantees, and enterprise SLA. For public sector buyers, that's the product.

Telemetry — Data that leaves your environment during model inference: prompts, completions, usage metadata. When it comes up: Every geopolitics and data residency conversation. The question is not what trained the model; it's what your inference calls send and where. Don't confuse with: Training data. What trained the model is historical. Telemetry is live. These are separate risk surfaces.

If you remember nothing else: "Open source" means Apache 2.0 to a lawyer and "I can see the weights" to an engineer. Know which conversation you're in.

Mapping Tables

Table 1: Pricing Models → Enterprise Software Contracts

AI Term	What It Means in AI	IDAM Equivalent	Key Divergence
Per-token pricing	Pay per unit of text processed (input + output separately)	Per-API-call / per-transaction	One "call" can cost 10x another based on prompt length — there's no fixed unit price
Provisioned throughput	Fixed monthly commitment for reserved capacity and latency guarantee	Reserved instance / committed use	Doesn't reduce per-token cost; buys queue priority and predictable spend ceiling
Prompt caching	Reuse of processed prompt segments to reduce billable input tokens	Session persistence / token reuse	Cache hits reduce cost but don't persist state the way a session token does
Hyperscaler markup	Price delta for hosting a lab's model on a cloud platform	Reseller margin / distribution fee	Markup includes compliance posture and contract vehicle — not separable from the product

Table 2: Capability Tiers → Procurement Language

AI Term	What It Means in AI	IDAM Equivalent	Key Divergence
Frontier model	Highest-capability, highest-cost; at the research edge	Enterprise tier / full feature set	"Frontier" is not a stable designation — the frontier shifts every 6–12 months
Efficient model	Strong on structured tasks; meaningfully lower cost	Professional / Business tier	Capability gap vs. frontier has narrowed; pricing gap has not fully adjusted
Small / edge model	Runs in constrained environments or on-device	Lite tier / on-prem deployment	"Small" is an architectural designation, not just a feature reduction — inference location changes

Table 3: Vendor Layers → Familiar Vendor Relationships

AI Term	What It Means in AI	IDAM Equivalent	Key Divergence
Frontier lab	Trains and owns the model weights	Software ISV / IP owner	Labs may lack enterprise sales motion or FedRAMP posture — direct purchase is often not an option
Hyperscaler	Hosts models from multiple labs; provides enterprise wrapper	Cloud marketplace / distribution channel	Hyperscaler is often pre-determined by existing cloud contract, not model preference
Aggregator	API layer across multiple models; adds routing, fallback, observability	Identity broker / federation hub	Aggregators add latency and a new vendor dependency that may not be visible to the buyer
Specialty provider	Optimized hosting for specific model families or workloads	Niche ISV / point solution	Procurement path is typically outside existing contract vehicles — new vendor onboarding required

If you remember nothing else: The vendor layer determines what's actually procurable. Model preference is downstream of contract reality.

Geopolitics: One Principle

Who runs the datacenter matters. Who trained the weights does not — at least not for data residency.

DeepSeek trained its models in China. DeepSeek's models are available on Amazon Bedrock. When an agency accesses DeepSeek via Bedrock, inference runs on AWS infrastructure in US-based datacenters under AWS's FedRAMP authorization. The agency's prompts and completions do not touch DeepSeek's infrastructure. The Chinese lab's involvement ends at the weights.

The risk surface that remains is telemetry: what usage metadata leaves your environment and where it goes. Model origin and telemetry destination are separate questions, and only one of them is live. "Where was this model trained?" is historical. "Where does my inference traffic go, and what does the platform log?" is what the security team actually needs answered.

The same distinction applies wherever a buyer conflates model origin with data exposure.

If you remember nothing else: Model origin is a procurement optics question. Data residency is an architecture question. Don't let buyers conflate them, and don't let yourself conflate them either.

For More Information, See…

Topic	Source Lesson
Four-layer stack (frontier labs → hyperscalers → aggregators → enterprise)	Opening: The Market Structure
Frontier lab profiles: OpenAI, Anthropic, Google DeepMind, xAI	Lesson 1: The Labs
Open-weight licensing spectrum: Apache 2.0, MIT, Llama MAU clause, OSI definition	Lesson 2: Open Weight
Hyperscaler hosting platforms: Bedrock, Azure OpenAI, Vertex; procurement reality	Lesson 3: The Hyperscalers
Pricing structures: per-token, provisioned throughput, prompt caching, markup	Lesson 4: Pricing
Capability tiers: frontier, efficient, small; routing pattern	Lesson 5: Tiers
Geopolitics: trained model vs. hosting vs. telemetry; DeepSeek-on-Bedrock	Lesson 6: Geopolitics
Specialty providers: Cohere, Databricks, Together, Fireworks, Replicate	Lesson 7: Specialty Providers