Models & Vendors: The Three Questions, in Order

By Leigh Garrity— May 6, 2026

Models & Vendors: The Three Questions, in Order

Nine pieces. What follows is organized by decision point, not by lesson, so you can find the right section before a call rather than reconstruct the whole sequence.

The Decision Framework

Run these in order. The answer to each one constrains the next.

Question 1: What capability tier does this task actually require?

Three tiers, established in Lesson 5:

Small — Classification, extraction, routing, simple lookup. Runs fast, costs almost nothing, handles the majority of enterprise AI tasks once you break them apart.
Efficient — Drafting, summarization, moderate reasoning, structured output. The workhorse tier. Most enterprise use cases land here.
Frontier — Complex multi-step reasoning, novel synthesis, agentic task chains. Genuinely necessary for a narrow slice of what agencies are actually building.

The most common enterprise mistake, per Lesson 5: defaulting to frontier because it feels safer. Slower, more expensive, harder to audit. That's what frontier delivers when the task doesn't require it. Route by what the task requires. Vendor marketing will always point you toward frontier.

Open-weight models on a hyperscaler (Lesson 7) are often the right answer at the Efficient tier — comparable output quality, no proprietary lock-in, deployed inside infrastructure you already control.

If you remember nothing else: Route by task complexity first. Everything else follows from that.

Question 2: What procurement and compliance constraints already exist?

This question usually answers itself. From Lesson 3:

Existing AWS enterprise agreement → Bedrock
Microsoft EA or Azure credits → Azure OpenAI Service
GCP contract or Google Workspace → Vertex AI

Technical merit is rarely the selection driver. The hyperscaler is selected by procurement, and the model catalog comes with it. Your job in the call is to confirm which hyperscaler the agency is already committed to, then work within that catalog. The infrastructure decision is already made.

Data residency requirements and FedRAMP authorization status narrow the model list further. The hyperscaler's compliance posture covers the infrastructure layer. Model-level authorization is a separate question, and the list of FedRAMP-authorized model deployments is shorter than most customers assume.

This is also where the geopolitics question resolves (covered below).

If you remember nothing else: The hyperscaler is usually already chosen. Find out which one before you discuss any specific model.

Question 3: What is the volume profile?

From Lesson 4:

Under roughly 10 million API calls per month: Per-token pricing. Predictable enough to budget, flexible enough to absorb variable workloads. Watch for prompt caching opportunities when the same system prompt appears across many calls — the savings are real and the math is immediate.
Over roughly 10 million calls per month: Provisioned throughput. The 30–50% cost reduction threshold from Lesson 4 applies here. Treat it like a capacity reservation, not a subscription — you're committing to a specific model version at a specific throughput level.

Bursty workloads complicate this. A use case that averages 8 million calls per month but spikes to 25 million during peak periods is not cleanly a per-token account. Lesson 4 has the framework for that.

If you remember nothing else: Per-token under 10 million. Provisioned above. The math in Lesson 4 shows you when the crossover pays.

Vocabulary Mapping: Pricing Terms

The collision zone where AI vendor pricing vocabulary lands on enterprise procurement ears trained by software deals.

AI Term	What It Means in AI	IDAM / Procurement Equivalent	Key Divergence
Token (pricing unit)	Roughly ¾ of a word; the unit vendors charge against	API call / transaction	Token count varies by content length and language. You cannot predict monthly cost from call volume alone without knowing average prompt and response length.
Per-token pricing	Pay for what you use, billed by token consumed	Consumption-based / metered billing	Familiar model, unfamiliar unit. The budget conversation requires token estimates, not just call estimates.
Provisioned throughput	Reserved processing capacity for a specific model at a committed rate	Reserved capacity / committed use discount	Commitment is to a model version, not a service tier. If the vendor deprecates that version, the reservation terms change. Read the contract.
Prompt caching	Storing repeated prompt prefixes so they aren't re-processed on each call	Session persistence / stateful connection	Caching reduces cost but does not maintain state. Each call is still stateless. The savings come from not re-tokenizing the same system prompt 10,000 times a day.
Context window	Maximum tokens the model can hold in a single call	Session timeout / token lifetime	A context window is a capacity limit, not a time limit. It doesn't expire; it fills. When it fills, earlier content is dropped, not archived.

If you remember nothing else: Provisioned throughput is a capacity reservation against a specific model version. It behaves more like a reserved instance than a software subscription.

Vocabulary Mapping: Architecture and Access Terms

The collision zone where AI system design vocabulary overlaps with IDAM vocabulary — same words, different mechanisms.

AI Term	What It Means in AI	IDAM Equivalent	Key Divergence
Session (model context)	The active context window for a conversation or task chain	Auth session / SSO session	An AI session has no authentication state. It holds prompt history. An auth session holds identity assertions. These are unrelated objects that happen to share a name.
Context (prompt context)	The accumulated input the model can see in a given call	Identity context / claims	Identity context is about who the user is. Prompt context is about what the model has been told. Conflating them in a customer conversation creates real confusion about what data the model can access.
Scope (model access)	What data or tools an agent or model has been granted access to	OAuth scope	OAuth scope is enforced by an authorization server. Model scope is enforced by prompt design and system instructions — a fundamentally weaker guarantee.
Agent (AI agent)	A model configured to take sequential actions using tools and APIs	Endpoint agent / device agent	An AI agent acts on behalf of a user or system using credentials it was given. It is not an identity principal. It does not authenticate. This is the gap Okta's agentic identity work addresses.
Open weights	Model parameters published publicly; can be run on any infrastructure	Open source software	Open weights ≠ open source. The weights are available; the training data and methodology often aren't. Governance and licensing terms vary significantly by model family.

If you remember nothing else: When a customer says "session," ask which kind. An AI context session and an auth session have nothing to do with each other, and the confusion will surface in a security review if not in the sales call.

Geopolitics, Resolved

Lesson 6 established the question. Here's the answer you can say in a meeting:

Who runs the datacenter matters. Who trained the weights does not.

A model trained by a Chinese lab and deployed on AWS Bedrock inside a FedRAMP-authorized region is, from a compliance standpoint, an AWS workload. The datacenter operator — the hyperscaler — determines the applicable compliance posture, data residency, and audit rights. The model's country of origin affects none of those things once it's running inside a controlled environment.

That's a procurement framing, not a political one. The datacenter operator is the answer to Question 2.

Where this breaks: if the model requires a call-home to an external API during inference, if it isn't fully self-contained in the hyperscaler environment, the datacenter argument doesn't hold. Confirm that the deployment is air-gapped from the model vendor's infrastructure before using this framing with a customer.

For More Information

Recap Entry	Source	Section
Four-layer stack	Opening	Models & Vendors
Frontier / Efficient / Small tier taxonomy; routing pattern; defaulting-to-frontier mistake	Lesson 5	Models & Vendors
Hyperscaler selection; procurement as selection driver; data residency	Lesson 3	Models & Vendors
Per-token vs. provisioned throughput; 30–50% savings threshold; prompt caching	Lesson 4	Models & Vendors
Open-weight models on hyperscalers	Lesson 7	Models & Vendors
Geopolitics and the stack; datacenter operator vs. model origin	Lesson 6	Models & Vendors
Token definition; context window mechanics	Lesson 1	Models & Vendors
Agent architecture; scope enforcement gaps	Lesson 2	Models & Vendors