Three Questions That Pick Your Model and Vendor

A three-question decision framework that resolves model selection, vendor choice, and pricing tier before the meeting starts.

By Leigh Garrity— May 8, 2026

Three Questions That Pick Your Model and Vendor

A three-question decision framework that resolves model selection, vendor choice, and pricing tier before the meeting starts.

You covered seven lessons on who builds models, who hosts them, how they're priced, and what "open" actually means. Your head is full. This is the scaffold that makes it stick: three questions, asked in order, that resolve most model and vendor decisions before they become committee theater.

Capability Tier

Capability tier — The class of model a workload actually requires. Frontier (GPT-4.1, Claude Opus 4.6, Gemini 2.5 Pro), mid-tier (Claude Sonnet 4.6, GPT-4.1 Mini, Llama 70B+), or small (Gemma, Phi, GPT-4.1 Nano, Mistral 7B, task-specific fine-tunes).

When it comes up: Every buyer conversation that starts with "we need GPT-4" when the actual workload is structured extraction from PDFs. Asking this question pulls the conversation off brand and onto workload.

Don't confuse with: Model generation or release date. Tier tracks capability, not recency. DeepSeek V4-Pro (49B active parameters, MIT license, released April 2026) outperforms older frontier models on several benchmarks at mid-tier cost.

LLM routing — Sending each request to the cheapest model that can handle it. Recall the principle that ran through the cost lessons: route, don't maximize.

When it comes up: Cost conversations. RouteLLM research (ICLR 2025, UC Berkeley/Anyscale/Canva) shows routers achieving 48–75% cost reduction while maintaining 95% of frontier quality. Production systems report 37–60% savings depending on workload mix and router quality. The 85% figure you may have seen is best-case, not typical.

Don't confuse with: Load balancing. Load balancing distributes identical requests across identical endpoints. Routing selects different models based on task complexity. The selection is the intelligence; distribution is plumbing.

If you remember nothing else

Most enterprise workloads don't need frontier models. The buyer insisting on GPT-4 for everything is overspending by 2–4x. Route by task, not by brand.

Procurement and Compliance Constraints

Hyperscaler lock-in — The practical reality that procurement, compliance certification, and existing enterprise agreements determine which platform hosts the model. As you saw in the platform comparison lessons, the model choice follows the platform.

When it comes up: When a buyer says "we're evaluating models" but already has a $40M Azure EA or an AWS GovCloud ATO. The hyperscaler is already chosen. You need to know which models run on it.

Don't confuse with: Model provider lock-in. Running Llama on Bedrock doesn't lock you to Meta. The hosting platform is the constraint; the model is increasingly swappable.

Data residency (the geopolitics principle) — Data residency follows the datacenter where inference runs, regardless of where the model was trained. Same separation you encountered in the geopolitics and sovereignty lesson.

When it comes up: Every conversation about DeepSeek. DeepSeek R1 on Bedrock runs in US East and US West regions on AWS infrastructure. User inputs and outputs stay inside the hyperscaler's boundary. A model trained in Beijing but hosted in Virginia has the same data residency posture as a model trained in San Francisco and hosted in Virginia.

Don't confuse with: Supply chain risk for the weights themselves. Weight provenance is a real concern for some threat models, but it is a different question from data residency. Conflating them loses the conversation.

If you remember nothing else

Datacenter location governs data residency. Training origin is a separate question entirely. Say that before the meeting where DeepSeek comes up.

Volume Profile

Per-token (on-demand) pricing — Pay for what you use, no commitment. Every hyperscaler offers this as the default. Right for workloads under roughly 10M calls per month or where volume is unpredictable.

When it comes up: Pilots, spiky workloads, any project where the buyer hasn't proven sustained volume.

Don't confuse with: Cheap. Per-token is the most expensive per-unit tier. Fine for proving volume; expensive once you've proved it.

Provisioned throughput — Reserved capacity at committed volume, purchased in platform-specific units (Model Units on Bedrock, PTUs on Azure, GSUs on Vertex). As you saw in the pricing lessons, discounts run 20–40% versus on-demand; exact pricing requires account team engagement on all three platforms.

When it comes up: When the buyer has proven sustained volume and wants cost predictability. The crossover sits around 10M calls/month.

Don't confuse with: Reserved instances for compute. Provisioned throughput reserves model inference capacity, not VMs. The billing unit is model-specific.

Batch inference — Asynchronous processing where requests queue and run outside real-time. Bedrock offers batch at 50% discount versus on-demand for select models.

When it comes up: Any workload that doesn't need a live response. Document processing, nightly report generation, bulk classification. The buyer often forgets this option exists.

Don't confuse with: Provisioned throughput. Batch trades latency for cost. Provisioned trades commitment for cost. Different levers, stackable.

If you remember nothing else

Per-token until you're sure. Provisioned when you can prove the volume. Batch for anything that doesn't need real-time. Most buyers leave the batch lever untouched.

Vocabulary Collision Zones

These terms mean different things in IDAM and AI. The collision is where your credibility holds or breaks.

AI Term	What It Means in AI	IDAM Equivalent	Key Divergence
Token	A chunk of text (~4 characters) the model processes; the pricing unit	OAuth access token — a security credential conveying authorization	An LLM token carries zero authorization. When a buyer says "token budget," ask: cost budget or security credential? Okta's AI agent token exchange bridges the two: an agent exchanges its identity token for a scoped OAuth access token.
Session	Conversation thread; context carried across turns; not tied to authentication	Authenticated continuity with lifetime, idle timeout, revocation	An AI chat session may outlive or underlive the auth session. Authorization state doesn't persist in the model's memory unless the system re-supplies it every turn.
Context window	The text a model can attend to in one request — working memory	Security context — authentication state, user attributes, policy bindings	The context window is not durable memory, not an audit log, and not identity state. It can be truncated without notice.

AI Term	What It Means in AI	IDAM Equivalent	Key Divergence
Scope	Task boundary, tool permission, operational blast radius (informal, no spec)	OAuth scope — a precise authorization string (`okta.users.read`) granted by an authorization server	OAuth scopes constrain the authorization boundary but leave the behavioral surface unguarded. An agent can hold a scope permitting calendar writes and still generate inappropriate writes from injected context.
Agent	An LLM-based system that uses tools, makes decisions, may behave non-deterministically	Endpoint agent, service process — static, deterministic, defined function	AI agents select tools at runtime based on prompts. A static service-account model misses delegated authority, tool chaining, and prompt sensitivity. Okta treats AI agents as identities requiring governance — GA April 30, 2026.

The Underused Pattern

Open-weight model (Llama, Mistral, DeepSeek). Hosted on the hyperscaler you already have under contract. Enterprise guardrails layered on top (Bedrock Guardrails for PII detection, toxicity filters, prompt-injection mitigation; equivalent controls on Vertex and Azure).

This pattern delivers three things at once: cost reduction versus proprietary frontier models (DeepSeek R1 runs at roughly 27x lower cost than comparable proprietary reasoning models), data residency on infrastructure you already govern, and no new procurement vehicle.

The hyperscaler charges a managed-service premium. Vertex AI charges roughly 55% more for Llama 3.3 70B than a pure-play inference provider. That premium buys integration, compliance posture, and the fact that your buyer's security team already approved the platform. For most public sector accounts, that premium is the cheapest part of the decision.

Not covered above, but worth knowing: No named public sector agency has published a case study of this pattern. The technical capability is documented and available on all three hyperscalers. The adoption evidence is practitioner-level, not institutional.

If you remember nothing else: Open-weight on your existing hyperscaler, with guardrails, is the right answer more often than your buyer expects. It survives procurement review, cost review, and security review simultaneously. Lead with it.

For More Information, See…

Recap Entry	Source Lesson
Capability tier, LLM routing	Model tiers and the route-don't-maximize principle
Hyperscaler lock-in, platform comparison	Bedrock, Vertex, and Azure AI Foundry
Data residency, geopolitics principle	Geopolitics and data sovereignty
Per-token, provisioned, and batch pricing	Pricing models and cost optimization
Open-weight vs. proprietary	Open-weight licensing and what "open" means
Vocabulary collisions (token, session, scope, agent, context)	Consolidated from multiple lessons
Open-weight-on-hyperscaler pattern	Synthesis of pricing, hosting, and open-weight lessons

Things to follow up on...

Prompt caching at scale: Every major provider now offers cached-prompt discounts of 70–90% on repeated system prompts, a cost lever most enterprises haven't pulled yet — Anthropic's pricing page shows cache reads at 10% of base input price.
RouteLLM's peer-reviewed results: The UC Berkeley/Anyscale/Canva routing paper accepted at ICLR 2025 is the strongest independent evidence for the route-don't-maximize principle — the research shows 48–75% cost reduction with 95% quality retention using a matrix factorization router.
DeepSeek V4-Pro benchmarks: Released April 24, 2026 under MIT license with 1.6T total parameters (49B active), this model now leads SWE-bench Verified at 80.6% — current benchmark rankings show it outperforming several proprietary frontier models at a fraction of the inference cost.
Cohere's sovereign AI play: Cohere's April 2026 agreement to acquire Germany's Aleph Alpha significantly strengthens its European data residency story — the acquisition positions Cohere as the enterprise AI vendor most explicitly built for regulated, cloud-agnostic, sovereignty-sensitive buyers.

Capability Tier

LLM routing — Sending each request to the cheapest model that can handle it. Recall the principle that ran through the cost lessons: route, don't maximize.

If you remember nothing else

Most enterprise workloads don't need frontier models. The buyer insisting on GPT-4 for everything is overspending by 2–4x. Route by task, not by brand.

Procurement and Compliance Constraints

When it comes up: When a buyer says "we're evaluating models" but already has a $40M Azure EA or an AWS GovCloud ATO. The hyperscaler is already chosen. You need to know which models run on it.

Don't confuse with: Model provider lock-in. Running Llama on Bedrock doesn't lock you to Meta. The hosting platform is the constraint; the model is increasingly swappable.

If you remember nothing else

Datacenter location governs data residency. Training origin is a separate question entirely. Say that before the meeting where DeepSeek comes up.

Volume Profile

When it comes up: Pilots, spiky workloads, any project where the buyer hasn't proven sustained volume.

Don't confuse with: Cheap. Per-token is the most expensive per-unit tier. Fine for proving volume; expensive once you've proved it.

When it comes up: When the buyer has proven sustained volume and wants cost predictability. The crossover sits around 10M calls/month.

Don't confuse with: Reserved instances for compute. Provisioned throughput reserves model inference capacity, not VMs. The billing unit is model-specific.

Batch inference — Asynchronous processing where requests queue and run outside real-time. Bedrock offers batch at 50% discount versus on-demand for select models.

When it comes up: Any workload that doesn't need a live response. Document processing, nightly report generation, bulk classification. The buyer often forgets this option exists.

Don't confuse with: Provisioned throughput. Batch trades latency for cost. Provisioned trades commitment for cost. Different levers, stackable.

If you remember nothing else

Per-token until you're sure. Provisioned when you can prove the volume. Batch for anything that doesn't need real-time. Most buyers leave the batch lever untouched.

Vocabulary Collision Zones

These terms mean different things in IDAM and AI. The collision is where your credibility holds or breaks.

AI Term	What It Means in AI	IDAM Equivalent	Key Divergence
Token	A chunk of text (~4 characters) the model processes; the pricing unit	OAuth access token — a security credential conveying authorization	An LLM token carries zero authorization. When a buyer says "token budget," ask: cost budget or security credential? Okta's AI agent token exchange bridges the two: an agent exchanges its identity token for a scoped OAuth access token.
Session	Conversation thread; context carried across turns; not tied to authentication	Authenticated continuity with lifetime, idle timeout, revocation	An AI chat session may outlive or underlive the auth session. Authorization state doesn't persist in the model's memory unless the system re-supplies it every turn.
Context window	The text a model can attend to in one request — working memory	Security context — authentication state, user attributes, policy bindings	The context window is not durable memory, not an audit log, and not identity state. It can be truncated without notice.

AI Term	What It Means in AI	IDAM Equivalent	Key Divergence
Scope	Task boundary, tool permission, operational blast radius (informal, no spec)	OAuth scope — a precise authorization string (`okta.users.read`) granted by an authorization server	OAuth scopes constrain the authorization boundary but leave the behavioral surface unguarded. An agent can hold a scope permitting calendar writes and still generate inappropriate writes from injected context.
Agent	An LLM-based system that uses tools, makes decisions, may behave non-deterministically	Endpoint agent, service process — static, deterministic, defined function	AI agents select tools at runtime based on prompts. A static service-account model misses delegated authority, tool chaining, and prompt sensitivity. Okta treats AI agents as identities requiring governance — GA April 30, 2026.

The Underused Pattern

For More Information, See…

Recap Entry	Source Lesson
Capability tier, LLM routing	Model tiers and the route-don't-maximize principle
Hyperscaler lock-in, platform comparison	Bedrock, Vertex, and Azure AI Foundry
Data residency, geopolitics principle	Geopolitics and data sovereignty
Per-token, provisioned, and batch pricing	Pricing models and cost optimization
Open-weight vs. proprietary	Open-weight licensing and what "open" means
Vocabulary collisions (token, session, scope, agent, context)	Consolidated from multiple lessons
Open-weight-on-hyperscaler pattern	Synthesis of pricing, hosting, and open-weight lessons

Things to follow up on...

Prompt caching at scale: Every major provider now offers cached-prompt discounts of 70–90% on repeated system prompts, a cost lever most enterprises haven't pulled yet — Anthropic's pricing page shows cache reads at 10% of base input price.
RouteLLM's peer-reviewed results: The UC Berkeley/Anyscale/Canva routing paper accepted at ICLR 2025 is the strongest independent evidence for the route-don't-maximize principle — the research shows 48–75% cost reduction with 95% quality retention using a matrix factorization router.
DeepSeek V4-Pro benchmarks: Released April 24, 2026 under MIT license with 1.6T total parameters (49B active), this model now leads SWE-bench Verified at 80.6% — current benchmark rankings show it outperforming several proprietary frontier models at a fraction of the inference cost.
Cohere's sovereign AI play: Cohere's April 2026 agreement to acquire Germany's Aleph Alpha significantly strengthens its European data residency story — the acquisition positions Cohere as the enterprise AI vendor most explicitly built for regulated, cloud-agnostic, sovereignty-sensitive buyers.