Three model tiers show up in your accounts whether you name them or not: frontier models (GPT-4o, Claude Opus, Gemini Ultra class), efficient mid-tier models (Claude Sonnet, GPT-4o-mini, Gemini Flash class), and small specialized models built for classification and routing. Buyers use all three, often without a coherent strategy for which task goes where. When you can name the tiers and explain the routing logic between them, you move from vendor to advisor — because the question "which AI model should we use?" has a better answer than "the best one," and that answer is worth knowing before your next discovery call.
Frontier Models
What it is: A frontier model is the current capability ceiling from a major AI lab — the model that benchmarks highest on complex reasoning, multi-step synthesis, and novel problem-solving.
What it does: Frontier models handle tasks where the problem itself is hard to specify in advance. Writing a policy brief that synthesizes conflicting regulatory guidance. Generating code for a novel integration pattern with no prior examples. Running a multi-step agentic workflow where each step depends on the output of the last and the failure modes aren't predictable. These are tasks where the model needs to hold many constraints simultaneously, reason across them, and produce output that a domain expert would recognize as genuinely useful rather than plausibly formatted.
Who's behind it / where it comes from: OpenAI's GPT-4o and o3 series, Anthropic's Claude Opus 3 and 4, Google's Gemini 1.5 Ultra and Gemini 2.0 Ultra. These are the models that get announced at keynotes and cited in congressional testimony. Not an exhaustive list — the frontier shifts every few months.
What makes it distinct: Frontier models are the only tier that handles genuinely novel, multi-constraint reasoning at production quality. On tasks that require synthesizing ambiguous inputs into structured, defensible outputs, the gap between frontier and mid-tier is real and measurable. On tasks that don't require that — which is most tasks — the gap narrows to the point where you're paying for capability you're not using.
Efficient Mid-Tier Models
What it is: A mid-tier model is optimized for the performance-to-cost ratio — strong on structured, well-defined tasks, at a fraction of the cost of frontier.
What it does: Mid-tier models are where most enterprise workloads actually live. Document summarization. Drafting standardized responses to common queries. Extracting structured data from unstructured text. Code generation for well-understood patterns. Translation. These are tasks where the problem is well-specified, the output format is predictable, and quality is measurable against a rubric. On structured tasks, mid-tier models score within 3 to 5 percentage points of frontier models on standard benchmarks — close enough that the cost differential dominates the decision.
Who's behind it / where it comes from: Anthropic's Claude Sonnet series, OpenAI's GPT-4o-mini, Google's Gemini 1.5 Flash and 2.0 Flash. These are the models that don't get the keynote but handle the volume. Most of what a federal agency's AI deployment actually does day-to-day runs here.
What makes it distinct: The cost profile relative to frontier is the defining characteristic. Illustrative benchmarks from published pricing pages put mid-tier input costs roughly 10 to 20 times lower per token than frontier equivalents — figures that shift with each model generation but have held directionally consistent. For workloads running millions of inferences per month, that differential is a budget line, not a rounding error.
Small Specialized Models
What it is: A small model is purpose-built for a narrow task — classification, routing, intent detection, similarity scoring — and optimized for speed and cost at that task rather than general capability.
What it does: Small models handle the work that happens before the work. Classifying an incoming request as "benefits inquiry" versus "security incident" before routing it to the right workflow. Detecting whether a document contains PII before sending it to a larger model for processing. Scoring semantic similarity between a query and a knowledge base entry to decide whether RAG retrieval is even necessary. Tagging, filtering, labeling at scale. These tasks are narrow, well-defined, and high-volume — exactly the conditions where a general-purpose frontier model is both overkill and slow.
Who's behind it / where it comes from: Fine-tuned variants of Microsoft's Phi-3-mini, DistilBERT derivatives, custom fine-tunes on domain-specific data, embedding models from OpenAI and Cohere. Some agencies build these internally on open-weight base models. The category is less about brand and more about fit — a small model is often the right answer precisely because it was trained on a specific task rather than everything.
What makes it distinct: Small models are the only tier where speed and cost approach zero as practical constraints. Illustrative benchmarks put small model inference costs two to three orders of magnitude below frontier — fractions of a cent per thousand inferences rather than dollars. More importantly, on the narrow tasks they're built for, small models frequently outperform larger ones. A fine-tuned classifier trained on your agency's document taxonomy will beat a frontier model on that taxonomy every time, because it was trained on that taxonomy.
Comparison Strategy: Scenario Mapping
This section uses scenario mapping rather than a trait matrix or clustering analysis. The justification: task type fit is the primary decision variable for this audience, and buyers describe their needs in terms of use cases, not capability dimensions. Scenario mapping lets the reader recognize their buyer's situation and apply the framework immediately. Every tier appears on every relevant dimension below.
Scenario 1: Novel Policy Synthesis
An agency needs to draft a regulatory impact analysis that synthesizes guidance from three conflicting federal frameworks, identifies gaps, and proposes reconciliation language. The problem is genuinely novel — no prior examples, no template, domain expertise required to evaluate the output.
Frontier. This is the use case frontier models exist for. The task requires holding multiple constraint sets simultaneously, reasoning across them, and producing output that a policy expert would recognize as substantively correct rather than superficially plausible. Mid-tier models will produce something that looks right. Whether it is right requires a human reviewer who understands the domain — which is true of frontier output too, but the gap in starting quality is real here.
Mid-tier handles the drafting of individual sections once the synthesis framework is established. Once a human (or a frontier model) has mapped the reconciliation logic, mid-tier can execute against it reliably.
Small handles pre-processing: classifying which sections of the source documents are relevant, flagging citations that require legal review, routing outputs to the appropriate reviewer queue.
Cost profile: Frontier for the synthesis step, mid-tier for execution, small for routing. The frontier inference might represent 5% of total query volume and 60% of total inference cost. That split is the point.
Scenario 2: High-Volume Citizen Service Processing
An agency processes 50,000 citizen service requests per month. Requests need to be categorized, routed to the appropriate program office, and answered with a standardized response where one exists.
Small handles classification and routing. Intent detection at this volume is a narrow, well-defined task — exactly what small models are built for. Running it through a frontier model would be the AI equivalent of routing every network packet through your SIEM.
Mid-tier drafts responses for the 80% of requests that fall into known categories with established answer patterns. The task is structured, the output format is predictable, and quality is measurable. This is the mid-tier's natural habitat.
Frontier handles the 5% of requests that are genuinely novel, legally complex, or require synthesis across multiple program areas — the specific requests where that capability justifies the cost.
Cost profile: Small for volume routing, mid-tier for standard response generation, frontier for exception handling. An agency that routes all 50,000 requests through frontier is paying frontier prices for work that small and mid-tier handle at equivalent or better quality. The overspend is not marginal.
The Frontier-for-Everything Mistake
This is the most common enterprise AI deployment error, and it happens for understandable reasons.
Capability is visible. Cost is invisible until the invoice arrives. When a buyer asks "which model should we use?" and the answer is "the best one," that answer is defensible in a procurement conversation, safe in a risk review, and wrong as an architecture decision. It's the AI equivalent of using a senior partner to file expense reports. Technically capable. Financially catastrophic at scale.
A second driver: frontier models are more forgiving of poorly specified prompts. When a team is still figuring out what they're asking the model to do, frontier models produce usable output even from vague instructions. Mid-tier models on vague prompts produce vague outputs. So teams that haven't invested in prompt engineering default to frontier because it hides the cost of unclear thinking. The bill comes later.
A third: procurement. In federal acquisition, "we used the most capable available model" is a defensible audit position. "We used a cheaper model to save money" requires a justification that procurement officers aren't always equipped to write. This is a real constraint, not a failure of intelligence — and it's worth naming in a buyer conversation because CAIOs know it and appreciate that you do too.
The routing pattern addresses all three. It reframes the procurement question from "which model is best?" to "which model is right for this task?" — an answer that is defensible, auditable, and grounded in task complexity rather than default to maximum capability. A CAIO at a large civilian agency, speaking at an industry forum in March 2026, put it plainly: "We stopped asking which model to buy and started asking which tasks belong at which tier. That's when the budget conversation became tractable."
Field Language Guide
| Don't say | Do say | Why it matters |
|---|---|---|
| "You should use the best model" | "The right model depends on what the task actually requires" | Positions you as an architect, not a vendor; opens the task-mapping conversation |
| "Frontier models are too expensive" | "Frontier models cost significantly more per inference — the question is whether the task justifies that cost" | Avoids editorializing; keeps the buyer in the decision seat |
| "Small models are less capable" | "Small models are purpose-built for narrow tasks and often outperform larger models on those specific tasks" | Accurate; prevents the buyer from dismissing the tier that handles their highest-volume work |
| "AI is one-size-fits-all" | "Most production deployments use multiple tiers, routed by task type" | Sets accurate expectations about architecture before procurement begins |
| "The mid-tier is good enough" | "Mid-tier models handle the majority of structured enterprise workloads at a fraction of frontier cost" | "Good enough" implies compromise; the accurate framing is fit-for-purpose |
| "You need a routing layer" | "The architecture routes tasks to the appropriate model tier based on complexity" | "Routing layer" sounds like an add-on; framing it as architecture makes it a design decision, not a workaround |
| "Which model do you want?" | "What are the task types you're trying to support, and what does quality look like for each?" | Shifts the conversation from product selection to requirements definition |
| "Frontier models are overkill for simple tasks" | "Frontier inference costs on high-volume, structured tasks add up quickly without a proportional quality return" | Specific and quantifiable; "overkill" is informal and doesn't survive a procurement conversation |
| "Small models are cheap AI" | "Small models are specialized inference — fast, low-cost, and highly accurate on the narrow tasks they're built for" | Reframes value; "cheap" implies inferior |
| "The model handles everything" | "The system routes requests across model tiers based on task complexity" | Accurate description of how production deployments actually work |
Okta Concept Mapping
The closest IDAM analog to model tier routing is risk-based authentication — the practice of routing authentication requests to different assurance levels based on contextual risk signals. Low-risk requests get step-up skipped; high-risk requests trigger MFA or step-up authentication. The logic maps cleanly: route by complexity, not by default to maximum capability.
Where it breaks: in risk-based authentication, the routing policy is explicit, auditable, and deterministic — a policy engine evaluates signals and applies rules. In AI tier routing, the routing decision is often itself a model inference. You have a model deciding which model to use, which introduces a layer of indirection that IDAM doesn't have and that governance frameworks haven't fully caught up to.
In a buyer conversation, this distinction matters when a CAIO asks "how do we govern which model handles sensitive data?" The answer isn't just policy configuration — it's architecture, and the routing layer is part of the attack surface. That's a conversation worth having before the RFP drops.
Pricing figures in this lesson are illustrative benchmarks reflecting published rates as of Q1 2026. Model pricing changes frequently; verify against current provider documentation before using in customer conversations. Specific pricing mechanics — per-token, provisioned throughput, cached inference — are covered in Lesson 4.

