Three tiers govern how production AI systems are actually built: frontier models for reasoning tasks that require judgment under ambiguity, efficient mid-tier models that handle the bulk of production traffic at a fraction of frontier cost, and small models for classification, routing, and triage. Buyers use this language in RFPs. Architects use it in design reviews. The AE who can speak it precisely, not just name the tiers but explain what each is actually for, has a different conversation than the one who can't. The routing pattern that connects these tiers is the central architectural concept, and the correction to the most common mistake in enterprise AI deployments: using frontier models for everything.
Frontier
What it is. A frontier model is the highest-capability model a lab currently offers, optimized for complex, multi-step reasoning tasks where the answer isn't deterministic and the path to it requires judgment.
What it does. Frontier handles tasks where the input is ambiguous, the required output involves synthesis across multiple domains, or the cost of a wrong answer is high enough to justify slower, more expensive inference. Document analysis with legal implications. Code generation for security-critical systems. Policy interpretation where edge cases matter. Agentic workflows where the model needs to decompose an underspecified goal into a sequence of actions and recover when a step fails. These tasks share a common property: the model needs to hold the problem in working context, identify where it's underspecified, and produce output that accounts for that ambiguity.
Who's behind it. OpenAI's GPT-4o and o3 series. Anthropic's Claude Opus 4. Google DeepMind's Gemini 2 Ultra. xAI's Grok-3. Each lab maintains a flagship model at this tier; the specific version turns over on roughly a quarterly cadence, but the tier designation is stable. When a lab announces a new flagship, it's a frontier-tier announcement — the naming conventions (Opus, Ultra, the absence of "mini" or "Flash") signal this directly.
What makes it distinct. Reasoning depth. Frontier models don't just retrieve and recombine — they can identify where a problem is underspecified and produce outputs that account for that ambiguity. That capability is real, and it costs real money per token. The cost profile from Lesson 4 applies here: frontier inference runs roughly 10–20x the per-token cost of mid-tier. That ratio is the reason the routing pattern exists.
Efficient Mid-Tier
What it is. An efficient mid-tier model is a production-optimized model built to handle the majority of enterprise AI workloads — roughly 80% of production traffic — at 10–20% of frontier cost.
What it does. Handles tasks where the input is well-structured, the required output follows a predictable pattern, and the cost of a wrong answer is recoverable. Summarization. Translation. Structured data extraction. Q&A against a known knowledge base via the RAG pattern. Customer-facing response generation within defined guardrails. These represent most of what enterprise AI actually does in production. If you can write a clear evaluation rubric for the output before the model runs, you're probably looking at a mid-tier task.
Who's behind it. Anthropic's Claude Sonnet 4. OpenAI's GPT-4.1 mini. Google DeepMind's Gemini 2 Flash. xAI's Grok-3 mini. These models are explicitly positioned by their labs as production workhorses — the Sonnet and Flash naming conventions signal this directly, and the labs publish benchmark comparisons showing mid-tier models matching frontier performance on structured tasks while running at a fraction of the cost.
What makes it distinct. The cost-capability ratio, and the fact that mid-tier is purpose-built for the workload profile that dominates production. A mid-tier model doing summarization is running at full capacity on a task it was designed for. This matters in buyer conversations: when someone says "we evaluated the cheaper model and it wasn't as good," the follow-up question is always "as good at what task?" If the task was structured extraction, the evaluation methodology was almost certainly wrong, not the model.
Small
What it is. A small model is a low-latency, low-cost model optimized for single-step classification, routing, and triage tasks where the output is a label, a score, or a binary decision.
What it does. Classifies intent. Routes requests to the appropriate downstream model or tool. Scores content for safety or relevance. Extracts structured fields from structured inputs. Detects language. Flags anomalies for human review. The answer space is bounded and the task doesn't require reasoning across context. Small models are fast enough to run inline — sub-100ms latency is achievable — which makes them viable as the first step in a multi-model pipeline. The routing decision itself is often made by a small model. The model that decides which model to use is usually the cheapest one in the stack.
Who's behind it. Claude Haiku 3.5 (Anthropic). Gemini 2 Flash-Lite (Google DeepMind). OpenAI's smaller model variants. At this tier, open-weight models — Llama 3 class, Mistral small variants — also appear in enterprise deployments, particularly where data residency requirements or cost sensitivity make hosted inference impractical. The Lesson 1 labs all maintain small-tier offerings, but this is the tier where open-weight alternatives most frequently compete on equal footing.
What makes it distinct. Latency and inline deployability. A small model can sit in the request path without adding meaningful latency to the user experience. A frontier model cannot, and this isn't primarily a cost argument — it's an architectural constraint. If your routing decision takes three seconds, you've already failed the user experience requirement before the actual task begins. Small models exist because some decisions need to happen before the conversation starts, not during it.
Comparison: Four Dimensions That Drive Routing Decisions
The most efficient structure here is trait-led analysis: walk through the dimensions that actually govern routing decisions and show where each tier sits on each dimension. Apply it to any task you encounter, not just the ones enumerated below.
Reasoning requirement. Frontier handles tasks where the answer requires judgment across ambiguous inputs — the model needs to reason, not retrieve. Mid-tier handles tasks where the answer requires synthesis across well-defined inputs — the model needs to combine, not judge. Small handles tasks where the answer is a classification or a score — the model needs to match, not synthesize. The practical test: can you write a deterministic rubric for the correct output before the model runs? Yes means mid-tier or small. No means frontier.
Latency tolerance. Small models run inline, sub-100ms. Mid-tier models run in the 500ms–2s range for typical production workloads. Frontier models run in the 3–15s range for complex reasoning tasks, longer for extended context. If the task is in the request path and the user is waiting, latency tolerance governs tier selection as much as capability does. An agent that takes 12 seconds to classify an incoming request has an architecture problem, not a model problem.
Cost sensitivity. Directionally: small is the floor, frontier is the ceiling, mid-tier is where the economics of production AI actually live. The 80/20 observation — 80% of production traffic at 10–20% of frontier cost — is the economic argument for building a routing layer rather than defaulting to frontier. The cost difference compounds at scale. A workflow that routes correctly costs an order of magnitude less than one that doesn't, at equivalent output quality for the tasks that don't require frontier reasoning.
Failure mode. This is the dimension that gets least attention. Frontier failure is usually a reasoning failure — the model produces a confident, plausible, wrong answer to an ambiguous question. Mid-tier failure is usually a task-fit failure — the model produces a correct-looking answer to a task it wasn't the right tool for. Small model failure is usually a classification failure — the model routes a task to the wrong tier, which then propagates the error downstream. Each failure mode has a different remediation, and knowing which tier a failure originated at is the first diagnostic step.
Field Language Guide
| Don't say | Do say | Why it matters |
|---|---|---|
| "We're using GPT-4 for everything" | "We're running frontier models for complex reasoning tasks and routing structured workloads to mid-tier" | Names the architecture, not just the vendor |
| "The AI is expensive" | "Frontier inference is priced for complex reasoning; most production traffic routes to mid-tier at roughly 10–20% of that cost" | Separates cost from capability, opens the routing conversation |
| "We need the best model" | "We need the right model for each task — frontier for ambiguous reasoning, mid-tier for structured production workloads" | Reframes "best" as fit-for-purpose, not absolute capability |
| "Small models can't handle this" | "Small models handle classification and routing inline; the reasoning happens at the tier the router sends the task to" | Clarifies what small models are for before dismissing them |
| "We're evaluating which model to use" | "We're defining the routing logic that matches task complexity to model tier" | Signals architectural maturity, not just model selection |
| "The model handles the whole workflow" | "The workflow uses multiple tiers — small for routing, mid-tier for most tasks, frontier for the steps that require judgment" | Describes how production AI actually works |
| "We're using Claude" | "We're using Claude Sonnet for production workloads and Claude Opus for the reasoning-intensive steps" | Tier-specific model references demonstrate operational depth |
| "AI costs are unpredictable" | "Costs are predictable once you've defined which tasks route to which tier — the variable is task volume, not model selection" | Addresses a common procurement objection with architectural framing |
| "We need to test the frontier model first" | "Frontier is the right starting point for capability validation; production architecture routes most traffic to mid-tier" | Validates the buyer's instinct while introducing the routing concept |
| "The model isn't accurate enough" | "Accuracy at which tier? Mid-tier accuracy on a frontier task is an architecture problem, not a model problem" | Redirects accuracy complaints toward root cause |
Okta Concept Mapping: Adaptive Authentication
The routing pattern maps cleanly to adaptive authentication — the principle that you don't require step-up verification for every login, you escalate based on risk signals. Low-risk context gets a lighter authentication experience; high-risk context gets stepped up to MFA or re-verification. Model routing works the same way: low-complexity tasks route to small or mid-tier, high-complexity tasks route to frontier. The analogy holds well enough to use in a buyer conversation, especially with a security-minded audience. Where it breaks: adaptive auth makes routing decisions against well-defined risk signals with established thresholds — device posture, location anomaly, behavior baseline. Model routing requires the system to assess task complexity in real time, which is itself an inference problem, and the model making that assessment is usually a small model. The routing decision is a prediction, not a policy lookup. That distinction matters when a buyer asks how the system knows which tier to use — and they will ask.
The routing pattern is the architecture. Treating it as a cost-optimization layer added after the fact is what produces systems that are expensive, slow, and opaque when something breaks. A system that sends every task to the most capable model available is making a latency, cost, and design error simultaneously, and calling it caution. The tiers exist because inference cost is not flat across task complexity, and treating it as flat is what makes AI deployments hard to reason about when something goes wrong. Knowing which tier a task belongs to, and why, is the operational knowledge that separates an AE who can hold a design conversation from one who can only hold a vendor conversation.

