Frontier, Efficient, Small — Most Enterprise Queries Don't Need the Expensive One

Enterprise AI runs three model tiers; a router pattern decides which handles each query, and most production traffic never needs frontier.

By Leigh Garrity— May 8, 2026

Frontier, Efficient, Small — Most Enterprise Queries Don't Need the Expensive One

Enterprise AI runs three model tiers; a router pattern decides which handles each query, and most production traffic never needs frontier.

Enterprise AI deploys three model tiers: frontier, efficient, and small. Buyers talk about them when they say "model selection" or "we're evaluating which models to use." Under that phrase sits a specific question: which tier handles which work, and what decides. The decision mechanism is called a router. It's a small classifier model that reads each incoming query, scores its complexity, and sends it to the right tier automatically. The vocabulary you need in the room: frontier handles the hardest reasoning, efficient handles most production traffic at a fraction of the cost, and small models handle classification, simple tasks, and often the routing itself. Most enterprise queries don't need frontier, and the most expensive mistake in production AI is paying frontier prices for work that cheaper models handle just as well.

Frontier Models

What it is: The highest-capability tier, purpose-built for tasks where ambiguity, multi-step reasoning, or novel problem-solving makes cheaper models unreliable.

What it does: Synthesizes research across contradictory sources, generates complex multi-file code, performs scientific reasoning, and holds long chains of logic together without losing coherence. On graduate-level science benchmarks (GPQA Diamond), frontier models score 91–94%. On routine enterprise work like summarization, extraction, and single-file code generation, efficient-tier models produce functionally interchangeable results. The frontier premium only shows up on the hardest reasoning tasks. That overlap is why tiered architecture exists. (Source note: GPQA Diamond scores are reported by aggregator sites drawing from Artificial Analysis and LMSYS Chatbot Arena leaderboard data. Treat as directional across evaluation conditions, not as fixed numbers.)

Who's behind it: Claude Opus 4.6 (Anthropic), GPT-5.4 (OpenAI), Gemini 3.1 Pro (Google DeepMind). Output pricing runs $15–25 per million tokens.

What makes it distinct: Frontier earns its cost on tasks that require holding ambiguity without collapsing it prematurely. A document summary runs fine on efficient tier. Analyzing where two contradictory agency memos create compliance gaps requires the reasoning depth that justifies frontier pricing. The performance gap between frontier and efficient is widest on the hardest reasoning and nearly invisible on routine work. That asymmetry is the entire economic argument for tiered architecture.

Efficient Mid-Tier Models

What it is: The workhorse tier, handling the majority of production traffic at 5–25x less cost than frontier.

What it does: Everyday coding assistance, content generation, structured data extraction, analysis, and most question-answering. Efficient-tier models deliver comparable quality to frontier on most production tasks. Sonnet runs at roughly 60% of Opus pricing on a per-token basis; Flash runs at a fraction of that. The quality gap on standard enterprise work, including summarization, extraction, and code review, is negligible in practice. For the simplest work, small-tier models handle it at around 20% of frontier pricing while retaining roughly 85% of quality. (Source note: the quality-to-cost ratios are from TechCloudPro, a practitioner analysis blog synthesizing vendor-published benchmarks. The ratios are consistent with RouteLLM findings but treat the specific percentages as approximate.)

Who's behind it: Claude Sonnet 4.6 (Anthropic), GPT-4.1 (OpenAI), Gemini 2.5 Flash (Google). Output pricing across this tier ranges from roughly $2.50 to $15 per million tokens depending on vendor and capability. Flash is the fastest in the tier at 200+ tokens per second. Sonnet leads on nuanced writing. GPT-4.1 has the broadest fine-tuning ecosystem.

What makes it distinct: This tier exists because the delta between "good enough" and "the absolute best" costs 5–25x more per token, and for 70–80% of production queries, the two are indistinguishable. The most common enterprise deployment mistake is routing everything through frontier when efficient would produce identical results for the actual task.

Small Models

What it is: The smallest tier serves two distinct functions: routing queries and handling the simplest ones. Some small models are purpose-built classifiers that decide where each query goes. Others are lightweight general-purpose models that handle the easy work once the routing decision is made.

What it does: The routing classifier reads each incoming query, estimates its complexity, and sends it to the appropriate tier. Think of it as a traffic controller that runs in milliseconds, purpose-built to do one thing: read a query and decide where it goes. A fine-tuned classifier achieves ~90% routing accuracy on well-defined intent categories while adding under 5 milliseconds of latency. (Source note: this figure is from a credible engineer-founder practitioner blog with academic citations; no peer-reviewed study specifically quantifies routing classifier accuracy for enterprise workloads. Treat as a reasonable practitioner estimate, not an audited production number.) The small general-purpose models handle classification, formatting, simple lookups, and structured output, tasks where even efficient-tier pricing is wasteful at volume.

Who's behind it: For routing classifiers: fine-tuned lightweight models like DistilBERT or Qwen2.5-0.5B, typically too specialized to have consumer brand names. For small general-purpose work: Microsoft Phi-4 (MIT-licensed, strong on reasoning benchmarks for its size), Claude Haiku 4.5, GPT-4.1 Nano. Pricing at this tier runs $0.15–$1 per million input tokens.

What makes it distinct: Small models are the economic foundation of the whole architecture. Without a reliable classifier at the front, every query defaults to frontier pricing. The router turns three separate tiers into a cost optimization system. It's the difference between having three tiers of service and having a host who seats people correctly.

Okta Concept Mapping: The Router as Adaptive Auth

The router pattern works like adaptive authentication. A small classifier evaluates each query at ingress, scores its complexity, and routes it to the appropriate tier — the same way an adaptive MFA policy evaluates risk signals and decides whether to allow passwordless access or require a step-up challenge. A confidence gate (escalate below 0.70, trust above 0.85) parallels step-up auth thresholds. If you've explained adaptive MFA to a buyer, you already have the architectural vocabulary for routing.

How Routing Turns Tiers into Architecture

Comparison structure: scenario mapping through the router pattern. The buyer conversation is always "which model handles which work," and the router is the mechanism that answers it. Each query type maps to a tier based on complexity, with cost and quality consequences attached.

The RouteLLM paper (ICLR 2025, UC Berkeley/Anyscale/Canva, peer-reviewed, widely cited) provides the cleanest production evidence. Its routers cut costs by up to 85% on conversational benchmarks while maintaining 95% of GPT-4 quality. On more structured tasks, savings ran 35–46%. The core finding: a matrix factorization router achieved 95% of GPT-4's performance while sending only 26% of queries to GPT-4. The other 74% went to cheaper models and nobody noticed the difference.

The routers were trained on fewer than 1,500 labeled samples and generalized across model families without retraining. You don't need massive labeled datasets to build an effective router. That's a practical production signal worth knowing.

In practice, this plays out across three bands:

Simple queries (formatting, classification, FAQ lookup): Routed to small tier. "What's the status of my request?" or "Reformat this as a table." The classifier scores these as low-complexity with high confidence. Cost: pennies per thousand queries.

Standard work (summarization, extraction, code review, drafting): Routed to efficient tier. "Summarize this 20-page policy document" or "Extract the key dates from this contract." 70–80% of enterprise traffic lands here, and it's where the economics of tiered architecture actually get made.

Hard problems (ambiguous reasoning, multi-document synthesis, novel analysis): Routed to frontier. "Compare these three contradictory regulatory guidance memos and identify where they create compliance gaps." Maybe 10–20% of traffic, but the quality degradation from misrouting these is the most visible and most costly.

The blended economics hold up. Practitioner analysis shows that routing easy tasks to small-tier models, medium tasks to efficient-tier, and hard tasks to frontier produces roughly 58% cost savings versus sending everything to frontier. (Source note: iternal.ai is a practitioner analysis blog; the math uses published per-tier pricing from Anthropic's model family as an illustration. The percentage holds directionally across vendors, but your actual savings depend on traffic distribution.) At enterprise query volumes, this is the difference between a sustainable production deployment and a budget conversation nobody wants to have.

The failure mode that matters. Routing failures don't throw errors. They produce silent quality degradation: a complex query gets misrouted to the cheap tier, produces a mediocre answer, and nobody catches it because there's no error code for "not as good as it should have been." The output just isn't sharp enough. Catching this requires behavioral monitoring, things like regeneration rates, retry patterns, users rephrasing the same question. The asymmetry matters here. Misrouting a hard query to the cheap tier costs quality. Misrouting an easy query to frontier costs a few extra cents. Smart routing architectures bias toward the expensive error because overspending is recoverable and quality failures compound silently.

Okta Concept Mapping: Where Your IDAM Intuition Misleads You

In identity, an authorization decision is deterministic — a SAML assertion validates or it doesn't. Routing is probabilistic. A confidence score of 0.73 means the classifier is right about 73% of the time. Routing produces a different kind of failure than access control: the answer was worse than it should have been, and you might not notice. When a buyer asks about routing accuracy, your authorization vocabulary won't map cleanly. Risk tolerance is the closer frame: how much quality variance is acceptable for the cost savings?

Okta Concept Mapping: Compliance as the First Router

For public sector accounts, FedRAMP and ITAR authorization status determines which tiers are available before cost-quality tradeoff enters the conversation. ITAR workloads on Azure Government are currently limited to efficient-tier models like GPT-4.1 and o3-mini; frontier models like GPT-5.4 and Claude Opus 4.6 haven't cleared those isolated environments yet. The routing conversation with a defense contractor starts differently than with a civilian agency: the compliance boundary is the first router, and it runs before any classifier does. (Source note: sweetspot.so aggregates public cloud documentation; the underlying constraint is Azure Government's model catalog, which lags commercial availability.)

How to Say This in the Field

Don't say	Do say	Why it matters
"You should use GPT-5 for everything — it's the most capable."	"Most of your production traffic won't need frontier. The question is which 15–20% does, and how you route the rest to cheaper tiers."	Shows you understand production economics, not just capability rankings.
"Small models aren't very good."	"Small models handle classification and routing. They're the traffic controller, not the destination. Different job entirely."	Prevents the buyer from thinking tiers are just a quality ladder.
"AI models are like different security levels."	"The router pattern works like adaptive auth — evaluate at ingress, route based on complexity, escalate when confidence is low."	Specific analogy lands; vague analogy sounds like guessing.
"Routing saves money."	"Production routing cuts costs 50–85% by sending only the hardest queries to frontier. The RouteLLM paper showed 95% of GPT-4 quality with 74% fewer GPT-4 calls."	Specific numbers from a peer-reviewed source beat a generic claim.
"We need to pick the right model."	"Think of it as a model portfolio — a router matches queries to tiers based on complexity, so each tier handles the work it's suited for."	Moves the conversation from 'which model' to 'which architecture.'
"What model are you using?"	"Are you routing across tiers, or running everything through a single model?"	The answer tells you whether they're optimizing or overspending.
"Frontier models are better."	"Frontier is better at ambiguous multi-step reasoning. On summarization and extraction, efficient-tier models produce equivalent results at 80% less cost."	Task-specific claims are credible; blanket claims aren't.
"The risk is that the AI gives a wrong answer."	"The risk in routing is silent quality degradation — a hard query gets sent to a cheap model and the output is mediocre, not wrong. No error code. You need behavioral monitoring to catch it."	Shows you understand the actual production failure mode.
"Your ITAR environment can use any model."	"ITAR workloads are currently limited to efficient-tier models in GovCloud. Frontier hasn't cleared those environments yet. That changes your routing math."	Demonstrates you know the compliance constraint before the cost constraint.
"You should wait for better models."	"The cost ratios between tiers are 5–25x right now, and routing tooling is production-ready. The economics of tiered architecture work today."	Counters analysis paralysis with current production evidence.

The conversation you're preparing for is about architecture: which mechanism matches queries to the right tier automatically, and what governance that routing layer needs. Walk in with the routing frame and you've already moved past the "which model is best" dead end that stalls most of these meetings.

Things to follow up on...

RouterEval benchmark maturity: The RouterEval paper (EMNLP 2025) found that most existing routing methods "still have significant room for improvement," even as a capable router can enhance performance beyond the best single model in the pool.
MMLU benchmark saturation: Multiple leaderboards now exclude MMLU from scoring because frontier models have effectively maxed it out; the discriminators that matter in 2026 are GPQA Diamond, SWE-bench Pro, and HLE.
FedRAMP 20x acceleration: OpenAI achieved FedRAMP 20x Moderate authorization enabling GPT-5.5 in the FedRAMP environment, part of a broader GSA push to add Claude, Gemini, and ChatGPT to the Multiple Award Schedule.
Distillation as tier bridge: Training a smaller "student" model to mimic a larger "teacher" can achieve 80–95% of the teacher's performance at dramatically lower inference cost, which is how efficient-tier models keep closing the gap on frontier.