The most expensive enterprise AI mistake is using a flagship model for everything. Here's the architecture that fixes it — and the vocabulary to explain it.
When a CAIO tells you their pilot inference costs are "way higher than projected," they're usually not describing a vendor pricing problem. They're describing an architecture problem: someone defaulted to the most capable model available, routed every task through it regardless of complexity, and is now paying frontier prices for work that a model a tenth the cost could handle without measurable quality loss. This happens constantly in early enterprise AI deployments, and it's almost always fixable.
The vocabulary that opens that conversation is model capability tiers. There are three: frontier, efficient mid-tier, and small. Every major AI vendor has offerings in each. The buyer's job — and increasingly the identity of technically credible AI procurement — is knowing which tier matches which workload, and building the routing architecture that makes that matching automatic. The 80–90% inference cost reduction you'll see cited in AI architecture discussions isn't marketing math. It's what happens when you stop treating your most expensive model as the default.
The Tiers
Frontier
What it is: The highest-capability class of large language models currently available, optimized for complex reasoning, nuanced synthesis, and tasks where output quality is the binding constraint.
What it does: Multi-step reasoning across long documents, production-quality code generation, complex instruction following with ambiguous inputs, tasks requiring the model to hold and reconcile contradictory information, and any output that will be read by a human without intermediate review. Frontier models are also where you go when the task requires genuine inference — not retrieval, not pattern-matching, but reasoning through something that hasn't been seen before.
Who's behind it: OpenAI's GPT-4o and the o-series reasoning models occupy this tier. Anthropic's Claude 3 Opus and the Claude 3.5 Opus class sit here. Google's Gemini Ultra line. These are the models that get the benchmark headlines and the capability announcements. They are also the models with the highest per-token inference costs in the industry.
What makes it distinct: Reasoning depth. It's not that frontier models know more — it's that they can hold more context, follow longer chains of logic without losing the thread, and produce output that requires substantially less human correction on complex tasks. When a federal agency needs a model to synthesize a 200-page acquisition regulation against a vendor proposal and surface genuine conflicts, that's frontier territory. Smaller models can read the documents. Identifying a conflict that spans three non-adjacent sections requires reasoning that's qualitatively different from extraction or summarization.
Efficient Mid-Tier
What it is: Models engineered specifically for the performance-cost tradeoff — capable enough for the majority of enterprise tasks, priced to run at scale without requiring a separate budget conversation every quarter.
What it does: Document summarization, drafting and editing, Q&A over structured knowledge bases, moderate-complexity code tasks, structured data extraction from unstructured text, and multi-turn conversational interfaces where the user's questions are reasonably predictable. This is the workhorse tier. Most enterprise AI use cases — the ones that actually get deployed beyond pilot — live here.
Who's behind it: Anthropic's Claude 3.5 Sonnet is the reference model for this tier and has become something of an industry benchmark for the performance-cost balance. OpenAI's GPT-4o mini occupies similar territory. Google's Gemini Flash class. These are separately optimized models, not stripped-down versions of their frontier siblings. The objective isn't a cheaper frontier experience. It's a model capable enough for 75% of enterprise tasks at 10% of the cost.
What makes it distinct: The throughput-to-quality ratio. These models are specifically engineered to handle high-volume workloads without the latency and cost overhead of frontier inference. An agency running 50,000 document summaries per day cannot do that on frontier economics. On efficient mid-tier, it's a budget line. On frontier, it's a budget crisis.
Small
What it is: Compact models purpose-built for high-volume, low-complexity tasks — classification, routing, extraction, and structured output generation where speed and cost matter more than reasoning depth.
What it does: Intent classification, sentiment analysis, entity extraction, document routing, binary decisions (does this ticket belong to HR or IT?), keyword tagging, and any task where the output is a label, a category, or a short structured response rather than generated prose. These are the models that handle the front of the pipeline — the intake, the triage, the sorting — so that more expensive models only see the work they're actually needed for.
Who's behind it: This tier is more fragmented than the others. Fine-tuned BERT (Bidirectional Encoder Representations from Transformers) variants remain common for classification tasks in enterprise deployments. Mistral's 7B-class models are widely used for on-premises or air-gapped deployments where data residency matters. Distilled versions of larger models — where a frontier model's behavior is compressed into a much smaller architecture — are increasingly common. OpenAI's text-embedding models sit at the edge of this tier. The defining characteristic isn't the vendor; it's the parameter count and the task specificity.
What makes it distinct: Latency and cost. A classification task that costs $0.30 per thousand API calls on a frontier model costs approximately $0.003 on a purpose-built small model. That's not a rounding error — it's a factor of 100. At the volumes that enterprise AI actually runs (millions of classification events per day in a large agency deployment), that difference is the difference between a sustainable operating cost and an infrastructure emergency.
Comparison Strategy: Scenario Mapping
Three scenarios, three tiers, one routing decision each. The goal is to make the cost case concrete rather than abstract — so the 80–90% reduction claim has a mechanism, not just a number.
Scenario 1: Incoming support ticket classification. A federal agency's IT help desk receives 8,000 tickets per day. Each ticket needs to be routed to the right team: network, identity, endpoint, or application. The routing decision requires reading the ticket subject and first paragraph and assigning a category. This is a classification task. The output is a label. There is no reasoning required beyond pattern recognition. This is small-tier work. Running it through frontier would cost roughly $120/day at current pricing. Running it through a fine-tuned classification model costs under $2/day. The quality difference on this specific task is negligible — the classification model was trained for exactly this.
Scenario 2: Employee benefits Q&A. An agency deploys a conversational interface for HR questions — leave balances, benefits enrollment windows, policy clarifications. The questions are predictable in structure, the knowledge base is bounded, and the answers need to be accurate but not synthesized from scratch. This is efficient mid-tier work. The model needs to retrieve relevant policy text, understand the employee's specific question, and produce a clear response. It doesn't need to reason through ambiguity or handle novel edge cases — those get escalated to a human. At mid-tier pricing, 50,000 employee interactions per month is a manageable line item. At frontier pricing, it's a quarterly budget conversation.
Scenario 3: Acquisition regulation conflict analysis. A contracting officer needs to know whether a vendor's proposed technical approach conflicts with FAR (Federal Acquisition Regulation) Part 39 and the agency's own IT security addendum. The documents are long, the conflicts are subtle, and the stakes of a missed conflict are real. This is frontier territory. The other tiers can read the documents. Identifying a conflict between a vendor's proposed authentication architecture and a clause buried in an agency addendum requires multi-step reasoning that mid-tier models handle inconsistently. Frontier costs more. For this task, it's worth it.
The routing architecture is the system that makes these assignments automatically. An incoming task is evaluated — usually by a lightweight classifier, which is itself a small-tier model — and directed to the appropriate tier based on complexity signals: task type, document length, output requirements, confidence threshold. The router doesn't need to be perfect. It needs to be right often enough that the cost savings from correct routing outweigh the occasional misrouting cost.
The math behind the 80–90% headline: in a typical enterprise AI deployment, roughly 60–70% of tasks are classification or extraction (small tier), 20–30% are summarization or moderate Q&A (efficient tier), and 5–10% require genuine frontier reasoning. Route correctly, and your weighted average cost per token drops by roughly 85–92% compared to running everything through frontier. The exact number depends on your task mix. The direction is always the same.
The most common objection is that routing adds complexity. It does. A routing layer is an architectural component with its own error rate, its own latency overhead, and its own governance requirements. Routing is not simpler than defaulting to frontier. The cost savings have to justify the architectural investment. At enterprise scale, they almost always do. At federal agency scale, where inference costs are a line item that procurement officers scrutinize, they're often the difference between a sustainable program and a pilot that doesn't survive its first budget cycle.
Field Language Guide
| Don't say | Do say | Why it matters |
|---|---|---|
| "the AI" | "the model tier" | Buyers who distinguish capability levels will hear the imprecision immediately |
| "GPT" as a generic term | "frontier-class model" | Avoids vendor lock-in assumptions and is architecturally accurate |
| "it's expensive" | "inference costs scale with model tier" | Frames cost as an architectural variable, not a fixed overhead |
| "we use the best model" | "we route tasks to the appropriate tier" | Signals architectural sophistication rather than vendor preference |
| "small models are less capable" | "small models are purpose-optimized for classification workloads" | Avoids framing that implies compromise; these models are doing what they were built for |
| "we'll need the full model for that" | "this use case warrants frontier-tier reasoning" | Positions frontier as a deliberate architectural choice, not the default |
| "the model decides" | "the router determines tier assignment" | Buyers need to understand routing as an explicit architectural layer, not emergent behavior |
| "AI costs are unpredictable" | "inference costs are predictable once tasks are mapped to tiers" | Addresses the CFO concern directly and positions you as someone who understands the economics |
| "it's just summarization" | "this is efficient-tier work" | Connects the task type to the cost tier; shows you've done the mapping |
| "we can run everything through Sonnet" | "Sonnet handles the mid-tier workload; classification routes to a smaller model" | Demonstrates routing awareness, which is what technically credible buyers are listening for |
Callout: The IDAM Analog — And Where It Breaks
Step-up authentication maps to model routing — until it doesn't.
In identity, you don't require the same level of assurance for every transaction. Reading a public document requires no additional verification. Accessing PII triggers step-up to a phishing-resistant authenticator. The assurance level matches the risk level of the resource being accessed, a policy decision made in advance and enforced at runtime. Model routing looks structurally similar: the model tier matches the complexity level of the task, and a routing layer enforces the assignment.
The analogy breaks at a specific point. In step-up authentication, the assurance requirement is determined by the resource classification, a policy set by a human and applied consistently. In model routing, the tier assignment is determined by the task's complexity, which has to be assessed dynamically. The router is itself a model, making an inference about what level of inference is required. That's a different kind of trust decision than anything in the IDAM stack — you're using intelligence to route to intelligence, and the router has its own error rate.
In a buyer conversation, this matters when a CAIO asks "how do you decide which model handles which task?" The answer isn't "we configured it." It's "we have a routing layer that classifies incoming tasks — and that routing layer is itself a governed component with its own accuracy metrics and fallback behavior." That answer earns credibility. The simpler answer raises the next question the buyer is already thinking.
The tier vocabulary is already in CAIO conversations. When a buyer says "we're evaluating whether to use Sonnet or Opus for our document pipeline," they're asking a routing question without using routing language. The useful response isn't a model comparison — it's a task complexity question: what does the document pipeline actually need to do, and at what volume? The answer to that question determines the tier. The tier determines the cost. The cost determines whether the program survives its second year.
That's the conversation. The tiers are just the map.

