A note before we get into specifics: every price cited here comes from provider documentation or pricing aggregators as of early May 2026. Model pricing changes without notice. Treat dollar figures as directional. The cost mechanismsare durable even when the numbers move.
Base/Chat Models
What it is: A language model trained on next-token prediction, then fine-tuned to follow instructions and hold conversations.
What it does: Drafts, summarizes, answers questions, writes code, analyzes documents. You send text in, you get text out. It handles the broadest range of general tasks at the lowest per-token cost, and for most workloads that don't require extended deliberation, it's the right tool.
Who's behind it: Everyone. OpenAI's GPT series is the most widely deployed. Google's Gemini, Anthropic's Claude, and Meta's Llama all live here. When someone says "we're using an LLM" without further qualification, they mean a base/chat model.
What makes it distinct: Broadest capability range at the most predictable cost per request. You send tokens in, you get tokens out, and the ratio between input and output is roughly what it looks like from reading the prompt and the response. A 500-word answer to a 200-word question costs about what you'd expect. GPT-5.5 runs $5/$30 per million input/output tokens at the high end; smaller models come in far lower. The token economy is transparent and controllable, which matters when you're budgeting for thousands of daily requests.
Reasoning Models
What it is: A model trained (via reinforcement learning) to deliberate internally before answering, generating a chain of reasoning tokens you typically don't see but always pay for.
What it does: Works through problems that benefit from step-by-step analysis: multi-constraint compliance checks, mathematical proofs, code debugging, policy interpretation where the answer depends on reconciling several rules at once. The model thinks before it speaks. That thinking shows up on your bill as output tokens.
Who's behind it: OpenAI's o-series (o3, o4-mini) and DeepSeek's R1 are the clearest exemplars. o3 runs $2/$8 per million input/output tokens. o4-mini comes in at $1.10/$4.40. DeepSeek R1-0528, an open-weights model with 671 billion parameters (37 billion active per query), sits around $0.50/$2.19 through third-party providers. (That R1 pricing comes from Artificial Analysis, an independent benchmarking service that tracks model pricing across providers. DeepSeek's own pricing page should be checked for the latest official rates, as a September 2025 architecture update may have shifted them.) Those per-token rates look cheap until you see how many tokens reasoning actually burns.
What makes it distinct: The cost mechanism. A reasoning model answering a complex question might generate 10,000 internal reasoning tokens before writing a 2,000-token visible response. You pay for all 12,000. R1-0528 nearly doubled average token usage compared to its predecessor on hard math problems, from roughly 12,000 to 23,000 tokens per question on the AIME benchmark, according to a BentoML technical guide (a model deployment platform that published a detailed analysis of DeepSeek's model family; treat as a secondary source and confirm against the official R1-0528 model card on Hugging Face for the definitive numbers). The per-token rate is lower than GPT-5.5's, but total token count per request can be dramatically higher, and it's unpredictable. A slight rephrasing of the same question can swing reasoning token usage from 500 to 5,000.
The mechanical reality worth sitting with: a peer-reviewed study (Venhoff et al., accepted at the NeurIPS 2025 Mechanistic Interpretability Workshop) found that the reinforcement learning used to train reasoning models primarily teaches when to activate reasoning skills the base model already has. The researchers built hybrid models that recovered 91% of the performance gap between base and reasoning models on math benchmarks by steering only 12% of tokens, with no weight changes to the base model. The base model already knew how to do the math. The reasoning training taught it when to show its work.
That gives you a precise, defensible framing for buyer conversations: reasoning models are base models that have learned to deliberate at the right moments, and you pay for that deliberation in tokens.
Two caveats on that finding. The benchmarks were mathematical (GSM8K, MATH500), so the result may not transfer identically to legal analysis or policy interpretation. And the models tested were smaller (up to 32B parameters), not the full-scale o3 or R1. The directional insight holds. The universal claim needs more evidence.
OpenAI's reasoning models support a reasoning effort parameter: none, low, medium, high, xhigh. Think of it like an adaptive MFA step-up policy: low-risk transactions get a lightweight check, high-risk ones trigger the full ceremony. The analogy holds for the cost structure (more deliberation = more tokens = higher cost) but breaks on predictability. An MFA policy fires deterministically based on rules you wrote. A reasoning model's token consumption for a given effort level varies per query. You set the ceiling, not the floor.
Multimodal Models
What it is: A model that accepts or produces more than text: images, audio, video, or some combination, as inputs or outputs alongside text.
What it does: Processes visual and audio information directly. A buyer scanning intake documents that include photographs, charts, or handwritten notes needs a model that can read those inputs natively rather than requiring a separate OCR or transcription pipeline. Multimodal models handle this in a single pass.
Who's behind it: GPT-5.5 accepts text and image inputs. Gemini 2.5 Pro goes further: text, images, audio, and video, with a million-token context window. Anthropic's Claude supports vision (image input) as well. All three are available through enterprise platforms with the compliance certifications public sector buyers expect: Azure OpenAI and Vertex AI carry FedRAMP authorization; Anthropic offers AWS GovCloud deployment.
What makes it distinct: Modality-specific token consumption. Everything converts to tokens, and each modality converts at its own rate. On Gemini 2.5 Pro, a 1024×1024 image consumes roughly 1,290 tokens. Video runs 258 tokens per second at one frame per second, and if the video has audio, you pay for both the video tokens and the audio tokens separately. A request that includes a 30-second video clip just consumed ~7,740 tokens before the model reads a single word of your text prompt. Gemini 2.5 Pro's base rate is $1.25/$10 per million tokens for prompts under 200K tokens, but that rate roughly doubles once your prompt exceeds 200K. A pipeline feeding in large documents alongside images can silently cross that threshold on every request.
Think of a model's modalities like OAuth scopes: each one (image, audio, video) is a capability the model can exercise, and each one has a cost. But unlike OAuth scopes, which are binary (granted or not), modalities scale continuously. A higher-resolution image costs more tokens. A longer video costs more tokens. The "scope" isn't just present or absent; it has a volume dimension that directly affects the bill.
Quick Pricing Reference
All figures per million tokens, as of early May 2026. These shift without notice; the relative positions are more stable than the absolute numbers.
| Model | Class | Input | Output | Note |
|---|---|---|---|---|
| GPT-5.5 | Base/chat + multimodal | $5.00 | $30.00 | Highest per-token rate; includes image input and reasoning dial |
| o3 | Reasoning | $2.00 | $8.00 | Lower rate, but total tokens per request can be 5–10× a base model |
| o4-mini | Reasoning | $1.10 | $4.40 | Budget reasoning option in the o-series |
| DeepSeek R1-0528 | Reasoning (open-weights) | ~$0.50 | ~$2.19 | Third-party provider pricing; hosting/compliance burden shifts to deployer |
| Gemini 2.5 Pro | Multimodal | $1.25 | $10.00 | Rate roughly doubles above 200K input tokens |
Which Class Fits Which Task
I'm using scenario mapping here because these three classes don't compete on the same axis. A flat comparison grid would imply they're interchangeable options for the same job, and they aren't. Each scenario below represents a task type a public sector buyer might describe. The mapping shows which class fits, why, and where the other classes fall short.
High-volume document summarization. An agency processes 10,000 case files per month and needs summaries. Repetitive, well-defined, no multi-step reasoning required. Base/chat model. Cost scales linearly with volume, and this is where the lowest per-token rate matters most because you're multiplying it by millions of tokens monthly. A reasoning model would burn deliberation tokens on each of those 10,000 requests with no quality gain on a straightforward summarization task. Multimodal adds nothing if the documents are already text.
Multi-step policy compliance analysis. A buyer needs to check whether a proposed action complies with three overlapping regulatory frameworks. The model has to hold multiple constraints, reason through conflicts, and produce a structured determination. Reasoning model. The deliberation tokens are the point. A base model might produce a confident-sounding answer that misses a constraint interaction because it doesn't work through the relationships step by step. The reasoning model's extended chain-of-thought catches it. Cost per request is higher and less predictable, but task volume is typically lower. Use the reasoning effort parameter to control the ceiling.
Intake processing with mixed media. An agency receives applications that include photographs, scanned forms, and occasionally short video clips. The pipeline needs to extract structured data from all of it. Multimodal model. Cost scales with the volume and resolution of non-text inputs. A batch of 500 applications with two images each is more than "500 requests" in token terms; the image tokens may dominate the bill. A text-only model would require a separate OCR/transcription pipeline upstream, adding latency and integration complexity for a task the multimodal model handles natively.
Internal knowledge base Q&A. Staff ask questions about agency policies and procedures. Answers come from a curated document set. Base/chat model with retrieval augmentation. The questions are typically single-hop, and the cost of reasoning tokens on thousands of daily queries adds up fast with no quality benefit. Reasoning models are overkill here.
Strategic threat assessment. An analyst synthesizes intelligence from multiple sources, identifies contradictions, and produces a structured assessment with confidence levels. Reasoning model for the analysis itself. Multi-source reconciliation and explicit uncertainty quantification are exactly what extended deliberation buys you. But if the source material includes satellite imagery or intercepted audio, the input pipeline is multimodal even if the reasoning task is text-based. A base model lacks the deliberation depth for contradiction analysis across sources. These classes combine in real architectures. Expect that.
This maps loosely to how you'd architect IDAM tiers. Base/chat models are your standard SSO tier: handles most traffic, low friction, predictable cost. Reasoning models are your PAM tier: higher cost per session, justified by what's being accessed. The analogy breaks when you remember that model selection happens per-task, not per-application, and a single workflow might chain multiple model classes in sequence.
How to Say This in the Field
| Don't say | Do say | Why it matters |
|---|---|---|
| "Reasoning models are smarter than regular models." | "Reasoning models spend more tokens deliberating before answering. That deliberation is the cost, and it's what buys you better performance on multi-step problems." | Frames capability as a cost tradeoff, which is what procurement cares about. |
| "You need the most advanced model for this use case." | "For high-volume summarization, a base model at a lower per-token rate will outperform a reasoning model on cost without sacrificing quality. Reasoning models earn their cost on multi-constraint problems." | Matches model class to task type, which is the actual procurement question. |
| "Reasoning models have a new kind of intelligence." | "Research shows reasoning models primarily learn when to activate skills the base model already has. You're paying for better judgment about when to think harder." | Gives the buyer a defensible, non-hype framing backed by peer-reviewed work. |
| "Multimodal is just adding image support." | "Each modality — image, audio, video — consumes tokens at its own rate. A 30-second video clip converts to roughly 7,700 tokens before the model reads any text. Multimodal pricing is modality-specific." | Prevents cost surprises when the buyer's pipeline includes non-text inputs. |
| "DeepSeek is the cheap option." | "DeepSeek R1 is open-weights, around $0.50 per million input tokens through third-party providers. For agencies concerned about vendor lock-in or data training restrictions, open-weights models shift compliance responsibility to the deployer but eliminate the vendor dependency." | Connects pricing to the procurement concerns buyers actually raise. |
| "The reasoning effort setting controls how smart the model is." | "Reasoning effort controls how many tokens the model spends deliberating. Higher effort means more tokens, which means higher cost per request. You can set it per query based on task complexity." | Makes the cost lever explicit and actionable. |
| "You should use GPT-5.5 for everything — it handles reasoning and images." | "GPT-5.5 has a configurable reasoning dial and multimodal input, but at $30 per million output tokens, every reasoning token costs three to seven times what you'd pay on o3 or o4-mini. Use GPT-5.5 when you need its full context window or image input. Use o-series when reasoning is the primary requirement." | Prevents defaulting to the most expensive option when a cheaper model class fits. |
| "AI models all basically work the same way." | "These three model classes have different cost structures because they do mechanically different things with tokens. Knowing which class fits a given task is the difference between a cost-effective deployment and one that scales badly." | Positions you as someone who understands the procurement tradeoff. |
| "We can figure out the model later — let's focus on the use case." | "The model class shapes the cost model. A reasoning-heavy pipeline on the wrong model class can cost five to ten times what it should. Let's map your use cases to model classes early so the budget conversation is grounded." | Shows you understand that model selection is a procurement decision. |
| "Open-source models aren't enterprise-ready." | "Open-weights models like DeepSeek R1 are production-capable, but compliance validation, security hardening, and hosting fall to your team or your integrator. The tradeoff is control versus operational burden." | Respects vendor lock-in concerns without overselling open-weights. |
A Note on Numbers
Every price in this piece is drawn from provider API documentation and pricing aggregators as of early May 2026. Model pricing changes without notice. Context windows shift. New models launch and old ones deprecate. The specific dollar figures are directional, not contractual.
The structural insight survives even when the numbers move: reasoning models cost more per request because they consume more tokens deliberating, and multimodal models cost more because non-text inputs convert to tokens at modality-specific rates. Those mechanisms don't change when the price sheet does.
One more thing worth naming. The line between these three classes is already blurring. GPT-5.5 accepts images and has a reasoning effort dial, which makes it simultaneously a base model, a reasoning model, and a multimodal model depending on how you configure it. Gemini 2.5 Pro does the same across even more modalities. Providers have a structural incentive to push you toward these consolidated models: one model that does everything means one price tier, and it's usually the highest one. The model class framework gives you the language to push back. When a buyer's architecture calls for high-volume summarization plus occasional compliance analysis plus image intake, the answer is probably three model classes, each matched to the task where it earns its cost. The classes remain useful for understanding cost and capability tradeoffs even as the product boundaries soften. You're just pulling more than one lever at a time.
Things to follow up on...
- OMB procurement playbooks pending: M-25-22 committed to publishing AI-specific procurement playbooks covering generative AI and biometrics, but as of May 2026 those playbooks remain unpublished — when they drop, expect explicit model-class distinctions that will shape how agencies write RFPs.
- DeepSeek pricing ambiguity: A September 2025 architecture update unified DeepSeek's chat and reasoner endpoints under V3.2, and the official pricing page may now reflect different rates than the R1-0528 figures cited here — verify before any buyer-facing conversation.
- The "when not how" caveat: The arXiv 2510.07364 finding on reasoning models was tested on math benchmarks with models up to 32B parameters, and a related study reports conflicting observations about whether RLVR reduces overall reasoning diversity even as it improves sampling efficiency.
- Gemini thinking token controls: Google's Gemini models support configurable thinking budgets that cap deliberation tokens, but the exact API parameter behavior differs across model versions — confirm whether your buyer's target model (2.5 Pro vs. Flash vs. 3.x) supports the same controls.

