Three Model Classes, One Buying Decision

By Leigh Garrity— May 8, 2026

Three Model Classes, One Buying Decision

The Three Profiles

Base and Chat Models

What it is: A language model trained to generate text responses directly from a prompt, without a deliberation step.

What it does: Handles the broad middle of enterprise AI work — document summarization, drafting, classification, question-answering over documents, code generation, conversational interfaces. These are the workhorses. When an agency wants to stand up a chatbot for benefits inquiries or a tool that summarizes policy documents, they're reaching for this class.

Who's behind it: OpenAI's GPT-4o family, Anthropic's Claude Sonnet class, Google's Gemini Flash, Meta's Llama (open-weight), Mistral. This is the most populated category in the market.

What makes it distinct: No deliberation. The model reads the prompt and generates a response in one forward pass. The architecture makes these models fast and cost-efficient for tasks that don't require working through intermediate steps. The model answers from what it knows, directly.

Reasoning Models

What it is: A language model trained to generate intermediate reasoning steps — thinking tokens — before producing a final response.

What it does: Handles tasks that require multi-step logic: complex code debugging, mathematical problem-solving, legal or regulatory analysis with nested conditions, anything where the answer depends on correctly sequencing several sub-problems. The model works through the problem before it answers, which produces more reliable outputs on structured, high-stakes tasks.

Who's behind it: OpenAI's o-series (o3, o4-mini) and DeepSeek's R1 are the two most-referenced examples. OpenAI developed the o-series as a distinct product line from GPT-4o; DeepSeek, a Chinese AI research lab, released R1 as an open-weight model trained using reinforcement learning to develop similar deliberation behavior. Different training approaches, similar output behavior: both generate extended chain-of-thought before the final answer.

What makes it distinct: The deliberation step is literal. Before the model returns a response, it generates tokens working through the problem, sometimes thousands of them. Those tokens cost money and take time. A reasoning model on a simple summarization task isn't more accurate; it's just slower and more expensive. The class earns its cost on problems where the intermediate steps matter.

Multimodal Models

What it is: A language model extended to accept and/or generate content across multiple modalities — text, image, audio, and video.

What it does: Processes inputs that aren't text: reads charts and diagrams in documents, transcribes and analyzes audio, interprets video frames, extracts data from scanned forms. In government contexts, this matters for document-heavy workflows where the documents include images, for accessibility tooling, and for any process that ingests audio or video as source material.

Who's behind it: GPT-4o (OpenAI), Gemini 1.5/2.0 (Google, with the strongest native video story), and Claude 3.5/3.7 (Anthropic, with strong document vision). Multimodality is increasingly a feature of base and chat models rather than a separate product line — most frontier models now accept image inputs at minimum.

What makes it distinct: Modality is an input/output expansion, not a reasoning upgrade. A multimodal model can see a chart; it doesn't necessarily reason about the chart better than a text-only model reasons about the same data in table form. The capability is real and useful for specific workflows. General reasoning ability is a separate dimension.

“

IDAM Callout — "Token" means something different here

Your professional vocabulary has "token" meaning a bearer credential — an OAuth access token, a hardware token, a session token. In LLM context, a token is a unit of text, roughly three-quarters of a word in English. When a vendor says "reasoning models cost more per token," they mean per unit of text processed, not per authentication event. The analogy holds in one direction (both are units of something being exchanged) and breaks immediately after (a security token carries identity claims; an LLM token carries no semantic weight on its own, and meaning emerges from sequences). In a buyer conversation, this disambiguation is worth thirty seconds. It prevents a lot of confusion that looks like technical disagreement but is actually vocabulary mismatch.

Comparison: Trait-Led Analysis

I'm using trait-led analysis here, anchoring on cost, latency, and task fit as dimensions and positioning all three subjects against each. The alternative structures (clustering by deliberation behavior, scenario mapping) would work, but trait-led analysis makes the "model class is not a quality ranking" point most directly, which is the thing this audience needs to hold in a live conversation.

Cost

Base and chat models are the cheapest per token, often by a significant margin. They're optimized for throughput. For high-volume, low-complexity tasks — processing thousands of documents, running a chatbot at scale — the economics favor this class clearly.

Reasoning models cost more because they generate more tokens. A model that spends 2,000 thinking tokens before producing a 500-token response is billing you for 2,500 tokens of work. On a complex legal analysis task, that's a reasonable trade. On a task that didn't require deliberation, it's waste. The cost premium is real and task-dependent, not a fixed quality surcharge. (Specific per-token pricing changes frequently; treat any vendor quote as a point-in-time figure.)

Multimodal models typically carry a cost premium for non-text inputs — image tokens, audio processing, video frames are priced differently from text tokens and vary by vendor. If a workflow is text-only, using a multimodal model doesn't add cost from unused modalities per se, but it's worth confirming with each vendor's pricing structure, which varies.

Latency

Base and chat models are fastest. No deliberation step means the model starts generating the response immediately. For real-time applications — conversational interfaces, live document Q&A — this class is the right default.

Reasoning models are slower, and the latency scales with problem complexity. A simple question routed to a reasoning model will still incur the deliberation overhead. A complex multi-step problem will incur more. This is not a bug; it's the mechanism. Buyers who want reasoning model accuracy on a real-time interface are asking for a tradeoff the architecture doesn't easily support.

Multimodal models add processing time for non-text inputs. Image analysis, audio transcription, and video processing all add latency before the model begins generating a text response. For batch workflows, this is usually acceptable. For real-time applications with image or audio inputs, it's a design constraint worth surfacing early.

Task Fit

Base and chat models fit the majority of enterprise AI tasks: summarization, drafting, classification, search, conversational Q&A, code generation at speed. If the task is well-defined and doesn't require working through intermediate logical steps, start here.

Reasoning models fit tasks where the answer depends on correctly sequencing sub-problems: complex code debugging, multi-step regulatory analysis, mathematical reasoning, scenarios where an incorrect intermediate step produces a confidently wrong final answer. The deliberation step is most valuable when the task has internal structure that can be worked through, not just pattern-matched.

Multimodal models fit tasks where the source material isn't text: scanned documents with embedded images, audio recordings, video content, forms with visual structure. They also fit workflows where the output needs to be visual — generating diagrams, annotating images. If the workflow is text-in, text-out, multimodal capability is available but not exercised.

“

IDAM Callout — Model class vs. model quality :: authorization vs. authentication

Buyers who conflate model class with quality are making the same category error as conflating authentication strength with authorization scope. A highly authenticated user can still be unauthorized for a specific resource. A highly capable reasoning model can still be wrong for a specific task — or right, but at unnecessary cost and latency. The mapping isn't perfect. Class determines what the model is designed to do; quality is a separate axis. In a buyer conversation, this reframe lands because your audience already holds the authentication/authorization distinction as a hard-won professional insight. They've already done the work of separating two things that look related but operate differently.

How to Say This in the Field

The scenario: a buyer is conflating model class with model quality — "GPT-4 is the best," "reasoning models are always smarter," "multimodal means more capable." These are the substitutions that work verbatim.

Don't say	Do say	Why it matters
"That's not quite right..."	"Model class and model quality are different dimensions. Let me show you what I mean."	Corrects without condescending; opens a framework conversation instead of a debate
"GPT-4 is the industry standard"	"GPT-4o is a strong general-purpose model. For multi-step reasoning tasks, OpenAI's o-series is a different class designed specifically for that — not just a newer version."	Prevents the buyer from anchoring on a brand name as a proxy for capability
"Reasoning models are always better"	"Reasoning models are better at specific task types — complex logic, multi-step analysis, structured problem-solving. For fast summarization or conversational tasks, they're slower and more expensive without a quality benefit."	Protects the buyer from over-spending on deliberation they don't need
"Multimodal means more capable"	"Multimodal means it can process additional input types — images, audio, video. It doesn't mean it reasons better or costs less. Capability and modality are separate dimensions."	Prevents a procurement decision based on a feature the workflow doesn't use
"We want the most advanced model"	"What's the task? Advanced for document summarization looks different from advanced for contract analysis. Let's match the class to the work."	Redirects from brand preference to task requirement
"DeepSeek is a cheap knockoff of GPT"	"DeepSeek-R1 is a reasoning model in the same class as OpenAI's o-series — it uses a similar deliberation approach. It's open-weight, which is a different deployment model, not a quality statement."	Prevents a dismissal that could close off a legitimate option, especially for on-premise deployments
"The expensive model must be better"	"Cost reflects deliberation time and token usage, not general intelligence. A reasoning model costs more because it generates more tokens working through the problem — that's valuable for complex tasks and wasteful for simple ones."	Gives the buyer a mechanical explanation they can use internally
"We should just use the multimodal model for everything"	"You can, but you'll pay for modality you don't use on text-only workflows. If the pipeline is text-in, text-out, a text-optimized model will be faster and cheaper for the same output quality."	Prevents over-procurement on capability that won't be exercised
"Reasoning models think like humans"	"Reasoning models generate additional tokens before responding — it's a computational process, not cognition. The output is more reliable on structured problems because the model works through intermediate steps."	Removes the anthropomorphism that makes buyers over-trust outputs on tasks outside the model's strength
"OpenAI is the only serious option for government"	"OpenAI has strong FedRAMP-authorized offerings. So do Anthropic and Google. For on-premise or air-gapped requirements, open-weight models like Meta's Llama or DeepSeek-R1 are worth evaluating. The question is which class fits the task and which deployment model fits the compliance posture."	Keeps the conversation on requirements rather than brand loyalty

“

IDAM Callout — Multimodality and data classification in public sector

For your agency buyers, every new input modality is a data classification question before it's a capability question. Adding image processing to a workflow means asking: where does the image go, who can see it, and is it retained? Audio inputs mean asking whether transcripts are stored and under what controls. This is where your IDAM instincts are directly applicable — the question isn't just "can the model do this" but "can the model do this in a way that satisfies the agency's data handling requirements." Surfacing this early positions you as someone who understands their operational constraints, not just the vendor's feature list. It also surfaces the real procurement question: which model in which deployment configuration.

The taxonomy here isn't exhaustive. There are fine-grained distinctions within each class, and the specific models populating each category will shift as the market moves. The underlying logic holds regardless: deliberation costs tokens, modality expands inputs, and neither fact is a quality statement. A buyer who understands that can make a purchasing decision. A buyer who doesn't will keep asking which model is "the best," and that question doesn't have an answer worth giving.