Three Model Classes: A Purchasing Framework for Base, Reasoning, and Multimodal AI

By Carey Whitten— May 5, 2026

Three Model Classes: A Purchasing Framework for Base, Reasoning, and Multimodal AI

Lesson 7 — AI Foundations | Technical Comparison

What's in the Field and Why It Shows Up in Your Accounts

Base and chat models, reasoning models, and multimodal models are the three purchasing categories you'll encounter in federal AI procurements — in RFIs that ask vendors to specify which model class their solution uses, in CAIO conversations about cost-per-query trade-offs, and in technical evaluations where the wrong model class produces the wrong results regardless of how well everything else is configured. Knowing the distinction precisely — not just that reasoning models "think harder" but what that means architecturally and what it costs — is what separates an AE who can hold a technical conversation from one who defers to the SE every time a buyer asks a follow-up.

Profile: Base and Chat Models

What it is. A language model trained to predict and generate text, optimized for low-latency response at scale.

What it does. Handles the majority of deployed enterprise AI workloads: document summarization, policy Q&A, email drafting, classification, conversational interfaces, and retrieval-augmented generation (RAG) pipelines where the model synthesizes retrieved content into a coherent answer. These are the models behind most AI products currently in production in federal civilian agencies. When a buyer says their agency has "deployed an AI assistant," they almost certainly mean a chat model.

Who's behind it / where it comes from. GPT-4o (OpenAI), Claude 3.5 Sonnet (Anthropic), Gemini 1.5 Flash (Google DeepMind), and Llama 3 (Meta, open weights). These are the models that underpin most enterprise AI integrations available today. Llama 3 is notable because its open-weights availability means agencies can deploy it in air-gapped environments — a procurement consideration that comes up frequently in defense and intelligence contexts.

What makes it distinct. A chat model returns its best answer in a single forward pass through the model. It doesn't deliberate, check its work, or allocate additional computation to harder questions. Response time is fast (typically under two seconds for most queries) and cost per query is low relative to other model classes. Fast response at low cost is the design intent, not a trade-off. For the vast majority of enterprise use cases, a single forward pass is exactly what's needed.

Profile: Reasoning Models

What it is. A model class that generates extended internal deliberation before producing a final answer, trading latency and cost for accuracy on complex multi-step problems.

What it does. Works through problems sequentially — generating intermediate reasoning steps, evaluating them, and revising before producing output. This makes reasoning models substantially more reliable on tasks that require logical chains: regulatory compliance analysis, benefits adjudication, contract review, code generation for complex systems, and any workflow where a wrong intermediate step cascades into a wrong final answer. The output is slower and more expensive than a chat model response. On a simple summarization task, a reasoning model produces no better result than a chat model and costs significantly more.

Who's behind it / where it comes from. OpenAI's o-series (o3, o4-mini) is the most widely cited line; "o-series" is OpenAI's product designation for this class. Anthropic's Claude 3.7 Sonnet with extended thinking is Anthropic's equivalent capability, marketed under a different term. Google DeepMind's Gemini 2.0 Flash Thinking covers the same class. DeepSeek-R1 is an open-weights reasoning model from the Chinese AI lab DeepSeek — it's procurement-relevant because its open-weights availability makes it a candidate for self-hosted federal deployments, and it has appeared in federal AI evaluation discussions. Cost differential is real and significant: reasoning models typically run at 5–15x the per-token cost of their chat-class counterparts from the same provider (illustrative benchmark; subject to change by provider).

What makes it distinct. The extended deliberation happens at inference time and is not a prompt engineering technique. You cannot replicate reasoning-model behavior by asking a chat model to "think step by step" — that's a prompting approach that improves chat model output at the margins. A reasoning model allocates additional compute to internal token generation before producing its response. The mechanism is architectural. CAIOs sometimes ask whether their existing chat model deployment can be "upgraded" to reasoning capability through better prompting. The answer is no.

Profile: Multimodal Models

What it is. A model that accepts inputs beyond text — images, audio, video — and reasons across modalities within a single inference call.

What it does. Processes satellite imagery, interprets charts and diagrams embedded in documents, analyzes audio recordings, handles video frames. The practical federal use cases span document processing (forms with handwritten fields, scanned contracts, diagrams in technical specifications), geospatial analysis, and accessibility workflows that require audio transcription paired with content analysis. The capability axes — image, audio, video — are not uniformly present across all multimodal models. A buyer specifying "multimodal" without naming the required modality may be describing a model that handles images but not audio, or video but not real-time audio.

Who's behind it / where it comes from. GPT-4o (OpenAI — image and audio input, generally available). Gemini 1.5 Pro (Google DeepMind — image, audio, and video input). Claude 3.5 Sonnet (Anthropic — image input; audio and video support varies by version and availability). Capability coverage is model-version-specific and changes with releases — in a procurement context, verify the specific modalities against current provider documentation rather than assuming the model name implies full multimodal coverage.

What makes it distinct. The differentiating trait is input modality breadth, not reasoning architecture. A multimodal model may or may not have reasoning-class performance — these are orthogonal dimensions. GPT-4o is multimodal and chat-class. A hypothetical model could be multimodal and reasoning-class. When a buyer asks for "a multimodal reasoning model," they're specifying both axes simultaneously, and the AE needs to know whether the solution covers both.

Comparison: Scenario Mapping

The comparison below uses scenario mapping as its organizing frame. Three federal task scenarios, each matched to the model class that fits it, with explicit reasoning for why the alternatives don't. Every model class appears in every scenario — either as the right answer or as the explicit wrong answer with a reason.

Scenario 1: Policy Q&A assistant for a civilian agency's benefits portal

A benefits agency deploys a conversational interface that helps claimants understand eligibility criteria, navigate the application process, and get status updates. Queries are natural language, answers draw from a curated document corpus via RAG, and response time matters for user experience.

Right answer: Chat model. Fast, low cost, adequate accuracy for well-scoped retrieval tasks. The query complexity doesn't require extended deliberation — the answer is in the retrieved document or it isn't.

Why not reasoning: Overkill and expensive. A reasoning model on a factual retrieval task produces no accuracy improvement and adds latency and cost. If the agency is running thousands of queries per day, the cost differential is material.

Why not multimodal: Wrong axis. Unless the portal handles uploaded documents with images or audio queries, multimodal capability adds nothing. Specifying it adds cost and complexity without benefit.

Scenario 2: Regulatory compliance analysis for a financial regulator

An agency's legal team needs to evaluate whether proposed rule changes conflict with existing regulations across a 400-page regulatory corpus. The analysis requires identifying logical dependencies, flagging contradictions, and producing a structured memo with citations.

Right answer: Reasoning model. Multi-step logical analysis with cascading dependencies is exactly the task class where reasoning models outperform chat models. A wrong intermediate inference — "this clause doesn't apply to X" — produces a wrong final memo. The extended deliberation catches those errors.

Why not chat: Accuracy risk on complex chains. A chat model can produce a plausible-looking compliance memo that contains logical errors invisible to a non-expert reviewer. For a legal team that will rely on the output, that's a liability, not a cost saving.

Why not multimodal: Not the constraint. The input is text. Multimodal capability is irrelevant unless the regulatory corpus includes charts or diagrams requiring visual interpretation — in which case a multimodal reasoning model covers both requirements.

Scenario 3: Processing scanned federal forms with handwritten fields

An agency's intake workflow receives thousands of paper forms monthly, scanned to PDF. The forms include handwritten entries, checkboxes, and occasional diagrams. The agency needs to extract structured data and flag incomplete or inconsistent submissions.

Right answer: Multimodal model with image input. The input is an image of a document. A text-only model cannot process it. The task requires visual interpretation of handwriting, layout, and form structure.

Why not chat (text-only): Wrong input modality. A chat model receives text. A scanned form is an image. The model cannot process what it cannot receive.

Why not reasoning: Depends on the task. For straightforward field extraction, a multimodal chat-class model is sufficient and cheaper. If the agency needs the model to evaluate whether the submitted information is logically consistent with other records — a more complex analytical task — a multimodal reasoning model is the right specification. The modality and the reasoning architecture are separate decisions.

Field Language Guide

Don't say	Do say	Why it matters
"GPT-4" (as a generic term for any LLM)	"Large language model" or name the specific model	Buyers use GPT-4 as a brand name for all AI; using it precisely signals you know the difference
"Thinking model"	"Reasoning model"	"Thinking" is Anthropic's product marketing term; "reasoning model" is the class name that works across vendors
"Smarter model"	"Reasoning model — appropriate for complex multi-step tasks"	"Smarter" implies the chat model is deficient; reasoning models are a task-fit choice, not a quality upgrade
"Vision model"	"Multimodal model with image input"	Vision is one modality; assuming it covers audio or video creates a spec mismatch
"Extended thinking" (without context)	"Anthropic's reasoning model capability, equivalent to OpenAI's o-series"	Extended thinking is Anthropic-specific; cross-vendor conversations need the class name
"o-series" (without context)	"OpenAI's reasoning model line — o3, o4-mini"	o-series is OpenAI's internal naming; buyers outside the OpenAI ecosystem may not recognize it
"Base model"	"Chat model" or "foundation model" depending on context	Base model technically means pre-instruction-tuning; most deployed models are instruction-tuned chat models — the terms aren't interchangeable
"Frontier model"	"Leading-capability model from [provider name]"	Frontier is a marketing term with no standard definition; it signals nothing specific about model class
"The model reasons through it"	"The model uses extended deliberation — it's a reasoning-class model"	"Reasons" is used loosely to mean "produces good output"; when it has a technical meaning, the collision creates a spec misunderstanding
"Multimodal AI"	"A model that accepts [image / audio / video] input — specify the modality"	Multimodal is often used as a marketing umbrella; buyers need to know which modalities are actually covered
"DeepSeek"	"DeepSeek-R1, an open-weights reasoning model available for self-hosted deployment"	DeepSeek has multiple models; R1 is the reasoning-class one that's procurement-relevant, and the open-weights distinction matters for federal air-gap requirements
"Cost per query"	"Cost per token, which scales with model class — reasoning models generate more tokens internally"	Buyers think in per-query terms; the actual cost driver is token volume, and reasoning models generate significantly more tokens per query

“

Okta Concept Mapping: Adaptive Authentication

Adaptive authentication matches the authentication method to the risk context of the transaction — low-risk access gets a password, high-risk privileged access gets MFA plus device trust. The logic maps to model class selection: match the model's capability to the task's complexity, and let cost and friction scale with what the task actually requires. Both frameworks say don't use your most expensive, highest-friction method for every transaction. The analogy breaks at timing: adaptive auth is a runtime decision made automatically by policy; model class selection is a design-time decision made by the architect before deployment. When a CAIO asks "can the system automatically use a more powerful model when the query is complex?" — that's a routing architecture question, and the answer depends entirely on how the agent pipeline is built.

Lesson 6 covers sampling parameters (temperature, top-p, seed). Lesson 4 covers training regimes and inference cost as a training-side variable. Model class selection as a purchasing decision is the scope of this lesson.