Three model classes show up in almost every enterprise AI conversation right now: base and chat models, reasoning models, and multimodal models. Vendors use these terms to justify pricing tiers, position capabilities, and — occasionally — obscure what their product actually does. An AE who can decode the taxonomy in real time can ask better questions, catch positioning that doesn't hold up, and help a buyer match the right tool to the right problem. That's what this piece is for.
The Three Profiles
Base and Chat Models
What it is. A base model is a large language model trained on text and capable of generating text. A chat model is a base model that has been further trained to follow instructions and hold a conversation. In practice, when a vendor says "we use GPT-4o" or "our platform runs on Claude 3.5 Sonnet," they mean a chat model. The distinction between base and chat matters for researchers; for enterprise buyers, chat models are the default.
What it does. The model accepts a prompt and generates a response in a single forward pass. There's no deliberation phase. The model reads the input, runs it through its parameters, and produces output. Latency is bounded and predictable — typically one to ten seconds for most enterprise use cases, sub-second for smaller variants like GPT-4o mini or Claude 3 Haiku. For problems requiring multi-step inference, a single forward pass sometimes isn't enough.
Who's behind it. OpenAI (GPT-4o, GPT-4o mini), Anthropic (Claude 3.5 Sonnet, Claude 3 Haiku), Google DeepMind (Gemini 1.5 Flash, Gemini 1.5 Pro), and Meta (Llama 3.x series, available as open weights). Most enterprise AI products are built on one of these, either through API access or a fine-tuned derivative.
What makes it distinct. Cost is per-token, and the token count is bounded by what you put in and what comes out. No hidden compute. No deliberation overhead. For tasks where the answer is retrievable — summarization, classification, drafting, Q&A against a document — this class handles it well and handles it cheaply.
Reasoning Models
What it is. A reasoning model generates a chain of thought before producing its final answer. The model works through intermediate steps — checking its own logic, considering alternatives, catching errors — before committing to a response. OpenAI calls this "extended thinking" in their o-series documentation. Anthropic uses the same phrase for Claude 3.7 Sonnet's reasoning mode. The thinking tokens are real tokens, consumed during inference, and they're billed accordingly.
What it does. The model spends additional compute on deliberation. In practice, this means the model produces a longer internal monologue — sometimes visible to the developer, sometimes not — before generating the output the user sees. For a complex coding problem or a multi-step legal analysis, this deliberation catches errors that a single forward pass would miss. For a request to summarize a paragraph, it adds cost and latency without adding value.
Who's behind it. OpenAI's o1, o3, and o3-mini; Anthropic's Claude 3.7 Sonnet with extended thinking enabled; Google's Gemini 2.0 Flash Thinking. DeepSeek-R1 is worth naming separately: it's an open-weights reasoning model from a Chinese AI lab that demonstrated competitive reasoning performance at significantly lower inference cost than the OpenAI o-series, which changed the market's assumptions about what reasoning capability costs. (Pricing numbers move fast in this space; treat any specific figures as illustrative rather than current.)
What makes it distinct. Latency is higher — measured in tens of seconds for complex problems, not seconds. Token cost is higher because the thinking tokens add to the total bill, even when they're not surfaced to the end user. The accuracy improvement on genuinely complex tasks is real and documented. The mistake buyers make is applying reasoning models to tasks that don't require them, which is like running every authentication request through step-up MFA: technically possible, practically expensive, and operationally painful.
Multimodal Models
What it is. A multimodal model accepts non-text inputs — images, audio, video — alongside text prompts. The model processes multiple input types in a single inference call. "Multimodal" describes the input channel, not a performance tier. A multimodal model can be a fast chat model (GPT-4o accepts images and audio) or a reasoning model (Gemini 2.0 with thinking enabled accepts video). The term tells you what goes in, not how the model processes it.
What it does. For image input: the model interprets visual content — documents, screenshots, diagrams, photos — and responds to questions about it. For audio input: the model processes spoken language, sometimes with speaker tone and cadence as additional signal. For video input: the model processes sequences of frames, which is computationally expensive and slower than image processing. The maturity varies significantly by modality. Image input is production-ready across multiple providers. Audio is newer and more variable. Video is real but expensive enough that most enterprise deployments treat it as a specialized capability rather than a default.
Who's behind it. GPT-4o (image, audio), Gemini 1.5 Pro and Gemini 2.0 Flash (image, audio, video), Claude 3.x series (image). Meta's Llama 3.2 introduced multimodal capabilities to the open-weights ecosystem. Pricing for non-text inputs is typically calculated differently than text tokens — image inputs are often priced per image or per tile, which means the cost model requires a separate conversation than text-only deployments.
What makes it distinct. The capability is real; the use case has to justify the input type. A vendor who says "our platform is multimodal" is telling you something about what the model can accept, not necessarily about what their product does with it. The useful follow-up question is: which modalities, in which workflows, and what's the latency and cost profile for each?
Comparing the Three Classes
Multimodal is not a point on the same spectrum as base/chat and reasoning. It's a different axis entirely. Base/chat versus reasoning describes how the model processes input. Multimodal describes what input types the model accepts. You can have a multimodal chat model (GPT-4o) or a multimodal reasoning model (Gemini 2.0 Thinking). Treating all three as equivalent tiers on a capability ladder is a vendor slide habit that doesn't survive contact with the billing dashboard.
The comparison runs across four dimensions.
Cost structure. Base and chat models charge per input token plus per output token. The bill is predictable because the token count is bounded by what you send and what you receive. Reasoning models add a third cost component: thinking tokens, which are generated during deliberation and billed even when they're not shown to the user. This means the actual cost of a reasoning model call can be two to five times higher than the quoted output token price suggests, depending on how much the model deliberates. Multimodal models introduce modality-specific pricing — image inputs are often priced per image tile rather than per token, audio by the second, video by the frame or second. Any cost comparison across vendors that doesn't account for modality pricing is incomplete.
Latency profile. Base and chat models return responses in one to ten seconds for most enterprise prompts. Smaller, distilled variants return faster. Reasoning models are slower by design — the deliberation phase adds time, and for complex problems, response times of thirty to ninety seconds are documented. This isn't a bug; it's the mechanism. But it has architectural implications: you can't block a user interface waiting for a reasoning model the way you might for a chat model. Multimodal latency depends on the modality. Image processing adds modest overhead. Video processing is substantially slower, often requiring asynchronous handling.
Task fit. Base and chat models handle the majority of enterprise AI tasks well: summarization, classification, drafting, retrieval-augmented Q&A, simple code generation. Reasoning models earn their cost on tasks where intermediate steps matter — complex debugging, multi-step financial analysis, legal reasoning across long documents, agentic workflows where the model needs to plan before acting. A task warrants a reasoning model when intermediate steps matter, early errors compound, and accuracy outweighs speed. Multimodal models are appropriate when the input genuinely exists in a non-text form — a scanned document, a screenshot of an error, a recorded support call. Converting those inputs to text first and then using a chat model is sometimes cheaper; sometimes the modality carries signal that text conversion loses.
Vendor positioning tells. When a vendor says their product uses a reasoning model for all requests, ask about latency and cost at scale — the numbers will either justify the choice or reveal that the model isn't actually being used for reasoning-intensive tasks. When a vendor says "multimodal" without specifying which modalities and which workflows, the term is doing marketing work rather than technical work. When a vendor positions a base model as "fast and efficient," that's accurate — and the question is whether the tasks their product handles actually require more than a single forward pass. Most do not.
Field Language Guide
| Instead of saying | Say this | Why it matters |
|---|---|---|
| "It's smarter" | "It's designed for tasks where multi-step reasoning improves accuracy" | "Smarter" is unmeasurable; task fit is a purchasing criterion |
| "The reasoning model thinks longer" | "It generates additional tokens during a deliberation phase before producing the final response" | Buyers need to understand the cost mechanism, not the anthropomorphism |
| "It can see images" | "It accepts image input alongside text prompts" | "See" implies comprehension; "accepts" is accurate and testable |
| "Reasoning models are always better" | "Reasoning models are appropriate when the task requires multi-step inference; for retrieval or summarization, they add cost without benefit" | Prevents buyers from over-specifying and then complaining about cost |
| "Multimodal means it can do everything" | "Multimodal refers to which input types the model accepts — image, audio, or video alongside text" | Scopes the capability accurately before the demo |
| "Base models are the cheap version" | "Base and chat models are optimized for speed and cost at tasks where extended reasoning isn't required" | Avoids implying inferiority; frames the tradeoff correctly |
| "The thinking is hidden" | "The deliberation tokens are generated and billed but not always surfaced in the user interface" | Surfaces the cost implication without sounding evasive |
| "It's expensive because it's better" | "Reasoning models cost more because they consume additional tokens during the deliberation phase" | Connects price to mechanism, which buyers can evaluate |
| "We use the latest model" | "We use [specific model], which is a [chat/reasoning] model optimized for [specific task type]" | Forces specificity that vendor slides often avoid |
| "Multimodal is more capable" | "Multimodal expands which input types the workflow can handle; whether that matters depends on where your data lives" | Keeps the buyer focused on their actual use case |
| "It processes video" | "It accepts video input; latency and cost for video are higher than for image or text, and most deployments treat it as a specialized capability" | Sets accurate expectations before the proof of concept |
Callout: Okta Concept Mapping
The analog: step-up authentication.
Base and chat models behave like standard authentication — fast, appropriate for most requests, the default path. Reasoning models behave like step-up auth: triggered for high-stakes decisions, slower by design, more expensive per transaction, but justified when the consequences of a wrong answer are significant. Multimodal models are like adding a biometric factor — a different input channel that requires different infrastructure and different cost assumptions. It holds for latency expectations and for the principle that you don't route every request through the most expensive path just because it exists. Where it breaks: the deliberation inside a reasoning model isn't auditable the way an authentication event is. You can inspect a SAML assertion; you often can't inspect the chain of thought that produced a reasoning model's output, even when it's technically available. In a buyer conversation, this is worth naming: the model's reasoning process is more opaque than the authentication flows your buyers are used to governing, and that has compliance implications they'll eventually ask about.
Specific pricing figures, latency benchmarks, and context window sizes for individual models change frequently. Treat any numbers in this piece as illustrative of relative relationships rather than current specifications. Always verify against provider documentation before using in a customer conversation.

