Lesson 7: Model Taxonomy for Enterprise Buyers

By Leigh Garrity— May 9, 2026

Lesson 7: Model Taxonomy for Enterprise Buyers

Three model classes matter for enterprise AI purchasing: base and chat models, reasoning models, and multimodal models. You'll encounter all three in the same conversation — sometimes in the same sentence — because buyers are pattern-matching from vendor marketing, not from a taxonomy. Knowing what each class actually does, who makes it, and what distinguishes it from the others gives you something specific to say when the conversation goes there. Precise language here is the difference between sounding like a peer and sounding like you read a press release.

Base and Chat Models

What it is: A language model trained on large text corpora and fine-tuned for conversational interaction.

What it does: Handles general-purpose language tasks at high throughput and relatively low cost. Summarization, drafting, classification, question-answering, first-line triage. When a buyer says "we're using AI to handle support tickets" or "we've got a chatbot on our intranet," they're almost certainly describing a chat model, whether they know the term or not. These are the workhorses of enterprise AI deployment.

Who's behind it: OpenAI (GPT-4o, GPT-4o mini), Anthropic (Claude 3.5 Haiku, Claude 3.5 Sonnet), Google (Gemini 2.0 Flash), Meta (Llama 3.x, open-weight). The open-weight options from Meta and Mistral deserve a separate note for public sector buyers: if your agency can't route data to a third-party API, self-hosting a Llama model is a real option that changes the cost and compliance picture entirely.

What makes it distinct: Speed and cost. A chat model returns a response in seconds, costs fractions of a cent per request at scale, and handles the vast majority of enterprise AI use cases without complaint. The tradeoff is that it doesn't deliberate — it generates the most statistically likely continuation of your prompt. For most tasks, that's exactly what you want. Assuming "more expensive model" means "better for this task" is the mistake. It often doesn't.

“

Okta Concept Mapping: Standard Accounts vs. Privileged Accounts

Chat models map reasonably well to standard user accounts: provisioned broadly, appropriate for most workflows, cheap to scale. You don't reach for a privileged account when a standard one covers the task. Same logic applies here — chat models are the default tier, not the fallback. The analogy breaks at the governance layer. A standard user account has defined permissions enforced by your IdP. A chat model's effective "permissions" are determined by the system prompt and API configuration, which is a different kind of access control problem — and one that doesn't have a clean SCIM schema yet.

Reasoning Models

What it is: A model class that generates extended internal deliberation before producing a final response.

What it does: Breaks complex problems into steps, evaluates intermediate conclusions, and produces outputs that reflect multi-stage analysis. The deliberation happens in what vendors call a chain-of-thought or "thinking" process. Those reasoning steps consume compute and cost money, whether or not they're visible to the end user. Use cases where this matters: contract analysis, policy interpretation, multi-step compliance checks, security architecture review, anything where the path to the answer is as important as the answer itself.

Who's behind it: OpenAI (o3, o4-mini — the "o" prefix is the tell), Anthropic (extended thinking mode in Claude 3.7 Sonnet), DeepSeek (R1, open-weight, which is why it attracted attention — a reasoning model you can run yourself), Google (Gemini 2.0 Flash Thinking). The DeepSeek-R1 entry matters specifically because it decoupled "reasoning model" from "expensive API." That changes the conversation for buyers with data residency constraints.

What makes it distinct: The deliberation is real and it costs real money. A reasoning model call can run 10 to 50 times the cost of a comparable chat model call, depending on problem complexity (pricing subject to change — check current API documentation). Scope them correctly and the cost is justified. Treat them as the default and your buyer's CFO will notice by Q2.

One clarification worth having ready: reasoning models spend more tokens on deliberation. That is not the same as being more intelligent, more conscious, or more reliable in some general sense. It means the model checks its work more systematically before committing to an output. For complex, multi-step problems, that's valuable. For high-volume, low-complexity tasks, it's expensive overhead.

“

Okta Concept Mapping: Privileged Access Management

Reasoning models follow the same provisioning logic as privileged accounts: you don't make them the default, you scope them to specific high-stakes workflows, and you audit the usage. A PAM workflow requires justification before escalation. A well-architected AI system should require the same before routing a task to a reasoning-tier model. The analogy breaks at the cost model — PAM doesn't charge per session the way a reasoning model API does, and "session" in PAM has defined scope and duration that reasoning model calls don't always carry. The governance instinct transfers. The billing model doesn't.

Multimodal Models

What it is: A model that accepts inputs beyond text — images, audio, video, structured documents — and in some cases generates outputs in those formats as well.

What it does: Processes mixed-input tasks. A buyer asking "can the AI read our scanned contracts?" is asking about vision capability. A buyer asking "can the AI transcribe and summarize our call recordings?" is asking about audio input. A buyer asking "can the AI generate diagrams for our reports?" is asking about image output, which is a different capability axis than image input. These distinctions matter because not every multimodal model handles every modality, and "multimodal" as a vendor claim doesn't specify which ones.

Who's behind it: OpenAI (GPT-4o handles image input and audio; DALL-E 3 handles image output), Anthropic (Claude 3.5 Sonnet accepts image input), Google (Gemini 1.5 Pro and Gemini 2.0 accept image, audio, and video — currently the broadest native multimodal capability in the market). The capability axes vary significantly by provider, and the gap between "accepts image input" and "processes video natively" is substantial.

What makes it distinct: Multimodality describes what inputs a model accepts, not where it sits in a performance hierarchy. A multimodal model can also be a chat model or a reasoning model — these categories aren't mutually exclusive. GPT-4o is a chat model with multimodal inputs. Gemini 2.0 Flash Thinking is a reasoning model with multimodal inputs. When a vendor says "our model is multimodal," the follow-up question is: which modalities, in which direction (input, output, or both), and under what data handling terms?

That last part matters for public sector buyers more than most. Audio and video inputs often carry PII, and the data handling agreement for a multimodal API call is not the same conversation as the one for a text-only call.

“

Okta Concept Mapping: Federated Identity Attribute Claims

A federated token can carry multiple claims about a user — department, role, clearance level — issued by different attribute authorities. A multimodal model can process multiple input types, each with its own provenance and trust question. The analogy holds for "what can this thing assert or process." It breaks because in federation, you explicitly declare which claims you trust and which IdP issued them. Multimodal capability is more implicit — the model accepts whatever you send it, which creates a governance question your buyer may not have asked yet: who decided the model could process that audio file, and under what data handling agreement? If you're in a federal account, that question has a procurement answer, not just a technical one.

Comparison Strategy: Scenario Mapping

This section uses scenario mapping rather than clustering or trait-led analysis. A capability cluster produces a spec sheet. A trait-led analysis produces a deep dive. Scenario mapping produces a decision tool, which is what this piece is for — the reader's actual use case is "buyer describes a workflow, I need to match it to a model class."

High-volume, low-complexity tasks (ticket routing, email drafting, document summarization, FAQ response): chat model. Fast, cheap, appropriate. Don't over-engineer this.

Complex analysis requiring multi-step reasoning (contract review, policy interpretation, compliance gap analysis, security architecture assessment): reasoning model. The deliberation cost is justified by the task complexity. Scope carefully — not every step in the workflow needs reasoning-tier compute.

Tasks involving non-text inputs (scanned documents, call recordings, images, video): multimodal capability required. The model class — chat or reasoning — depends on task complexity, not on the input format. A scanned contract that needs clause-by-clause analysis probably wants a reasoning model with vision capability. A scanned form that needs field extraction probably wants a chat model with vision capability. These are different products.

Cost-sensitive or data-residency-constrained environments: open-weight models (Llama 3.x, DeepSeek-R1, Mistral) change the equation. Self-hosting is a real option, and for federal buyers operating under specific data handling requirements, it may be the only option. The capability gap between frontier APIs and well-run open-weight deployments has narrowed considerably in the past 18 months.

The "we just need AI" buyer: this is the most common scenario and the most dangerous one to accept at face value. "We just need AI" usually means "we have a task we think AI can help with and we haven't scoped it yet." Your job in that conversation is to get to the task type before the model class comes up, because the task type comes first — the model class follows from it.

How to Say This in the Field

Don't say	Do say	Why it matters
"The AI is thinking"	"The model is running extended reasoning — that's a deliberate compute choice, not a metaphor for cognition"	Buyers who hear "thinking" start asking about AI consciousness; that conversation doesn't close deals
"We need the smartest model"	"We need to match the model class to the task — reasoning models cost more and aren't always the right fit"	"Smartest" has no technical definition and invites the vendor to sell you their most expensive tier
"GPT" as a generic term for any AI	"Which model class are you evaluating — chat, reasoning, or multimodal?"	GPT is a specific product line; using it generically signals you're not tracking the market
"It can see images"	"The model accepts image inputs"	"See" implies perception; "accepts inputs" is accurate and keeps the conversation technical
"DeepSeek is cheaper"	"DeepSeek-R1 is an open-weight reasoning model — the cost depends entirely on your hosting environment"	API cost and self-hosting cost are completely different numbers; conflating them misleads the buyer
"Reasoning models are more accurate"	"Reasoning models spend more tokens on deliberation — that's appropriate for complex tasks, not a general accuracy upgrade"	Accuracy is task-dependent; this framing prevents the buyer from over-scoping to the expensive tier
"The model understands the document"	"The model processes the document and generates structured output"	"Understands" is a claim about comprehension; "processes and generates" is a claim about mechanism — and only one of those is defensible
"Multimodal means it can do everything"	"Multimodal means it accepts non-text inputs — which modalities, and in which direction, depends on the specific model"	"Everything" is a sales claim; the buyer will test it and find the gaps, and you'll own those gaps
"We'll use the AI for that"	"A chat model handles that — fast, cost-effective, right for this task type"	Specificity signals competence; "the AI" signals you're not tracking what you're selling
"The o-series models think like humans"	"The o-series models run extended chain-of-thought reasoning before generating a final response"	This framing invites regulatory and ethical questions you can't answer in a sales conversation
"Extended thinking is just slower"	"Extended thinking trades latency for deliberation quality — it's a cost-performance tradeoff, not a limitation"	"Just slower" undersells the capability and misrepresents what the buyer is paying for
"Any model can handle that"	"A standard chat model handles that — you don't need reasoning-tier compute for this task"	Precision here saves the buyer money and positions you as an advisor, not a vendor pushing the expensive option

The taxonomy isn't complicated once you stop treating "AI" as a single thing. Three classes, three distinct use cases, three different cost profiles. The buyer who knows which one they need is easier to help than the buyer who doesn't — and getting them there is a five-minute conversation if you have the vocabulary ready before you walk in.