AI Foundations: Reading a Model Spec Sheet

By Leigh Garrity— May 6, 2026

AI Foundations: Reading a Model Spec Sheet

Recap — Lessons 1–7

You've covered eight pieces. This one doesn't add concepts — it organizes what you already have. Pick it up before a meeting with a CAIO or a technical evaluator. The goal: look at a real vendor spec sheet and know exactly what each field is telling you, and what it isn't.

The Spec Sheet, Annotated

The following is a composite of fields drawn from current major provider documentation. Numbers are illustrative benchmarks — representative, not contractual.

Model: Apex-4 Turbo (Illustrative Composite)

Context Length: 128,000 tokens

Recall from Lesson 2: a token is roughly three-quarters of a word. 128,000 tokens is approximately 96,000 words — a short novel. This is the model's working memory for a single inference call. Everything the model "knows" about your conversation, your documents, your instructions lives here. When the window fills, earlier content drops out. There is no persistence between calls unless you engineer it. A federal agency asking whether the model can "remember" previous sessions is asking a question this field doesn't answer.

Token Pricing: $2.50 / 1M input tokens · $10.00 / 1M output tokens

Tokens-as-currency, from Lesson 2. Input tokens are what you send; output tokens are what the model generates. Output costs four times more here — that ratio is common across providers. The asymmetry matters when you're estimating workload cost. A RAG pipeline that retrieves 20 pages of context per query and generates a two-paragraph summary is spending heavily on input. A workflow generating long reports is spending heavily on output. Know which side your use case lives on before you quote a number.

Reasoning Mode: Available (Extended Thinking) · $15.00 / 1M output tokens

Reasoning models spend tokens deliberating before producing a final answer, as Lesson 5 covered. Those deliberation tokens cost money. The $15.00 rate here is 50% above standard output pricing. Reasoning mode is appropriate for complex, multi-step tasks where accuracy matters more than speed. It is not appropriate for high-volume, low-complexity queries. A document routing workflow doesn't need it. A contract risk analysis might.

Supported Modalities: Text, Image (Vision), Document (PDF)

Modality, from Lesson 6, is what type of input the model can process. This model accepts text, images, and PDFs. It does not accept audio or video. For a federal agency asking whether the model can process scanned forms, the answer here is yes. For a use case involving call center audio transcription, this model is not the right tool. The modality list is a hard boundary. There's no configuration option that changes it.

Knowledge Cutoff: Q1 2025

The model's training data ends here — Lesson 4's territory. It has no awareness of events after this date unless you provide that information in context. This is the structural reason grounding matters. An agency asking the model about current policy guidance, recent legislation, or active threat intelligence will get confident-sounding answers that may be months out of date. The cutoff is a design constraint, baked into how training works. Grounding — retrieval, RAG, tool access — is the mitigation.

Architecture: ~70B Parameters · Transformer-Based

Parameter count is a rough proxy for model capability and inference cost. Larger is not always better for a given task — a 7B model fine-tuned on legal documents may outperform a 70B general model on legal document tasks. Transformer-based is the dominant architecture for current LLMs; it is the substrate on which next-token prediction (Lesson 1) runs. This field surfaces in technical evaluations more than in sales conversations, but you should be able to say what it means when it comes up.

Five Mental Models

Next-Token Prediction The mechanism by which an LLM generates text: at each step, predict the most probable next token given everything before it. The model has no understanding of meaning. It has learned statistical patterns across massive text corpora. When it comes up: A buyer asks whether the model "understands" their policy documents. The honest answer is: it predicts text that looks like understanding. Don't confuse with: Retrieval. The model isn't looking up answers; it's generating them. Retrieval is a separate layer you add on top.

Tokens-as-Currency The unit of everything: pricing, context capacity, throughput limits, and benchmark comparisons. One token is approximately 0.75 words in English. When it comes up: Any cost conversation is a token conversation. Every time a buyer asks "what will this cost," you're doing token math. Don't confuse with: Characters or words. Vendors price in tokens; your buyer thinks in words. The conversion matters when you're estimating at scale.

Embeddings-as-Meaning A numerical representation of text as coordinates in high-dimensional space. Semantically similar text lands near each other. The substrate for vector search and RAG. When it comes up: A buyer asks how the model can search a 50,000-document repository without reading every document at query time. Embeddings are the answer. Don't confuse with: The LLM itself. Embeddings are generated by a separate embedding model and stored in a vector database. The LLM queries them; it doesn't contain them.

Grounding-Not-Trusting Hallucination is structural — the model generates plausible text, not verified facts. Grounding means providing verified information in context so the model reasons from it rather than from training data alone. When it comes up: A buyer in a regulated environment asks about accuracy guarantees. There are none without grounding. The spec sheet doesn't solve this. Don't confuse with: Fine-tuning. Fine-tuning adjusts the model's weights; grounding adjusts what's in context. They address different problems and operate at different layers.

Reasoning-as-Extra-Tokens Reasoning models generate intermediate deliberation steps before producing a final answer. This costs more and takes longer. It produces better results on complex, multi-step tasks. When it comes up: A buyer asks why one model costs three times as much as another for what looks like the same task. Check whether reasoning mode is enabled. Don't confuse with: Chain-of-thought prompting. Reasoning mode is a model architecture choice; chain-of-thought is a prompting technique. Related concepts, different levers.

Vocabulary Mapping Tables

Table 1: Core Units

AI Term	What It Means in AI	IDAM Equivalent	Key Divergence
Token	Smallest unit of text the model processes (~0.75 words)	Byte / character	IDAM doesn't price by character; AI pricing is entirely token-denominated, including context overhead
Context window	Total tokens the model can process in one call (input + output combined)	Session payload size	Context windows have no native persistence; IDAM sessions can be stateful across requests
Embedding	Numerical vector representing semantic meaning of text	Hash / fingerprint	A hash has no semantic proximity; embeddings cluster by meaning, enabling similarity search

Table 2: Access and Control

AI Term	What It Means in AI	IDAM Equivalent	Key Divergence
Agent	A model that can take actions, call tools, and operate across multiple steps autonomously	Service account / non-human identity	Agents have dynamic, emergent behavior; service accounts execute defined, bounded operations
Scope (system prompt)	Instructions and constraints given to the model before the conversation begins	OAuth scope	Prompt scope is natural language and not cryptographically enforced; OAuth scope is
Temperature	Parameter controlling output randomness (0 = deterministic, higher = more varied)	No equivalent	IDAM has no analog; temperature is a tuning lever with no access control parallel

If You Remember Nothing Else

Spec sheet: Context window is working memory, not storage. Pricing asymmetry between input and output tokens tells you where your workload costs live.

Mental models: The model predicts; it doesn't know. Tokens are the unit of everything. Embeddings are how meaning becomes searchable. Grounding is the only path to accuracy in regulated environments. Reasoning mode costs more because it does more work.

Vocabulary tables: "Agent" in AI is not a service account. "Scope" in a system prompt is not an OAuth scope. These are the two collision zones most likely to derail a technical conversation with an agency evaluator.

The section: A model spec sheet tells you what the model can do and what it costs. It does not tell you how to get your data and tools in front of it.

For More Information, See…

Concept	Source Lesson
Next-token prediction, how LLMs generate text	Lesson 1: How Language Models Actually Work
Tokens, pricing, context length, cost estimation	Lesson 2: Tokens, Pricing, and What You're Actually Buying
Embeddings, vector search, semantic similarity	Lesson 3: Embeddings and Semantic Search
Hallucination, grounding, knowledge cutoff	Lesson 4: Hallucination, Grounding, and Verification
Reasoning models, extended thinking, cost tradeoffs	Lesson 5: Reasoning Models and When to Use Them
Modalities, multimodal inputs, capability boundaries	Lesson 6: Modalities — What the Model Can See
Spec sheet structure, architecture notes, parameter count	Lesson 7: Reading a Spec Sheet

What the Spec Sheet Doesn't Answer

Every field above tells you what the model can do and what it costs to run. None of them tell you how to get your data and tools in front of it.

How does the model access a document repository? How does it call an external API mid-conversation? How does it act on behalf of a user without holding that user's credentials? How does an identity system know what the model is allowed to do?

Those questions are not answered by the spec sheet. They are answered by the protocols and patterns the next section covers. That's where identity infrastructure becomes relevant again, and it's where the vocabulary you already hold starts doing real work.