Reading a Model Spec Sheet

Walks OpenAI's current model docs using every concept from prior lessons as annotation vocabulary, resolving AI-IDAM vocabulary collisions along the way.

By Leigh Garrity— May 8, 2026

Walks OpenAI's current model docs using every concept from prior lessons as annotation vocabulary, resolving AI-IDAM vocabulary collisions along the way.

A model spec sheet is a constraint map. Every field on it corresponds to a concept from the preceding lessons. This recap walks OpenAI's current model documentationusing those concepts as annotation vocabulary.

All numbers are as of May 2026 and will change without notice.

Next-Token Prediction

Recall the core mechanism: the model generates one token at a time, each conditioned on everything preceding it. The context window exists because of this. It's the hard limit on "everything preceding it." Every field in this section follows from that architecture.

Context window — Maximum tokens (input + output + reasoning) a model holds in one call. GPT-5.5: 1,050,000. GPT-5.4: 272,000 standard; 1M experimental in Codex only.

When it comes up: "Can it process our entire policy corpus in one call?" The context window is the hard ceiling. Everything the model considers, including reasoning tokens you never see, has to fit.
Don't confuse with: Security context in IDAM. A context window is a token budget for one stateless inference call, with no persistent trust boundary.

Max output tokens — Ceiling on generated tokens per response, including invisible reasoning tokens. Both models: 128,000. If reasoning burns 40K, visible output caps at 88K.

When it comes up: Long-form generation. The output cap is shared with reasoning, not separate from it.
Don't confuse with: Rate limits. Output tokens are a per-call ceiling. Rate limits are a per-minute throughput constraint.

Architecture (dense vs. MoE) — Dense models activate all parameters per token. Mixture-of-Experts (MoE) routes each token to a subset of specialized sub-networks, reducing compute while maintaining capacity.

When it comes up: Buyer asks why two similarly capable models have different latency and cost profiles. An MoE model can have more total parameters but activate fewer per token. Inference speed and pricing don't scale linearly with capability.
OpenAI does not disclose architecture for GPT-5.5 or GPT-5.4. They confirm MoE for gpt-oss open-weight models only. On proprietary spec sheets, pricing and latency are your proxy for what's happening underneath.

If you remember nothing else

The context window is working memory, not storage. One call, then it's gone.

Tokens-as-Currency

Token pricing — Per-million-token rates, split by input, cached input, and output.

	GPT-5.5	GPT-5.4	GPT-5.4 mini	GPT-5.4 nano
Input / 1M	$5.00	$2.50	$0.75	$0.20
Cached input / 1M	$0.50	$0.25	$0.075	—
Output / 1M	$30.00	$15.00	$4.50	$1.25

Source: OpenAI pricing.

When it comes up: Every sizing conversation. Output costs 6x input for GPT-5.5. Reasoning-heavy workloads skew hard toward output cost because reasoning tokens bill as output.

Long-context pricing cliff — Requests exceeding 272K input tokens reprice the entire session: 2x input, 1.5x output. Not incremental. The full call.

When it comes up: A 300K-token prompt doesn't cost slightly more than 270K. It costs roughly double. This is the kind of thing that shows up in a POC invoice and kills momentum.

Reasoning tokens — Hidden tokens generated during deliberation. Billed as output tokens. Volume scales with complexity: hundreds for simple tasks, tens of thousands for hard ones. Reported in the usage object but never visible in the response.

When it comes up: Unexpectedly high bills. Reasoning tokens are the invisible line item.
Don't confuse with: Verbose output. The model can produce a two-sentence answer and still burn 30K reasoning tokens getting there.

If you remember nothing else

Output tokens are where the money is. Reasoning makes that worse, and GPT-5.5 reasons by default.

Token, Context, and Session Collisions

AI Term	What It Means in AI	IDAM Equivalent	Key Divergence
Token	Text fragment (~4 chars); billing and processing unit	OAuth access token, SAML assertion	AI tokens are consumed and destroyed. IDAM tokens are issued, validated, and revoked. Completely different lifecycle.
Context window	Total token capacity for one inference call	Security context (authn/authz state)	A context window is stateless and ephemeral. A security context persists across requests within a session.
Session	Multi-turn conversation; prior I/O carried forward as re-sent tokens	Authenticated session	An AI "session" has no authentication state, no binding, no timeout, no revocation. Prior turns are just tokens stuffed into the next prompt. The spec calls this "conversation history," which is a generous description of what's actually happening.

Embeddings-as-Meaning

As you saw in the embeddings lesson, inputs become numerical vectors encoding meaning in the model's internal space. Each modality requires its own embedding pathway into that space.

Supported modalities — What the model can receive and produce. "Supported modalities" on a spec sheet is a statement about which embedding pathways the model was trained with. No pathway, no processing, regardless of what the marketing page implies.

GPT-5.5 and GPT-5.4: text and image input, text output. Tool capabilities (web search, file search, computer use) listed separately with per-call fees. Audio and video are absent from these spec sheets.

When it comes up: Buyer assumes "multimodal" means audio and video. For these models, it means text and images. Audio runs through separate OpenAI APIs.
Don't confuse with: OAuth scopes. Both constrain what's possible. Modalities describe data types the model can process; scopes describe permissions a token grants. One is architectural. The other is dynamically authorized.

If you remember nothing else

"Multimodal" is a spec-sheet claim that requires reading the fine print. Check which modalities, in which direction.

Grounding-Not-Trusting

Knowledge cutoff — The date after which the model has no training data. GPT-5.5: December 1, 2025. GPT-5.4: approximately August 31, 2025 (confirmed for mini and nano; not directly stated for the main model).

When it comes up: Any question about recent policy, regulation, or events. The model will answer confidently about post-cutoff topics. Those answers are fabricated from pattern-matching across training data, with no grounding in actual events. There is no error message. There is no "I don't know." There is fluent, plausible text with no basis.
Don't confuse with: Certificate expiration or CRL freshness. In IDAM, staleness has a remediation path: you renew the cert, you re-fetch the revocation list. A knowledge cutoff has no remediation mechanism inside the model itself. You can't update the training data on a Tuesday. You ground the model to external data, or you accept the risk.

If you remember nothing else

The knowledge cutoff is the line where grounding stops being optional. Past it, you either connect the model to current data or you accept hallucination risk. There is no third option.

Reasoning-as-Extra-Tokens

Reasoning effort — The reasoning.effort parameter controls how many reasoning tokens the model spends per call.

Model	Available levels	Default
GPT-5.5	none, low, medium, high, xhigh	medium
GPT-5.4	none, low, medium, high, xhigh	none
GPT-5.4 pro	medium, high, xhigh	medium

When it comes up: Cost and latency tuning. GPT-5.5 reasons at medium by default. Every call carries reasoning overhead unless you explicitly set it to none. That cost implication won't appear in headline pricing.
Don't confuse with: Model quality. Reasoning effort is a per-call dial, not a tier. Same model, different effort, very different cost and latency.

Scope, Reasoning, and Grounding Collisions

AI Term	What It Means in AI	IDAM Equivalent	Key Divergence
Scope (capabilities)	Modalities and tools a model supports	OAuth scope (permissions per token)	Model scope is fixed by architecture. OAuth scope is dynamically granted per authorization. You can't negotiate with a model card.
Reasoning	Internal token generation for deliberation; billed but invisible	Policy evaluation / business logic	AI reasoning is probabilistic prediction with extra steps. Policy evaluation is deterministic rule execution.
Grounding	Connecting output to verified external data to reduce hallucination	Trust anchoring (cert chains, federation)	Both chain to a trusted source. Grounding is per-query and never cached. Trust anchoring is configured once and persists.

If you remember nothing else

Reasoning effort moves cost and latency before it moves quality. Know which dial the buyer thinks they're turning.

The Question Every Spec Sheet Leaves Open

Every field above describes what the model can hold, cost, process, reason through, and remember. None of them say how enterprise data, internal tools, or organizational context reach the model at inference time. That gap between "we have a model" and "we have a system" is where the next section starts.

Source Index

Recap Entry	Source Lesson
Context window, max output tokens	Lesson 1: Tokens and Context — token definition, context window mechanics
Token pricing, long-context cliff, reasoning tokens	Lesson 1: Tokens and Context — token as unit of cost; Lesson 4: Inference Economics — pricing tiers, reasoning token billing
Architecture (dense vs. MoE)	Lesson 7: Model Architecture — dense vs. sparse routing, parameter activation
Supported modalities	Lesson 7: Modalities — input/output types, tool capabilities
Embeddings	Lesson 3: Embeddings — vector representation, semantic space
Knowledge cutoff, hallucination, grounding	Lesson 5: Hallucination and Grounding — cutoff dates, fabrication mechanics, grounding strategies
Reasoning effort	Lesson 7: Reasoning Mode — effort parameter, token overhead, interleaved thinking
Vocabulary collision tables	Cross-cutting; all lessons

OpenAI documentation referenced throughout: Models, Pricing, Reasoning guide. All OpenAI specs verified against primary documentation as of May 8, 2026.

Things to follow up on...

GPT-5.5 token efficiency claims: OpenAI says GPT-5.5 uses fewer tokens per task than GPT-5.4, partially offsetting its 2x headline price, with independent analysis observing 19-34% fewer completion tokens and a net cost increase of 49-92% rather than the nominal 100%.
DeepSeek R1-0528 reasoning costs: The latest DeepSeek reasoning model nearly doubled its average token usage per problem from 12K to 23K tokens, a concrete example of how "reasoning-as-extra-tokens" scales in practice across providers.
MoE architecture in open-weight models: OpenAI's gpt-oss models explicitly confirm MoE routing, with gpt-oss-120b activating only 5.1B of its total parameters per token, illustrating why total parameter count and inference cost are decoupled.
Hallucination persists with scale: OpenAI researchers Kalai and Nachum argue that hallucinations arise from natural statistical pressures in binary classification, meaning larger models can hallucinate with more confidence rather than less.

All numbers are as of May 2026 and will change without notice.

Next-Token Prediction

Context window — Maximum tokens (input + output + reasoning) a model holds in one call. GPT-5.5: 1,050,000. GPT-5.4: 272,000 standard; 1M experimental in Codex only.

When it comes up: "Can it process our entire policy corpus in one call?" The context window is the hard ceiling. Everything the model considers, including reasoning tokens you never see, has to fit.
Don't confuse with: Security context in IDAM. A context window is a token budget for one stateless inference call, with no persistent trust boundary.

Max output tokens — Ceiling on generated tokens per response, including invisible reasoning tokens. Both models: 128,000. If reasoning burns 40K, visible output caps at 88K.

When it comes up: Long-form generation. The output cap is shared with reasoning, not separate from it.
Don't confuse with: Rate limits. Output tokens are a per-call ceiling. Rate limits are a per-minute throughput constraint.

When it comes up: Buyer asks why two similarly capable models have different latency and cost profiles. An MoE model can have more total parameters but activate fewer per token. Inference speed and pricing don't scale linearly with capability.
OpenAI does not disclose architecture for GPT-5.5 or GPT-5.4. They confirm MoE for gpt-oss open-weight models only. On proprietary spec sheets, pricing and latency are your proxy for what's happening underneath.

If you remember nothing else

The context window is working memory, not storage. One call, then it's gone.

Tokens-as-Currency

Token pricing — Per-million-token rates, split by input, cached input, and output.

	GPT-5.5	GPT-5.4	GPT-5.4 mini	GPT-5.4 nano
Input / 1M	$5.00	$2.50	$0.75	$0.20
Cached input / 1M	$0.50	$0.25	$0.075	—
Output / 1M	$30.00	$15.00	$4.50	$1.25

Source: OpenAI pricing.

When it comes up: Every sizing conversation. Output costs 6x input for GPT-5.5. Reasoning-heavy workloads skew hard toward output cost because reasoning tokens bill as output.

Long-context pricing cliff — Requests exceeding 272K input tokens reprice the entire session: 2x input, 1.5x output. Not incremental. The full call.

When it comes up: A 300K-token prompt doesn't cost slightly more than 270K. It costs roughly double. This is the kind of thing that shows up in a POC invoice and kills momentum.

When it comes up: Unexpectedly high bills. Reasoning tokens are the invisible line item.
Don't confuse with: Verbose output. The model can produce a two-sentence answer and still burn 30K reasoning tokens getting there.

If you remember nothing else

Output tokens are where the money is. Reasoning makes that worse, and GPT-5.5 reasons by default.

Token, Context, and Session Collisions

AI Term	What It Means in AI	IDAM Equivalent	Key Divergence
Token	Text fragment (~4 chars); billing and processing unit	OAuth access token, SAML assertion	AI tokens are consumed and destroyed. IDAM tokens are issued, validated, and revoked. Completely different lifecycle.
Context window	Total token capacity for one inference call	Security context (authn/authz state)	A context window is stateless and ephemeral. A security context persists across requests within a session.
Session	Multi-turn conversation; prior I/O carried forward as re-sent tokens	Authenticated session	An AI "session" has no authentication state, no binding, no timeout, no revocation. Prior turns are just tokens stuffed into the next prompt. The spec calls this "conversation history," which is a generous description of what's actually happening.

Embeddings-as-Meaning

As you saw in the embeddings lesson, inputs become numerical vectors encoding meaning in the model's internal space. Each modality requires its own embedding pathway into that space.

When it comes up: Buyer assumes "multimodal" means audio and video. For these models, it means text and images. Audio runs through separate OpenAI APIs.
Don't confuse with: OAuth scopes. Both constrain what's possible. Modalities describe data types the model can process; scopes describe permissions a token grants. One is architectural. The other is dynamically authorized.

If you remember nothing else

"Multimodal" is a spec-sheet claim that requires reading the fine print. Check which modalities, in which direction.

Grounding-Not-Trusting

When it comes up: Any question about recent policy, regulation, or events. The model will answer confidently about post-cutoff topics. Those answers are fabricated from pattern-matching across training data, with no grounding in actual events. There is no error message. There is no "I don't know." There is fluent, plausible text with no basis.
Don't confuse with: Certificate expiration or CRL freshness. In IDAM, staleness has a remediation path: you renew the cert, you re-fetch the revocation list. A knowledge cutoff has no remediation mechanism inside the model itself. You can't update the training data on a Tuesday. You ground the model to external data, or you accept the risk.

If you remember nothing else

The knowledge cutoff is the line where grounding stops being optional. Past it, you either connect the model to current data or you accept hallucination risk. There is no third option.

Reasoning-as-Extra-Tokens

Reasoning effort — The reasoning.effort parameter controls how many reasoning tokens the model spends per call.

Model	Available levels	Default
GPT-5.5	none, low, medium, high, xhigh	medium
GPT-5.4	none, low, medium, high, xhigh	none
GPT-5.4 pro	medium, high, xhigh	medium

When it comes up: Cost and latency tuning. GPT-5.5 reasons at medium by default. Every call carries reasoning overhead unless you explicitly set it to none. That cost implication won't appear in headline pricing.
Don't confuse with: Model quality. Reasoning effort is a per-call dial, not a tier. Same model, different effort, very different cost and latency.

Scope, Reasoning, and Grounding Collisions

AI Term	What It Means in AI	IDAM Equivalent	Key Divergence
Scope (capabilities)	Modalities and tools a model supports	OAuth scope (permissions per token)	Model scope is fixed by architecture. OAuth scope is dynamically granted per authorization. You can't negotiate with a model card.
Reasoning	Internal token generation for deliberation; billed but invisible	Policy evaluation / business logic	AI reasoning is probabilistic prediction with extra steps. Policy evaluation is deterministic rule execution.
Grounding	Connecting output to verified external data to reduce hallucination	Trust anchoring (cert chains, federation)	Both chain to a trusted source. Grounding is per-query and never cached. Trust anchoring is configured once and persists.

If you remember nothing else

Reasoning effort moves cost and latency before it moves quality. Know which dial the buyer thinks they're turning.

The Question Every Spec Sheet Leaves Open

Source Index

Recap Entry	Source Lesson
Context window, max output tokens	Lesson 1: Tokens and Context — token definition, context window mechanics
Token pricing, long-context cliff, reasoning tokens	Lesson 1: Tokens and Context — token as unit of cost; Lesson 4: Inference Economics — pricing tiers, reasoning token billing
Architecture (dense vs. MoE)	Lesson 7: Model Architecture — dense vs. sparse routing, parameter activation
Supported modalities	Lesson 7: Modalities — input/output types, tool capabilities
Embeddings	Lesson 3: Embeddings — vector representation, semantic space
Knowledge cutoff, hallucination, grounding	Lesson 5: Hallucination and Grounding — cutoff dates, fabrication mechanics, grounding strategies
Reasoning effort	Lesson 7: Reasoning Mode — effort parameter, token overhead, interleaved thinking
Vocabulary collision tables	Cross-cutting; all lessons

OpenAI documentation referenced throughout: Models, Pricing, Reasoning guide. All OpenAI specs verified against primary documentation as of May 8, 2026.

Things to follow up on...

GPT-5.5 token efficiency claims: OpenAI says GPT-5.5 uses fewer tokens per task than GPT-5.4, partially offsetting its 2x headline price, with independent analysis observing 19-34% fewer completion tokens and a net cost increase of 49-92% rather than the nominal 100%.
DeepSeek R1-0528 reasoning costs: The latest DeepSeek reasoning model nearly doubled its average token usage per problem from 12K to 23K tokens, a concrete example of how "reasoning-as-extra-tokens" scales in practice across providers.
MoE architecture in open-weight models: OpenAI's gpt-oss models explicitly confirm MoE routing, with gpt-oss-120b activating only 5.1B of its total parameters per token, illustrating why total parameter count and inference cost are decoupled.
Hallucination persists with scale: OpenAI researchers Kalai and Nachum argue that hallucinations arise from natural statistical pressures in binary classification, meaning larger models can hallucinate with more confidence rather than less.