A model spec sheet is a constraint map. Every field on it corresponds to a concept from the preceding lessons. This recap walks OpenAI's current model documentationusing those concepts as annotation vocabulary.
All numbers are as of May 2026 and will change without notice.
Next-Token Prediction
Recall the core mechanism: the model generates one token at a time, each conditioned on everything preceding it. The context window exists because of this. It's the hard limit on "everything preceding it." Every field in this section follows from that architecture.
Context window — Maximum tokens (input + output + reasoning) a model holds in one call. GPT-5.5: 1,050,000. GPT-5.4: 272,000 standard; 1M experimental in Codex only.
- When it comes up: "Can it process our entire policy corpus in one call?" The context window is the hard ceiling. Everything the model considers, including reasoning tokens you never see, has to fit.
- Don't confuse with: Security context in IDAM. A context window is a token budget for one stateless inference call, with no persistent trust boundary.
Max output tokens — Ceiling on generated tokens per response, including invisible reasoning tokens. Both models: 128,000. If reasoning burns 40K, visible output caps at 88K.
- When it comes up: Long-form generation. The output cap is shared with reasoning, not separate from it.
- Don't confuse with: Rate limits. Output tokens are a per-call ceiling. Rate limits are a per-minute throughput constraint.
Architecture (dense vs. MoE) — Dense models activate all parameters per token. Mixture-of-Experts (MoE) routes each token to a subset of specialized sub-networks, reducing compute while maintaining capacity.
- When it comes up: Buyer asks why two similarly capable models have different latency and cost profiles. An MoE model can have more total parameters but activate fewer per token. Inference speed and pricing don't scale linearly with capability.
- OpenAI does not disclose architecture for GPT-5.5 or GPT-5.4. They confirm MoE for gpt-oss open-weight models only. On proprietary spec sheets, pricing and latency are your proxy for what's happening underneath.
The context window is working memory, not storage. One call, then it's gone.
Tokens-as-Currency
Token pricing — Per-million-token rates, split by input, cached input, and output.
| GPT-5.5 | GPT-5.4 | GPT-5.4 mini | GPT-5.4 nano | |
|---|---|---|---|---|
| Input / 1M | $5.00 | $2.50 | $0.75 | $0.20 |
| Cached input / 1M | $0.50 | $0.25 | $0.075 | — |
| Output / 1M | $30.00 | $15.00 | $4.50 | $1.25 |
- When it comes up: Every sizing conversation. Output costs 6x input for GPT-5.5. Reasoning-heavy workloads skew hard toward output cost because reasoning tokens bill as output.
Long-context pricing cliff — Requests exceeding 272K input tokens reprice the entire session: 2x input, 1.5x output. Not incremental. The full call.
- When it comes up: A 300K-token prompt doesn't cost slightly more than 270K. It costs roughly double. This is the kind of thing that shows up in a POC invoice and kills momentum.
Reasoning tokens — Hidden tokens generated during deliberation. Billed as output tokens. Volume scales with complexity: hundreds for simple tasks, tens of thousands for hard ones. Reported in the usage object but never visible in the response.
- When it comes up: Unexpectedly high bills. Reasoning tokens are the invisible line item.
- Don't confuse with: Verbose output. The model can produce a two-sentence answer and still burn 30K reasoning tokens getting there.
Output tokens are where the money is. Reasoning makes that worse, and GPT-5.5 reasons by default.
Token, Context, and Session Collisions
| AI Term | What It Means in AI | IDAM Equivalent | Key Divergence |
|---|---|---|---|
| Token | Text fragment (~4 chars); billing and processing unit | OAuth access token, SAML assertion | AI tokens are consumed and destroyed. IDAM tokens are issued, validated, and revoked. Completely different lifecycle. |
| Context window | Total token capacity for one inference call | Security context (authn/authz state) | A context window is stateless and ephemeral. A security context persists across requests within a session. |
| Session | Multi-turn conversation; prior I/O carried forward as re-sent tokens | Authenticated session | An AI "session" has no authentication state, no binding, no timeout, no revocation. Prior turns are just tokens stuffed into the next prompt. The spec calls this "conversation history," which is a generous description of what's actually happening. |
Embeddings-as-Meaning
As you saw in the embeddings lesson, inputs become numerical vectors encoding meaning in the model's internal space. Each modality requires its own embedding pathway into that space.
Supported modalities — What the model can receive and produce. "Supported modalities" on a spec sheet is a statement about which embedding pathways the model was trained with. No pathway, no processing, regardless of what the marketing page implies.
GPT-5.5 and GPT-5.4: text and image input, text output. Tool capabilities (web search, file search, computer use) listed separately with per-call fees. Audio and video are absent from these spec sheets.
- When it comes up: Buyer assumes "multimodal" means audio and video. For these models, it means text and images. Audio runs through separate OpenAI APIs.
- Don't confuse with: OAuth scopes. Both constrain what's possible. Modalities describe data types the model can process; scopes describe permissions a token grants. One is architectural. The other is dynamically authorized.
"Multimodal" is a spec-sheet claim that requires reading the fine print. Check which modalities, in which direction.
Grounding-Not-Trusting
Knowledge cutoff — The date after which the model has no training data. GPT-5.5: December 1, 2025. GPT-5.4: approximately August 31, 2025 (confirmed for mini and nano; not directly stated for the main model).
- When it comes up: Any question about recent policy, regulation, or events. The model will answer confidently about post-cutoff topics. Those answers are fabricated from pattern-matching across training data, with no grounding in actual events. There is no error message. There is no "I don't know." There is fluent, plausible text with no basis.
- Don't confuse with: Certificate expiration or CRL freshness. In IDAM, staleness has a remediation path: you renew the cert, you re-fetch the revocation list. A knowledge cutoff has no remediation mechanism inside the model itself. You can't update the training data on a Tuesday. You ground the model to external data, or you accept the risk.
The knowledge cutoff is the line where grounding stops being optional. Past it, you either connect the model to current data or you accept hallucination risk. There is no third option.
Reasoning-as-Extra-Tokens
Reasoning effort — The reasoning.effort parameter controls how many reasoning tokens the model spends per call.
| Model | Available levels | Default |
|---|---|---|
| GPT-5.5 | none, low, medium, high, xhigh | medium |
| GPT-5.4 | none, low, medium, high, xhigh | none |
| GPT-5.4 pro | medium, high, xhigh | medium |
- When it comes up: Cost and latency tuning. GPT-5.5 reasons at medium by default. Every call carries reasoning overhead unless you explicitly set it to
none. That cost implication won't appear in headline pricing. - Don't confuse with: Model quality. Reasoning effort is a per-call dial, not a tier. Same model, different effort, very different cost and latency.
Scope, Reasoning, and Grounding Collisions
| AI Term | What It Means in AI | IDAM Equivalent | Key Divergence |
|---|---|---|---|
| Scope (capabilities) | Modalities and tools a model supports | OAuth scope (permissions per token) | Model scope is fixed by architecture. OAuth scope is dynamically granted per authorization. You can't negotiate with a model card. |
| Reasoning | Internal token generation for deliberation; billed but invisible | Policy evaluation / business logic | AI reasoning is probabilistic prediction with extra steps. Policy evaluation is deterministic rule execution. |
| Grounding | Connecting output to verified external data to reduce hallucination | Trust anchoring (cert chains, federation) | Both chain to a trusted source. Grounding is per-query and never cached. Trust anchoring is configured once and persists. |
Reasoning effort moves cost and latency before it moves quality. Know which dial the buyer thinks they're turning.
The Question Every Spec Sheet Leaves Open
Every field above describes what the model can hold, cost, process, reason through, and remember. None of them say how enterprise data, internal tools, or organizational context reach the model at inference time. That gap between "we have a model" and "we have a system" is where the next section starts.
Source Index
| Recap Entry | Source Lesson |
|---|---|
| Context window, max output tokens | Lesson 1: Tokens and Context — token definition, context window mechanics |
| Token pricing, long-context cliff, reasoning tokens | Lesson 1: Tokens and Context — token as unit of cost; Lesson 4: Inference Economics — pricing tiers, reasoning token billing |
| Architecture (dense vs. MoE) | Lesson 7: Model Architecture — dense vs. sparse routing, parameter activation |
| Supported modalities | Lesson 7: Modalities — input/output types, tool capabilities |
| Embeddings | Lesson 3: Embeddings — vector representation, semantic space |
| Knowledge cutoff, hallucination, grounding | Lesson 5: Hallucination and Grounding — cutoff dates, fabrication mechanics, grounding strategies |
| Reasoning effort | Lesson 7: Reasoning Mode — effort parameter, token overhead, interleaved thinking |
| Vocabulary collision tables | Cross-cutting; all lessons |
OpenAI documentation referenced throughout: Models, Pricing, Reasoning guide. All OpenAI specs verified against primary documentation as of May 8, 2026.
Things to follow up on...
- GPT-5.5 token efficiency claims: OpenAI says GPT-5.5 uses fewer tokens per task than GPT-5.4, partially offsetting its 2x headline price, with independent analysis observing 19-34% fewer completion tokens and a net cost increase of 49-92% rather than the nominal 100%.
- DeepSeek R1-0528 reasoning costs: The latest DeepSeek reasoning model nearly doubled its average token usage per problem from 12K to 23K tokens, a concrete example of how "reasoning-as-extra-tokens" scales in practice across providers.
- MoE architecture in open-weight models: OpenAI's gpt-oss models explicitly confirm MoE routing, with gpt-oss-120b activating only 5.1B of its total parameters per token, illustrating why total parameter count and inference cost are decoupled.
- Hallucination persists with scale: OpenAI researchers Kalai and Nachum argue that hallucinations arise from natural statistical pressures in binary classification, meaning larger models can hallucinate with more confidence rather than less.

