What "Custom AI" Actually Means

How production AI systems assemble context windows at runtime, why that matters more than model training, and where your IDAM instincts apply.

By Leigh Garrity— May 8, 2026

How production AI systems assemble context windows at runtime, why that matters more than model training, and where your IDAM instincts apply.

The context window, defined

A language model has no memory. Zero. Every time it generates a response, it works from a single input: the context window. Anthropic's API documentation defines this as "all the text a language model can reference when generating a response, including the response itself."

Picture a fixed-size workspace. Everything the model can see, reason about, or react to must fit inside it. Current models offer windows ranging from roughly 128,000 to over a million tokens. A token is roughly three-quarters of a word, so a 200,000-token window holds about 150,000 words.

Token ≠ Token

This "token" has nothing to do with the tokens you work with daily. Not an OAuth access token, not a SAML assertion. In AI, a token is a chunk of text, typically a word or word-fragment. Same word, completely different meaning. When someone in an AI architecture conversation says "tokens," they mean units of text, not credentials.

The critical piece: the software surrounding the model populates this workspace at runtime, before each model call. It decides which instructions to include, which conversation turns to keep, which external data to inject, which tool results to attach. The model generates a response based on whatever it finds in the workspace when it opens its eyes.

Anthropic's Applied AI team defines context engineering as "the set of strategies for curating and maintaining the optimal set of tokens during LLM inference." The engineering problem is "optimizing the utility of those tokens against the inherent constraints of LLMs in order to consistently achieve a desired outcome." (This is a primary source from Anthropic's own engineering blog, authored by their Applied AI team based on production implementation experience.)

The whole discipline is about assembling the right information, in the right structure, inside a fixed-size window, so the model does what you need. The model's weights stay untouched.

Context window: a fixed-size workspace populated at runtime by the application layer. The application decides what goes in; the model reads what it's given.

What fills the window

A production context window typically contains four types of content, assembled fresh for each call.

System prompts set the model's role, constraints, and behavioral rules. These are developer-written instructions, not user input. They might say: "You are a procurement assistant for federal civilian agencies. Never recommend products. Always cite regulation numbers." The model treats these as its operating instructions for the session.

Few-shot examples show the model what good output looks like. Instead of describing the desired format abstractly, the application includes two or three example input-output pairs. Anthropic's guidance is specific: don't stuff every edge case into examples. Instead, "curate a set of diverse, canonical examples that effectively portray the expected behavior."

Retrieved knowledge is external data pulled in at runtime. Documents, database results, policy text, anything the model needs but doesn't know from training. (The retrieval pipeline that fetches this data is its own topic, covered in a later piece. For now, just know it's one source of context.)

Tool results come from actions the model has already taken in the current interaction. If the model called an API to check a user's account status, the API response gets inserted back into the context window so the model can reason about it.

All four types are assembled by the application layer. The model sees the finished product. It has no way to tell which parts were written by a developer six months ago, which were retrieved from a database three seconds ago, and which are leftover from the user's previous question. It processes the entire window as one input.

Okta Concept Mapping: Context Window ≈ Session State

The context window functions like a session's available state: it holds everything the system needs to act on the current request. This is where your session intuition helps. Here's where it misleads you: a session has durable server-side state, a binding to an identity, and an audit trail. A context window has none of these. Content can be omitted, summarized, or replaced by the application between calls with no guaranteed persistence and no record of what was present when a decision was made.

Context assembly: the application combines system prompts, few-shot examples, retrieved knowledge, and tool results into a single window before each model call. The model doesn't distinguish between these sources.

Why the field outgrew "prompt engineering"

For the first wave of LLM adoption, "prompt engineering" was an adequate label. You wrote a careful instruction, maybe added a few examples, sent it to the model. The prompt was the product.

Once applications got serious, the label stopped fitting. In production, the static prompt a developer writes is one component of a much larger assembly. The application also manages conversation history, injects retrieved documents, formats tool outputs, and makes real-time decisions about what to include and what to cut. The prompt might be 200 tokens. The full assembled context might be 50,000.

In June 2025, Andrej Karpathy argued publicly that the word "prompt" trivializes a complex engineering problem:

“

"People's use of 'prompt' tends to (incorrectly) trivialize a rather complex component. You prompt an LLM to tell you why the sky is blue. But apps build contexts (meticulously) for LLMs to solve their custom tasks."

(Karpathy co-founded OpenAI and led Tesla's AI team. When he names a shift in practice, the field listens.) Within a month, a preprint survey analyzing over 1,400 research papers formalized the term, defining context engineering as "a formal discipline that transcends simple prompt design to encompass the systematic optimization of information payloads for LLMs." (This is an arXiv preprint, not peer-reviewed, but it's the first systematic academic treatment of the concept and reflects the speed at which the term gained formal traction.)

Anthropic's own framing captures where the engineering effort actually lives: building with language models is "becoming less about finding the right words and phrases for your prompts, and more about answering the broader question of 'what configuration of context is most likely to generate our model's desired behavior?'"

The terminology caught up to where the practice had already moved.

Context engineering vs. prompt engineering: prompt engineering is writing good instructions. Context engineering is the broader discipline of assembling everything the model needs at runtime, of which the prompt is one piece.

The memory illusion

When a chatbot appears to "remember" your earlier question, the application is re-sending your earlier question as part of the new context window. Every prior turn of conversation is included in the input, consuming space. The model reads a transcript someone handed it and processes it fresh.

This works until the conversation approaches the window's size limit. At that point, the application makes choices. Anthropic's documentation describes three mechanisms, and they're worth understanding because they reveal what "memory" actually is in these systems.

Compaction is automatic summarization. When a conversation approaches a configured threshold (Anthropic's default is 150,000 tokens), the system summarizes older turns and drops the originals. The model then "continues working from this summary as if it were the original conversation history." The model has no idea the summary replaced a longer conversation. It treats the summary as ground truth. Read that again if you need to, because it has real implications for any system making decisions based on prior conversation.

Selective clearing lets the application surgically remove specific content while keeping the rest. Anthropic's docs frame this as "actively curating what Claude sees: context is a finite resource with diminishing returns, and irrelevant content degrades model focus."

Truncation is the bluntest instrument: when the window fills, the oldest content gets cut.

Anthropic's engineering team states it plainly about long-running agents: "each new session begins with no memory of what came before." Any continuity between sessions exists because the application reconstructed it. Loaded a summary. Re-injected key facts. Re-supplied instructions. The model itself retained nothing.

Okta Concept Mapping: System Prompt ≈ Policy Configuration

A system prompt functions like policy configuration: it sets intended behavior for the system. Your policy intuition is useful here. But a system prompt is not enforceable access policy. It's influential text, not a control boundary. Models can be manipulated by untrusted input elsewhere in the context window (this is prompt injection), so enforcement has to live in application controls, tool gateways, and authorization servers. In the application layer, not in the prompt.

Model "memory": an illusion created by re-supplying conversation history in the context window. When history exceeds the window, it's summarized, selectively cleared, or truncated. The model doesn't know what was removed.

More context degrades performance

Bigger context windows help, up to a point. Past that point, they create their own problems. Anthropic's engineering team describes context rot: "as the number of tokens in the context window increases, the model's ability to accurately recall information from that context decreases." This characteristic "emerges across all models." Empirical research from Chroma confirmed this across 18 models, finding that performance degrades "often in surprising and non-uniform ways" even under minimal test conditions. (Chroma is a vector database company with a commercial interest in retrieval infrastructure, but their methodology is documented and the findings are consistent with Anthropic's own characterization.)

Anthropic uses a specific framing: LLMs have an "attention budget" that gets depleted by every token in the window. Their core guidance is to find "the smallest possible set of high-signal tokens that maximize the likelihood of some desired outcome."

This creates real failure modes. Practitioner Drew Breunig catalogued them in work subsequently published by O'Reilly: a hallucination enters the context and gets repeatedly referenced (he calls this context poisoning); the model over-focuses on its long history instead of its training (context distraction); irrelevant information degrades response quality; new information conflicts with existing instructions. These aren't theoretical. They're what production teams debug daily. Context engineering is curation. What you exclude matters as much as what you include.

Context rot: model accuracy degrades as context size increases, so context must be curated, not maximized. Every token in the window costs attention; irrelevant tokens degrade the value of relevant ones.

What to listen for on Tuesday

A buyer says: "We've built custom AI for our workflow." You now have the right follow-up: What are you customizing?

Almost always, the answer is the context window. They wrote system prompts encoding their business rules. They built retrieval pipelines injecting their internal documents. They assembled few-shot examples reflecting their preferred output format. They built the orchestration layer that decides what goes in and what gets cut when the window fills up.

That's real engineering. The model underneath is still the same foundation model everyone else uses. The custom part is the context assembly.

This matters for your conversation because it tells you where the governance questions live. If the "custom AI" is context engineering, then the security surface is the context assembly pipeline. Who decides what documents get retrieved? What system prompts get applied? What tools can the model call? If a civilian agency built a context pipeline over their policy corpus, what happens when the context window gets compacted and prior authorization guidance disappears from the model's view? Who has visibility into what was in the window when the model generated a recommendation?

Those are access, policy, and session management questions. You already know how to think about them. The AI terminology is the new layer.

When the buyer's answer is genuinely different ("we fine-tuned the model" or "we trained our own"), that's a different architecture with different implications, and it's worth bringing in an SE. But the vast majority of enterprise "custom AI" is context engineering. Now you know what that means, mechanically, and you know where the conversation goes next.

"Custom AI" in practice: almost always means a customized context assembly pipeline around an unmodified foundation model. The governance questions live in the assembly layer, not the model.

Things to follow up on...

Anthropic's context engineering guide: Anthropic's Applied AI team published a detailed framework covering system prompts, tool design, and few-shot curation that goes deeper than what we covered here — worth reading in full at anthropic.com/engineering.
Context rot across models: Chroma's empirical study tested 18 models including GPT-4.1, Claude 4, and Gemini 2.5, documenting how performance degrades non-uniformly as input length grows — the full methodology is published and useful for understanding why retrieval design matters.
Named context failure modes: Drew Breunig's taxonomy of poisoning, distraction, confusion, and clash was picked up by O'Reilly Radar and adopted by LangChain, making it the closest thing to a standard vocabulary for context debugging in production systems.
Fine-tuning vs. context assembly: When a buyer says they trained a custom model, the next piece in this series covers what fine-tuning actually changes and why practitioners recommend exhausting prompt and retrieval options first before modifying model weights.