The Context Window | The Plumbing

The Context Window | The Plumbing — Issue 1

The Plumbing

VERSION 1Sunday, May 10

When buyers say 'the AI does X,' the model wrote a request and something else ran it. This section maps the something else, because that's where your questions belong.

The Plumbing

VERSION 1Sunday, May 10

When buyers say 'the AI does X,' the model wrote a request and something else ran it. This section maps the something else, because that's where your questions belong.

Section Opener

The Model Doesn't Do Anything

By Leigh Garrity— May 9, 2026

Feature image for article: The Model Doesn't Do Anything

When your buyer says "the AI reads production email," here's what's mechanically happening: a text-completion engine outputs a JSON object requesting the email, and a separate system picks up a credential the model never touches to actually retrieve it. The model proposes. Infrastructure disposes. Every agentic capability is an identity problem wearing an AI hat.

This section maps the components behind that execution layer — the harness, the tools, the credential surface — using Anthropic's four-part decomposition and the application ladder from single prompts through multi-agent systems. Your IDAM intuition is useful here, up to a point. The specific places where it breaks are the ones that matter.

Section Opener

The Model Doesn't Do Anything

By Leigh Garrity— May 9, 2026

PART ONE

What "Tool Calling" Actually Means

OPEN

When your buyer says "agentic AI," they almost always mean a model that calls tools. Two components do the work, and every identity question you'll face in that conversation depends on knowing which one does what. The model emits a structured JSON request — a tool name, arguments, nothing else — then stops. It executes nothing. It holds no credential. The harness, built by the buyer's engineering team, carries the credential, makes the actual API call, and deals with whatever comes back. This piece walks the single-turn loop phase by phase: who acts, who waits, where the failure modes land, and where the credential lives. Every "who authorized this action" question points to the harness side.

PART TWO

Function Calling vs. XML Tool Calls — Same Mechanic, Different Wire Format

OPEN

When an AI model "calls a tool," it emits structured text requesting an action. Your orchestration code does the rest. The two dominant formats for that request — native JSON function calling and XML tool calls — differ in portability, debuggability, and validation. They don't differ in what the model can do. If you've ever explained the relationship between SAML and OIDC, you already hold the right mental model. Your instinct that format choice carries security implications comes from identity, where it's well-earned. Here, the security lives entirely in the harness.

Function Calling vs. XML Tool Calls — Same Mechanic, Different Wire Format

PART THREE

MCP Default Loading vs. Skills Progressive Disclosure

OPEN

Connect five MCP servers to an agent and you've burned 55,000 tokens in tool definitions before anyone types a word. That's two-thirds of some models' working memory, gone, occupied by a catalog of things the agent might need. Anthropic's Skills take the opposite approach: load a one-line description per skill, pull full instructions only when the model decides they're relevant. This piece breaks down the token math, the accuracy data, and why production teams are using both as complementary layers.

MCP Default Loading vs. Skills Progressive Disclosure

PART FOUR

RAG vs. Agentic Search — Two Ways to Feed a Model Before It Thinks

OPEN

Anthropic's Claude Code team built RAG, tested it against agentic search, and switched. That story is circulating in buyer conversations now, often without the context that makes it useful: the win was specific to codebases, the performance benchmark was self-described as "mostly vibes," and for large stable corpora the math reverses entirely. This piece profiles both retrieval strategies with parallel structure, maps them against three data environments your buyers actually operate in, and gives you ten rows of verbatim-usable field language. The practical 2026 answer depends on corpus shape: which approach fits which data, and where identity governance enters the architecture.

RAG vs. Agentic Search — Two Ways to Feed a Model Before It Thinks

PART FIVE

Vector, Keyword, and Hybrid Retrieval — What Each Finds and What Each Leaks

OPEN

Your buyer says "we're building RAG" and within ninety seconds the conversation turns to retrieval. Three mechanisms own that space: keyword search, vector search, and hybrid. Each finds different things well. Each drops different things silently. Vector search smears contract numbers into semantic neighborhoods. Keyword search misses documents that use different vocabulary for the same concept. Hybrid covers both failure modes, with Microsoft's production benchmarks showing roughly 10% higher relevance scores. And underneath all three sits a document-level authorization gap that OWASP already named.

Vector, Keyword, and Hybrid Retrieval — What Each Finds and What Each Leaks

PART SIX

MCP Connects the Tools. Governance Rides on Top.

OPEN

Model Context Protocol is the wire connecting tools to your buyer's AI stack. Over 16,000 MCP servers indexed, every major AI client supporting it, the Linux Foundation hosting it as foundational infrastructure. Adoption is real. What MCP actually governs is narrower than most people assume. The original spec explicitly excluded authentication and authorization. Four revisions in thirteen months have tightened the auth story considerably, but 53% of deployed servers still rely on static API keys. Your OAuth intuition maps cleanly onto parts of this architecture. It breaks in specific, consequential places. Knowing exactly where that break happens is worth five minutes.

MCP Connects the Tools. Governance Rides on Top.

PART SEVEN

Agent Failure Modes That Aren't Model Failures

OPEN

Amazon's AI coding assistant retrieved superseded documentation from an internal wiki. No freshness signal, no version check. An engineer followed the advice. Result: 120,000 lost orders. The model reasoned correctly. The infrastructure failed. When a buyer says their agent "started hallucinating," three distinct infrastructure failures are usually hiding behind that word: context rot, instruction centrifugation, and stale-world reasoning. Each has a different root cause, a different fix, and a better model solves zero of them. The diagnostic question that earns you the room, "which layer failed?", is one you can ask Tuesday once you know what the answers mean.

Agent Failure Modes That Aren't Model Failures

Recap — From Black Box to Plumbing

Nine articles on the mechanical reality behind every "AI-powered" feature your buyers describe. This is the reference version: four layers, the terms that live in each, the sales moment where they surface, and the adjacent concept that causes confusion. Organized for the parking lot before a call, not for a weekend read. One insight holds the whole thing together. An AI agent is a loop: the model emits requests, the harness executes them, credentials flow through infrastructure you can audit. Your IDAM instincts are correct — they just attach to the harness and protocol layers, not the model. Vocabulary mapping tables at the end translate buyer language into identity questions you already know how to ask.

VIEW ARTICLE

PART ONE

What "Tool Calling" Actually Means

OPEN

PART TWO

Function Calling vs. XML Tool Calls — Same Mechanic, Different Wire Format

OPEN

PART THREE

MCP Default Loading vs. Skills Progressive Disclosure

OPEN

PART FOUR

RAG vs. Agentic Search — Two Ways to Feed a Model Before It Thinks

OPEN

PART FIVE

Vector, Keyword, and Hybrid Retrieval — What Each Finds and What Each Leaks

OPEN

PART SIX

MCP Connects the Tools. Governance Rides on Top.

OPEN

PART SEVEN

Agent Failure Modes That Aren't Model Failures

OPEN

Recap — From Black Box to Plumbing

VIEW ARTICLE

Layer Quick Reference

Agents have five architectural layers. Your buyers are building on all of them, and most AI conversations mash the layers together until nobody can point to where a control should live or who to blame when something breaks.

You think in layers already. Federation taught you that. Zero trust reinforced it. This is the same discipline applied to a stack that's new enough that half the people building on it haven't bothered to learn it yet.

Five layers. Each one does one thing. Each one raises one identity question. When a CAIO starts walking you through their agent architecture, this is how the pieces connect.

Layer Quick Reference

Five layers. Each one does one thing. Each one raises one identity question. When a CAIO starts walking you through their agent architecture, this is how the pieces connect.

Model Layer

The Model Emits Requests It Executes Nothing

The model outputs structured tool-call blocks. It never touches an API, opens a file, or holds a credential. It does, however, reason over everything in its context window. Including content the requesting user shouldn't see. Who filtered what reached context before reasoning began?

Harness Layer

The Harness Carries the Credential and Blame

The harness takes tool-call requests and actually runs them. It routes errors, carries credentials, manages growing context. Anthropic's production pattern has a proxy fetch credentials from a vault so the agent never handles tokens. When an agent acts wrong, the harness is where it happened. What scope does that credential carry?

Retrieval Layer

Retrieval Decides What the Model Gets to See

Embeddings find similar content. Grep finds exact matches. Hybrid search runs both and merges results, outperforming either alone by 10–30% in published benchmarks. All three feed content into context before the model reasons over it. The question nobody wants to own yet: were those documents filtered by the user's authorization?

Protocol Layer

MCP Connects Tools It Does Not Authorize Them

MCP standardizes how models discover and connect to tools. Thousands of servers already exist. The 2025 spec added OAuth 2.1 requirements for HTTP transports, which is real progress. But MCP handles connectivity. Authorization on the tool side? Still yours to solve.

Context Layer

Context Rot Degrades Inputs Not the Model Itself

Tool outputs, error messages, intermediate reasoning. All of it accumulates across turns. Chroma's 2025 study found every one of 18 frontier models tested lost accuracy as context grew. Production fixes exist: compaction, clearing, sub-agent isolation. The audit question lurking underneath: can you reconstruct what the agent saw when it acted?