When a buyer mentions connecting agents to internal tools via MCP, and separately mentions using Skills to define agent behavior, the sentence that earns you a seat at the table is: "Those solve different problems. MCP connects the agent to external tools. Skills tell it how to behave. But they share a design question that matters a lot at scale, which is what gets loaded into the model's working memory and when." What follows gives you the mechanical understanding behind that sentence so you can hold the conversation after it.
How MCP Loads Tools by Default
What it is: MCP (Model Context Protocol) is an open protocol that gives AI models a standardized way to connect to external tools and data sources.
What it does: When an MCP client connects to a server, it calls tools/list and gets back every tool definition that server exposes. Each definition includes a name, a description, and a full JSON Schema specifying every parameter the tool accepts. In most MCP client implementations, all of these definitions get loaded into the model's system prompt before the conversation starts. Connect five servers and the model is reading the equivalent of a short novel about tool specifications before it processes your first word.
Who's behind it: Anthropic created MCP and open-sourced it in November 2024. It's now hosted by the Linux Foundation. The protocol itself doesn't mandate where tool schemas appear in the model's context; that's a client implementation decision. But most clients, including Claude Desktop and the majority of third-party implementations, load everything upfront.
What makes it distinct: MCP is the universal adapter. It doesn't know which tools matter for any given conversation, so it loads all of them and hopes the model can sort it out. A five-server setup with 58 tools consumes roughly 55,000 tokens before anyone types anything. Anthropic's own engineering team documented setups consuming 134,000 tokens in tool definitions alone. That's two-thirds of a 200,000-token context window, occupied by a catalog of things the model might need. The model hasn't done any work yet. It just knows what it could do. And it's paying for that knowledge on every single turn.
MCP's default loading is the same pattern as early IdP configurations that stuffed every user attribute into the SAML assertion regardless of what the relying party needed. Works fine with three RPs; with fifty, the assertion bloats and half the attributes get ignored. The analogy holds on the cost side. It breaks on the fix: SAML attribute release is configured per-RP at the IdP, while MCP's fix required changing how the client itself decides what to load.
How Skills Load Instructions on Demand
What it is: Skills are Anthropic's system for defining agent behaviors as structured markdown files, each containing instructions the model follows for a specific type of task.
What it does: At session start, the agent loads only the name and a one-line description from each installed skill into the system prompt. This costs roughly 100 tokens per skill for the full YAML metadata stub (name, description, and routing fields). When the model determines a user's request matches a skill's description, it pulls the full instruction body into context. That body ranges from about 275 tokens to 8,000 tokens across Anthropic's official skill library, with a median around 2,000 (from SwirlAI Newsletter, which conducted a token-level analysis across all 17 official Anthropic skills). A third level exists: if the skill references bundled files like templates or reference docs, those load only when a specific step needs them.
Who's behind it: Anthropic launched Skills in October 2025 across Claude.ai, Claude Code, and the Claude API. In December 2025, they released Skills as an open standard. Within weeks, OpenAI, Google, GitHub, and Cursor adopted it.
What makes it distinct: The model reads the one-line stub and decides for itself whether to load the rest. The model's own reasoning about relevance is the entire routing mechanism. There's no separate router, classifier, or keyword matcher sitting in front of it. The description field in each skill file is the entire engineering surface for routing. Write a vague description, the skill won't trigger reliably. Write a precise one, and you've built a behavioral switch that costs 100 tokens until activated. The model's working memory stays clean until it has a reason to fill it.
Skills follow the same pattern as OAuth incremental authorization, where an app requests minimal scopes at login and additional scopes only when the user hits a feature that needs them. The analogy holds on the loading pattern. It breaks on who decides: in OAuth, the application developer hardcodes when to request additional scopes. In Skills, the model itself decides when to load. There's no policy engine making the call. The model is the policy engine.
Two Dimensions That Matter in Buyer Conversations
With two subjects, a trait-led comparison is the cleanest structure. I'm anchoring on two dimensions: context cost at rest (what the model pays before any work happens) and accuracy under load (what happens to the model's performance as you add more tools or skills). These connect to money and to correctness, which are the two things that make a CISO or CAIO sit forward.
Context Cost at Rest
MCP's default behavior scales linearly and expensively. Each connected server adds its full tool catalog to the system prompt. Anthropic's engineering team published the math: five servers with 58 tools cost ~55K tokens. Add a tool-heavy server like Jira (~17K tokens alone) and you approach 100K+ tokens of overhead. One practitioner, Sam McLeod, documented a setup where a single Docker MCP server consumed 125,964 tokens across 135 tools (McLeod's analysis is a practitioner write-up documenting his own production environment).
Skills' cost at rest is nearly flat. Twenty installed skills cost roughly 2,000 tokens of metadata stubs. Fifty skills cost about 5,000. Full instructions load only on activation and unload when the task completes. The context budget stays available for actual work.
Put those side by side: a five-server MCP setup burns 55,000 tokens at rest. Fifty skills burn 5,000. An order of magnitude, and it compounds on every turn because MCP tool definitions persist in the system prompt throughout the conversation.
Accuracy Under Load
And the cost has a second face: accuracy. Whether the agent picks the right tool, and uses it correctly, degrades as context fills up.
Anthropic's internal testing on their MCP evaluation suite showed that moving from full upfront loading to on-demand discovery improved tool selection accuracy from 49% to 74% on one model, and from 79.5% to 88.1% on another. These are Anthropic's own benchmarks, not independent results. But the direction is consistent with what practitioners report: with too many tool definitions in context, models pick the wrong tool more often, hallucinate parameters more frequently, and lose focus on the actual task.
Tools with similar names become especially problematic. The model sees notification-send-user and notification-send-channel and guesses. You can appreciate the absurdity: the more capable you made your MCP setup, the less reliable the model became at using it.
Simon Willison, creator of Datasette and a consistently careful technical commentator, put it plainly: context pollution was why he rarely used MCP. Developers were independently building workarounds. One practitioner reduced a server from 20 tools to 8 by manually combining similar functionality, just to keep context overhead manageable (Scott Spence, a developer documenting his own optimization of the mcp-omnisearch server). WorkOS built an entire lazy-loading pipeline that resolved tool instructions on demand before Anthropic shipped an official solution. When multiple production teams independently build the same workaround, the problem is real.
Because only metadata stubs sit in context at rest, Skills avoid this structurally. The model reasons over a clean, minimal set of descriptions. Full instructions arrive only after the model has already decided which skill is relevant. Fifty sets of detailed instructions aren't competing for the model's attention simultaneously.
January 2026 and the Convergence
In January 2026, Anthropic shipped Tool Search for MCP in Claude Code, and it's now the default behavior there. When tool descriptions would consume more than 10% of the available context window, Claude Code stops loading full definitions and builds a lightweight search index instead. The model discovers tools on demand, pulling in only the specific definition it needs.
The results, from Anthropic's engineering team:
| Metric | Traditional MCP loading (50+ tools) | MCP with Tool Search |
|---|---|---|
| Tokens consumed | ~77,000 | ~8,700 |
| Tool Search overhead | — | ~500 tokens |
| Context preserved for work | ~38% | ~95% |
Thariq Shihipar, a member of Anthropic's technical staff working on Claude Code, noted that the server instructions field in MCP definitions, previously a "nice to have," is now critical. It acts as the metadata that helps the model "know when to search for your tools, similar to skills."
That last phrase tells you everything. MCP's discovery mechanism now works like Skills' trigger mechanism: a lightweight description stub that the model reasons over to decide whether to load the full definition. Skills influenced MCP's evolution directly.
Two caveats worth being precise about. Tool Search is the default in Claude Code specifically. Most other MCP clients, including Claude Desktop and third-party implementations, still load everything upfront. And Tool Search requires newer models (Sonnet 4 or later). The broader MCP ecosystem hasn't caught up yet.
This is the same architectural move as the shift from full SCIM sync to just-in-time provisioning: stop pushing everything everywhere, start resolving on demand. The analogy holds on the efficiency gain. It breaks on the trigger. JIT fires on an authentication event, a known, discrete action. MCP Tool Search fires on the model's own assessment of what it needs, which is probabilistic. The model might decide wrong.
Layers That Compose
Skills define how an agent behaves. MCP defines what an agent can connect to. They compose.
A skill might say "when the user asks you to draft a security review, follow this methodology and use this template." An MCP server might expose the tools that let the agent pull vulnerability data from a scanner or push the finished review into a document management system. The skill provides behavioral guidance. The MCP server provides external connectivity. In production, WorkOS demonstrated this composition pattern by building a context engine that resolves who the user is and what they can access (behavioral layer), then lazily loads only the MCP tool instructions relevant to the current task (connectivity layer). Behavioral guidance and tool connectivity are separate concerns, managed separately, loaded separately.
Amazon's MCP team at the Dev Summit described their approach the same way: Skills for behavioral patterns, MCP for tool connectivity. This is the emerging production consensus. (Dev Summit reporting via aaif.io, sourced from named speakers at MCP Dev Summit North America 2026.)
How to Say This in the Field
| Don't say | Do say | Why it matters |
|---|---|---|
| "MCP dumps everything into context and that's bad" | "MCP's default loads all tool definitions upfront. That works fine with a few tools, but at scale it eats the model's working memory before any real work starts." | Accurate without dismissing a protocol the buyer may already use |
| "Skills replace MCP" | "Skills and MCP are different layers. Skills define behavior, MCP provides connectivity. Most production setups use both." | Prevents a false either/or that the buyer's engineers will correct |
| "The context window is like RAM" | "The model has a fixed budget for everything it can hold in mind at once. Tool definitions, conversation history, and actual work all compete for the same budget." | Avoids a metaphor that implies persistence; context windows reset |
| "MCP wastes tokens" | "A five-server MCP setup can consume 55,000 tokens in tool definitions before the first message. That's real cost and it affects accuracy, not just price." | Grounds the concern in Anthropic's published numbers |
| "Skills are more efficient" | "Skills load about 100 tokens per skill at startup and pull in full instructions only when the model decides they're relevant. It's progressive disclosure." | Gives the buyer the design term they'll recognize from their own architecture conversations |
| "Anthropic fixed MCP in January" | "Claude Code now defers MCP tool loading by default. It only pulls in a tool definition when the model actually needs it. Most other MCP clients haven't adopted that pattern yet." | Precise about scope; doesn't overpromise ecosystem-wide change |
| "AI agents need fewer tools" | "The accuracy data shows models pick the right tool more often when they're sorting through fewer options at once. Anthropic saw selection accuracy jump from 49% to 74% on one model." | Reframes a vague claim into specific, sourceable data that lands with security buyers |
| "Progressive disclosure is new" | "Progressive disclosure is the same pattern as loading permissions on demand instead of all at once. Start minimal, expand when you need to." | Bridges to something the buyer already trusts without jargon |
| "We should wait for MCP to mature" | "MCP is production infrastructure at companies running tens of thousands of agent executions a week. The question is how to manage what it loads into context." | Positions the decision as operational |
| "Tool Search solves everything" | "Tool Search dropped context overhead from 77K tokens to about 9K in Anthropic's testing. But it requires newer models and only Claude Code defaults to it today." | Honest about constraints; builds trust for every other claim you make |
The sentence to carry into any conversation about agent architecture: the model's context window is a fixed budget, and everything competes for the same space. Skills and MCP's newer Tool Search both solve this the same way. Load less upfront. Resolve on demand.
Things to follow up on...
-
Uber's MCP at scale: Uber reported 1,500+ monthly active agents and 60,000+ agent executions per week across 10,000+ internal services at the MCP Dev Summit North America 2026, making it one of the few public accounts of MCP operating at genuine enterprise scale.
-
MCP authorization keeps moving: The MCP spec has gone through at least two major authorization revisions since launch, with the 2025-06-18 spec adding OAuth 2.1-oriented requirements for HTTP-based transports that didn't exist in the original November 2024 release. (Covered in depth in slot 5.6.)
-
Armin Ronacher's practitioner take: Flask creator Armin Ronacher wrote a detailed comparison explaining why he moved from MCP to Skills for behavioral guidance, noting that skills "do not actually load a tool definition into the context" and instead let the agent use the tools it already has.
-
Tool description quality as engineering surface: An academic paper from February 2026 found that MCP tool descriptions are "smelly" in the software engineering sense, with metadata quality directly affecting whether models select the right tool under Tool Search and Skills routing.

