Function Calling vs. XML Tool Calls — Same Mechanic, Different Wire Format

Two encoding formats for the same mechanic: how the model asks for a tool call, why the format choice matters operationally, and where your IDAM intuition breaks.

By Leigh Garrity— May 9, 2026

Function Calling vs. XML Tool Calls — Same Mechanic, Different Wire Format

Two encoding formats for the same mechanic: how the model asks for a tool call, why the format choice matters operationally, and where your IDAM intuition breaks.

When an AI model "calls a tool," it doesn't call anything. It emits structured text that says please do this. Your orchestration code, the harness, parses that text, decides whether to comply, executes the actual function, and feeds the result back. The model never touches the external system. It writes a request. The harness acts on it.

That mechanic holds regardless of how the request is encoded. The industry calls it "tool use."

The two dominant encodings are native function calling (JSON schema registered through the provider's API, structured response object back) and XML tool calls (the model writes parseable XML tags directly into its text output, your code extracts them). Same intent. Different wire format. "Function calling" is the encoding layer. "Tool use" is the mechanic. They get conflated constantly because most people encounter them together, but they're separable, and the separation matters.

The format you choose determines how tightly your agent stack is coupled to a single provider, how easily you can debug a failed session, and whether you get schema validation for free. The model's capabilities stay the same either way.

Native function calling

You register tools by passing JSON schema definitions through a dedicated API parameter. Each definition includes a function name, a description, and a schema for expected arguments. When the model decides a tool is relevant, it returns a structured object: function name, arguments, call ID. Your code catches that object, executes the function, and sends the result back with the matching ID.

OpenAI, Anthropic, and Google Gemini all support this. OpenAI and Anthropic both use JSON Schema for tool definitions. Gemini uses an OpenAPI-based schema with its own constraints. None of the wire formats are interchangeable.

OpenAI puts tool schemas under parameters and returns calls in a tool_calls array. Anthropic uses input_schema and returns tool_use content blocks. Gemini wraps definitions in functionDeclarations and returns functionCall parts. The nesting structures differ. OpenAI's own format differs between its two APIs. Three providers, four wire formats.

Switching providers means rewriting your schema registration, your response parsing, and your result injection. That's the coupling cost.

XML tool calls

The alternative is simpler and older than it looks. You include tool definitions in the model's prompt as text, instruct the model to emit tool calls using parseable XML tags, and extract the calls from the model's text output yourself.

No dedicated API parameter. No structured response object. The model writes something like this inline in its output, and your harness finds it, parses it, acts on it:

xml

<tool_call>
  <name>get_weather</name>
  <location>Falls Church</location>
</tool_call>

This works on any instruction-following model. It doesn't require the provider to have built a function-calling API. It doesn't require a specific SDK version. The tool definitions are just text in the prompt, and the tool calls are just text in the output.

These are the same mechanic

The detail that collapses the distinction: when you pass JSON tool definitions to Anthropic's API, the API constructs a system prompt from those definitions and injects it as XML-formatted instructions. The model receives your tools as structured text. It has always been parsing tool information from text. Native function calling is the convenience layer that handles the serialization for you.

Tim Kellogg, a software engineer whose analysis of MCP's architecture circulated widely in the agentic coding community, put it directly:

“

"You can paste the JSON from a tool declaration into a prompt and it works, because most LLMs are trained to do function calling with an XML-like variant anyway."

The model doesn't care which door the instruction came through.

The tradeoffs that actually matter

Portability. Native function calling ties you to a provider's API surface. XML tool calls work anywhere. If your agent stack needs to run against multiple models, or you're evaluating providers, or your procurement timeline means you can't commit to one vendor's API contract today, the text-based approach avoids a rewrite when you switch.

Debuggability. When a multi-step agent session fails at step 12 and you need to trace back to the tool call that went sideways at step 4, the inspection experience differs. XML tool calls appear as plaintext in the conversation stream. You can read a raw transcript. Native function call objects live in structured API responses that require dedicated observability tooling to surface. Without that tooling, you're reconstructing a multi-turn session from structured response objects that don't appear in the text at all. Practitioners report this as a meaningful operational difference, though the gap narrows as tracing tools mature. This is practitioner consensus, not benchmarked fact.

Validation. Native function calling earns its keep here. Both OpenAI and Anthropic offer strict mode, which guarantees the model's output conforms to your JSON schema. With XML tool calls, you're responsible for validating the parsed output yourself.

Both providers built strict mode for a reason. Without it, the model can pass malformed arguments, omit required fields, or invent parameters that don't exist in your schema. When that happens, the harness sends a bad request to the target API, gets a 400 back, and the agent either retries with the same broken arguments or quietly fabricates a result. The session looks fine from the outside. The data is wrong. Strict mode eliminates that entire failure category. That's not a small thing.

Context overhead. Armin Ronacher, creator of the Flask web framework, documented his migration away from MCP tool schemas because loading them consumed roughly 8,000 tokens of context window. His alternative: text-based "Skills" that teach the model to construct API calls from documentation rather than from pre-defined schemas. The model learns what's available and how to use it, without the harness loading full tool definitions into every conversation turn. The tradeoff is real: structured tool definitions are precise but expensive in context budget.

No consensus has formed on which encoding is "better." The honest answer is that it depends on whether you're optimizing for portability, validation, or debuggability, and those priorities shift depending on where you are in the build.

Where your IDAM intuition helps

You already hold a clean model for this. SAML assertions and JWTs are different serialization formats for the same semantic content: identity claims, authentication state, authorization context. SAML uses XML. OIDC uses compact JSON tokens. Different encoding, identical function.

Native function calling and XML tool calls have exactly this relationship. Same semantic content (call this function with these arguments), different serialization (structured API object vs. parseable text), with the lighter-weight format being more portable and easier to inspect.

The parallel goes one layer deeper. SAML defines both the token format and the protocol: assertions, bindings, profiles, metadata. JWT defines only the token format; it lives inside other protocols like OIDC and OAuth. Similarly, native function calling is format plus protocol (dedicated API parameter, structured response lifecycle, strict validation). XML tool calls are just a format. Any model, any harness, no protocol required. The protocol layer is what creates vendor coupling.

This is where your IDAM intuition helps.

Where it misleads

In identity, format choice carries security implications. SAML's XML signatures and JWT's JWS have distinct vulnerability profiles. Token binding semantics differ. Audience restriction mechanisms differ. Your instinct that format choice matters for security posture is well-earned.

That instinct doesn't transfer. Neither native function calling nor XML tool calls are signed. Neither carries integrity guarantees at the format level. The format choice between JSON API objects and XML text carries zero security content.

The security of a tool call lives entirely outside the encoding: in the credential the harness attaches when executing the call, in the scope constraints on that credential, in the authorization check the target API performs, and in whether anyone logged what happened. Those are the questions that matter in a buyer conversation about agentic AI. The serialization format is background noise. Who authorized the action, with what scope, and whether there's an audit trail — those are the questions worth asking.

So what does this give you

Function calling is the encoding layer. Tool use is the mechanic. The model requests an action. The harness executes it. Security, governance, and identity all live in the harness. The format of the request is plumbing.

Your expertise already lives there. So should the conversation.

Things to follow up on...

Anthropic's internal XML conversion: When you pass JSON tool definitions to Claude's API, it constructs XML-formatted instructions in the system prompt before the model ever sees them, which is the clearest evidence that native function calling is a convenience wrapper over text-based tool parsing.
Schema validation failure modes: Without strict mode, models can invent parameters or omit required fields, and common production breakdowns include broken JSON, nonexistent function calls, and mismatched parameters that look correct from the outside.
Context cost of tool definitions: A five-server MCP setup can approach 100K+ tokens of overhead before the model starts working, which is why Ronacher and others have migrated to Skills-based approaches that load only 20–50 tokens per tool until triggered.
The portability standard question: Tim Kellogg argues that MCP's strongest case is sociological rather than technical, since tool declarations in any format are parseable structured information that most instruction-following models can already handle.