The Coordination Tax: What Multi-Agent Architecture Actually Costs

By Carey Whitten— May 5, 2026

The Coordination Tax: What Multi-Agent Architecture Actually Costs

Three topologies dominate the multi-agent conversation you're about to walk into: supervisor-worker, where a central orchestrator routes subtasks to specialist agents; debate, where multiple agents critique each other's outputs before a final answer surfaces; and handoff, where agents pass context sequentially like a relay baton. You'll hear all three described as upgrades. The production record is more complicated, and knowing the difference between the topologies — and where each one actually earns its complexity — is what separates a useful question from a credulous nod in a buyer conversation.

The Three Topologies

Supervisor-Worker

What it is: A central orchestrator agent delegates subtasks to specialist agents and synthesizes their outputs.

What it does: The orchestrator receives a task, decomposes it, routes subtasks to agents with relevant capabilities (a code agent, a retrieval agent, a summarization agent), collects results, and assembles a response. The specialists don't communicate with each other — they report up.

Where it comes from: This is the dominant pattern in current enterprise agent frameworks. LangGraph's StateGraph with a supervisor node, AutoGen's GroupChat with a speaker selector, and CrewAI's role-based task assignment all implement variants of this topology. The pattern descends from classical workflow orchestration — think BPEL and service mesh routing — adapted for probabilistic LLM outputs.

What makes it distinct: The orchestrator holds the full task context. Specialists hold only what they're given. The orchestrator's quality determines everything; a weak orchestrator with strong specialists still produces weak results.

Debate

What it is: Multiple agents generate and critique each other's outputs iteratively to improve answer quality.

What it does: Two or more agents (often the same base model with different system prompts, sometimes different models) each produce an initial response, then review and critique the others' responses, then revise. The cycle runs for a fixed number of rounds or until outputs converge. A final agent or aggregation step produces the answer.

Where it comes from: The roots here are academic. Du et al. (2023) demonstrated in "Improving Factuality and Reasoning in Language Models through Multiagent Debate" that structured debate between LLM instances improved performance on math and strategic reasoning benchmarks. The pattern also draws from Constitutional AI's self-critique mechanism, generalized to multi-agent form. Production implementations are rarer than the research suggests they should be.

What makes it distinct: Unlike the other two topologies, debate doesn't decompose a task — it applies multiple perspectives to the same task. The value proposition is epistemic, not operational: you're buying error correction, not parallelism.

Handoff

What it is: Agents pass context sequentially, each completing a defined stage before transferring to the next.

What it does: Agent A handles intake and initial processing, then packages its output and passes to Agent B, which handles the next stage, and so on. Each agent has a defined scope and a defined exit condition. The chain is linear; there's no orchestrator coordinating the whole — each agent knows only its predecessor's output and its own instructions.

Where it comes from: OpenAI's Swarm library (now absorbed into the Agents SDK) popularized this pattern for production use. The design logic is pipeline processing — Unix pipes, ETL stages, assembly line manufacturing. The appeal is modularity: each agent is independently testable and replaceable.

What makes it distinct: No agent holds the full task context at any point. The context is distributed across the chain. This is both the topology's strength (each agent is focused) and its central vulnerability (context degrades at each transfer).

Comparing the Three

Comparison strategy: trait-led analysis. Each topology is evaluated across three dimensions — coordination overhead, failure surface, and problem class — so the comparison yields actionable signal rather than a ranking.

Coordination Overhead

Supervisor-worker carries the highest coordination overhead of the three. Every subtask requires an orchestrator decision: which specialist, what scope, what format. The orchestrator itself is an LLM call, so you're paying for a routing decision in tokens and latency before any specialist work begins. In a five-specialist system, you might make six or more LLM calls to complete a task that a single well-tooled agent could handle in one or two.

Debate has predictable overhead: it scales linearly with the number of agents and the number of debate rounds. A two-agent, three-round debate makes six LLM calls before aggregation. The overhead is fixed by configuration, not by task complexity, which makes it easier to budget but harder to optimize.

Handoff has the lowest per-step overhead — each agent call is independent, with no routing logic — but the chain length determines total overhead. A four-stage handoff pipeline makes four LLM calls minimum, regardless of whether the task needed all four stages.

Failure Surface

The production record gets uncomfortable here.

Every LLM call has a non-zero probability of producing an output that's wrong, malformed, or subtly off in a way that downstream processing won't catch. In a single-agent system, that failure is visible and contained. In a multi-agent system, failures compound.

A five-agent supervisor-worker system where each agent operates at 95% reliability — a generous assumption for complex tasks — produces a correct end-to-end result roughly 77% of the time (0.95⁵). Add an orchestrator that itself operates at 95% reliability, and you're at 74%. The math is straightforward; the implications are not always surfaced in architecture discussions.

Supervisor-worker has the most complex failure surface because failures can occur at the orchestrator level (wrong routing, wrong scope definition) or at any specialist level, and orchestrator failures tend to cascade. If the orchestrator misunderstands the task, all specialist outputs are wasted.

Debate has a more contained failure surface: agents are evaluating the same task, so a single agent failure is partially corrected by the others. This is the topology's actual value proposition. The failure mode to watch is convergence on a shared wrong answer — when agents agree confidently on something incorrect, the debate mechanism amplifies rather than corrects the error.

Handoff has a linear failure surface, but with a specific vulnerability: context degradation. Each transfer compresses or summarizes the prior agent's output. The "lost in the middle" problem — where LLMs underweight information in the middle of long contexts — compounds across handoff stages. By stage four, the agent may be working from a materially incomplete picture of the original task.

Problem Class

Supervisor-worker justifies its overhead when the task genuinely decomposes into parallel, independent subtasks that require different capabilities. Complex research synthesis, multi-domain document analysis, and tasks that benefit from genuine specialization (a code agent that's been fine-tuned for code, a legal retrieval agent with domain-specific tools) are the right fit. Sequential tasks, interdependent subtasks, or anything a single agent with good tools could handle in full scope — none of those warrant the overhead.

Debate justifies its overhead for tasks where factual accuracy or reasoning quality is the primary concern and latency is not. High-stakes answer generation, medical or legal question answering where errors are costly, and tasks where the cost of a wrong answer exceeds the cost of multiple LLM calls. Time-sensitive or cost-sensitive tasks don't fit, nor do tasks where the agents are likely to share the same training biases and thus converge on the same errors.

Handoff justifies its overhead when the task has a natural sequential structure with clear stage boundaries, when each stage requires genuinely different context or tooling, and when modularity and independent testability matter for maintenance. Customer service escalation flows, multi-step document processing pipelines, and staged approval workflows are reasonable fits. Artificial stage divisions, tasks requiring context continuity across the full scope, or pipelines long enough to create more failure surface than the modularity saves — those are the wrong applications.

The Cost-and-Coherence Tax

The production record shows a consistent pattern: most multi-agent systems, as deployed, perform worse than a well-designed single agent on the same task.

This isn't a theoretical concern. In a 2024 evaluation published by AI practitioner Hamel Husain, single-agent pipelines with well-designed tool sets outperformed multi-agent variants on the majority of benchmark tasks tested, with the performance gap widening as task complexity increased — the opposite of what the multi-agent pitch would predict. Anthropic's own documentation on multi-agent patterns notes explicitly that "each LLM call introduces latency and cost" and that orchestrators "cannot verify subagent outputs," which means errors in specialist responses propagate silently. The framework builders are telling you this; the vendors selling multi-agent platforms sometimes aren't.

The tax has two components. The cost component is straightforward: more LLM calls, more tokens, more latency, more infrastructure. A system that makes ten LLM calls to complete a task costs roughly ten times what a single-call system costs, before accounting for orchestration logic, retry handling, and the engineering time required to maintain agent boundaries.

The coherence component is subtler and more damaging. Coherence is the degree to which the system's output reflects a unified understanding of the task. In a single-agent system, coherence is maintained by the model's context window — everything the agent knows is in one place. In a multi-agent system, coherence is a coordination problem. Every agent boundary is a context compression event. The orchestrator's summary of a subtask is not the subtask. The handoff package is not the full prior context. The debate synthesis is not the full debate. Something is always lost, and what's lost is usually the nuance that makes the difference between a correct answer and a plausible-sounding wrong one.

The community has noticed. A thread on Hacker News in late 2024 (annotated: practitioner discussion, not peer-reviewed; useful for surfacing real deployment experience) collected reports from engineers who had built and then dismantled multi-agent systems, with a recurring pattern: the system worked in demos, failed in production on edge cases, and was replaced by a single agent with better tools and a longer context window. The phrase that appeared repeatedly was "we added agents to solve a problem that better prompting would have fixed." (HN thread "Ask HN: Who has actually shipped multi-agent systems in production?", November 2024.)

None of this means multi-agent architecture is wrong. The burden of proof runs the other way from how it's usually presented, though. Before accepting architectural complexity, the right question is what specifically this task requires that a single agent with good tools cannot provide.

Field Language Guide

When you hear "swarm of agents" or "multi-agent architecture" in a buyer pitch, the following table gives you language that keeps you credible without shutting down the conversation.

Don't say	Do say	Why it matters
"That sounds complex"	"Walk me through how the agents coordinate — who decides what each one does?"	Opens the topology question without signaling skepticism
"Is that really necessary?"	"What's the task structure that makes multiple agents the right fit here?"	Forces the buyer to justify the architecture against the problem class
"More agents means more risk"	"How does the system handle a failure in one of the specialist agents?"	Surfaces the failure surface question in operational terms the buyer can answer
"That's just a pipeline"	"Is this a sequential handoff or does the orchestrator route dynamically?"	Distinguishes handoff from supervisor-worker; shows you know the difference
"Swarm sounds like marketing"	"When you say swarm, are the agents running in parallel or passing context sequentially?"	The word "swarm" almost never maps to a specific topology; this question reveals whether the buyer knows their own architecture
"How do you know it's accurate?"	"For a debate topology, how many rounds does it run, and what's the convergence criterion?"	Debate without a convergence criterion is just expensive; this question tests whether it's real
"That seems expensive"	"What's the per-task LLM call count, and how does that compare to a single-agent baseline?"	Grounds cost concern in a specific, answerable question rather than a vague objection
"We support agentic workflows"	"What coordination pattern does your current implementation use, and what governance layer sits above the orchestrator?"	Positions Okta's identity and governance story without overclaiming
"Agents can do anything"	"What's the defined scope for each agent, and how is that scope enforced at runtime?"	Scope enforcement is where identity and authorization become relevant
"This is cutting-edge"	"Is this pattern in production, or is this the target architecture?"	The gap between demo and production is where most multi-agent complexity lives
"Our platform handles that"	"What does the buyer's current identity infrastructure look like for the human users these agents are acting on behalf of?"	Keeps the conversation grounded in what Okta actually governs

“

Okta Concept Mapping

The supervisor-worker topology rhymes with OAuth delegation chains — a principal (the orchestrator) acts on behalf of a user and delegates scoped authority to downstream agents (the specialists). The analogy holds well enough to be useful: you can think of each specialist agent as a downstream service receiving a scoped token from the orchestrator. Where it breaks is enforcement. In OAuth, delegation is explicit, the scope is declared in the token, and revocation is a defined mechanism. In supervisor-worker architectures, the orchestrator's delegation to specialists is typically implicit — passed as prompt instructions, not as a verifiable credential — the scope is whatever the orchestrator says it is in natural language, and there is no standard revocation path if a specialist agent misbehaves mid-task. In a buyer conversation, when a CAIO asks "how do you govern what each agent is authorized to do," the answer cannot be "the orchestrator handles it" — because the orchestrator is itself an LLM, not an authorization server.

The topology vocabulary gives you something specific to ask for. The cost-and-coherence tax gives you the right prior. Ask which topology, and the answer — or the absence of one — tells you most of what you need to know about whether the architecture has been thought through or just named.