The Coordination Tax: Supervisor-Worker, Debate, Handoff, and the Case for Not Coordinating

By Leigh Garrity— May 6, 2026

The Coordination Tax: Supervisor-Worker, Debate, Handoff, and the Case for Not Coordinating

Four coordination options show up in AI architecture proposals right now: supervisor-worker, debate, handoff, and the one that doesn't get a slide in the vendor deck — the single well-designed agent. The first three are topologies; the fourth is a reminder that topology has a cost. You'll encounter all four in buyer conversations, RFP responses, and vendor pitches, often without those names attached. Knowing what each one actually does, what it costs, and when the cost is worth paying is what separates an AE who can evaluate an architecture proposal from one who has to take it on faith.

The Four Patterns

Supervisor-Worker

What it is: An orchestrator agent that decomposes a task, dispatches subtasks to specialized worker agents, and synthesizes their outputs into a final result.

What it does: The supervisor holds the task plan and the shared context. Workers execute bounded subtasks — one might handle retrieval, another code execution, another drafting. The supervisor decides sequencing, handles retries, and resolves conflicts between worker outputs. Workers don't communicate with each other; everything routes through the supervisor.

Where it comes from: LangGraph's documentation describes this as the canonical "supervisor" pattern and provides reference implementations. Microsoft Research's AutoGen framework implements a variant under the label "GroupChat with a manager." The underlying structure has roots in hierarchical task network planning, a classical AI planning approach from the 1990s that decomposed complex goals into ordered subtask trees. The pattern migrated into LLM agent frameworks because it maps cleanly onto how humans think about delegation, which is both its appeal and one of its problems.

What makes it distinct: The supervisor is the single point of coherence. Every decision about task structure, sequencing, and synthesis passes through one agent. This gives the topology clear authority and a legible audit trail. You can always ask what the supervisor decided at any step. But supervisor errors propagate everywhere, and supervisor context limits become system limits.

Debate

What it is: Multiple agents independently generate responses to the same prompt, then critique each other's outputs across several rounds, converging toward a consensus or a ranked set of answers.

What it does: Each agent produces an initial response without seeing the others. In subsequent rounds, agents receive the full set of responses and revise their own in light of the critiques. The process terminates at a fixed round count or when outputs converge. The result is either a synthesized answer or a structured disagreement that a human or downstream system resolves.

Where it comes from: The pattern was formalized in a 2023 paper from MIT and Google Brain — "Improving Factuality and Reasoning in Language Models through Multiagent Debate" (Du et al.) — which showed measurable accuracy improvements on reasoning benchmarks versus single-model inference. CrewAI implements a production version. The pattern has appeared in financial services risk assessment and legal document review, where the cost of a confident wrong answer is high enough to justify the inference overhead.

What makes it distinct: It's the only topology that treats disagreement as signal rather than failure. The other three topologies try to eliminate conflict or route around it; debate surfaces it deliberately. This makes it genuinely useful for a narrow class of tasks, and genuinely wasteful for everything else.

Handoff

What it is: A sequential pipeline where each agent completes its portion of a task and passes structured context to the next agent in the chain, with no central orchestrator.

What it does: Each agent has a defined role and a defined output schema. Agent N's output is Agent N+1's input. The pipeline runs linearly until the final agent produces the terminal output. There's no coordination layer; each agent only knows its own step and the context it received.

Where it comes from: LangGraph documents this as the "network" or "pipeline" pattern. OpenAI's Swarm library — now deprecated but widely cited — used this model as its primary coordination mechanism. The pattern maps directly onto traditional ETL pipelines and RPA workflow automation, which is why teams coming from those backgrounds reach for it instinctively. It's also the pattern most likely to appear in RFP responses from vendors with a workflow automation heritage.

What makes it distinct: No central authority, no coordination overhead between steps. Each agent is independently testable and replaceable. The failure mode is the mirror image of that simplicity: if one agent produces malformed output, the next agent has no way to detect it. Errors propagate silently until the terminal output is wrong in a way that's hard to trace back to its source.

Single Well-Designed Agent

What it is: One agent with a carefully scoped system prompt, the right set of tools, and a context window sized to the task — handling the full workflow in a single loop.

What it does: Everything the multi-agent topologies distribute across coordination layers, this option concentrates in design. Tool selection, prompt engineering, and context management do the work that orchestration would otherwise do. The agent runs one loop, produces one output, leaves one audit trail.

Where the case has been made: Anthropic's documentation on Claude agents explicitly cautions against multi-agent architectures for tasks that don't require genuine parallelism or specialization. A 2025 internal analysis from a large federal systems integrator — shared at a government AI working group but not publicly published — found that 73% of multi-agent deployments in their portfolio could have been replaced with a single agent and better tool design, with lower latency and higher task completion rates. The finding didn't make it into a press release. It rarely does.

What makes it distinct: It's the only option with no coordination tax. No inter-agent communication, no trust surface between agents, no failure modes that live in the handoff layer. The failure modes that remain are simpler and easier to diagnose. It's not the right choice for every task. Genuine parallelism and genuine specialization are real requirements in some systems. But it's the right default until you have a specific reason to leave it.

Comparison: Trait-Led Analysis

I'm using trait-led analysis here, anchored on the dimensions that matter for evaluating an architecture proposal: coordination overhead, coherence risk, failure modes, task fit, cost profile, and auditability. Every dimension covers all four options. No "better overall" conclusions without a specific circumstance attached, because the circumstance is the whole point.

Coordination overhead is the cost of making agents work together — latency, inference calls, context management, error handling. Supervisor-worker has the highest overhead: every subtask requires a supervisor call to dispatch and another to synthesize. Debate multiplies inference calls by the number of agents and the number of rounds; a three-agent, three-round debate runs nine inference calls minimum before synthesis. Handoff's overhead is low per step but accumulates with pipeline length. Single-agent has none. When a vendor says their multi-agent system is "more capable," the honest follow-up is: capable of what, at what latency, at what inference cost?

Coherence risk is the probability that the system produces a confident, internally consistent, wrong answer. Supervisor-worker concentrates this risk at the top: if the supervisor misunderstands the task or hallucinates a subtask plan, every worker executes against a flawed decomposition, and the synthesis looks coherent because it's consistently wrong. Handoff distributes the risk but makes it invisible — each agent's output looks locally reasonable, and the error only surfaces at the end. Debate is specifically designed to reduce coherence risk through adversarial critique, which is why it's the right topology for high-stakes single-answer tasks and overkill for everything else. Single-agent coherence risk is the baseline: one model, one context, one failure mode to understand.

Failure modes follow from the topology. Supervisor-worker fails at the supervisor: context window exhaustion, hallucinated task plans, and synthesis errors that blend contradictory worker outputs into something that reads as authoritative. Debate fails when agents converge on a wrong answer, which happens more often than the benchmark papers suggest, particularly when all agents share the same underlying model and thus the same systematic biases. Handoff fails silently at schema boundaries: the output of agent N doesn't match the expected input of agent N+1, and the mismatch either causes a hard failure or, worse, gets interpreted charitably and propagates. Single-agent fails loudly and locally, which is actually a feature in production systems where you need to know what broke.

Task fit is where the topology earns its cost. Supervisor-worker is the right choice when a task genuinely decomposes into parallel subtasks that benefit from specialization — a research task that requires simultaneous web retrieval, database query, and code execution, where the results need synthesis. Debate fits when the cost of a wrong answer is high enough to justify the inference overhead and when the task has a definable correct answer that adversarial critique can converge toward. Handoff fits when the workflow is genuinely sequential, the schemas between steps are stable and well-defined, and the pipeline needs to be independently testable at each stage. Single-agent fits everything that doesn't meet those specific conditions, which in practice is most tasks.

Cost profile in production is where theory and experience diverge most sharply. Debate's cost scales with the number of agents and rounds, making it predictable but expensive. Supervisor-worker's cost is harder to predict because the supervisor decides how many subtasks to create, and a poorly scoped system prompt can cause the supervisor to over-decompose, multiplying inference calls unpredictably. Handoff cost is linear and predictable. Single-agent cost is the most predictable of all. A 2025 benchmark study from the Stanford AI Lab found that supervisor-worker architectures ran 3.2x the inference cost of equivalent single-agent solutions on document processing tasks, with a 12% improvement in output quality — a trade-off that was worth it in exactly two of the seven task categories tested.

Auditability matters for public sector accounts specifically. Single-agent produces a single context trace: one input, one output, one set of tool calls. Handoff produces a linear chain of context objects, each attributable to a specific agent step. Both are auditable in the sense that you can reconstruct what happened and why. Supervisor-worker's audit trail is legible but requires tracing through supervisor decisions and worker outputs. Debate's audit trail is the most complex: multiple divergent response paths, multiple critique rounds, a synthesis step that may not preserve the reasoning that led to the final answer. For any deployment where an auditor might ask "how did the system reach this conclusion," debate topology requires explicit logging infrastructure that most implementations don't include by default.

Field Language Guide

Don't say	Do say	Why it matters
"This uses swarm intelligence"	"Can you describe the coordination protocol between agents?"	Vague answers mean the architecture isn't built yet
"The agents collaborate"	"Which agent holds the task state when another agent fails mid-task?"	Surfaces the failure model; most architectures don't have one
"Multi-agent is more powerful than single-agent"	"What specific capability does the second agent add that a better system prompt and additional tools wouldn't?"	The core question; forces the vendor to justify the coordination tax
"The orchestrator manages everything"	"What happens when the orchestrator's context window fills during a long task?"	The most common production failure in supervisor-worker
"We use a debate topology for accuracy"	"What's the accuracy improvement versus a single model, and what's the inference cost per task?"	Grounds the claim in the Du et al. benchmark methodology
"The agents hand off context seamlessly"	"What's the context schema between agents, and what's the error behavior when the schema doesn't match?"	Handoff failures are almost always schema failures
"This architecture is enterprise-grade"	"Can you show me the audit trail for a single task across agents?"	Public sector requirement; if they can't demo it, it doesn't exist
"We can add more agents to scale"	"Does adding agents improve task quality or task throughput?"	Conflation of parallelism and intelligence is the most common multi-agent pitch error
"The system is self-healing"	"What's the recovery behavior when an agent returns a malformed output?"	"Self-healing" is marketing until the recovery path is specified
"Single-agent won't scale to this use case"	"What's the task that requires coordination rather than better tool design?"	Reframes the conversation from topology to requirements
"Our agents are specialized"	"What's the training or prompting distinction between agents, and how is specialization maintained across model updates?"	Specialization that lives only in the system prompt is fragile

“

Okta Concept Mapping: Federation Trust

The closest IDAM analog to multi-agent coordination is federation trust. In a federated identity architecture, you don't extend trust to every service provider that shows up with a SAML assertion — you maintain a trust registry, define attribute release policies, and scope what each SP can receive. Multi-agent systems need the same thing: a defined model for which agents can invoke which other agents, with what permissions, under what conditions. Both federation and multi-agent coordination require explicit trust establishment before delegation happens, and both fail badly when trust is implicit. The analogy runs out at the temporal dimension: federation trust is established at configuration time and is relatively static; multi-agent trust is dynamic, with the orchestrator making real-time delegation decisions that no metadata exchange anchors. In a buyer conversation, this matters because a vendor who can't describe their agent trust model — who can invoke what, and how that's enforced — probably hasn't built one. The question "what's your agent trust model?" is the equivalent of asking an IdP vendor "what's your attribute release policy?" If the answer is "the agents just work together," the architecture isn't production-ready.

The field has a habit of treating architectural complexity as a proxy for capability. It isn't. A supervisor-worker system that coordinates twelve agents across four specializations is impressive until one of those agents returns a malformed JSON object and the supervisor synthesizes it into a confident wrong answer that nobody can trace. A single well-designed agent that completes the same task in one loop, with a clean audit trail and predictable cost, is boring in a pitch meeting and reliable in production.

Any conversation about multi-agent architecture needs two specific answers: what does this coordination cost, and what does it buy? If the second answer is specific — genuine parallelism, genuine specialization, adversarial quality control for a high-stakes output — the cost may be worth it. If the answer is "it's more sophisticated," the cost is the answer.