Three multi-agent topologies account for most of what you'll hear in buyer conversations about AI systems that coordinate multiple models: supervisor-worker, debate (multi-voice), and handoff/swarm. Supervisor-worker shows up in domain-routing RFPs where a buyer wants one system triaging tasks across specialized agents. Debate surfaces in governance and assurance discussions, especially when a CAIO needs to demonstrate that outputs get independently verified before they reach a human. Handoff/swarm appears in vendor pitches about autonomous orchestration, usually on the "fleet of agents" slide. The phrase that buys you credibility across all three conversations: "What's the coordination story?"Every topology has one. And every topology pays a coordination tax in tokens, in coherence, in debuggability. The question for any buyer conversation is whether the task justifies the bill.
A vocabulary note before we go further. Three words in this piece mean something different than they do in your daily work. Token here means a unit of text the model processes and bills for — an auth token it is not. Agent means an LLM-driven system that takes autonomous actions, as distinct from a service account or user agent. Context means the information window the model reasons over, as distinct from a security context. Same words, different machinery. The collision is worth naming because buyers use these terms loosely, and knowing which definition is in play tells you whether the conversation is about identity or AI architecture.
One more distinction that earns its keep early. Anthropic's published framework for building with LLMs draws a line between workflows (deterministic orchestration where routing is fixed in code) and agents (systems where the model decides what to do next). This was contested as recently as 2024 but is now widely accepted. It matters here because some of what buyers describe as "multi-agent" is actually a workflow with fixed routing. If the routing logic is hardcoded, the coordination tax changes entirely. Ask.
Supervisor-Worker
What it is: A single orchestrator agent routes tasks to specialized worker agents and synthesizes their outputs.
What it does: The supervisor receives a request, decides which worker or workers should handle it, dispatches the task with relevant context, collects results, and assembles a final response. One brain decides who does what, then reconciles the answers.
Who's behind it / where it comes from: This is the default starting point in most enterprise agent frameworks. Anthropic's published workflow patterns include an orchestrator-workers pattern that maps closely to this topology. Practitioner consensus treats supervisor-worker as the baseline for multi-agent coordination. LangGraph, CrewAI, and most orchestration toolkits implement some version of it. It maps to how enterprises already think about task routing, which is why it gets adopted first.
What makes it distinct: The supervisor holds the full context. Every other topology distributes context across agents; this one centralizes it. That's its strength and its bill. A controlled scaling study across 260 configurations and six benchmarks measured roughly 285% token overhead for centralized architectures compared to a single agent performing the same work. The supervisor must ingest, reason over, and forward context that a single agent would simply hold in its window. You're paying for coordination with every call.
Debate / Multi-Voice
What it is: Multiple agents independently reason about the same problem, then argue toward consensus through structured rounds of critique.
What it does: Each agent produces an answer. Agents then see each other's answers and reasoning, critique them, and revise their positions over one or more rounds. The final output is either a consensus answer or a judgment by a separate evaluator agent. The premise: adversarial pressure improves accuracy. Same logic behind red-teaming.
Who's behind it / where it comes from: Multi-agent debate emerged from academic AI safety research as a way to improve LLM reasoning without retraining. It shows up in governance-oriented architectures where a buyer needs to demonstrate that AI outputs were independently verified before action. More common in research settings and compliance-driven designs than in production enterprise deployments.
What makes it distinct: It's the only topology where agents are designed to disagree. Supervisor-worker and handoff/swarm both assume agents cooperate toward a shared goal. Debate assumes the output gets better when agents challenge each other. That adversarial structure is what makes it appealing for governance and assurance use cases, and it's also what gives it unique failure modes. (More on those failures in the comparison section below.)
Handoff / Swarm
What it is: Agents pass control and context to each other sequentially, with each agent deciding when its part is done and which agent should take over.
What it does: Agent A handles the first phase of a task, then hands off to Agent B with a context payload (a summary of what happened and what's needed next). Agent B picks up, does its work, and either completes the task or hands off again. In more complex versions, agents can dynamically spawn new agents or route to any agent in the network, creating emergent workflows.
Who's behind it / where it comes from: OpenAI's original Swarm framework popularized this pattern before being deprecated in March 2025 and superseded by the Agents SDK, which retains the handoff model with added guardrails. The pattern also appears in Microsoft's AutoGen and in various "agent mesh" architectures from startups. It's the topology vendors reach for when they want to demonstrate autonomous, flexible orchestration.
What makes it distinct: No single agent holds the full picture. Context degrades at every handoff because each agent receives a compressed summary of what came before, not the full reasoning chain. The same scaling study that measured supervisor overhead found that independent multi-agent setups amplify errors 17.2x through unchecked propagation (meaning errors compound without correction), while centralized coordination contains error amplification to 4.4x. Swarm sits somewhere between those poles depending on how tightly handoffs are structured. The direction of the risk is clear.
If you've explained how a PDP routes authentication requests to the right enforcement point based on context, you already have the model for supervisor-worker. The analog holds for routing logic but breaks because a PDP passes a decision (allow/deny) while a supervisor passes full problem context, which is where the token cost comes from. In buyer conversations, this mapping helps a CISO see the pattern without you teaching agent architecture from scratch.
Comparing Topologies by Buyer Conversation
I'm organizing this comparison by scenario mapping: the buyer context where each topology surfaces, which pattern fits, and what it costs. This structure serves you better than trait-by-trait comparison because you encounter these patterns in specific conversations, not in the abstract. A flat comparison grid would tell you which topology has lower token costs. Scenario mapping tells you which topology fits the conversation you're actually in.
Before the scenarios, one finding that frames all three: coordination gains plateau. The scaling study (arXiv:2512.08296, December 2025, co-presented by Google Research) found statistically significant diminishing returns (p<0.001) once a single agent already performs above roughly 45% on the task being measured. Below that threshold, adding coordination helps. Above it, you're paying the token and coherence tax for marginal improvement. So before you reach for a topology, the prior question: is this task hard enough that a single agent can't handle it?
"We need to route requests to specialized AI capabilities across domains."
Supervisor-worker territory. The buyer wants one entry point dispatching to specialized agents: one for HR queries, one for IT tickets, one for procurement. Works well for parallelizable tasks. The scaling study found centralized coordination improved performance by 80.9% on parallelizable tasks like financial reasoning. (That blog post is Google Research's summary of the same arXiv paper cited above.) But that 285% token overhead is real, and for sequential reasoning tasks, all multi-agent variants in the study degraded performance by 39–70% compared to a single agent. Worth asking: is the task actually parallelizable, or is the buyer breaking a sequential workflow into pieces to justify the architecture?
"We need to demonstrate that AI outputs are independently verified before production."
Debate surfaces here, usually in governance or compliance contexts. Multiple agents checking each other's work sounds like the AI equivalent of separation of duties, which is exactly why it appeals. Token overhead scales with the number of debate rounds times the number of agents. And the independence may be less real than it appears. A 2025 study on debate conformity (Wu, Li, and Li; arXiv preprint, not yet peer-reviewed) using logic puzzles with verifiable ground truth found that under majority pressure, agents tend to conform to the group position rather than maintain a dissenting answer, even when the dissent is correct. The study did find genuine deliberation occurring, but only under demanding conditions: diverse model architectures and strong individual reasoning capability. If the models are similar (same family, same training data), "debate" can collapse into expensive agreement. Ask the buyer: what model diversity exists across the debating agents, and has anyone tested whether the debate actually changes outcomes versus a single agent with a self-critique prompt?
"We want agents that autonomously handle complex, multi-step workflows end to end."
Swarm/handoff territory, and where vendor pitches get most ambitious. Flexible, emergent workflows that adapt to novel situations. That's the promise. Every handoff is a potential drift point, context degrades cumulatively, and debugging a failure across a chain of autonomous handoffs is genuinely hard. That's the reality. The MAST study (peer-reviewed, NeurIPS 2025) analyzed 1,642 execution traces across seven multi-agent frameworks and found failure rates between 41% and 86.7%, with 32.3% of failures attributable to inter-agent misalignment: agents proceeding on wrong assumptions, ignoring each other's input, derailing tasks. Those numbers come from open-source research frameworks (MetaGPT, ChatDev, AG2, and others), and the rates may differ in vendor deployments. But the failure modes are baked into the architecture. The paper's authors argue that better base models won't fix them. Many failures trace to organizational design and coordination challenges.
Gartner's June 2025 projection that over 40% of agentic AI projects will be canceled by end of 2027 cites escalating costs and unclear business value as primary drivers. That projection is analyst forecast (attributed to Anushree Verma, Senior Director Analyst) with no disclosed methodology. But it circulates in procurement conversations, and your buyer may reference it. Better to know the source and its limits than to be caught flat.
A swarm handoff is structurally similar to a federated auth chain: context (like a SAML assertion) gets passed from one trust domain to the next. The analog holds for trust boundaries — each agent must decide whether to trust what it received — but breaks because SAML assertions have standardized format and signature verification, while agent handoffs pass natural language with no schema enforcement. When a buyer describes a handoff architecture, ask what the "assertion format" is between agents; if the answer is "the agent just passes a summary," that's where drift enters.
Debate borrows the same logic as MFA — multiple independent signals must converge before a decision is trusted — which helps a buyer see why debate appeals in governance contexts. It breaks in an important place: MFA factors are genuinely independent (a fingerprint doesn't know what the password was), while debate agents share context and influence each other, which is exactly where the conformity research bites. If a buyer says "our agents check each other like multi-factor," that's your opening to ask whether the agents are actually independent or just expensively agreeing.
How to Say This in the Field
| Don't say | Do say | Why it matters |
|---|---|---|
| "Multi-agent is the future of AI architecture." | "Multi-agent solves specific problems. What's the task that a single agent can't handle?" | Forces the buyer to justify complexity before you're defending it. |
| "Supervisor-worker is the best pattern." | "Supervisor-worker fits when the task is parallelizable. For sequential work, research shows it can actually degrade performance." | Prevents you from recommending a pattern that fails on their actual workflow. |
| "Debate ensures AI outputs are accurate." | "Debate improves accuracy under specific conditions: diverse models, strong individual reasoning. Without those, research shows agents tend to agree rather than genuinely challenge each other." | Positions you as someone who's read the research. |
| "Swarm agents handle complex workflows autonomously." | "Swarm architectures are flexible, but context degrades at every handoff. The question is how they maintain coherence across the chain." | Gives the buyer a concrete thing to evaluate in their vendor's demo. |
| "AI agents are like human teams." | "Agent coordination has real costs in tokens, coherence, and debuggability. More agents doesn't mean better outcomes past a certain point." | Breaks the intuition that "more agents = more capability" before it drives a bad architecture decision. |
| "We can integrate with any multi-agent framework." | "Which coordination pattern are you using? That determines where identity and access controls need to sit." | Shows you understand the topology behind the buzzword. |
| "Gartner says 40% of these projects will fail." | "Gartner projects over 40% of agentic AI projects will be canceled by 2027, mostly due to cost and unclear value. That's an analyst forecast, not a study, but it tracks with what the research on coordination overhead shows." | Uses the Gartner number without over-relying on it. Buyers respect honest source annotation. |
| "The research says multi-agent doesn't work." | "The research says coordination gains plateau once a single agent is already performing well on the task. Below that threshold, multi-agent helps. Above it, you're paying overhead for marginal improvement." | Precise enough to survive a follow-up question from a technical buyer. |
| "What AI tools are you evaluating?" | "What's the coordination story between your agents? Who holds the full context, and what happens when a handoff fails?" | The single most useful question in any multi-agent conversation. Separates vendors with answers from vendors with slides. |
| "I'd need to bring in my SE for that." | "I can walk through the three main coordination patterns and where identity fits. If we need to go deeper on implementation specifics, I'll bring in my SE for that part." | Handle the architectural conversation yourself; bring the SE for implementation. That's the right split. |
Things to follow up on...
- MAST failure taxonomy paper: The NeurIPS 2025 study cataloging 14 distinct failure modes across 1,642 multi-agent traces is worth reading directly if you want to know exactly how agent coordination breaks down in practice.
- Google Research scaling study: The controlled comparison of 260 agent configurations across six benchmarks is the most rigorous source available on when multi-agent coordination helps versus when it just adds cost.
- Single agents matching multi-agent performance: A separate April 2026 study found that single-agent systems consistently match or outperform multi-agent systems on multi-hop reasoning when reasoning tokens are held constant, suggesting many reported multi-agent advantages are explained by unaccounted computation rather than architectural benefits.
- Natural language as coordination bottleneck: A June 2025 paper examining why AI agents communicate in human language finds that natural language inter-agent messaging creates semantic misalignment and information loss, contributing to the behavioral drift that compounds across handoffs.

