Multi-agent systems are overrated in demos, underrated in production.
The buzzy pitch: "swarms of specialized agents collaborate!" The reality: two well-scoped agents, one orchestrator, one specialist, handle most of what you need. Swarms collapse into spaghetti.
This module is 90 minutes of sober multi-agent. The patterns that work, the patterns that don't, and how to build a two-agent pipeline that solves a real problem.
- Three patterns that ship. Two that don't.
- A working orchestrator + specialist pipeline.
- Guardrails: when to collapse back into one agent.
Thinker.
Three multi-agent patterns that work:
- Orchestrator + specialists. One agent plans and routes. Others do specific tasks.
- Pipeline. Linear chain. Agent A produces, agent B critiques, agent C revises.
- Generator + judge. One agent produces candidates. Another selects.
Two that don't
- The swarm. N agents negotiating. Fine in research papers, disaster in production. Non-deterministic, expensive, hard to debug.
- The debate. Two agents arguing until one "wins." Theoretically elegant, practically produces long traces with worse outputs than a single well-prompted agent.
The collapse rule
Before adding agent #2, ask: can one well-prompted agent do this with tools? Usually yes. Multi-agent is the answer when the single-agent approach consistently fails for a reason you can name.
Talker.
The orchestrator prompt
You are an orchestrator. You have access to these specialist
agents:
- researcher(query): returns factual research.
- drafter(brief): returns a draft of writing.
- critic(draft): returns critique of a draft.
- filer(content, path): saves output to a file.
Rules:
- Plan the steps in order before executing.
- Call the minimum number of specialists needed.
- Never call the same specialist twice with the same input.
- If a specialist returns an error, stop and report.
- Return the final output, not intermediate steps.
Task: [user request]
The specialist prompt
Each specialist has a tight contract. Input shape, output shape, nothing else. Specialists don't know they're in a pipeline.
You are [role]. Input: [shape]. Output: [shape]. Do exactly
this task, nothing more. If input is malformed, return an
error message.
Tight specialists compose well. Loose ones produce unpredictable pipelines.
Rememberer.
Multi-agent systems need shared state. Where it lives determines whether the system scales.
Two patterns
- Message-passing. Orchestrator holds state; specialists receive inputs, return outputs, stateless. Easiest to reason about.
- Shared blackboard. All agents read/write a common state store. More flexible, harder to debug.
Start with message-passing. Move to blackboard only when you have a specific reason.
Traces
Every specialist call logs input, output, duration, cost. Traces for multi-agent systems are much longer than single-agent ones. Structured logs are the only way to debug.
[YYYY-MM-DD HH:MM] orchestrator: planning
[HH:MM] researcher("X"): returned 3 facts (1.2s)
[HH:MM] drafter(brief={...}): returned draft (3.4s)
[HH:MM] critic(draft): returned 3 issues (2.1s)
[HH:MM] drafter(brief, critique): returned v2 (3.1s)
[HH:MM] orchestrator: done. total 14.3s, $0.12
Doer.
Twelve minutes. Ship a two-agent pipeline for a real task.
Step 1. Pick the task (2 min)
Something you'd like higher quality on than single-agent gives you. "Generate a draft memo, have it critiqued, revise once." Good candidate.
Step 2. Build the specialists (5 min)
Drafter: takes a brief, returns a draft. Critic: takes a draft, returns 3 issues. Both are single-agent SDK calls (Module 016).
Step 3. Build the orchestrator (3 min)
Python script: call drafter, then critic, then drafter again with the critique as extra context. Return the v2 draft. Log every step.
Step 4. Run on real input (2 min)
Give it an actual brief. Read v1, read critique, read v2. Which did you prefer?
Step 5. Decide: did this beat single-agent?
Run the same brief through a single well-prompted agent. Compare. If the pipeline is clearly better, keep it. If not, collapse back.
A working pipeline. An honest comparison with single-agent. A decision based on quality, not novelty.
- Pipeline output worse than single-agent: your critic is weak. Sharpen it with explicit rubrics.
- Pipeline costs 3x with no quality gain: collapse. Most work doesn't need multi-agent.
- Pipeline is right sometimes, wrong sometimes: non-determinism. Pin temperatures lower. Add retries only at the specialist level, not the orchestrator.
Rookie.
Failure 1. Agent count inflation
You start with 2 agents. You add a 3rd "for research." Then a 4th "for formatting." By the time you have 6, debugging is impossible and quality is worse than when you had 2.
Fix: each new agent must demonstrably improve an output the others couldn't. If you can't measure the lift, don't add.
Failure 2. Loose specialists
Your "researcher" agent sometimes drafts content. Your "critic" sometimes researches. Roles blur. Output is unpredictable.
Fix: tight contracts. Input shape, output shape, role. Enforce in code; reject outputs that don't match the expected shape.
Failure 3. Debugging without traces
Something in the pipeline goes wrong. You don't know where. You add print statements. You rerun. You add more print statements. Ten iterations later, you find the bug.
Fix: structured logs from day one. Every specialist call logs in, out, duration, cost. Grep finds the bug in seconds.
Manager.
One repo, one orchestrator
Each multi-agent system lives in its own repo. One orchestrator file. N specialist files. Named owner. Shared with the team when it's proven.
Specialist reuse
A good specialist is reusable. Your "critic" agent should be callable from the drafting pipeline, the coding pipeline, and the spec review pipeline. Package specialists as shared libraries, not one-offs per system.
Eval suites matter more, not less
Multi-agent systems have more failure modes than single-agent. The eval suite (Module 006) is non-negotiable. Run it against the full pipeline, not just individual specialists.
Chief.
Risk 1. Cost multiplication
A two-agent pipeline costs roughly 2-3x a single-agent call. Three agents, 4-5x. Doesn't need to hurt if the quality justifies it. Needs to be planned.
Governance: cost-per-outcome metric (Module 023) applies. Multi-agent systems justify themselves or get simplified.
Risk 2. Latency
Multi-agent is slower. Sequential specialists add up. User-facing pipelines taking 20+ seconds lose users.
Governance: latency budget per pipeline. If the system can't meet it, collapse to single-agent or move work offline (background jobs).
Risk 3. Debugging at scale
In production, a multi-agent pipeline that fails is harder to debug than an app with the same outcome. The engineering burden is real.
Governance: only use multi-agent in production when you have the engineering capacity to operate it. Otherwise, it's a science project.
Founder.
Solo founder: use multi-agent exactly when single-agent isn't enough, and not a moment sooner.
The solo multi-agent kit
- One orchestrator. One or two specialists. Never more.
- Always start with single-agent. Prove you need the second.
- Message-passing, not blackboard.
- Logs that a human can read in 30 seconds.
When it actually pays off
Writing pipelines (drafter + critic + reviser from Module 012). Research pipelines (researcher + writer + fact-checker). Quality-critical tasks where the second pass catches real issues.
When it doesn't
Everything else. Classification, summarization, extraction, drafting of low-stakes content. Single agents with good prompts win.
Two agents sometimes, many agents almost never.
The path from 1 to 2 agents is justified by measurable quality gain. The path from 2 to 5 is justified by nothing. Resist the swarm fantasy. Ship two well-scoped agents that compose cleanly.