Chapter 15. Sub-agents

Previously: permissions, trust labels, sandbox interfaces. One well-governed agent can do a lot. Some tasks, though, decompose better into parallel or specialized work — and that's the case for sub-agents.

Two countervailing results bound the design space. Anthropic's June 2025 "How We Built Our Multi-Agent Research System" reported that a multi-agent setup (Opus 4 orchestrator + Sonnet 4 sub-agents) outperformed single-agent baselines by over 90%, with the gain correlating strongly with token usage and the ability to distribute reasoning across independent context windows. On the other side, Towards Data Science's 2025 "The Multi-Agent Trap" — focused on systems with low per-step accuracy — showed that naive multi-agent decomposition compounds error rates: three 85% agents in series give you about 61% end-to-end. Cemri et al.'s 2025 MAST study (cited in Chapter 1 for the "why harnesses are hard" discussion) backed that second finding empirically at scale, tracing 36.9% of observed multi-agent failures to coordination breakdowns rather than individual-agent errors. The research arc around formal multi-agent frameworks — Hong et al.'s 2024 ICLR paper "MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework" is the most-cited — shows that both outcomes are reachable from the same primitive; the design choices determine which.

All of this is reconcilable once you name the determining factor: what the sub-agent decomposition is for. Parallel research over independent sub-questions is a big win from multi-agent, because each sub-agent's context stays narrowly focused. A multi-step chain where agent B consumes agent A's output is a loss, because errors accumulate. Rewrite that chain as a single agent with structured state, and it works better.

This chapter builds the sub-agent primitive and the guardrails that keep it from becoming a trap. Three specific constraints, consciously chosen:

One level deep. Sub-agents cannot spawn sub-agents. Same choice as Claude Code, same reasoning — nested delegation compounds failure rates fast.
Bounded spawning. A per-session budget caps how many sub-agents can run, and every spawn carries a justification string for audit.
Compact results. Sub-agents return structured summaries, not full transcripts. The parent's context inflates by what the sub-agent concluded, not by the full trace.

Parent context

plan · synth · summaries

spec →

← summary

sub-agent A · fresh ctx

sub-agent B · fresh ctx

sub-agent C · fresh ctx

Parent sees compact summaries, never the full child transcripts.

15.1 The Sub-agent Contract

A sub-agent is itself an agent — same loop, same tools, same harness. What distinguishes it is how the parent invokes it and what comes back:

# src/harness/subagents/subagent.py
from __future__ import annotations

from dataclasses import dataclass, field


@dataclass(frozen=True)
class SubagentSpec:
    """What a sub-agent is told to do."""
    objective: str               # the specific task, operationally specific
    output_format: str           # how the result should be structured
    tools_allowed: list[str]     # tool names available to the sub-agent
    max_iterations: int = 20
    max_tokens: int = 50_000     # hard context budget
    system_override: str | None = None  # override parent's system prompt


@dataclass
class SubagentResult:
    """What a sub-agent returns to its parent."""
    summary: str                 # the synthesized answer
    tokens_used: int
    iterations_used: int
    error: str | None = None     # non-null if sub-agent failed

The key design choice is in SubagentResult. The sub-agent's full transcript doesn't come back. Only its summary does — a final text output that the parent inserts into its own context as one message. A 40-turn sub-agent run returns as ~500 tokens in the parent's context, not 50,000.

This is the pattern Anthropic's multi-agent research system documents: compact, structured summaries from sub-agents are what make multi-agent work at scale. A parent that received a full transcript from each sub-agent would suffer the context explosion — exactly the O(n×m) problem AutoGen's GroupChat hits.

15.2 The Spawner

# src/harness/subagents/spawner.py
from __future__ import annotations

import asyncio
from dataclasses import dataclass, field

from ..agent import arun
from ..context.accountant import ContextAccountant, ContextBudget
from ..providers.base import Provider
from ..tools.selector import ToolCatalog
from .subagent import SubagentResult, SubagentSpec


class SubagentBudgetExceeded(Exception):
    pass


@dataclass
class SubagentSpawner:
    provider: Provider
    catalog: ToolCatalog
    max_subagents_per_session: int = 5
    _spawn_count: int = field(default=0, init=False)

    async def spawn(
        self,
        spec: SubagentSpec,
        parent_scratchpad_root: str | None = None,
        justification: str = "",
    ) -> SubagentResult:
        if self._spawn_count >= self.max_subagents_per_session:
            raise SubagentBudgetExceeded(
                f"spawn budget of {self.max_subagents_per_session} exhausted"
            )
        self._spawn_count += 1

        # restrict the catalog to the tools the sub-agent is allowed
        allowed = [t for t in self.catalog.tools if t.name in spec.tools_allowed]
        sub_catalog = ToolCatalog(tools=allowed)

        # constrain context budget
        budget = ContextBudget(window_size=spec.max_tokens)
        accountant = ContextAccountant(budget=budget)

        system = spec.system_override or _default_subagent_system(spec)

        try:
            result = await arun(
                provider=self.provider,
                catalog=sub_catalog,
                user_message=(
                    f"Objective: {spec.objective}\n\n"
                    f"Return format: {spec.output_format}"
                ),
                system=system,
                accountant=accountant,
            )
            return SubagentResult(
                summary=result.summary,
                tokens_used=result.tokens_used,
                iterations_used=result.iterations_used,
            )
        except SubagentBudgetExceeded:
            raise  # let budget failures propagate to the caller
        except Exception as e:
            return SubagentResult(
                summary="", tokens_used=0, iterations_used=0, error=str(e),
            )


def _default_subagent_system(spec: SubagentSpec) -> str:
    header = f"""\
You are a sub-agent. You have one objective:

{spec.objective}

Return your answer in this format:
{spec.output_format}

You have the following tools available: {spec.tools_allowed}.
You have a maximum of {spec.max_iterations} iterations.
You have a maximum of {spec.max_tokens} tokens of context.
"""
    # Smaller / weaker models sometimes narrate tool calls in text instead of
    # actually dispatching them. When the spec allows tools at all, make the
    # mandatory-execute rule explicit — see the callout below this snippet.
    mandate = ""
    if spec.tools_allowed:
        mandate = (
            "\nYou MUST call at least one tool from your allowed list before\n"
            "producing your final answer. Do not describe what you would do — do it.\n"
            "Describing a tool call in prose without actually invoking it is a failure.\n"
        )
    footer = """
When you have completed the objective, produce your final answer in the
requested format. Do not continue working after you have the answer.
If you cannot complete the objective (missing data, scope unclear), say
so explicitly — do not fabricate.
"""
    return header + mandate + footer

A few pragmatic notes.

Sub-agents run in fresh contexts. The spawner creates a new Transcript implicitly via arun. The parent's context never touches the sub-agent's. This is the Anthropic finding — independent context windows are most of the multi-agent value.

Tool restriction via catalog filtering. The sub-agent only sees tools in tools_allowed. A researcher sub-agent might get search_docs, read_file_viewport, scratchpad_read. It does not get edit_lines, write_file, bash. Scope restriction is automatic and enforced at the tool level, not trusted to the sub-agent's system prompt.

The spawner owns the budget counter. A malicious sub-agent cannot spawn more sub-agents because sub-agents are one level deep by construction (we don't expose spawn as a tool they can call). But even a well-behaved parent can over-spawn; the manager caps it.

Sub-agents that narrate instead of execute. There's a third failure mode the spec contract does not catch by itself, and the "must call a tool" clause above is what guards against it: sub-agents that understand their objective, describe what bash commands would run, and return a confident-sounding narrative without ever dispatching a single tool call. iterations_used=1, no tool calls in the transcript, a summary full of plausible-looking numbers. Frontier models comply with implicit "use your tools" expectations; smaller local models (7B-class and below — think Gemma via Ollama) will sometimes hallucinate the execution and write a plan instead. The spec's tools_allowed, output_format, and justification are all present and correct — the contract held, but the model no-op'd the work. The fix is the explicit imperative in the system prompt above, paired with a check in your eval harness (Chapter 19) that counts sub-agent runs where iterations_used == 1 and tools_allowed was non-empty. That number should be zero; if it's not, your prompt isn't strong enough for the model you're running.

arun returns an AgentRunResult, not a bare string. Earlier chapters showed arun returning str for brevity; at the point where sub-agents come in, the parent needs real accounting for each sub-run — token cost, iteration count, the sub-agent's transcript if you want to log it. So arun is promoted here to return a small dataclass:

# src/harness/agent.py (add alongside arun)
from dataclasses import dataclass

from .messages import Transcript


@dataclass
class AgentRunResult:
    summary: str               # the final answer text (was arun's bare return)
    tokens_used: int           # input + output across all turns
    iterations_used: int       # how many turns the loop took
    transcript: Transcript     # full record — useful for logs / debugging

arun's signature changes from -> str to -> AgentRunResult. Everything downstream uses .summary where it used to use the raw return value. The spawner below reads .tokens_used and .iterations_used directly; Chapter 19's eval runner uses .transcript for tool-call recording; Chapter 20's budget enforcer uses .tokens_used for post-run accounting.

15.3 The Spawn Tool

The parent uses sub-agents via a tool, the same way it uses anything else. Because the spawner itself is async — it calls arun — the tool must be an async tool. Chapter 13 §13.3 extended Tool with an optional arun: Callable[..., Awaitable[str]] field and added an @async_tool decorator that wires async def functions into it. We use that here:

# src/harness/subagents/spawn_tool.py
from __future__ import annotations

from ..tools.base import Tool
from ..tools.decorator import async_tool
from .spawner import SubagentSpawner
from .subagent import SubagentSpec


def spawn_tool(spawner: SubagentSpawner) -> Tool:

    @async_tool(side_effects={"mutate"})  # conservative — sub-agents may do anything
    async def spawn_subagent(
        objective: str,
        output_format: str,
        tools_allowed: list[str],
        max_iterations: int = 15,
        justification: str = "",
    ) -> str:
        """Spawn a sub-agent to handle a delegated task.

        objective: the specific task for the sub-agent, operationally
                   specific ("read files X and Y and report their schemas"),
                   NOT vague ("look into the database").
        output_format: exact format the sub-agent should return its answer
                       in. Examples: "a JSON object with keys X and Y";
                       "a three-paragraph summary, each under 100 words".
        tools_allowed: names of tools the sub-agent is permitted to use.
                       Narrower is better; the sub-agent can't use tools
                       not in this list.
        max_iterations: hard cap on sub-agent turns. Default 15; reduce
                        for simple lookups.
        justification: one sentence explaining WHY a sub-agent is better
                       than handling this in-line. Required; if you can't
                       articulate why, don't spawn.

        Returns the sub-agent's summary, prefixed with its token cost.
        Side effects: depend on sub-agent's tools; pessimistically 'mutate'.
        """
        if not justification:
            return ("error: justification is required. If you cannot explain "
                    "why a sub-agent is better than inline handling, do not "
                    "spawn one.")
        if not tools_allowed:
            return ("error: tools_allowed must be non-empty. Specify which "
                    "tools the sub-agent needs.")

        spec = SubagentSpec(
            objective=objective,
            output_format=output_format,
            tools_allowed=tools_allowed,
            max_iterations=max_iterations,
        )
        result = await spawner.spawn(spec, justification=justification)
        if result.error:
            return f"sub-agent error: {result.error}"
        return (f"[sub-agent result; {result.tokens_used} tokens, "
                f"{result.iterations_used} iters]\n{result.summary}")

    return spawn_subagent

Three deliberate frictions.

Justification is required. An empty justification returns an error. This forces the model to articulate why it's spawning — the Multi-Agent Trap paper's finding that over-delegation happens because spawning feels like doing work is the failure mode this counters directly.

tools_allowed must be non-empty and specific. An unset tool list would give the sub-agent everything, which negates the blast-radius argument for sub-agents in the first place. Forcing the parent to list specific tools makes the scope contract explicit.

Output format is required. "Return your findings" produces rambling summaries; "Return a JSON object with keys files_found, bugs_identified, recommendations" produces structured output the parent can parse. The Anthropic research post was emphatic: the biggest multi-agent quality lever is precise output format specifications.

15.4 A Two-agent Scenario

# examples/ch15_research_and_report.py
import asyncio
from pathlib import Path

from harness.agent import arun
from harness.providers.anthropic import AnthropicProvider
from harness.subagents.spawn_tool import spawn_tool
from harness.subagents.spawner import SubagentSpawner
from harness.tools.selector import ToolCatalog, discovery_tool
from harness.tools.std import STANDARD_TOOLS


SYSTEM = """\
You are a research coordinator. You have a spawn_subagent tool to delegate
specific research tasks to sub-agents. A sub-agent is appropriate when:
- The subtask can be stated operationally in one sentence.
- The subtask uses a narrow set of tools.
- You want the sub-agent to return a structured summary you synthesize.

Do NOT use sub-agents for:
- Simple lookups you can do in-line.
- Multi-step chains where each step depends on the last — do those yourself.

For each sub-agent, provide a justification explaining WHY it's better than
handling the work in-line.
"""


async def main() -> None:
    provider = AnthropicProvider()
    catalog = ToolCatalog(tools=STANDARD_TOOLS)
    spawner = SubagentSpawner(provider=provider, catalog=catalog,
                              max_subagents_per_session=3)
    coordinator_catalog = ToolCatalog(
        tools=STANDARD_TOOLS + [spawn_tool(spawner), discovery_tool(catalog)]
    )

    await arun(
        provider=provider,
        catalog=coordinator_catalog,
        system=SYSTEM,
        user_message=(
            "Investigate this machine's package management setup. "
            "Spawn one sub-agent for each package manager likely installed "
            "(apt, brew, pip, npm). Each sub-agent should return: "
            "(1) whether the package manager is installed, "
            "(2) the version, "
            "(3) a count of installed packages. "
            "Then synthesize a one-paragraph summary."
        ),
    )


asyncio.run(main())

Run it. The coordinator spawns three or four sub-agents in sequence, each with a narrow tool list (["bash"] probably), and each returns a tiny structured result ("apt: installed, version X, N packages"). The coordinator synthesizes. Total context in the parent: well under what a single-agent version of the same task would accumulate.

A note: this example is sequential — each spawn_subagent call blocks until the sub-agent finishes. Chapter 17 covers parallel spawning and the shared-state problems that introduces.

If you run this on a small local model (Gemma via Ollama, a 7B-class open model) and the sub-agents come back with paragraphs describing what brew list would print instead of what it actually printed, you're seeing the narrate-instead-of-execute mode from §15.2. iterations_used=1, no tool calls, a confident-sounding summary. The mandatory-tool clause in _default_subagent_system is what's meant to prevent this; if it still happens, your model needs the clause stronger, or your sub-agent objective needs to be operationally specific enough that "just describe" isn't a plausible interpretation. Frontier sub-agents (Opus, Sonnet) won't hit this; harness authors testing locally will, and it's worth watching for.

15.5 Agent-as-tool vs. Handoffs

The OpenAI Agents SDK distinguishes two multi-agent patterns: agent-as-tool (what we built — the parent calls a sub-agent, gets a result, continues) and handoff (control transfers permanently to the new agent; the original agent doesn't get control back).

We built agent-as-tool. Handoffs are rarely the right pattern for the harness cases this book cares about:

Handoffs are hard to observe — you don't have a parent that can summarize across delegations.
Handoffs make the control flow hard to reason about — the agent you're looking at isn't necessarily the one that started.
Handoffs can be simulated with agent-as-tool plus explicit return-from-delegate logic; the reverse is harder.

For workflows that feel like handoffs — a triage agent that routes to a specialist — agent-as-tool plus the coordinator pattern from 15.4 is usually clearer.

15.6 Permissions for Sub-agents

A sub-agent shares the parent's permission manager by default — any tool it calls goes through the same policy check. This matters: a sub-agent cannot escalate privilege by being a sub-agent. If the parent can't bash rm -rf /, neither can the sub-agent.

For some deployments, you want tighter sub-agent permissions. A research sub-agent might be fully read-only even though the parent has write permissions. The pattern:

spec = SubagentSpec(
    objective="...",
    output_format="...",
    tools_allowed=["search_docs", "read_file_viewport"],  # read-only subset
    ...
)

The spawner filters the catalog by tools_allowed; tools not in the list aren't even visible to the sub-agent. Combined with the permission manager, this gives you layered scoping: tool list for positive restriction, permission policy for negative enforcement.

15.7 Commit

git add -A && git commit -m "ch15: sub-agents — agent-as-tool with spawn budget"
git tag ch15-subagents

15.8 Try It Yourself

Measure the overhead. Run a task that can be done either in-line or by spawning one sub-agent. Compare total tokens, total latency, final output quality. Is the sub-agent ever a net win on a small task? When, if so?
Force the Trap. Rewrite the scenario so the parent's second sub-agent depends on the first's output. Does the compounding-error pattern from the Multi-Agent Trap paper show up? How often does the second sub-agent misread the first's summary?
Add a pre-spawn approval policy. Extend your permission manager so spawn_subagent triggers an ask decision every time. Run a long session. Does the prompt frequency match your intuition? Does the justification field tell you enough to decide well?

What you now understand

The harness can delegate. Sub-agents run in fresh contexts with narrow tool lists and bounded iteration budgets; parents call them via a tool that enforces justifications and output formats; results come back as compact summaries, not full transcripts. One level deep, bounded spawn count per session. The permission layer scopes what a sub-agent can do. The Multi-Agent Trap is mitigated by the friction we built into the spawn tool — empty justifications return errors, tools_allowed must be explicit, output_format is required.

What's still missing: sub-agents today run sequentially. Real research parallelism wants four sub-agents running at once, each pursuing a sub-question, returning when ready. That's coordination across concurrent agents writing to potentially shared state — the subject of Chapter 17. Before that, Chapter 16 fixes a different open gap: the agent's "plan" and its "completion criteria" have been implicit in prose all along. Making them structured objects unlocks the plan-execution consistency check that catches premature finalization.

Chapter 14. Sandboxing and Permissions

Previously: MCP lets any external tool server plug into the harness. The harness has also been running happily without any permission controls. This is the moment both facts become untenable.

Chapter 16. Structured Plans and Verified Completion

Previously: sub-agents with bounded delegation. A coordinator can now split work across sub-agents and synthesize. What it cannot yet do — and what single-agent runs also cannot — is verify that what it claims to have done, it actually did.