Previously: sub-agents with bounded delegation. A coordinator can now split work across sub-agents and synthesize. What it cannot yet do — and what single-agent runs also cannot — is verify that what it claims to have done, it actually did.
Two specific failure modes motivate this chapter.
Premature finalization. The agent says "task complete" after processing half the items. Galileo's production analysis named this as one of the top agent failure patterns — an agent asked to process all items in a list processes four of six and claims completion. The model's training rewards coherent-sounding completions; the agent cannot distinguish "said X" from "did X."
Plan-execution mismatch. The MAST paper (Cemri et al., 2025) identified reasoning-action mismatch as a distinct failure mode across 1,642 multi-agent traces. The agent's plan says "read A, then modify B." The action reads A and modifies C. The plan and the action are generated in separate forward passes; nothing connects them.
Both are addressable by making the plan a structured object the harness enforces rather than an unstructured string the agent produces and then may or may not follow. This is Kambhampati et al.'s 2024 ICML position paper "LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks" spelled out in concrete code: the paper's core argument is that language models produce plausible plans but cannot reliably self-verify completion, and the right architecture pairs the model with an external verifier that decides when the model's work is actually done. The harness is the external verifier. This chapter introduces Plan, a dataclass with steps, preconditions, and postconditions. A step has to be marked complete with evidence; a plan has to have all postconditions satisfied before the agent can declare final. The harness checks, not the model.
# src/harness/plans/model.py
from __future__ import annotations
from dataclasses import dataclass, field
from datetime import datetime, timezone
from enum import Enum
from typing import Literal
from uuid import uuid4
class StepStatus(str, Enum):
pending = "pending"
in_progress = "in_progress"
done = "done"
blocked = "blocked"
@dataclass
class Step:
id: str
description: str
status: StepStatus = StepStatus.pending
evidence: str | None = None # what proved the step done
notes: str = ""
def is_terminal(self) -> bool:
return self.status in (StepStatus.done, StepStatus.blocked)
@dataclass
class Postcondition:
description: str
satisfied: bool = False
evidence: str | None = None
@dataclass
class Plan:
objective: str
steps: list[Step] = field(default_factory=list)
postconditions: list[Postcondition] = field(default_factory=list)
created_at: datetime = field(default_factory=lambda: datetime.now(timezone.utc))
id: str = field(default_factory=lambda: str(uuid4()))
def all_steps_terminal(self) -> bool:
return all(s.is_terminal() for s in self.steps)
def all_postconditions_satisfied(self) -> bool:
return all(pc.satisfied for pc in self.postconditions)
def is_ready_to_finalize(self) -> bool:
return self.all_steps_terminal() and self.all_postconditions_satisfied()
def to_render(self) -> str:
"""Render the plan as a string the model can read."""
lines = [f"# Plan: {self.objective}\n"]
lines.append("## Steps")
for i, s in enumerate(self.steps, start=1):
mark = {"pending": "[ ]", "in_progress": "[.]",
"done": "[x]", "blocked": "[!]"}[s.status.value]
lines.append(f"{i}. {mark} {s.description}")
if s.evidence:
lines.append(f" evidence: {s.evidence}")
if s.notes:
lines.append(f" notes: {s.notes}")
lines.append("\n## Postconditions")
for i, pc in enumerate(self.postconditions, start=1):
mark = "[x]" if pc.satisfied else "[ ]"
lines.append(f"{i}. {mark} {pc.description}")
if pc.evidence:
lines.append(f" evidence: {pc.evidence}")
return "\n".join(lines)
Two distinctions worth naming.
Steps vs. postconditions. A step is what you do. A postcondition is what must be true at the end. They overlap, but not fully: a plan might have five steps and two postconditions, and both kinds matter. Steps let the agent track progress mid-task; postconditions let the harness check that outcomes actually landed.
Evidence is a string. Not a boolean. When the model marks a step done, it must provide evidence — "ran tests; all passed, see scratchpadtest-results"; "wrote file; confirmed with read_file_viewport at line 47-52." Evidence gives the harness something to check later, and gives debugging a paper trail. The harness doesn't parse the evidence; it just requires it's non-empty. The model knows to put useful things there because the plan's render shows the evidence field.
The agent interacts with the plan via tools.
# src/harness/plans/tools.py
from __future__ import annotations
from dataclasses import dataclass
from ..tools.base import Tool
from ..tools.decorator import tool
from .model import Plan, Postcondition, Step, StepStatus
@dataclass
class PlanHolder:
"""Wraps a mutable Plan so tools can mutate it through a shared reference."""
plan: Plan | None = None
def require(self) -> Plan:
if self.plan is None:
raise RuntimeError("no active plan")
return self.plan
def plan_tools(holder: PlanHolder) -> list[Tool]:
@tool(side_effects={"write"})
def plan_create(objective: str, steps: list[str],
postconditions: list[str]) -> str:
"""Create or replace the plan for this session.
objective: one-sentence description of what you're trying to
accomplish.
steps: ordered list of step descriptions. Each is a specific
actionable item, not a vague intent.
postconditions: list of conditions that must be true when you
declare the task complete. Examples: "file X
exists and contains Y"; "tests in module Z pass".
Call this once at the start of any non-trivial task, before
beginning work. If the plan is wrong mid-task, call this again
to replace it — the harness records the rewrite.
"""
holder.plan = Plan(
objective=objective,
steps=[Step(id=f"s{i}", description=d) for i, d in enumerate(steps)],
postconditions=[Postcondition(description=d) for d in postconditions],
)
return f"plan created with {len(steps)} steps and {len(postconditions)} postconditions"
@tool(side_effects={"read"})
def plan_show() -> str:
"""Display the current plan with its step and postcondition status.
Use this any time you want to re-orient — especially after long
sub-tasks or compaction. The plan is durable; compaction won't
lose it.
"""
return holder.require().to_render()
@tool(side_effects={"write"})
def step_update(step_number: int, status: str, evidence: str = "",
notes: str = "") -> str:
"""Update a step's status.
step_number: 1-based index from `plan_show`.
status: one of 'pending', 'in_progress', 'done', 'blocked'.
evidence: required for 'done'. One-sentence proof of completion
(reference to a tool result, a scratchpad key, etc.).
notes: optional free text.
"""
plan = holder.require()
if step_number < 1 or step_number > len(plan.steps):
return f"step_number {step_number} out of range (1..{len(plan.steps)})"
try:
new_status = StepStatus(status)
except ValueError:
return f"invalid status {status!r}; use pending/in_progress/done/blocked"
if new_status == StepStatus.done and not evidence:
return ("error: marking a step 'done' requires evidence. Describe "
"what proved the step complete (e.g., 'wrote file and "
"read it back; content matches').")
s = plan.steps[step_number - 1]
plan.steps[step_number - 1] = Step(
id=s.id, description=s.description,
status=new_status, evidence=evidence or None,
notes=notes,
)
return f"step {step_number} → {status}"
@tool(side_effects={"write"})
def postcondition_verify(postcondition_number: int, evidence: str) -> str:
"""Mark a postcondition as verified.
postcondition_number: 1-based index.
evidence: required. Concrete proof the postcondition holds.
This is what the harness checks before letting you declare the
task complete. Do not verify a postcondition without evidence.
"""
plan = holder.require()
if postcondition_number < 1 or postcondition_number > len(plan.postconditions):
return (f"postcondition_number {postcondition_number} out of range "
f"(1..{len(plan.postconditions)})")
if not evidence:
return "error: evidence is required to verify a postcondition"
pc = plan.postconditions[postcondition_number - 1]
plan.postconditions[postcondition_number - 1] = Postcondition(
description=pc.description, satisfied=True, evidence=evidence,
)
return f"postcondition {postcondition_number} verified"
return [plan_create, plan_show, step_update, postcondition_verify]
Three points of discipline enforced at the tool level.
Evidence is mandatory for done and verified. If the model calls step_update(step_number=3, status="done") without evidence, the tool returns an error explaining why. The model has to come back with a non-empty evidence string. This isn't security; it's a habit trainer. Evidence-producing models produce better evidence everywhere.
Plan rewriting is allowed, not hidden. A mid-task plan rewrite is a legitimate move; circumstances change. What we don't want is the model silently drifting from the old plan. Calling plan_create again replaces the plan — and Chapter 18's observability will log it, so you can see whether rewriting is happening too often (a sign the initial planning is weak).
plan_show is read-only and cheap. It's the first tool the agent should call after compaction or any long tool sequence, as a re-orientation. The system prompt should explicitly teach this.
The loop checks the plan before accepting a final answer.
# src/harness/agent.py (plan-aware version, sketch)
async def arun(
provider: Provider,
catalog: ToolCatalog,
user_message: str,
system: str | None = None,
# ...
plan_holder: "PlanHolder | None" = None,
) -> str:
# ... existing setup
for _ in range(MAX_ITERATIONS):
# ... existing tool selection, compaction, turn execution
if response.is_final:
final_text = response.text or ""
if plan_holder and plan_holder.plan:
plan = plan_holder.plan
if not plan.is_ready_to_finalize():
# reject the finalization; ask the agent to continue
synthetic = (
"The plan is not complete. Before declaring the "
"task done, either mark remaining steps as done "
"with evidence, verify outstanding postconditions, "
"or mark them blocked with a reason. Current plan:\n\n"
+ plan.to_render()
)
transcript.append(Message.from_assistant_response(response))
transcript.append(Message.user_text(synthetic))
continue
transcript.append(Message.from_assistant_response(response))
return final_text
# ... existing tool dispatch
The logic: when the model produces what it thinks is a final answer, and a plan is attached, the harness checks the plan. If every step is terminal (done or blocked) and every postcondition is satisfied, accept. Otherwise, append the model's attempted final answer to the transcript, append a synthetic user message explaining what's missing, and continue the loop.
This is the specific intervention that kills premature finalization. The model says "all done"; the harness says "no, you haven't marked step 3 done and postcondition 2 isn't verified — here's the current state." The model either finishes properly or explicitly marks the missing work as blocked with a reason.
Why not just enforce during step_update? Because "done" is about a specific step, not the whole plan. A step can be legitimately done while the overall plan isn't. The completion check has to happen at finalization, not at each step update.
# examples/ch16_plan.py
import asyncio
from harness.agent import arun
from harness.plans.tools import PlanHolder, plan_tools
from harness.providers.anthropic import AnthropicProvider
from harness.tools.selector import ToolCatalog
from harness.tools.std import STANDARD_TOOLS
SYSTEM = """\
You are an agent that works from structured plans. For any non-trivial
task, your first action is to call plan_create with an explicit objective,
steps, and postconditions. After each major action, call step_update to
record progress with evidence. Before declaring the task complete, verify
every postcondition with postcondition_verify.
The harness will reject your final answer if the plan is not complete.
Do not claim completion prematurely. If a step cannot be completed, mark
it 'blocked' with a reason rather than skipping it.
"""
async def main() -> None:
provider = AnthropicProvider()
holder = PlanHolder()
catalog = ToolCatalog(tools=STANDARD_TOOLS + plan_tools(holder))
answer = await arun(
provider=provider,
catalog=catalog,
system=SYSTEM,
plan_holder=holder,
user_message=(
"For each of these three files, verify they exist, measure "
"their size in bytes, and report the largest one: "
"/etc/hostname, /etc/os-release, /etc/machine-id. "
"Produce a summary with file paths, sizes, and the winner."
),
)
print("---")
print(answer)
print("---")
print(holder.plan.to_render() if holder.plan else "(no plan)")
asyncio.run(main())
Run it. Notice two behaviors.
At the start, the agent calls plan_create with three steps (one per file) and three postconditions (one per file's size reported, plus "winner identified"). Without the plan_holder mechanism, a careless agent might check two files, claim completion. With it, if the agent skips step 3 and tries to finalize, the harness rejects the final answer and the agent goes back to check the third file.
At the end, the plan's rendered form shows every step done with evidence and every postcondition verified. The agent's summary matches what the plan claims it did. These two are now guaranteed to match — by construction.
Three limitations worth naming.
The plan doesn't validate evidence. The agent can write evidence="I did the thing" and the harness accepts it. Evidence is about making the agent think — not about cryptographic proof. If you want verifiable evidence (the agent must prove postcondition X by running a specific test), you write a custom postcondition-verifier tool that actually runs the test. Chapter 19's eval harness is how you do this for regression testing.
The plan doesn't prevent drift. The agent can rewrite the plan to remove inconvenient steps. We log the rewrite (Chapter 18) but don't prevent it. Preventing it would make the agent brittle — plans legitimately need revision when the initial plan was wrong. The right audit is observational, not enforcement.
The plan doesn't compose across sub-agents. If the parent has a plan and spawns a sub-agent, the sub-agent either gets its own plan (usually the right choice) or no plan (often fine for narrow objectives). There is no notion of a shared plan hierarchy. Sub-agents report results; parents synthesize; the parent's plan marks the synthesis as a completed step.
git add -A && git commit -m "ch16: structured plans with enforced completion"
git tag ch16-plans
verify_file_exists(path, expected_content_contains) that returns true/false. Have the agent call it during postcondition verification. The evidence now carries an actual tool-call outcome.Chapter 15. Sub-agents
Previously: permissions, trust labels, sandbox interfaces. One well-governed agent can do a lot. Some tasks, though, decompose better into parallel or specialized work — and that's the case for sub-agents.
Chapter 17. Parallelism and Shared State
Previously: structured plans with evidence-backed completion. Sub-agents still run sequentially. The payoff for sub-agents comes from running them in parallel — the Anthropic multi-agent finding of 90%+ improvement over single-agent baselines rests on parallelism plus independent context windows, not sub-agents in series.