Chapter 16. Structured Plans and Verified Completion

Previously: sub-agents with bounded delegation. A coordinator can now split work across sub-agents and synthesize. What it cannot yet do — and what single-agent runs also cannot — is verify that what it claims to have done, it actually did.

Two specific failure modes motivate this chapter.

Premature finalization. The agent says "task complete" after processing half the items. Galileo's production analysis named this as one of the top agent failure patterns — an agent asked to process all items in a list processes four of six and claims completion. The model's training rewards coherent-sounding completions; the agent cannot distinguish "said X" from "did X."

Plan-execution mismatch. The MAST paper (Cemri et al., 2025) identified reasoning-action mismatch as a distinct failure mode across 1,642 multi-agent traces. The agent's plan says "read A, then modify B." The action reads A and modifies C. The plan and the action are generated in separate forward passes; nothing connects them.

Both are addressable by making the plan a structured object the harness enforces rather than an unstructured string the agent produces and then may or may not follow. This is Kambhampati et al.'s 2024 ICML position paper "LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks" spelled out in concrete code: the paper's core argument is that language models produce plausible plans but cannot reliably self-verify completion, and the right architecture pairs the model with an external verifier that decides when the model's work is actually done. The harness is the external verifier. This chapter introduces Plan, a dataclass with steps, preconditions, and postconditions. A step has to be marked complete with evidence; a plan has to have all postconditions satisfied before the agent can declare final. The harness checks, not the model.

step.status

pending

→

step.status

in_progress

↓ can side-branch

blocked

→

step.status

done

+ evidence

↓

FINALIZATION GATE

all steps done? postconditions satisfied?

agent cannot emit final answer unless gate passes

The plan state machine. The harness checks the gate; the model cannot self-certify completion.

16.1 The Shape of a Plan

# src/harness/plans/model.py
from __future__ import annotations

from dataclasses import dataclass, field
from datetime import datetime, timezone
from enum import Enum
from typing import Literal
from uuid import uuid4


class StepStatus(str, Enum):
    pending = "pending"
    in_progress = "in_progress"
    done = "done"
    blocked = "blocked"


@dataclass
class Step:
    id: str
    description: str
    status: StepStatus = StepStatus.pending
    evidence: str | None = None   # what proved the step done
    notes: str = ""

    def is_terminal(self) -> bool:
        return self.status in (StepStatus.done, StepStatus.blocked)


@dataclass
class Postcondition:
    description: str
    satisfied: bool = False
    evidence: str | None = None


@dataclass
class Plan:
    objective: str
    steps: list[Step] = field(default_factory=list)
    postconditions: list[Postcondition] = field(default_factory=list)
    created_at: datetime = field(default_factory=lambda: datetime.now(timezone.utc))
    id: str = field(default_factory=lambda: str(uuid4()))

    def all_steps_terminal(self) -> bool:
        return all(s.is_terminal() for s in self.steps)

    def all_postconditions_satisfied(self) -> bool:
        return all(pc.satisfied for pc in self.postconditions)

    def is_ready_to_finalize(self) -> bool:
        return self.all_steps_terminal() and self.all_postconditions_satisfied()

    def to_render(self) -> str:
        """Render the plan as a string the model can read."""
        lines = [f"# Plan: {self.objective}\n"]
        lines.append("## Steps")
        for i, s in enumerate(self.steps, start=1):
            mark = {"pending": "[ ]", "in_progress": "[.]",
                    "done": "[x]", "blocked": "[!]"}[s.status.value]
            lines.append(f"{i}. {mark} {s.description}")
            if s.evidence:
                lines.append(f"     evidence: {s.evidence}")
            if s.notes:
                lines.append(f"     notes: {s.notes}")
        lines.append("\n## Postconditions")
        for i, pc in enumerate(self.postconditions, start=1):
            mark = "[x]" if pc.satisfied else "[ ]"
            lines.append(f"{i}. {mark} {pc.description}")
            if pc.evidence:
                lines.append(f"     evidence: {pc.evidence}")
        return "\n".join(lines)

Two distinctions worth naming.

Steps vs. postconditions. A step is what you do. A postcondition is what must be true at the end. They overlap, but not fully: a plan might have five steps and two postconditions, and both kinds matter. Steps let the agent track progress mid-task; postconditions let the harness check that outcomes actually landed.

Evidence is a string. Not a boolean. When the model marks a step done, it must provide evidence — "ran tests; all passed, see scratchpadtest-results"; "wrote file; confirmed with read_file_viewport at line 47-52." Evidence gives the harness something to check later, and gives debugging a paper trail. The harness doesn't parse the evidence; it just requires it's non-empty. The model knows to put useful things there because the plan's render shows the evidence field.

16.2 Plan Tools

The agent interacts with the plan via tools.

# src/harness/plans/tools.py
from __future__ import annotations

from dataclasses import dataclass

from ..tools.base import Tool
from ..tools.decorator import tool
from .model import Plan, Postcondition, Step, StepStatus


@dataclass
class PlanHolder:
    """Wraps a mutable Plan so tools can mutate it through a shared reference."""
    plan: Plan | None = None

    def require(self) -> Plan:
        if self.plan is None:
            raise RuntimeError("no active plan")
        return self.plan


def plan_tools(holder: PlanHolder) -> list[Tool]:

    @tool(side_effects={"write"})
    def plan_create(objective: str, steps: list[str],
                     postconditions: list[str]) -> str:
        """Create or replace the plan for this session.

        objective: one-sentence description of what you're trying to
                   accomplish.
        steps: ordered list of step descriptions. Each is a specific
               actionable item, not a vague intent.
        postconditions: list of conditions that must be true when you
                        declare the task complete. Examples: "file X
                        exists and contains Y"; "tests in module Z pass".

        Call this once at the start of any non-trivial task, before
        beginning work. If the plan is wrong mid-task, call this again
        to replace it — the harness records the rewrite.
        """
        holder.plan = Plan(
            objective=objective,
            steps=[Step(id=f"s{i}", description=d) for i, d in enumerate(steps)],
            postconditions=[Postcondition(description=d) for d in postconditions],
        )
        return f"plan created with {len(steps)} steps and {len(postconditions)} postconditions"

    @tool(side_effects={"read"})
    def plan_show() -> str:
        """Display the current plan with its step and postcondition status.

        Use this any time you want to re-orient — especially after long
        sub-tasks or compaction. The plan is durable; compaction won't
        lose it.
        """
        return holder.require().to_render()

    @tool(side_effects={"write"})
    def step_update(step_number: int, status: str, evidence: str = "",
                     notes: str = "") -> str:
        """Update a step's status.

        step_number: 1-based index from `plan_show`.
        status: one of 'pending', 'in_progress', 'done', 'blocked'.
        evidence: required for 'done'. One-sentence proof of completion
                  (reference to a tool result, a scratchpad key, etc.).
        notes: optional free text.
        """
        plan = holder.require()
        if step_number < 1 or step_number > len(plan.steps):
            return f"step_number {step_number} out of range (1..{len(plan.steps)})"
        try:
            new_status = StepStatus(status)
        except ValueError:
            return f"invalid status {status!r}; use pending/in_progress/done/blocked"
        if new_status == StepStatus.done and not evidence:
            return ("error: marking a step 'done' requires evidence. Describe "
                    "what proved the step complete (e.g., 'wrote file and "
                    "read it back; content matches').")
        s = plan.steps[step_number - 1]
        plan.steps[step_number - 1] = Step(
            id=s.id, description=s.description,
            status=new_status, evidence=evidence or None,
            notes=notes,
        )
        return f"step {step_number} → {status}"

    @tool(side_effects={"write"})
    def postcondition_verify(postcondition_number: int, evidence: str) -> str:
        """Mark a postcondition as verified.

        postcondition_number: 1-based index.
        evidence: required. Concrete proof the postcondition holds.

        This is what the harness checks before letting you declare the
        task complete. Do not verify a postcondition without evidence.
        """
        plan = holder.require()
        if postcondition_number < 1 or postcondition_number > len(plan.postconditions):
            return (f"postcondition_number {postcondition_number} out of range "
                    f"(1..{len(plan.postconditions)})")
        if not evidence:
            return "error: evidence is required to verify a postcondition"
        pc = plan.postconditions[postcondition_number - 1]
        plan.postconditions[postcondition_number - 1] = Postcondition(
            description=pc.description, satisfied=True, evidence=evidence,
        )
        return f"postcondition {postcondition_number} verified"

    return [plan_create, plan_show, step_update, postcondition_verify]

Three points of discipline enforced at the tool level.

Evidence is mandatory for done and verified. If the model calls step_update(step_number=3, status="done") without evidence, the tool returns an error explaining why. The model has to come back with a non-empty evidence string. This isn't security; it's a habit trainer. Evidence-producing models produce better evidence everywhere.

Plan rewriting is allowed, not hidden. A mid-task plan rewrite is a legitimate move; circumstances change. What we don't want is the model silently drifting from the old plan. Calling plan_create again replaces the plan — and Chapter 18's observability will log it, so you can see whether rewriting is happening too often (a sign the initial planning is weak).

plan_show is read-only and cheap. It's the first tool the agent should call after compaction or any long tool sequence, as a re-orientation. The system prompt should explicitly teach this.

16.3 Harness Enforcement of Completion

The loop checks the plan before accepting a final answer.

# src/harness/agent.py (plan-aware version, sketch)

async def arun(
    provider: Provider,
    catalog: ToolCatalog,
    user_message: str,
    system: str | None = None,
    # ...
    plan_holder: "PlanHolder | None" = None,
) -> str:
    # ... existing setup

    for _ in range(MAX_ITERATIONS):
        # ... existing tool selection, compaction, turn execution

        if response.is_final:
            final_text = response.text or ""
            if plan_holder and plan_holder.plan:
                plan = plan_holder.plan
                if not plan.is_ready_to_finalize():
                    # reject the finalization; ask the agent to continue
                    synthetic = (
                        "The plan is not complete. Before declaring the "
                        "task done, either mark remaining steps as done "
                        "with evidence, verify outstanding postconditions, "
                        "or mark them blocked with a reason. Current plan:\n\n"
                        + plan.to_render()
                    )
                    transcript.append(Message.from_assistant_response(response))
                    transcript.append(Message.user_text(synthetic))
                    continue
            transcript.append(Message.from_assistant_response(response))
            return final_text

        # ... existing tool dispatch

The logic: when the model produces what it thinks is a final answer, and a plan is attached, the harness checks the plan. If every step is terminal (done or blocked) and every postcondition is satisfied, accept. Otherwise, append the model's attempted final answer to the transcript, append a synthetic user message explaining what's missing, and continue the loop.

This is the specific intervention that kills premature finalization. The model says "all done"; the harness says "no, you haven't marked step 3 done and postcondition 2 isn't verified — here's the current state." The model either finishes properly or explicitly marks the missing work as blocked with a reason.

Why not just enforce during step_update? Because "done" is about a specific step, not the whole plan. A step can be legitimately done while the overall plan isn't. The completion check has to happen at finalization, not at each step update.

16.4 A Scenario That Catches Premature Finalization

# examples/ch16_plan.py
import asyncio

from harness.agent import arun
from harness.plans.tools import PlanHolder, plan_tools
from harness.providers.anthropic import AnthropicProvider
from harness.tools.selector import ToolCatalog
from harness.tools.std import STANDARD_TOOLS


SYSTEM = """\
You are an agent that works from structured plans. For any non-trivial
task, your first action is to call plan_create with an explicit objective,
steps, and postconditions. After each major action, call step_update to
record progress with evidence. Before declaring the task complete, verify
every postcondition with postcondition_verify.

The harness will reject your final answer if the plan is not complete.
Do not claim completion prematurely. If a step cannot be completed, mark
it 'blocked' with a reason rather than skipping it.
"""


async def main() -> None:
    provider = AnthropicProvider()
    holder = PlanHolder()
    catalog = ToolCatalog(tools=STANDARD_TOOLS + plan_tools(holder))

    answer = await arun(
        provider=provider,
        catalog=catalog,
        system=SYSTEM,
        plan_holder=holder,
        user_message=(
            "For each of these three files, verify they exist, measure "
            "their size in bytes, and report the largest one: "
            "/etc/hostname, /etc/os-release, /etc/machine-id. "
            "Produce a summary with file paths, sizes, and the winner."
        ),
    )
    print("---")
    print(answer)
    print("---")
    print(holder.plan.to_render() if holder.plan else "(no plan)")


asyncio.run(main())

Run it. Notice two behaviors.

At the start, the agent calls plan_create with three steps (one per file) and three postconditions (one per file's size reported, plus "winner identified"). Without the plan_holder mechanism, a careless agent might check two files, claim completion. With it, if the agent skips step 3 and tries to finalize, the harness rejects the final answer and the agent goes back to check the third file.

At the end, the plan's rendered form shows every step done with evidence and every postcondition verified. The agent's summary matches what the plan claims it did. These two are now guaranteed to match — by construction.

16.5 What the Plan Doesn't Do

Three limitations worth naming.

The plan doesn't validate evidence. The agent can write evidence="I did the thing" and the harness accepts it. Evidence is about making the agent think — not about cryptographic proof. If you want verifiable evidence (the agent must prove postcondition X by running a specific test), you write a custom postcondition-verifier tool that actually runs the test. Chapter 19's eval harness is how you do this for regression testing.

The plan doesn't prevent drift. The agent can rewrite the plan to remove inconvenient steps. We log the rewrite (Chapter 18) but don't prevent it. Preventing it would make the agent brittle — plans legitimately need revision when the initial plan was wrong. The right audit is observational, not enforcement.

The plan doesn't compose across sub-agents. If the parent has a plan and spawns a sub-agent, the sub-agent either gets its own plan (usually the right choice) or no plan (often fine for narrow objectives). There is no notion of a shared plan hierarchy. Sub-agents report results; parents synthesize; the parent's plan marks the synthesis as a completed step.

16.6 Commit

git add -A && git commit -m "ch16: structured plans with enforced completion"
git tag ch16-plans

16.7 Try It Yourself

Induce a premature finalization. Construct a task where the model is tempted to stop early — a checklist where the last item is tedious. Run with plans off; run with plans on. Does the enforcement actually catch the shortcut?
Write a postcondition-verifier tool. For a task that writes a file, add a tool verify_file_exists(path, expected_content_contains) that returns true/false. Have the agent call it during postcondition verification. The evidence now carries an actual tool-call outcome.
Measure the overhead. Compare total turns (and total cost) for the same task with and without plan tools. Does the structured approach cost more per run? If so, is the quality improvement worth it? What tasks is it worth it for?

What you now understand

The agent works against a structured plan. Steps and postconditions are first-class objects that the agent marks with evidence. The harness enforces that a task can't be declared complete until everything is terminal and verified. Premature finalization is no longer something we hope the agent avoids; it's something the harness catches. The plan is also durable across compaction — it lives in the PlanHolder, not in the transcript — so long sessions don't lose it.

What's still missing: parallelism. Everything we've built runs one sub-agent at a time, one step at a time. Real multi-agent systems want to run four research sub-agents in parallel, combine their results, then act. That brings shared-state problems we haven't faced yet — what happens when two sub-agents want to write to the same file, or update the same plan? Chapter 17 tackles it.

Chapter 15. Sub-agents

Previously: permissions, trust labels, sandbox interfaces. One well-governed agent can do a lot. Some tasks, though, decompose better into parallel or specialized work — and that's the case for sub-agents.

Chapter 17. Parallelism and Shared State

Previously: structured plans with evidence-backed completion. Sub-agents still run sequentially. The payoff for sub-agents comes from running them in parallel — the Anthropic multi-agent finding of 90%+ improvement over single-agent baselines rests on parallelism plus independent context windows, not sub-agents in series.