Chapter 6. Safe Tool Execution

Previously: streaming, interruption, retries. The loop survives network failures and closes cleanly on Ctrl-C. But a misnamed argument still fails inside the tool function, and the registry still can't tell when the model is spinning.

Two of the five breaks from Chapter 2 are still open. Break 2: the model passes a wrong argument shape ({"expr": "..."} instead of {"expression": "..."}), and we discover it only when Python raises TypeError inside the function body. Break 4: the model keeps calling the same tool with the same arguments and never converges — a tool-call loop that our MAX_ITERATIONS catches too late, after twenty wasted calls.

Both are fixable at the registry level, cheaply, with patterns every serious harness implements. This chapter closes them. We also tighten the error messages we return to the model, because the error message is how it learns to do better on the next turn.

name exists?
args match schema?
loop detector?
execute
any "no" →
structured error → back to model
Four gates before a tool runs. Every "no" short-circuits to a structured error the model can learn from on its next turn.

6.1 Why Validate Before Dispatch

There are two reasons to validate arguments before calling the tool function, and the first of them is backed by specific research on how LLM agents recover from failure.

Better error messages for the model. When calc raises TypeError: calc() got an unexpected keyword argument 'expr', the registry currently returns that string to the model. It's not wrong, but it's not great — the model has to reverse-engineer which argument was expected from a Python-flavored error message aimed at a human debugger. A schema-aware validator can say "tool calc requires expression (string); got expr" and the model's next attempt is usually right on the first try. Shinn et al.'s 2023 "Reflexion: Language Agents with Verbal Reinforcement Learning" (NeurIPS 2023) demonstrated this effect empirically across several agent benchmarks: agents that received structured feedback about their failures — what specifically was wrong, in the model's own vocabulary — recovered substantially faster than agents that received only raw error traces, and the effect compounded across multi-step tasks. In production, that difference is measurable on any real system: one saved turn per misnamed argument, multiplied by every turn you run.

Safety. A tool like write_file that receives {"path": "/etc/passwd", "content": "..."} should not have reached the function body. Validating the argument shape in the registry gives us one clean place to enforce invariants, before the tool can do any damage. This chapter's validation is structural only — "is this the shape I expect?" Chapter 14 adds semantic checks — "is this path allowed?" — but it layers on the same machinery.

Production harnesses overwhelmingly use Pydantic or jsonschema for this. Pydantic is more ergonomic for Python-native types; jsonschema is the reference implementation for the JSON Schema spec. We'll use jsonschema because our tool schemas are JSON Schemas; the validation is exactly what the library was designed for.

Add the dependency:

uv add 'jsonschema>=4.22'

6.2 The Validator

# src/harness/tools/validation.py
from __future__ import annotations

from dataclasses import dataclass

import jsonschema
from jsonschema import Draft202012Validator


@dataclass(frozen=True)
class ValidationError:
    message: str
    path: str  # JSON-pointer-ish; e.g. "args.expression"

    def __str__(self) -> str:
        return f"{self.path}: {self.message}"


def validate(args: dict, schema: dict) -> list[ValidationError]:
    """Return a list of validation errors. Empty list == valid."""
    validator = Draft202012Validator(schema)
    errors: list[ValidationError] = []
    for err in validator.iter_errors(args):
        path = "args" + "".join(f".{p}" for p in err.absolute_path)
        errors.append(ValidationError(message=err.message, path=path))
    return errors

Two design points.

We return a list, not raise. A single call can have multiple problems (wrong type and missing required argument and extra unknown argument). The model learns faster from one error message listing all three than from three consecutive turns fixing them one at a time.

The path is human-readable. args.expression and args.items[0].name are the shapes we emit. The model reads these as fluently as humans do; "at $.items.0.name" is harder to parse.


6.3 Threading Validation Through Dispatch

The registry gains a validation step. When validation fails, we return the errors to the model instead of running the tool.

# src/harness/tools/registry.py (updated)
from __future__ import annotations

from dataclasses import dataclass, field

from ..messages import ToolResult
from .base import Tool
from .validation import ValidationError, validate


MAX_REPEAT_CALLS = 3  # same (tool, args) this many times → halt


@dataclass
class ToolRegistry:
    tools: dict[str, Tool] = field(default_factory=dict)
    _call_history: list[tuple[str, str]] = field(default_factory=list, init=False)

    def __init__(self, tools: list[Tool] | None = None) -> None:
        self.tools = {}
        self._call_history = []
        for t in tools or []:
            self.add(t)

    def add(self, tool: Tool) -> None:
        if tool.name in self.tools:
            raise ValueError(f"duplicate tool name: {tool.name}")
        self.tools[tool.name] = tool

    def schemas(self) -> list[dict]:
        return [t.schema_for_provider() for t in self.tools.values()]

    def dispatch(self, name: str, args: dict, call_id: str) -> ToolResult:
        if name not in self.tools:
            return self._unknown_tool(name, call_id)

        tool = self.tools[name]
        errors = validate(args, tool.input_schema)
        if errors:
            return self._validation_failure(name, errors, call_id)

        self._record(name, args)
        loop_result = self._check_loop(name, args, call_id)
        if loop_result is not None:
            return loop_result

        try:
            content = tool.run(**args)
        except Exception as e:
            return ToolResult(
                call_id=call_id,
                content=f"{name} raised {type(e).__name__}: {e}",
                is_error=True,
            )
        return ToolResult(call_id=call_id, content=content)

    # --- helpers ---

    def _unknown_tool(self, name: str, call_id: str) -> ToolResult:
        # Try to suggest a close match. We drop difflib's default cutoff
        # of 0.6 to 0.5 — the ratio for `calculator` vs `calc` is ~0.57,
        # and prefix-heavy misspellings like that are exactly the case
        # we want to catch. 0.5 still rejects unrelated names.
        import difflib
        close = difflib.get_close_matches(
            name, list(self.tools.keys()), n=1, cutoff=0.5,
        )
        suggestion = f" Did you mean {close[0]!r}?" if close else ""
        return ToolResult(
            call_id=call_id,
            content=(
                f"unknown tool: {name!r}.{suggestion} "
                f"Available: {sorted(self.tools.keys())}"
            ),
            is_error=True,
        )

    def _validation_failure(
        self, name: str, errors: list[ValidationError], call_id: str
    ) -> ToolResult:
        summary = "; ".join(str(e) for e in errors)
        return ToolResult(
            call_id=call_id,
            content=f"{name}: invalid arguments. {summary}",
            is_error=True,
        )

    def _record(self, name: str, args: dict) -> None:
        import json
        self._call_history.append((name, json.dumps(args, sort_keys=True)))
        if len(self._call_history) > 100:
            self._call_history = self._call_history[-100:]

    def _check_loop(self, name: str, args: dict, call_id: str) -> ToolResult | None:
        import json
        key = (name, json.dumps(args, sort_keys=True))
        repeats = sum(1 for k in self._call_history[-MAX_REPEAT_CALLS:] if k == key)
        if repeats >= MAX_REPEAT_CALLS:
            return ToolResult(
                call_id=call_id,
                content=(
                    f"tool-call loop detected: {name} called with identical "
                    f"arguments {MAX_REPEAT_CALLS} times in a row. "
                    "Try a different approach or different arguments, or "
                    "stop and return your current best answer."
                ),
                is_error=True,
            )
        return None

Three new behaviors.

Unknown tools suggest alternatives. If the model asks for calculator and we only have calc, difflib.get_close_matches produces "Did you mean 'calc'?". In my experience, this recovers about 80% of misnamed tool calls in one turn. It costs us one import difflib and three lines.

Validation errors come back structured. The model reads calc: invalid arguments. args.expression: 'expression' is a required property and, in the typical case, fixes it on the next turn. Compare to the pre-validation version where it sees the Python TypeError message — both work, but structured is faster.

Tool-call loops are detected and explained. After three consecutive identical calls, the registry returns a synthetic error explaining what happened. This is the key intervention, and it's the same principle as §6.1's Reflexion framing applied to a different failure mode: the model gets a structured, external hint that it's stuck, rather than more turns of the same unhelpful output it's already producing. Most models recover — they try different arguments, try a different tool, or stop and return their best current answer.


6.4 What "Identical" Means

The loop detector uses (name, json.dumps(args, sort_keys=True)) as the dedup key. That's exact-match. A model that calls calc("1+1") and then calc("1 + 1") would bypass it. That's usually fine — the model is making progress if it's varying the arguments, even trivially. The failure mode we care about is the one where the model has nothing left to try.

Two extensions are tempting. Neither made it in:

Fuzzy match. Collapse whitespace, normalize casing, round floats. Catches trivial variations but also catches legitimate ones — "read lines 1-50" and "read lines 1-51" look fuzzy-identical but the second is a real step forward. False positives on progress are worse than false negatives on loops.

Semantic match. An LLM-based judge of whether two calls are "really the same." Expensive, non-deterministic, and a great way to have a bug you can't reproduce.

The exact-match version catches the nasty case — a genuine stuck loop — without stepping on real progress. If you hit a case where it misses, bump MAX_REPEAT_CALLS down or look for a heuristic specific to your domain.


6.5 The MAX_ITERATIONS Question

Up until this chapter, the loop's outer bound has been MAX_ITERATIONS = 20. The loop detector at the registry level gives us a smarter inner bound: we stop before twenty iterations if the model is spinning. But MAX_ITERATIONS itself is still a coarse safety net, and the right answer to "how many iterations is too many" isn't a number — it's a budget.

A cost budget based on tokens (Chapter 20) or a time budget (wall-clock seconds) is more honest than iteration count. A short task with a handful of 100K-token tool results hits cost ceilings fast; a long task with tiny tool results can legitimately run 40 iterations cheaply. Counting iterations is a proxy; cost is the thing.

We'll upgrade MAX_ITERATIONS to a proper budget in Chapter 20. For now, we keep the iteration cap as a fail-safe and note the _check_loop intervention is the real signal.


6.6 A Small Test Suite

Now is the time to start testing the loop's error paths deliberately, not just its happy path. We've been relying on examples that run to completion; we need tests that exercise the five-break table from Chapter 2 and confirm they all fail gracefully.

# tests/test_registry.py
from harness.tools.registry import ToolRegistry
from harness.tools.std import calc


def test_unknown_tool_with_suggestion():
    registry = ToolRegistry(tools=[calc])
    result = registry.dispatch("calculator", {"expression": "2+2"}, "call-1")
    assert result.is_error
    assert "Did you mean 'calc'?" in result.content


def test_validation_missing_required():
    registry = ToolRegistry(tools=[calc])
    result = registry.dispatch("calc", {}, "call-1")
    assert result.is_error
    assert "expression" in result.content
    assert "required" in result.content


def test_validation_wrong_type():
    registry = ToolRegistry(tools=[calc])
    result = registry.dispatch("calc", {"expression": 42}, "call-1")
    assert result.is_error
    assert "string" in result.content.lower() or "str" in result.content.lower()


def test_loop_detection():
    registry = ToolRegistry(tools=[calc])
    for i in range(3):
        result = registry.dispatch("calc", {"expression": "1+1"}, f"call-{i}")
    # the third call should be caught by the loop detector
    result = registry.dispatch("calc", {"expression": "1+1"}, "call-3")
    assert result.is_error
    assert "tool-call loop" in result.content


def test_happy_path():
    registry = ToolRegistry(tools=[calc])
    result = registry.dispatch("calc", {"expression": "2+2"}, "call-1")
    assert not result.is_error
    assert result.content == "4"

Run it:

uv run pytest tests/test_registry.py -q

All five pass. The test suite isn't comprehensive — Chapter 19 will build proper trajectory evals — but it's enough to catch regressions in the registry, which is now the central component of the harness.


6.7 A Second Tool Worth Writing

The registry is robust enough now to handle a tool that genuinely has a narrow contract. Let's add json_query, which takes a JSON string and a JSONPath-like expression. It's a good stress test of validation — the schema has two required arguments, both strings, with a specific shape, and the failure modes (invalid JSON, invalid path expression) are ones the validator and the tool share.

# src/harness/tools/std.py (add)
import json

@tool(side_effects={"read"})
def json_query(data: str, path: str) -> str:
    """Query JSON data with a simple dot-path expression.

    data: a JSON string (object or array).
    path: a dot-separated path; e.g. "items.0.name" or "user.email".
          Array indices are integers; object keys are dot-separated.

    Returns the queried value as JSON, or an error string if the path
    doesn't exist.
    Side effects: none.
    """
    obj = json.loads(data)  # will raise on invalid JSON; registry catches it
    current = obj
    for part in path.split("."):
        if isinstance(current, list):
            current = current[int(part)]
        elif isinstance(current, dict):
            if part not in current:
                raise KeyError(f"path not found: {part}")
            current = current[part]
        else:
            raise TypeError(f"cannot index {type(current).__name__} with {part}")
    return json.dumps(current)

Now the registry has five tools. Run them through the loop against your preferred provider and observe: when the model passes malformed arguments, the registry's error message arrives structured, the model corrects, the next call works. Most of the time.


6.8 What the Registry Still Doesn't Do

Three things worth naming now, each of which gets a chapter later.

No permissions. Anyone can call write_file on any path. Chapter 14 builds the permission layer.

No observability. The registry logs nothing; a failed call is invisible in post-hoc analysis. Chapter 18 adds OpenTelemetry spans per dispatch.

No cost accounting. The registry doesn't know — or care — how much the model spent to make each call. Chapter 20 wires in budget-aware dispatch.

Each of these slots in cleanly because the registry is the sole dispatch point. We didn't have to thread permission checks through every tool; we didn't have to teach each tool to log. The registry is the interception layer, by design, and the book's cost of adding these features is proportional to what they do — not to how many tools we have.


6.9 Commit

git add -A && git commit -m "ch06: schema validation and loop detection at the registry"
git tag ch06-safety

6.10 Try It Yourself

  1. Find a legitimate loop. Construct a prompt where the agent genuinely needs to retry the same tool with the same arguments — for example, polling a tool that represents a slow operation. Does the loop detector fire? If so, is that the right behavior? How would you distinguish a polling loop from a stuck loop?
  2. Measure the recovery rate. Run your harness against thirty prompts that commonly trigger malformed tool calls. Log how often the model recovers on the next turn after receiving a validation error versus how often it gives up. That number is a proxy for how well-designed your schemas and error messages are.
  3. Write a test for the close-match suggestion. Prove that renaming calc to calculate breaks the Did you mean suggestion you hard-coded. What would you change so that the test stays green regardless of the specific name? Your answer is a sketch of what a larger test suite needs.