Previously: typed messages, typed transcripts, three provider adapters. The loop no longer crashes on unknown tools, but its fix is ad hoc — a try/except in the dispatch. We owe ourselves a proper tool abstraction.
A tool is a contract. It has a name a model can guess, a description a model can read, a schema a model must match, a callable we execute on a match, and a side-effect profile that determines who has to ask permission before we run it. The tools you ship are the surface through which your agent reaches into the world, and every well-documented production failure in this space — hallucinated tool calls, output truncation, tool cliff, prompt injection — is a failure of the tool surface.
This chapter builds the Tool abstraction we'll use for the rest of the book. By the end, three things are true:
ToolRegistry dispatches calls and rejects unknown names before they reach your code.We still don't validate argument shapes before dispatch — that's Chapter 6. We're building the object; Chapter 6 attaches the validator.
The research arc here is short but worth naming. Schick et al.'s 2023 "Toolformer: Language Models Can Teach Themselves to Use Tools" was the paper that opened the tool-use research area as a distinct subfield — it demonstrated that language models could learn to call external APIs (calculators, translators, search) in the middle of text generation, which in turn established tool use as a research problem rather than a prompting trick. Yang et al.'s 2024 "SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering" made the sharp design point that followed from Toolformer's opening: tool design is interface design for a non-human user with very specific cognitive constraints. A model reads your description before it decides what to call, it doesn't remember your README, it can't ask a clarifying question, and it infers behavior from names. Three consequences follow:
Names are semantic. send_message is very different from send_email. read_file implies non-destructive; open_file is ambiguous. If two tools could be confused, they will be.
Descriptions are contracts. "Sends a notification" doesn't tell a model whether the notification costs money, reaches production customers, is rate-limited, or requires a subject line. A well-specified tool description covers what it does, what it does not do, what it requires as preconditions, and what side effects it produces. "Sends a notification" is a bug report waiting to happen.
Schemas are hard edges. A tool with an optional argument that's actually required, or a string argument that should have been an enum, gets misused in proportion to how soft its edges are. Every string where an enum would fit is a chance for the model to be creative in ways you don't want.
We'll encode all three — name, description, schema — as first-class fields of Tool. Side effects get a field too, because the permission layer in Chapter 14 will need to gate on them.
# src/harness/tools/base.py
from __future__ import annotations
from dataclasses import dataclass, field
from typing import Callable, Literal
SideEffect = Literal["read", "write", "network", "mutate"]
@dataclass(frozen=True)
class Tool:
"""A callable exposed to the model.
name -- stable identifier the model calls by.
description -- contract text the model reads. Must state scope,
preconditions, and side effects in plain English.
input_schema -- JSON Schema for the arguments dict.
run -- the callable. Accepts kwargs matching the schema;
returns a string (what the model will see as the result).
side_effects -- declared effect tags. Used by the permission layer.
"""
name: str
description: str
input_schema: dict
run: Callable[..., str]
side_effects: frozenset[SideEffect] = field(default_factory=frozenset)
def schema_for_provider(self) -> dict:
"""The dict shape providers expect (Anthropic-flavored)."""
return {
"name": self.name,
"description": self.description,
"input_schema": self.input_schema,
}
Four tags in SideEffect:
read — the tool only reads state. Safe to retry, safe to run in parallel, never needs an idempotency key.write — modifies local state (files, scratchpad). Needs write ownership; usually safe to retry with idempotency.network — reaches an external service. Needs egress permission; retries require vendor-side idempotency.mutate — has externally-visible irreversible side effects (sending an email, charging a card, deleting a row). The permission layer may require human approval; retries need real idempotency keys.These tags aren't load-bearing yet. In Chapter 14, the PermissionManager uses them to decide what gets gated. In Chapter 21, the Checkpointer uses them to decide what needs idempotency protection. We declare them now so every tool we write has them from the start.
Building Tool instances by hand every time is tedious, and the boilerplate obscures what's actually specific to each tool. A decorator gives us lighter ergonomics and lets the tool's function signature, docstring, and type hints do most of the work:
# src/harness/tools/decorator.py
from __future__ import annotations
import inspect
import typing
from typing import Callable, get_type_hints
from .base import SideEffect, Tool
def tool(
name: str | None = None,
description: str | None = None,
side_effects: set[SideEffect] | frozenset[SideEffect] = frozenset(),
) -> Callable[[Callable[..., str]], Tool]:
"""Turn a plain function into a Tool.
The input schema is inferred from type hints. The function's docstring
is used as the description if not provided explicitly.
"""
def wrap(fn: Callable[..., str]) -> Tool:
actual_name = name or fn.__name__
actual_description = description or (fn.__doc__ or "").strip()
if not actual_description:
raise ValueError(f"tool {actual_name!r} has no description")
schema = _schema_from_signature(fn)
return Tool(
name=actual_name,
description=actual_description,
input_schema=schema,
run=fn,
side_effects=frozenset(side_effects),
)
return wrap
def _schema_from_signature(fn: Callable[..., str]) -> dict:
sig = inspect.signature(fn)
hints = get_type_hints(fn)
properties: dict[str, dict] = {}
required: list[str] = []
for pname, param in sig.parameters.items():
if pname == "self":
continue
hint = hints.get(pname, str)
properties[pname] = _type_to_schema(hint)
if param.default is inspect.Parameter.empty:
required.append(pname)
return {
"type": "object",
"properties": properties,
"required": required,
}
def _type_to_schema(t: type) -> dict:
origin = typing.get_origin(t)
if origin is typing.Union or origin is types_union():
args = [a for a in typing.get_args(t) if a is not type(None)]
if len(args) == 1:
return _type_to_schema(args[0])
if t is str:
return {"type": "string"}
if t is int:
return {"type": "integer"}
if t is float:
return {"type": "number"}
if t is bool:
return {"type": "boolean"}
if origin is list:
return {"type": "array", "items": _type_to_schema(typing.get_args(t)[0])}
return {"type": "string"} # fallback
def types_union():
import types
return types.UnionType
_schema_from_signature is deliberately simple. It handles str, int, float, bool, list[T], and Optional[T] — enough to cover every tool we write in the book. If you need enums, nested objects, or constraints (minLength, pattern, minimum), you can either extend this function or pass an explicit input_schema to a manual Tool(...) call. Most serious production harnesses use Pydantic or pydantic.TypeAdapter for this; we'll adopt that in Chapter 6 when schema validation starts mattering.
Usage:
from harness.tools.decorator import tool
@tool(side_effects={"read"})
def calc(expression: str) -> str:
"""Evaluate a Python arithmetic expression.
Accepts: standard Python syntax with +, -, *, /, **, parentheses.
Rejects: imports, function calls, attribute access.
Side effects: none.
"""
import ast
tree = ast.parse(expression, mode="eval")
for node in ast.walk(tree):
if not isinstance(node, (ast.Expression, ast.BinOp, ast.UnaryOp,
ast.Num, ast.Constant, ast.operator,
ast.unaryop, ast.Load)):
raise ValueError(f"not allowed in expression: {type(node).__name__}")
return str(eval(compile(tree, "<expr>", mode="eval"), {"__builtins__": {}}))
Notice the docstring is doing real work. The model reads it; "accepts / rejects / side effects" is a pattern that makes the tool hard to misuse. Compare that to "evaluates math" and you can see the difference it makes to the model's selection behavior.
The registry holds tools, renders their schemas for providers, and dispatches calls by name.
# src/harness/tools/registry.py
from __future__ import annotations
from dataclasses import dataclass
from ..messages import ToolResult
from .base import Tool
class UnknownToolError(Exception):
pass
@dataclass
class ToolRegistry:
tools: dict[str, Tool]
def __init__(self, tools: list[Tool] | None = None) -> None:
self.tools = {}
for t in tools or []:
self.add(t)
def add(self, tool: Tool) -> None:
if tool.name in self.tools:
raise ValueError(f"duplicate tool name: {tool.name}")
self.tools[tool.name] = tool
def schemas(self) -> list[dict]:
return [t.schema_for_provider() for t in self.tools.values()]
def dispatch(self, name: str, args: dict, call_id: str) -> ToolResult:
if name not in self.tools:
return ToolResult(
call_id=call_id,
content=(f"unknown tool: {name}. "
f"available: {sorted(self.tools.keys())}"),
is_error=True,
)
tool = self.tools[name]
try:
content = tool.run(**args)
except TypeError as e:
return ToolResult(
call_id=call_id,
content=f"argument error for {name}: {e}",
is_error=True,
)
except Exception as e:
return ToolResult(
call_id=call_id,
content=f"{name} raised {type(e).__name__}: {e}",
is_error=True,
)
return ToolResult(call_id=call_id, content=content)
The registry's contract is narrow: it knows tools, it knows schemas, it dispatches. What it does not do yet: validate arguments against the schema before dispatch, detect loops, enforce permissions, measure cost. All of those are chapters of their own. The registry is the seam they'll plug into.
Two design choices worth naming.
dispatch returns a ToolResult, never raises. The loop should not have to wrap every call in try/except; that's what the registry is for. An unknown tool is not an exception; it's a structured error the model can read and recover from. This is the explicit handling the Chapter 2 try/except was approximating.
The error messages name the available tools. When the model calls calculator instead of calc, the error tells it so. That's not decoration — it's how the model corrects itself on the next turn. A registry that says "unknown tool" without naming alternatives is throwing away a free learning signal.
The Chapter 3 loop built a dict[str, Callable] and called it directly. Now it uses the registry.
# src/harness/agent.py
from __future__ import annotations
from .messages import Message, Transcript, ToolCall
from .providers.base import Provider
from .tools.registry import ToolRegistry
MAX_ITERATIONS = 20
def run(
provider: Provider,
registry: ToolRegistry,
user_message: str,
transcript: Transcript | None = None,
system: str | None = None,
) -> str:
if transcript is None:
transcript = Transcript(system=system)
transcript.append(Message.user_text(user_message))
for _ in range(MAX_ITERATIONS):
response = provider.complete(transcript, registry.schemas())
if response.is_final:
transcript.append(Message.from_assistant_response(response))
return response.text or ""
# One assistant message with every ToolCall block from this turn.
transcript.append(Message.from_assistant_response(response))
# Dispatch each call in arrival order (Chapter 5 details the
# ProviderResponse.tool_calls tuple; here it's usually one call).
for ref in response.tool_calls:
result = registry.dispatch(ref.name, ref.args, ref.id)
transcript.append(Message.tool_result(result))
raise RuntimeError(f"agent did not finish in {MAX_ITERATIONS} iterations")
The loop has gotten smaller, not larger. That's the point of the abstraction: the complexity moves into the registry, where it belongs, and the loop keeps its focus on the three decisions from Chapter 2.
Let's build the tools we'll actually use in later chapters, so each one is deliberate.
# src/harness/tools/std.py
from __future__ import annotations
import ast
import subprocess
from pathlib import Path
from .decorator import tool
@tool(side_effects={"read"})
def calc(expression: str) -> str:
"""Evaluate a Python arithmetic expression.
Accepts: +, -, *, /, **, parentheses, integer and float literals.
Does NOT allow function calls, imports, attribute access, subscripts,
comprehensions, names, or anything else not explicitly listed here.
Side effects: none. Safe to retry.
"""
ALLOWED = (
ast.Expression, ast.BinOp, ast.UnaryOp, ast.Constant,
ast.operator, ast.unaryop, ast.Load,
)
tree = ast.parse(expression, mode="eval")
for node in ast.walk(tree):
if not isinstance(node, ALLOWED):
raise ValueError(f"forbidden in expression: {type(node).__name__}")
return str(eval(compile(tree, "<expr>", mode="eval"),
{"__builtins__": {}}, {}))
@tool(side_effects={"read"})
def read_file(path: str) -> str:
"""Read a UTF-8 text file and return its contents.
path: relative or absolute filesystem path.
Side effects: reads the filesystem, no writes.
Returns the file contents. For very large files, prefer chapter 11's
viewport reader.
"""
return Path(path).read_text(encoding="utf-8")
@tool(side_effects={"write"})
def write_file(path: str, content: str) -> str:
"""Overwrite a file with the given content.
path: relative or absolute filesystem path. The file will be CREATED
or OVERWRITTEN; its previous contents are lost.
Side effects: writes to the filesystem. Not safe to call twice with
different content expecting either version to survive.
"""
Path(path).write_text(content, encoding="utf-8")
return f"wrote {len(content)} bytes to {path}"
@tool(side_effects={"read", "network"})
def bash(command: str, timeout_seconds: int = 30) -> str:
"""Run a shell command in the current working directory.
command: a shell command line.
timeout_seconds: hard time limit; default 30, cap 300.
Side effects: MAY read/write files, MAY make network calls — depends on
the command. Caller is responsible for the blast radius.
Returns combined stdout+stderr with the exit code appended.
"""
timeout = min(int(timeout_seconds), 300)
result = subprocess.run(
command, shell=True, capture_output=True, text=True,
timeout=timeout,
)
return (f"exit={result.returncode}\n"
f"---stdout---\n{result.stdout}\n"
f"---stderr---\n{result.stderr}")
Three things to notice.
calc is now safe enough to use. The AST walk forbids calls, attributes, imports, and names — eval("__import__('os').system('rm -rf /')") fails parse-validation before evaluation. It's not a sandbox (Chapter 14 builds one), but it's defensible against accidents.
bash is tagged read, network. That's a conservative guess. In reality, bash is tagged everything — it can mutate, read, write, network, depending on the command. We'll revisit this in Chapter 14; for now we mark it with the tags we know apply most of the time. The permission layer can tighten it later.
Docstrings do double duty. They're the description the model reads and the documentation the human reads. Notice how each one declares preconditions, side effects, and failure modes — that pattern is what makes a tool hard to misuse.
An end-to-end example using the Chapter 3 providers and the new tool system:
# examples/ch04_tools.py
import os
from harness.agent import run
from harness.providers.anthropic import AnthropicProvider
from harness.providers.openai import OpenAIProvider
from harness.providers.local import LocalProvider
from harness.tools.registry import ToolRegistry
from harness.tools.std import calc, read_file, write_file, bash
provider = {
"anthropic": AnthropicProvider,
"openai": OpenAIProvider,
"local": LocalProvider,
}[os.environ.get("PROVIDER", "anthropic")]()
registry = ToolRegistry(tools=[calc, read_file, write_file, bash])
answer = run(
provider=provider,
registry=registry,
user_message=(
"Write the string 'hello world' to /tmp/ch04-test.txt, "
"then read it back, then tell me what the file contained."
),
)
print(answer)
Switching providers is still one environment variable, and the tool definitions are unchanged regardless of which provider answers — that's the layering from Chapter 3 paying off against a more interesting workload than the single-tool calculator demo.
Commit:
git add -A && git commit -m "ch04: Tool abstraction + ToolRegistry + std toolset"
git tag ch04-tools
We have four tools, and that is intentional. Jenova AI's 2025 "AI Tool Overload" analysis put hard numbers on the problem: model tool-selection accuracy drops off a cliff somewhere between 20 and 50 tools, and the cliff is steep enough that a 100-tool agent performs worse than a 10-tool one by a wide margin. The name of that analysis is not an accident — it sits inside the broader context of Anthropic's Model Context Protocol (MCP), introduced in November 2024 as the industry's attempt at a standardized tool-interop protocol. MCP lets an agent connect to third-party tool servers without custom adapters per vendor, which is enormously useful, but it also makes the tool-count problem easier to accidentally trigger: once tools become cheap to add via an external registry, adding them starts to feel free. Chapter 13 picks up MCP in detail, including the selector machinery that keeps an MCP-heavy agent below the cliff.
Our discipline for the rest of the book follows from the Jenova finding directly:
Chapter 12 builds the dynamic tool loader that scales past this cliff for systems that genuinely need 50+ tools. Until then, we hold the line at a handful.
list_tools tool that returns the names and descriptions of every tool in the registry. Watch what happens when the agent gets confused and calls this tool to re-read its options. Is the behavior useful or noisy?list[dict[str, int]] parameter. What does _schema_from_signature produce? Is it correct? If not, write the version that would be. (Spoiler: you've just reinvented a small fraction of Pydantic's TypeAdapter.)write_file and change its docstring to say "appends to the file." Don't change the implementation. Prompt the agent to "add a note to /tmp/note.txt without losing what's there." Observe how the agent uses the tool based on its (now-lying) description. What does this tell you about how much the model trusts descriptions?Chapter 3. Messages, Turns, and the Transcript
Previously: we built a forty-line loop against a mock provider and watched it break five ways. Break 5 — tool output overwhelming the transcript — hinted that the transcript was doing too much work as a pile of dicts. It's time to give it some structure, and at the same time plug in real providers.
Chapter 5. Streaming, Interruption, and Error Handling
Previously: tools are first-class and dispatch through a registry. The loop is tight, typed, and provider-agnostic. What it doesn't yet do is stream output to the user, stop cleanly when the user changes their mind, or survive a transient network failure.