Chapter 11. Designing Tools Models Can Actually Use

Previously: context-engineering pillars are in place — accounting, compaction, scratchpad, retrieval. What's left is the source of most of the context pressure we've been managing: tools that return too much because they were designed for humans, not models.

Yang et al.'s 2024 SWE-agent paper (cited in Chapter 4, where we first used its "tool design is interface design" framing) made a sharper central claim: the interface between the LLM and the computer — what the paper names the Agent-Computer Interface, or ACI — matters as much as the LLM itself. Their headline empirical result was that the same model, evaluated on Jimenez et al.'s 2024 SWE-bench benchmark of real GitHub issues, went from near-zero to 12.5% pass@1 by changing nothing but the ACI. Most of that improvement came from tool designs that constrained what the model could see and do in ways that matched its actual cognitive affordances: small viewport into a file rather than the whole file, line-range edits rather than full-file rewrites, errors that suggested what to do next rather than just saying what went wrong.

Our read_file returns the whole file. Our write_file overwrites the whole file. That's wrong for models in the same way cat /etc/passwd piped to a user in Notepad would be wrong: too much data, no structure, no navigation. This chapter rebuilds the file tools — and establishes the discipline — around the ACI principles.

read_file(path)

lines 1–100

lines 101–200

lines 201–300

lines 301–400 ← overflows budget

400 lines dumped, no navigation.

read_file_viewport(path, offset)

lines 100–199 (viewport)

scroll → offset=200

scroll → offset=300

100-line window, agent navigates.

Viewport reads keep the context budget bounded; the agent scrolls on demand.

11.1 Four Principles of ACI Design

These are the SWE-agent findings, lightly reframed for our purposes.

Viewport, not dump. A model reading a 2000-line file through a single tool call processes those 2000 lines with no structural affordances — it can't scroll, it can't search visually, it can't hold a mental map of where it is. Better: a tool that returns a window (50–100 lines) with explicit position indicators and a scroll command to move.

Targeted edit, not rewrite. A model that wants to change line 47 of a 2000-line file shouldn't have to return all 2000 lines. It should return the change. Targeted edits also make the intent auditable — the diff is minimal, the review is easy, the revert is trivial.

Explicit envelopes. Every tool result needs a machine-readable frame: what was returned, what was truncated, what the next step would be. [file: /etc/passwd; lines 1-100 of 423; call again with offset=100 for more] is cheap to write and saves the model from having to guess.

Error messages as instructions. "File not found" is information. "File not found: /etc/passwd. Did you mean /etc/passwd.bak (found by fuzzy search)? Or, use list_files('/etc') to see available files" is instruction. The model does better with instruction.

The rest of the chapter applies these to our file tools.

11.2 The Viewport File Reader

# src/harness/tools/files.py
from __future__ import annotations

from pathlib import Path

from .base import Tool
from .decorator import tool


VIEWPORT_DEFAULT = 100
VIEWPORT_MAX = 500


@tool(side_effects={"read"})
def read_file_viewport(path: str, offset: int = 0, limit: int = VIEWPORT_DEFAULT) -> str:
    """Read a slice of a text file, like `less` or `head -n ... | tail -n ...`.

    path: filesystem path.
    offset: zero-based line number to start reading from. Default 0.
    limit: max lines to return. Default 100, max 500.

    Returns a rendered viewport with line numbers. The last line of the
    output describes what's visible and what's NOT, so you can call this
    tool again with a different offset to keep reading.

    Side effects: reads the filesystem.

    Use this in preference to reading whole files. For files <50 lines,
    the whole file fits in one call.
    """
    limit = min(max(1, limit), VIEWPORT_MAX)
    p = Path(path)
    if not p.exists():
        raise FileNotFoundError(f"file does not exist: {path}")
    if not p.is_file():
        raise IsADirectoryError(f"not a regular file: {path}")

    lines = p.read_text(encoding="utf-8", errors="replace").splitlines()
    total = len(lines)
    start = max(0, offset)
    end = min(total, start + limit)
    visible = lines[start:end]

    width = len(str(total))
    numbered = [f"{i + 1:>{width}}  {line}" for i, line in enumerate(visible, start=start)]
    body = "\n".join(numbered)
    footer = (f"\n[file {path}; lines {start + 1}-{end} of {total}"
              + (f"; MORE below — call with offset={end}" if end < total else "; end of file")
              + (f"; MORE above — call with offset=0" if start > 0 else "")
              + "]")
    return body + footer

Four things earned by the design.

Line numbers in the rendered output. The model reads line numbers alongside content and can refer back to them in subsequent edits. The line-range edit tool (next section) uses these directly.

The footer tells the model what's missing. lines 1-100 of 423; MORE below — call with offset=100. The model doesn't have to infer that there's more; it's told, with the exact call that would fetch it. This maps directly to the "explicit envelopes" principle.

The offset is zero-based, display is one-based. Displaying one-based is natural for humans and models (editors use one-based); the offset parameter is zero-based because it's a programmatic slice. We make this difference visible by labeling lines X-Y in one-based in the footer, while the offset parameter takes zero-based values. This is a small inconsistency, but it matches what editors and compilers do and is easy to explain in the docstring.

Error messages are specific. file does not exist: ... and not a regular file: ... give the model enough to fix the call. A more aggressive version (which we'll add in Chapter 14's sandboxing) would also check allowed paths and say why a path is rejected.

11.3 The Line-Range Editor

# src/harness/tools/files.py (continued)

@tool(side_effects={"write"})
def edit_lines(
    path: str,
    start_line: int,
    end_line: int,
    replacement: str,
) -> str:
    """Replace a line range in a file with new content.

    path: filesystem path (file must exist).
    start_line: one-based starting line (inclusive).
    end_line: one-based ending line (inclusive).
    replacement: text to insert in place of the removed lines. Empty string
                 deletes the range without replacement. Include trailing
                 newlines if you want blank lines.

    Returns a confirmation with the diff summary and the lines around the
    edit (for verification).

    Side effects: writes the file. Preserves content outside the range.

    To INSERT new lines at position N without removing: use start_line=N,
    end_line=N-1 and replacement=your_new_content.
    To APPEND: use start_line=last+1, end_line=last.
    """
    p = Path(path)
    if not p.exists():
        raise FileNotFoundError(f"file does not exist: {path}")

    original = p.read_text(encoding="utf-8")
    lines = original.splitlines(keepends=True)
    total = len(lines)

    if start_line < 1 or start_line > total + 1:
        raise ValueError(f"start_line {start_line} out of range (1..{total + 1})")
    if end_line < start_line - 1 or end_line > total:
        raise ValueError(f"end_line {end_line} out of range ({start_line - 1}..{total})")

    # normalize: start is zero-based slice, end is zero-based exclusive
    s = start_line - 1
    e = end_line  # slice end is exclusive of end_line, so this works for deletes too

    replacement_lines = replacement.splitlines(keepends=True)
    if replacement and not replacement.endswith("\n"):
        # make sure we don't glue onto the next line without a newline
        if e < total:
            replacement_lines[-1] = replacement_lines[-1] + "\n"

    new_lines = lines[:s] + replacement_lines + lines[e:]
    p.write_text("".join(new_lines), encoding="utf-8")

    removed = end_line - start_line + 1 if end_line >= start_line else 0
    added = len(replacement_lines)

    # render context around the edit
    context_start = max(0, s - 2)
    context_end = min(len(new_lines), s + len(replacement_lines) + 2)
    preview = "".join(
        f"{i + 1:>5}  {new_lines[i]}" for i in range(context_start, context_end)
    )
    return (f"edited {path}: removed {removed} lines, "
            f"added {added} lines at {start_line}..{end_line}\n"
            f"context:\n{preview}")

The edit tool is more complicated than read because editing has more edge cases. We handle:

Pure replacement. Lines 5–10 become other content.
Pure delete. Lines 5–10 removed (replacement = "").
Insert. start_line=5, end_line=4, replacement="new content" inserts before line 5 without removing anything.
Append. start_line=total+1, end_line=total, replacement="..." adds to the end.

The return value shows the context around the edit — a few lines before and after — so the model can verify. This is the SWE-agent trick of making tool outputs self-validating: the agent doesn't have to read the file back to confirm; the edit tool shows the result.

Two things worth highlighting. Line-ending preservation — we use splitlines(keepends=True) and add "\n" to replacement content if the next line expects one. This prevents the edit from silently mangling newlines, a common bug in naive diff-apply code. Bounds checks with explicit ranges — "start_line 500 out of range (1..423)" tells the model the specific valid range. A model that miscounts lines (which they do) gets enough signal to correct on the next turn.

11.4 Replacing the Old Tools

The Chapter 4 read_file and write_file go into a deprecated path. We don't delete them — write_file is still useful for creating files that don't exist, and there are cases where rewriting a whole file is the right call. But the default tools shipped with the harness switch to viewport-and-edit:

# src/harness/tools/std.py (updated)
from .files import read_file_viewport, edit_lines

# calc and bash unchanged
# read_file stays available but is no longer in the "standard" set
# write_file stays available but is no longer in the "standard" set

STANDARD_TOOLS = [calc, bash, read_file_viewport, edit_lines]

Swap these into an agent:

# examples/ch11_viewport.py
import asyncio

from harness.agent import arun
from harness.providers.anthropic import AnthropicProvider
from harness.tools.registry import ToolRegistry
from harness.tools.std import STANDARD_TOOLS


async def main() -> None:
    provider = AnthropicProvider()
    registry = ToolRegistry(tools=STANDARD_TOOLS)
    await arun(
        provider=provider,
        registry=registry,
        user_message=(
            "Read /etc/passwd. There's probably a user called 'nobody' — "
            "find its entry and tell me the shell and home directory."
        ),
    )


asyncio.run(main())

Run it. The model calls read_file_viewport("/etc/passwd", offset=0, limit=100); sees the whole file (it's under 100 lines on a typical system); finds the line for "nobody"; reports back. Compare against the old read_file — same outcome, but the token cost of a larger file would be dramatically different. For a 5,000-line log file, the viewport keeps the tool result under 500 lines; a full read would eat much of the context window in one call.

11.5 Truncation Envelopes for Other Tools

The viewport pattern is specific to files, but the explicit-envelope principle generalizes. Every tool output that can be large should have the same shape:

<content>
[tool_result: <N> items/lines/bytes returned; <M> more omitted.
 Call <suggestion> to see more.]

Apply it to bash:

# src/harness/tools/std.py (bash, updated)

BASH_OUTPUT_LIMIT = 4000  # characters


@tool(side_effects={"read", "network"})
def bash(command: str, timeout_seconds: int = 30) -> str:
    """Run a shell command in the current working directory.
    [... description ...]
    """
    import subprocess
    timeout = min(int(timeout_seconds), 300)
    result = subprocess.run(
        command, shell=True, capture_output=True, text=True, timeout=timeout,
    )
    out = result.stdout
    err = result.stderr

    out_truncated = len(out) > BASH_OUTPUT_LIMIT
    err_truncated = len(err) > BASH_OUTPUT_LIMIT // 2
    if out_truncated:
        out = out[:BASH_OUTPUT_LIMIT] + f"\n...[truncated at {BASH_OUTPUT_LIMIT} chars]"
    if err_truncated:
        err = err[:BASH_OUTPUT_LIMIT // 2] + f"\n...[truncated]"

    note = ""
    if out_truncated or err_truncated:
        note = ("\n[note: output was truncated. For large output, "
                "pipe through `head`, `tail`, `grep`, or save to a file "
                "and use read_file_viewport.]")

    return (f"exit={result.returncode}\n"
            f"---stdout---\n{out}\n"
            f"---stderr---\n{err}"
            + note)

The bash tool now caps output, labels the truncation explicitly, and tells the model what to do about it. The suggestion ("pipe through head, tail, grep") is a small LLM-aware design: it's the idiomatic shell way to reduce output, and the model already knows those tools.

Apply the same to any tool that could return a lot: search_docs from Chapter 10, scratchpad_read if an entry gets huge, any HTTP GET tool you add. Consistency of the envelope across tools is itself a feature — the model learns the shape once and applies it everywhere.

11.6 Description Hygiene

The tool descriptions in this chapter are longer than the Chapter 4 versions. Deliberately. A tool with three paragraphs of description — covering what it does, when to use it, how to call it, what the output envelope means — is less likely to be misused than a one-line description. The AWS Heroes 2024 post "MCP Tool Design: Why Your AI Agent Is Failing" put it bluntly: "Sends a notification" gets abused. "Sends an email to the address in args.to. Delivery is asynchronous. Idempotent on message_id. Do not call twice for the same logical message." doesn't.

A checklist for a good tool description:

What it does. One sentence.
What it requires. Preconditions: file exists, user exists, process running.
What it does not do. Scope limits: "does not fetch URLs"; "does not modify git state."
Side effects. Read/write/network/mutate, in plain English.
Output envelope. What the return value looks like, including truncation behavior.
When to prefer it. "Use this rather than X when..."

The viewport reader docstring hits all six. Every tool we've written from Chapter 4 onward will be retrofitted to the same standard as we revisit them.

11.7 What SWE-agent Got Wrong (And Why It's Instructive)

The original SWE-agent ACI includes custom commands like find_file, search_dir, create, and a detailed file-viewer state machine. The mini-SWE-agent follow-up threw most of it out and used just bash — and achieved comparable SWE-bench results with ~100 lines of code.

What changed? Frontier models got better at using general-purpose tools. The elaborate ACI commands that SWE-agent built to compensate for GPT-4's clumsiness aren't necessary for Claude 3.5 and beyond, which can drive a shell competently as long as the outputs are framed well.

The lesson: design tools to augment the model's weaknesses, not to reinvent capabilities it already has. Viewport reads are still worth it — no model, however good, does well with 50,000-token tool outputs. Line-range edits are still worth it — they're how diffs work, and they make the agent's intent auditable. But re-implementing ls or grep when the model can call bash is rarely worth the maintenance burden.

Our design hits the sweet spot: we add what constrains token flow (viewport, envelopes) and let the model use bash for the general-purpose cases.

11.8 Commit

git add -A && git commit -m "ch11: viewport reader, line-range editor, truncation envelopes"
git tag ch11-tools

11.9 Try It Yourself

Measure the token impact. Run a task that reads a 1000-line file, first with read_file (Chapter 4 version), then with read_file_viewport. Compare total tokens consumed. Compare quality of the output. Is viewport always better, or only for large files?
Extend the edit tool. Add a dry_run parameter that returns the diff but doesn't write. The agent can use this to verify before committing. What tradeoffs does the dry-run option introduce?
Write a bad tool on purpose. Write read_file_unbounded that returns the whole file with no envelope, and hand it to the agent alongside the viewport version. Watch which one the model picks. Does it drift toward the worse tool when the prompt is short? What does that tell you about description discipline?

What you now understand

Tools are interfaces for a specific non-human reader. Your file tools now respect that: viewport reads, line-range edits, explicit truncation envelopes, descriptions that name scope and side effects. The bash tool caps and labels its output. The tools the agent reaches for first are the ones designed for it.

What's still missing. The harness has 6 tools (calc, bash, read viewport, edit, plus the scratchpad trio and search_docs). That's a small number; well under any cliff. But an agent system that wants 30 tools — a realistic number for anything past a demo — runs into the scalability problem Jenova AI documented in 2025: model tool-selection accuracy drops off a cliff past 20–30 tools. Chapter 12 builds the ToolSelector that scales past the cliff without changing any of the tools we've already written.

Chapter 10. Retrieval

Previously: the scratchpad gave the agent durable state for what it produces. What it doesn't cover is what the agent needs to read from but didn't write — a codebase it's exploring, documentation, a knowledge base that's larger than the context window could hold even empty.

Chapter 12. The Tool Cliff and Dynamic Loading

Previously: tool design for a non-human reader. The harness now has a handful of well-designed tools. What happens when you need thirty of them?