Chapter 10. Retrieval

Previously: the scratchpad gave the agent durable state for what it produces. What it doesn't cover is what the agent needs to read from but didn't write — a codebase it's exploring, documentation, a knowledge base that's larger than the context window could hold even empty.

Retrieval is how an agent works over a corpus too large to fit in context. The idea is not new: Lewis et al.'s 2020 NeurIPS paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" established the RAG pattern — retrieve relevant passages, inject them into the prompt, generate an answer conditioned on both — and every production retrieval system in LLM-land is a descendant of that work. What most implementations miss is a subtle point that post-2020 research made inescapable: retrieval is not just about getting the right content, it's about getting the right content in the right place. The lost-in-the-middle effect Liu et al. documented in 2023 is real and quantified. A relevant document shoved into the middle of a 100K-token context gets less attention than a less-relevant one placed at the end. You can have perfect recall and terrible answers.

This chapter builds a small retrieval system for the harness with three specific disciplines:

Agent-driven, not passive. The agent chooses when to retrieve, via a tool, rather than retrieval happening every turn.
Edge-placed. Retrieved content goes at the end of the context, right before the user's current turn — the position with the highest attention weight.
Explicit cost. Every retrieval declares what it will add to the context so the agent can make informed choices.

By the end, the agent can search a directory of documents, get relevant chunks with scores, and be trusted not to drown itself.

start · ~90%

middle · ~55%

end · ~90%

system prompt history current turn

place critical retrieval results at the edges — end preferred

Lost-in-the-middle: attention retention dips hardest in the centre of long contexts.

10.1 Naive RAG and What's Wrong With It

The classic pattern: on every user turn, embed the user's message, search a vector store, take top-K results, prepend them to the prompt. Many tutorials stop there.

Three problems with the naive version.

It retrieves whether or not retrieval is needed. A simple arithmetic prompt triggers a vector search; the top-K results are irrelevant; the model now has irrelevant content in its context, which — per context rot — degrades rather than improves its output.

Placement is wrong. Prepending to the system prompt is the worst spot: middle of the context as soon as history accumulates. The U-curve bites.

The agent can't see the retrieval. If the search was bad, the model doesn't know; it just knows its context contains weird stuff. An agent-driven retrieval tool means the agent decides, sees the results, and can re-query with a better term.

We'll do agent-driven retrieval with edge placement, backed by the cheapest index that can possibly work.

10.2 The Index

For the book's scenarios, we don't need a vector database. A BM25 index over a directory of text documents is accurate enough, fast enough, and — importantly — runs without a network call or an embedding model. The BM25 scoring function itself dates back to Robertson and Zaragoza's 2009 survey "The Probabilistic Relevance Framework: BM25 and Beyond" and the decades of information-retrieval work it consolidated; it is not a stopgap or a simplification, it is the algorithm classical IR converged on for keyword relevance and the one against which every embedding-based retriever is still benchmarked. Chapter 22 discusses when you'd upgrade to embeddings or hybrid retrieval; most harnesses under 10K documents are fine without.

uv add 'rank-bm25>=0.2.2'

# src/harness/retrieval/index.py
from __future__ import annotations

import re
from dataclasses import dataclass
from pathlib import Path

from rank_bm25 import BM25Okapi


def _tokenize(text: str) -> list[str]:
    return re.findall(r"\w+", text.lower())


@dataclass
class Chunk:
    doc_id: str
    chunk_id: int
    text: str


@dataclass
class SearchHit:
    chunk: Chunk
    score: float


class DocumentIndex:
    """A BM25 index over text files in a directory.

    Chunks files into ~500-token pieces with 50-token overlap.
    """

    def __init__(self, root: Path | str, chunk_tokens: int = 500,
                 overlap: int = 50) -> None:
        self.root = Path(root)
        self.chunks: list[Chunk] = []
        self._build(chunk_tokens, overlap)
        tokenized = [_tokenize(c.text) for c in self.chunks]
        self._bm25 = BM25Okapi(tokenized)

    def _build(self, chunk_tokens: int, overlap: int) -> None:
        for path in sorted(self.root.rglob("*")):
            if not path.is_file():
                continue
            try:
                text = path.read_text(encoding="utf-8")
            except (UnicodeDecodeError, PermissionError):
                continue
            words = text.split()
            for i, start in enumerate(range(0, len(words),
                                             chunk_tokens - overlap)):
                chunk_text = " ".join(words[start:start + chunk_tokens])
                if chunk_text.strip():
                    self.chunks.append(Chunk(
                        doc_id=str(path.relative_to(self.root)),
                        chunk_id=i,
                        text=chunk_text,
                    ))

    def search(self, query: str, k: int = 5) -> list[SearchHit]:
        tokenized_query = _tokenize(query)
        scores = self._bm25.get_scores(tokenized_query)
        indexed = sorted(enumerate(scores), key=lambda x: -x[1])[:k]
        return [SearchHit(chunk=self.chunks[i], score=s)
                for i, s in indexed if s > 0]

Four design choices worth noting.

Word-based chunking, ~500 tokens, 50-token overlap. Good enough for the book's scenarios; production systems use semantic chunking, sentence-aware splitters, or recursive structure-aware approaches. We optimize for readability, not SOTA retrieval quality. The overlap prevents information loss at chunk boundaries.

BM25, not embeddings. BM25 is a bag-of-words score: TF-IDF on steroids. It works shockingly well on technical documentation, code, and any corpus with meaningful keywords. Embeddings are better for semantic similarity (paraphrase queries) but require an embedding model, a vector store, and a network hop. The book's harness can index 5,000 documents in seconds and search them in milliseconds; that's the right engineering budget here.

Filter zero-score hits. BM25 returns a score for every chunk, many near zero. Returning them would pollute the agent's context with pretend-relevant noise. We cap at k and require positive score; if the query matches nothing, we return empty.

Chunks carry doc_id and chunk_id. The agent sees where each hit came from. It can refer back to "the third chunk of config.yaml" in its reasoning; Chapter 13's viewport reader can render the full chunk if needed.

10.3 The Retrieve Tool

# src/harness/tools/retrieval.py
from __future__ import annotations

from ..retrieval.index import DocumentIndex
from .base import Tool
from .decorator import tool


class RetrievalInterface:
    def __init__(self, index: DocumentIndex) -> None:
        self.index = index

    def as_tools(self) -> list[Tool]:
        idx = self.index

        @tool(side_effects={"read"})
        def search_docs(query: str, k: int = 5) -> str:
            """Search the document corpus for chunks matching a query.

            query: keywords or a short sentence describing what you're
                   looking for.
            k: number of hits to return (default 5, max 10).

            Returns up to k hits, each with: doc_id, chunk_id, score,
            and the chunk text. Chunks are ~500 tokens each; plan your
            context budget before calling with k > 3.

            Side effects: reads the in-memory index.
            """
            k = min(max(1, k), 10)
            hits = idx.search(query, k=k)
            if not hits:
                return "(no results)"

            lines: list[str] = []
            total_chars = 0
            for hit in hits:
                c = hit.chunk
                lines.append(f"\n--- {c.doc_id}#{c.chunk_id} "
                             f"(score={hit.score:.2f}) ---")
                lines.append(c.text)
                total_chars += len(c.text)
            lines.append(f"\n[{len(hits)} hits, ~{total_chars} chars "
                         f"(~{total_chars // 4} tokens)]")
            return "\n".join(lines)

        return [search_docs]

The tool description carries three specific instructions. It names the cost (chunks are ~500 tokens). It caps k at 10. It includes the total token estimate in the result text, so the agent knows what it just paid for.

The last line of the result — [5 hits, ~12500 chars (~3125 tokens)] — is a deliberate choice. Without it, the agent has no way to feel the cost of retrieval. With it, the agent learns: "this query cost me 3K tokens; I should synthesize rather than retrieve again."

10.4 Edge Placement

Retrieval hits come back as a ToolResult, which ends up in the transcript like any other tool result. By the time the next turn runs, the hit is somewhere in the history. If the session is long, the hit is in the middle — the worst position.

The fix: we want retrieved content to be freshly placed at the end of the context on the turn the agent wants to act on it. Two ways to do this.

The agent chooses placement. The agent reads the hit from the tool result and rewrites it into its own reasoning on the next turn. "I found: .... Based on this, I will..." The retrieved content now occupies the fresh assistant-message position. This is how most agents work naturally, and it works as long as the agent has the discipline.

The harness places it. The harness intercepts search results and re-inserts them as a synthesized recent message, right before the next user turn. This is more invasive — and can confuse the model about what happened — but it guarantees placement regardless of agent discipline.

We do the first, with a small assist: the retrieval tool's result is structured so the agent can easily lift it verbatim. Chapter 16's structured plans build on this pattern — the plan is the thing the agent reads every turn, and it sits at the end of context by construction.

10.5 The Scenario

# examples/ch10_corpus.py
import asyncio
from pathlib import Path

from harness.agent import arun
from harness.context.accountant import ContextAccountant
from harness.context.compactor import Compactor
from harness.providers.anthropic import AnthropicProvider
from harness.retrieval.index import DocumentIndex
from harness.tools.registry import ToolRegistry
from harness.tools.retrieval import RetrievalInterface
from harness.tools.std import calc, read_file, write_file


SYSTEM = """\
You have a tool `search_docs(query, k)` that searches a corpus of
documentation. Use it when the user asks questions that likely have answers
in the docs, rather than guessing. Each result is ~500 tokens; prefer k=3
or k=5 over k=10 unless you need breadth. After getting results, quote the
relevant passages in your reasoning — do not rely on memory of them across
many turns. If the first query is not useful, refine the query; do not
give up after one search.
"""


async def main() -> None:
    provider = AnthropicProvider()
    index = DocumentIndex(root=Path("./docs-corpus"))
    retriever = RetrievalInterface(index)
    registry = ToolRegistry(tools=[calc, read_file, write_file,
                                    *retriever.as_tools()])
    accountant = ContextAccountant()
    compactor = Compactor(accountant, provider)

    await arun(
        provider=provider,
        registry=registry,
        system=SYSTEM,
        accountant=accountant,
        compactor=compactor,
        user_message=(
            "Look through the docs and explain how retry budgets are "
            "configured. Quote the relevant passage. If retry budgets "
            "aren't documented, say so explicitly."
        ),
    )


asyncio.run(main())

Point this at any directory with docs — the book's own research/ directory works, or a cloned project's docs. The agent now queries the index instead of trying to divine the answer; when the query is weak, it retries with a better one; when the answer isn't in the corpus, it says so, because the retrieved chunks don't mention retry budgets and the agent knows not to invent.

10.6 When Retrieval Hurts

Three failure modes to recognize.

Distractor interference. The query returns chunks that look related but aren't. The model latches onto them and answers confidently wrong. Mitigation: higher score thresholds (our code filters score > 0, but you can lift the floor to 0.5 or 1.0 depending on your corpus); smaller k; better chunk boundaries. Evals — Chapter 19 — are how you discover whether your thresholds are right for your corpus.

Query-document mismatch. The user asks about "rate limiting"; the docs use "throttling"; BM25 doesn't know they're synonyms. An embedding-based index would handle this; BM25 requires the agent to re-query with broader terms. Well-written tool descriptions that tell the agent to refine queries help a lot here.

Redundancy within top-K. Two of the five hits are the same content from overlapping chunks. The model burns tokens on a duplicate. Mitigation: de-duplicate by doc/chunk proximity in the retriever, or enlarge the chunk size and reduce K. Simple post-filtering in search_docs would be: after the top-K, skip any chunk that overlaps with an already-included one by more than X tokens.

10.7 Hybrid Retrieval and Why We're Not Building It

Production retrieval systems usually combine BM25 (keyword precision) with embeddings (semantic recall) via reciprocal rank fusion. The harness supports this straightforwardly — swap DocumentIndex for a hybrid implementation, keep the same search method — but the book doesn't need it. The scenarios we run are keyword-rich (technical docs, code, configs), and BM25 dominates on those.

When you'd switch:

Paraphrase-heavy queries. Users asking "how do I make my agent remember things?" when the docs say "context persistence."
Cross-lingual. Queries in one language, docs in another.
Very short documents. Tweets, SMS, short FAQ entries — BM25 starves on short texts because the TF component has nothing to work with.

For everything the book builds, BM25 is sufficient. Chapter 22 lists hybrid retrieval as a first-class upgrade path.

10.8 Commit

git add -A && git commit -m "ch10: BM25 document index + agent-driven retrieval tool"
git tag ch10-retrieval

10.9 Try It Yourself

Index the book itself. Point DocumentIndex at this book's chapters/ directory and ask the agent "how does compaction work in this harness?" Does the retrieval find Chapter 8? If not, what's wrong with the chunking or the query?
Stress the retrieval. Index a directory with 10,000+ files (a cloned open-source project's source tree, say). Time the index build and the query. Acceptable? If not, what would you profile first?
Build a distractor test. Index two directories — one with docs on a topic, one with docs on an unrelated topic. Ask a question whose answer is in the first. Measure how often the second directory's chunks appear in top-5. That's your distractor rate; it tells you whether to raise your score threshold or rewrite your chunks.

What you now understand

Your agent can query a document corpus via a tool, choose when to retrieve, see the retrieval cost, and place results where the model will actually read them. BM25 gets you far on keyword-rich corpora; the upgrade path to embeddings or hybrid retrieval is clean. The harness now has the three context-engineering pillars — accounting, compaction, external state (scratchpad), retrieval — all interoperating through the same seams.

What's still missing. Every tool the agent has reads or writes in a way optimized for humans — read_file returns the whole file, write_file overwrites wholesale. When you watched the long-session scenario in Chapter 8, most of the context-filling was tool output, and most of that was tools returning more than the agent needed. Chapter 11 is the SWE-agent lesson applied: tool design for a non-human reader. Viewport reads. Line-range edits. Explicit truncation envelopes.

Chapter 9. External State: The Scratchpad

Previously: compaction. Older tool results get masked; the prefix gets summarized when masking isn't enough. The transcript survives long sessions. But anything the agent wanted to keep verbatim is at the mercy of the compactor.

Chapter 11. Designing Tools Models Can Actually Use

Previously: context-engineering pillars are in place — accounting, compaction, scratchpad, retrieval. What's left is the source of most of the context pressure we've been managing: tools that return too much because they were designed for humans, not models.