Chapter 12. The Tool Cliff and Dynamic Loading

Previously: tool design for a non-human reader. The harness now has a handful of well-designed tools. What happens when you need thirty of them?

The "tool cliff" is a non-linear performance collapse. Jenova AI's 2025 "AI Tool Overload" analysis (the same empirical finding we cited in Chapter 4) and several independent replications since then found that models routinely handle 10 tools near-perfectly, degrade noticeably at 20, and fall off a cliff somewhere between 30 and 50: tool-selection accuracy drops sharply, argument shapes get confused across tools, and context consumption from tool schemas alone eats 5–7% of the window before the user says anything.

Three distinct problems hide inside that one observation.

Token cost of schemas. Each tool's schema in the prompt is 100–500 tokens. Fifty tools is 10K–25K tokens of overhead per turn, before the user gets a word in edgewise.

Attention dilution. The model has to "choose the right tool" from a list. The longer the list, the harder the choice. Selection accuracy drops even when the right tool's description would be unambiguous if it were the only option.

Name and parameter collision. Two tools called search_docs and search_code with similar parameter shapes get confused. The model calls one expecting the other's behavior. This is a specific failure mode: the model isn't picking the wrong tool because it doesn't know the difference; it's picking the right tool and passing the wrong arguments because it's blending two similar schemas.

The fix is dynamic tool loading. Instead of showing the model all tools at every turn, we show it a small selection relevant to the current task. EclipseSource's 2026 "MCP and Context Overload" analysis frames this as "tool selection as a retrieval problem" — and that's exactly how we'll implement it.

flat 0–20

degrading 20–50

cliff 50+

•

0 20 50 100 tools

selector keeps us here (~10–15 loaded)

Tool-count vs selection accuracy: dynamic loading stays in the flat zone.

12.1 Three Approaches

Before committing to one, it's worth seeing the design space.

Static subsetting. Define a few fixed tool subsets ("read-only mode", "code-editing mode") and switch between them explicitly. Simple, predictable, needs no retrieval. The cost: the agent can't mix tool subsets mid-task. Works well for sharply-divided workloads (chat mode vs code mode in Cursor).

Dynamic top-K by embedding. Embed the tool description and the current task. Fetch the top-K most relevant tools. The agent sees K schemas per turn. Accurate, but introduces an embedding dependency and a latency hit. Production systems use this at scale.

Dynamic top-K by BM25. Same as above but keyword-based. Cheaper, no embedding model, works well when tool descriptions use domain vocabulary. Less accurate on paraphrase queries — but we control the queries (they come from the agent or from a classifier), so we can write them in the same vocabulary as the tools.

We'll build the BM25 version. The upgrade to embeddings is a twenty-line swap if you hit its limits.

12.2 The Selector

BM25 is the same ranking function Chapter 10 used for document retrieval, formalized in Robertson and Zaragoza's 2009 "The Probabilistic Relevance Framework" we cited there. Tool selection is a retrieval problem — rank documents (tools) by relevance to a query (the current task) — and the same machinery applies with only the corpus changed.

# src/harness/tools/selector.py
from __future__ import annotations

import re
from dataclasses import dataclass

from rank_bm25 import BM25Okapi

from .base import Tool


def _tokenize(text: str) -> list[str]:
    return re.findall(r"\w+", text.lower())


@dataclass
class ToolCatalog:
    """A catalog of tools, with a BM25 index over names + descriptions."""

    tools: list[Tool]

    def __post_init__(self) -> None:
        self._tokenized = [
            _tokenize(f"{t.name} {t.description}") for t in self.tools
        ]
        self._bm25 = BM25Okapi(self._tokenized)
        self._by_name = {t.name: t for t in self.tools}

    def select(self, query: str, k: int = 7, must_include: set[str] | None = None) -> list[Tool]:
        """Return up to k tools most relevant to the query.

        must_include: tool names that must appear in the result regardless
        of score — typically "core" tools the agent always has.
        """
        must_include = must_include or set()
        pinned = [self._by_name[n] for n in must_include if n in self._by_name]

        scores = self._bm25.get_scores(_tokenize(query))
        ranked = sorted(enumerate(scores), key=lambda x: -x[1])

        remaining_slots = max(0, k - len(pinned))
        picks: list[Tool] = list(pinned)
        seen = {t.name for t in pinned}
        for i, score in ranked:
            if remaining_slots <= 0:
                break
            tool = self.tools[i]
            if tool.name in seen:
                continue
            if score <= 0:
                continue
            picks.append(tool)
            seen.add(tool.name)
            remaining_slots -= 1

        return picks

    def get(self, name: str) -> Tool | None:
        return self._by_name.get(name)

    def all_names(self) -> list[str]:
        return list(self._by_name.keys())

The catalog is a searchable tool registry. Two features worth naming.

must_include for pinned tools. Some tools should always be present — scratchpad_read, scratchpad_list, maybe a help tool. Pinning keeps them available regardless of what the query retrieved. This is how we prevent the selector from accidentally hiding essential capabilities.

Score floor. We don't include tools with score ≤ 0. If the query doesn't match any tool, we return just the pinned ones. The model learns that an empty selection means "nothing in the catalog looks relevant."

Why must_include is load-bearing, not a nice-to-have. The score floor has an uncomfortable failure mode: a query that matches nothing produces an empty selection. On the first user turn — "hi", "help", "what can you do?" — every tool scores 0 and the selector returns nothing. The agent sees zero tools, can only respond with text, and has no path to discover what the harness can do. The same failure mode fires on mid-task pivots: five turns of file work, then "now post a summary to Slack", and BM25's transcript-derived query is dominated by filesystem vocabulary rather than slack. Pinning a single discovery tool — we build it in §12.5 — closes both holes. Without it, you've shipped a selector-backed agent that can go blind in ways the rest of the chapter's machinery can't recover from.

12.3 Two Strategies for Picking the Query

The selector needs a query — a string that describes what tools would be useful right now. Two workable strategies.

Classify the user's message. The user says "read the log file and find errors." A small classifier (could be a cheap model, could be rules) extracts "read file" and "find errors" as the task intent, and those keywords drive the BM25 query. Works well when user turns are clean task descriptions; falls apart on conversational multi-step interactions.

Use the agent's running transcript. Take the last user message, the last assistant thought, and maybe the last tool call, and use that text as the query. This is more robust — the agent's own reasoning naturally reaches for relevant vocabulary — but it requires that the agent is making progress at all (on the first turn, you have only the user's message).

We use a hybrid: the user's original message as a base, augmented by the last couple of turns if they exist. This gives us a query that reflects both initial intent and current direction.

# src/harness/tools/selector.py (continued)

from ..messages import Transcript, TextBlock, ToolCall


def query_from_transcript(transcript: Transcript) -> str:
    """Derive a search query from the transcript: user intent plus recent activity."""
    parts: list[str] = []
    # first user message is the anchor
    if transcript.messages:
        first = transcript.messages[0]
        for b in first.blocks:
            if isinstance(b, TextBlock):
                parts.append(b.text)
    # last 3 assistant blocks (text or tool calls) for current focus
    recent = [m for m in transcript.messages[-6:] if m.role == "assistant"]
    for m in recent:
        for b in m.blocks:
            if isinstance(b, TextBlock):
                parts.append(b.text[:500])
            elif isinstance(b, ToolCall):
                parts.append(f"{b.name} {list(b.args.keys())}")
    return " ".join(parts)

Not sophisticated. Works surprisingly well.

12.4 Threading the Selector Through the Loop

The loop now picks tools per turn instead of using a fixed registry. Signature note: this chapter changes arun's tool parameter from registry: to catalog:. Earlier chapters' examples (Chs 8–11) that called arun(..., registry=registry, ...) need to be updated to arun(..., catalog=ToolCatalog(tools=list(registry.tools.values())), ...) — or use the convenience ToolCatalog.from_registry(registry) if you wire one up.

# src/harness/agent.py (selector-aware version)
from .tools.selector import ToolCatalog, query_from_transcript


async def arun(
    provider: Provider,
    catalog: ToolCatalog,
    user_message: str,
    transcript: Transcript | None = None,
    system: str | None = None,
    on_event: "callable | None" = None,
    on_tool_call: "callable | None" = None,
    on_tool_result: "callable | None" = None,
    on_snapshot: "callable | None" = None,
    accountant: ContextAccountant | None = None,
    compactor: Compactor | None = None,
    pinned_tools: set[str] | None = None,
    tools_per_turn: int = 7,
) -> str:
    if transcript is None:
        transcript = Transcript(system=system)
    transcript.append(Message.user_text(user_message))
    accountant = accountant or ContextAccountant()
    compactor = compactor or Compactor(accountant, provider)

    for _ in range(MAX_ITERATIONS):
        # Select tools for this turn.
        query = query_from_transcript(transcript)
        selected = catalog.select(query, k=tools_per_turn,
                                   must_include=pinned_tools)
        registry = ToolRegistry(tools=selected)

        snapshot = accountant.snapshot(transcript, tools=registry.schemas())
        if on_snapshot is not None:
            on_snapshot(snapshot)

        if snapshot.state == "red":
            await compactor.compact_if_needed(transcript, registry.schemas())

        response = await _one_turn(provider, registry, transcript, on_event=on_event)

        if response.is_final:
            transcript.append(Message.from_assistant_response(response))
            return response.text or ""

        transcript.append(Message.from_assistant_response(response))
        for ref in response.tool_calls:
            result = registry.dispatch(ref.name, ref.args, ref.id)
            transcript.append(Message.tool_result(result))

    raise RuntimeError(f"agent did not finish in {MAX_ITERATIONS} iterations")

One concern: what if the model wants to call a tool that wasn't selected this turn? Two cases.

The tool was filtered out. This is OK and informative — the model gets an "unknown tool" error from the registry, and next turn the query (which now includes the model's attempted tool name) is likely to bring that tool back into the selection. Try-fail-retry is the mechanism, and it converges fast.

The tool doesn't exist in the catalog. Same error. No recovery possible; the model has genuinely hallucinated.

The model doesn't know the tool exists. This is the one try-fail-retry cannot fix: the model can't attempt a tool whose name it hasn't seen, and the selector only surfaces what the current query matches. On first turns or mid-task pivots, that's often nothing useful. §12.5 builds a discovery tool you pin into every turn so the model always has a way to ask "what can I do?" — without it, the other two recovery paths above are dead letters.

The registry already handles the first two cases with the same close-match suggestion from Chapter 6. The catalog approach doesn't need new error paths for them — but it does need the discovery tool for the third.

12.5 The Discovery Tool

Two scenarios break the selector if you only rely on per-turn BM25 matching. The first: a vague opener — "hi", "help", "what can you do?" — produces a query that scores every tool at zero, and the selection is empty. The second: mid-task pivots — the user asks for a capability BM25 doesn't associate with the current transcript ("now post a summary to Slack" after five turns of file work). In both cases, the fix is the same. The agent needs a tool it can always call to see the full catalog, so it can decide for itself whether the capability it needs exists. That tool is list_available_tools:

# src/harness/tools/selector.py (continued)

def discovery_tool(catalog: ToolCatalog) -> Tool:
    from .decorator import tool as tool_decorator

    @tool_decorator(side_effects={"read"})
    def list_available_tools(filter_term: str | None = None) -> str:
        """List tools available in this harness.

        filter_term: optional substring to match against tool name or
                     description. Use this to narrow a large catalog.

        Returns a newline-separated list of `name — one-line summary`.

        Use this when you think a capability you need exists but isn't in
        your current tool list. After discovering a tool name, you can call
        it directly — the tool will be loaded for your next turn.
        """
        results = []
        for t in catalog.tools:
            first_line = t.description.split("\n", 1)[0]
            text = f"{t.name} — {first_line}"
            if filter_term and filter_term.lower() not in text.lower():
                continue
            results.append(text)
        return "\n".join(results) if results else "(no matching tools)"

    return list_available_tools

Pin this tool. "Pinning" means it shows up in every turn's selection regardless of BM25 score — which is exactly how you build it into the arun call:

# wiring: build the catalog, include the discovery tool in it, then pin
# its name so every turn's selection contains at least this one entry.
catalog = ToolCatalog(tools=all_tools + [discovery_tool(catalog)])
await arun(
    provider=provider,
    catalog=catalog,
    user_message=user_message,
    pinned_tools={"list_available_tools"},  # always surfaces, score be damned
    tools_per_turn=7,
)

The tool's docstring instruction ("call it directly after discovery") works because the next turn's query will include the tool name the model just tried, and normal BM25 will surface it without needing a second discovery round-trip.

This is Cursor's pattern, approximately: the agent has a codebase search tool as a first-class primitive, and uses it to discover what's relevant. We've generalized the idea to tool discovery.

12.6 Does This Actually Work?

The selector is cheap to try. Build a catalog with thirty tools (invent some plausible ones: github_search, npm_info, read_file_viewport, edit_lines, run_tests, diff, git_status, git_diff, git_log, http_get, http_post, ... any ten are enough), pin list_available_tools, scratchpad_list, scratchpad_read, and watch what happens in a real task.

Three observations typically hold.

Selection is mostly right. For a clear task ("read this file and fix the bug"), the top-7 selection includes read_file_viewport, edit_lines, maybe run_tests. The irrelevant twenty tools stay out of context.

The model rarely hits missing tools. When it does, it often recovers by calling list_available_tools and trying again. Pinning that discovery tool pays for itself many times.

Schema overhead drops roughly linearly with the selected-tool count. Going from 30 tools to 7 reduces tool-schema tokens by about 75% on our examples. That's real context budget returned.

One counter-observation: even with tools_per_turn=7 or higher, the selector will occasionally miss a tool the model needs mid-task — a Slack tool when the transcript is dominated by file operations, say. This is the case §12.5's discovery tool handles: the model calls list_available_tools("slack"), sees slack_post exists, calls it, and the next turn's query (now containing slack_post) surfaces it through normal selection. Tuning tools_per_turn reduces but doesn't eliminate this — pinning discovery is the reliable fix. Chapter 19's eval harness is how you tune both knobs empirically.

12.7 When Not to Use a Selector

If your harness has five tools, use them all, all the time. The selector costs more (BM25 index, query building) than it saves. The cliff doesn't exist below ~20 tools.

If your tools are sharply siloed — a codebase search tool, a shell tool, a deployment tool — and the user clearly wants one silo at a time, a simple mode switch is cleaner than dynamic retrieval. Cursor's "agent mode" vs "ask mode" is this pattern.

If you have 200+ tools, BM25 starts to miss; you want embeddings. The interface (catalog.select(query, k)) doesn't change. The implementation does.

We use the selector in this book's harness from Chapter 13 onward — where we integrate MCP tools (potentially many) — because that's the point where the tool count crosses over into selector-justifying territory.

12.8 Commit

git add -A && git commit -m "ch12: dynamic tool loading with ToolCatalog"
git tag ch12-selector

12.9 Try It Yourself

Calibrate the selector on your own corpus. Build a catalog of fifteen realistic tools and run the agent through ten representative tasks. Log the selected top-7 per turn. How often did the model try to call a tool that wasn't selected? How often did the fallback (list_available_tools) recover?
Break it with poor descriptions. Rename your tools to tool_1, tool_2, ... and give them vague descriptions. Run the selector. Observe the degradation. This is a direct measure of how much description quality matters — a lesson that applies regardless of whether you use a selector.
Swap BM25 for embeddings. Use a small embedding model (sentence-transformers works offline) to produce the catalog's search index. Measure whether this changes selection quality on paraphrase-heavy queries. If it does, how much? Is the cost worth it?

What you now understand

The harness can scale past the tool cliff without changing any tool definitions. The selector loads 7 tools per turn from a larger catalog; the discovery tool lets the agent surface anything the selector misses; the per-turn query is derived cheaply from the transcript. Token costs drop, selection quality stays usable. Below 20 tools, don't bother; above 30, this is how you keep the model sharp.

What's still missing. Every tool in the catalog is one we wrote. Real harnesses want to integrate external tools — git, GitHub, Slack, a database — without writing custom wrappers each time. The Model Context Protocol exists for this; Chapter 13 builds an MCP client that plugs into the registry and the selector, so any MCP server's tools become indistinguishable from the ones we built by hand.

Chapter 11. Designing Tools Models Can Actually Use

Previously: context-engineering pillars are in place — accounting, compaction, scratchpad, retrieval. What's left is the source of most of the context pressure we've been managing: tools that return too much because they were designed for humans, not models.

Chapter 13. MCP: Tools From the Outside World

Previously: the harness can scale past the tool cliff via dynamic loading. All tools are still ones we wrote. This chapter plugs in external tool servers.