Chapter 22. What Transfers, Where to Go

Previously: twenty-one chapters of cumulative engineering. The harness has a loop, a transcript, adapters for three providers, a tool registry with validation and loop detection, streaming, async, permissions, MCP integration, a scratchpad, retrieval, compaction, sub-agents, structured plans, parallel coordination, observability, evals, cost control, and durable checkpointing.

One chapter left. Not for more machinery — the machinery is done. For stepping back. We run the full harness against three providers to prove the adapter seam earns its name. We name what the harness doesn't do and where each gap would be filled. We close with a scorecard for the next framework you evaluate.

Anthropic
OpenAI
Local / OSS
Provider
interface
harness
loop / tools / context
unchanged
providertokensiterationscompactionsseconds
Anthropic6,412407.1
OpenAI6,980508.4
Local9,1086122.3
Three providers, one Provider interface, one harness. Same code runs against each; the numbers shift, the shape doesn't.

22.1 Running Against Three Providers

The commitment from Chapter 1, tested. Provider-agnostic means the core harness — loop, tools, registry, context engineering — works unchanged against any Provider. The AnthropicProvider, OpenAIProvider, and LocalProvider are the three we built. Let's run the same example against each and observe.

# examples/ch22_multi_provider.py
import asyncio
import os
import time
from pathlib import Path

from harness.agent import arun
from harness.context.accountant import ContextAccountant
from harness.context.compactor import Compactor
from harness.observability.tracing import setup_tracing
from harness.providers.anthropic import AnthropicProvider
from harness.providers.openai import OpenAIProvider
from harness.providers.local import LocalProvider
from harness.tools.scratchpad import Scratchpad
from harness.tools.selector import ToolCatalog
from harness.tools.std import STANDARD_TOOLS


TASK = (
    "Read the file /etc/hostname. Using the calculator, compute the length "
    "of its contents squared. Write the result to /tmp/hostname-square.txt. "
    "Report: the hostname, its length, the square, and the path you wrote."
)


async def run_with(provider) -> dict:
    pad = Scratchpad(root=Path(f".scratchpad-{provider.name}"))
    catalog = ToolCatalog(tools=STANDARD_TOOLS + pad.as_tools())
    accountant = ContextAccountant()
    compactor = Compactor(accountant, provider)

    tool_call_count = 0
    compaction_count = 0

    def on_snapshot(snap):
        nonlocal compaction_count
        if snap.state == "red":
            compaction_count += 1

    start = time.time()
    result = await arun(
        provider=provider,
        catalog=catalog,
        user_message=TASK,
        accountant=accountant,
        compactor=compactor,
        on_snapshot=on_snapshot,
    )
    return {
        "provider": provider.name,
        "duration_s": round(time.time() - start, 2),
        "tokens_used": result.tokens_used,
        "iterations_used": result.iterations_used,
        "compactions": compaction_count,
        "summary": result.summary,
    }


async def main() -> None:
    setup_tracing()  # Ch 18 — spans per provider go through the same exporter

    providers = [AnthropicProvider(), OpenAIProvider()]
    if os.environ.get("LOCAL_ENDPOINT"):
        providers.append(LocalProvider(base_url=os.environ["LOCAL_ENDPOINT"]))

    results = await asyncio.gather(*(run_with(p) for p in providers))

    # Comparison table — this is the adapter-seam payoff.
    print(f"\n{'provider':<12}  {'tokens':>8}  {'iters':>6}  {'compact':>8}  {'sec':>6}")
    for r in results:
        print(f"{r['provider']:<12}  {r['tokens_used']:>8}  "
              f"{r['iterations_used']:>6}  {r['compactions']:>8}  "
              f"{r['duration_s']:>6}")

    for r in results:
        print(f"\n=== {r['provider']} ===")
        print(r["summary"])


asyncio.run(main())

The same code, three providers. Three sets of output, plus a comparison table that makes the delta visible: tokens used, iterations taken, compactions triggered, wall-clock time. The hostname is the same across all three (deterministic tool); the phrasing varies with model style; the iteration count varies with how the model chose to decompose the task. The harness doesn't change — the numbers tell you how each provider drives it.

This is the claim of the book, made operational. Your agent logic, your tools, your context strategy, your evals — all reusable across providers. A model deprecation (they will happen), a price change (they will happen), a capability gap in a specific vendor (they will happen) — none of these force a rewrite of the agent. They force a configuration change.


22.2 What The Harness Does Not Do

Honest list. Every one of these is a place a real production deployment might extend, and every one is a small to medium project on top of what we built — not a rewrite.

No fine-tuning support. Toolformer showed that tool-use behavior is learnable; we assume a capable RLHF-trained frontier model and don't try to improve it. If your tools are novel enough that the model misuses them systematically, fine-tuning is an option worth knowing about.

No tree search / best-of-N. Tree of Thoughts, Self-Refine, Reflexion — all patterns where the harness generates multiple candidates and scores them. Useful for verifiable-answer tasks (code, math). Not in our harness; adding it is one chapter of work at the loop level.

No embedding-based retrieval. BM25 is our baseline. Swapping DocumentIndex for an embedding-backed version is a drop-in upgrade; we named the interface so it would be. When you cross paraphrase-heavy use cases, do it.

No genuine Firecracker/gVisor sandbox. We defined the ToolSandbox interface (Chapter 14); we ship a subprocess-with-allowlist implementation. Production deployments hand this to E2B, Modal, or a self-hosted Firecracker setup.

No first-class voice or multimodal support. Text-only input and output. MCP has resource types for images; we didn't plumb them through. Add-on project.

No UI. CLI streaming works; there's no built-in TUI, web UI, or IDE extension. That's application-level work; the harness is a library.

No team deployment. Single-user assumed throughout. Multi-tenant deployments need per-user isolation, quota management, authentication — all of which the session_id threading already supports but which the book didn't formalize.

No learned routing. Chapter 20's ModelRouter is rules-based. Production routing often uses a learned classifier; the research on this is improving but not yet packaged. Worth watching.

Every one of these is a deliberate stop. The book's goal was a harness you understand end-to-end, not a harness that does everything.


22.3 The Danger List, Revisited

The failure-mode literature we surveyed at the start of this book catalogued twenty-eight distinct failure modes — the danger list reproduced in the book's Research Brief 4 (Failure Mode Catalog). Look back at the cross-reference table: every one is addressed in this harness.

The Chapter 2 five-break itinerary was the narrow version. The twenty-eight-entry failure catalog was the exhaustive one. Between them, they gave every design decision in this book a specific motivation — a place in a real production post-mortem where someone wishes they'd had this exact thing. If any of your design choices don't trace back to one of those entries, it's worth asking whether they're earning their place.


22.4 A Scorecard for the Next Framework

A framework will ship tomorrow that claims to supersede LangGraph or the Agents SDK or Claude Code. The vocabulary of this book is what lets you evaluate it honestly. A scorecard, in the form of questions I ask:

On the loop.

  • What triggers the loop to stop — a tool-call-absent response, a final tool, an iteration cap, something else?
  • Is the loop pluggable (can I insert a compaction step, an observability hook) or is it opaque?
  • Can I see how long the loop is? Under 500 lines is a good sign.

On messages and transcripts.

  • Are messages typed or dicts?
  • Is the transcript a first-class object with its own accounting?
  • How does the framework handle provider differences in message shape? Adapters? Coupling?

On tools.

  • Are tool schemas inferred from types or hand-written?
  • Is there a registry with pre-dispatch validation?
  • Is loop detection built in or my problem?
  • How does the framework handle more than 20 tools?

On context.

  • Is there automatic compaction? What does it compact first — tool outputs, middle turns, everything? Is the policy configurable?
  • Is the context window tracked as a budgetable resource, or is it "the model decides"?
  • Is there a scratchpad or external state pattern built in?

On sub-agents.

  • Does the framework enforce that sub-agent results are compact summaries rather than full transcripts?
  • Is there a spawn budget and justification requirement, or can the parent spawn unbounded?
  • Can sub-agents spawn sub-agents? (If yes, has the framework noticed this is usually a mistake?)

On permissions.

  • Is there a permission layer at all?
  • Is it policy-composable, or a single "allow/deny" list?
  • Does it handle trust-labeled outputs for indirect prompt injection?

On cost.

  • Are hard budgets enforced in-process, out-of-process, or only via alerts?
  • Is there built-in support for prompt caching?
  • Can I see per-agent cost attribution?

On observability.

  • Does the framework emit OpenTelemetry spans, proprietary traces, or just logs?
  • Can I correlate across sub-agents, tools, and LLM calls via standard IDs?

On durability.

  • Does the framework checkpoint at all?
  • Does it handle side-effecting tool idempotency, or is that my problem?
  • Can I resume across processes, or only within one?

On evaluation.

  • Is there a regression harness?
  • Can I run golden trajectories with structural checks (required tools, forbidden tools) and outcome checks?

A framework that scores well on most of these is worth adopting. A framework that scores poorly is a tool you're going to outgrow — and the book you just read is the outline of what you'll end up building yourself anyway.


22.5 What I Wish I'd Known When I Started

A short list, written as if to a reader about to start their own harness from scratch.

The model is the easy part. You'll spend 10% of your time on model choices and 90% on everything around them. This is a surprise if you came to the space as a prompt engineer. Adjust your budget.

Build the Provider abstraction on day one. The moment your loop imports a vendor SDK directly, migration costs compound. A one-file adapter is ten minutes of work and saves months.

Types beat dicts, especially for messages. The time you save by skipping type definitions you pay back the first time you ship to a new provider and something silently shapes wrong.

Context is the real fight. You will rebuild your compactor three times. The first version is naive. The second version is complicated. The third version is principled. Skip to the third version when you can.

Tool design matters more than model choice. A mediocre model with well-designed tools beats a flagship model with sloppy tools. Spend time on tool descriptions, viewport reads, truncation envelopes.

Evals are non-optional. You will not know if a change made the harness better without them. Getting even a small suite in place pays for itself the first time someone says "I think this got worse."

Alerts are not enforcement. The $47K lesson. Budget caps run in their own path or they don't run at all.

Compaction can fail silently. If your compactor loses something the agent needed, the agent won't tell you; it'll just produce wrong answers. Instrument compaction events explicitly; treat each one as a data point.

Trust labels are necessary but not sufficient. You will not solve prompt injection with clever prompting. Defense in depth: permission layer, trust labels, network allowlists, behavioral monitoring.

Ship the boring version first. The fancy version of every feature — learned routing, tree search, semantic tool selection, hybrid retrieval — comes later. The non-fancy version of each, shipped and measured, is always the right first step.


Not exhaustive; curated.

If you want to go deeper on context. Anthropic's "Effective Context Engineering for AI Agents" (Sep 2025) is the canonical current treatment. The Chroma "Context Rot" study (2025) is the empirical basis for most of what Chapter 7 builds on.

If you want to go deeper on multi-agent. Anthropic's "How We Built Our Multi-Agent Research System" is the clearest production case study. The MAST paper (Cemri et al., 2025) is the systematic failure taxonomy.

If you want to go deeper on ACI / tool design. The SWE-agent paper (Yang et al., 2024) and the mini-SWE-agent code (100 lines) together teach the complete arc: custom tools help, then models catch up, then simpler tools suffice.

If you want to evaluate harnesses. Read the smolagents source (~1000 lines). Read mini-swe-agent. Skim Claude Code's public docs. You'll find the ideas in this book instantiated in each, with different tradeoffs. Recognizing those tradeoffs is the skill.

If you want to stay current. Anthropic's engineering blog. OpenAI's cookbook. The Modal and E2B blogs (for sandboxing). Simon Willison on prompt injection. Hamel Husain on evals.


22.7 The Last Commit

git add -A && git commit -m "ch22: multi-provider demonstration and closing"
git tag ch22-final

22.8 Try It Yourself

  1. Score your own harness. Take whatever agent system you currently work on — or plan to build — and run it through the scorecard from Section 22.4. Write down the gaps. Prioritize the top three.
  2. Write the missing chapter. Pick one of the "what the harness doesn't do" items from Section 22.2 and add it. Embedding-based retrieval, tree search, a fine-tuned tool model — pick one. How long does it take? Does it fit the interfaces the book established, or does it want to break them?
  3. Evaluate one more harness. Clone smolagents, mini-swe-agent, or the OpenAI Agents SDK. Read the core loop. Map it onto the vocabulary of this book. Where does it agree with the design choices we made? Where does it differ, and why?