Chapter 20. Cost Control

Previously: evals measure correctness. Nothing in the harness caps spend. The $47K agent-loop incident (DEV Community, Nov 2025) was two agents ping-ponging requests for eleven days; alerts fired, no one stopped them. Alerts are not enforcement.

Three cost problems, addressed in rough order of impact.

Caching. The stable prefix of every request — system prompt, tool schemas, early history — is repeated on every turn. Without caching, every turn pays full input-token rates on the same content. Anthropic's explicit caching can reduce cache-read costs by an order of magnitude. OpenAI's implicit caching offers similar savings with less control.

Model routing. Not every turn needs the most capable model. A summarization pass, a classification step, a simple tool-calling turn — these can run on a cheaper model for one-tenth the cost without material quality loss. Production systems that measure before routing typically recover 40–60% of cost.

Hard budgets. Alerts say "you're over." Enforcement says "stop now." A per-session hard cap, enforced in a separate thread so a runaway loop can't avoid it, is the single most important cost-safety primitive. Chapter 5's RetryPolicy capped retry-on-transient-error spend per call; this chapter's BudgetEnforcer caps total session spend. Both defend against the same class of failure — unbounded iteration in a cost-per-call system — at different levels of the stack.

This chapter builds all three.

cached prefix

system + tool schemas + anchors

write: 1.25× base · read: 0.1× base

fresh

latest msgs

model-router

simple → Haiku

complex → Sonnet

hard → Opus

Top: the long stable prefix is cached (amber) and reread cheaply each turn; only the short suffix changes. Bottom: route by task difficulty.

20.1 Anthropic Caching, Concretely

Anthropic's explicit cache_control markers are the powerful case: you tell Anthropic where to cache, and subsequent calls that share that prefix are read at 0.1× input cost (for 5-minute TTL) or 0.1× with 2× write cost (for 1-hour TTL).

The catch: 1,024 tokens minimum per cache breakpoint. Below that, cache_control is silently ignored — one of the most-cited gotchas in Anthropic's prompt-caching docs.

Our stable prefix — system prompt plus tool schemas — is often 1500–3000 tokens in this harness, comfortably over the minimum. We add cache_control when we send the request:

# src/harness/providers/anthropic.py (cache-aware addition)

def _to_anthropic_system(system: str | None, use_cache: bool) -> str | list[dict] | None:
    if system is None:
        return None
    if not use_cache:
        return system
    # structured system with cache_control on the last block
    return [{
        "type": "text",
        "text": system,
        "cache_control": {"type": "ephemeral"},  # default TTL: 5 min
    }]


class AnthropicProvider:
    def __init__(
        self,
        model: str = "claude-sonnet-4-6",
        client: Any | None = None,
        cache_enabled: bool = True,
    ) -> None:
        self.model = model
        self.cache_enabled = cache_enabled
        # ... rest

    async def astream(self, transcript, tools):
        kwargs: dict = {
            "model": self.model,
            "max_tokens": 4096,
            "messages": [_to_anthropic(m) for m in transcript.messages],
            "tools": _tools_with_cache(tools, self.cache_enabled),
            "system": _to_anthropic_system(transcript.system, self.cache_enabled),
        }
        # ... rest unchanged


def _tools_with_cache(tools: list[dict], enabled: bool) -> list[dict]:
    if not enabled or not tools:
        return tools
    # Mark the last tool with cache_control; this caches the full tools array
    # up to that point as a single breakpoint.
    result = list(tools)
    result[-1] = {**result[-1], "cache_control": {"type": "ephemeral"}}
    return result

One breakpoint on the tool schemas, one on the system prompt. The messages list itself — user turns, tool results — isn't cached because it changes every turn. What gets cached is the stable prefix.

Your first call writes the cache (slightly more expensive). Every subsequent call within 5 minutes that shares the same prefix reads from cache (10× cheaper). For an agent running 15 turns in a session, that's 14 cache reads vs 1 cache write — a net cost reduction around 80–90% on the prefix portion.

Budget for cache misses, though. In March 2026, Anthropic silently regressed the default cache TTL from 1 hour to 5 minutes (GitHub issue anthropics/claude-code#46829; see also byteiota's cache-TTL analysis of the incident). Sessions that expected cache hits between turns got cache misses every turn because the TTL had dropped below the inter-turn interval; Claude Code Max quotas were exhausted in 19 minutes instead of hours, with 17–32% cost inflation observed. The lesson: measure cache hit rate in production, don't assume it. OTel attribute anthropic.cache_read_tokens (if the SDK exposes it) is what you track.

Back-compat note on the constructor. This adds one parameter (cache_enabled=True) to AnthropicProvider.__init__ from §3.4. Existing AnthropicProvider() and AnthropicProvider(model=..., client=...) calls in every example from Ch 5 onward continue to work unchanged because the default is enabled — only callers who want to disable caching for testing or for a fair A/B comparison pass cache_enabled=False. No change needed in examples from earlier chapters.

20.2 Model Routing

Some turns are easier than others. A classifier turn — "what kind of question is this?" — doesn't need Opus. A summarization of a tool result — "compact this 50K-token output" — can run on Haiku. Routing reduces average cost per turn at the expense of one pre-call decision.

Three routing signals, in order of what pays off most.

Task type. Classification, extraction, simple lookup — route to the cheapest capable model. Code generation, multi-step reasoning — route to the flagship. The split is usually 60/40 by volume.

Input length. Very long contexts (>100K tokens) often require premium models because smaller ones lose fidelity. This is counterintuitive — you'd expect cheap models for cheap tasks — but model recall on long contexts is a capability gap, not a cost gap.

Uncertainty. A cheap model produces an answer with low confidence; route to a premium model for a second opinion. This is the "evaluator-optimizer" pattern from Self-Refine, repurposed for cost.

# src/harness/cost/router.py
from __future__ import annotations

from dataclasses import dataclass
from typing import Literal

from ..messages import Transcript
from ..providers.base import Provider


Tier = Literal["economy", "mid", "premium"]


@dataclass
class ModelRouter:
    economy: Provider
    mid: Provider
    premium: Provider

    def choose(
        self,
        transcript: Transcript,
        task_hint: str | None = None,
    ) -> Provider:
        """Pick a provider based on what the next turn is likely to need."""
        # Heuristic 1: long contexts go premium
        approx_tokens = sum(len(m.blocks[0].__dict__.get("text", "") or "")
                            for m in transcript.messages if m.blocks) // 4
        if approx_tokens > 50_000:
            return self.premium

        # Heuristic 2: task-type hints
        if task_hint in ("classify", "extract", "summarize"):
            return self.economy
        if task_hint in ("code", "plan", "reason"):
            return self.premium

        # Default: mid-tier
        return self.mid

This is a rules-based router — simple and explicit. Production routers get more sophisticated (learned classifiers, uncertainty estimation) but the principle is the same.

A router isn't itself a Provider; it chooses among providers. The loop calls router.choose(transcript).astream(...) instead of provider.astream(...).

When routing hurts. Gitar's 2025 "We switched to a 5× cheaper LLM and our costs went up" flagged the trap: a cheap model that produces worse tool JSON may require more retries, take more turns, and end up costing more than the expensive model would have. The fix is measurement. Your evals should run with the router in place; if cost per passing case goes up, your routing is wrong.

Reasoning effort is a cheaper knob than model switching. Before escalating a hard task from Sonnet to Opus, try Sonnet with enable_thinking=True (Anthropic) or GPT-5 with reasoning_effort="high" (OpenAI Responses). Reasoning tokens are billed as output, so the cost increase is bounded by your thinking_budget_tokens or by the effort level — and you keep the cheaper model. Good routers consider effort before they consider tier. Concretely: if the task type is "reasoning" and the current provider supports a reasoning knob, turn the knob up one notch; escalate tier only when the top effort still fails. The ModelRouter here doesn't wire this in — it's a one-chapter sketch — but the adapter seam already gives you what you need: each provider's reasoning knob is a constructor argument, and choose() can return a different pre-configured instance.

20.3 The Budget Enforcer

The hardest of the three, and the one without which the other two are decoration. A runaway loop generates cost in the inner loop, not at turn boundaries; an enforcement check at the start of each turn doesn't stop the turn in progress.

The pattern that works: enforce in a separate thread. The main loop runs the agent. A watchdog tracks session cost; when the cap is reached, it cancels the main task.

# src/harness/cost/enforcer.py
from __future__ import annotations

import asyncio
from dataclasses import dataclass, field
from datetime import datetime, timezone


class BudgetExceeded(Exception):
    pass


@dataclass
class BudgetEnforcer:
    max_usd: float                    # hard session cap
    alert_thresholds: list[float] = field(default_factory=lambda: [0.5, 0.8])
    spent_usd: float = field(default=0.0, init=False)
    alerted: set[float] = field(default_factory=set, init=False)
    _cancelled_task: "asyncio.Task | None" = field(default=None, init=False)

    def attach_task(self, task: asyncio.Task) -> None:
        self._cancelled_task = task

    def record(self, input_tokens: int, output_tokens: int,
               model: str) -> None:
        cost = self._price(model, input_tokens, output_tokens)
        self.spent_usd += cost

        for t in self.alert_thresholds:
            if t not in self.alerted and self.spent_usd / self.max_usd >= t:
                self.alerted.add(t)
                print(f"[BUDGET WARNING] {self.spent_usd:.2f} / {self.max_usd:.2f} "
                      f"({t*100:.0f}% reached)")

        if self.spent_usd >= self.max_usd:
            self._halt()

    def _halt(self) -> None:
        if self._cancelled_task and not self._cancelled_task.done():
            print(f"[BUDGET HALT] {self.spent_usd:.2f} exceeds "
                  f"{self.max_usd:.2f}; cancelling session")
            self._cancelled_task.cancel()
        raise BudgetExceeded(
            f"session budget ${self.max_usd} exceeded: ${self.spent_usd:.2f}"
        )

    def _price(self, model: str, in_toks: int, out_toks: int) -> float:
        # April 2026 pricing. Move to a dated appendix or config file
        # so this doesn't rot inside the harness code.
        prices = {
            "claude-sonnet-4-6": (3.0, 15.0),
            "claude-opus-4-6":   (5.0, 25.0),
            "claude-haiku":      (0.8, 4.0),
            "gpt-5":             (1.25, 10.0),
            "gpt-5.2":           (1.75, 14.0),
            # Local/free providers — zero-rate so LocalProvider demos
            # from Chapters 7–17 don't log fictitious Opus-tier cost.
            "local":             (0.0, 0.0),
            "stub":              (0.0, 0.0),
        }
        # Unknown-model fallback: Opus-tier, deliberately. Over-reporting
        # cost for a provider we don't have rates for is safer than
        # silently under-reporting it. See the paragraph below.
        in_rate, out_rate = prices.get(model, (5.0, 25.0))
        return (in_toks * in_rate + out_toks * out_rate) / 1_000_000

The loop registers the running task with the enforcer at the start; every turn's ProviderResponse gets recorded. When cumulative cost crosses the cap, record() cancels the attached task and raises BudgetExceeded. Two mechanisms, not one, for a concrete reason.

Why both cancel() and raise. record() runs synchronously inside the loop's own stack — so raise BudgetExceeded(...) is what actually stops the current session: it propagates up through record() → the arun loop → the caller's await arun(...). The self._cancelled_task.cancel() call on the line above is belt-and-braces for a different scenario: when a parent coroutine is await asyncio.gather(...)-ing multiple agents (the parallel spawner from §17.4, for instance), the raise stops this session's stack, but sibling sessions running on other tasks in the gather would keep burning tokens until their own turn ended. cancel() propagates an asyncio.CancelledError to sibling awaits so they stop too. Callers wrap await arun(...) in try/except that catches both BudgetExceeded (the expected case — this session hit its cap) and asyncio.CancelledError (the sibling case — another session hit the cap first and took us down with it). §20.4's example does exactly that.

On the unknown-model fallback. The prices.get(model, (5.0, 25.0)) default treats anything not in the table as Opus-tier. That's a deliberate safety choice — under-reporting cost for an unknown provider is worse than over-reporting it — but it means adapter names that aren't in the table (a user-added GroqProvider.name == "groq", say) log inflated costs until you add a row. If you see a bewildering bill on provider.name = "something", the first suspect is a missing row in prices, not a real overrun. "local" and "stub" are pre-seeded at zero specifically because the book's own LocalProvider-based examples (Ch 7, 8, 12, 17) would otherwise produce misleading cost numbers.

Wiring:

# src/harness/agent.py (budget-aware, sketch)

async def arun(..., budget_enforcer: "BudgetEnforcer | None" = None) -> str:
    current = asyncio.current_task()
    if budget_enforcer and current:
        budget_enforcer.attach_task(current)

    # ... existing setup

    for _ in range(MAX_ITERATIONS):
        # ... turn execution
        response = await _one_turn(...)
        if budget_enforcer:
            budget_enforcer.record(
                input_tokens=response.input_tokens,
                output_tokens=response.output_tokens,
                model=provider.name,  # or a more specific identifier
            )
        # ... continue

One subtlety worth naming. record() fires synchronously after each turn. A turn that itself takes 10 seconds and produces 100K output tokens would already be expensive before record runs. Fine for most cases — the next turn gets halted. For pathological cases where one turn alone exceeds the budget, you'd add a streaming enforcement that watches output tokens as they arrive. We don't build it here; the hard cap at turn boundaries catches 95% of real runaway patterns.

20.4 Putting It Together

# examples/ch20_cost_controlled.py
import asyncio

from harness.agent import arun
from harness.cost.enforcer import BudgetEnforcer, BudgetExceeded
from harness.cost.router import ModelRouter
from harness.providers.anthropic import AnthropicProvider
from harness.tools.selector import ToolCatalog
from harness.tools.std import STANDARD_TOOLS


async def main() -> None:
    # Router would typically select between Haiku, Sonnet, and Opus.
    # For simplicity, we use one provider here but demonstrate the enforcer.
    provider = AnthropicProvider(cache_enabled=True)
    catalog = ToolCatalog(tools=STANDARD_TOOLS)
    enforcer = BudgetEnforcer(max_usd=0.50)

    try:
        await arun(
            provider=provider,
            catalog=catalog,
            user_message="Investigate the machine: OS, CPU, memory, disk.",
            budget_enforcer=enforcer,
        )
    except BudgetExceeded as e:
        # This session tripped the cap; record() raised on our own stack.
        print(f"Session terminated: {e}")
    except asyncio.CancelledError:
        # A sibling session (in a gather-based parallel spawn) tripped
        # the cap; the enforcer cancelled us. See §20.3's "Why both
        # cancel() and raise" note.
        print(f"Session cancelled by budget enforcer at ${enforcer.spent_usd:.2f}")

    print(f"Total spent: ${enforcer.spent_usd:.4f}")


asyncio.run(main())

A 50-cent cap. On a typical run, the agent comes in well under. On a degenerate case where the agent enters a loop, the cap fires — you get a traceback with the spent amount, and no further spending happens.

20.5 What a Production Cost Dashboard Looks Like

With Chapter 18's per-agent attribution and this chapter's cost tracking, production dashboards show:

Total cost per session, histogram. Most sessions cluster; outliers are either pathological or legitimately expensive. Investigate the tails.
Cost per task type. Routing evaluation: are you paying more for "classify" tasks than you should? Bad routing rules.
Cache hit rate over time. A sudden drop is a provider-side regression (the March 2026 Anthropic incident would have shown up here within hours).
Budget-halt events. Every time _halt() fires, log it. A spike in halts means either your budget is too tight or your harness has a regression.

None of this requires changes to the harness code beyond what we've built. It's all queries on the OTel traces plus the enforcer's logged events.

20.6 Commit

git add -A && git commit -m "ch20: caching, model routing, budget enforcement"
git tag ch20-cost

20.7 Try It Yourself

Measure cache effectiveness. Run the same task three times in five minutes with caching on vs. off. Compare input-token costs. The delta is your cache hit rate.
Force a runaway. Give the agent a prompt that will loop (e.g., a broken tool that never satisfies the plan). Set a small budget ($0.05). Confirm the enforcer halts the session. Measure how much over-budget you ended up — how late was the halt?
A/B the router. Run your eval suite with a premium-only provider, then with the router. Compare cost per passing case, correctness, and latency. If routing didn't win on all three, your heuristics need tuning.

What you now understand

The harness caches, routes, and enforces budgets. Cache breakpoints on stable prefixes reduce input-token costs by roughly an order of magnitude. A model router picks the cheapest adequate model per turn, configurable by task type or input length. A hard budget enforcer running alongside the loop cancels the session when cumulative cost crosses a cap — the mitigation the $47K post-mortem explicitly called for. Together, these typically reduce total spend 50–80% over a naive harness, with no quality degradation if the evals stay green.

What's still missing: durability. A crashed harness loses its session. A long-running agent that goes down before completion has no resume path. The scratchpad from Chapter 9 partially covers this — durable state for things the agent wrote down — but the conversation itself is in-memory. Chapter 21 adds full session checkpointing: SQLite-backed, idempotency-aware, verify-before-retry for side-effecting tools. After that, your harness survives crashes.

Chapter 19. Evals

Previously: observability — every operation in the harness emits a structured span, per-agent cost attribution works, dashboards show drift. Observability says what happened; it doesn't say whether what happened was right.

Chapter 21. Resumability and Durable State

Previously: cost control. A budget-enforced harness can't run away. What it still can't do is survive a crash. The machine reboots, the process is killed, the laptop lid closes — and the session is gone.