Previously: evals measure correctness. Nothing in the harness caps spend. The $47K agent-loop incident (DEV Community, Nov 2025) was two agents ping-ponging requests for eleven days; alerts fired, no one stopped them. Alerts are not enforcement.
Three cost problems, addressed in rough order of impact.
Caching. The stable prefix of every request — system prompt, tool schemas, early history — is repeated on every turn. Without caching, every turn pays full input-token rates on the same content. Anthropic's explicit caching can reduce cache-read costs by an order of magnitude. OpenAI's implicit caching offers similar savings with less control.
Model routing. Not every turn needs the most capable model. A summarization pass, a classification step, a simple tool-calling turn — these can run on a cheaper model for one-tenth the cost without material quality loss. Production systems that measure before routing typically recover 40–60% of cost.
Hard budgets. Alerts say "you're over." Enforcement says "stop now." A per-session hard cap, enforced in a separate thread so a runaway loop can't avoid it, is the single most important cost-safety primitive. Chapter 5's RetryPolicy capped retry-on-transient-error spend per call; this chapter's BudgetEnforcer caps total session spend. Both defend against the same class of failure — unbounded iteration in a cost-per-call system — at different levels of the stack.
This chapter builds all three.
Anthropic's explicit cache_control markers are the powerful case: you tell Anthropic where to cache, and subsequent calls that share that prefix are read at 0.1× input cost (for 5-minute TTL) or 0.1× with 2× write cost (for 1-hour TTL).
The catch: 1,024 tokens minimum per cache breakpoint. Below that, cache_control is silently ignored — one of the most-cited gotchas in Anthropic's prompt-caching docs.
Our stable prefix — system prompt plus tool schemas — is often 1500–3000 tokens in this harness, comfortably over the minimum. We add cache_control when we send the request:
# src/harness/providers/anthropic.py (cache-aware addition)
def _to_anthropic_system(system: str | None, use_cache: bool) -> str | list[dict] | None:
if system is None:
return None
if not use_cache:
return system
# structured system with cache_control on the last block
return [{
"type": "text",
"text": system,
"cache_control": {"type": "ephemeral"}, # default TTL: 5 min
}]
class AnthropicProvider:
def __init__(
self,
model: str = "claude-sonnet-4-6",
client: Any | None = None,
cache_enabled: bool = True,
) -> None:
self.model = model
self.cache_enabled = cache_enabled
# ... rest
async def astream(self, transcript, tools):
kwargs: dict = {
"model": self.model,
"max_tokens": 4096,
"messages": [_to_anthropic(m) for m in transcript.messages],
"tools": _tools_with_cache(tools, self.cache_enabled),
"system": _to_anthropic_system(transcript.system, self.cache_enabled),
}
# ... rest unchanged
def _tools_with_cache(tools: list[dict], enabled: bool) -> list[dict]:
if not enabled or not tools:
return tools
# Mark the last tool with cache_control; this caches the full tools array
# up to that point as a single breakpoint.
result = list(tools)
result[-1] = {**result[-1], "cache_control": {"type": "ephemeral"}}
return result
One breakpoint on the tool schemas, one on the system prompt. The messages list itself — user turns, tool results — isn't cached because it changes every turn. What gets cached is the stable prefix.
Your first call writes the cache (slightly more expensive). Every subsequent call within 5 minutes that shares the same prefix reads from cache (10× cheaper). For an agent running 15 turns in a session, that's 14 cache reads vs 1 cache write — a net cost reduction around 80–90% on the prefix portion.
Budget for cache misses, though. In March 2026, Anthropic silently regressed the default cache TTL from 1 hour to 5 minutes (GitHub issue anthropics/claude-code#46829; see also byteiota's cache-TTL analysis of the incident). Sessions that expected cache hits between turns got cache misses every turn because the TTL had dropped below the inter-turn interval; Claude Code Max quotas were exhausted in 19 minutes instead of hours, with 17–32% cost inflation observed. The lesson: measure cache hit rate in production, don't assume it. OTel attribute anthropic.cache_read_tokens (if the SDK exposes it) is what you track.
Back-compat note on the constructor. This adds one parameter (cache_enabled=True) to AnthropicProvider.__init__ from §3.4. Existing AnthropicProvider() and AnthropicProvider(model=..., client=...) calls in every example from Ch 5 onward continue to work unchanged because the default is enabled — only callers who want to disable caching for testing or for a fair A/B comparison pass cache_enabled=False. No change needed in examples from earlier chapters.
Some turns are easier than others. A classifier turn — "what kind of question is this?" — doesn't need Opus. A summarization of a tool result — "compact this 50K-token output" — can run on Haiku. Routing reduces average cost per turn at the expense of one pre-call decision.
Three routing signals, in order of what pays off most.
Task type. Classification, extraction, simple lookup — route to the cheapest capable model. Code generation, multi-step reasoning — route to the flagship. The split is usually 60/40 by volume.
Input length. Very long contexts (>100K tokens) often require premium models because smaller ones lose fidelity. This is counterintuitive — you'd expect cheap models for cheap tasks — but model recall on long contexts is a capability gap, not a cost gap.
Uncertainty. A cheap model produces an answer with low confidence; route to a premium model for a second opinion. This is the "evaluator-optimizer" pattern from Self-Refine, repurposed for cost.
# src/harness/cost/router.py
from __future__ import annotations
from dataclasses import dataclass
from typing import Literal
from ..messages import Transcript
from ..providers.base import Provider
Tier = Literal["economy", "mid", "premium"]
@dataclass
class ModelRouter:
economy: Provider
mid: Provider
premium: Provider
def choose(
self,
transcript: Transcript,
task_hint: str | None = None,
) -> Provider:
"""Pick a provider based on what the next turn is likely to need."""
# Heuristic 1: long contexts go premium
approx_tokens = sum(len(m.blocks[0].__dict__.get("text", "") or "")
for m in transcript.messages if m.blocks) // 4
if approx_tokens > 50_000:
return self.premium
# Heuristic 2: task-type hints
if task_hint in ("classify", "extract", "summarize"):
return self.economy
if task_hint in ("code", "plan", "reason"):
return self.premium
# Default: mid-tier
return self.mid
This is a rules-based router — simple and explicit. Production routers get more sophisticated (learned classifiers, uncertainty estimation) but the principle is the same.
A router isn't itself a Provider; it chooses among providers. The loop calls router.choose(transcript).astream(...) instead of provider.astream(...).
When routing hurts. Gitar's 2025 "We switched to a 5× cheaper LLM and our costs went up" flagged the trap: a cheap model that produces worse tool JSON may require more retries, take more turns, and end up costing more than the expensive model would have. The fix is measurement. Your evals should run with the router in place; if cost per passing case goes up, your routing is wrong.
Reasoning effort is a cheaper knob than model switching. Before escalating a hard task from Sonnet to Opus, try Sonnet with enable_thinking=True (Anthropic) or GPT-5 with reasoning_effort="high" (OpenAI Responses). Reasoning tokens are billed as output, so the cost increase is bounded by your thinking_budget_tokens or by the effort level — and you keep the cheaper model. Good routers consider effort before they consider tier. Concretely: if the task type is "reasoning" and the current provider supports a reasoning knob, turn the knob up one notch; escalate tier only when the top effort still fails. The ModelRouter here doesn't wire this in — it's a one-chapter sketch — but the adapter seam already gives you what you need: each provider's reasoning knob is a constructor argument, and choose() can return a different pre-configured instance.
The hardest of the three, and the one without which the other two are decoration. A runaway loop generates cost in the inner loop, not at turn boundaries; an enforcement check at the start of each turn doesn't stop the turn in progress.
The pattern that works: enforce in a separate thread. The main loop runs the agent. A watchdog tracks session cost; when the cap is reached, it cancels the main task.
# src/harness/cost/enforcer.py
from __future__ import annotations
import asyncio
from dataclasses import dataclass, field
from datetime import datetime, timezone
class BudgetExceeded(Exception):
pass
@dataclass
class BudgetEnforcer:
max_usd: float # hard session cap
alert_thresholds: list[float] = field(default_factory=lambda: [0.5, 0.8])
spent_usd: float = field(default=0.0, init=False)
alerted: set[float] = field(default_factory=set, init=False)
_cancelled_task: "asyncio.Task | None" = field(default=None, init=False)
def attach_task(self, task: asyncio.Task) -> None:
self._cancelled_task = task
def record(self, input_tokens: int, output_tokens: int,
model: str) -> None:
cost = self._price(model, input_tokens, output_tokens)
self.spent_usd += cost
for t in self.alert_thresholds:
if t not in self.alerted and self.spent_usd / self.max_usd >= t:
self.alerted.add(t)
print(f"[BUDGET WARNING] {self.spent_usd:.2f} / {self.max_usd:.2f} "
f"({t*100:.0f}% reached)")
if self.spent_usd >= self.max_usd:
self._halt()
def _halt(self) -> None:
if self._cancelled_task and not self._cancelled_task.done():
print(f"[BUDGET HALT] {self.spent_usd:.2f} exceeds "
f"{self.max_usd:.2f}; cancelling session")
self._cancelled_task.cancel()
raise BudgetExceeded(
f"session budget ${self.max_usd} exceeded: ${self.spent_usd:.2f}"
)
def _price(self, model: str, in_toks: int, out_toks: int) -> float:
# April 2026 pricing. Move to a dated appendix or config file
# so this doesn't rot inside the harness code.
prices = {
"claude-sonnet-4-6": (3.0, 15.0),
"claude-opus-4-6": (5.0, 25.0),
"claude-haiku": (0.8, 4.0),
"gpt-5": (1.25, 10.0),
"gpt-5.2": (1.75, 14.0),
# Local/free providers — zero-rate so LocalProvider demos
# from Chapters 7–17 don't log fictitious Opus-tier cost.
"local": (0.0, 0.0),
"stub": (0.0, 0.0),
}
# Unknown-model fallback: Opus-tier, deliberately. Over-reporting
# cost for a provider we don't have rates for is safer than
# silently under-reporting it. See the paragraph below.
in_rate, out_rate = prices.get(model, (5.0, 25.0))
return (in_toks * in_rate + out_toks * out_rate) / 1_000_000
The loop registers the running task with the enforcer at the start; every turn's ProviderResponse gets recorded. When cumulative cost crosses the cap, record() cancels the attached task and raises BudgetExceeded. Two mechanisms, not one, for a concrete reason.
Why both cancel() and raise. record() runs synchronously inside the loop's own stack — so raise BudgetExceeded(...) is what actually stops the current session: it propagates up through record() → the arun loop → the caller's await arun(...). The self._cancelled_task.cancel() call on the line above is belt-and-braces for a different scenario: when a parent coroutine is await asyncio.gather(...)-ing multiple agents (the parallel spawner from §17.4, for instance), the raise stops this session's stack, but sibling sessions running on other tasks in the gather would keep burning tokens until their own turn ended. cancel() propagates an asyncio.CancelledError to sibling awaits so they stop too. Callers wrap await arun(...) in try/except that catches both BudgetExceeded (the expected case — this session hit its cap) and asyncio.CancelledError (the sibling case — another session hit the cap first and took us down with it). §20.4's example does exactly that.
On the unknown-model fallback. The prices.get(model, (5.0, 25.0)) default treats anything not in the table as Opus-tier. That's a deliberate safety choice — under-reporting cost for an unknown provider is worse than over-reporting it — but it means adapter names that aren't in the table (a user-added GroqProvider.name == "groq", say) log inflated costs until you add a row. If you see a bewildering bill on provider.name = "something", the first suspect is a missing row in prices, not a real overrun. "local" and "stub" are pre-seeded at zero specifically because the book's own LocalProvider-based examples (Ch 7, 8, 12, 17) would otherwise produce misleading cost numbers.
Wiring:
# src/harness/agent.py (budget-aware, sketch)
async def arun(..., budget_enforcer: "BudgetEnforcer | None" = None) -> str:
current = asyncio.current_task()
if budget_enforcer and current:
budget_enforcer.attach_task(current)
# ... existing setup
for _ in range(MAX_ITERATIONS):
# ... turn execution
response = await _one_turn(...)
if budget_enforcer:
budget_enforcer.record(
input_tokens=response.input_tokens,
output_tokens=response.output_tokens,
model=provider.name, # or a more specific identifier
)
# ... continue
One subtlety worth naming. record() fires synchronously after each turn. A turn that itself takes 10 seconds and produces 100K output tokens would already be expensive before record runs. Fine for most cases — the next turn gets halted. For pathological cases where one turn alone exceeds the budget, you'd add a streaming enforcement that watches output tokens as they arrive. We don't build it here; the hard cap at turn boundaries catches 95% of real runaway patterns.
# examples/ch20_cost_controlled.py
import asyncio
from harness.agent import arun
from harness.cost.enforcer import BudgetEnforcer, BudgetExceeded
from harness.cost.router import ModelRouter
from harness.providers.anthropic import AnthropicProvider
from harness.tools.selector import ToolCatalog
from harness.tools.std import STANDARD_TOOLS
async def main() -> None:
# Router would typically select between Haiku, Sonnet, and Opus.
# For simplicity, we use one provider here but demonstrate the enforcer.
provider = AnthropicProvider(cache_enabled=True)
catalog = ToolCatalog(tools=STANDARD_TOOLS)
enforcer = BudgetEnforcer(max_usd=0.50)
try:
await arun(
provider=provider,
catalog=catalog,
user_message="Investigate the machine: OS, CPU, memory, disk.",
budget_enforcer=enforcer,
)
except BudgetExceeded as e:
# This session tripped the cap; record() raised on our own stack.
print(f"Session terminated: {e}")
except asyncio.CancelledError:
# A sibling session (in a gather-based parallel spawn) tripped
# the cap; the enforcer cancelled us. See §20.3's "Why both
# cancel() and raise" note.
print(f"Session cancelled by budget enforcer at ${enforcer.spent_usd:.2f}")
print(f"Total spent: ${enforcer.spent_usd:.4f}")
asyncio.run(main())
A 50-cent cap. On a typical run, the agent comes in well under. On a degenerate case where the agent enters a loop, the cap fires — you get a traceback with the spent amount, and no further spending happens.
With Chapter 18's per-agent attribution and this chapter's cost tracking, production dashboards show:
_halt() fires, log it. A spike in halts means either your budget is too tight or your harness has a regression.None of this requires changes to the harness code beyond what we've built. It's all queries on the OTel traces plus the enforcer's logged events.
git add -A && git commit -m "ch20: caching, model routing, budget enforcement"
git tag ch20-cost
Chapter 19. Evals
Previously: observability — every operation in the harness emits a structured span, per-agent cost attribution works, dashboards show drift. Observability says what happened; it doesn't say whether what happened was right.
Chapter 21. Resumability and Durable State
Previously: cost control. A budget-enforced harness can't run away. What it still can't do is survive a crash. The machine reboots, the process is killed, the laptop lid closes — and the session is gone.