Chapter 7. The Context Window Is a Resource

Previously: the registry validates arguments and detects loops. All five breaks from Chapter 2 are handled — except Break 5, which is now the subject of the next five chapters. A tool that returns 200KB of JSON still poisons the loop on turn four.

The context window is the single most misunderstood resource in agent engineering. People treat it like disk space: fixed size, linear consumption, obvious when it's full. Three of those intuitions are wrong.

The size is not fixed in a useful sense — every provider quotes a headline number (200K, 1M) but model performance degrades continuously as you approach it. Consumption is not linear — tool results, retrieved documents, and prior turns have very different ratios of tokens-to-value. And it's not obvious when it's full — models don't gracefully degrade; they fail silently, get lost in the middle, invent facts to fill gaps they can't find evidence for.

This chapter turns the context window into something your harness can see. We build a ContextAccountant that tracks what's in the window, broken down by component, and exposes utilization thresholds that the next three chapters will use to drive compaction, scratchpad offloading, and retrieval.

We don't act on the accounting yet — that's Chapter 8. This chapter is strictly about measurement, because you can't decide how to react to context pressure until you can see it.

history
headroom
system ~500schemas ~2Khistory ~12Kretrieved ~4Kheadroom ~180K
< 60% green
60–80% yellow
> 80% red
Context as a layered budget, not a bag of tokens. Utilization thresholds drive every compaction decision in Chapters 8 through 11.

7.1 What the Research Actually Says

Three findings anchor this chapter, and together they make the case that context deserves its own accounting layer rather than being treated as a generous pile you don't need to watch.

Chroma's 2025 "Context Rot" study tested 18 SOTA models on synthetic retrieval tasks — needle-in-a-haystack scenarios where the "needle" is a specific fact the model must find and use. Performance degraded continuously with input length, even when the input was 10% of the model's quoted context window. The degradation was model-specific but universal: no tested model was immune. Two separate mechanisms were at play: dilution of attention across more tokens, and interference from semantically-similar-but-irrelevant distractor content.

Liu et al.'s 2023 "Lost in the Middle: How Language Models Use Long Contexts" showed that retrieval accuracy follows a U-curve: content at the beginning and end of the context is retrieved at high accuracy, content in the middle significantly less so. This is not a bug in any single model — it's an artifact of how attention is trained — and it has stayed consistent across GPT, Claude, and open-source models through multiple generations.

Hsieh et al.'s 2024 "RULER: What's the Real Context Size of Your Long-Context Language Models?" formalized the gap between a model's claimed context window and its effective one. RULER tested models across thirteen task types at escalating context lengths — needle-in-a-haystack at varying depths, multi-hop tracing, aggregation, frequent-words extraction — and found that every model's effective length (the length at which it still performs comparably to short-context baselines) was substantially shorter than its nominal window, often by 4–8×. A model advertised at 128K might be meaningfully reliable only up to 32K; a 1M-token model might rot noticeably past 128K. The RULER numbers are the empirical backbone of the rule of thumb this chapter leans on.

Together these three findings imply a practical rule: a 200K context window is not a 200K budget. The effective budget — the amount you can fill before quality degrades — is typically 50–70% of the headline number for retrieval-heavy work, and the placement within that budget matters. Chapter 10 handles placement (put critical facts at the ends of the window). This chapter handles budgeting.


7.2 What to Count

A context window is not a pile of tokens; it's a layered composition. Most production harnesses track at least five components:

System prompt. The instructions, persona, tool-use guidelines, and safety policies that run before any user input. Typically 500–3000 tokens, stable across a session.

Tool schemas. Every tool's schema — name, description, input schema — rendered into the prompt once, per provider convention. Our four tools cost perhaps 400 tokens. A 50-tool harness might spend 5000+ tokens here, which — remember the tool cliff — is not just a cost concern but a quality one.

Conversation history. The user messages, assistant messages, and tool results accumulated across the session. Grows monotonically in the naive loop.

Retrieved context. Any documents, search results, or scratchpad contents pulled in for the current turn. In Chapter 10 we'll make this dynamic; for now we count whatever's there.

Headroom. The room we need to leave for the model's own response. Anthropic's max_tokens parameter, OpenAI's equivalent. A minimum we subtract from the total.

Reasoning tokens, if preserved. When a ReasoningBlock (Chapter 3) survives in the transcript — which happens with Anthropic's extended thinking + tools combo, or when a consumer chose to preserve reasoning for auditability — those tokens count against history like any other block. The accountant's _count_block handles ReasoningBlock alongside the others, using the text body as its weight. If you turn extended thinking on, expect history to grow noticeably faster per turn; reasoning can easily be 5–10× the size of the final answer on hard tasks.

Total = sum of the above. Utilization = total / context window size. The critical thresholds, by rule of thumb:

  • ≤ 60%: green. No action needed.
  • 60–80%: yellow. Consider pruning, summarizing, or offloading soon.
  • > 80%: red. Compact now; you're in the rot zone.
  • > 95%: emergency. The next turn probably won't fit.

These numbers are defensible rules of thumb, not laws. You'll tune them for your workload; Chapter 19 gives you the evals that let you tune them empirically.


7.3 Counting Tokens

Every provider has its own tokenizer. Counting on one and estimating for another is a recipe for mid-session surprises. Three approaches, each with tradeoffs.

Use the provider's official counter. Anthropic's count_tokens endpoint returns exact billing-grade counts; the OpenAI SDK has tiktoken for OpenAI models. Accurate, but network round-trips for Anthropic's endpoint make it unsuitable for per-message counting (latency adds up). Use it for calibration, not hot-path accounting.

Use a local approximation. tiktoken with the appropriate encoding (cl100k_base for GPT-4/4o, o200k_base for GPT-5) gives you byte-exact counts for OpenAI models. For Anthropic and others, the closest local approximation is still tiktoken's cl100k_base, which is off by maybe 5% on typical English text — usable as a budget proxy, not for billing.

Rely on the provider's response. Every ProviderResponse from Chapter 3 carries input_tokens and output_tokens. This is ground-truth for the last turn but tells you nothing about what the next turn will cost, since you don't yet know what the model will produce.

Our accountant uses a combination: local tiktoken for estimates before a call, provider-reported counts after a call for reconciliation.

uv add 'tiktoken>=0.8'

7.4 The Accountant

# src/harness/context/accountant.py
from __future__ import annotations

import json
from dataclasses import dataclass, field
from typing import Literal

import tiktoken

from ..messages import (
    Block, Message, ReasoningBlock, TextBlock, ToolCall, ToolResult, Transcript,
)


Component = Literal["system", "tools", "history", "retrieved", "headroom"]


@dataclass
class ContextBudget:
    window_size: int = 200_000
    headroom: int = 4096  # reserved for the model's response
    yellow_threshold: float = 0.60
    red_threshold: float = 0.80

    @property
    def usable(self) -> int:
        return self.window_size - self.headroom


@dataclass
class ContextSnapshot:
    totals: dict[Component, int] = field(default_factory=dict)
    budget: ContextBudget = field(default_factory=ContextBudget)

    @property
    def total_used(self) -> int:
        return sum(v for k, v in self.totals.items() if k != "headroom")

    @property
    def utilization(self) -> float:
        return self.total_used / max(self.budget.usable, 1)

    @property
    def state(self) -> Literal["green", "yellow", "red"]:
        u = self.utilization
        if u >= self.budget.red_threshold:
            return "red"
        if u >= self.budget.yellow_threshold:
            return "yellow"
        return "green"


class ContextAccountant:
    """Counts tokens per component across a transcript."""

    def __init__(self, encoding_name: str = "cl100k_base",
                 budget: ContextBudget | None = None) -> None:
        self._enc = tiktoken.get_encoding(encoding_name)
        self.budget = budget or ContextBudget()

    def snapshot(
        self,
        transcript: Transcript,
        tools: list[dict] | None = None,
        retrieved: list[str] | None = None,
    ) -> ContextSnapshot:
        totals: dict[Component, int] = {
            "system": self._count_text(transcript.system or ""),
            "tools": sum(self._count_text(json.dumps(t)) for t in (tools or [])),
            "history": sum(self._count_message(m) for m in transcript.messages),
            "retrieved": sum(self._count_text(r) for r in (retrieved or [])),
            "headroom": self.budget.headroom,
        }
        return ContextSnapshot(totals=totals, budget=self.budget)

    def _count_text(self, s: str) -> int:
        return len(self._enc.encode(s))

    def _count_message(self, m: Message) -> int:
        # message overhead is ~4 tokens per message in most providers' formats
        total = 4
        for block in m.blocks:
            total += self._count_block(block)
        return total

    def _count_block(self, block: Block) -> int:
        match block:
            case TextBlock(text=t):
                return self._count_text(t)
            case ToolCall(name=n, args=a):
                return self._count_text(n) + self._count_text(json.dumps(a)) + 6
            case ToolResult(content=c):
                return self._count_text(c) + 4
            case ReasoningBlock(text=t):
                # Only present when an adapter preserves reasoning on the
                # transcript (Anthropic thinking + tools, or explicit
                # consumer choice). Weight is the text body; the opaque
                # signature / encrypted_content in metadata is negligible.
                return self._count_text(t)
            case _:
                # Defensive fallthrough: new block types added later should
                # be undercounted (not crash the measurement component).
                return 0

The accountant is pure measurement. It doesn't mutate the transcript, it doesn't prune anything, it doesn't call the provider. It answers one question: given this transcript and these tools, how much of my usable window am I consuming, broken down by where it went?


7.5 Per-Turn Accounting in the Loop

The loop threads the accountant through every turn. After each provider response, we reconcile the estimated counts with the provider's reported counts (for calibration), and we expose the snapshot to any caller that wants to observe.

The merged version keeps the Chapter 5 interrupt-safety (partial_text + CancelledError handling) and adds the accountant + snapshot hook. Paste the whole thing — don't cherry-pick just the new lines:

# src/harness/agent.py (updated)
from typing import Callable

from .context.accountant import ContextAccountant, ContextSnapshot
from .providers.events import StreamEvent


async def arun(
    provider: Provider,
    registry: ToolRegistry,
    user_message: str,
    transcript: Transcript | None = None,
    system: str | None = None,
    on_event: Callable[[StreamEvent], None] | None = None,
    on_tool_call: Callable[[ToolCall], None] | None = None,
    on_tool_result: Callable[[ToolResult], None] | None = None,
    on_snapshot: Callable[[ContextSnapshot], None] | None = None,   # NEW
    accountant: ContextAccountant | None = None,                    # NEW
) -> str:
    if transcript is None:
        transcript = Transcript(system=system)
    transcript.append(Message.user_text(user_message))
    accountant = accountant or ContextAccountant()                  # NEW

    for _ in range(MAX_ITERATIONS):
        # NEW — measure before each turn.
        snapshot = accountant.snapshot(transcript, tools=registry.schemas())
        if on_snapshot is not None:
            on_snapshot(snapshot)
        if snapshot.state == "red":
            # Chapter 8 drops the compactor in here.
            # For now: observe only.
            pass

        # Unchanged from Ch 5: partial-text rescue + cancel handling around
        # _one_turn. Don't drop this when merging — it's how Ctrl-C still
        # captures streamed tokens into the transcript cleanly.
        partial_text: list[str] = []
        try:
            response = await _one_turn(
                provider, registry, transcript, partial_text, on_event,
            )
        except asyncio.CancelledError:
            if partial_text:
                transcript.append(Message.assistant_text(
                    "".join(partial_text) + " [interrupted]"
                ))
            raise

        if response.is_final:
            transcript.append(Message.from_assistant_response(response))
            return response.text or ""

        transcript.append(Message.from_assistant_response(response))
        for ref in response.tool_calls:
            result = registry.dispatch(ref.name, ref.args, ref.id)
            transcript.append(Message.tool_result(result))

    raise RuntimeError(f"agent did not finish in {MAX_ITERATIONS} iterations")

Three observations.

The three lines marked # NEW are all this chapter adds. Everything else — the partial_text / CancelledError rescue, the from_assistant_response commit that preserves ReasoningBlock, the tool dispatch / result append — carries forward from Chapter 5 unchanged. If you're diffing your copy against §5.6, only snapshot = ..., the on_snapshot callback, and the empty red-state branch should be new.

The red-state hook is empty on purpose. Chapter 8 drops in the compactor. Leaving the hook here now means Chapter 8's patch is about three lines of code.

on_snapshot callback per turn. This is how you'd wire a CLI or TUI to display live context usage ("67% / yellow"). In production harnesses, the same hook feeds your observability pipeline (Chapter 18).


7.6 Making It Visible

If you can't see your context filling up, you can't reason about when to compact. A small text visualizer turns the abstract percentages into something you glance at:

# examples/ch07_context_usage.py
import asyncio

from harness.agent import arun
from harness.context.accountant import ContextAccountant
from harness.providers.anthropic import AnthropicProvider
from harness.tools.registry import ToolRegistry
from harness.tools.std import bash, calc, read_file


def display(snap) -> None:
    bar_width = 40
    u = snap.utilization
    filled = int(u * bar_width)
    empty = bar_width - filled
    bar = "" * filled + "" * empty
    state_color = {"green": "\033[92m", "yellow": "\033[93m", "red": "\033[91m"}
    color = state_color[snap.state]
    reset = "\033[0m"

    print(f"\n{color}[{bar}] {u*100:.0f}% ({snap.state}){reset}")
    for k, v in snap.totals.items():
        if k == "headroom":
            continue
        print(f"  {k:10s} {v:>8,d}")
    print(f"  {'usable':10s} {snap.budget.usable:>8,d}")


async def main():
    provider = AnthropicProvider()
    registry = ToolRegistry(tools=[calc, read_file, bash])
    accountant = ContextAccountant()

    await arun(
        provider=provider,
        registry=registry,
        user_message=(
            "Read the file /etc/hostname, the file /etc/os-release, "
            "the file /proc/cpuinfo, and summarize the machine."
        ),
        on_snapshot=display,
        accountant=accountant,
    )


asyncio.run(main())

Run it. You'll see the context usage grow turn-by-turn. The jumps come from tool outputs — /proc/cpuinfo on a typical machine is ~20KB ≈ 5000 tokens, one tool result that shifts your utilization several percentage points. Do this with a prompt that reads three large files and you'll watch the bar walk toward yellow in real time. That's the point. What was invisible is now something you watch.


7.7 Observations Worth Keeping

Three patterns show up reliably once you start watching the accountant.

Tool results dominate. In most agentic workloads, by turn ten, tool results are 70–90% of the transcript. System prompts and tool schemas are rounding errors; the history is mostly what the tools returned. That's why Chapter 11 is devoted to tool output design — smaller, structured outputs are the single highest-leverage context intervention.

User messages are tiny. The human at the other end writes a paragraph per turn, maybe. The model reads kilobytes. This asymmetry is one reason why the naive "just make the context window bigger" intuition fails: the user isn't the one filling it.

Assistant reasoning is the third bulge. When you run an agent that thinks out loud — with extended thinking, or ReAct-style verbose reasoning — the assistant's own text can approach the size of tool results. The decision to log the reasoning (useful for debugging, expensive for context) is one you make consciously once you're accounting.


7.8 What About Cache Discounts?

Both Anthropic and OpenAI support prompt caching: a long stable prefix (system prompt + tool schemas) can be marked for caching, and subsequent calls that share that prefix are charged at ~10% of the input rate (Anthropic's cache reads) or similar (OpenAI's implicit caching).

A cached prefix takes the same space in the context window — caching is a billing optimization, not a window optimization. Our accountant counts the raw tokens regardless of cache state. If you want to track cache-effective cost separately, Chapter 20 introduces a CostAccountant that pairs with this one.


7.9 Commit

git add -A && git commit -m "ch07: ContextAccountant and per-component token accounting"
git tag ch07-accounting

7.10 Try It Yourself

  1. Measure the naive loop. Run the Chapter 2 calculator example through the accountant. How much of the budget did a simple arithmetic task consume? Compare to a prompt that reads three medium-sized files. Where did the budget go?
  2. Calibrate against ground truth. After each provider call, compare the accountant's estimate to response.input_tokens. How far off is the local tiktoken estimate for your primary provider? Write a small report and keep it — you'll want it in Chapter 20 when cost accounting gets serious.
  3. Find your red line. Design a prompt that forces the agent to pull in enough context to push past 80% utilization. Run it. Does the model's behavior change as utilization climbs through yellow into red? You now have a bench test for Chapter 8's compaction.