Previously: we set up a repo skeleton and agreed on vocabulary. Model is a function; agent is a loop; harness is the engineering around the loop. No code yet.
A while loop is the smallest thing that separates a chat interface from an agent, and it's what turns one API call into many — the model reads the output of its previous turn and decides what to do next, a pattern Yao et al. formalized in 2022 as ReAct (reason, act, observe) and that nearly every modern LLM agent is some variation on. That decision, made once per iteration, is the whole point: a model that cannot observe its own last action cannot debug, recover, or finish, while a model that can — even badly — is the start of an agent.
We are going to write that loop now. Forty lines in one file, no frameworks, no abstractions we haven't earned — and then we are going to break it in five specific ways, on purpose, and watch each break ripple through the design. Those five breaks will become the itinerary for the rest of the book, and most of the engineering in subsequent chapters is traceable back to one of them.
By the end of this chapter, your harness can answer a question by calling a calculator tool in a loop, against a mock provider rather than a real API. The mock provider is not a placeholder we'll throw away; it is the first piece of real architecture we lay down, the seam that makes your harness provider-agnostic from day one, and the reason every subsequent chapter can add capability without ever hard-coding a vendor's SDK into the core.
Three decisions happen on every iteration of the loop. They map cleanly onto the think-act-observe cycle of ReAct and onto the Planning → Tools → Memory → Action decomposition that Lilian Weng's widely-read 2023 post "LLM Powered Autonomous Agents" offered as a reference model for the field:
That is the whole shape of the thing. Everything else in this book is accretion on top of those three decisions: compaction, sub-agents, streaming, evals — they all live inside, around, or between steps 1 and 2, while step 3 is where the cost-runaway failure modes get caught and bounded.
Two subtle points are worth naming before we write any code.
The transcript is the state. The loop has no other memory of what happened turn to turn; if a fact needs to persist across turns, it either lives in the transcript (and costs tokens forever) or it doesn't survive at all. Later chapters introduce external state — scratchpads, checkpointers, retrieval — but every one of them exists precisely because the transcript is too narrow a container for durable memory.
The provider is a dependency, not the protagonist. The loop doesn't care whether the response came from Anthropic, OpenAI, a locally-hosted Llama, or a mock; it cares only that something returns a response in a shape it can interpret. Designing that shape is the work of Chapter 3, and for this chapter a mock is all we need — strictly better than a real API for our purposes here, because it runs offline, deterministically, and costs nothing.
Most tutorials start by calling anthropic.Anthropic() or OpenAI() directly in the loop — the right thing to do when you're exploring, and exactly the wrong thing when you're building something you expect to last. The moment a vendor SDK is imported from your core loop, you have taken on the vendor's quirks as part of your design: response envelope shape, streaming protocol, token-counting method, error taxonomy, all of it. Refactoring later means touching every file that ever touched the loop, and by then there are usually many.
Instead, we'll define a Provider protocol — a small, stable interface — and write a mock implementation of it. Chapter 3 writes real Anthropic and OpenAI adapters to the same protocol, and every subsequent chapter depends only on the protocol, never on a specific vendor's API surface.
# src/harness/providers/base.py
from __future__ import annotations
from dataclasses import dataclass
from typing import Protocol
@dataclass(frozen=True)
class ProviderResponse:
"""What a provider gives us back: either text, or a tool call."""
kind: str # "text" or "tool_call"
text: str | None = None
tool_name: str | None = None
tool_args: dict | None = None
tool_call_id: str | None = None
class Provider(Protocol):
def complete(self, transcript: list[dict], tools: list[dict]) -> ProviderResponse:
"""Given a transcript and available tools, produce one response."""
...
Two notes on this protocol are worth pausing on before we use it.
The transcript and tools are plain list[dict] for now, which is a deliberate simplification; Chapter 3 promotes both of them to typed dataclasses with proper block structure and a Transcript wrapper. Using dicts here keeps the mock trivially easy to write, and the protocol stays small enough to read in about five seconds — an honest test of whether an abstraction is paying its way.
ProviderResponse collapses two cases into one shape. Real provider responses are richer than this — they carry token counts, finish reasons, multiple content blocks, streaming chunks, reasoning traces — but none of that matters for the loop at this stage. The loop wants to know one thing: did the model ask me to call a tool, or did it give me an answer? Everything else is someone else's problem until we need it, and dragging it in now would be premature.
Now the mock. It implements a tiny scripted scenario: ask about 2 + 2, the mock calls a calculator tool, reads the result, and produces the answer.
# src/harness/providers/mock.py
from __future__ import annotations
from .base import Provider, ProviderResponse
class MockProvider(Provider):
"""A scripted provider for teaching and testing.
Walks through a fixed list of responses, one per call.
"""
def __init__(self, responses: list[ProviderResponse]) -> None:
self._responses = list(responses)
self._index = 0
def complete(self, transcript: list[dict], tools: list[dict]) -> ProviderResponse:
if self._index >= len(self._responses):
raise RuntimeError("mock ran out of responses")
response = self._responses[self._index]
self._index += 1
return response
A note on the MockProvider(Provider) line. In Python, a Protocol is satisfied structurally — any class with matching methods counts, no inheritance required. So why inherit? Two reasons. It documents intent: a reader sees "this class implements the Provider contract" without having to diff method signatures. And it turns a silent mismatch into a type-checker error at class definition time: forget to add complete, or change its signature, and mypy/pyright flag the class instead of letting the bug surface later in the loop. The real-provider adapters in Chapter 3 do the same thing.
That's the whole provider abstraction for this chapter. Thirty lines of code and we have a seam we will keep for the entire book.
Here is the naive loop. I am calling it naive because it's about to break in five ways we already know about. It's still a useful starting point — everything it doesn't do will be motivated by a specific failure.
# src/harness/agent.py
from __future__ import annotations
from typing import Callable
from .providers.base import Provider, ProviderResponse
MAX_ITERATIONS = 20
def run(
provider: Provider,
tools: dict[str, Callable[..., str]],
tool_schemas: list[dict],
user_message: str,
) -> str:
transcript: list[dict] = [{"role": "user", "content": user_message}]
for _ in range(MAX_ITERATIONS):
response = provider.complete(transcript, tool_schemas)
if response.kind == "text":
transcript.append({"role": "assistant", "content": response.text})
return response.text or ""
if response.kind == "tool_call":
if response.tool_name is None:
raise RuntimeError("tool_call response is missing tool_name")
if response.tool_name not in tools:
raise RuntimeError(f"unknown tool: {response.tool_name!r}")
tool_fn = tools[response.tool_name]
result = tool_fn(**(response.tool_args or {}))
transcript.append({
"role": "assistant",
"content": [{"type": "tool_use", "name": response.tool_name,
"id": response.tool_call_id, "input": response.tool_args}]
})
transcript.append({
"role": "user",
"content": [{"type": "tool_result", "tool_use_id": response.tool_call_id,
"content": result}]
})
continue
raise RuntimeError(f"unexpected response kind: {response.kind!r}")
raise RuntimeError(f"agent did not finish in {MAX_ITERATIONS} iterations")
Notice the three explicit guard clauses inside the tool-call branch: tool_name is None, tool_name not in tools, and a final else that raises on an unexpected response.kind. The comment-based assumption # response.kind == "tool_call" that a lot of tutorial code relies on is, in practice, a silent None-deref or an opaque KeyError waiting for its turn. The rule for this book is simple: if the type system doesn't narrow the case for you, narrow it yourself and raise a descriptive error. Later chapters replace these raises with structured ToolResults that the model can read and recover from, but even then every branch is still enumerated — defensive enumeration of cases is the engineering discipline the harness rests on.
Read that loop twice. Everything after it in the book is about one of the implicit choices you can see right here:
dict[str, Callable] — no schema check, no side-effect declaration, no permission gate.**response.tool_args and we trust it implicitly.result is a string. If the tool returned 50,000 characters of JSON, it goes straight into the transcript.MAX_ITERATIONS = 20 is the only thing standing between this loop and an unbounded cost runaway.That is the point — we are not going to pretend to be surprised when it breaks.
Let's make it work before we make it fail. A calculator tool, a mock scenario.
# examples/ch02_calculator.py
from harness.agent import run
from harness.providers.base import ProviderResponse
from harness.providers.mock import MockProvider
def calc(expression: str) -> str:
# dangerous in real life; fine for a mock
return str(eval(expression, {"__builtins__": {}}, {}))
mock = MockProvider([
ProviderResponse(
kind="tool_call",
tool_name="calc",
tool_args={"expression": "2 + 2"},
tool_call_id="call-1",
),
ProviderResponse(kind="text", text="2 + 2 is 4."),
])
tool_schemas = [{
"name": "calc",
"description": "Evaluate a Python arithmetic expression.",
"input_schema": {
"type": "object",
"properties": {"expression": {"type": "string"}},
"required": ["expression"],
},
}]
answer = run(
provider=mock,
tools={"calc": calc},
tool_schemas=tool_schemas,
user_message="What is 2 + 2?",
)
print(answer) # -> "2 + 2 is 4."
Run it:
uv run examples/ch02_calculator.py
You should see 2 + 2 is 4. printed. Two turns — the model asks for the calculator, we run it, the model reads the result, the model produces a final answer — and that is an agent. A small, contrived, brittle one, but structurally the real thing; every harness in the rest of the book, and every production harness in the wild, is a variation on this same two-turn pattern with progressively more engineering layered between the asks.
Commit:
git add -A && git commit -m "ch02: minimum viable loop with mock provider"
git tag ch02-minimum-loop
Now the pedagogically useful part. We are going to feed the loop five specific failure scenarios, one at a time, and watch each one reveal a missing piece of engineering the naive version quietly assumed would not matter.
Change the mock's first response to call a tool we didn't register:
ProviderResponse(
kind="tool_call",
tool_name="calculator", # not "calc"
tool_args={"expression": "2 + 2"},
tool_call_id="call-1",
),
Run it and you get a RuntimeError: unknown tool: 'calculator'. The guard clause we added catches the missing name and the loop crashes deliberately, rather than stumbling into a silent KeyError deep inside the tool lookup — which is strictly better, because the error now names the actual problem. The model, though, has no chance to recover: the exception unwinds the whole loop, and a misnamed tool in turn three kills a session that had nine turns of useful work behind it.
What's missing. A dispatch layer that catches "unknown tool" and returns a structured error to the model as a tool result, so the model gets one more chance to call the right tool. Chapter 4 introduces the ToolRegistry that does this.
ProviderResponse(
kind="tool_call",
tool_name="calc",
tool_args={"expr": "2 + 2"}, # wrong key name
tool_call_id="call-1",
),
You get TypeError: calc() got an unexpected keyword argument 'expr', and the loop dies. Same class of failure as Break 1, but one level deeper: a model that misnamed a parameter never gets the chance to see what it did wrong, because the exception unwinds the loop before the next turn can happen.
What's missing. Schema validation before dispatch. The loop should notice the args don't match, return a validation error to the model, and give it a chance to correct. Chapter 6 builds this.
def calc(expression: str) -> str:
return str(eval(expression, {"__builtins__": {}}, {}))
# And the mock asks for:
ProviderResponse(
kind="tool_call",
tool_name="calc",
tool_args={"expression": "1 / 0"}, # guaranteed ZeroDivisionError
tool_call_id="call-1",
),
ZeroDivisionError, and again the loop unwinds. That alone would be fine to handle locally, but consider the subtler versions that show up in any non-trivial system: a network tool that times out, a file read that hits a permission error, a shell command that returns exit code 1, a remote API that returns a 503. All of these are expected failures in a harness that does anything interesting, and the loop currently has no place to put them other than "crash the session."
What's missing. A tool-dispatch wrapper that converts tool exceptions into structured tool-result errors, visible to the model. Chapter 6 again.
mock = MockProvider([
ProviderResponse(kind="tool_call", tool_name="calc",
tool_args={"expression": "1"}, tool_call_id=f"call-{i}")
for i in range(100)
])
The model keeps calling the tool, iteration after iteration, and MAX_ITERATIONS = 20 catches it — but we raise a RuntimeError that discards the partial transcript along with everything useful for debugging. The number itself is arbitrary, too: twenty is too low for a real task and too high for a true runaway. We need a principled bound, one rooted in cost rather than iteration count, and we need to preserve the transcript when the bound fires so a human can figure out what went wrong.
What's missing. Two things. A token budget that triggers termination based on cost, not iteration count (Chapter 20). An observability layer that preserves the transcript for debugging when we do terminate (Chapter 18). And — looking ahead — a dedup check that notices the model is calling the same tool with the same args over and over, and halts before the budget fires (Chapter 6).
def calc(expression: str) -> str:
# imagine a "read_file" tool that reads 60,000 tokens of JSON
return "X" * 200_000 # 200KB of X
The loop still works — technically — but the transcript now has 200KB of X in it, and the next turn sends that whole transcript back to the provider. By turn five we're well past the context window, the provider returns an error, and the session crashes in a way that looks mysterious only because nobody was tracking the cost of what the tool was returning.
This is the central problem that shapes the rest of the book. The loop has no awareness of how much context it's using, no strategy for summarizing or truncating tool outputs, no concept of a scratchpad for state that shouldn't be in context at all — and no way to tell the difference between 200 bytes of useful signal and 200KB of noise that happens to look similar at the protocol level.
What's missing. Context accounting (Chapter 7), compaction (Chapter 8), external state (Chapter 9), retrieval (Chapter 10), and deliberate tool design that avoids producing these blobs in the first place (Chapter 11).
Those five breaks are the book. Look at them laid out:
| Break | What's Missing | Chapter |
|---|---|---|
| 1. Tool doesn't exist | Dispatch layer with structured errors | 4, 6 |
| 2. Args don't match schema | Pre-dispatch validation | 6 |
| 3. Tool raises | Wrapped tool execution | 6 |
| 4. Model never stops | Budget + dedup + observability | 6, 18, 20 |
| 5. Tool returns too much | Context engineering | 7–11 |
Every chapter from here to Chapter 11 is motivated by one of these five breaks. The chapters after that — orchestration, observability, evals, cost control, resumability — extend the harness into production territory once the core is solid.
If at any point the design feels over-engineered, come back to this table. Every piece of machinery is there because we watched the absence of it crash a loop.
Before we close the chapter, a sanity check: do real harnesses actually look like this — a while loop with a dispatch inside? The answer, across the ones worth looking at, is yes.
Claude Code, per Anthropic's public documentation, has an agent loop described in roughly 88 lines internally. The core is exactly what we wrote: the model produces a response, if the response contains tool calls the harness dispatches them and appends results, then it loops; otherwise, it returns. What Claude Code adds on top is production-grade error handling, a permission gate in front of every side-effecting tool, and the compaction and checkpointing we'll build in later chapters.
smolagents, Hugging Face's open-source agent library, fits in about a thousand lines total. Its MultiStepAgent.run() method is a for _ in range(self.max_steps) loop with the same three decisions we just wrote, plus a richer error taxonomy and an observation-formatting layer.
mini-swe-agent, the minimal variant of SWE-agent, is about a hundred lines and uses the same structure with a single bash tool instead of a registry — a useful reference for how thin a working harness can get when the problem shape is narrow enough to assume a single tool.
LangGraph looks different on the surface — it's a compiled graph, not a while — but the graph compiles down to a Pregel-style execution model in which a ReAct-style agent is a cycle between an LLM node and a tool node. Same three decisions as our naive loop, different packaging, and the "ReAct" naming here refers directly to Yao et al.'s 2022 paper that introduced the reason-act-observe paradigm the graph implements.
The loop is not a simplification for pedagogy; it is the actual shape of the thing, and Wang et al.'s 2024 "Survey on LLM-based Autonomous Agents" — which analyzed more than a hundred LLM-agent implementations across research and production — found the same core structure in the overwhelming majority of them. Everything else is engineering around it.
run() with TypedDict message shapes instead of raw dicts. Does the type checker catch any of the five breaks at authoring time? Which ones does it not catch, and why? (The answer to the second question is where most of Chapter 3's motivation comes from.)Chapter 1. What an Agent Actually Is
A few months ago a friend asked me to look at a system he was building. He said: "My agent keeps forgetting what it's doing."
Chapter 3. Messages, Turns, and the Transcript
Previously: we built a forty-line loop against a mock provider and watched it break five ways. Break 5 — tool output overwhelming the transcript — hinted that the transcript was doing too much work as a pile of dicts. It's time to give it some structure, and at the same time plug in real providers.