Chapter 1. What an Agent Actually Is
A few months ago a friend asked me to look at a system he was building. He said: "My agent keeps forgetting what it's doing."
He showed me the code — a while loop, a call to messages.create(), a list of three tools. The agent was calling the right tool most of the time, but after about forty turns it would get confused, start over, re-derive facts it had established an hour ago, apologize for being slow. He wanted to know how to give it better memory.
The answer wasn't memory. His while loop was not a harness — it was a prompt running on repeat, and the forgetting he was chasing was just the most visible symptom of everything else the loop wasn't doing.
That gap is the problem this book exists to solve. The model — Claude, GPT, whatever — is the easy part: you call an API, you get a response. What's hard is everything around the model: the loop that decides when to stop, the protocol that shapes each turn, the context that shrinks faster than your agent grows smarter, the tools that have to work ten thousand times without corrupting state, the observability that tells you why last night's run went wrong. That "everything around the model" has a name — the harness — and building a good one is its own distinct craft, with its own failure modes, its own literature, and its own accumulated body of operational wisdom that is rarely written down in one place.
This chapter is about getting the vocabulary right before we write a single line. If the right name for your system is "workflow" when you've been calling it an "agent," you will solve the wrong problems for six months. If your harness is a bare while loop, you will ship something that doesn't know how to bound itself — and the failure mode, when it arrives, will look like a mystery rather than an oversight. So we start here, with definitions sharp enough that the rest of the book has stable ground to stand on.
By the end of this chapter, you'll have a repo skeleton, an understanding of the design space we're navigating, and a preview of the twenty-one chapters that follow. No agent yet; the next chapter builds one in forty lines and then breaks it in five different ways, and those five breakages become the itinerary for everything after.
1.1 Models, Agents, and the Category Error
Three definitions worth getting straight, because conflating them is the root cause of roughly every bad architecture I've seen in the past year.
A model is a function: you hand it a context, and it returns a probability distribution over next tokens. That is the whole contract. A model has no memory, no goals, and no capacity to act in the world — it responds to what it is given, and then it is done. This is true whether the model is a frontier system with hundreds of billions of parameters or a locally-hosted open model a fraction the size; the protocol is tokens → tokens, and everything else lives outside. Russell and Norvig's Artificial Intelligence: A Modern Approach draws the same line in classical terms: the model is the function, and an agent is the system that uses the function to perceive and act.
An agent is a loop around a model — plus bounded state, a set of tools, and a policy for managing context across turns. Where a model is stateless, an agent has memory, imperfect because context is finite, but persistent across turns through the transcript the loop keeps feeding back in. Where a model has no goals, an agent pursues goals expressed through the context it is given and re-given each turn. And where a model cannot act, an agent acts through tools that the loop dispatches on the model's behalf. The classical agent literature — Franklin and Graesser's 1996 "Is it an agent, or just a program? A Taxonomy for Autonomous Agents" is the most-cited piece — tends to require four properties: autonomy (it makes its own decisions), reactivity (it responds to its environment), proactivity (it pursues goals rather than only reacting), and situatedness (it is embedded in a world where its actions have consequences). Production LLM agents satisfy all four, though the "world" in question is usually a bounded environment defined by the tools you chose to give them.
A harness is the engineering that surrounds the model and turns it into an agent. The word has useful history: a test harness runs test code inside a controlled environment that provides setup, isolation, and teardown; an operating system is, in a practical sense, a harness around the processes it runs, giving them addresses, scheduling, I/O, and crash recovery while the processes themselves do only compute. An LLM harness is the same kind of thing for a model. Because a model's contract is narrow — tokens in, tokens out — everything else required to make the model useful in a production setting has to live outside the model. That "everything else" is the harness, and its scope is broader than most first attempts reckon with:
- A loop that decides when to invoke the model, when to call a tool, when to retry a failed call, and when to stop and return a final answer.
- A turn protocol that structures each call's input (system prompt, history, tool schemas, available state) and parses each call's output (text, tool calls, reasoning traces, stop reasons).
- Context management — the policy that decides what the model sees turn after turn as the session grows past the window, through compaction, retrieval, scratchpad offload, or observation masking.
- Tool orchestration that registers the agent's available actions, dispatches them safely, validates arguments against schemas, and routes results back into context without corrupting anything upstream.
- Error handling that keeps a single bad tool call, malformed response, or transient provider failure from poisoning the rest of the session.
- Observability that tells you, after the fact, why a run went the way it did — which tool took twelve seconds, which compaction dropped the fact the final answer needed, which sub-agent burned the tokens.
- Persistence that lets a session survive a process crash or a deliberate pause, with side-effecting tool calls de-duplicated on resume so nothing runs twice.
- Permission and budget controls that prevent the agent from taking actions you didn't authorize or spending money you can't afford.
None of these is optional for a production system. They are either present by design, because you built them, or present by accident, because you didn't — and accidental harnesses are where most of the public post-mortems come from. The 2025 $47K agent-loop incident, in which two agents ping-ponged requests for eleven days while token-budget alerts fired but no enforcement existed to stop them, is one recent example; the MAST study of multi-agent failure modes (Cemri et al., 2025) traces 36.9% of observed failures to coordination breakdowns the harness was supposed to mediate. The pattern these share is that a harness is what decides whether a model's capability becomes a system's capability. The book's argument, elaborated across twenty-two chapters, is that a harness deserves the same engineering discipline you'd give a database schema or a service mesh — because, for this new class of system, that is what it is.
The category error — and it is everywhere — is the claim that an agent simply is a model, or that "building an agent" means "picking a good model." It sneaks in through framing: you ask a vendor for an agent and they sell you a model; you type a prompt into a chat interface, watch the interface use a tool, and the whole exchange gets called "agentic." Sometimes the usage is just loose. Sometimes it is load-bearing, and that is where it hurts — if your mental model says "the model does the agent things," you will not build the parts that actually need building, and your users will feel it in reliability, latency, cost, and safety, usually in that order.
A useful test. If the model gets twice as good tomorrow — OpenAI ships GPT-6, Anthropic ships Claude 5 — does your agent get twice as good? If the answer is yes, you have built well: your harness is a thin, honest conductor for the model's capability, and a rising tide lifts it cleanly. If the answer is no — and in practice it is almost always no — then something else is the bottleneck, and that something lives in the harness. Most systems that call themselves agents fall into the second category, and so does every system the book teaches you to build.
Claude Code is a harness. LangGraph is a harness. The OpenAI Agents SDK is a harness. Claude itself is a model. The shape of the harness determines almost everything about how the assembled system behaves — its failure modes, its operating costs, its observability story, its user experience — and the rest of this book is about how to shape yours well.
1.2 The Four-Axis Design Space
Every agent harness can be located on four axes. I'll return to these throughout the book; they are the coordinate system we'll navigate. They are a pragmatic taxonomy rather than a formal one — if you prefer Russell and Norvig's classical simple reflex / model-based / goal-based / utility-based / learning hierarchy, our axes span roughly the same design space, but cut it along lines that matter for LLM-powered systems specifically.
Autonomy. How much decision-making the harness delegates to the model. At one extreme is a deterministic workflow with LLM calls inserted at fixed points — no loop, no model-initiated choice, the control flow is yours and the model is an ingredient. At the other is a bare loop that hands the model every decision, including when to stop. Most production systems sit somewhere in between, and pretending to be more autonomous than you actually are — "our pipeline is fully agentic!" — misleads both you and your users, usually in ways that only become visible under load.
State. What the harness remembers across turns. Three tiers appear often enough to be worth naming. Context-only state lives entirely in the model's window and is lost on compaction. Session state is persisted structurally outside the window — a scratchpad, a plan object, a running tally — and survives within a single run. Durable state survives process crashes and deployments, stored in a database row or a serialized checkpoint on disk. Most naive harnesses sit at tier one and don't realize they've chosen that tier; most production harnesses need all three, and this book will build all three across the course of its chapters.
Tools. What the agent can do in the world. Zero tools and you have a chatbot. A handful of well-designed tools — shell, file read and edit, search, retrieval — and you have a capable assistant. Fifty tools and you have a scalability problem most teams discover too late, long after the tool list has calcified into something nobody wants to touch. Chapter 12 is devoted to the "tool cliff," the non-linear performance collapse that happens somewhere north of twenty tools for most models, and the two standard fixes (BM25-based selection and a pinned discovery tool) that keep capacity growing after the cliff.
Context. How the harness manages what the model sees, turn after turn. The options form a rough progression: pure append (the naive default, which fails around turn forty in most setups), sliding window, summarization, compaction, retrieval, scratchpad offload. Each has its own cost, failure mode, and appropriate home. Context engineering — the discipline of doing this well — has, in the last year, become the central skill of production agent work, to the point that several large vendors now treat it as a distinct specialty within applied AI. Chapters 7 through 11 live there.
An honest mapping of your own system onto these four axes will tell you what to build next. A good harness does not maximize all four; it picks the point that fits the problem and then engineers that point hard. The harness we build in this book lands at medium autonomy, all three state tiers, about a dozen carefully-designed tools, and aggressive context engineering — a specific choice, not a universal answer. You should feel free to locate your own projects elsewhere, and part of what this chapter's taxonomy is for is giving you the vocabulary to defend that choice when you do.
1.3 Workflow vs. Agent — Anthropic's Useful Distinction
In December 2024 Anthropic published a post titled "Building Effective Agents" that drew a line I'll lean on throughout the book. In paraphrase: a workflow is a predefined code path where LLM calls happen at specific, pre-decided steps — the control flow is yours, and the model is an ingredient you drop into slots you've chosen. An agent, by contrast, is a system where the LLM directs its own control flow — deciding which tools to call, in what order, and when to stop — without the code above it dictating the next step.
Both are legitimate architectures, and most production systems are in fact mixtures of the two. The Anthropic post's most-skipped observation, and the one most worth internalizing early, is that workflows win more often than builders expect. If your problem has predictable structure, a workflow will be faster, cheaper, more reliable, and easier to debug than an agent doing the same work; agents earn their complexity only when the problem genuinely requires dynamic tool selection, iterative refinement, or open-ended exploration.
The foundational paper behind the agent-as-a-loop view is Yao et al.'s 2022 "ReAct: Synergizing Reasoning and Acting in Language Models," which established that interleaving reasoning steps with tool calls — think, act, observe, think, act, observe — outperforms either pure reasoning or pure tool-use in isolation. Nearly every LLM agent built since has been a variation on that loop. Simon Willison has pushed Yao's framing further into the vernacular, arguing that the word "agent" has become so overloaded in marketing copy that it's almost unusable, and that the operational definition practitioners have converged on is simply tools in a loop. That is a deliberately modest framing — it names what the software actually does without promising autonomy it doesn't have — and it is the framing this book will adopt throughout.
Both positions matter here. We are going to build a loop around a model — a real, production-capable loop — but we are not going to pretend the loop has wisdom it lacks. Every design decision we make will be about giving the model enough structure to do useful work, without enough rope to hang our users.
Here is a concrete diagnostic, worth running the next time someone on your team says "let's make this agentic":
- Can the problem be solved by a pipeline of N fixed LLM calls? If yes, build that, and don't build an agent.
- Does the problem require the model to decide, at runtime, which of several tools to invoke and in what order? If yes, you have the beginning of a case for an agent.
- Does the problem require the agent to iterate — call a tool, read the result, decide on a next tool based on what it learned? If yes, you need a real loop.
- Does the problem require the agent to plan, fail, and re-plan over a long horizon? If yes, you need a full harness of the sort this book builds, and you need evaluations alongside it, because the failure modes at that horizon are not obvious and will not surface until you measure them.
Three out of four production "agents" I've seen should have been workflows. That is not a criticism — workflows are often the right answer, and a well-built workflow with three carefully-placed LLM calls will frequently outperform a loosely-wired agent on the same task. The problem is that calling a workflow an agent hides the real engineering that keeps it working, and that hiding is what lets systems drift into the accidental-harness territory §1.1 warned about.
1.4 What Makes a Harness Hard
Three things, roughly in order of how often they'll bite you.
Context is a moving target. Models have fixed context windows — typically 200K tokens at the time of writing, sometimes 1M on the premium tier — but tool outputs are unbounded, and tool outputs go into context. Do the arithmetic: your agent reads a 60,000-token file and half its window is gone; read two such files and the session is effectively over, because there is no room left for the actual reasoning the model is supposed to do. The hard limit is only half the story. Liu et al.'s 2023 "Lost in the Middle: How Language Models Use Long Contexts" showed that models attend disproportionately to information at the beginning and end of the window and systematically miss content in the middle, and the Chroma research team's July 2025 study extended that finding into what they named context rot: performance on retrieval-style tasks degrades continuously as context fills, well before any hard limit is reached. The window, in practice, is both smaller and more hostile than the spec sheet suggests. Every production harness has some answer to this, and the quality of the answer is more or less the quality of the harness itself.
Tools lie, and the model believes them. A tool that returns truncated output without saying so will make the model confidently wrong downstream. A tool whose description says "sends a message" but actually sends five will get abused in ways the description gave no warning about. A tool that fails silently, or returns a null value that could plausibly mean either "not found" or "permission denied," will send the agent into a loop while it tries to disambiguate a situation the tool's author already forgot about. Tools are not a programming interface for humans; they are an interface for a non-human consumer with very specific cognitive constraints, and designing them is its own discipline. It's also the discipline most under-invested in practice, which is why Chapter 11 spends an entire chapter on getting this right and why the tool-cliff problem in Chapter 12 exists in the first place.
Failure compounds. An agent with 95% per-step accuracy has roughly a 60% chance of completing a ten-step task cleanly; with 85% per-step accuracy — which is not a low number in absolute terms — the end-to-end success rate drops to about 20%. Long-horizon agents fail not because any single step is catastrophic, but because every step is an independent coin flip and the flips multiply. The MAST study (Cemri et al., 2025) measured this empirically across real multi-agent systems and found that most failures come from accumulated small errors — misinterpretations, specification misunderstandings, coordination slips — rather than from any single dramatic mistake. A harness earns its keep by turning coin flips into decisions: validating each step, recovering from errors, preventing one small mistake from poisoning the next ten. This is why most of the engineering in this book is defensive in character — not a disappointment, it's the job.
Every chapter of the book addresses one of these three forces. You will notice as we go that the interesting design questions almost never concern the model itself. They concern the protocol around it.
- windows are finite
- tool output is unbounded
- attention degrades long before the limit
- silent truncation
- underspecified schemas
- ambiguous nulls
- side-effect surprises
- 0.9510 ≈ 0.60
- 0.8510 ≈ 0.20
- every step is a new coin flip
1.5 What This Book Builds
By the end of Chapter 22 your repo will contain a harness that:
- Runs against any provider through a thin adapter layer — Anthropic, OpenAI, a locally-hosted open-source model — without changing agent logic. Provider-agnostic from Chapter 2 on; we never hard-code a vendor in the core.
- Manages its own context window through compaction, observation masking, and external scratchpad storage.
- Supports bounded sub-agent delegation with per-agent cost attribution and a spawning budget.
- Sandboxes tool execution with filesystem allowlists and an explicit permission gate.
- Consumes external tools via the Model Context Protocol, inside the same permission model as built-ins.
- Streams responses, handles interruption, and resumes durably after a crash.
- Emits OpenTelemetry traces with per-session, per-task, per-agent attribution.
- Runs a golden-trajectory regression suite before any model upgrade.
That list looks long because it is, and the book builds it piece by piece — every chapter takes one subsystem from naive to correct, and the chapter after that uses it, so the harness grows in a way that stays runnable at every stage rather than requiring a big-bang reveal at the end.
The harness will not be a clone of Claude Code, LangGraph, or the OpenAI Agents SDK. It steals shamelessly from all three, and from SWE-agent, smolagents, and every public post-mortem I could find while writing. What it will be is yours — a working system you understand end-to-end, small enough to read in an afternoon, capable enough to do real work.
One constraint is deliberate and worth flagging early: the harness is provider-agnostic from the start. The very first loop we write, in Chapter 2, runs against a mock provider — not because mocks are pedagogically cute, but because the moment you hard-code one vendor's SDK into your harness you have built a thing that's harder to migrate than it needs to be, and harder to test than it needs to be, and more tightly coupled to a pricing model you don't control than is comfortable. The adapter seam is the first piece of real architecture we lay down, and every subsequent chapter respects it. By Chapter 22 you will run the same agent code against Anthropic, OpenAI, and a local open-source model side-by-side — swapping providers is a configuration change, not a rewrite, which is the concrete payoff of an investment made in the opening chapters.
1.6 How to Read This Book
Two modes are supported, and most readers will use a bit of both.
Linear. Open Chapter 2, type every line, run the tests, and commit. Read the theory halves alongside the code — they aren't decoration; they are the argument for why the code is shaped the way it is. By the end of each chapter, git log in your repo should tell the story of the chapter in five to fifteen commits. If something stops working, the companion repo has one annotated tag per chapter — git checkout ch03-transcript will put you on known-good ground to resync.
Reference. You already have a harness and you want to understand compaction, or sub-agents, or evals. Each chapter stands alone well enough to skim: the opening frames the problem in concrete terms, and the "what you now understand" close tells you what you should have gained so you can check yourself. Every concept has exactly one canonical home in the book, and other chapters point back rather than re-explaining.
Both modes assume Python 3.11+, comfort with type hints and async/await, and prior exposure to at least one LLM API. You don't need to have built an agent before — if you have, some of the early chapters will go fast, which is fine, because the vocabulary we establish here pays off in Chapters 7 through 11.
A word on the code. Every code block in the book runs when assembled in order. There are no # ... ellipses hiding the load-bearing parts; when I simplify for teaching purposes, I say so, and the companion repo has the fuller version. The code is not decorative — it is the argument made concrete.
A word on opinions. This book is opinionated. Where a defensible alternative exists I'll name it and explain when to take it; where I'm picking one path because the book needs a through-line, I'll say that too. My goal is not to convince you that my way is the only way. It is to give you enough understanding of the tradeoffs that you can make your own decisions, and, when a new framework or new model generation lands in six months, evaluate it on substance rather than marketing.
1.7 Setting Up the Repo
Enough framing. Let's make a place to put the code.
We need Python 3.11 or newer and uv. If you don't have uv: curl -LsSf https://astral.sh/uv/install.sh | sh (or brew install uv). Everything in this book runs under uv — it manages the Python toolchain, the venv, and dependencies in one binary, and it's ~10× faster than pip for the install loops we do across chapters.
Create the project directory and initialize:
mkdir agent-harness && cd agent-harness
uv init --package --python 3.11
uv init --package scaffolds a pyproject.toml with a src/ layout, which matches the layout we want. Replace the generated pyproject.toml with this — it declares the package, the Python version floor, and a few light dependencies:
# pyproject.toml
[project]
name = "harness"
version = "0.1.0"
requires-python = ">=3.11"
dependencies = [
"httpx>=0.27",
]
[project.optional-dependencies]
anthropic = ["anthropic>=0.40"]
openai = ["openai>=1.40"]
[dependency-groups]
dev = ["pytest>=8.0", "pytest-asyncio>=0.23"]
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[tool.hatch.build.targets.wheel]
packages = ["src/harness"]
Notice that anthropic and openai are optional extras. That's deliberate. The core harness should install without either. We'll exercise this in Chapter 3, when the mock provider does all the work while the real SDKs stay uninstalled. dev lives under [dependency-groups] — uv's native way to express dev-only dependencies without polluting the published extras.
Sync the environment. uv creates .venv/ automatically on first run:
uv sync
That's the whole install. You don't need to source .venv/bin/activate — we use uv run for every command, and uv picks up the project's venv automatically.
The project layout we will grow into. Create it now; most of the files are empty. We'll populate them as we go.
agent-harness/
├── pyproject.toml
├── README.md
├── src/
│ └── harness/
│ ├── __init__.py
│ ├── agent.py # the loop (Chapter 2)
│ ├── messages.py # typed transcripts (Chapter 3)
│ ├── providers/ # provider adapters (Chapter 3)
│ │ ├── __init__.py
│ │ ├── base.py # the Provider protocol
│ │ └── mock.py # in-memory fake
│ ├── tools/ # tool protocol + registry (Chapters 4-5)
│ │ └── __init__.py
│ └── context/ # accounting + compaction (Chapters 7-11)
│ └── __init__.py
├── tests/
│ └── __init__.py
└── examples/
One file worth writing this early. A smoke test that proves the package imports and the Python version is what we need.
# tests/test_smoke.py
import sys
import harness
def test_python_version() -> None:
assert sys.version_info >= (3, 11), "This book assumes Python 3.11+"
def test_package_imports() -> None:
assert harness is not None
Run it:
uv run pytest tests/test_smoke.py -q
You should see two passing tests. If you don't, fix the import path before going further. Every chapter in this book assumes uv run pytest runs clean at the start. Broken tests accumulate badly.
Commit:
git init
git add .
git commit -m "ch01: project skeleton"
git tag ch01-skeleton
That tag will matter in Chapter 2, when we want to show you exactly what changed. Every chapter ends with one tag of this shape.
1.8 Try It Yourself
- Map three systems. Pick three agent-shaped systems you've used — ChatGPT, Claude Code, Cursor, a customer support chatbot, whatever. For each, place it on the four axes from Section 1.2. Where is it high-autonomy? Where is its state? How many tools? What's its context strategy, as far as you can tell from the outside? Write it down. This is the last time in the book I'll ask you to think without writing code, but it's worth doing before Chapter 2.
- Workflow or agent? Take a task you or your team is about to automate. Apply the four-question diagnostic from Section 1.3. Where does the task land? If the honest answer is "workflow, actually," notice that and consider whether you still want to read this book. (You probably do — harness patterns show up in workflow design too — but the goal framing changes.)
- Read one harness's core loop. Open the source of one of the harnesses we surveyed —
smolagents/agents.pyis a good starting point at ~1,000 lines, andmini-swe-agent/src/minisweagent/agent.pyis even shorter at ~100 lines. Read the main loop. Find the three moments where the loop decides: to call a tool, to stop, to produce final output. Don't try to understand everything; just find those three moments. You will write your own versions in the next few chapters, and it helps to have seen real examples first.