[{"data":1,"prerenderedAt":4873},["ShallowReactive",2],{"navigation":3,"page-\u002Fchapters\u002Fevals":102,"surround-\u002Fchapters\u002Fevals":4868},[4,8],{"title":5,"path":6,"stem":7},"Home","\u002F","index",{"title":9,"path":10,"stem":11,"children":12,"page":101},"Chapters","\u002Fchapters","2.chapters",[13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77,81,85,89,93,97],{"title":14,"path":15,"stem":16},"Chapter 1. What an Agent Actually Is","\u002Fchapters\u002Fwhat-an-agent-actually-is","2.chapters\u002F01.what-an-agent-actually-is",{"title":18,"path":19,"stem":20},"Chapter 2. The Minimum Viable Loop","\u002Fchapters\u002Fminimum-viable-loop","2.chapters\u002F02.minimum-viable-loop",{"title":22,"path":23,"stem":24},"Chapter 3. Messages, Turns, and the Transcript","\u002Fchapters\u002Fmessages-turns-transcript","2.chapters\u002F03.messages-turns-transcript",{"title":26,"path":27,"stem":28},"Chapter 4. The Tool Protocol","\u002Fchapters\u002Ftool-protocol","2.chapters\u002F04.tool-protocol",{"title":30,"path":31,"stem":32},"Chapter 5. Streaming, Interruption, and Error Handling","\u002Fchapters\u002Fstreaming-interruption-errors","2.chapters\u002F05.streaming-interruption-errors",{"title":34,"path":35,"stem":36},"Chapter 6. Safe Tool Execution","\u002Fchapters\u002Fsafe-tool-execution","2.chapters\u002F06.safe-tool-execution",{"title":38,"path":39,"stem":40},"Chapter 7. The Context Window Is a Resource","\u002Fchapters\u002Fcontext-window-as-resource","2.chapters\u002F07.context-window-as-resource",{"title":42,"path":43,"stem":44},"Chapter 8. Compaction","\u002Fchapters\u002Fcompaction","2.chapters\u002F08.compaction",{"title":46,"path":47,"stem":48},"Chapter 9. External State: The Scratchpad","\u002Fchapters\u002Fscratchpad","2.chapters\u002F09.scratchpad",{"title":50,"path":51,"stem":52},"Chapter 10. Retrieval","\u002Fchapters\u002Fretrieval","2.chapters\u002F10.retrieval",{"title":54,"path":55,"stem":56},"Chapter 11. Designing Tools Models Can Actually Use","\u002Fchapters\u002Fdesigning-tools","2.chapters\u002F11.designing-tools",{"title":58,"path":59,"stem":60},"Chapter 12. The Tool Cliff and Dynamic Loading","\u002Fchapters\u002Ftool-cliff","2.chapters\u002F12.tool-cliff",{"title":62,"path":63,"stem":64},"Chapter 13. MCP: Tools From the Outside World","\u002Fchapters\u002Fmcp","2.chapters\u002F13.mcp",{"title":66,"path":67,"stem":68},"Chapter 14. Sandboxing and Permissions","\u002Fchapters\u002Fsandboxing-permissions","2.chapters\u002F14.sandboxing-permissions",{"title":70,"path":71,"stem":72},"Chapter 15. Sub-agents","\u002Fchapters\u002Fsub-agents","2.chapters\u002F15.sub-agents",{"title":74,"path":75,"stem":76},"Chapter 16. Structured Plans and Verified Completion","\u002Fchapters\u002Fplans-verified-completion","2.chapters\u002F16.plans-verified-completion",{"title":78,"path":79,"stem":80},"Chapter 17. Parallelism and Shared State","\u002Fchapters\u002Fparallelism-shared-state","2.chapters\u002F17.parallelism-shared-state",{"title":82,"path":83,"stem":84},"Chapter 18. Observability","\u002Fchapters\u002Fobservability","2.chapters\u002F18.observability",{"title":86,"path":87,"stem":88},"Chapter 19. Evals","\u002Fchapters\u002Fevals","2.chapters\u002F19.evals",{"title":90,"path":91,"stem":92},"Chapter 20. Cost Control","\u002Fchapters\u002Fcost-control","2.chapters\u002F20.cost-control",{"title":94,"path":95,"stem":96},"Chapter 21. Resumability and Durable State","\u002Fchapters\u002Fresumability","2.chapters\u002F21.resumability",{"title":98,"path":99,"stem":100},"Chapter 22. What Transfers, Where to Go","\u002Fchapters\u002Fwhat-transfers","2.chapters\u002F22.what-transfers",false,{"id":103,"title":86,"body":104,"description":117,"extension":4863,"meta":4864,"navigation":4865,"path":87,"seo":4866,"stem":88,"__hash__":4867},"content\u002F2.chapters\u002F19.evals.md",{"type":105,"value":106,"toc":4852},"minimark",[107,111,118,121,124,148,151,214,217,222,225,228,247,253,259,265,268,270,274,774,777,779,783,2871,2882,2913,2923,2925,2929,3650,3653,3972,3989,3991,3995,4001,4399,4402,4408,4414,4420,4422,4426,4431,4705,4712,4715,4717,4721,4724,4746,4749,4751,4755,4802,4806,4837,4839,4848],[108,109,86],"h1",{"id":110},"chapter-19-evals",[112,113,114],"p",{},[115,116,117],"em",{},"Previously: observability — every operation in the harness emits a structured span, per-agent cost attribution works, dashboards show drift. Observability says what happened; it doesn't say whether what happened was right.",[112,119,120],{},"The difference matters. A zero-error run can produce a wrong answer. An agent that \"succeeds\" by the harness's internal definition can fail the user's actual need. Hamel Husain's working point, widely cited among practitioners, is worth stating again: agent complexity is only justified when you can define precise task-success criteria and build evaluations that measure them. Without evals, agent complexity is debt. On the research side, Liu et al.'s 2023 \"AgentBench: Evaluating LLMs as Agents\" made a parallel point by example — it proposed evaluating agents across eight distinct environments (operating systems, databases, knowledge graphs, web tasks) specifically because no single-task benchmark was capturing what real agent deployments required, and the substantial cross-environment variance their data showed is one reason you can't rely on a model's headline number when deciding whether it's right for your workload.",[112,122,123],{},"This chapter builds a minimal eval harness. Three pieces by the end:",[125,126,127,136,142],"ol",{},[128,129,130,131,135],"li",{},"A ",[132,133,134],"strong",{},"golden trajectory"," format: a task spec, expected outcomes, a way to score a run against it.",[128,137,130,138,141],{},[132,139,140],{},"regression runner"," that executes the harness against a suite of golden trajectories and reports pass\u002Ffail.",[128,143,130,144,147],{},[132,145,146],{},"production-to-eval pipeline",": when a real run fails, the trace becomes a new eval case automatically.",[112,149,150],{},"Chapter 22 runs the full harness against three providers using this machinery. For now, we build the machinery.",[152,153,157,206],"figure",{"className":154},[155,156],"not-prose","my-8",[158,159,167,179,184,188,191,195,198],"div",{"className":160},[161,162,163,164,165,166],"flex","flex-row","flex-wrap","items-center","justify-center","gap-2",[158,168,178],{"className":169},[170,171,172,173,174,175,176,177],"rounded-full","border","border-default","bg-elevated","px-3","py-2","text-xs","text-default","production trace",[158,180,183],{"className":181},[182],"text-muted","→",[158,185,187],{"className":186},[170,171,172,173,174,175,176,177],"failure triggers capture",[158,189,183],{"className":190},[182],[158,192,194],{"className":193},[170,171,172,173,174,175,176,177],"trace → eval case",[158,196,183],{"className":197},[182],[158,199,205],{"className":200},[170,201,202,173,174,175,176,203,204],"border-2","border-primary","text-primary","font-semibold","CI runs eval before merge",[207,208,213],"figcaption",{"className":209},[176,182,210,211,212],"mt-3","text-center","italic","Production-to-eval pipeline: real failures become regression tests; the CI gate blocks re-regression.",[215,216],"hr",{},[218,219,221],"h2",{"id":220},"_191-what-to-measure","19.1 What to Measure",[112,223,224],{},"Agent evals operate at the trajectory level, not the turn level. A single turn can look great in isolation and be wrong in context; a single turn can look ugly and be part of a correct recovery. The unit of evaluation is the full task from prompt to final output.",[112,226,227],{},"Four metric classes worth tracking:",[112,229,230,233,234,238,239,242,243,246],{},[132,231,232],{},"Completion."," Did the agent finish? This is the coarsest signal: ",[235,236,237],"code",{},"True"," if it returned an answer; ",[235,240,241],{},"False"," if it crashed, hit ",[235,244,245],{},"MAX_ITERATIONS",", or exceeded a budget.",[112,248,249,252],{},[132,250,251],{},"Correctness."," Is the answer right? This needs task-specific logic. For a \"read file and return its size\" task, we can check. For \"summarize this article\" we can't, trivially — we need either LLM-as-judge or a human.",[112,254,255,258],{},[132,256,257],{},"Process validity."," Did the agent do the right work on the way to the answer? Did it call the right tools in a reasonable order? Did it compact when expected? Did it use the plan structure? These are trajectory-level structural checks.",[112,260,261,264],{},[132,262,263],{},"Cost."," How many tokens did it take? How many turns? A correct answer produced with 50K tokens is a worse answer than the same correctness at 5K.",[112,266,267],{},"Different task types weight these differently. Debugging tasks care hugely about process validity. Question-answering tasks care about correctness and cost. Long-horizon tasks care about completion and cost. Your eval suite should reflect your workload.",[215,269],{},[218,271,273],{"id":272},"_192-the-eval-case-format","19.2 The Eval Case Format",[275,276,281],"pre",{"className":277,"code":278,"language":279,"meta":280,"style":280},"language-python shiki shiki-themes material-theme-lighter github-light github-dark","# src\u002Fharness\u002Fevals\u002Fcase.py\nfrom __future__ import annotations\n\nfrom dataclasses import dataclass, field\nfrom typing import Callable\n\n\n@dataclass\nclass EvalCase:\n    \"\"\"A single golden trajectory test.\"\"\"\n    id: str\n    description: str\n    user_message: str\n    system: str | None = None\n\n    # Optional: a list of tool names the agent must call (in any order).\n    # Any tool listed here must appear at least once in the run's spans.\n    required_tools: list[str] = field(default_factory=list)\n\n    # Optional: a list of tool names the agent must NOT call.\n    forbidden_tools: list[str] = field(default_factory=list)\n\n    # Optional: a callable that takes the final answer string and returns\n    # True\u002FFalse for correctness. For tasks with deterministic answers.\n    check_answer: Callable[[str], bool] | None = None\n\n    # Optional: ceiling on total tokens. Pass if under.\n    max_tokens: int | None = None\n\n    # Optional: ceiling on iterations.\n    max_iterations: int | None = None\n\n\n@dataclass\nclass EvalResult:\n    case_id: str\n    passed: bool\n    failures: list[str]\n    final_answer: str\n    tokens_used: int\n    iterations_used: int\n    duration_seconds: float\n","python","",[235,282,283,292,310,317,338,351,356,361,372,386,400,414,424,434,459,464,470,476,518,523,529,559,564,570,576,608,613,619,638,643,649,667,672,677,684,694,704,715,732,742,753,763],{"__ignoreMap":280},[284,285,288],"span",{"class":286,"line":287},"line",1,[284,289,291],{"class":290},"sutJx","# src\u002Fharness\u002Fevals\u002Fcase.py\n",[284,293,295,299,303,306],{"class":286,"line":294},2,[284,296,298],{"class":297},"sVHd0","from",[284,300,302],{"class":301},"s_hVV"," __future__",[284,304,305],{"class":297}," import",[284,307,309],{"class":308},"su5hD"," annotations\n",[284,311,313],{"class":286,"line":312},3,[284,314,316],{"emptyLinePlaceholder":315},true,"\n",[284,318,320,322,325,328,331,335],{"class":286,"line":319},4,[284,321,298],{"class":297},[284,323,324],{"class":308}," dataclasses ",[284,326,327],{"class":297},"import",[284,329,330],{"class":308}," dataclass",[284,332,334],{"class":333},"sP7_E",",",[284,336,337],{"class":308}," field\n",[284,339,341,343,346,348],{"class":286,"line":340},5,[284,342,298],{"class":297},[284,344,345],{"class":308}," typing ",[284,347,327],{"class":297},[284,349,350],{"class":308}," Callable\n",[284,352,354],{"class":286,"line":353},6,[284,355,316],{"emptyLinePlaceholder":315},[284,357,359],{"class":286,"line":358},7,[284,360,316],{"emptyLinePlaceholder":315},[284,362,364,368],{"class":286,"line":363},8,[284,365,367],{"class":366},"stp6e","@",[284,369,371],{"class":370},"sGLFI","dataclass\n",[284,373,375,379,383],{"class":286,"line":374},9,[284,376,378],{"class":377},"sbsja","class",[284,380,382],{"class":381},"sbgvK"," EvalCase",[284,384,385],{"class":333},":\n",[284,387,389,393,397],{"class":286,"line":388},10,[284,390,392],{"class":391},"s2W-s","    \"\"\"",[284,394,396],{"class":395},"sithA","A single golden trajectory test.",[284,398,399],{"class":391},"\"\"\"\n",[284,401,403,407,410],{"class":286,"line":402},11,[284,404,406],{"class":405},"sptTA","    id",[284,408,409],{"class":333},":",[284,411,413],{"class":412},"sZMiF"," str\n",[284,415,417,420,422],{"class":286,"line":416},12,[284,418,419],{"class":308},"    description",[284,421,409],{"class":333},[284,423,413],{"class":412},[284,425,427,430,432],{"class":286,"line":426},13,[284,428,429],{"class":308},"    user_message",[284,431,409],{"class":333},[284,433,413],{"class":412},[284,435,437,440,442,445,449,453,456],{"class":286,"line":436},14,[284,438,439],{"class":308},"    system",[284,441,409],{"class":333},[284,443,444],{"class":412}," str",[284,446,448],{"class":447},"smGrS"," |",[284,450,452],{"class":451},"s39Yj"," None",[284,454,455],{"class":447}," =",[284,457,458],{"class":451}," None\n",[284,460,462],{"class":286,"line":461},15,[284,463,316],{"emptyLinePlaceholder":315},[284,465,467],{"class":286,"line":466},16,[284,468,469],{"class":290},"    # Optional: a list of tool names the agent must call (in any order).\n",[284,471,473],{"class":286,"line":472},17,[284,474,475],{"class":290},"    # Any tool listed here must appear at least once in the run's spans.\n",[284,477,479,482,484,487,490,493,496,498,502,505,509,512,515],{"class":286,"line":478},18,[284,480,481],{"class":308},"    required_tools",[284,483,409],{"class":333},[284,485,486],{"class":308}," list",[284,488,489],{"class":333},"[",[284,491,492],{"class":412},"str",[284,494,495],{"class":333},"]",[284,497,455],{"class":447},[284,499,501],{"class":500},"slqww"," field",[284,503,504],{"class":333},"(",[284,506,508],{"class":507},"s99_P","default_factory",[284,510,511],{"class":447},"=",[284,513,514],{"class":412},"list",[284,516,517],{"class":333},")\n",[284,519,521],{"class":286,"line":520},19,[284,522,316],{"emptyLinePlaceholder":315},[284,524,526],{"class":286,"line":525},20,[284,527,528],{"class":290},"    # Optional: a list of tool names the agent must NOT call.\n",[284,530,532,535,537,539,541,543,545,547,549,551,553,555,557],{"class":286,"line":531},21,[284,533,534],{"class":308},"    forbidden_tools",[284,536,409],{"class":333},[284,538,486],{"class":308},[284,540,489],{"class":333},[284,542,492],{"class":412},[284,544,495],{"class":333},[284,546,455],{"class":447},[284,548,501],{"class":500},[284,550,504],{"class":333},[284,552,508],{"class":507},[284,554,511],{"class":447},[284,556,514],{"class":412},[284,558,517],{"class":333},[284,560,562],{"class":286,"line":561},22,[284,563,316],{"emptyLinePlaceholder":315},[284,565,567],{"class":286,"line":566},23,[284,568,569],{"class":290},"    # Optional: a callable that takes the final answer string and returns\n",[284,571,573],{"class":286,"line":572},24,[284,574,575],{"class":290},"    # True\u002FFalse for correctness. For tasks with deterministic answers.\n",[284,577,579,582,584,587,590,592,595,598,600,602,604,606],{"class":286,"line":578},25,[284,580,581],{"class":308},"    check_answer",[284,583,409],{"class":333},[284,585,586],{"class":308}," Callable",[284,588,589],{"class":333},"[[",[284,591,492],{"class":412},[284,593,594],{"class":333},"],",[284,596,597],{"class":412}," bool",[284,599,495],{"class":333},[284,601,448],{"class":447},[284,603,452],{"class":451},[284,605,455],{"class":447},[284,607,458],{"class":451},[284,609,611],{"class":286,"line":610},26,[284,612,316],{"emptyLinePlaceholder":315},[284,614,616],{"class":286,"line":615},27,[284,617,618],{"class":290},"    # Optional: ceiling on total tokens. Pass if under.\n",[284,620,622,625,627,630,632,634,636],{"class":286,"line":621},28,[284,623,624],{"class":308},"    max_tokens",[284,626,409],{"class":333},[284,628,629],{"class":412}," int",[284,631,448],{"class":447},[284,633,452],{"class":451},[284,635,455],{"class":447},[284,637,458],{"class":451},[284,639,641],{"class":286,"line":640},29,[284,642,316],{"emptyLinePlaceholder":315},[284,644,646],{"class":286,"line":645},30,[284,647,648],{"class":290},"    # Optional: ceiling on iterations.\n",[284,650,652,655,657,659,661,663,665],{"class":286,"line":651},31,[284,653,654],{"class":308},"    max_iterations",[284,656,409],{"class":333},[284,658,629],{"class":412},[284,660,448],{"class":447},[284,662,452],{"class":451},[284,664,455],{"class":447},[284,666,458],{"class":451},[284,668,670],{"class":286,"line":669},32,[284,671,316],{"emptyLinePlaceholder":315},[284,673,675],{"class":286,"line":674},33,[284,676,316],{"emptyLinePlaceholder":315},[284,678,680,682],{"class":286,"line":679},34,[284,681,367],{"class":366},[284,683,371],{"class":370},[284,685,687,689,692],{"class":286,"line":686},35,[284,688,378],{"class":377},[284,690,691],{"class":381}," EvalResult",[284,693,385],{"class":333},[284,695,697,700,702],{"class":286,"line":696},36,[284,698,699],{"class":308},"    case_id",[284,701,409],{"class":333},[284,703,413],{"class":412},[284,705,707,710,712],{"class":286,"line":706},37,[284,708,709],{"class":308},"    passed",[284,711,409],{"class":333},[284,713,714],{"class":412}," bool\n",[284,716,718,721,723,725,727,729],{"class":286,"line":717},38,[284,719,720],{"class":308},"    failures",[284,722,409],{"class":333},[284,724,486],{"class":308},[284,726,489],{"class":333},[284,728,492],{"class":412},[284,730,731],{"class":333},"]\n",[284,733,735,738,740],{"class":286,"line":734},39,[284,736,737],{"class":308},"    final_answer",[284,739,409],{"class":333},[284,741,413],{"class":412},[284,743,745,748,750],{"class":286,"line":744},40,[284,746,747],{"class":308},"    tokens_used",[284,749,409],{"class":333},[284,751,752],{"class":412}," int\n",[284,754,756,759,761],{"class":286,"line":755},41,[284,757,758],{"class":308},"    iterations_used",[284,760,409],{"class":333},[284,762,752],{"class":412},[284,764,766,769,771],{"class":286,"line":765},42,[284,767,768],{"class":308},"    duration_seconds",[284,770,409],{"class":333},[284,772,773],{"class":412}," float\n",[112,775,776],{},"A deliberately simple shape. Real eval frameworks (Braintrust, LangSmith) have richer structures — scorer functions, dataset versioning, experiment tracking. We deliberately don't replicate those; the interface leaves room to integrate with them, and the book's goal is to establish the minimum honest eval harness.",[215,778],{},[218,780,782],{"id":781},"_193-the-runner","19.3 The Runner",[275,784,786],{"className":277,"code":785,"language":279,"meta":280,"style":280},"# src\u002Fharness\u002Fevals\u002Frunner.py\nfrom __future__ import annotations\n\nimport asyncio\nimport time\nfrom dataclasses import dataclass, field\n\nfrom ..agent import arun\nfrom ..providers.base import Provider\nfrom ..tools.selector import ToolCatalog\nfrom .case import EvalCase, EvalResult\n\n\n@dataclass\nclass EvalRunner:\n    provider: Provider\n    catalog: ToolCatalog\n\n    async def run_one(self, case: EvalCase) -> EvalResult:\n        start = time.time()\n        tool_calls_observed: list[str] = []\n\n        # Wrap the catalog in a recording proxy that appends each\n        # dispatched tool name to tool_calls_observed. A ToolCatalog\n        # with observed-dispatch is the simplest in-harness way to\n        # see what the model actually called.\n        recording_catalog = _RecordingCatalog(self.catalog, tool_calls_observed)\n\n        try:\n            result = await arun(\n                provider=self.provider,\n                catalog=recording_catalog,\n                system=case.system,\n                user_message=case.user_message,\n            )\n        except Exception as e:\n            return EvalResult(\n                case_id=case.id, passed=False,\n                failures=[f\"crashed: {type(e).__name__}: {e}\"],\n                final_answer=\"\", tokens_used=0, iterations_used=0,\n                duration_seconds=time.time() - start,\n            )\n\n        failures: list[str] = []\n\n        if case.check_answer is not None and not case.check_answer(result.summary):\n            failures.append(\"answer check failed\")\n\n        for required in case.required_tools:\n            if required not in tool_calls_observed:\n                failures.append(f\"required tool not called: {required}\")\n\n        for forbidden in case.forbidden_tools:\n            if forbidden in tool_calls_observed:\n                failures.append(f\"forbidden tool called: {forbidden}\")\n\n        if case.max_tokens is not None and result.tokens_used > case.max_tokens:\n            failures.append(f\"tokens_used {result.tokens_used} > {case.max_tokens}\")\n\n        if case.max_iterations is not None and result.iterations_used > case.max_iterations:\n            failures.append(f\"iterations_used {result.iterations_used} > {case.max_iterations}\")\n\n        return EvalResult(\n            case_id=case.id,\n            passed=len(failures) == 0,\n            failures=failures,\n            final_answer=result.summary,\n            tokens_used=result.tokens_used,\n            iterations_used=result.iterations_used,\n            duration_seconds=time.time() - start,\n        )\n\n    async def run_all(self, cases: list[EvalCase]) -> list[EvalResult]:\n        results: list[EvalResult] = []\n        for case in cases:\n            result = await self.run_one(case)\n            print(f\"{'✓' if result.passed else '✗'} {case.id}: \"\n                  f\"{case.description} \"\n                  f\"[{result.tokens_used} tok, {result.duration_seconds:.1f}s]\"\n                  + (f\" — {', '.join(result.failures)}\" if result.failures else \"\"))\n            results.append(result)\n        return results\n\n\nclass _RecordingCatalog:\n    \"\"\"A ToolCatalog wrapper that records every tool name dispatched.\n\n    The catalog interface is `select(query, k, must_include)` and `get(name)`.\n    Wrapping `select`'s returned tools is the clean interception point: each\n    returned Tool gets its `arun`\u002F`run` wrapped to record the name before\n    delegating.\n    \"\"\"\n\n    def __init__(self, inner, observed: list[str]) -> None:\n        self._inner = inner\n        self._observed = observed\n\n    def select(self, query, k=7, must_include=None):\n        from ..tools.base import Tool\n        tools = self._inner.select(query, k=k, must_include=must_include)\n        return [self._wrap(t) for t in tools]\n\n    def _wrap(self, tool):\n        from ..tools.base import Tool\n        observed = self._observed\n\n        if tool.arun is not None:\n            original_arun = tool.arun\n            async def arun(**kwargs):\n                observed.append(tool.name)\n                return await original_arun(**kwargs)\n            return Tool(\n                name=tool.name, description=tool.description,\n                input_schema=tool.input_schema,\n                arun=arun, side_effects=tool.side_effects,\n            )\n        else:\n            original_run = tool.run\n            def run(**kwargs):\n                observed.append(tool.name)\n                return original_run(**kwargs)\n            return Tool(\n                name=tool.name, description=tool.description,\n                input_schema=tool.input_schema,\n                run=run, side_effects=tool.side_effects,\n            )\n",[235,787,788,793,803,807,814,821,835,839,854,874,893,912,916,920,926,935,944,953,957,994,1012,1032,1036,1041,1046,1051,1056,1083,1087,1094,1110,1127,1139,1156,1172,1177,1193,1202,1227,1279,1311,1335,1339,1344,1364,1369,1414,1436,1441,1462,1480,1507,1512,1531,1544,1571,1576,1615,1658,1663,1700,1742,1747,1757,1773,1799,1810,1826,1842,1858,1880,1886,1891,1934,1954,1968,1991,2050,2072,2110,2168,2184,2192,2197,2202,2211,2219,2224,2230,2236,2242,2248,2254,2259,2298,2314,2329,2334,2373,2392,2436,2471,2476,2495,2512,2527,2532,2552,2567,2587,2609,2628,2638,2667,2684,2710,2715,2723,2738,2755,2774,2790,2799,2826,2841,2866],{"__ignoreMap":280},[284,789,790],{"class":286,"line":287},[284,791,792],{"class":290},"# src\u002Fharness\u002Fevals\u002Frunner.py\n",[284,794,795,797,799,801],{"class":286,"line":294},[284,796,298],{"class":297},[284,798,302],{"class":301},[284,800,305],{"class":297},[284,802,309],{"class":308},[284,804,805],{"class":286,"line":312},[284,806,316],{"emptyLinePlaceholder":315},[284,808,809,811],{"class":286,"line":319},[284,810,327],{"class":297},[284,812,813],{"class":308}," asyncio\n",[284,815,816,818],{"class":286,"line":340},[284,817,327],{"class":297},[284,819,820],{"class":308}," time\n",[284,822,823,825,827,829,831,833],{"class":286,"line":353},[284,824,298],{"class":297},[284,826,324],{"class":308},[284,828,327],{"class":297},[284,830,330],{"class":308},[284,832,334],{"class":333},[284,834,337],{"class":308},[284,836,837],{"class":286,"line":358},[284,838,316],{"emptyLinePlaceholder":315},[284,840,841,843,846,849,851],{"class":286,"line":363},[284,842,298],{"class":297},[284,844,845],{"class":333}," ..",[284,847,848],{"class":308},"agent ",[284,850,327],{"class":297},[284,852,853],{"class":308}," arun\n",[284,855,856,858,860,863,866,869,871],{"class":286,"line":374},[284,857,298],{"class":297},[284,859,845],{"class":333},[284,861,862],{"class":308},"providers",[284,864,865],{"class":333},".",[284,867,868],{"class":308},"base ",[284,870,327],{"class":297},[284,872,873],{"class":308}," Provider\n",[284,875,876,878,880,883,885,888,890],{"class":286,"line":388},[284,877,298],{"class":297},[284,879,845],{"class":333},[284,881,882],{"class":308},"tools",[284,884,865],{"class":333},[284,886,887],{"class":308},"selector ",[284,889,327],{"class":297},[284,891,892],{"class":308}," ToolCatalog\n",[284,894,895,897,900,903,905,907,909],{"class":286,"line":402},[284,896,298],{"class":297},[284,898,899],{"class":333}," .",[284,901,902],{"class":308},"case ",[284,904,327],{"class":297},[284,906,382],{"class":308},[284,908,334],{"class":333},[284,910,911],{"class":308}," EvalResult\n",[284,913,914],{"class":286,"line":416},[284,915,316],{"emptyLinePlaceholder":315},[284,917,918],{"class":286,"line":426},[284,919,316],{"emptyLinePlaceholder":315},[284,921,922,924],{"class":286,"line":436},[284,923,367],{"class":366},[284,925,371],{"class":370},[284,927,928,930,933],{"class":286,"line":461},[284,929,378],{"class":377},[284,931,932],{"class":381}," EvalRunner",[284,934,385],{"class":333},[284,936,937,940,942],{"class":286,"line":466},[284,938,939],{"class":308},"    provider",[284,941,409],{"class":333},[284,943,873],{"class":308},[284,945,946,949,951],{"class":286,"line":472},[284,947,948],{"class":308},"    catalog",[284,950,409],{"class":333},[284,952,892],{"class":308},[284,954,955],{"class":286,"line":478},[284,956,316],{"emptyLinePlaceholder":315},[284,958,959,962,965,968,970,974,976,980,982,984,987,990,992],{"class":286,"line":520},[284,960,961],{"class":377},"    async",[284,963,964],{"class":377}," def",[284,966,967],{"class":370}," run_one",[284,969,504],{"class":333},[284,971,973],{"class":972},"smCYv","self",[284,975,334],{"class":333},[284,977,979],{"class":978},"sFwrP"," case",[284,981,409],{"class":333},[284,983,382],{"class":308},[284,985,986],{"class":333},")",[284,988,989],{"class":333}," ->",[284,991,691],{"class":308},[284,993,385],{"class":333},[284,995,996,999,1001,1004,1006,1009],{"class":286,"line":525},[284,997,998],{"class":308},"        start ",[284,1000,511],{"class":447},[284,1002,1003],{"class":308}," time",[284,1005,865],{"class":333},[284,1007,1008],{"class":500},"time",[284,1010,1011],{"class":333},"()\n",[284,1013,1014,1017,1019,1021,1023,1025,1027,1029],{"class":286,"line":531},[284,1015,1016],{"class":308},"        tool_calls_observed",[284,1018,409],{"class":333},[284,1020,486],{"class":308},[284,1022,489],{"class":333},[284,1024,492],{"class":412},[284,1026,495],{"class":333},[284,1028,455],{"class":447},[284,1030,1031],{"class":333}," []\n",[284,1033,1034],{"class":286,"line":561},[284,1035,316],{"emptyLinePlaceholder":315},[284,1037,1038],{"class":286,"line":566},[284,1039,1040],{"class":290},"        # Wrap the catalog in a recording proxy that appends each\n",[284,1042,1043],{"class":286,"line":572},[284,1044,1045],{"class":290},"        # dispatched tool name to tool_calls_observed. A ToolCatalog\n",[284,1047,1048],{"class":286,"line":578},[284,1049,1050],{"class":290},"        # with observed-dispatch is the simplest in-harness way to\n",[284,1052,1053],{"class":286,"line":610},[284,1054,1055],{"class":290},"        # see what the model actually called.\n",[284,1057,1058,1061,1063,1066,1068,1070,1072,1076,1078,1081],{"class":286,"line":615},[284,1059,1060],{"class":308},"        recording_catalog ",[284,1062,511],{"class":447},[284,1064,1065],{"class":500}," _RecordingCatalog",[284,1067,504],{"class":333},[284,1069,973],{"class":301},[284,1071,865],{"class":333},[284,1073,1075],{"class":1074},"skxfh","catalog",[284,1077,334],{"class":333},[284,1079,1080],{"class":500}," tool_calls_observed",[284,1082,517],{"class":333},[284,1084,1085],{"class":286,"line":621},[284,1086,316],{"emptyLinePlaceholder":315},[284,1088,1089,1092],{"class":286,"line":640},[284,1090,1091],{"class":297},"        try",[284,1093,385],{"class":333},[284,1095,1096,1099,1101,1104,1107],{"class":286,"line":645},[284,1097,1098],{"class":308},"            result ",[284,1100,511],{"class":447},[284,1102,1103],{"class":297}," await",[284,1105,1106],{"class":500}," arun",[284,1108,1109],{"class":333},"(\n",[284,1111,1112,1115,1117,1119,1121,1124],{"class":286,"line":651},[284,1113,1114],{"class":507},"                provider",[284,1116,511],{"class":447},[284,1118,973],{"class":301},[284,1120,865],{"class":333},[284,1122,1123],{"class":1074},"provider",[284,1125,1126],{"class":333},",\n",[284,1128,1129,1132,1134,1137],{"class":286,"line":669},[284,1130,1131],{"class":507},"                catalog",[284,1133,511],{"class":447},[284,1135,1136],{"class":500},"recording_catalog",[284,1138,1126],{"class":333},[284,1140,1141,1144,1146,1149,1151,1154],{"class":286,"line":674},[284,1142,1143],{"class":507},"                system",[284,1145,511],{"class":447},[284,1147,1148],{"class":500},"case",[284,1150,865],{"class":333},[284,1152,1153],{"class":1074},"system",[284,1155,1126],{"class":333},[284,1157,1158,1161,1163,1165,1167,1170],{"class":286,"line":679},[284,1159,1160],{"class":507},"                user_message",[284,1162,511],{"class":447},[284,1164,1148],{"class":500},[284,1166,865],{"class":333},[284,1168,1169],{"class":1074},"user_message",[284,1171,1126],{"class":333},[284,1173,1174],{"class":286,"line":686},[284,1175,1176],{"class":333},"            )\n",[284,1178,1179,1182,1185,1188,1191],{"class":286,"line":696},[284,1180,1181],{"class":297},"        except",[284,1183,1184],{"class":412}," Exception",[284,1186,1187],{"class":297}," as",[284,1189,1190],{"class":308}," e",[284,1192,385],{"class":333},[284,1194,1195,1198,1200],{"class":286,"line":706},[284,1196,1197],{"class":297},"            return",[284,1199,691],{"class":500},[284,1201,1109],{"class":333},[284,1203,1204,1207,1209,1211,1213,1216,1218,1221,1223,1225],{"class":286,"line":717},[284,1205,1206],{"class":507},"                case_id",[284,1208,511],{"class":447},[284,1210,1148],{"class":500},[284,1212,865],{"class":333},[284,1214,1215],{"class":1074},"id",[284,1217,334],{"class":333},[284,1219,1220],{"class":507}," passed",[284,1222,511],{"class":447},[284,1224,241],{"class":451},[284,1226,1126],{"class":333},[284,1228,1229,1232,1234,1236,1239,1243,1247,1250,1252,1255,1258,1261,1264,1267,1269,1271,1273,1276],{"class":286,"line":734},[284,1230,1231],{"class":507},"                failures",[284,1233,511],{"class":447},[284,1235,489],{"class":333},[284,1237,1238],{"class":377},"f",[284,1240,1242],{"class":1241},"s_sjI","\"crashed: ",[284,1244,1246],{"class":1245},"srdBf","{",[284,1248,1249],{"class":412},"type",[284,1251,504],{"class":333},[284,1253,1254],{"class":500},"e",[284,1256,1257],{"class":333},").",[284,1259,1260],{"class":301},"__name__",[284,1262,1263],{"class":1245},"}",[284,1265,1266],{"class":1241},": ",[284,1268,1246],{"class":1245},[284,1270,1254],{"class":500},[284,1272,1263],{"class":1245},[284,1274,1275],{"class":1241},"\"",[284,1277,1278],{"class":333},"],\n",[284,1280,1281,1284,1286,1290,1292,1295,1297,1300,1302,1305,1307,1309],{"class":286,"line":744},[284,1282,1283],{"class":507},"                final_answer",[284,1285,511],{"class":447},[284,1287,1289],{"class":1288},"sjJ54","\"\"",[284,1291,334],{"class":333},[284,1293,1294],{"class":507}," tokens_used",[284,1296,511],{"class":447},[284,1298,1299],{"class":1245},"0",[284,1301,334],{"class":333},[284,1303,1304],{"class":507}," iterations_used",[284,1306,511],{"class":447},[284,1308,1299],{"class":1245},[284,1310,1126],{"class":333},[284,1312,1313,1316,1318,1320,1322,1324,1327,1330,1333],{"class":286,"line":755},[284,1314,1315],{"class":507},"                duration_seconds",[284,1317,511],{"class":447},[284,1319,1008],{"class":500},[284,1321,865],{"class":333},[284,1323,1008],{"class":500},[284,1325,1326],{"class":333},"()",[284,1328,1329],{"class":447}," -",[284,1331,1332],{"class":500}," start",[284,1334,1126],{"class":333},[284,1336,1337],{"class":286,"line":765},[284,1338,1176],{"class":333},[284,1340,1342],{"class":286,"line":1341},43,[284,1343,316],{"emptyLinePlaceholder":315},[284,1345,1347,1350,1352,1354,1356,1358,1360,1362],{"class":286,"line":1346},44,[284,1348,1349],{"class":308},"        failures",[284,1351,409],{"class":333},[284,1353,486],{"class":308},[284,1355,489],{"class":333},[284,1357,492],{"class":412},[284,1359,495],{"class":333},[284,1361,455],{"class":447},[284,1363,1031],{"class":333},[284,1365,1367],{"class":286,"line":1366},45,[284,1368,316],{"emptyLinePlaceholder":315},[284,1370,1372,1375,1377,1379,1382,1385,1388,1390,1393,1395,1397,1399,1401,1403,1406,1408,1411],{"class":286,"line":1371},46,[284,1373,1374],{"class":297},"        if",[284,1376,979],{"class":308},[284,1378,865],{"class":333},[284,1380,1381],{"class":1074},"check_answer",[284,1383,1384],{"class":447}," is",[284,1386,1387],{"class":447}," not",[284,1389,452],{"class":451},[284,1391,1392],{"class":447}," and",[284,1394,1387],{"class":447},[284,1396,979],{"class":308},[284,1398,865],{"class":333},[284,1400,1381],{"class":500},[284,1402,504],{"class":333},[284,1404,1405],{"class":500},"result",[284,1407,865],{"class":333},[284,1409,1410],{"class":1074},"summary",[284,1412,1413],{"class":333},"):\n",[284,1415,1417,1420,1422,1425,1427,1429,1432,1434],{"class":286,"line":1416},47,[284,1418,1419],{"class":308},"            failures",[284,1421,865],{"class":333},[284,1423,1424],{"class":500},"append",[284,1426,504],{"class":333},[284,1428,1275],{"class":1288},[284,1430,1431],{"class":1241},"answer check failed",[284,1433,1275],{"class":1288},[284,1435,517],{"class":333},[284,1437,1439],{"class":286,"line":1438},48,[284,1440,316],{"emptyLinePlaceholder":315},[284,1442,1444,1447,1450,1453,1455,1457,1460],{"class":286,"line":1443},49,[284,1445,1446],{"class":297},"        for",[284,1448,1449],{"class":308}," required ",[284,1451,1452],{"class":297},"in",[284,1454,979],{"class":308},[284,1456,865],{"class":333},[284,1458,1459],{"class":1074},"required_tools",[284,1461,385],{"class":333},[284,1463,1465,1468,1470,1473,1476,1478],{"class":286,"line":1464},50,[284,1466,1467],{"class":297},"            if",[284,1469,1449],{"class":308},[284,1471,1472],{"class":447},"not",[284,1474,1475],{"class":447}," in",[284,1477,1080],{"class":308},[284,1479,385],{"class":333},[284,1481,1483,1485,1487,1489,1491,1493,1496,1498,1501,1503,1505],{"class":286,"line":1482},51,[284,1484,1231],{"class":308},[284,1486,865],{"class":333},[284,1488,1424],{"class":500},[284,1490,504],{"class":333},[284,1492,1238],{"class":377},[284,1494,1495],{"class":1241},"\"required tool not called: ",[284,1497,1246],{"class":1245},[284,1499,1500],{"class":500},"required",[284,1502,1263],{"class":1245},[284,1504,1275],{"class":1241},[284,1506,517],{"class":333},[284,1508,1510],{"class":286,"line":1509},52,[284,1511,316],{"emptyLinePlaceholder":315},[284,1513,1515,1517,1520,1522,1524,1526,1529],{"class":286,"line":1514},53,[284,1516,1446],{"class":297},[284,1518,1519],{"class":308}," forbidden ",[284,1521,1452],{"class":297},[284,1523,979],{"class":308},[284,1525,865],{"class":333},[284,1527,1528],{"class":1074},"forbidden_tools",[284,1530,385],{"class":333},[284,1532,1534,1536,1538,1540,1542],{"class":286,"line":1533},54,[284,1535,1467],{"class":297},[284,1537,1519],{"class":308},[284,1539,1452],{"class":447},[284,1541,1080],{"class":308},[284,1543,385],{"class":333},[284,1545,1547,1549,1551,1553,1555,1557,1560,1562,1565,1567,1569],{"class":286,"line":1546},55,[284,1548,1231],{"class":308},[284,1550,865],{"class":333},[284,1552,1424],{"class":500},[284,1554,504],{"class":333},[284,1556,1238],{"class":377},[284,1558,1559],{"class":1241},"\"forbidden tool called: ",[284,1561,1246],{"class":1245},[284,1563,1564],{"class":500},"forbidden",[284,1566,1263],{"class":1245},[284,1568,1275],{"class":1241},[284,1570,517],{"class":333},[284,1572,1574],{"class":286,"line":1573},56,[284,1575,316],{"emptyLinePlaceholder":315},[284,1577,1579,1581,1583,1585,1588,1590,1592,1594,1596,1599,1601,1604,1607,1609,1611,1613],{"class":286,"line":1578},57,[284,1580,1374],{"class":297},[284,1582,979],{"class":308},[284,1584,865],{"class":333},[284,1586,1587],{"class":1074},"max_tokens",[284,1589,1384],{"class":447},[284,1591,1387],{"class":447},[284,1593,452],{"class":451},[284,1595,1392],{"class":447},[284,1597,1598],{"class":308}," result",[284,1600,865],{"class":333},[284,1602,1603],{"class":1074},"tokens_used",[284,1605,1606],{"class":447}," >",[284,1608,979],{"class":308},[284,1610,865],{"class":333},[284,1612,1587],{"class":1074},[284,1614,385],{"class":333},[284,1616,1618,1620,1622,1624,1626,1628,1631,1633,1635,1637,1639,1641,1644,1646,1648,1650,1652,1654,1656],{"class":286,"line":1617},58,[284,1619,1419],{"class":308},[284,1621,865],{"class":333},[284,1623,1424],{"class":500},[284,1625,504],{"class":333},[284,1627,1238],{"class":377},[284,1629,1630],{"class":1241},"\"tokens_used ",[284,1632,1246],{"class":1245},[284,1634,1405],{"class":500},[284,1636,865],{"class":333},[284,1638,1603],{"class":1074},[284,1640,1263],{"class":1245},[284,1642,1643],{"class":1241}," > ",[284,1645,1246],{"class":1245},[284,1647,1148],{"class":500},[284,1649,865],{"class":333},[284,1651,1587],{"class":1074},[284,1653,1263],{"class":1245},[284,1655,1275],{"class":1241},[284,1657,517],{"class":333},[284,1659,1661],{"class":286,"line":1660},59,[284,1662,316],{"emptyLinePlaceholder":315},[284,1664,1666,1668,1670,1672,1675,1677,1679,1681,1683,1685,1687,1690,1692,1694,1696,1698],{"class":286,"line":1665},60,[284,1667,1374],{"class":297},[284,1669,979],{"class":308},[284,1671,865],{"class":333},[284,1673,1674],{"class":1074},"max_iterations",[284,1676,1384],{"class":447},[284,1678,1387],{"class":447},[284,1680,452],{"class":451},[284,1682,1392],{"class":447},[284,1684,1598],{"class":308},[284,1686,865],{"class":333},[284,1688,1689],{"class":1074},"iterations_used",[284,1691,1606],{"class":447},[284,1693,979],{"class":308},[284,1695,865],{"class":333},[284,1697,1674],{"class":1074},[284,1699,385],{"class":333},[284,1701,1703,1705,1707,1709,1711,1713,1716,1718,1720,1722,1724,1726,1728,1730,1732,1734,1736,1738,1740],{"class":286,"line":1702},61,[284,1704,1419],{"class":308},[284,1706,865],{"class":333},[284,1708,1424],{"class":500},[284,1710,504],{"class":333},[284,1712,1238],{"class":377},[284,1714,1715],{"class":1241},"\"iterations_used ",[284,1717,1246],{"class":1245},[284,1719,1405],{"class":500},[284,1721,865],{"class":333},[284,1723,1689],{"class":1074},[284,1725,1263],{"class":1245},[284,1727,1643],{"class":1241},[284,1729,1246],{"class":1245},[284,1731,1148],{"class":500},[284,1733,865],{"class":333},[284,1735,1674],{"class":1074},[284,1737,1263],{"class":1245},[284,1739,1275],{"class":1241},[284,1741,517],{"class":333},[284,1743,1745],{"class":286,"line":1744},62,[284,1746,316],{"emptyLinePlaceholder":315},[284,1748,1750,1753,1755],{"class":286,"line":1749},63,[284,1751,1752],{"class":297},"        return",[284,1754,691],{"class":500},[284,1756,1109],{"class":333},[284,1758,1760,1763,1765,1767,1769,1771],{"class":286,"line":1759},64,[284,1761,1762],{"class":507},"            case_id",[284,1764,511],{"class":447},[284,1766,1148],{"class":500},[284,1768,865],{"class":333},[284,1770,1215],{"class":1074},[284,1772,1126],{"class":333},[284,1774,1776,1779,1781,1784,1786,1789,1791,1794,1797],{"class":286,"line":1775},65,[284,1777,1778],{"class":507},"            passed",[284,1780,511],{"class":447},[284,1782,1783],{"class":405},"len",[284,1785,504],{"class":333},[284,1787,1788],{"class":500},"failures",[284,1790,986],{"class":333},[284,1792,1793],{"class":447}," ==",[284,1795,1796],{"class":1245}," 0",[284,1798,1126],{"class":333},[284,1800,1802,1804,1806,1808],{"class":286,"line":1801},66,[284,1803,1419],{"class":507},[284,1805,511],{"class":447},[284,1807,1788],{"class":500},[284,1809,1126],{"class":333},[284,1811,1813,1816,1818,1820,1822,1824],{"class":286,"line":1812},67,[284,1814,1815],{"class":507},"            final_answer",[284,1817,511],{"class":447},[284,1819,1405],{"class":500},[284,1821,865],{"class":333},[284,1823,1410],{"class":1074},[284,1825,1126],{"class":333},[284,1827,1829,1832,1834,1836,1838,1840],{"class":286,"line":1828},68,[284,1830,1831],{"class":507},"            tokens_used",[284,1833,511],{"class":447},[284,1835,1405],{"class":500},[284,1837,865],{"class":333},[284,1839,1603],{"class":1074},[284,1841,1126],{"class":333},[284,1843,1845,1848,1850,1852,1854,1856],{"class":286,"line":1844},69,[284,1846,1847],{"class":507},"            iterations_used",[284,1849,511],{"class":447},[284,1851,1405],{"class":500},[284,1853,865],{"class":333},[284,1855,1689],{"class":1074},[284,1857,1126],{"class":333},[284,1859,1861,1864,1866,1868,1870,1872,1874,1876,1878],{"class":286,"line":1860},70,[284,1862,1863],{"class":507},"            duration_seconds",[284,1865,511],{"class":447},[284,1867,1008],{"class":500},[284,1869,865],{"class":333},[284,1871,1008],{"class":500},[284,1873,1326],{"class":333},[284,1875,1329],{"class":447},[284,1877,1332],{"class":500},[284,1879,1126],{"class":333},[284,1881,1883],{"class":286,"line":1882},71,[284,1884,1885],{"class":333},"        )\n",[284,1887,1889],{"class":286,"line":1888},72,[284,1890,316],{"emptyLinePlaceholder":315},[284,1892,1894,1896,1898,1901,1903,1905,1907,1910,1912,1914,1916,1919,1922,1924,1926,1928,1931],{"class":286,"line":1893},73,[284,1895,961],{"class":377},[284,1897,964],{"class":377},[284,1899,1900],{"class":370}," run_all",[284,1902,504],{"class":333},[284,1904,973],{"class":972},[284,1906,334],{"class":333},[284,1908,1909],{"class":978}," cases",[284,1911,409],{"class":333},[284,1913,486],{"class":308},[284,1915,489],{"class":333},[284,1917,1918],{"class":308},"EvalCase",[284,1920,1921],{"class":333},"])",[284,1923,989],{"class":333},[284,1925,486],{"class":308},[284,1927,489],{"class":333},[284,1929,1930],{"class":308},"EvalResult",[284,1932,1933],{"class":333},"]:\n",[284,1935,1937,1940,1942,1944,1946,1948,1950,1952],{"class":286,"line":1936},74,[284,1938,1939],{"class":308},"        results",[284,1941,409],{"class":333},[284,1943,486],{"class":308},[284,1945,489],{"class":333},[284,1947,1930],{"class":308},[284,1949,495],{"class":333},[284,1951,455],{"class":447},[284,1953,1031],{"class":333},[284,1955,1957,1959,1962,1964,1966],{"class":286,"line":1956},75,[284,1958,1446],{"class":297},[284,1960,1961],{"class":308}," case ",[284,1963,1452],{"class":297},[284,1965,1909],{"class":308},[284,1967,385],{"class":333},[284,1969,1971,1973,1975,1977,1980,1982,1985,1987,1989],{"class":286,"line":1970},76,[284,1972,1098],{"class":308},[284,1974,511],{"class":447},[284,1976,1103],{"class":297},[284,1978,1979],{"class":301}," self",[284,1981,865],{"class":333},[284,1983,1984],{"class":500},"run_one",[284,1986,504],{"class":333},[284,1988,1148],{"class":500},[284,1990,517],{"class":333},[284,1992,1994,1997,1999,2001,2003,2005,2008,2011,2013,2016,2018,2020,2023,2026,2029,2032,2034,2036,2039,2041,2043,2045,2047],{"class":286,"line":1993},77,[284,1995,1996],{"class":405},"            print",[284,1998,504],{"class":333},[284,2000,1238],{"class":377},[284,2002,1275],{"class":1241},[284,2004,1246],{"class":1245},[284,2006,2007],{"class":1288},"'",[284,2009,2010],{"class":1241},"✓",[284,2012,2007],{"class":1288},[284,2014,2015],{"class":297}," if",[284,2017,1598],{"class":500},[284,2019,865],{"class":333},[284,2021,2022],{"class":1074},"passed",[284,2024,2025],{"class":297}," else",[284,2027,2028],{"class":1288}," '",[284,2030,2031],{"class":1241},"✗",[284,2033,2007],{"class":1288},[284,2035,1263],{"class":1245},[284,2037,2038],{"class":1245}," {",[284,2040,1148],{"class":500},[284,2042,865],{"class":333},[284,2044,1215],{"class":1074},[284,2046,1263],{"class":1245},[284,2048,2049],{"class":1241},": \"\n",[284,2051,2053,2056,2058,2060,2062,2064,2067,2069],{"class":286,"line":2052},78,[284,2054,2055],{"class":377},"                  f",[284,2057,1275],{"class":1241},[284,2059,1246],{"class":1245},[284,2061,1148],{"class":500},[284,2063,865],{"class":333},[284,2065,2066],{"class":1074},"description",[284,2068,1263],{"class":1245},[284,2070,2071],{"class":1241}," \"\n",[284,2073,2075,2077,2080,2082,2084,2086,2088,2090,2093,2095,2097,2099,2102,2105,2107],{"class":286,"line":2074},79,[284,2076,2055],{"class":377},[284,2078,2079],{"class":1241},"\"[",[284,2081,1246],{"class":1245},[284,2083,1405],{"class":500},[284,2085,865],{"class":333},[284,2087,1603],{"class":1074},[284,2089,1263],{"class":1245},[284,2091,2092],{"class":1241}," tok, ",[284,2094,1246],{"class":1245},[284,2096,1405],{"class":500},[284,2098,865],{"class":333},[284,2100,2101],{"class":1074},"duration_seconds",[284,2103,2104],{"class":377},":.1f",[284,2106,1263],{"class":1245},[284,2108,2109],{"class":1241},"s]\"\n",[284,2111,2113,2116,2119,2121,2124,2126,2128,2131,2133,2135,2138,2140,2142,2144,2146,2148,2150,2152,2154,2156,2158,2160,2162,2165],{"class":286,"line":2112},80,[284,2114,2115],{"class":447},"                  +",[284,2117,2118],{"class":333}," (",[284,2120,1238],{"class":377},[284,2122,2123],{"class":1241},"\" — ",[284,2125,1246],{"class":1245},[284,2127,2007],{"class":1288},[284,2129,2130],{"class":1241},", ",[284,2132,2007],{"class":1288},[284,2134,865],{"class":333},[284,2136,2137],{"class":500},"join",[284,2139,504],{"class":333},[284,2141,1405],{"class":500},[284,2143,865],{"class":333},[284,2145,1788],{"class":1074},[284,2147,986],{"class":333},[284,2149,1263],{"class":1245},[284,2151,1275],{"class":1241},[284,2153,2015],{"class":297},[284,2155,1598],{"class":500},[284,2157,865],{"class":333},[284,2159,1788],{"class":1074},[284,2161,2025],{"class":297},[284,2163,2164],{"class":1288}," \"\"",[284,2166,2167],{"class":333},"))\n",[284,2169,2171,2174,2176,2178,2180,2182],{"class":286,"line":2170},81,[284,2172,2173],{"class":308},"            results",[284,2175,865],{"class":333},[284,2177,1424],{"class":500},[284,2179,504],{"class":333},[284,2181,1405],{"class":500},[284,2183,517],{"class":333},[284,2185,2187,2189],{"class":286,"line":2186},82,[284,2188,1752],{"class":297},[284,2190,2191],{"class":308}," results\n",[284,2193,2195],{"class":286,"line":2194},83,[284,2196,316],{"emptyLinePlaceholder":315},[284,2198,2200],{"class":286,"line":2199},84,[284,2201,316],{"emptyLinePlaceholder":315},[284,2203,2205,2207,2209],{"class":286,"line":2204},85,[284,2206,378],{"class":377},[284,2208,1065],{"class":381},[284,2210,385],{"class":333},[284,2212,2214,2216],{"class":286,"line":2213},86,[284,2215,392],{"class":391},[284,2217,2218],{"class":395},"A ToolCatalog wrapper that records every tool name dispatched.\n",[284,2220,2222],{"class":286,"line":2221},87,[284,2223,316],{"emptyLinePlaceholder":315},[284,2225,2227],{"class":286,"line":2226},88,[284,2228,2229],{"class":395},"    The catalog interface is `select(query, k, must_include)` and `get(name)`.\n",[284,2231,2233],{"class":286,"line":2232},89,[284,2234,2235],{"class":395},"    Wrapping `select`'s returned tools is the clean interception point: each\n",[284,2237,2239],{"class":286,"line":2238},90,[284,2240,2241],{"class":395},"    returned Tool gets its `arun`\u002F`run` wrapped to record the name before\n",[284,2243,2245],{"class":286,"line":2244},91,[284,2246,2247],{"class":395},"    delegating.\n",[284,2249,2251],{"class":286,"line":2250},92,[284,2252,2253],{"class":391},"    \"\"\"\n",[284,2255,2257],{"class":286,"line":2256},93,[284,2258,316],{"emptyLinePlaceholder":315},[284,2260,2262,2265,2268,2270,2272,2274,2277,2279,2282,2284,2286,2288,2290,2292,2294,2296],{"class":286,"line":2261},94,[284,2263,2264],{"class":377},"    def",[284,2266,2267],{"class":405}," __init__",[284,2269,504],{"class":333},[284,2271,973],{"class":972},[284,2273,334],{"class":333},[284,2275,2276],{"class":978}," inner",[284,2278,334],{"class":333},[284,2280,2281],{"class":978}," observed",[284,2283,409],{"class":333},[284,2285,486],{"class":308},[284,2287,489],{"class":333},[284,2289,492],{"class":412},[284,2291,1921],{"class":333},[284,2293,989],{"class":333},[284,2295,452],{"class":451},[284,2297,385],{"class":333},[284,2299,2301,2304,2306,2309,2311],{"class":286,"line":2300},95,[284,2302,2303],{"class":301},"        self",[284,2305,865],{"class":333},[284,2307,2308],{"class":1074},"_inner",[284,2310,455],{"class":447},[284,2312,2313],{"class":308}," inner\n",[284,2315,2317,2319,2321,2324,2326],{"class":286,"line":2316},96,[284,2318,2303],{"class":301},[284,2320,865],{"class":333},[284,2322,2323],{"class":1074},"_observed",[284,2325,455],{"class":447},[284,2327,2328],{"class":308}," observed\n",[284,2330,2332],{"class":286,"line":2331},97,[284,2333,316],{"emptyLinePlaceholder":315},[284,2335,2337,2339,2342,2344,2346,2348,2351,2353,2356,2358,2361,2363,2366,2368,2371],{"class":286,"line":2336},98,[284,2338,2264],{"class":377},[284,2340,2341],{"class":370}," select",[284,2343,504],{"class":333},[284,2345,973],{"class":972},[284,2347,334],{"class":333},[284,2349,2350],{"class":978}," query",[284,2352,334],{"class":333},[284,2354,2355],{"class":978}," k",[284,2357,511],{"class":447},[284,2359,2360],{"class":1245},"7",[284,2362,334],{"class":333},[284,2364,2365],{"class":978}," must_include",[284,2367,511],{"class":447},[284,2369,2370],{"class":451},"None",[284,2372,1413],{"class":333},[284,2374,2376,2379,2381,2383,2385,2387,2389],{"class":286,"line":2375},99,[284,2377,2378],{"class":297},"        from",[284,2380,845],{"class":333},[284,2382,882],{"class":308},[284,2384,865],{"class":333},[284,2386,868],{"class":308},[284,2388,327],{"class":297},[284,2390,2391],{"class":308}," Tool\n",[284,2393,2395,2398,2400,2402,2404,2406,2408,2411,2413,2416,2418,2420,2422,2425,2427,2429,2431,2434],{"class":286,"line":2394},100,[284,2396,2397],{"class":308},"        tools ",[284,2399,511],{"class":447},[284,2401,1979],{"class":301},[284,2403,865],{"class":333},[284,2405,2308],{"class":1074},[284,2407,865],{"class":333},[284,2409,2410],{"class":500},"select",[284,2412,504],{"class":333},[284,2414,2415],{"class":500},"query",[284,2417,334],{"class":333},[284,2419,2355],{"class":507},[284,2421,511],{"class":447},[284,2423,2424],{"class":500},"k",[284,2426,334],{"class":333},[284,2428,2365],{"class":507},[284,2430,511],{"class":447},[284,2432,2433],{"class":500},"must_include",[284,2435,517],{"class":333},[284,2437,2439,2441,2444,2446,2448,2451,2453,2456,2458,2461,2464,2466,2469],{"class":286,"line":2438},101,[284,2440,1752],{"class":297},[284,2442,2443],{"class":333}," [",[284,2445,973],{"class":301},[284,2447,865],{"class":333},[284,2449,2450],{"class":500},"_wrap",[284,2452,504],{"class":333},[284,2454,2455],{"class":500},"t",[284,2457,986],{"class":333},[284,2459,2460],{"class":297}," for",[284,2462,2463],{"class":308}," t ",[284,2465,1452],{"class":297},[284,2467,2468],{"class":308}," tools",[284,2470,731],{"class":333},[284,2472,2474],{"class":286,"line":2473},102,[284,2475,316],{"emptyLinePlaceholder":315},[284,2477,2479,2481,2484,2486,2488,2490,2493],{"class":286,"line":2478},103,[284,2480,2264],{"class":377},[284,2482,2483],{"class":370}," _wrap",[284,2485,504],{"class":333},[284,2487,973],{"class":972},[284,2489,334],{"class":333},[284,2491,2492],{"class":978}," tool",[284,2494,1413],{"class":333},[284,2496,2498,2500,2502,2504,2506,2508,2510],{"class":286,"line":2497},104,[284,2499,2378],{"class":297},[284,2501,845],{"class":333},[284,2503,882],{"class":308},[284,2505,865],{"class":333},[284,2507,868],{"class":308},[284,2509,327],{"class":297},[284,2511,2391],{"class":308},[284,2513,2515,2518,2520,2522,2524],{"class":286,"line":2514},105,[284,2516,2517],{"class":308},"        observed ",[284,2519,511],{"class":447},[284,2521,1979],{"class":301},[284,2523,865],{"class":333},[284,2525,2526],{"class":1074},"_observed\n",[284,2528,2530],{"class":286,"line":2529},106,[284,2531,316],{"emptyLinePlaceholder":315},[284,2533,2535,2537,2539,2541,2544,2546,2548,2550],{"class":286,"line":2534},107,[284,2536,1374],{"class":297},[284,2538,2492],{"class":308},[284,2540,865],{"class":333},[284,2542,2543],{"class":1074},"arun",[284,2545,1384],{"class":447},[284,2547,1387],{"class":447},[284,2549,452],{"class":451},[284,2551,385],{"class":333},[284,2553,2555,2558,2560,2562,2564],{"class":286,"line":2554},108,[284,2556,2557],{"class":308},"            original_arun ",[284,2559,511],{"class":447},[284,2561,2492],{"class":308},[284,2563,865],{"class":333},[284,2565,2566],{"class":1074},"arun\n",[284,2568,2570,2573,2575,2577,2579,2582,2585],{"class":286,"line":2569},109,[284,2571,2572],{"class":377},"            async",[284,2574,964],{"class":377},[284,2576,1106],{"class":370},[284,2578,504],{"class":333},[284,2580,2581],{"class":447},"**",[284,2583,2584],{"class":978},"kwargs",[284,2586,1413],{"class":333},[284,2588,2590,2593,2595,2597,2599,2602,2604,2607],{"class":286,"line":2589},110,[284,2591,2592],{"class":308},"                observed",[284,2594,865],{"class":333},[284,2596,1424],{"class":500},[284,2598,504],{"class":333},[284,2600,2601],{"class":500},"tool",[284,2603,865],{"class":333},[284,2605,2606],{"class":1074},"name",[284,2608,517],{"class":333},[284,2610,2612,2615,2617,2620,2622,2624,2626],{"class":286,"line":2611},111,[284,2613,2614],{"class":297},"                return",[284,2616,1103],{"class":297},[284,2618,2619],{"class":500}," original_arun",[284,2621,504],{"class":333},[284,2623,2581],{"class":447},[284,2625,2584],{"class":500},[284,2627,517],{"class":333},[284,2629,2631,2633,2636],{"class":286,"line":2630},112,[284,2632,1197],{"class":297},[284,2634,2635],{"class":500}," Tool",[284,2637,1109],{"class":333},[284,2639,2641,2644,2646,2648,2650,2652,2654,2657,2659,2661,2663,2665],{"class":286,"line":2640},113,[284,2642,2643],{"class":507},"                name",[284,2645,511],{"class":447},[284,2647,2601],{"class":500},[284,2649,865],{"class":333},[284,2651,2606],{"class":1074},[284,2653,334],{"class":333},[284,2655,2656],{"class":507}," description",[284,2658,511],{"class":447},[284,2660,2601],{"class":500},[284,2662,865],{"class":333},[284,2664,2066],{"class":1074},[284,2666,1126],{"class":333},[284,2668,2670,2673,2675,2677,2679,2682],{"class":286,"line":2669},114,[284,2671,2672],{"class":507},"                input_schema",[284,2674,511],{"class":447},[284,2676,2601],{"class":500},[284,2678,865],{"class":333},[284,2680,2681],{"class":1074},"input_schema",[284,2683,1126],{"class":333},[284,2685,2687,2690,2692,2694,2696,2699,2701,2703,2705,2708],{"class":286,"line":2686},115,[284,2688,2689],{"class":507},"                arun",[284,2691,511],{"class":447},[284,2693,2543],{"class":500},[284,2695,334],{"class":333},[284,2697,2698],{"class":507}," side_effects",[284,2700,511],{"class":447},[284,2702,2601],{"class":500},[284,2704,865],{"class":333},[284,2706,2707],{"class":1074},"side_effects",[284,2709,1126],{"class":333},[284,2711,2713],{"class":286,"line":2712},116,[284,2714,1176],{"class":333},[284,2716,2718,2721],{"class":286,"line":2717},117,[284,2719,2720],{"class":297},"        else",[284,2722,385],{"class":333},[284,2724,2726,2729,2731,2733,2735],{"class":286,"line":2725},118,[284,2727,2728],{"class":308},"            original_run ",[284,2730,511],{"class":447},[284,2732,2492],{"class":308},[284,2734,865],{"class":333},[284,2736,2737],{"class":1074},"run\n",[284,2739,2741,2744,2747,2749,2751,2753],{"class":286,"line":2740},119,[284,2742,2743],{"class":377},"            def",[284,2745,2746],{"class":370}," run",[284,2748,504],{"class":333},[284,2750,2581],{"class":447},[284,2752,2584],{"class":978},[284,2754,1413],{"class":333},[284,2756,2758,2760,2762,2764,2766,2768,2770,2772],{"class":286,"line":2757},120,[284,2759,2592],{"class":308},[284,2761,865],{"class":333},[284,2763,1424],{"class":500},[284,2765,504],{"class":333},[284,2767,2601],{"class":500},[284,2769,865],{"class":333},[284,2771,2606],{"class":1074},[284,2773,517],{"class":333},[284,2775,2777,2779,2782,2784,2786,2788],{"class":286,"line":2776},121,[284,2778,2614],{"class":297},[284,2780,2781],{"class":500}," original_run",[284,2783,504],{"class":333},[284,2785,2581],{"class":447},[284,2787,2584],{"class":500},[284,2789,517],{"class":333},[284,2791,2793,2795,2797],{"class":286,"line":2792},122,[284,2794,1197],{"class":297},[284,2796,2635],{"class":500},[284,2798,1109],{"class":333},[284,2800,2802,2804,2806,2808,2810,2812,2814,2816,2818,2820,2822,2824],{"class":286,"line":2801},123,[284,2803,2643],{"class":507},[284,2805,511],{"class":447},[284,2807,2601],{"class":500},[284,2809,865],{"class":333},[284,2811,2606],{"class":1074},[284,2813,334],{"class":333},[284,2815,2656],{"class":507},[284,2817,511],{"class":447},[284,2819,2601],{"class":500},[284,2821,865],{"class":333},[284,2823,2066],{"class":1074},[284,2825,1126],{"class":333},[284,2827,2829,2831,2833,2835,2837,2839],{"class":286,"line":2828},124,[284,2830,2672],{"class":507},[284,2832,511],{"class":447},[284,2834,2601],{"class":500},[284,2836,865],{"class":333},[284,2838,2681],{"class":1074},[284,2840,1126],{"class":333},[284,2842,2844,2847,2849,2852,2854,2856,2858,2860,2862,2864],{"class":286,"line":2843},125,[284,2845,2846],{"class":507},"                run",[284,2848,511],{"class":447},[284,2850,2851],{"class":500},"run",[284,2853,334],{"class":333},[284,2855,2698],{"class":507},[284,2857,511],{"class":447},[284,2859,2601],{"class":500},[284,2861,865],{"class":333},[284,2863,2707],{"class":1074},[284,2865,1126],{"class":333},[284,2867,2869],{"class":286,"line":2868},126,[284,2870,1176],{"class":333},[112,2872,2873,2874,2877,2878,2881],{},"The runner is sequential. For a small suite (20–50 cases), that's fine. For larger suites, parallelize by running independent cases in separate async tasks, rate-limited to avoid overwhelming the provider. The interface doesn't need to change — ",[235,2875,2876],{},"run_all"," can ",[235,2879,2880],{},"asyncio.gather"," instead of looping.",[112,2883,2884,2885,2888,2889,2892,2893,2896,2897,2900,2901,2904,2905,2908,2909,2912],{},"For brevity the recording wrapper only proxies ",[235,2886,2887],{},"select()",". If you use the discovery tool from §12.5 (which reads ",[235,2890,2891],{},"catalog.tools"," directly) or otherwise access ",[235,2894,2895],{},"catalog.get(name)"," \u002F ",[235,2898,2899],{},"catalog.all_names()"," from anywhere in your harness, proxy those through ",[235,2902,2903],{},"self._inner"," too — each is a one-liner, and the companion repo's ",[235,2906,2907],{},"_RecordingCatalog"," does exactly that. Without them, the recording catalog works for §19.4's cases but will ",[235,2910,2911],{},"AttributeError"," the moment you drop a real catalog with helpers wired in.",[112,2914,2915,2918,2919,2922],{},[132,2916,2917],{},"Tokens and tool-call observation"," in the sketch are handwaves. A production eval runner pulls these from OTel spans directly — the tracing we built in Chapter 18 is the right substrate. A small span-reader that listens to a ",[235,2920,2921],{},"ConsoleSpanProcessor","-like collector and reports per-run metrics is ~50 lines, which the companion repo includes but the book omits for focus.",[215,2924],{},[218,2926,2928],{"id":2927},"_194-some-real-eval-cases","19.4 Some Real Eval Cases",[275,2930,2932],{"className":277,"code":2931,"language":279,"meta":280,"style":280},"# tests\u002Fevals\u002Fcases.py\nfrom harness.evals.case import EvalCase\n\n\nCASES = [\n    EvalCase(\n        id=\"arithmetic-simple\",\n        description=\"2+2 with calculator\",\n        user_message=\"What is 2 + 2?\",\n        required_tools=[\"calc\"],\n        check_answer=lambda ans: \"4\" in ans,\n        max_tokens=5_000,\n    ),\n\n    EvalCase(\n        id=\"file-viewport\",\n        description=\"Reads a known file via viewport, not full read\",\n        user_message=\"What is the first line of \u002Fetc\u002Fhostname?\",\n        required_tools=[\"read_file_viewport\"],\n        forbidden_tools=[\"read_file\"],  # old unbounded read\n        check_answer=lambda ans: len(ans) > 0,\n        max_tokens=8_000,\n    ),\n\n    EvalCase(\n        id=\"long-session-compaction\",\n        description=\"Task that triggers compaction; verifies survival\",\n        user_message=(\n            \"Read \u002Fproc\u002Fcpuinfo, \u002Fproc\u002Fmeminfo, \u002Fproc\u002Fversion, \"\n            \"\u002Fetc\u002Fos-release, and \u002Fetc\u002Fhostname. Summarize the system in \"\n            \"three bullet points.\"\n        ),\n        required_tools=[\"read_file_viewport\"],\n        max_tokens=50_000,\n        max_iterations=15,\n    ),\n\n    EvalCase(\n        id=\"premature-finalization-trap\",\n        description=\"Agent must process all 5 items; shortcut is possible\",\n        user_message=(\n            \"For each number in [1, 2, 3, 4, 5], compute its square \"\n            \"using the calculator. Then report all five squares in a list.\"\n        ),\n        required_tools=[\"calc\"],\n        check_answer=lambda ans: all(s in ans for s in [\"1\", \"4\", \"9\", \"16\", \"25\"]),\n    ),\n\n    EvalCase(\n        id=\"plan-required\",\n        description=\"Task complex enough that a plan should be created\",\n        user_message=(\n            \"Investigate and report: (1) the user running this, (2) the \"\n            \"working directory, (3) three most-recent files in it. \"\n            \"Structure your answer as a three-point summary.\"\n        ),\n        required_tools=[\"bash\", \"plan_create\", \"plan_show\"],\n    ),\n]\n",[235,2933,2934,2939,2960,2964,2968,2978,2985,3001,3017,3033,3051,3080,3092,3097,3101,3107,3122,3137,3152,3169,3190,3218,3229,3233,3237,3243,3258,3273,3281,3292,3301,3310,3315,3331,3342,3354,3358,3362,3368,3383,3398,3406,3415,3424,3428,3444,3524,3528,3532,3538,3553,3568,3576,3585,3594,3603,3607,3642,3646],{"__ignoreMap":280},[284,2935,2936],{"class":286,"line":287},[284,2937,2938],{"class":290},"# tests\u002Fevals\u002Fcases.py\n",[284,2940,2941,2943,2946,2948,2951,2953,2955,2957],{"class":286,"line":294},[284,2942,298],{"class":297},[284,2944,2945],{"class":308}," harness",[284,2947,865],{"class":333},[284,2949,2950],{"class":308},"evals",[284,2952,865],{"class":333},[284,2954,902],{"class":308},[284,2956,327],{"class":297},[284,2958,2959],{"class":308}," EvalCase\n",[284,2961,2962],{"class":286,"line":312},[284,2963,316],{"emptyLinePlaceholder":315},[284,2965,2966],{"class":286,"line":319},[284,2967,316],{"emptyLinePlaceholder":315},[284,2969,2970,2973,2975],{"class":286,"line":340},[284,2971,2972],{"class":301},"CASES",[284,2974,455],{"class":447},[284,2976,2977],{"class":333}," [\n",[284,2979,2980,2983],{"class":286,"line":353},[284,2981,2982],{"class":500},"    EvalCase",[284,2984,1109],{"class":333},[284,2986,2987,2990,2992,2994,2997,2999],{"class":286,"line":358},[284,2988,2989],{"class":507},"        id",[284,2991,511],{"class":447},[284,2993,1275],{"class":1288},[284,2995,2996],{"class":1241},"arithmetic-simple",[284,2998,1275],{"class":1288},[284,3000,1126],{"class":333},[284,3002,3003,3006,3008,3010,3013,3015],{"class":286,"line":363},[284,3004,3005],{"class":507},"        description",[284,3007,511],{"class":447},[284,3009,1275],{"class":1288},[284,3011,3012],{"class":1241},"2+2 with calculator",[284,3014,1275],{"class":1288},[284,3016,1126],{"class":333},[284,3018,3019,3022,3024,3026,3029,3031],{"class":286,"line":374},[284,3020,3021],{"class":507},"        user_message",[284,3023,511],{"class":447},[284,3025,1275],{"class":1288},[284,3027,3028],{"class":1241},"What is 2 + 2?",[284,3030,1275],{"class":1288},[284,3032,1126],{"class":333},[284,3034,3035,3038,3040,3042,3044,3047,3049],{"class":286,"line":388},[284,3036,3037],{"class":507},"        required_tools",[284,3039,511],{"class":447},[284,3041,489],{"class":333},[284,3043,1275],{"class":1288},[284,3045,3046],{"class":1241},"calc",[284,3048,1275],{"class":1288},[284,3050,1278],{"class":333},[284,3052,3053,3056,3058,3061,3064,3066,3069,3072,3074,3076,3078],{"class":286,"line":402},[284,3054,3055],{"class":507},"        check_answer",[284,3057,511],{"class":447},[284,3059,3060],{"class":377},"lambda",[284,3062,3063],{"class":978}," ans",[284,3065,409],{"class":333},[284,3067,3068],{"class":1288}," \"",[284,3070,3071],{"class":1241},"4",[284,3073,1275],{"class":1288},[284,3075,1475],{"class":297},[284,3077,3063],{"class":500},[284,3079,1126],{"class":333},[284,3081,3082,3085,3087,3090],{"class":286,"line":416},[284,3083,3084],{"class":507},"        max_tokens",[284,3086,511],{"class":447},[284,3088,3089],{"class":1245},"5_000",[284,3091,1126],{"class":333},[284,3093,3094],{"class":286,"line":426},[284,3095,3096],{"class":333},"    ),\n",[284,3098,3099],{"class":286,"line":436},[284,3100,316],{"emptyLinePlaceholder":315},[284,3102,3103,3105],{"class":286,"line":461},[284,3104,2982],{"class":500},[284,3106,1109],{"class":333},[284,3108,3109,3111,3113,3115,3118,3120],{"class":286,"line":466},[284,3110,2989],{"class":507},[284,3112,511],{"class":447},[284,3114,1275],{"class":1288},[284,3116,3117],{"class":1241},"file-viewport",[284,3119,1275],{"class":1288},[284,3121,1126],{"class":333},[284,3123,3124,3126,3128,3130,3133,3135],{"class":286,"line":472},[284,3125,3005],{"class":507},[284,3127,511],{"class":447},[284,3129,1275],{"class":1288},[284,3131,3132],{"class":1241},"Reads a known file via viewport, not full read",[284,3134,1275],{"class":1288},[284,3136,1126],{"class":333},[284,3138,3139,3141,3143,3145,3148,3150],{"class":286,"line":478},[284,3140,3021],{"class":507},[284,3142,511],{"class":447},[284,3144,1275],{"class":1288},[284,3146,3147],{"class":1241},"What is the first line of \u002Fetc\u002Fhostname?",[284,3149,1275],{"class":1288},[284,3151,1126],{"class":333},[284,3153,3154,3156,3158,3160,3162,3165,3167],{"class":286,"line":520},[284,3155,3037],{"class":507},[284,3157,511],{"class":447},[284,3159,489],{"class":333},[284,3161,1275],{"class":1288},[284,3163,3164],{"class":1241},"read_file_viewport",[284,3166,1275],{"class":1288},[284,3168,1278],{"class":333},[284,3170,3171,3174,3176,3178,3180,3183,3185,3187],{"class":286,"line":525},[284,3172,3173],{"class":507},"        forbidden_tools",[284,3175,511],{"class":447},[284,3177,489],{"class":333},[284,3179,1275],{"class":1288},[284,3181,3182],{"class":1241},"read_file",[284,3184,1275],{"class":1288},[284,3186,594],{"class":333},[284,3188,3189],{"class":290},"  # old unbounded read\n",[284,3191,3192,3194,3196,3198,3200,3202,3205,3207,3210,3212,3214,3216],{"class":286,"line":531},[284,3193,3055],{"class":507},[284,3195,511],{"class":447},[284,3197,3060],{"class":377},[284,3199,3063],{"class":978},[284,3201,409],{"class":333},[284,3203,3204],{"class":405}," len",[284,3206,504],{"class":333},[284,3208,3209],{"class":500},"ans",[284,3211,986],{"class":333},[284,3213,1606],{"class":447},[284,3215,1796],{"class":1245},[284,3217,1126],{"class":333},[284,3219,3220,3222,3224,3227],{"class":286,"line":561},[284,3221,3084],{"class":507},[284,3223,511],{"class":447},[284,3225,3226],{"class":1245},"8_000",[284,3228,1126],{"class":333},[284,3230,3231],{"class":286,"line":566},[284,3232,3096],{"class":333},[284,3234,3235],{"class":286,"line":572},[284,3236,316],{"emptyLinePlaceholder":315},[284,3238,3239,3241],{"class":286,"line":578},[284,3240,2982],{"class":500},[284,3242,1109],{"class":333},[284,3244,3245,3247,3249,3251,3254,3256],{"class":286,"line":610},[284,3246,2989],{"class":507},[284,3248,511],{"class":447},[284,3250,1275],{"class":1288},[284,3252,3253],{"class":1241},"long-session-compaction",[284,3255,1275],{"class":1288},[284,3257,1126],{"class":333},[284,3259,3260,3262,3264,3266,3269,3271],{"class":286,"line":615},[284,3261,3005],{"class":507},[284,3263,511],{"class":447},[284,3265,1275],{"class":1288},[284,3267,3268],{"class":1241},"Task that triggers compaction; verifies survival",[284,3270,1275],{"class":1288},[284,3272,1126],{"class":333},[284,3274,3275,3277,3279],{"class":286,"line":621},[284,3276,3021],{"class":507},[284,3278,511],{"class":447},[284,3280,1109],{"class":333},[284,3282,3283,3286,3289],{"class":286,"line":640},[284,3284,3285],{"class":1288},"            \"",[284,3287,3288],{"class":1241},"Read \u002Fproc\u002Fcpuinfo, \u002Fproc\u002Fmeminfo, \u002Fproc\u002Fversion, ",[284,3290,3291],{"class":1288},"\"\n",[284,3293,3294,3296,3299],{"class":286,"line":645},[284,3295,3285],{"class":1288},[284,3297,3298],{"class":1241},"\u002Fetc\u002Fos-release, and \u002Fetc\u002Fhostname. Summarize the system in ",[284,3300,3291],{"class":1288},[284,3302,3303,3305,3308],{"class":286,"line":651},[284,3304,3285],{"class":1288},[284,3306,3307],{"class":1241},"three bullet points.",[284,3309,3291],{"class":1288},[284,3311,3312],{"class":286,"line":669},[284,3313,3314],{"class":333},"        ),\n",[284,3316,3317,3319,3321,3323,3325,3327,3329],{"class":286,"line":674},[284,3318,3037],{"class":507},[284,3320,511],{"class":447},[284,3322,489],{"class":333},[284,3324,1275],{"class":1288},[284,3326,3164],{"class":1241},[284,3328,1275],{"class":1288},[284,3330,1278],{"class":333},[284,3332,3333,3335,3337,3340],{"class":286,"line":679},[284,3334,3084],{"class":507},[284,3336,511],{"class":447},[284,3338,3339],{"class":1245},"50_000",[284,3341,1126],{"class":333},[284,3343,3344,3347,3349,3352],{"class":286,"line":686},[284,3345,3346],{"class":507},"        max_iterations",[284,3348,511],{"class":447},[284,3350,3351],{"class":1245},"15",[284,3353,1126],{"class":333},[284,3355,3356],{"class":286,"line":696},[284,3357,3096],{"class":333},[284,3359,3360],{"class":286,"line":706},[284,3361,316],{"emptyLinePlaceholder":315},[284,3363,3364,3366],{"class":286,"line":717},[284,3365,2982],{"class":500},[284,3367,1109],{"class":333},[284,3369,3370,3372,3374,3376,3379,3381],{"class":286,"line":734},[284,3371,2989],{"class":507},[284,3373,511],{"class":447},[284,3375,1275],{"class":1288},[284,3377,3378],{"class":1241},"premature-finalization-trap",[284,3380,1275],{"class":1288},[284,3382,1126],{"class":333},[284,3384,3385,3387,3389,3391,3394,3396],{"class":286,"line":744},[284,3386,3005],{"class":507},[284,3388,511],{"class":447},[284,3390,1275],{"class":1288},[284,3392,3393],{"class":1241},"Agent must process all 5 items; shortcut is possible",[284,3395,1275],{"class":1288},[284,3397,1126],{"class":333},[284,3399,3400,3402,3404],{"class":286,"line":755},[284,3401,3021],{"class":507},[284,3403,511],{"class":447},[284,3405,1109],{"class":333},[284,3407,3408,3410,3413],{"class":286,"line":765},[284,3409,3285],{"class":1288},[284,3411,3412],{"class":1241},"For each number in [1, 2, 3, 4, 5], compute its square ",[284,3414,3291],{"class":1288},[284,3416,3417,3419,3422],{"class":286,"line":1341},[284,3418,3285],{"class":1288},[284,3420,3421],{"class":1241},"using the calculator. Then report all five squares in a list.",[284,3423,3291],{"class":1288},[284,3425,3426],{"class":286,"line":1346},[284,3427,3314],{"class":333},[284,3429,3430,3432,3434,3436,3438,3440,3442],{"class":286,"line":1366},[284,3431,3037],{"class":507},[284,3433,511],{"class":447},[284,3435,489],{"class":333},[284,3437,1275],{"class":1288},[284,3439,3046],{"class":1241},[284,3441,1275],{"class":1288},[284,3443,1278],{"class":333},[284,3445,3446,3448,3450,3452,3454,3456,3459,3461,3464,3466,3469,3472,3475,3477,3479,3481,3484,3486,3488,3490,3492,3494,3496,3498,3501,3503,3505,3507,3510,3512,3514,3516,3519,3521],{"class":286,"line":1371},[284,3447,3055],{"class":507},[284,3449,511],{"class":447},[284,3451,3060],{"class":377},[284,3453,3063],{"class":978},[284,3455,409],{"class":333},[284,3457,3458],{"class":405}," all",[284,3460,504],{"class":333},[284,3462,3463],{"class":500},"s ",[284,3465,1452],{"class":297},[284,3467,3468],{"class":500}," ans ",[284,3470,3471],{"class":297},"for",[284,3473,3474],{"class":500}," s ",[284,3476,1452],{"class":297},[284,3478,2443],{"class":333},[284,3480,1275],{"class":1288},[284,3482,3483],{"class":1241},"1",[284,3485,1275],{"class":1288},[284,3487,334],{"class":333},[284,3489,3068],{"class":1288},[284,3491,3071],{"class":1241},[284,3493,1275],{"class":1288},[284,3495,334],{"class":333},[284,3497,3068],{"class":1288},[284,3499,3500],{"class":1241},"9",[284,3502,1275],{"class":1288},[284,3504,334],{"class":333},[284,3506,3068],{"class":1288},[284,3508,3509],{"class":1241},"16",[284,3511,1275],{"class":1288},[284,3513,334],{"class":333},[284,3515,3068],{"class":1288},[284,3517,3518],{"class":1241},"25",[284,3520,1275],{"class":1288},[284,3522,3523],{"class":333},"]),\n",[284,3525,3526],{"class":286,"line":1416},[284,3527,3096],{"class":333},[284,3529,3530],{"class":286,"line":1438},[284,3531,316],{"emptyLinePlaceholder":315},[284,3533,3534,3536],{"class":286,"line":1443},[284,3535,2982],{"class":500},[284,3537,1109],{"class":333},[284,3539,3540,3542,3544,3546,3549,3551],{"class":286,"line":1464},[284,3541,2989],{"class":507},[284,3543,511],{"class":447},[284,3545,1275],{"class":1288},[284,3547,3548],{"class":1241},"plan-required",[284,3550,1275],{"class":1288},[284,3552,1126],{"class":333},[284,3554,3555,3557,3559,3561,3564,3566],{"class":286,"line":1482},[284,3556,3005],{"class":507},[284,3558,511],{"class":447},[284,3560,1275],{"class":1288},[284,3562,3563],{"class":1241},"Task complex enough that a plan should be created",[284,3565,1275],{"class":1288},[284,3567,1126],{"class":333},[284,3569,3570,3572,3574],{"class":286,"line":1509},[284,3571,3021],{"class":507},[284,3573,511],{"class":447},[284,3575,1109],{"class":333},[284,3577,3578,3580,3583],{"class":286,"line":1514},[284,3579,3285],{"class":1288},[284,3581,3582],{"class":1241},"Investigate and report: (1) the user running this, (2) the ",[284,3584,3291],{"class":1288},[284,3586,3587,3589,3592],{"class":286,"line":1533},[284,3588,3285],{"class":1288},[284,3590,3591],{"class":1241},"working directory, (3) three most-recent files in it. ",[284,3593,3291],{"class":1288},[284,3595,3596,3598,3601],{"class":286,"line":1546},[284,3597,3285],{"class":1288},[284,3599,3600],{"class":1241},"Structure your answer as a three-point summary.",[284,3602,3291],{"class":1288},[284,3604,3605],{"class":286,"line":1573},[284,3606,3314],{"class":333},[284,3608,3609,3611,3613,3615,3617,3620,3622,3624,3626,3629,3631,3633,3635,3638,3640],{"class":286,"line":1578},[284,3610,3037],{"class":507},[284,3612,511],{"class":447},[284,3614,489],{"class":333},[284,3616,1275],{"class":1288},[284,3618,3619],{"class":1241},"bash",[284,3621,1275],{"class":1288},[284,3623,334],{"class":333},[284,3625,3068],{"class":1288},[284,3627,3628],{"class":1241},"plan_create",[284,3630,1275],{"class":1288},[284,3632,334],{"class":333},[284,3634,3068],{"class":1288},[284,3636,3637],{"class":1241},"plan_show",[284,3639,1275],{"class":1288},[284,3641,1278],{"class":333},[284,3643,3644],{"class":286,"line":1617},[284,3645,3096],{"class":333},[284,3647,3648],{"class":286,"line":1660},[284,3649,731],{"class":333},[112,3651,3652],{},"Run them:",[275,3654,3656],{"className":277,"code":3655,"language":279,"meta":280,"style":280},"# examples\u002Fch19_evals.py\nimport asyncio\n\nfrom harness.evals.runner import EvalRunner\nfrom harness.providers.anthropic import AnthropicProvider\nfrom harness.tools.selector import ToolCatalog\nfrom harness.tools.std import STANDARD_TOOLS\nfrom tests.evals.cases import CASES\n\n\nasync def main() -> None:\n    runner = EvalRunner(\n        provider=AnthropicProvider(),\n        catalog=ToolCatalog(tools=STANDARD_TOOLS),\n    )\n    results = await runner.run_all(CASES)\n    passed = sum(1 for r in results if r.passed)\n    print(f\"\\n{passed}\u002F{len(results)} passed\")\n\n\nasyncio.run(main())\n",[235,3657,3658,3663,3669,3673,3693,3713,3731,3751,3772,3776,3780,3798,3809,3822,3844,3849,3871,3907,3947,3951,3955],{"__ignoreMap":280},[284,3659,3660],{"class":286,"line":287},[284,3661,3662],{"class":290},"# examples\u002Fch19_evals.py\n",[284,3664,3665,3667],{"class":286,"line":294},[284,3666,327],{"class":297},[284,3668,813],{"class":308},[284,3670,3671],{"class":286,"line":312},[284,3672,316],{"emptyLinePlaceholder":315},[284,3674,3675,3677,3679,3681,3683,3685,3688,3690],{"class":286,"line":319},[284,3676,298],{"class":297},[284,3678,2945],{"class":308},[284,3680,865],{"class":333},[284,3682,2950],{"class":308},[284,3684,865],{"class":333},[284,3686,3687],{"class":308},"runner ",[284,3689,327],{"class":297},[284,3691,3692],{"class":308}," EvalRunner\n",[284,3694,3695,3697,3699,3701,3703,3705,3708,3710],{"class":286,"line":340},[284,3696,298],{"class":297},[284,3698,2945],{"class":308},[284,3700,865],{"class":333},[284,3702,862],{"class":308},[284,3704,865],{"class":333},[284,3706,3707],{"class":308},"anthropic ",[284,3709,327],{"class":297},[284,3711,3712],{"class":308}," AnthropicProvider\n",[284,3714,3715,3717,3719,3721,3723,3725,3727,3729],{"class":286,"line":353},[284,3716,298],{"class":297},[284,3718,2945],{"class":308},[284,3720,865],{"class":333},[284,3722,882],{"class":308},[284,3724,865],{"class":333},[284,3726,887],{"class":308},[284,3728,327],{"class":297},[284,3730,892],{"class":308},[284,3732,3733,3735,3737,3739,3741,3743,3746,3748],{"class":286,"line":358},[284,3734,298],{"class":297},[284,3736,2945],{"class":308},[284,3738,865],{"class":333},[284,3740,882],{"class":308},[284,3742,865],{"class":333},[284,3744,3745],{"class":308},"std ",[284,3747,327],{"class":297},[284,3749,3750],{"class":301}," STANDARD_TOOLS\n",[284,3752,3753,3755,3758,3760,3762,3764,3767,3769],{"class":286,"line":363},[284,3754,298],{"class":297},[284,3756,3757],{"class":308}," tests",[284,3759,865],{"class":333},[284,3761,2950],{"class":308},[284,3763,865],{"class":333},[284,3765,3766],{"class":308},"cases ",[284,3768,327],{"class":297},[284,3770,3771],{"class":301}," CASES\n",[284,3773,3774],{"class":286,"line":374},[284,3775,316],{"emptyLinePlaceholder":315},[284,3777,3778],{"class":286,"line":388},[284,3779,316],{"emptyLinePlaceholder":315},[284,3781,3782,3785,3787,3790,3792,3794,3796],{"class":286,"line":402},[284,3783,3784],{"class":377},"async",[284,3786,964],{"class":377},[284,3788,3789],{"class":370}," main",[284,3791,1326],{"class":333},[284,3793,989],{"class":333},[284,3795,452],{"class":451},[284,3797,385],{"class":333},[284,3799,3800,3803,3805,3807],{"class":286,"line":416},[284,3801,3802],{"class":308},"    runner ",[284,3804,511],{"class":447},[284,3806,932],{"class":500},[284,3808,1109],{"class":333},[284,3810,3811,3814,3816,3819],{"class":286,"line":426},[284,3812,3813],{"class":507},"        provider",[284,3815,511],{"class":447},[284,3817,3818],{"class":500},"AnthropicProvider",[284,3820,3821],{"class":333},"(),\n",[284,3823,3824,3827,3829,3832,3834,3836,3838,3841],{"class":286,"line":436},[284,3825,3826],{"class":507},"        catalog",[284,3828,511],{"class":447},[284,3830,3831],{"class":500},"ToolCatalog",[284,3833,504],{"class":333},[284,3835,882],{"class":507},[284,3837,511],{"class":447},[284,3839,3840],{"class":405},"STANDARD_TOOLS",[284,3842,3843],{"class":333},"),\n",[284,3845,3846],{"class":286,"line":461},[284,3847,3848],{"class":333},"    )\n",[284,3850,3851,3854,3856,3858,3861,3863,3865,3867,3869],{"class":286,"line":466},[284,3852,3853],{"class":308},"    results ",[284,3855,511],{"class":447},[284,3857,1103],{"class":297},[284,3859,3860],{"class":308}," runner",[284,3862,865],{"class":333},[284,3864,2876],{"class":500},[284,3866,504],{"class":333},[284,3868,2972],{"class":405},[284,3870,517],{"class":333},[284,3872,3873,3876,3878,3881,3883,3885,3887,3890,3892,3895,3898,3901,3903,3905],{"class":286,"line":472},[284,3874,3875],{"class":308},"    passed ",[284,3877,511],{"class":447},[284,3879,3880],{"class":405}," sum",[284,3882,504],{"class":333},[284,3884,3483],{"class":1245},[284,3886,2460],{"class":297},[284,3888,3889],{"class":500}," r ",[284,3891,1452],{"class":297},[284,3893,3894],{"class":500}," results ",[284,3896,3897],{"class":297},"if",[284,3899,3900],{"class":500}," r",[284,3902,865],{"class":333},[284,3904,2022],{"class":1074},[284,3906,517],{"class":333},[284,3908,3909,3912,3914,3916,3918,3921,3923,3925,3927,3929,3931,3933,3935,3938,3940,3942,3945],{"class":286,"line":478},[284,3910,3911],{"class":405},"    print",[284,3913,504],{"class":333},[284,3915,1238],{"class":377},[284,3917,1275],{"class":1241},[284,3919,3920],{"class":301},"\\n",[284,3922,1246],{"class":1245},[284,3924,2022],{"class":500},[284,3926,1263],{"class":1245},[284,3928,6],{"class":1241},[284,3930,1246],{"class":1245},[284,3932,1783],{"class":405},[284,3934,504],{"class":333},[284,3936,3937],{"class":500},"results",[284,3939,986],{"class":333},[284,3941,1263],{"class":1245},[284,3943,3944],{"class":1241}," passed\"",[284,3946,517],{"class":333},[284,3948,3949],{"class":286,"line":520},[284,3950,316],{"emptyLinePlaceholder":315},[284,3952,3953],{"class":286,"line":525},[284,3954,316],{"emptyLinePlaceholder":315},[284,3956,3957,3960,3962,3964,3966,3969],{"class":286,"line":531},[284,3958,3959],{"class":308},"asyncio",[284,3961,865],{"class":333},[284,3963,2851],{"class":500},[284,3965,504],{"class":333},[284,3967,3968],{"class":500},"main",[284,3970,3971],{"class":333},"())\n",[112,3973,3974,3975,3982,3983,3988],{},"You now have a regression suite. Run it before any model upgrade, any prompt change, any harness refactor. The output — pass counts, failure reasons — is the signal the ",[3976,3977,3981],"a",{"href":3978,"rel":3979},"https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.02185",[3980],"nofollow","POSIX prompt-sensitivity paper (arXiv 2410.02185, 2024)"," and ",[3976,3984,3987],{"href":3985,"rel":3986},"https:\u002F\u002Fwww.promptfoo.dev\u002Fblog\u002Fmodel-upgrades-break-agent-safety\u002F",[3980],"Promptfoo's 2025 \"Your model upgrade just broke your agent's safety\""," call for: before you ship a model version upgrade, you run this and verify nothing regresses.",[215,3990],{},[218,3992,3994],{"id":3993},"_195-llm-as-judge","19.5 LLM-as-Judge",[112,3996,3997,3998,4000],{},"For tasks where ",[235,3999,1381],{}," is subjective — \"summarize this article\" — we can use another LLM as the judge. A well-designed judge prompt and a more powerful model than the one being tested:",[275,4002,4004],{"className":277,"code":4003,"language":279,"meta":280,"style":280},"# src\u002Fharness\u002Fevals\u002Fllm_judge.py\nasync def judge(\n    judge_provider: Provider,\n    question: str,\n    candidate_answer: str,\n    reference_answer: str | None = None,\n    criteria: str = \"accuracy, completeness, relevance\",\n) -> bool:\n    from ..messages import Message, Transcript\n\n    transcript = Transcript(system=(\n        \"You are a strict evaluator. Given a question and a candidate answer, \"\n        \"judge whether the answer is correct by the criteria provided. \"\n        \"Reply with only 'PASS' or 'FAIL' followed by a one-sentence reason.\"\n    ))\n    user = (f\"Question: {question}\\n\\n\"\n            f\"Candidate answer: {candidate_answer}\\n\\n\")\n    if reference_answer:\n        user += f\"Reference answer for comparison: {reference_answer}\\n\\n\"\n    user += f\"Criteria: {criteria}\"\n    transcript.append(Message.user_text(user))\n\n    response = await judge_provider.acomplete(transcript, tools=[])\n    text = response.text or \"\"\n    return text.strip().upper().startswith(\"PASS\")\n",[235,4005,4006,4011,4022,4034,4045,4056,4075,4095,4105,4125,4129,4147,4157,4166,4175,4180,4206,4227,4237,4262,4282,4308,4312,4343,4364],{"__ignoreMap":280},[284,4007,4008],{"class":286,"line":287},[284,4009,4010],{"class":290},"# src\u002Fharness\u002Fevals\u002Fllm_judge.py\n",[284,4012,4013,4015,4017,4020],{"class":286,"line":294},[284,4014,3784],{"class":377},[284,4016,964],{"class":377},[284,4018,4019],{"class":370}," judge",[284,4021,1109],{"class":333},[284,4023,4024,4027,4029,4032],{"class":286,"line":312},[284,4025,4026],{"class":978},"    judge_provider",[284,4028,409],{"class":333},[284,4030,4031],{"class":308}," Provider",[284,4033,1126],{"class":333},[284,4035,4036,4039,4041,4043],{"class":286,"line":319},[284,4037,4038],{"class":978},"    question",[284,4040,409],{"class":333},[284,4042,444],{"class":412},[284,4044,1126],{"class":333},[284,4046,4047,4050,4052,4054],{"class":286,"line":340},[284,4048,4049],{"class":978},"    candidate_answer",[284,4051,409],{"class":333},[284,4053,444],{"class":412},[284,4055,1126],{"class":333},[284,4057,4058,4061,4063,4065,4067,4069,4071,4073],{"class":286,"line":353},[284,4059,4060],{"class":978},"    reference_answer",[284,4062,409],{"class":333},[284,4064,444],{"class":412},[284,4066,448],{"class":447},[284,4068,452],{"class":451},[284,4070,455],{"class":447},[284,4072,452],{"class":451},[284,4074,1126],{"class":333},[284,4076,4077,4080,4082,4084,4086,4088,4091,4093],{"class":286,"line":358},[284,4078,4079],{"class":978},"    criteria",[284,4081,409],{"class":333},[284,4083,444],{"class":412},[284,4085,455],{"class":447},[284,4087,3068],{"class":1288},[284,4089,4090],{"class":1241},"accuracy, completeness, relevance",[284,4092,1275],{"class":1288},[284,4094,1126],{"class":333},[284,4096,4097,4099,4101,4103],{"class":286,"line":363},[284,4098,986],{"class":333},[284,4100,989],{"class":333},[284,4102,597],{"class":412},[284,4104,385],{"class":333},[284,4106,4107,4110,4112,4115,4117,4120,4122],{"class":286,"line":374},[284,4108,4109],{"class":297},"    from",[284,4111,845],{"class":333},[284,4113,4114],{"class":308},"messages ",[284,4116,327],{"class":297},[284,4118,4119],{"class":308}," Message",[284,4121,334],{"class":333},[284,4123,4124],{"class":308}," Transcript\n",[284,4126,4127],{"class":286,"line":388},[284,4128,316],{"emptyLinePlaceholder":315},[284,4130,4131,4134,4136,4139,4141,4143,4145],{"class":286,"line":402},[284,4132,4133],{"class":308},"    transcript ",[284,4135,511],{"class":447},[284,4137,4138],{"class":500}," Transcript",[284,4140,504],{"class":333},[284,4142,1153],{"class":507},[284,4144,511],{"class":447},[284,4146,1109],{"class":333},[284,4148,4149,4152,4155],{"class":286,"line":416},[284,4150,4151],{"class":1288},"        \"",[284,4153,4154],{"class":1241},"You are a strict evaluator. Given a question and a candidate answer, ",[284,4156,3291],{"class":1288},[284,4158,4159,4161,4164],{"class":286,"line":426},[284,4160,4151],{"class":1288},[284,4162,4163],{"class":1241},"judge whether the answer is correct by the criteria provided. ",[284,4165,3291],{"class":1288},[284,4167,4168,4170,4173],{"class":286,"line":436},[284,4169,4151],{"class":1288},[284,4171,4172],{"class":1241},"Reply with only 'PASS' or 'FAIL' followed by a one-sentence reason.",[284,4174,3291],{"class":1288},[284,4176,4177],{"class":286,"line":461},[284,4178,4179],{"class":333},"    ))\n",[284,4181,4182,4185,4187,4189,4191,4194,4196,4199,4201,4204],{"class":286,"line":466},[284,4183,4184],{"class":308},"    user ",[284,4186,511],{"class":447},[284,4188,2118],{"class":333},[284,4190,1238],{"class":377},[284,4192,4193],{"class":1241},"\"Question: ",[284,4195,1246],{"class":1245},[284,4197,4198],{"class":308},"question",[284,4200,1263],{"class":1245},[284,4202,4203],{"class":301},"\\n\\n",[284,4205,3291],{"class":1241},[284,4207,4208,4211,4214,4216,4219,4221,4223,4225],{"class":286,"line":472},[284,4209,4210],{"class":377},"            f",[284,4212,4213],{"class":1241},"\"Candidate answer: ",[284,4215,1246],{"class":1245},[284,4217,4218],{"class":308},"candidate_answer",[284,4220,1263],{"class":1245},[284,4222,4203],{"class":301},[284,4224,1275],{"class":1241},[284,4226,517],{"class":333},[284,4228,4229,4232,4235],{"class":286,"line":478},[284,4230,4231],{"class":297},"    if",[284,4233,4234],{"class":308}," reference_answer",[284,4236,385],{"class":333},[284,4238,4239,4242,4245,4248,4251,4253,4256,4258,4260],{"class":286,"line":520},[284,4240,4241],{"class":308},"        user ",[284,4243,4244],{"class":447},"+=",[284,4246,4247],{"class":377}," f",[284,4249,4250],{"class":1241},"\"Reference answer for comparison: ",[284,4252,1246],{"class":1245},[284,4254,4255],{"class":308},"reference_answer",[284,4257,1263],{"class":1245},[284,4259,4203],{"class":301},[284,4261,3291],{"class":1241},[284,4263,4264,4266,4268,4270,4273,4275,4278,4280],{"class":286,"line":525},[284,4265,4184],{"class":308},[284,4267,4244],{"class":447},[284,4269,4247],{"class":377},[284,4271,4272],{"class":1241},"\"Criteria: ",[284,4274,1246],{"class":1245},[284,4276,4277],{"class":308},"criteria",[284,4279,1263],{"class":1245},[284,4281,3291],{"class":1241},[284,4283,4284,4287,4289,4291,4293,4296,4298,4301,4303,4306],{"class":286,"line":531},[284,4285,4286],{"class":308},"    transcript",[284,4288,865],{"class":333},[284,4290,1424],{"class":500},[284,4292,504],{"class":333},[284,4294,4295],{"class":500},"Message",[284,4297,865],{"class":333},[284,4299,4300],{"class":500},"user_text",[284,4302,504],{"class":333},[284,4304,4305],{"class":500},"user",[284,4307,2167],{"class":333},[284,4309,4310],{"class":286,"line":561},[284,4311,316],{"emptyLinePlaceholder":315},[284,4313,4314,4317,4319,4321,4324,4326,4329,4331,4334,4336,4338,4340],{"class":286,"line":566},[284,4315,4316],{"class":308},"    response ",[284,4318,511],{"class":447},[284,4320,1103],{"class":297},[284,4322,4323],{"class":308}," judge_provider",[284,4325,865],{"class":333},[284,4327,4328],{"class":500},"acomplete",[284,4330,504],{"class":333},[284,4332,4333],{"class":500},"transcript",[284,4335,334],{"class":333},[284,4337,2468],{"class":507},[284,4339,511],{"class":447},[284,4341,4342],{"class":333},"[])\n",[284,4344,4345,4348,4350,4353,4355,4358,4361],{"class":286,"line":572},[284,4346,4347],{"class":308},"    text ",[284,4349,511],{"class":447},[284,4351,4352],{"class":308}," response",[284,4354,865],{"class":333},[284,4356,4357],{"class":1074},"text",[284,4359,4360],{"class":447}," or",[284,4362,4363],{"class":1288}," \"\"\n",[284,4365,4366,4369,4372,4374,4377,4380,4383,4385,4388,4390,4392,4395,4397],{"class":286,"line":578},[284,4367,4368],{"class":297},"    return",[284,4370,4371],{"class":308}," text",[284,4373,865],{"class":333},[284,4375,4376],{"class":500},"strip",[284,4378,4379],{"class":333},"().",[284,4381,4382],{"class":500},"upper",[284,4384,4379],{"class":333},[284,4386,4387],{"class":500},"startswith",[284,4389,504],{"class":333},[284,4391,1275],{"class":1288},[284,4393,4394],{"class":1241},"PASS",[284,4396,1275],{"class":1288},[284,4398,517],{"class":333},[112,4400,4401],{},"Two caveats worth knowing.",[112,4403,4404,4407],{},[132,4405,4406],{},"Judge bias."," Using Claude to judge Claude's output correlates judge and candidate errors. If they share the same blind spot, the judge misses the failure. Best practice: use a different provider for the judge than for the candidate — Claude judging GPT, or vice versa.",[112,4409,4410,4413],{},[132,4411,4412],{},"Judge ceiling."," An LLM judge can't reliably exceed its own capability ceiling on the underlying task. A judge weaker than the candidate on a hard task will mis-score confidently.",[112,4415,4416,4417,4419],{},"For the book's scenarios, deterministic ",[235,4418,1381],{}," functions cover most cases. LLM-as-judge is a tool in the kit; don't reach for it when a function would do.",[215,4421],{},[218,4423,4425],{"id":4424},"_196-production-to-eval-pipeline","19.6 Production-to-Eval Pipeline",[112,4427,4428,4429,409],{},"The observability work from Chapter 18 gives us structured trace data. A production run that fails — crashed, timed out, produced a clearly-wrong output — is a potential eval case. A small script turns a failing trace into an ",[235,4430,1918],{},[275,4432,4434],{"className":277,"code":4433,"language":279,"meta":280,"style":280},"# src\u002Fharness\u002Fevals\u002Ffrom_trace.py\nfrom .case import EvalCase\n\n\ndef case_from_trace(trace_summary: dict) -> EvalCase:\n    \"\"\"Convert a production trace into a regression eval case.\n\n    trace_summary: a dict extracted from your tracing backend. Typical\n    fields: user_message, system, final_answer, failure_reason.\n    \"\"\"\n    return EvalCase(\n        id=f\"prod-regression-{trace_summary['trace_id'][:8]}\",\n        description=f\"regression from production: \"\n                    f\"{trace_summary.get('failure_reason', 'unknown')}\",\n        user_message=trace_summary[\"user_message\"],\n        system=trace_summary.get(\"system\"),\n        max_tokens=int(trace_summary.get(\"tokens_used\", 0) * 1.5),\n        # The check is often just \"doesn't repeat the same failure.\"\n        # More sophisticated: check the specific known-bad behavior.\n    )\n",[235,4435,4436,4441,4453,4457,4461,4487,4494,4498,4503,4508,4512,4520,4558,4569,4611,4629,4652,4691,4696,4701],{"__ignoreMap":280},[284,4437,4438],{"class":286,"line":287},[284,4439,4440],{"class":290},"# src\u002Fharness\u002Fevals\u002Ffrom_trace.py\n",[284,4442,4443,4445,4447,4449,4451],{"class":286,"line":294},[284,4444,298],{"class":297},[284,4446,899],{"class":333},[284,4448,902],{"class":308},[284,4450,327],{"class":297},[284,4452,2959],{"class":308},[284,4454,4455],{"class":286,"line":312},[284,4456,316],{"emptyLinePlaceholder":315},[284,4458,4459],{"class":286,"line":319},[284,4460,316],{"emptyLinePlaceholder":315},[284,4462,4463,4466,4469,4471,4474,4476,4479,4481,4483,4485],{"class":286,"line":340},[284,4464,4465],{"class":377},"def",[284,4467,4468],{"class":370}," case_from_trace",[284,4470,504],{"class":333},[284,4472,4473],{"class":978},"trace_summary",[284,4475,409],{"class":333},[284,4477,4478],{"class":412}," dict",[284,4480,986],{"class":333},[284,4482,989],{"class":333},[284,4484,382],{"class":308},[284,4486,385],{"class":333},[284,4488,4489,4491],{"class":286,"line":353},[284,4490,392],{"class":391},[284,4492,4493],{"class":395},"Convert a production trace into a regression eval case.\n",[284,4495,4496],{"class":286,"line":358},[284,4497,316],{"emptyLinePlaceholder":315},[284,4499,4500],{"class":286,"line":363},[284,4501,4502],{"class":395},"    trace_summary: a dict extracted from your tracing backend. Typical\n",[284,4504,4505],{"class":286,"line":374},[284,4506,4507],{"class":395},"    fields: user_message, system, final_answer, failure_reason.\n",[284,4509,4510],{"class":286,"line":388},[284,4511,2253],{"class":391},[284,4513,4514,4516,4518],{"class":286,"line":402},[284,4515,4368],{"class":297},[284,4517,382],{"class":500},[284,4519,1109],{"class":333},[284,4521,4522,4524,4526,4528,4531,4533,4535,4537,4539,4542,4544,4547,4550,4552,4554,4556],{"class":286,"line":416},[284,4523,2989],{"class":507},[284,4525,511],{"class":447},[284,4527,1238],{"class":377},[284,4529,4530],{"class":1241},"\"prod-regression-",[284,4532,1246],{"class":1245},[284,4534,4473],{"class":500},[284,4536,489],{"class":333},[284,4538,2007],{"class":1288},[284,4540,4541],{"class":1241},"trace_id",[284,4543,2007],{"class":1288},[284,4545,4546],{"class":333},"][:",[284,4548,4549],{"class":1245},"8",[284,4551,495],{"class":333},[284,4553,1263],{"class":1245},[284,4555,1275],{"class":1241},[284,4557,1126],{"class":333},[284,4559,4560,4562,4564,4566],{"class":286,"line":426},[284,4561,3005],{"class":507},[284,4563,511],{"class":447},[284,4565,1238],{"class":377},[284,4567,4568],{"class":1241},"\"regression from production: \"\n",[284,4570,4571,4574,4576,4578,4580,4582,4585,4587,4589,4592,4594,4596,4598,4601,4603,4605,4607,4609],{"class":286,"line":436},[284,4572,4573],{"class":377},"                    f",[284,4575,1275],{"class":1241},[284,4577,1246],{"class":1245},[284,4579,4473],{"class":500},[284,4581,865],{"class":333},[284,4583,4584],{"class":500},"get",[284,4586,504],{"class":333},[284,4588,2007],{"class":1288},[284,4590,4591],{"class":1241},"failure_reason",[284,4593,2007],{"class":1288},[284,4595,334],{"class":333},[284,4597,2028],{"class":1288},[284,4599,4600],{"class":1241},"unknown",[284,4602,2007],{"class":1288},[284,4604,986],{"class":333},[284,4606,1263],{"class":1245},[284,4608,1275],{"class":1241},[284,4610,1126],{"class":333},[284,4612,4613,4615,4617,4619,4621,4623,4625,4627],{"class":286,"line":461},[284,4614,3021],{"class":507},[284,4616,511],{"class":447},[284,4618,4473],{"class":500},[284,4620,489],{"class":333},[284,4622,1275],{"class":1288},[284,4624,1169],{"class":1241},[284,4626,1275],{"class":1288},[284,4628,1278],{"class":333},[284,4630,4631,4634,4636,4638,4640,4642,4644,4646,4648,4650],{"class":286,"line":466},[284,4632,4633],{"class":507},"        system",[284,4635,511],{"class":447},[284,4637,4473],{"class":500},[284,4639,865],{"class":333},[284,4641,4584],{"class":500},[284,4643,504],{"class":333},[284,4645,1275],{"class":1288},[284,4647,1153],{"class":1241},[284,4649,1275],{"class":1288},[284,4651,3843],{"class":333},[284,4653,4654,4656,4658,4661,4663,4665,4667,4669,4671,4673,4675,4677,4679,4681,4683,4686,4689],{"class":286,"line":472},[284,4655,3084],{"class":507},[284,4657,511],{"class":447},[284,4659,4660],{"class":412},"int",[284,4662,504],{"class":333},[284,4664,4473],{"class":500},[284,4666,865],{"class":333},[284,4668,4584],{"class":500},[284,4670,504],{"class":333},[284,4672,1275],{"class":1288},[284,4674,1603],{"class":1241},[284,4676,1275],{"class":1288},[284,4678,334],{"class":333},[284,4680,1796],{"class":1245},[284,4682,986],{"class":333},[284,4684,4685],{"class":447}," *",[284,4687,4688],{"class":1245}," 1.5",[284,4690,3843],{"class":333},[284,4692,4693],{"class":286,"line":478},[284,4694,4695],{"class":290},"        # The check is often just \"doesn't repeat the same failure.\"\n",[284,4697,4698],{"class":286,"line":520},[284,4699,4700],{"class":290},"        # More sophisticated: check the specific known-bad behavior.\n",[284,4702,4703],{"class":286,"line":525},[284,4704,3848],{"class":333},[112,4706,4707,4708,4711],{},"The workflow: monitoring flags a failed trace, an engineer reviews it, confirms it's a regression to prevent, runs ",[235,4709,4710],{},"case_from_trace",", reviews the generated case, tweaks it, commits it to the suite. Next CI run, the case runs; a future regression of the same issue fails CI before shipping.",[112,4713,4714],{},"This is how eval suites grow organically. Every real failure in production leaves a fossil in the suite. Over time, the suite encodes the specific failure modes your system has seen — the ones most likely to recur.",[215,4716],{},[218,4718,4720],{"id":4719},"_197-evals-are-not-tests","19.7 Evals Are Not Tests",[112,4722,4723],{},"A parting distinction worth naming. Unit tests verify deterministic code. Evals verify probabilistic systems. The differences:",[4725,4726,4727,4730,4733,4739],"ul",{},[128,4728,4729],{},"Unit tests pass or fail binarily; evals typically report a pass rate across runs (non-determinism is real).",[128,4731,4732],{},"Unit tests are cheap; evals cost real API money.",[128,4734,4735,4736,4738],{},"Unit tests run on every commit; evals might run on every merge to ",[235,4737,3968],{},", or nightly.",[128,4740,4741,4742,4745],{},"Unit tests protect correctness; evals protect ",[115,4743,4744],{},"behavior",", which includes correctness but also cost, latency, tool-use discipline.",[112,4747,4748],{},"Don't run evals on every commit — the cost and flakiness aren't worth it. Do run them as a merge gate and before any model upgrade. Treat a regression in the eval suite the same way you'd treat a regression in tests: a release blocker that requires root-causing.",[215,4750],{},[218,4752,4754],{"id":4753},"_198-commit","19.8 Commit",[275,4756,4759],{"className":4757,"code":4758,"language":3619,"meta":280,"style":280},"language-bash shiki shiki-themes material-theme-lighter github-light github-dark","git add -A && git commit -m \"ch19: minimal eval harness with regression cases\"\ngit tag ch19-evals\n",[235,4760,4761,4792],{"__ignoreMap":280},[284,4762,4763,4766,4769,4773,4776,4779,4782,4785,4787,4790],{"class":286,"line":287},[284,4764,4765],{"class":381},"git",[284,4767,4768],{"class":1241}," add",[284,4770,4772],{"class":4771},"stzsN"," -A",[284,4774,4775],{"class":333}," &&",[284,4777,4778],{"class":381}," git",[284,4780,4781],{"class":1241}," commit",[284,4783,4784],{"class":4771}," -m",[284,4786,3068],{"class":1288},[284,4788,4789],{"class":1241},"ch19: minimal eval harness with regression cases",[284,4791,3291],{"class":1288},[284,4793,4794,4796,4799],{"class":286,"line":294},[284,4795,4765],{"class":381},[284,4797,4798],{"class":1241}," tag",[284,4800,4801],{"class":1241}," ch19-evals\n",[218,4803,4805],{"id":4804},"_199-try-it-yourself","19.9 Try It Yourself",[125,4807,4808,4822,4828],{},[128,4809,4810,4813,4814,4816,4817,3982,4819,4821],{},[132,4811,4812],{},"Write five cases from your own use."," Pick five realistic tasks your harness should handle. Write ",[235,4815,1918],{},"s with ",[235,4818,1459],{},[235,4820,1381],{},". Run them. How many pass? For the failures, is the right fix in the harness or in the case?",[128,4823,4824,4827],{},[132,4825,4826],{},"Run the suite twice."," Non-determinism means the same case can pass once and fail the next. Measure the pass rate over 10 runs of the same case. Which cases are stable? Which aren't? A flaky case either has a real agent reliability problem or an over-strict check.",[128,4829,4830,4833,4834,4836],{},[132,4831,4832],{},"Swap the judge model."," Take a case that currently uses ",[235,4835,1381],{},"; replace it with an LLM judge. Does the judgment match? Where does it disagree? Judge-vs-function disagreements are informative.",[215,4838],{},[4840,4841,4842,4845],"what-you-understand",{},[112,4843,4844],{},"You can measure whether the harness is producing the right behavior, not just whether it runs. Golden trajectories encode task specs with structural and outcome checks. A regression runner turns the suite into a CI-gated signal. Production failures feed back into the suite, growing it organically. Evals are distinct from tests — probabilistic, expensive, run less often, but they gate behavior changes the way tests gate correctness changes.",[112,4846,4847],{},"What's still missing: cost. We track token counts; we don't bound them. A real deployment needs prompt caching, model routing by task complexity, and hard budget caps with auto-termination so one runaway agent doesn't cost $47K. Chapter 20 is cost control.",[4849,4850,4851],"style",{},"html pre.shiki code .sutJx, html code.shiki .sutJx{--shiki-light:#90A4AE;--shiki-light-font-style:italic;--shiki-default:#6A737D;--shiki-default-font-style:inherit;--shiki-dark:#6A737D;--shiki-dark-font-style:inherit}html pre.shiki code .sVHd0, html code.shiki .sVHd0{--shiki-light:#39ADB5;--shiki-light-font-style:italic;--shiki-default:#D73A49;--shiki-default-font-style:inherit;--shiki-dark:#F97583;--shiki-dark-font-style:inherit}html pre.shiki code .s_hVV, html code.shiki .s_hVV{--shiki-light:#90A4AE;--shiki-default:#005CC5;--shiki-dark:#79B8FF}html pre.shiki code .su5hD, html code.shiki .su5hD{--shiki-light:#90A4AE;--shiki-default:#24292E;--shiki-dark:#E1E4E8}html pre.shiki code .sP7_E, html code.shiki .sP7_E{--shiki-light:#39ADB5;--shiki-default:#24292E;--shiki-dark:#E1E4E8}html pre.shiki code .stp6e, html code.shiki .stp6e{--shiki-light:#39ADB5;--shiki-default:#6F42C1;--shiki-dark:#B392F0}html pre.shiki code .sGLFI, html code.shiki .sGLFI{--shiki-light:#6182B8;--shiki-default:#6F42C1;--shiki-dark:#B392F0}html pre.shiki code .sbsja, html code.shiki .sbsja{--shiki-light:#9C3EDA;--shiki-default:#D73A49;--shiki-dark:#F97583}html pre.shiki code .sbgvK, html code.shiki .sbgvK{--shiki-light:#E2931D;--shiki-default:#6F42C1;--shiki-dark:#B392F0}html pre.shiki code .s2W-s, html code.shiki .s2W-s{--shiki-light:#39ADB5;--shiki-light-font-style:italic;--shiki-default:#032F62;--shiki-default-font-style:inherit;--shiki-dark:#9ECBFF;--shiki-dark-font-style:inherit}html pre.shiki code .sithA, html code.shiki .sithA{--shiki-light:#90A4AE;--shiki-light-font-style:italic;--shiki-default:#032F62;--shiki-default-font-style:inherit;--shiki-dark:#9ECBFF;--shiki-dark-font-style:inherit}html pre.shiki code .sptTA, html code.shiki .sptTA{--shiki-light:#6182B8;--shiki-default:#005CC5;--shiki-dark:#79B8FF}html pre.shiki code .sZMiF, html code.shiki .sZMiF{--shiki-light:#E2931D;--shiki-default:#005CC5;--shiki-dark:#79B8FF}html pre.shiki code .smGrS, html code.shiki .smGrS{--shiki-light:#39ADB5;--shiki-default:#D73A49;--shiki-dark:#F97583}html pre.shiki code .s39Yj, html code.shiki .s39Yj{--shiki-light:#39ADB5;--shiki-default:#005CC5;--shiki-dark:#79B8FF}html pre.shiki code .slqww, html code.shiki .slqww{--shiki-light:#6182B8;--shiki-default:#24292E;--shiki-dark:#E1E4E8}html pre.shiki code .s99_P, html code.shiki .s99_P{--shiki-light:#90A4AE;--shiki-light-font-style:italic;--shiki-default:#E36209;--shiki-default-font-style:inherit;--shiki-dark:#FFAB70;--shiki-dark-font-style:inherit}html .light .shiki span {color: var(--shiki-light);background: var(--shiki-light-bg);font-style: var(--shiki-light-font-style);font-weight: var(--shiki-light-font-weight);text-decoration: var(--shiki-light-text-decoration);}html.light .shiki span {color: var(--shiki-light);background: var(--shiki-light-bg);font-style: var(--shiki-light-font-style);font-weight: var(--shiki-light-font-weight);text-decoration: var(--shiki-light-text-decoration);}html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .dark .shiki span {color: var(--shiki-dark);background: var(--shiki-dark-bg);font-style: var(--shiki-dark-font-style);font-weight: var(--shiki-dark-font-weight);text-decoration: var(--shiki-dark-text-decoration);}html.dark .shiki span {color: var(--shiki-dark);background: var(--shiki-dark-bg);font-style: var(--shiki-dark-font-style);font-weight: var(--shiki-dark-font-weight);text-decoration: var(--shiki-dark-text-decoration);}html pre.shiki code .smCYv, html code.shiki .smCYv{--shiki-light:#E53935;--shiki-light-font-style:italic;--shiki-default:#24292E;--shiki-default-font-style:inherit;--shiki-dark:#E1E4E8;--shiki-dark-font-style:inherit}html pre.shiki code .sFwrP, html code.shiki .sFwrP{--shiki-light:#90A4AE;--shiki-light-font-style:italic;--shiki-default:#24292E;--shiki-default-font-style:inherit;--shiki-dark:#E1E4E8;--shiki-dark-font-style:inherit}html pre.shiki code .skxfh, html code.shiki .skxfh{--shiki-light:#E53935;--shiki-default:#24292E;--shiki-dark:#E1E4E8}html pre.shiki code .s_sjI, html code.shiki .s_sjI{--shiki-light:#91B859;--shiki-default:#032F62;--shiki-dark:#9ECBFF}html pre.shiki code .srdBf, html code.shiki .srdBf{--shiki-light:#F76D47;--shiki-default:#005CC5;--shiki-dark:#79B8FF}html pre.shiki code .sjJ54, html code.shiki .sjJ54{--shiki-light:#39ADB5;--shiki-default:#032F62;--shiki-dark:#9ECBFF}html pre.shiki code .stzsN, html code.shiki .stzsN{--shiki-light:#91B859;--shiki-default:#005CC5;--shiki-dark:#79B8FF}",{"title":280,"searchDepth":294,"depth":294,"links":4853},[4854,4855,4856,4857,4858,4859,4860,4861,4862],{"id":220,"depth":294,"text":221},{"id":272,"depth":294,"text":273},{"id":781,"depth":294,"text":782},{"id":2927,"depth":294,"text":2928},{"id":3993,"depth":294,"text":3994},{"id":4424,"depth":294,"text":4425},{"id":4719,"depth":294,"text":4720},{"id":4753,"depth":294,"text":4754},{"id":4804,"depth":294,"text":4805},"md",{},null,{"title":86,"description":117},"M-Zohzx1Ap6oSt-KbFSWpGoCs3x9_5sbEXXnoKLglco",[4869,4871],{"title":82,"path":83,"stem":84,"description":4870,"children":-1},"Previously: parallel sub-agents, leases, grounded verification. The harness is capable but opaque. A failed run tells you the final error but nothing about which sub-agent burned tokens, which tool call took 12 seconds, which compaction event dropped what the final agent wanted.",{"title":90,"path":91,"stem":92,"description":4872,"children":-1},"Previously: evals measure correctness. Nothing in the harness caps spend. The $47K agent-loop incident (DEV Community, Nov 2025) was two agents ping-ponging requests for eleven days; alerts fired, no one stopped them. Alerts are not enforcement.",1776848986862]