Previously: parallel sub-agents, leases, grounded verification. The harness is capable but opaque. A failed run tells you the final error but nothing about which sub-agent burned tokens, which tool call took 12 seconds, which compaction event dropped what the final agent wanted.
Observability for agents is not the same shape as observability for typical web services. Request/response latency matters, but so does the entire trajectory: which tools fired in which order, how much each cost, when compaction ran, which sub-agents spawned, what the model's output was at each turn. A failed run needs a timeline, not a metric.
This chapter adds OpenTelemetry-based instrumentation to the harness. Three things by the end:
gen_ai.system, gen_ai.usage.input_tokens, etc.session_id, task_id, and agent_id so per-agent cost attribution is possible.The GenAI semantic conventions are still marked experimental by OpenTelemetry as of April 2026, but they're stable enough to build on — the major observability platforms (Datadog, Langfuse, Braintrust) have adopted them.
Three reasons to pick OTel over ad-hoc logs, and one piece of foundational context before the reasons: distributed tracing as a discipline grew out of Sigelman et al.'s 2010 paper "Dapper, a Large-Scale Distributed Systems Tracing Infrastructure," which documented the internal Google system that correlated billions of RPCs per day across thousands of microservices by passing a trace context through every call. Every modern tracing system — Zipkin, Jaeger, OpenTelemetry itself — is a descendant of Dapper's design. The GenAI semantic conventions this chapter uses are the agent-specific attributes bolted on top of the same underlying trace-and-span model.
Traces, not log lines. A trace with spans preserves the parent-child relationship between operations. You see "this LLM call happened inside this turn, which happened inside this sub-agent, which happened inside the parent task." With flat logs you reconstruct this by grepping; with traces it's free.
Standardized attributes. When you tag gen_ai.usage.input_tokens consistently, every observability platform knows what it means. Your dashboards, alerts, and downstream cost analysis don't have to speak your local vocabulary.
Exporters are pluggable. Console for development, Jaeger for local dev, Langfuse/Datadog/Honeycomb for production. The harness code doesn't change; the exporter does.
Add the dependencies:
uv add 'opentelemetry-api>=1.27' 'opentelemetry-sdk>=1.27' 'opentelemetry-exporter-otlp>=1.27'
A thin wrapper that gives the rest of the harness a stable interface. We don't scatter tracer.start_as_current_span(...) through every module; we put it behind a small API.
# src/harness/observability/tracing.py
from __future__ import annotations
from contextlib import contextmanager
from dataclasses import dataclass, field
from typing import Iterator
from uuid import uuid4
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import (
BatchSpanProcessor, ConsoleSpanExporter, SpanProcessor,
)
from opentelemetry.trace import Tracer, Status, StatusCode
_provider: TracerProvider | None = None
def setup_tracing(
service_name: str = "agent-harness",
exporter: SpanProcessor | None = None,
) -> None:
"""Initialize OTel once per process."""
global _provider
if _provider is not None:
return
resource = Resource.create({"service.name": service_name})
_provider = TracerProvider(resource=resource)
_provider.add_span_processor(
exporter or BatchSpanProcessor(ConsoleSpanExporter())
)
trace.set_tracer_provider(_provider)
@dataclass
class SessionContext:
session_id: str = field(default_factory=lambda: str(uuid4()))
task_id: str = field(default_factory=lambda: str(uuid4()))
agent_id: str = "root"
def subagent(self, agent_id: str) -> "SessionContext":
return SessionContext(
session_id=self.session_id,
task_id=self.task_id,
agent_id=agent_id,
)
def tracer() -> Tracer:
return trace.get_tracer("harness")
@contextmanager
def span(name: str, ctx: SessionContext, **attrs) -> Iterator[trace.Span]:
t = tracer()
with t.start_as_current_span(name) as s:
s.set_attribute("harness.session_id", ctx.session_id)
s.set_attribute("harness.task_id", ctx.task_id)
s.set_attribute("harness.agent_id", ctx.agent_id)
for k, v in attrs.items():
s.set_attribute(k, v)
try:
yield s
except Exception as e:
s.set_status(Status(StatusCode.ERROR, str(e)))
s.record_exception(e)
raise
Three design points.
SessionContext as the correlation anchor. Every span in a session shares its session_id and task_id. Sub-agents have a different agent_id but inherit session and task. This is how you produce "per-agent cost attribution" — downstream, you group by agent_id over fixed session_id.
span() is a context manager. The pattern every instrumented operation follows: with span("name", ctx): .... It handles error propagation, attribute setting, and OTel lifecycle consistently.
One-time setup. setup_tracing() is idempotent; repeated calls are no-ops. This matters when a harness runs inside an already-instrumented process (a CI runner, a web service).
The loop gets three span types, roughly corresponding to the three things happening: the overall run, each turn, and each tool call.
# src/harness/agent.py (observability-aware version, sketch)
from .observability.tracing import SessionContext, span
async def arun(
# ... existing parameters (including `transcript: Transcript | None = None`
# from Chapter 5's chat-continuity upgrade)
session_context: SessionContext | None = None,
) -> str:
ctx = session_context or SessionContext()
with span("agent.run", ctx,
**{"harness.initial_user_message_len": len(user_message)}) as s:
if transcript is None:
transcript = Transcript(system=system)
transcript.append(Message.user_text(user_message))
# ... existing setup
for iteration in range(MAX_ITERATIONS):
with span("agent.turn", ctx,
**{"harness.iteration": iteration}) as turn_span:
# compaction span
snapshot = accountant.snapshot(transcript, tools=registry.schemas())
turn_span.set_attribute("harness.context_utilization",
snapshot.utilization)
if snapshot.state == "red":
with span("agent.compact", ctx):
await compactor.compact_if_needed(transcript,
registry.schemas())
# LLM call span
with span("gen_ai.completion", ctx,
**{"gen_ai.system": provider.name}) as llm_span:
response = await _one_turn(provider, registry,
transcript, on_event)
llm_span.set_attribute(
"gen_ai.usage.input_tokens", response.input_tokens)
llm_span.set_attribute(
"gen_ai.usage.output_tokens", response.output_tokens)
if response.is_final:
s.set_attribute("harness.final_iteration", iteration)
transcript.append(Message.from_assistant_response(response))
return response.text or ""
# Commit one assistant message with every ToolCall block.
transcript.append(Message.from_assistant_response(response))
# One tool.call span per dispatched call (batched turns get N spans).
for ref in response.tool_calls:
with span("tool.call", ctx,
**{"tool.name": ref.name}) as tool_span:
result = await registry.adispatch(ref.name, ref.args, ref.id)
tool_span.set_attribute("tool.is_error", result.is_error)
tool_span.set_attribute("tool.result_chars", len(result.content))
transcript.append(Message.tool_result(result))
# ...
The spans nest naturally: agent.run contains agent.turns which contain gen_ai.completion and tool.call spans. A trace visualizer shows this as a flame chart — you see at a glance which turn took the time, which tool call was slow, which iteration hit compaction.
Sub-agents get their own SessionContext but share the parent's session_id and task_id:
# src/harness/subagents/spawner.py (observability-aware)
async def spawn(self, spec: SubagentSpec, ...) -> SubagentResult:
parent_ctx = get_current_session_context() # from a context var
sub_ctx = parent_ctx.subagent(agent_id=f"sub-{uuid4().hex[:8]}")
with span("subagent.spawn", sub_ctx,
**{"subagent.objective_preview": spec.objective[:200],
"subagent.tools_allowed": ",".join(spec.tools_allowed)}):
# ... run sub-agent with sub_ctx propagated into arun
Under the hood, this uses Python's contextvars to propagate the context through the call stack — trace.get_current_span() would give us the OTel parent, but we want the harness-specific SessionContext too. We add a contextvar for it:
# src/harness/observability/context.py
from contextvars import ContextVar
from .tracing import SessionContext
_current: ContextVar[SessionContext | None] = ContextVar("session_ctx", default=None)
def set_current(ctx: SessionContext) -> None:
_current.set(ctx)
def get_current_session_context() -> SessionContext:
ctx = _current.get()
if ctx is None:
raise RuntimeError("no session context; call arun with one")
return ctx
The loop sets _current at the start; sub-agent runs re-set it when they begin; OTel spans carry the context via the span() helper.
Run any example in the book with the console exporter enabled:
# prepend to any main() function
from harness.observability.tracing import setup_tracing
setup_tracing()
You get output like:
{
"name": "agent.run",
"trace_id": "...",
"span_id": "a1b2c3d4",
"attributes": {
"service.name": "agent-harness",
"harness.session_id": "s-xyz",
"harness.task_id": "t-abc",
"harness.agent_id": "root",
"harness.final_iteration": 7
},
"duration_ms": 12340
}
{
"name": "agent.turn",
"parent_id": "a1b2c3d4",
...
}
{
"name": "gen_ai.completion",
"parent_id": "...",
"attributes": {
"gen_ai.system": "anthropic",
"gen_ai.usage.input_tokens": 3421,
"gen_ai.usage.output_tokens": 188
},
"duration_ms": 1230
}
...
Now point the exporter at a real backend:
# Langfuse, Braintrust, Datadog, Honeycomb, Jaeger all accept OTLP
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
setup_tracing(
exporter=BatchSpanProcessor(OTLPSpanExporter(
endpoint="https://your-backend/v1/traces",
headers={"Authorization": "Bearer ..."},
))
)
Every run of the harness now produces structured traces you can visualize, filter, and compare. A regression investigation looks like: find the slow trace, expand the span tree, see the tool call that took 12 seconds, look at its arguments.
From the span attributes above, production dashboards usually show:
gen_ai.usage.input_tokens + gen_ai.usage.output_tokens grouped by harness.agent_id. A spike in one sub-agent is visible.tool.call spans with tool.is_error = true. A sustained rise is a regression signal.agent.compact spans per session. If compaction fires every turn, something is wrong with either your tool output sizes or your budget thresholds.subagent.spawn spans with error vs. success. If a specific sub-agent objective starts failing more, you're seeing prompt or model drift.These are queries over your tracing backend, not custom code in the harness. The harness emits; the platform aggregates.
The DEV Community 2025 post "Your AI Agent Spent $500 Overnight and Nobody Noticed" was about a team that got a billing alert with no diagnostic signal. No per-agent attribution; they couldn't tell which agent ran away.
Our instrumentation answers that. Every gen_ai.completion span carries harness.agent_id. A production dashboard:
SELECT harness_agent_id, SUM(input_tokens + output_tokens) AS total_tokens
FROM spans
WHERE span_name = 'gen_ai.completion'
AND timestamp > NOW() - INTERVAL '1 hour'
GROUP BY harness_agent_id
ORDER BY total_tokens DESC
A runaway agent shows up immediately — one agent_id with 10× the tokens of any other. Alert on it; kill the session manually. Chapter 20 automates the kill.
git add -A && git commit -m "ch18: OpenTelemetry instrumentation with GenAI semantic conventions"
git tag ch18-observability
docker run -p 16686:16686 jaegertracing/all-in-one). Point the harness OTLP exporter at it. Run a multi-agent scenario. Open http://localhost:16686 and explore the trace tree.Chapter 17. Parallelism and Shared State
Previously: structured plans with evidence-backed completion. Sub-agents still run sequentially. The payoff for sub-agents comes from running them in parallel — the Anthropic multi-agent finding of 90%+ improvement over single-agent baselines rests on parallelism plus independent context windows, not sub-agents in series.
Chapter 19. Evals
Previously: observability — every operation in the harness emits a structured span, per-agent cost attribution works, dashboards show drift. Observability says what happened; it doesn't say whether what happened was right.