Chapter 18. Observability

Previously: parallel sub-agents, leases, grounded verification. The harness is capable but opaque. A failed run tells you the final error but nothing about which sub-agent burned tokens, which tool call took 12 seconds, which compaction event dropped what the final agent wanted.

Observability for agents is not the same shape as observability for typical web services. Request/response latency matters, but so does the entire trajectory: which tools fired in which order, how much each cost, when compaction ran, which sub-agents spawned, what the model's output was at each turn. A failed run needs a timeline, not a metric.

This chapter adds OpenTelemetry-based instrumentation to the harness. Three things by the end:

Every LLM call, tool call, compaction event, and sub-agent spawn is a span in a trace.
Spans carry the OpenTelemetry GenAI semantic conventions — gen_ai.system, gen_ai.usage.input_tokens, etc.
Every span is tagged with session_id, task_id, and agent_id so per-agent cost attribution is possible.

The GenAI semantic conventions are still marked experimental by OpenTelemetry as of April 2026, but they're stable enough to build on — the major observability platforms (Datadog, Langfuse, Braintrust) have adopted them.

agent.run session_id=s_42 task_id=t_07 agent_id=root

agent.turn #1 42ms

agent.turn #2 1.8s

gen_ai.completion input=2400 output=180

tool.call read_file 12ms

agent.turn #3 0.9s

A trace tree: nested spans for run → turns → completion + tool calls. Every span carries session_id / task_id / agent_id.

18.1 Why OpenTelemetry and Not Just Logging

Three reasons to pick OTel over ad-hoc logs, and one piece of foundational context before the reasons: distributed tracing as a discipline grew out of Sigelman et al.'s 2010 paper "Dapper, a Large-Scale Distributed Systems Tracing Infrastructure," which documented the internal Google system that correlated billions of RPCs per day across thousands of microservices by passing a trace context through every call. Every modern tracing system — Zipkin, Jaeger, OpenTelemetry itself — is a descendant of Dapper's design. The GenAI semantic conventions this chapter uses are the agent-specific attributes bolted on top of the same underlying trace-and-span model.

Traces, not log lines. A trace with spans preserves the parent-child relationship between operations. You see "this LLM call happened inside this turn, which happened inside this sub-agent, which happened inside the parent task." With flat logs you reconstruct this by grepping; with traces it's free.

Standardized attributes. When you tag gen_ai.usage.input_tokens consistently, every observability platform knows what it means. Your dashboards, alerts, and downstream cost analysis don't have to speak your local vocabulary.

Exporters are pluggable. Console for development, Jaeger for local dev, Langfuse/Datadog/Honeycomb for production. The harness code doesn't change; the exporter does.

Add the dependencies:

uv add 'opentelemetry-api>=1.27' 'opentelemetry-sdk>=1.27' 'opentelemetry-exporter-otlp>=1.27'

18.2 The Instrumentation Layer

A thin wrapper that gives the rest of the harness a stable interface. We don't scatter tracer.start_as_current_span(...) through every module; we put it behind a small API.

# src/harness/observability/tracing.py
from __future__ import annotations

from contextlib import contextmanager
from dataclasses import dataclass, field
from typing import Iterator
from uuid import uuid4

from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import (
    BatchSpanProcessor, ConsoleSpanExporter, SpanProcessor,
)
from opentelemetry.trace import Tracer, Status, StatusCode


_provider: TracerProvider | None = None


def setup_tracing(
    service_name: str = "agent-harness",
    exporter: SpanProcessor | None = None,
) -> None:
    """Initialize OTel once per process."""
    global _provider
    if _provider is not None:
        return
    resource = Resource.create({"service.name": service_name})
    _provider = TracerProvider(resource=resource)
    _provider.add_span_processor(
        exporter or BatchSpanProcessor(ConsoleSpanExporter())
    )
    trace.set_tracer_provider(_provider)


@dataclass
class SessionContext:
    session_id: str = field(default_factory=lambda: str(uuid4()))
    task_id: str = field(default_factory=lambda: str(uuid4()))
    agent_id: str = "root"

    def subagent(self, agent_id: str) -> "SessionContext":
        return SessionContext(
            session_id=self.session_id,
            task_id=self.task_id,
            agent_id=agent_id,
        )


def tracer() -> Tracer:
    return trace.get_tracer("harness")


@contextmanager
def span(name: str, ctx: SessionContext, **attrs) -> Iterator[trace.Span]:
    t = tracer()
    with t.start_as_current_span(name) as s:
        s.set_attribute("harness.session_id", ctx.session_id)
        s.set_attribute("harness.task_id", ctx.task_id)
        s.set_attribute("harness.agent_id", ctx.agent_id)
        for k, v in attrs.items():
            s.set_attribute(k, v)
        try:
            yield s
        except Exception as e:
            s.set_status(Status(StatusCode.ERROR, str(e)))
            s.record_exception(e)
            raise

Three design points.

SessionContext as the correlation anchor. Every span in a session shares its session_id and task_id. Sub-agents have a different agent_id but inherit session and task. This is how you produce "per-agent cost attribution" — downstream, you group by agent_id over fixed session_id.

span() is a context manager. The pattern every instrumented operation follows: with span("name", ctx): .... It handles error propagation, attribute setting, and OTel lifecycle consistently.

One-time setup. setup_tracing() is idempotent; repeated calls are no-ops. This matters when a harness runs inside an already-instrumented process (a CI runner, a web service).

18.3 Instrumenting the Loop

The loop gets three span types, roughly corresponding to the three things happening: the overall run, each turn, and each tool call.

# src/harness/agent.py (observability-aware version, sketch)
from .observability.tracing import SessionContext, span


async def arun(
    # ... existing parameters (including `transcript: Transcript | None = None`
    #     from Chapter 5's chat-continuity upgrade)
    session_context: SessionContext | None = None,
) -> str:
    ctx = session_context or SessionContext()

    with span("agent.run", ctx,
              **{"harness.initial_user_message_len": len(user_message)}) as s:
        if transcript is None:
            transcript = Transcript(system=system)
        transcript.append(Message.user_text(user_message))
        # ... existing setup

        for iteration in range(MAX_ITERATIONS):
            with span("agent.turn", ctx,
                      **{"harness.iteration": iteration}) as turn_span:
                # compaction span
                snapshot = accountant.snapshot(transcript, tools=registry.schemas())
                turn_span.set_attribute("harness.context_utilization",
                                         snapshot.utilization)
                if snapshot.state == "red":
                    with span("agent.compact", ctx):
                        await compactor.compact_if_needed(transcript,
                                                           registry.schemas())

                # LLM call span
                with span("gen_ai.completion", ctx,
                          **{"gen_ai.system": provider.name}) as llm_span:
                    response = await _one_turn(provider, registry,
                                                transcript, on_event)
                    llm_span.set_attribute(
                        "gen_ai.usage.input_tokens", response.input_tokens)
                    llm_span.set_attribute(
                        "gen_ai.usage.output_tokens", response.output_tokens)

                if response.is_final:
                    s.set_attribute("harness.final_iteration", iteration)
                    transcript.append(Message.from_assistant_response(response))
                    return response.text or ""

                # Commit one assistant message with every ToolCall block.
                transcript.append(Message.from_assistant_response(response))
                # One tool.call span per dispatched call (batched turns get N spans).
                for ref in response.tool_calls:
                    with span("tool.call", ctx,
                              **{"tool.name": ref.name}) as tool_span:
                        result = await registry.adispatch(ref.name, ref.args, ref.id)
                        tool_span.set_attribute("tool.is_error", result.is_error)
                        tool_span.set_attribute("tool.result_chars", len(result.content))
                    transcript.append(Message.tool_result(result))
        # ...

The spans nest naturally: agent.run contains agent.turns which contain gen_ai.completion and tool.call spans. A trace visualizer shows this as a flame chart — you see at a glance which turn took the time, which tool call was slow, which iteration hit compaction.

18.4 Instrumenting Sub-agents

Sub-agents get their own SessionContext but share the parent's session_id and task_id:

# src/harness/subagents/spawner.py (observability-aware)

async def spawn(self, spec: SubagentSpec, ...) -> SubagentResult:
    parent_ctx = get_current_session_context()  # from a context var
    sub_ctx = parent_ctx.subagent(agent_id=f"sub-{uuid4().hex[:8]}")

    with span("subagent.spawn", sub_ctx,
              **{"subagent.objective_preview": spec.objective[:200],
                 "subagent.tools_allowed": ",".join(spec.tools_allowed)}):
        # ... run sub-agent with sub_ctx propagated into arun

Under the hood, this uses Python's contextvars to propagate the context through the call stack — trace.get_current_span() would give us the OTel parent, but we want the harness-specific SessionContext too. We add a contextvar for it:

# src/harness/observability/context.py
from contextvars import ContextVar
from .tracing import SessionContext

_current: ContextVar[SessionContext | None] = ContextVar("session_ctx", default=None)

def set_current(ctx: SessionContext) -> None:
    _current.set(ctx)

def get_current_session_context() -> SessionContext:
    ctx = _current.get()
    if ctx is None:
        raise RuntimeError("no session context; call arun with one")
    return ctx

The loop sets _current at the start; sub-agent runs re-set it when they begin; OTel spans carry the context via the span() helper.

18.5 What the Traces Look Like

Run any example in the book with the console exporter enabled:

# prepend to any main() function
from harness.observability.tracing import setup_tracing
setup_tracing()

You get output like:

{
  "name": "agent.run",
  "trace_id": "...",
  "span_id": "a1b2c3d4",
  "attributes": {
    "service.name": "agent-harness",
    "harness.session_id": "s-xyz",
    "harness.task_id": "t-abc",
    "harness.agent_id": "root",
    "harness.final_iteration": 7
  },
  "duration_ms": 12340
}
{
  "name": "agent.turn",
  "parent_id": "a1b2c3d4",
  ...
}
{
  "name": "gen_ai.completion",
  "parent_id": "...",
  "attributes": {
    "gen_ai.system": "anthropic",
    "gen_ai.usage.input_tokens": 3421,
    "gen_ai.usage.output_tokens": 188
  },
  "duration_ms": 1230
}
...

Now point the exporter at a real backend:

# Langfuse, Braintrust, Datadog, Honeycomb, Jaeger all accept OTLP
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

setup_tracing(
    exporter=BatchSpanProcessor(OTLPSpanExporter(
        endpoint="https://your-backend/v1/traces",
        headers={"Authorization": "Bearer ..."},
    ))
)

Every run of the harness now produces structured traces you can visualize, filter, and compare. A regression investigation looks like: find the slow trace, expand the span tree, see the tool call that took 12 seconds, look at its arguments.

18.6 Metrics That Matter

From the span attributes above, production dashboards usually show:

Per-agent token cost over time. gen_ai.usage.input_tokens + gen_ai.usage.output_tokens grouped by harness.agent_id. A spike in one sub-agent is visible.
Tool call error rate. Percent of tool.call spans with tool.is_error = true. A sustained rise is a regression signal.
Compaction frequency. Count of agent.compact spans per session. If compaction fires every turn, something is wrong with either your tool output sizes or your budget thresholds.
Sub-agent outcome mix. Count subagent.spawn spans with error vs. success. If a specific sub-agent objective starts failing more, you're seeing prompt or model drift.
Trace duration p50/p99. Standard latency SLOs. Agent workloads are slow on average; what you care about is the tail and the trend.

These are queries over your tracing backend, not custom code in the harness. The harness emits; the platform aggregates.

18.7 Cost Attribution, Specifically

The DEV Community 2025 post "Your AI Agent Spent $500 Overnight and Nobody Noticed" was about a team that got a billing alert with no diagnostic signal. No per-agent attribution; they couldn't tell which agent ran away.

Our instrumentation answers that. Every gen_ai.completion span carries harness.agent_id. A production dashboard:

SELECT harness_agent_id, SUM(input_tokens + output_tokens) AS total_tokens
FROM spans
WHERE span_name = 'gen_ai.completion'
  AND timestamp > NOW() - INTERVAL '1 hour'
GROUP BY harness_agent_id
ORDER BY total_tokens DESC

A runaway agent shows up immediately — one agent_id with 10× the tokens of any other. Alert on it; kill the session manually. Chapter 20 automates the kill.

18.8 Commit

git add -A && git commit -m "ch18: OpenTelemetry instrumentation with GenAI semantic conventions"
git tag ch18-observability

18.9 Try It Yourself

Run Jaeger locally and visualize. Stand up Jaeger in a container (docker run -p 16686:16686 jaegertracing/all-in-one). Point the harness OTLP exporter at it. Run a multi-agent scenario. Open http://localhost:16686 and explore the trace tree.
Find the slow operation. Instrument a realistic task. Look at the trace. What's the longest span? Is it what you expected? If not, what did that tell you?
Add one custom attribute that matters. Pick a metric that's specific to your use case — retrieval hit count, plan steps completed, retry count, whatever. Attach it to the right spans. Build a dashboard query that uses it.

What you now understand

The harness emits structured, correlated, OTel-compliant traces. Every LLM call, tool call, compaction event, and sub-agent spawn is a span with session, task, and agent IDs. Exporters plug in without code changes. Per-agent cost attribution — the thing the $500 post-mortem lacked — is a single query on your tracing backend. Dashboards track regression indicators (error rate, compaction frequency, token cost by agent) that let you catch drift before it costs you.

What's still missing. Observability tells you what happened; it doesn't tell you whether what happened was correct. A run with zero errors can still be wrong. Chapter 19 is evals: how you define "correct" for an agent, how you run regression tests against golden trajectories, and how you feed production failures back into your eval set so tomorrow's agent doesn't repeat today's mistakes.

Chapter 17. Parallelism and Shared State

Previously: structured plans with evidence-backed completion. Sub-agents still run sequentially. The payoff for sub-agents comes from running them in parallel — the Anthropic multi-agent finding of 90%+ improvement over single-agent baselines rests on parallelism plus independent context windows, not sub-agents in series.

Chapter 19. Evals

Previously: observability — every operation in the harness emits a structured span, per-agent cost attribution works, dashboards show drift. Observability says what happened; it doesn't say whether what happened was right.