An engineer spent half a day debugging a tool handler that wasn't broken. The model had emitted a tool name with no arguments. The handler never ran. Standard monitoring couldn't tell them that. From the request log it looked like the tool failed; from the handler's perspective it had never been called. The hours went into reading code that was working correctly.
That's the failure shape application monitoring tools weren't built for. Datadog and friends are tuned for the question "did the endpoint return 500." Still useful for agent runtimes. But for an agent run, the interesting questions look different:
- Did the model decide to call a tool?
- Did it form the tool call correctly?
- Did the tool gate allow the call?
- Did the handler execute?
- Did the handler's output return to the model?
- Did the model recover, retry, or stop too early?
- Did a prompt, model, provider, tool schema, or context change cause the behavior?
Logs alone don't answer those. You need to preserve enough of the agent's decision surface that a human can reconstruct why it did what it did. The trace UI is built around exactly that, and the empty-args story above is what convinced us we had to build it ourselves.
Generation vs. tool observation: the distinction that matters
Most observability tools collapse "what the model said" and "what actually happened." They shouldn't. We split them.
A generation is the model-facing observation: prompt, messages, available tools, stream timing, token estimates, output, and any tool calls the model emitted. This is where we see the model's intent.
A tool observation is what happened after. Did the gate allow the call? Did the handler succeed? Did the tool return malformed output? Each is a different category, classified at observation time.
The reason this matters: if a model emits a tool name with no arguments, that's not a tool handler failure. It's a generation/tool-call-formation failure. Debug it as a handler failure and you'll spin for hours reading code that works. The trace UI surfaces the boundary explicitly.
The failure taxonomy
Failures in agent runtimes don't fit "200 / 500" cleanly. We classify into specific categories at observation time:
Empty args, model emitted a tool name but no arguments. Generation-side.
Schema validation, arguments didn't match the tool schema. Generation-side, but caught at the gate.
Policy denial, gate denied based on allowlist / budget / approval / scope / SSRF / injection / trust. Each denial type is its own category.
Handler error, tool ran, threw an exception. The kind of error standard monitoring is good at.
Handler malformed output, tool ran, returned, but the output didn't match the expected schema for the next step.
Provider error, model provider returned an error. Routes to health monitor and retry / failover.
Timeout, queue timeout, stream timeout, or stalled generation. Surfaced separately from provider error because the right next action differs: a timeout is usually about queue load or stream stability, not the provider.
Each of these has a different fix. Conflating them under "error" makes debugging into guesswork. Surfacing them explicitly makes the right next action obvious.
The flame graph
We borrowed the flame graph metaphor from CPU profiling, with one important difference: spans aren't just "function called other function." Spans represent agent decision boundaries, model generation, tool call, handler execution, gate decision, retry.
Each span has input, output, metadata, and a diagnosis tab. You click into a span and see exactly what flowed into it and what came out. For generation spans, that's the prompt + the model output. For tool spans, that's the call + the gate decision + the handler result. The diagnosis tab is for failure cases, it surfaces the category, the specific reason, and any retry chain that followed.
Prompt versioning: the move that paid off most
Every system prompt gets a content hash. Two traces using the same prompt show the same hash. Two traces with different prompts can be diffed against each other.
This sounds boring until you've debugged the question "why did this agent's behavior change last Tuesday." With prompt versioning, the answer is two clicks: open Tuesday's trace, open Monday's trace, diff. The change in the prompt is visible and the resulting behavior change is obvious.
Without prompt versioning, that question takes hours and the answer is usually "we don't know." The trace data exists but you can't reproduce the comparison. Adding the hash to the storage schema took a day; the debugging time it saves is hard to overstate.
URL-driven state
Every trace, span, filter, and time range encodes in the URL. Share a debugging session by sharing the link. No "click these five things to reproduce." This is a small UX choice that turned out to be the thing engineers like most about the trace UI.
What we punted on
Automated regression detection on prompts and alerting on failure-taxonomy categories are both on the roadmap. Today the trace UI is a debug surface, not an alerting surface; an engineer notices behavior changed and goes looking. The data exists for both layers, the workflow to make them automatic doesn't yet.
The takeaway
Generic app monitoring works for web services. It does not work for agent runtimes. Build observation around the model-vs-tool boundary. Classify failures into specific categories at observation time. Version every prompt. Put it all in URLs you can share. Then the trace stops being a place engineers visit when something breaks, and starts being how they understand what the pack just did.