Capability · Observability layer

A flame graph for agent execution. Failure taxonomy. Replay.

Application monitoring is built around requests, logs, errors, and latency. Agent monitoring needs more. PACKWOLF preserves the decision surface, model intent vs. tool execution vs. handler outcome, so a human can understand why an agent did what it did, not just what happened.

Per-span
Flame graph
Versioned
Prompts diffable
Replayable
Run events
j/k/Enter
Keyboard nav
packwolf.app · Observability
Live screenshot
Observability screenshot
The activity trace. Every span, every input, every output, every failure category. The full execution graph of an agent run.
What it actually does

The parts that make this work.

Trace, not just log.

Each agent run is a trace: a coherent set of observations grouped by runId / conversationId / sessionId. You see the chain of decisions as one timeline.

Generation vs. tool observation.

Generations record what the model intended (prompt, tools, output, tool calls). Tool observations record what happened after. Confusing the two ruins debugging, PACKWOLF separates them.

Failure taxonomy, not just errors.

Did the model fail to form the call? Did the gate deny it? Did the handler error? Did the tool output return malformed? Each is a different category, classified at observation time.

Prompt versioning is built in.

Every system prompt is fingerprinted. You can diff the prompt between runs to find the change that broke (or fixed) behavior.

URL-driven state.

Every trace, span, filter, and time range encodes in the URL. Share a debugging session by sharing the link. No 'click these five things to reproduce.'

Master-detail with keyboard nav.

j/k to move through traces, Enter to open a span, / to filter, ? for help. Built for the engineer who lives in the trace browser, not just the operator who visits occasionally.

How it works

The path through observability.

  1. 01

    Pipeline emits run events.

    Every agent turn writes events to the run_events table, model:generation observations, tool:* observations, errors, denials. Append-only, ordered, replayable.

  2. 02

    Audit log captures policy.

    Tool calls also fire audit events: who, what, input, output, duration, approval state. The audit log is the source of truth for compliance.

  3. 03

    Cost events fire alongside.

    Each model call and each tool call fires a cost event with token estimates and provider billing. Rolls up per-agent and per-tool.

  4. 04

    Trace UI groups observations.

    The browser groups observations by trace ID into a flame graph timeline. Failed agent behavior becomes inspectable, not buried.

  5. 05

    Span detail shows the boundary.

    Click a span: input, output, metadata, diagnosis. Find the exact moment the model emitted a malformed tool call, or the gate denied a request, or a handler timed out.

  6. 06

    Diff prompts across runs.

    Two trace IDs, side by side. The prompt diff shows what changed, system prompt, tool schemas, memory block, conversation history. Find the regression.

Common questions

Things engineers actually ask.

Standard tools center on requests and errors. Agent runtimes have a different failure shape, a model can emit a tool name but no arguments, a gate can deny a call before it runs, a handler can succeed but return malformed output. PACKWOLF observes those boundaries directly.

Source: docs/monitoring.md

See it in your workspace.

Closed-beta cohorts are small. Tell us what you'd want this capability to handle for your team.

Request beta access