Capability · Observability layer

A/B comparison for agents. Across skills, across models.

When agent behavior changes, you need to know if it's better. PACKWOLF's eval framework runs the same task across two configurations, different skills, different models, different prompts, and grades each. LLM grader for qualitative scoring, deterministic mock tools for stable runs.

Multi-skill
Comparison
Multi-model
Comparison
LLM grader
Qualitative
Mock tools
Deterministic
packwolf.app · Evals
Live screenshot
Evals screenshot
The eval dashboard. Run history, scores, regression heatmap, regrade jobs all in one place.
What it actually does

The parts that make this work.

Same task, different configs.

Run task T with skill A on Claude. Run task T with skill B on Claude. Run task T with skill A on OpenAI. Score every run. Find what works.

LLM grader scores quality.

A separate model evaluates outputs against rubric criteria. Useful for tasks where there's no single correct answer but there is a quality bar.

Mock tools make runs deterministic.

Replace live tool calls with recorded responses. The model's behavior becomes the only variable. Reproducible eval runs.

Regression detection.

Today's run vs. last week's run, on the same task: did the score drop? The dashboard flags regressions before you ship them to production.

Re-grading jobs.

Change the rubric and re-grade past runs without re-executing them. Compare scores under the old vs. new rubric. Useful when the goalposts move.

Test suites are versioned.

Eval suites live as code. Add tasks, version suites, share across agents. Like unit tests for the agent layer.

How it works

The path through evals.

  1. 01

    Define the task.

    A task is a structured input, message + context, with a rubric. Could be one task or a hundred-task suite.

  2. 02

    Pick the configurations.

    Skill A vs. skill B. Claude vs. OpenAI vs. local. With memory vs. without. Whatever variables matter for what you're evaluating.

  3. 03

    Run the matrix.

    For each task × each config, the runner executes a real pipeline turn. Mock tools provide deterministic responses where applicable so the model's reasoning is the only variable.

  4. 04

    LLM grader scores.

    Each output passes through a grading model with the rubric. Outputs get scored on dimensions you defined.

  5. 05

    Dashboard surfaces results.

    Heatmap by config × task. Regression highlights where today's run is worse than baseline. Drill into a single run to see the trace.

  6. 06

    Re-grade if rubric changes.

    Update the rubric. Re-grading job re-scores past runs without re-executing them. The new score is comparable to baseline because the model output is the same.

Common questions

Things engineers actually ask.

Observability tells you what an agent did on a real run. Evals tell you whether that's good. Same trace data underneath, different question.

Source: docs/lib/eval-*.ts

See it in your workspace.

Closed-beta cohorts are small. Tell us what you'd want this capability to handle for your team.

Request beta access