Question 1

What's the difference between eval and observability?

Accepted Answer

Observability tells you what an agent did on a real run. Evals tell you whether that's good. Same trace data underneath, different question.

Question 2

Why mock tools?

Accepted Answer

Live tool calls add nondeterminism. Web search returns different results today vs. tomorrow. To compare two skill versions, you want the model output to be the only variable. Mock tools record once and replay.

Question 3

Can the LLM grader be wrong?

Accepted Answer

Yes, graders are imperfect. PACKWOLF supports multiple graders (cross-grading), human-overridable scores, and rubric versioning. Evals are calibrated tools, not oracles.

Question 4

How long does a typical eval suite take?

Accepted Answer

Depends on suite size and parallelism. A 50-task suite with 4 configurations runs in a few minutes for short tasks; longer for tasks that involve multi-turn tool use. Cost is bounded by the budget gate.

Question 5

Can I eval on production traffic?

Accepted Answer

Yes. Replay past traces with a different skill or model and grade the alternative output against the actual one. Useful for change-impact analysis without staging traffic.

A/B comparison for agents. Across skills, across models.

The parts that make this work.

Same task, different configs.

LLM grader scores quality.

Mock tools make runs deterministic.

Regression detection.

Re-grading jobs.

Test suites are versioned.

The path through evals.

Define the task.

Pick the configurations.

Run the matrix.

LLM grader scores.

Dashboard surfaces results.

Re-grade if rubric changes.

Things engineers actually ask.

See it in your workspace.