Same task, different configs.
Run task T with skill A on Claude. Run task T with skill B on Claude. Run task T with skill A on OpenAI. Score every run. Find what works.
A task is a structured input, message + context, with a rubric. Could be one task or a hundred-task suite.
Skill A vs. skill B. Claude vs. OpenAI vs. local. With memory vs. without. Whatever variables matter for what you're evaluating.
For each task × each config, the runner executes a real pipeline turn. Mock tools provide deterministic responses where applicable so the model's reasoning is the only variable.
Each output passes through a grading model with the rubric. Outputs get scored on dimensions you defined.
Heatmap by config × task. Regression highlights where today's run is worse than baseline. Drill into a single run to see the trace.
Update the rubric. Re-grading job re-scores past runs without re-executing them. The new score is comparable to baseline because the model output is the same.