LLM Evaluations & Testing
Pathrule1 Rule • 2 Memories • 2 Skills
The discipline that turns an LLM feature from a demo into something you can change with confidence. Hallucinations and inconsistent quality are the top reported problems with AI-generated work, and you cannot fix what you do not measure. This pattern builds a labelled evaluation set, scores outputs with deterministic checks plus a calibrated LLM-as-judge, and gates every prompt, model, or retrieval change on an eval run so a tweak that helps one case but breaks ten is caught in CI, not in production.
Suggested path map
Pathrule places each piece on the matching path, so your assistant only sees it where it belongs. This is the scoping you get on import; you can adjust it in your workspace.
Rules
1Gate prompt, model, and retrieval changes on an eval run/src/aihighadvisoryNo change to a prompt, model, or retrieval config ships without running the eval set and comparing scores against the current baseline.
| 1 | A prompt is code: a small edit can improve one case and silently break ten others. Without an eval gate, you find out from users. |
| 2 | |
| 3 | - Run the eval set on every change to a prompt, model id, temperature, tool definition, or retrieval config, and compare the scores to the committed baseline before merging. |
| 4 | - Treat a regression on the eval set like a failing test: it blocks the change. An improvement on your one hand-picked example is not evidence; the aggregate score on the dataset is. |
| 5 | - Pin the model version in the eval run. A provider silently changing a model under you is itself a regression you want the evals to catch. |
| 6 | - Keep eval runs in CI (or a pre-merge step) so the gate is enforced regardless of who makes the change. Record the score so the trend is visible over time. |
Memories
2Build a labelled eval set that mirrors real usage/evalsCurate a versioned dataset of representative and adversarial inputs with expected outputs or acceptance criteria; grow it from real failures.
| 1 | The eval set is the asset. The model and prompt will change; the dataset is what lets you tell whether a change is better. |
| 2 | |
| 3 | - Curate inputs that mirror real traffic: common cases, important edge cases, and adversarial inputs (prompt injection, ambiguous or out-of-scope requests, inputs that should be refused). A dataset of only happy-path examples measures nothing useful. |
| 4 | - For each case, record either an expected output, a reference answer, or explicit acceptance criteria. Some tasks have one right answer; many have a rubric instead, and that is fine as long as it is written down. |
| 5 | - Version the dataset alongside the code and grow it from production failures: every real hallucination or bad answer becomes a new eval case so the same regression cannot return unnoticed. |
| 6 | - Keep the set balanced and labelled honestly; do not overfit prompts to a tiny set of examples you keep re-reading. Aim for coverage of the behaviours that matter. |
| 7 | |
| 8 | See /evals for the scoring memory and /src/ai for the eval-gate rule. |
Score with deterministic checks first, then a calibrated judge/evalsUse exact/programmatic checks where outputs are verifiable; use an LLM-as-judge with a clear rubric, calibrated against human labels, for open-ended quality.
| 1 | Pick the cheapest scoring method that actually measures the thing, and only reach for an LLM judge when the output is genuinely open-ended. |
| 2 | |
| 3 | - Score deterministically wherever you can: exact match, schema/JSON validity, regex, contains-required-facts, executes-without-error, latency, and cost. These are free, fast, and not themselves subject to model error. |
| 4 | - For open-ended quality (helpfulness, tone, faithfulness), use an LLM-as-judge: a separate model call that scores the output against a specific, written rubric, ideally returning a structured verdict with a reason, not a bare number. |
| 5 | - Calibrate the judge against human labels on a sample: if the judge does not agree with your team's judgments, fix the rubric before trusting it. An uncalibrated judge is just another opinion. |
| 6 | - For RAG and any grounded answer, score faithfulness explicitly: does the answer follow from the retrieved context, or did the model invent it? This is the direct measure of hallucination. (See the rag-embeddings pattern for retrieval quality.) |
| 7 | |
| 8 | See /evals for the dataset memory and /src/ai for the eval-gate rule. |
Skills
2llm-eval-set-builder/rootChecklist for building or extending an LLM evaluation set and wiring it into the change workflow.
| 1 | --- |
| 2 | name: llm-eval-set-builder |
| 3 | description: Checklist for building or extending an LLM evaluation set and gating changes on it. Run when adding an LLM feature or after a production quality failure. |
| 4 | --- |
| 5 | |
| 6 | # LLM eval set builder |
| 7 | |
| 8 | ## Dataset |
| 9 | - [ ] Inputs mirror real usage: common cases, important edge cases, and adversarial inputs (injection, out-of-scope, must-refuse). |
| 10 | - [ ] Each case has an expected output, reference answer, or written acceptance criteria/rubric. |
| 11 | - [ ] Dataset is versioned with the code and grows from real production failures. |
| 12 | - [ ] Coverage is balanced; prompts are not overfit to a handful of examples. |
| 13 | |
| 14 | ## Scoring |
| 15 | - [ ] Deterministic checks used where outputs are verifiable (exact match, schema validity, required facts, runs-clean, latency, cost). |
| 16 | - [ ] LLM-as-judge used only for open-ended quality, with a specific written rubric and a structured verdict + reason. |
| 17 | - [ ] Judge calibrated against human labels on a sample; rubric fixed until it agrees. |
| 18 | - [ ] Grounded answers scored for faithfulness (does the answer follow from the source) to catch hallucination. |
| 19 | |
| 20 | ## Gate |
| 21 | - [ ] Eval run triggers on any prompt/model/temperature/tool/retrieval change; model version pinned. |
| 22 | - [ ] Scores compared to a committed baseline; a regression blocks the change like a failing test. |
| 23 | - [ ] Eval runs in CI / pre-merge; scores recorded so the quality trend is visible. |
llm-as-judge-rubric/rootTemplate and guidance for writing a reliable LLM-as-judge scoring prompt and rubric.
| 1 | --- |
| 2 | name: llm-as-judge-rubric |
| 3 | description: Guidance for writing a reliable LLM-as-judge scoring prompt. Use when building automated scoring for open-ended LLM outputs. |
| 4 | --- |
| 5 | |
| 6 | # LLM-as-judge rubric |
| 7 | |
| 8 | Use when an output is too open-ended for a deterministic check. A judge is only as good as its rubric. |
| 9 | |
| 10 | ## Writing the rubric |
| 11 | - [ ] Define each criterion concretely (e.g. faithfulness, relevance, completeness, tone) with what a pass and a fail look like, not just a label. |
| 12 | - [ ] Prefer a small discrete scale (e.g. 1-5 or pass/fail per criterion) over an unanchored 0-100; anchor each level with a description. |
| 13 | - [ ] Ask the judge to give its reasoning and cite the part of the input/source that justifies the score, then the score. Require a structured output (per-criterion verdict + reason). |
| 14 | - [ ] For faithfulness/grounding, give the judge the source context and ask explicitly whether each claim is supported by it. |
| 15 | |
| 16 | ## Making it reliable |
| 17 | - [ ] Use a capable model as the judge; do not have a weak model grade a strong one. |
| 18 | - [ ] Calibrate: score a sample the team has labelled and measure agreement; revise the rubric until the judge matches human judgment. |
| 19 | - [ ] Watch for known judge biases (position, length, self-preference) and control for them (e.g. randomize order in pairwise comparisons). |
| 20 | - [ ] Keep the judge prompt and model versioned with the eval set; a judge change is itself an eval change. |
Why this pattern
AI agents ship prompt and model changes on vibes, with no eval set, so hallucinations and quality regressions only surface once users hit them.
Built for Teams shipping LLM features that need to measure and defend output quality.
Keeps your assistant from:
- Changing a prompt or model and shipping it because the one example you tried looked good
- Having no labelled dataset, so quality is a feeling rather than a number
- Using an LLM judge with a vague rubric that does not agree with humans
- Letting hallucinations through because nothing checks answers against their source
- License
- Apache-2.0
- Version
- 1.0.0
- Updated
- 2026-06-09