Most AI projects don't fail because the model is wrong. They fail because nobody built the discipline to know whether the model is right.
You've probably seen this pattern: a vendor demos something impressive, your team builds a POC, the POC works in the demo environment, and then it hits production data — and quietly stops getting talked about. Six months later, the budget is still there, the model is still there, and nothing changed.
The thing that closes the gap between “model works on the demo” and “system works in production” is evals. This playbook walks through the exact methodology we use to keep AI builds from dying after week 4.
01 · The problem
Why most AI POCs die before production.
POCs die for one of three reasons. First: the team never built a way to test whether the output was right. The demo “looked good,” but nobody could answer the question “is this accurate at scale?”
Second: there was no regression test. The team prompted, the output looked good, they ship. A week later somebody changes the prompt, retrieval, or model — and breaks the original use case without realizing it. The system silently degrades.
Third: there was no production drift detection. The model worked at launch. Six months in, the data distribution shifted, performance dropped 30%, and nobody noticed until customers complained.
All three failures share the same root cause: nobody built an eval gate. An eval gate is the thing that says “this version passes” or “this version fails” — automatically, repeatably, before deploy.
02 · The golden dataset
Build a golden dataset before you build the AI.
The single most important asset in any AI build is the golden dataset — a representative set of real inputs paired with the right outputs, scored by people who actually know the work.
We build this in week 1 of every engagement. 100–500 examples for most workflows. Drawn from real data, not synthesized. Labeled by your subject-matter experts, not by interns. Stored as a versioned artifact — not a Notion table that disappears when the project changes hands.
Three rules for golden datasets that work:
- Cover the edge cases, not just the happy path. If your model breaks on the 5% weird inputs, you find out from production unless your golden set includes them. Bias toward the gnarly ones.
- Score on outcomes, not vibes. “The response looked good” isn't a label. “The model correctly identified all three contract risks the senior associate found” is.
- Version it like code. Golden sets evolve. A v1 set may have missed a category that v2 covers. Track which set scored which version of the model.
03 · Rubrics
Design rubrics that match the actual business decision.
A rubric is how a human grader (or LLM grader) decides whether a model output is good. Most teams use accuracy. Most rubrics should be richer than that.
For classification: precision and recall, but also cost of false positives vs false negatives in the actual business. An incident-classifier that under-flags critical incidents is a different failure than one that over-flags noise.
For generation: factuality (is it true?), relevance (does it answer the question?), grounding (does it cite real sources from your data?), tone (is it the right register for your audience?), and refusal behavior (does it say “I don't know” when it should?).
Don't roll up to a single score. Track each dimension separately. A 95% on factuality and 60% on grounding tells you exactly what to fix.
04 · LLM-as-judge
LLM-as-judge — only if you calibrate it.
For subjective evals (does this draft email feel professional? does this summary capture the key points?), human grading doesn't scale. So you use a stronger LLM as the grader.
Two failure modes to watch:
- The judge has its own biases. LLMs systematically prefer their own writing style. A GPT-class judge will rate Claude outputs lower than they deserve, and vice-versa. Calibrate against human ratings on a sample before you trust the judge at scale.
- The judge prompt is the product. “Rate this 1–10” gets you noise. Specific rubrics with examples (“A 9 looks like this. A 5 looks like this.”) get you signal. Iterate on the judge prompt the same way you iterate on the production prompt.
Our default: human-grade a sample of 30–50, run the LLM judge on the same sample, compute agreement (Cohen's kappa, not just accuracy), and tune the judge until agreement is >0.7. Then trust it at scale.
05 · Regression + drift
Run evals on every change. Watch for drift forever.
Pre-deploy: every prompt change, model change, retrieval change, or temperature change re-runs the golden-set eval before merge. If the eval regresses, the change doesn't ship. No exceptions.
Post-deploy: sample production inputs/outputs continuously. Track distribution shifts (are users asking different kinds of questions than the golden set covered?), eval score deltas, refusal rates, latency, token usage, cost-per-resolution.
When a metric drifts, the system tells you — before the business does. This is the difference between “the AI is broken and we don't know why” and “the AI started seeing 20% more questions about Topic X two weeks ago and accuracy on that topic dropped to 73%.”
06 · The eval-first deploy checklist
Before you ship anything, walk this list.
- Golden dataset exists. Versioned. Labeled by SMEs. Includes edge cases.
- Rubric is multi-dimensional. Each dimension is tracked separately.
- Baseline eval scores recorded — for the current production version AND for the prior version, if there was one.
- Regression suite runs on every change to prompt, model, retrieval, or config.
- Eval gate blocks deploys that regress on any dimension by more than your tolerance.
- LLM-judge (if used) calibrated against human ratings. Agreement >0.7.
- Production sampling pipeline live. Drift alerts wired up.
- Token cost and latency budgets defined. Dashboards live before launch, not after.
Want the PDF + golden dataset template?
Get the full playbook + a starter golden dataset template.
PDF version (printable), plus a CSV template you can fork to start building your own golden dataset today. One email field. We'll send it within 60 seconds.
Want us to apply this methodology to one workflow at your company? Tell us what you're building and we'll walk through how the eval suite would look for your specific use case.
Talk to a Managing Partner