Every team I have advised in the last two years has the same chart: an offline eval score marching confidently upward, release over release. And every one of them, eventually, has the same second chart: a production-quality metric that refuses to move. The eval harness is not broken. It is measuring something real. It is just not measuring the thing you ship.
- 01 Static eval sets rot the moment your prompt or model changes the distribution of inputs.
- 02 Contamination is the silent killer: your “held-out” set leaked into pre-training.
- 03 LLM-as-judge inherits the judge’s biases — verbosity, position, and self-preference.
- 04 The fix is a living eval: sampled from production, refreshed weekly, and adversarially audited.
Three ways the gap opens
1. Distribution drift you caused yourself
You tuned the prompt to handle a class of tricky inputs. Users notice the product got better at those, so they send more of them. Your eval set, frozen in March, still reflects the March input mix. Your score goes up; your users’ experience is governed by a distribution your harness has never seen.
2. Contamination
If you are evaluating on a public benchmark, assume it is in the training data.
3. The judge has opinions
LLM-as-judge is the only scalable option for open-ended quality, but a judge model is not neutral. The well-documented biases are real and they compound:
def score(answer_a: str, answer_b: str, judge) -> str: # Mitigate POSITION bias: ask twice, swap order, require agreement. first = judge.compare(answer_a, answer_b) second = judge.compare(answer_b, answer_a) if first == "A" and second == "B": return "a_wins" if first == "B" and second == "A": return "b_wins" return "tie" # disagreement under swap => no signal, don't pretend otherwiseVerbosity bias (longer looks better), position bias (first/last looks better), and self-preference (a model rates its own family higher) will each quietly inflate your numbers if you do not control for them.
Building a living eval
The harness that actually tracks production has four properties:
- Sampled from production. Pull real, anonymised inputs weekly. Stratify by the input clusters that matter to the business.
- Refreshed, not frozen. A static set is a snapshot of a moving target.
- Adversarially audited. A second model — or a human — tries to refute each “pass”. Findings that survive refutation are the only ones you trust.
- Tied to one north-star outcome. Resolution rate, escalation rate, thumbs-up — pick the metric the business already believes.
The uncomfortable truth is that a good eval harness is never “done”. It is a product surface with its own backlog. Staff it like one.