Public benchmarks tell you how a model does on someone else's task. They do not tell you whether it is good enough for yours. The only evaluation that matters is the one built from your own inputs and your own definition of correct. This is a repeatable process for deciding, with evidence, whether a model clears the bar before you build on top of it.
Decide what "good enough" means before you see any output, so results cannot move the goalposts. Be concrete: is 90 percent acceptable, or does one bad answer in a hundred cause real harm? Different tasks have very different bars.
Collect real examples, including the messy and ambiguous ones. A test set of only easy cases will pass any model and tell you nothing. Save them in a plain format you can re-run.
# One example per line: the input and the expected/acceptable answercat eval_set.jsonl
# {"input": "...", "expected": "..."}
Step 4: Score against your definition, not your mood#
Apply the bar from Step 1 to each output. For tasks with one right answer (extraction, classification, formatting), score exact or near-exact matches. For open-ended tasks, use a written rubric so two people would score the same output the same way.
# Reduce results to a single number you decided in advance
your-scorer --results results.jsonl --rubric rubric.txt
A single accuracy number hides the pattern. Read every failed case. Failures that cluster (always wrong on dates, always wrong on negation) are fixable with a prompt change. Failures that are random are a sign the model is not suited to the task.
Compare the measured number to the bar you set in Step 1. Record the model, the date, the score, and the failure pattern. That record is what lets you re-run the same evaluation when you consider a different or newer model.
Testing on easy cases only. A gold set without hard and ambiguous examples passes everything and means nothing. Include the cases that actually break things.
Moving the bar after seeing results. Decide "good enough" before you look. Otherwise every model looks acceptable in hindsight.
Trusting one number. Aggregate accuracy hides clustered failures that a quick read of the errors would reveal in minutes.
A vague rubric. If you cannot score the same output the same way twice, your evaluation is measuring your mood, not the model.
A tiny sample. Five examples cannot distinguish a 70 percent model from a 90 percent one. Start around 30 to 50 and grow the hard cases.
Confusing fluent with correct. A confident, well-written wrong answer is still wrong. Score the content, not the prose.
Evaluation is not a one-time gate. Keep the gold set, version it, and re-run it whenever the model, the prompt, or the task changes. The cost of building it once is small next to the cost of shipping on a model that quietly fails the cases you never tested.
How to structure prompts for reliable, parseable LLM output
Turn flaky model responses into dependable ones: give the model a role, explicit constraints, examples, and a fixed output format your code can parse every time.
Discussion