the biggest mistake i see in ml projects is building without evals. people tweak prompts, swap models, change parameters — all based on vibes. "it feels better now." that's not engineering, that's guessing.
i've started treating evals the way i treat tests in software. before i change anything, i write down what good looks like. a set of inputs with expected outputs, scored automatically where possible.
the process is simple:
- collect 50-100 real examples from your use case
- define what "correct" means for each one
- run your current system against all of them
- make a change
- run again
- compare
it's not glamorous. most of the work is in step 1 and 2 — collecting good examples and figuring out your scoring criteria. but once you have it, every decision becomes evidence-based.
the side benefit: evals make it trivial to test new models. when anthropic or openai drops something new, i can benchmark it against my current setup in minutes instead of days of manual testing.
if you're building with llms and you don't have evals, that's the single highest-leverage thing you can do today. not a new framework. not a new model. evals.