eval-driven development

the biggest mistake i see in ml projects is building without evals. people tweak prompts, swap models, change parameters — all based on vibes. "it feels better now." that's not engineering, that's guessing.

i've started treating evals the way i treat tests in software. before i change anything, i write down what good looks like. a set of inputs with expected outputs, scored automatically where possible.

the process is simple:

collect 50-100 real examples from your use case
define what "correct" means for each one
run your current system against all of them
make a change
run again
compare

it's not glamorous. most of the work is in step 1 and 2 — collecting good examples and figuring out your scoring criteria. but once you have it, every decision becomes evidence-based.

the side benefit: evals make it trivial to test new models. when anthropic or openai drops something new, i can benchmark it against my current setup in minutes instead of days of manual testing.

if you're building with llms and you don't have evals, that's the single highest-leverage thing you can do today. not a new framework. not a new model. evals.