eval-driven development

the biggest mistake i see in ml projects is building without evals. people tweak prompts, swap models, change parameters — all based on vibes. "it feels better now." that's not engineering, that's guessing.

i've started treating evals the way i treat tests in software. before i change anything, i write down what good looks like. a set of inputs with expected outputs, scored automatically where possible.

the process is simple:

  1. collect 50-100 real examples from your use case
  2. define what "correct" means for each one
  3. run your current system against all of them
  4. make a change
  5. run again
  6. compare

it's not glamorous. most of the work is in step 1 and 2 — collecting good examples and figuring out your scoring criteria. but once you have it, every decision becomes evidence-based.

the side benefit: evals make it trivial to test new models. when anthropic or openai drops something new, i can benchmark it against my current setup in minutes instead of days of manual testing.

if you're building with llms and you don't have evals, that's the single highest-leverage thing you can do today. not a new framework. not a new model. evals.