data quality still beats model quality

spent two weeks trying to improve a classification model by experimenting with architectures. tried transformers, tried ensembles, tried every trick i know. accuracy went from 87% to 88.5%.

then i spent one afternoon cleaning the training data. removed duplicates, fixed mislabeled examples, balanced the classes. accuracy jumped to 93%.

this keeps happening. not just to me — i see it in every ml team i've worked with. there's a bias toward model complexity because it's more interesting work. nobody writes blog posts about how they relabeled 200 training examples. but that boring work is usually what moves the needle.

console.log("Hello World");

a few rules i follow now:

before touching the model, look at 100 random examples from your dataset
if you find even 5 that are wrong, your data is the bottleneck
the best feature engineering is often just fixing your labels
a simple model on clean data beats a complex model on messy data, every time

the unglamorous truth of ml: most of the alpha is in the data pipeline, not the model architecture. the sooner you accept that, the faster you ship things that actually work.