running models locally changed how i think about ml

when i first started with ml, everything was api calls. send data to openai, get a response back. it worked but it always felt like renting someone else's brain.

then i started running models locally with ollama. nothing fancy — a quantized llama on a macbook. but something shifted in how i understood these systems.

when you run a model locally, you feel the weight of it. you see how much memory it eats, how long inference takes, how the quality degrades when you quantize too aggressively. you develop intuition for what's actually happening inside the model instead of treating it as a magic endpoint.

a few things i learned from running local models:

  • 7b parameter models are surprisingly capable for focused tasks
  • quantization to 4-bit is usually fine, 2-bit starts to hurt
  • the bottleneck is almost always memory, not compute
  • batching requests matters way more than i expected

i still use apis for production workloads. but for prototyping, experimentation, and building intuition — local models are now my default. there's something about being able to poke at the thing directly that no api can replicate.