“Teaching by example, with millions of examples and a tireless intern.”
The three-step recipe
All modern AI training boils down to: (1) show the model an example, (2) measure how wrong it was, (3) nudge its knobs so it would be slightly less wrong next time. Then repeat that loop somewhere between a billion and a trillion times.
The "how wrong was it" measurement is called the loss. The "nudge the knobs" step is called gradient descent. Together they are the entire engine of machine learning. Every other word (transformer, attention, fine-tuning, RLHF) is a refinement of one of those two ideas.
Three flavors of "example"
What counts as an example depends on what you want the model to learn. There are three main setups, and any serious AI system uses some combination of all three.
Supervised learning. Every example is paired with the correct answer. "This photo → cat. This email → spam." The model learns the mapping. Most classical ML (fraud detection, medical imaging, recommendation) is supervised. It needs labels, which means humans, which means expensive.
Self-supervised learning. The model invents its own labels from the data itself. Given a sentence with a word missing, predict the word. Given the first half of a paragraph, predict the second. This is how every modern large language model is pre-trained, and it's the single most important reason they scale. Labels are free because the internet writes them for you.
Reinforcement learning. The model takes actions in some environment and gets a reward signal such as a high score, win, click, or thumbs up. It tunes its behavior to chase reward. This is how AlphaGo learned to beat Go champions, and how chat models are polished after pretraining.
Pretraining vs fine-tuning
Modern LLMs are built in two stages, and the distinction matters when you read announcements.
Pretraining is the giant, expensive, self-supervised pass: predict-the-next-word over trillions of words of internet, books, code. This is where 99% of the compute goes. The output is a model that knows facts and language but has no manners; it will happily complete "How do I make a bo" as either "bookshelf" or something far worse.
Fine-tuning is a much shorter, cheaper, supervised or RL pass that shapes the pretrained model into something useful: a chat assistant, a code completer, a customer support agent. Fine-tuning teaches behavior, not knowledge. If the base model doesn't know who wrote *Anna Karenina*, fine-tuning won't fix it.
Overfitting: the one failure mode you need to know
The whole point of training is that the model should work on examples it has *never seen*. A model that memorises its training data perfectly but flunks on new inputs is useless. This failure mode is called overfitting, and avoiding it is most of what separates a working ML system from a broken one.
Visualise it: you're fitting a curve through scattered points. A straight line might miss many points but capture the overall trend. A wildly wiggly curve might pass through every point exactly and predict nonsense between them. Real training data is noisy. Your job is to learn the signal, not the noise.
The standard defence is a held-out validation set: a slice of data the model is never trained on. You watch the validation loss as training progresses. The instant it starts rising while training loss keeps falling, you stop. The model has begun memorising rather than generalising.
Why training takes months and millions
A frontier LLM training run in 2025 costs in the tens to hundreds of millions of dollars and runs for weeks across tens of thousands of GPUs. The cost is dominated by one thing: the loop runs trillions of times, and each pass touches every parameter.
Inference (actually using the trained model) is much cheaper per call but adds up at scale. The economics of AI are: training is a one-time capex; inference is the ongoing opex. Every product decision (model size, context length, batching) is downstream of that split.
In one line each
- Training = show example, measure error (loss), nudge parameters (gradient descent). Repeat a trillion times.
- Supervised needs labels; self-supervised invents them from data; reinforcement learns from reward.
- LLMs are built in two stages: massive self-supervised pretraining, then small fine-tuning that teaches behavior.
- The enemy of training is overfitting, which means memorising the data instead of learning the pattern.
Where to go next