A chain of voters

“A chain of voters, each weighing the evidence and passing a guess forward.”

One neuron, one decision

The atomic unit is the neuron. It takes a handful of numbers as input, multiplies each by a weight, adds a constant (the bias), and squashes the result through a non-linear function. Output: one number. That's it. There is no biology happening.

If you think of the inputs as evidence and the weights as how much each piece of evidence matters, a neuron is a tiny voting booth. "Did this photo show a whisker? a pointy ear? fur? Yes, yes, yes: output: high score for cat."

Inputs travel along weighted wires into a sum-with-bias, then through a non-linear activation. One number out.

Why layers, and why deep

A single layer of neurons can only learn straight-line distinctions: "is this point above or below the line?" Stack two layers and you can learn curves. Stack ten and you can learn whatever you like, given enough data. This is, formally, the "universal approximation theorem."

The interesting trick is what happens inside deep networks: each layer learns a slightly more abstract feature than the one below. In vision networks, the first layer detects edges, the second corners and textures, the fifth eyes and wheels, the tenth faces and cars. Nobody programmed this hierarchy; it emerges from the training objective.

Nobody tells the network to build a hierarchy. It emerges because that's the most compressed way to predict the answer.

The non-linear squish

Between every two layers there's a small non-linear function applied to each output. Common ones: ReLU ("if negative, output zero; else keep it"), GELU, sigmoid, tanh. Without it, stacking layers does nothing; the whole network would collapse mathematically into a single layer.

The non-linearity is the whole reason deep networks can model curves, shapes, and decisions a single layer can't. It's a small detail that does enormous structural work.

How a network actually computes

In practice, you never compute one neuron at a time. A whole layer is one big matrix multiplication: input vector times weight matrix, plus bias vector, applied non-linearity, output vector. The next layer takes that output and does the same. The whole network is just a chain of matrix multiplications interleaved with non-linearities.

This is why GPUs matter so much. GPUs were built for graphics, and graphics is matrix multiplications. They turned out to be the perfect hardware for the thing nobody designed them for. The whole AI boom rides on this accident.

# forward pass, two layers, in 5 lines
h1 = relu(x @ W1 + b1)   # input -> hidden
y  = h1 @ W2 + b2        # hidden -> output
# x: input vector
# W1, W2: weight matrices
# b1, b2: bias vectors
# relu: max(0, ...) applied elementwise

Backpropagation, in one paragraph

Training a network means adjusting every weight matrix and bias vector so the loss gets smaller. To know which way to nudge them, you need the gradient of the loss with respect to each parameter. Computing that naively is hopeless: there can be hundreds of billions of parameters. Backpropagation is the trick that computes them all in one backward sweep, by applying the chain rule layer by layer. It runs in roughly the same time as the forward pass.

Backprop was the unlock that made deep networks practical. The math has been the same since 1986; the reason we use it today and not in 1986 is that GPUs let us run it on networks a million times larger.

Wait: where's the language part?

Everything so far is generic. A neural network for cat photos and a neural network for text both look the same at this level: matrix multiplications and non-linearities. What makes one good at images and another good at language is (a) what you feed in and (b) how the layers are wired. We meet the language-specific wiring, embeddings and attention, in chapters 4 and 5.

In one line each

A neuron is a weighted sum followed by a non-linear squish. That's the whole atom.
Layers stack neurons; depth lets the network learn an abstraction hierarchy without being told to.
All computation is matrix multiplications interleaved with non-linearities, which is why GPUs won.
Backpropagation is how you compute gradients efficiently across hundreds of billions of parameters.

Where to go next

Chapter 4: From words to numbers