Backpropagation — How Neural Networks Learn
The chain rule applied to a network of layers. Gradients flow backward, weights update, the network gets better. Understood once, never forgotten.
A network makes a prediction. It is wrong. Backpropagation answers one question: which weights caused the error, and by exactly how much should each one change?
Module 41 showed the forward pass — data flows left to right through the network, layer by layer, until a prediction emerges. The prediction is compared to the true label. The difference is the loss. Now what? The network has thousands of weights. Which ones made the prediction wrong? How wrong did each one make it? How much should each one move?
This is the credit assignment problem — the hardest problem in training neural networks. If the network predicts 32 minutes for a delivery that actually took 41 minutes, which of the 5,000 weights is responsible for the 9-minute underestimate? All of them contributed — but in different amounts, through different paths.
Backpropagation solves credit assignment using the chain rule from calculus. It starts at the loss and works backwards — computing how much the loss would change if each weight changed by a tiny amount. That quantity is the gradient. Once you have the gradient for every weight, gradient descent subtracts a small fraction of it from each weight. Repeat this millions of times and the network learns.
A manager wants to know why a project was delivered late. They start at the final delay (the loss) and trace backwards. The deployment was late because testing was late. Testing was late because development was late. Development was late because requirements were unclear. At each step they answer: how much did this step contribute to the final delay? That is the chain rule — each step's contribution multiplied together to reach the root cause.
Backpropagation traces the prediction error backwards through the network. At each layer it asks: how much did this layer's weights contribute to the final error? The answer — the gradient — tells each weight exactly how to change to reduce the error next time.
The chain rule — the only piece of calculus backprop needs
The chain rule says: if y depends on z which depends on x, then how y changes with x equals how y changes with z multiplied by how z changes with x. Written as: ∂y/∂x = (∂y/∂z) × (∂z/∂x).
A neural network is a chain of functions. The loss depends on the output, which depends on layer 3, which depends on layer 2, which depends on layer 1, which depends on the weights. Backpropagation applies the chain rule at each link in this chain — starting from the loss and multiplying derivatives backwards through every layer until reaching the weights.
Backprop in matrix form — the same chain rule, but for every weight simultaneously
The chain rule example above worked on one weight. A real network has thousands. The key insight: in matrix form, the chain rule applies to entire layers simultaneously — the same equations work regardless of how wide or deep the network is. Each layer produces two outputs during backprop: the gradient for its own weights (used to update them) and the gradient to pass further backward (used by the previous layer).
Vanishing and exploding gradients — why deep networks were hard before 2015
The chain rule multiplies gradients across layers. In a 10-layer network, the gradient for layer 1's weights is the product of 10 terms — one per layer. If each term is slightly less than 1 (like sigmoid derivatives, which top out at 0.25), the product shrinks exponentially. By layer 1, the gradient is essentially zero — weights never update. This is vanishing gradients.
The opposite happens if each term is greater than 1. The product grows exponentially — gradients explode to billions, weights update by enormous amounts, and training diverges. This is exploding gradients.
Gradient checking — the numerical test that proves backprop is correct
When you implement backprop manually, bugs are easy to introduce — a transposed matrix, a missing factor, a wrong sign. Gradient checking is the gold standard test: compare every analytical gradient (from backprop) to its numerical approximation (computed by slightly perturbing each weight and measuring the loss change). If they match to within 1e-5, backprop is correct.
Two-sided difference is more accurate than one-sided: error is O(h²) vs O(h)
Do this for every weight element — expensive but definitive
Autograd — PyTorch builds the computational graph and runs backprop automatically
Everything you coded by hand above — caching intermediate values, applying the chain rule at each layer, computing dW and passing dA backwards — PyTorch does automatically. When you call loss.backward(), PyTorch traces the computational graph it built during the forward pass and applies the chain rule to every operation automatically. Every tensor that had requires_grad=Truegets its .grad populated.
Every common backprop mistake — explained and fixed
You understand how networks learn. Next: what they learn through.
Backpropagation is the learning algorithm. But the network's ability to learn depends critically on two other choices: the activation function (what non-linearity to apply at each neuron) and the loss function (what the network is trying to minimise). Module 43 covers every major activation and loss function — what each one does, when to use it, and the numerical stability pitfalls that trip up every practitioner at least once.
ReLU, GELU, Swish, sigmoid, softmax — and cross-entropy, MSE, Huber, focal loss. When to use each and why numerical stability matters more than you think.
🎯 Key Takeaways
- ✓Backpropagation solves credit assignment — given a prediction error, how much is each weight responsible? It applies the chain rule backwards through the network: start at the loss, multiply derivatives layer by layer back to the weights.
- ✓Each layer in the backward pass produces two things: the gradient for its own weights (∂L/∂W = A_prev.T @ dZ) and the gradient to pass further backward (∂L/∂A_prev = dZ @ W.T). This pattern repeats identically for every layer.
- ✓Vanishing gradients happen when derivatives multiply to near-zero across many layers — sigmoid derivatives top out at 0.25, so 10 layers gives 0.25^10 ≈ 10^-7. The fix is ReLU activation, which has derivative 1 for positive inputs and preserves gradient magnitude.
- ✓Gradient checking is the definitive test for correct backprop: compare analytical gradients (from backprop) to numerical gradients (finite difference). Relative error below 1e-5 means your implementation is correct. Always gradient-check before training a new network architecture.
- ✓PyTorch autograd builds a computational graph during the forward pass and runs backprop automatically on loss.backward(). Every tensor with requires_grad=True gets its .grad populated. Call optimizer.zero_grad() before every backward() — PyTorch accumulates gradients by default.
- ✓Three practical rules: use BCEWithLogitsLoss instead of Sigmoid+BCELoss for numerical stability, always clip sigmoid inputs to avoid overflow, and call optimizer.zero_grad() at the start of every training step without exception.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.