Optimisers — SGD, Adam, AdamW
Momentum, adaptive learning rates, and weight decay done right. Why AdamW replaced Adam as the default and when SGD still wins.
Backpropagation tells you which direction to move each weight. The optimiser decides how far to move — and how to move smarter than just "subtract the gradient."
After backpropagation you have a gradient for every weight — the direction in which each weight should change to reduce the loss. The simplest possible update: subtract a small fraction of the gradient. That fraction is the learning rate. This is plain SGD. It works, but it has two major problems in practice.
First: the same learning rate for every weight. A weight that receives large, consistent gradient signals needs smaller steps to avoid overshooting. A weight that receives rare, tiny gradients needs larger steps to make any progress. Treating all weights the same wastes most of the gradient signal.
Second: gradient noise. Mini-batch gradients are noisy estimates of the true gradient. A single step in a noisy direction wastes a step. Accumulating direction from many past steps — momentum — filters noise and accelerates progress. Modern optimisers (Adam, AdamW) solve both problems simultaneously.
SGD is like hiking downhill in thick fog with one step at a time — you only see the slope directly under your feet right now. SGD with momentum is like a ball rolling downhill — it accumulates speed in consistent directions and is slowed less by small bumps. Adam is a smart hiker with a map of the terrain history — they take big steps on flat ground and small careful steps on steep or unpredictable terrain.
AdamW is Adam who also carries a light backpack that gets heavier the further they walk — gently pulling them back toward the origin (weight decay) to prevent them from wandering too far.
SGD and momentum — from naive update to direction accumulation
Plain SGD is the simplest possible optimiser: subtract learning_rate × gradient from each weight every step. Momentum extends this by accumulating a velocity — a weighted average of all past gradients. Instead of updating directly from the current gradient, you update from the velocity, which smooths out noise and accelerates in consistent directions.
Consistent gradients accumulate — speed builds up.
Noisy gradients cancel — oscillation dampened.
Adam — per-weight adaptive learning rates via first and second moments
Adam (Adaptive Moment Estimation) maintains two running statistics per weight: the first moment (exponential moving average of gradients — like momentum) and the second moment (exponential moving average of squared gradients — measures how large gradients have been historically). The effective learning rate for each weight is lr / √(second moment) — weights with large past gradients get a smaller effective step size automatically.
Adam vs AdamW — why weight decay was broken in Adam
In standard SGD, L2 regularisation (adding λ||W||² to the loss) and weight decay (subtracting λW from the weight directly) are mathematically equivalent. In Adam they are not — and this caused Adam's weight decay to be effectively much weaker than intended for years before anyone noticed.
The problem: in Adam, the L2 gradient λW gets divided by √v̂ just like any other gradient — weights with large historical gradients get a smaller effective weight decay than weights with small gradients. The regularisation strength varies per weight in an uncontrolled way. AdamW (Loshchilov and Hutter, 2019) fixes this by decoupling weight decay from the gradient update — applying it directly to the weight before the adaptive gradient step.
Learning rate schedules — warmup, cosine decay, and ReduceLROnPlateau
The learning rate is the single most important hyperparameter. A fixed learning rate is always a compromise — too high early on causes divergence, too low late in training means slow progress. Schedules give you the best of both: a high rate for fast early exploration and a low rate for precise final convergence.
Linear warmup is especially important for Adam-based optimisers. In the first steps, the second moment estimate v is near zero — the bias correction denominator (1−β₂ᵗ) is small, making v̂ small, making the effective learning rate very large. Warmup starts with a tiny learning rate and gradually increases it, preventing unstable large updates in the first steps.
When SGD+momentum beats Adam — and why generalisation differs
Adam converges faster in almost every setting. But on large-scale image classification (ImageNet-scale CNNs) and some NLP tasks, SGD+momentum often achieves better final test accuracy despite slower convergence. This is a known phenomenon with a theoretical explanation: Adam finds sharp minima (narrow valleys in the loss landscape) while SGD tends to find flat minima. Flat minima generalise better because small perturbations to weights — which happen naturally when data distribution shifts slightly — do not change the loss much. Sharp minima are sensitive to such perturbations.
Every common optimiser mistake — explained and fixed
Optimisers are chosen. Next: make deep networks stable and prevent them from overfitting.
You now have the complete training loop: forward pass, loss, backprop, optimiser step. Module 45 adds the two techniques that make deep networks stable and generalisable at scale — Batch Normalisation (stabilise activations between layers) and Dropout (prevent co-adaptation and overfitting). These are not optional extras — they are standard components of every production deep learning model.
Internal covariate shift, running statistics, and why model.eval() is not optional when BatchNorm is in your network.
🎯 Key Takeaways
- ✓SGD updates every weight by the same learning rate times the gradient. SGD with momentum accumulates a velocity — a weighted average of past gradients. Momentum smooths noisy gradient directions and accelerates in consistent directions. β=0.9 is the standard default.
- ✓Adam maintains per-weight adaptive learning rates using two moment estimates: the first moment (running mean of gradients — like momentum) and the second moment (running mean of squared gradients — measures gradient magnitude). Weights with large past gradients get smaller effective steps automatically.
- ✓Bias correction in Adam is essential in the first training steps. Without it, m and v start at zero and underestimate the true moments — producing unstable first updates. The correction terms 1/(1−β₁ᵗ) and 1/(1−β₂ᵗ) fix this and become negligible after ~100 steps.
- ✓AdamW decouples weight decay from the gradient update. In Adam, L2 regularisation is scaled by the adaptive learning rate — making it weaker for frequently-updated weights. AdamW applies weight decay directly to the weight before the gradient step — uniform across all weights. Always prefer AdamW over Adam.
- ✓Default starting point for any new deep learning project: AdamW with lr=1e-3 and weight_decay=0.01. Pair with CosineAnnealingLR or ReduceLROnPlateau. Only switch to SGD+momentum when you have evidence it generalises better — primarily large-scale image classification.
- ✓The mandatory training step order: optimizer.zero_grad() → forward pass → loss → loss.backward() → optimizer.step(). Never rearrange these four lines. zero_grad() must come before backward() — PyTorch accumulates gradients by default and calling zero_grad() after backward() clears the gradients before they are used.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.