Training Deep Networks — Adam, BatchNorm, Dropout
The four techniques that separate a network that trains from one that trains well. Used in every production deep learning system.
Module 41 built a network that works. This module makes it train 10× faster, generalise better, and stay stable on deep architectures.
The network from Module 41 used plain SGD — subtract a fixed fraction of the gradient from each weight every step. It works, but it has three serious problems in practice. First: the same learning rate for every weight regardless of how frequently that weight gets useful gradient signal. Second: as the network gets deeper, activations between layers drift to extreme values — making gradients vanish or explode. Third: the network memorises training data instead of learning generalisable patterns.
Four techniques solve these problems and together define how every modern neural network is trained in production: Adam (adaptive learning rates), Batch Normalisation (stabilise activations), Dropout (prevent overfitting), and Learning Rate Scheduling(decay lr as training matures). You will use all four in every non-trivial deep learning project you build.
Learning to drive with plain SGD is like driving with one fixed foot pressure on the accelerator — too fast on empty roads, too slow in traffic. Adam is an experienced driver who automatically adjusts pressure based on the terrain — easing off on straight highways, pressing harder on uphill roads.
BatchNorm is the suspension system — absorbs shocks between layers so the car stays stable. Dropout is the practice of occasionally driving with one eye closed — forces the driver to not rely too heavily on any single cue, producing a more robust skill. LR scheduling is easing off the accelerator as you approach your destination.
Adam — the optimizer that replaced SGD for almost everything
Plain SGD uses the same learning rate for every weight at every step. This is suboptimal for two reasons. Some weights receive large, consistent gradient signals and should take smaller steps to avoid overshooting. Other weights receive rare, small gradient signals and should take larger steps to make progress at all. Adam maintains a separate effective learning rate for each weight, automatically adjusted based on the history of gradients for that weight.
Adam combines two ideas. The first moment (mean of past gradients) acts like momentum — it accumulates direction from past steps, smoothing out noise. The second moment (mean of squared past gradients) scales the step size — weights that have been receiving large gradients get a smaller effective learning rate. Together they produce an adaptive per-weight step size that makes training much faster and more robust to learning rate choice.
Batch Normalisation — the technique that made very deep networks trainable
As data flows through a deep network, the distribution of activations at each layer shifts and grows — a problem called internal covariate shift. Layer 3 expects activations in a certain range, but layer 2's outputs drift as weights update. Layer 3 then has to continuously re-adapt to its changing inputs. This is one reason very deep networks (10+ layers) trained poorly before BatchNorm.
Batch Normalisation fixes this by normalising the activations at each layer before passing them to the next. For each mini-batch, it subtracts the batch mean and divides by the batch standard deviation — forcing activations to approximately zero mean and unit variance. Two learnable parameters (gamma and beta) then scale and shift the normalised values back to whatever distribution is optimal for that layer.
Dropout — randomly disable neurons during training
Dropout is the simplest and most effective regularisation technique for neural networks. During each training step, each neuron is randomly set to zero with probability p (the dropout rate). The neuron produces no output and receives no gradient for that step. At test time, all neurons are active and their outputs are scaled by (1 − p) to compensate for the extra neurons.
Why does randomly disabling neurons help? It prevents co-adaptation — neurons learning to rely on specific other neurons. When any neuron might be absent, every neuron is forced to learn features that are individually useful. The result is an ensemble effect: each forward pass trains a slightly different sub-network, and the full network is an implicit average of all these sub-networks.
A cricket team that always plays with the same 11 players develops tight co-dependences — the bowlers know exactly when the fielders will move. Now randomly sit out 3 players each practice session. Every player must become more versatile, able to cover gaps. The team becomes more robust. That is dropout.
In deep learning: each neuron must learn a feature that is useful regardless of which other neurons happen to be active. The network cannot rely on any specific path from input to output.
Learning rate schedules — start fast, finish precise
A high learning rate at the start of training is good — large steps explore the loss landscape quickly and escape bad initialisations. But a high learning rate near convergence is bad — the model bounces around the minimum without settling into it. Learning rate schedules reduce the learning rate as training progresses, combining fast early progress with precise final convergence.
Multiply lr by gamma every step_size epochs. Simple and widely used. lr drops discretely — you see the loss decrease in jumps.
Smoothly decreases lr following a cosine curve from initial to eta_min over T_max epochs. Smooth and reliable — very widely used.
Monitor a metric (val loss). If it does not improve for "patience" epochs, reduce lr by factor. Adaptive — reacts to actual training dynamics.
Increase lr from base to max over first 30% of training, then decrease to near-zero. Based on the "super-convergence" phenomenon. Often fastest to converge.
All four techniques together — the standard modern training loop
In production, all four techniques are used simultaneously. AdamW as the optimizer, BatchNorm between linear layers and activations, Dropout after activations in hidden layers, CosineAnnealingLR or ReduceLROnPlateau for scheduling, and early stopping to prevent overfitting when validation loss stops improving. This is the recipe used in virtually every production MLP today.
Every common training technique mistake — explained and fixed
You can train deep MLPs reliably. Next: convolutional networks for images.
The MLP you have built is a general-purpose network — it works on tabular data, but it is not the right architecture for images. Images have spatial structure: nearby pixels are related, patterns appear at different positions in the image. A fully connected layer treats every pixel independently and ignores this structure. Convolutional Neural Networks (CNNs) are designed specifically to exploit spatial structure — they are the backbone of every image classification, object detection, and medical imaging system.
Filters, feature maps, pooling, and how CNNs learn to recognise objects at any position in an image.
🎯 Key Takeaways
- ✓Adam maintains a separate adaptive learning rate per weight based on the history of gradients. Weights that receive large consistent gradients get smaller steps. Weights with rare small gradients get larger steps. Use AdamW (Adam with correct weight decay) as the default optimizer for all deep learning.
- ✓Batch Normalisation normalises activations at each layer to zero mean and unit variance within each mini-batch, then applies learnable scale (gamma) and shift (beta) parameters. Prevents internal covariate shift, makes very deep networks trainable, and acts as mild regularisation. Always call model.eval() before inference — BatchNorm behaves differently in train vs eval mode.
- ✓Dropout randomly zeros a fraction p of neurons during each training step, forcing the network to not rely on any specific path. At inference, all neurons are active and outputs are scaled by (1-p). Place Dropout after activation functions, not before BatchNorm. Typical values: p=0.2 for hidden layers, p=0.5 for the final hidden layer.
- ✓Learning rate schedules start with a higher lr for fast early progress and reduce it as training matures for precise final convergence. ReduceLROnPlateau is the most robust — it only reduces lr when validation loss stops improving. OneCycleLR often converges fastest. Always monitor optimizer.param_groups[0]["lr"] to verify the schedule is working.
- ✓Early stopping is essential — monitor validation loss and stop training when it has not improved for "patience" epochs, then restore the best weights. This prevents overfitting without needing to guess the right number of epochs in advance.
- ✓The production training recipe: AdamW + BatchNorm + Dropout + ReduceLROnPlateau + early stopping. These five components are the standard in virtually every production MLP. Start with lr=0.001, dropout=0.2, weight_decay=0.01, and tune from there.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.