Activation Functions and Loss Functions
ReLU, GELU, Swish, sigmoid, softmax — and cross-entropy, MSE, Huber, focal loss. When to use each and why numerical stability matters more than you think.
The activation function decides what a neuron can express. The loss function decides what the network is trying to achieve. Both choices are made before training — and both can silently make a network fail.
You have a network architecture — layers, widths, connections. Two remaining decisions determine whether it trains successfully: what non-linearity to apply after each layer (activation function) and what quantity to minimise during training (loss function). Both are often treated as trivial defaults, but both have failure modes that are genuinely hard to debug.
The wrong activation function causes vanishing gradients (sigmoid in deep networks), dead neurons (ReLU with bad initialisation), or slow convergence (tanh). The wrong loss function causes the network to optimise for the wrong thing entirely — a model trained with MSE on a classification problem will learn to output the class mean, not the class probability. Numerical instability in either can silently corrupt training with NaN losses.
A football player's training regime (the loss function) determines what they get better at. Train them to minimise goals conceded — they become a defender. Train them to maximise goals scored — they become a striker. The same player, the same training intensity, but the objective determines the skill.
The activation function is the player's physical capability — how much they can bend, how fast they can turn. A player with no flexibility (linear activation) cannot do anything a simple regression cannot. A player with full agility (ReLU, GELU) can learn arbitrarily complex patterns.
Six activation functions — what each one does and when to use it
An activation function is applied element-wise after the linear transformation of each layer. Without it, a 10-layer network would collapse to a single linear transformation — no more expressive than one layer. The activation function is what gives neural networks their ability to learn non-linear patterns.
Six loss functions — match the loss to the task
The loss function is the quantity the network minimises during training. Choosing the wrong loss does not crash training — it often trains fine but optimises for the wrong thing. A network trained with MSE on a classification task learns to output class frequencies, not class probabilities. The outputs look reasonable but are fundamentally wrong.
The right loss function is determined entirely by the output type and what "correct" means for your task. Regression → MSE or MAE or Huber. Binary classification → BCE. Multi-class → Cross-entropy. Imbalanced classes → Focal loss. These are not interchangeable.
Predicting continuous values where large errors are costly. Delivery time, stock price, temperature.
Regression with outliers. Treats all errors proportionally.
Best of both: MSE for small errors (smooth gradient), MAE for large errors (outlier robust).
Two-class problems. Output layer must produce probabilities (0–1).
Three or more classes. Output layer produces one logit per class.
Severe class imbalance — fraud (1%), disease (0.1%). Downweights easy examples.
Numerical stability — why BCEWithLogitsLoss beats BCELoss every time
The most common source of NaN losses in production deep learning is not wrong architecture or bad data — it is numerical instability in loss functions. Understanding why BCEWithLogitsLoss exists and why CrossEntropyLoss takes raw logits (not softmax outputs) prevents hours of debugging.
Computing log(sigmoid(z)) directly overflows for large |z|. The numerically stable version uses the log-sum-exp trick:
PyTorch's BCEWithLogitsLoss and CrossEntropyLoss implement this trick internally. Using nn.Sigmoid() + nn.BCELoss() skips it — leading to NaN at training time when logits are large.
A complete decision guide — activation and loss for every task
Every common activation and loss mistake — explained and fixed
Activations and losses are chosen. Next: how to make the gradient descent step itself smarter.
You now know what a neuron computes (activation functions) and what the network minimises (loss functions). Module 44 covers the final missing piece of the training loop: optimisers. SGD takes the same step size for every weight. Adam adapts the step size per weight based on gradient history. AdamW adds proper weight decay. Momentum accumulates direction. The right optimiser makes training 5–10× faster and more stable.
Momentum, adaptive learning rates, weight decay done right. Why AdamW replaced Adam as the default and when SGD still wins.
🎯 Key Takeaways
- ✓Use ReLU as the default hidden layer activation — fast, sparse, no vanishing gradient for positive inputs. Switch to LeakyReLU if dying neurons are a problem. Use GELU for transformers and modern architectures — it is smooth everywhere and increasingly the default.
- ✓Sigmoid and tanh belong only in specific places: sigmoid at the output layer for binary classification, tanh inside RNNs and LSTMs. Never use sigmoid in hidden layers of deep networks — its maximum derivative of 0.25 causes vanishing gradients.
- ✓Match the loss function to the task exactly: BCEWithLogitsLoss for binary classification, CrossEntropyLoss for multi-class, MSELoss or L1Loss for regression, HuberLoss for regression with outliers. Using MSELoss for classification causes the network to predict class frequencies, not probabilities.
- ✓Never apply sigmoid before BCEWithLogitsLoss or softmax before CrossEntropyLoss. Both losses apply the stable version internally using the log-sum-exp trick. Adding the activation first causes numerical overflow for large logits and produces NaN losses.
- ✓At inference time after CrossEntropyLoss training: apply torch.softmax(logits, dim=1) to get probabilities. After BCEWithLogitsLoss training: apply torch.sigmoid(logits) to get probabilities. During training, pass raw logits to the loss function — never activated outputs.
- ✓For imbalanced classification, use the weight parameter in CrossEntropyLoss (minority class weight = n_majority/n_minority) or pos_weight in BCEWithLogitsLoss. Focal loss is stronger but requires an external library — start with weighted cross-entropy first.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.