Feature Scaling — Standardisation and Normalisation
Why scale matters, what StandardScaler and MinMaxScaler actually do under the hood, which algorithms break without scaling, and when to use each scaler.
Your model thinks ₹500 and 5km are the same magnitude. They are not.
A Swiggy delivery prediction model has two features: distance in kilometres (range 0.5–15) and order value in rupees (range 50–1200). To gradient descent, ₹1200 looks 80 times more important than 15km simply because the number is bigger — not because it actually is. The optimiser takes tiny steps in the distance direction and massive steps in the order-value direction, oscillating and converging slowly or not at all.
This is the scaling problem. It is not a subtle edge case. For gradient-based algorithms (linear regression, logistic regression, SVMs, neural networks, K-means) unscaled features produce models that are slower to train, less accurate, and sensitive to which units you happened to measure in. A model trained on distances in kilometres gives different results than one trained on the same distances in metres — even though the data contains identical information.
Feature scaling solves this by transforming all features to a common scale before training. This module shows you exactly what each scaler does mathematically, which algorithms need it, and how to apply it correctly inside a sklearn Pipeline without leaking test information.
What this module covers:
What unscaled features do to gradient descent
Imagine the loss surface as a landscape with hills and valleys. With well-scaled features the loss surface looks like a round bowl — gradient descent rolls straight down to the minimum from any starting point. With badly scaled features the surface becomes an elongated narrow valley — gradient descent bounces left and right off the steep walls while crawling slowly toward the minimum. The same distance to the minimum, but zigzagging makes the journey 10× or 100× longer.
StandardScaler — zero mean, unit variance
StandardScaler transforms each feature so it has mean 0 and standard deviation 1. Every value is expressed as "how many standard deviations from the mean is this?" A distance of 6km in a dataset with mean 4km and std 2km becomes (6 − 4) / 2 = 1.0 — one standard deviation above average. A distance of 2km becomes −1.0.
When StandardScaler is the right choice
StandardScaler preserves the Gaussian shape — scaled values are still normally distributed, just centred at 0 with std 1.
Linear/logistic regression, SVMs, neural networks, K-means, PCA — all assume features are on comparable scales.
Unlike RobustScaler, StandardScaler is affected by outliers. Use this when extreme values carry meaningful signal.
MinMaxScaler — compress every feature to [0, 1]
MinMaxScaler shifts and scales each feature so the minimum becomes 0 and the maximum becomes 1. All values end up strictly between 0 and 1. The shape of the distribution is preserved — the relative distances between values stay the same, just rescaled to fit the [0, 1] window.
RobustScaler — scale using median and IQR, not mean and std
StandardScaler uses the mean and standard deviation. Both are sensitive to outliers — one extreme value can shift the mean dramatically and inflate the standard deviation, causing all other values to be squashed into a tiny range after scaling. RobustScaler uses the median (Q2) and interquartile range (IQR = Q3 − Q1) instead. These are resistant to outliers by construction: no matter how extreme one value is, the median and IQR barely change.
MaxAbsScaler and Normalizer — the two special-purpose scalers
Two more scalers cover specific situations that StandardScaler, MinMaxScaler, and RobustScaler don't handle well.
MaxAbsScaler — for sparse data
MaxAbsScaler divides each feature by its maximum absolute value, producing values in [−1, 1]. Crucially, it does not centre the data (no mean subtraction). This preserves sparsity — if a feature was 0, it stays 0. StandardScaler would subtract the mean and create non-zero values where there were zeros, destroying the sparsity that makes sparse matrix operations fast. Use MaxAbsScaler for TF-IDF vectors, one-hot encoded matrices, and any sparse input.
Normalizer — scale rows, not columns
Every scaler so far operates on columns — each feature is scaled independently. Normalizer is different: it scales each sample (row) so its length equals 1. This is used when the direction of a feature vector matters more than its magnitude — text classification with TF-IDF, recommendation systems, cosine similarity computations.
Which algorithms need scaling — and which genuinely don't
Not every algorithm is sensitive to feature scale. Tree-based models split on threshold values — the scale of a feature does not change whether splitting at 3.5km vs 4.2km produces purer leaf nodes. But every algorithm that computes distances, dot products, or gradients is directly affected by scale. Knowing which is which prevents wasted preprocessing and wrong assumptions.
Scalers inside a Pipeline — the only safe way
The most common scaling mistake is fitting the scaler on the entire dataset before the train/test split. This leaks test statistics — the test set's mean and standard deviation influence the scaler, which in turn influences what the model sees during training. Evaluation metrics look slightly better than they should, and the model is technically trained on information from the test set.
A sklearn Pipeline completely prevents this. It fits the scaler only when pipe.fit(X_train) is called, and applies the stored statistics (never refitting) when pipe.predict(X_test) is called. There is no way to accidentally leak using a Pipeline.
Should you scale the target variable?
You almost never need to scale y for linear regression or tree models. The model adjusts its bias term to match the scale of y automatically. But for neural networks — especially deep ones — a target with a large range (like delivery times 10–120 minutes) can cause unstable training because the output layer needs large weights to produce large numbers. Scaling y to zero mean and unit variance stabilises training.
Which scaler to use — decision guide
Every common scaling error — explained and fixed
Scaling is now a reflex. Every algorithm you build from here uses it correctly.
StandardScaler inside a Pipeline, fit on training data only. This is the pattern you will repeat in every module from here. It takes three lines and prevents a class of subtle bugs that trip up even experienced practitioners.
Module 18 builds your first complete ML model from scratch: linear regression. You'll see how the scaled features from this module feed directly into the gradient descent update from Module 05, and how regularisation (Ridge and Lasso) prevents overfitting — with the coefficients directly interpretable as feature importance.
OLS, gradient descent, Ridge, Lasso, ElasticNet — and how to diagnose every failure mode on real delivery data.
🎯 Key Takeaways
- ✓Unscaled features distort gradient descent — features with large numerical ranges dominate weight updates. StandardScaler brings all features to mean=0, std=1, making gradient steps equal in all directions.
- ✓StandardScaler: x_scaled = (x − μ) / σ. Robust to most distributions. Use as the default for linear models, logistic regression, SVMs, K-means, PCA, and neural networks.
- ✓MinMaxScaler: x_scaled = (x − min) / (max − min). Produces values in [0,1]. Use when you need bounded output — neural network activations, cosine similarity, image pixels.
- ✓RobustScaler: x_scaled = (x − median) / IQR. Ignores outliers when computing the scaling statistics. Use when your data has significant outliers that should not distort the scale of the majority.
- ✓MaxAbsScaler divides by max absolute value — no mean subtraction. Use for sparse data (TF-IDF, one-hot) where zeroes must stay zero. Normalizer scales each row (sample) not each column (feature) — use for cosine similarity.
- ✓Tree-based algorithms (Decision Tree, Random Forest, XGBoost, LightGBM) do not need feature scaling — splits are threshold-based and scale-invariant. Scaling has no effect on their performance.
- ✓The only safe pattern: fit scaler on X_train only, transform both X_train and X_test. Use sklearn Pipeline to enforce this automatically in cross-validation. Never fit on the full dataset before splitting.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.