Regression Metrics — MAE, RMSE, R²
When your output is a number not a class. MAE, RMSE, MAPE, R², and which metric to choose based on how you want to treat large errors.
A classification model is either right or wrong. A regression model is never exactly right — the question is how wrong, and in what direction does wrong hurt more?
Swiggy predicts delivery time as 32 minutes. The actual time is 41 minutes. The model was wrong by 9 minutes. Is that acceptable? That depends on what Swiggy promised the customer. If the app said "arrives in 32 minutes" and it took 41, the customer is angry. The cost of underestimating is higher than the cost of overestimating.
Now imagine one prediction was wrong by 9 minutes and another was wrong by 45 minutes. Are those two errors equally bad? For Swiggy, 45 minutes late might trigger a refund, damage the restaurant's rating, and lose the customer permanently. That one large error is catastrophically worse than five 9-minute errors. The metric you choose determines whether your model optimises to minimise all errors equally or to specifically avoid large ones.
This is the core decision in regression evaluation: how do you want to penalise large errors?MAE treats all errors proportionally. RMSE squares the errors — large errors get penalised much more heavily. MAPE expresses error as a percentage — useful when the scale of the target varies. R² tells you how much better the model is than a naive baseline.
A cricket commentator says "India needs 12 runs per over." The team scores 10, 11, 13, 9, 12, 8 — never exactly 12. MAE asks: how far off was each over on average? Answer: about 1.5 runs. RMSE asks the same but doubles down on the 8-run over (4 under) — that 16-run miss from target hurts the team more than two 2-run misses. MAPE asks: what percentage of the target was each miss?
Choose MAE when all errors cost equally — late by 5 minutes is 5× worse than late by 1 minute, nothing more. Choose RMSE when catastrophic errors cost disproportionately — one 45-minute delay is far worse than nine 5-minute delays.
Four metrics — formulas, intuitions, and when each is right
Units: Same units as target
Interpret: "On average the model is off by X minutes."
Penalises: All errors proportionally. A 10-min error is 2× worse than a 5-min error.
Use when: When all error magnitudes cost equally. Easy to explain to stakeholders.
Avoid when: When large errors are disproportionately costly.
Units: Same units as target
Interpret: "Typical error magnitude, with large errors weighted more heavily."
Penalises: Large errors quadratically. A 10-min error is 4× worse than a 5-min error.
Use when: When catastrophic errors must be avoided. Standard in competitions.
Avoid when: When outliers are present and acceptable — RMSE will be dominated by them.
Units: Percentage — scale-independent
Interpret: "On average the model is off by X% of the actual value."
Penalises: Relative errors. Being off by 5 on a target of 10 is worse than off by 5 on a target of 100.
Use when: Comparing models across targets of different scales. Demand forecasting.
Avoid when: When true values are zero or near-zero — MAPE explodes. Not symmetric.
Units: Dimensionless (0 to 1, can be negative)
Interpret: "The model explains X% of the variance in the target."
Penalises: Relative to the baseline of predicting the mean.
Use when: Quick sanity check. Comparing models on same dataset. R²=0.87 = 87% variance explained.
Avoid when: Comparing across datasets with different target variance. Can be misleading.
R² — what it measures, why it can go negative, and when it misleads
R² measures how much better your model is than the simplest possible baseline: always predicting the mean. If someone asked you to predict Swiggy delivery times with no model at all, your best guess would be the historical mean — about 36 minutes for everything. R² = 0 means your model is exactly as good as that naive guess. R² = 0.87 means your model explains 87% of the variance that the mean baseline cannot explain. R² = 1 is a perfect model.
R² can go below zero. This happens when your model is worse than just predicting the mean — its predictions are so bad they increase the total squared error beyond what a constant prediction would give. A negative R² is a signal that something is severely wrong: wrong features, data leakage in reverse, or a completely broken pipeline.
Which metric to use — a decision framework
The right metric is determined by the business cost structure of your errors, not by convention. Before picking a metric, answer two questions: are large errors disproportionately costly? And does the scale of the target vary across predictions?
Residual analysis — where is the model systematically wrong?
A single MAE number hides a lot. A model with MAE = 4.2 minutes might be consistently accurate for short deliveries but systematically wrong for long-distance orders. The aggregate metric looks fine while a whole segment of customers is getting bad predictions. Residual analysis reveals these systematic patterns.
Every common regression metric mistake — explained and fixed
The Evaluation section is complete. Section 7 — Deep Learning — begins next.
You have now completed every module in the Model Evaluation section: classification metrics, calibration, ROC curves, cross-validation, hyperparameter tuning, model interpretability, and regression metrics. You can honestly evaluate any model — classifier or regressor — and communicate its performance to any audience.
Section 7 — Deep Learning — begins with Module 41. Everything changes: instead of hand-crafted features, the model learns its own representations from raw data. Module 41 builds a neural network from scratch in NumPy — forward pass, backpropagation, gradient descent — before introducing PyTorch.
Forward pass, backpropagation, and gradient descent built in NumPy before touching PyTorch. The foundation every deep learning framework is built on.
🎯 Key Takeaways
- ✓MAE treats all errors proportionally — a 10-minute error is exactly 2× worse than a 5-minute error. RMSE squares the errors first — a 10-minute error is 4× worse than a 5-minute error. Choose based on whether large errors in your domain are disproportionately costly.
- ✓MAPE expresses error as a percentage of the actual value — scale-independent and useful when targets span different magnitudes. Never use MAPE when true values can be zero — division by zero makes it undefined.
- ✓R² measures how much better the model is than predicting the mean. R²=0.87 means 87% of variance explained. R²=0 means no better than the mean. Negative R² means worse than the mean — a signal of a severely broken pipeline.
- ✓Always compare your model against a naive baseline before reporting any metric. If the baseline (always predict mean) has MAE=12.4 and your model has MAE=11.9, the improvement is marginal despite the metric looking reasonable in isolation.
- ✓The RMSE/MAE ratio reveals the outlier situation. Ratio near 1.0 means errors are uniform. Ratio above 2.0 means a few very large errors are dominating RMSE. Always inspect the error distribution — report percentile errors (50th, 90th, 95th) alongside summary metrics.
- ✓Residual analysis exposes systematic bias that aggregate metrics hide. Always check: is the mean residual near zero (no bias)? Does error vary by prediction range or input feature? Are the largest errors concentrated in a specific segment? A model with good overall MAE can be systematically wrong for a specific customer group.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.