AI/ML — Module 40Intermediate

Regression Metrics — MAE, RMSE, R²

When your output is a number not a class. MAE, RMSE, MAPE, R², and which metric to choose based on how you want to treat large errors.

22–28 min March 2026

Module 40 · Evaluation and Optimisation

Evaluation · 7 modulesModule 40

Evaluation Calibration ROC Cross-Validation Hyperparameter Model Regression

Before any formula — what makes regression evaluation different?

A classification model is either right or wrong. A regression model is never exactly right — the question is how wrong, and in what direction does wrong hurt more?

DoorDash predicts delivery time as 32 minutes. The actual time is 41 minutes. The model was wrong by 9 minutes. Is that acceptable? That depends on what DoorDash promised the customer. If the app said "arrives in 32 minutes" and it took 41, the customer is angry. The cost of underestimating is higher than the cost of overestimating.

Now imagine one prediction was wrong by 9 minutes and another was wrong by 45 minutes. Are those two errors equally bad? For DoorDash, 45 minutes late might trigger a refund, damage the restaurant's rating, and lose the customer permanently. That one large error is catastrophically worse than five 9-minute errors. The metric you choose determines whether your model optimises to minimise all errors equally or to specifically avoid large ones.

This is the core decision in regression evaluation: how do you want to penalise large errors?MAE treats all errors proportionally. RMSE squares the errors — large errors get penalised much more heavily. MAPE expresses error as a percentage — useful when the scale of the target varies. R² tells you how much better the model is than a naive baseline.

🧠 Analogy — read this first

A cricket commentator says "India needs 12 runs per over." The team scores 10, 11, 13, 9, 12, 8 — never exactly 12. MAE asks: how far off was each over on average? Answer: about 1.5 runs. RMSE asks the same but doubles down on the 8-run over (4 under) — that 16-run miss from target hurts the team more than two 2-run misses. MAPE asks: what percentage of the target was each miss?

Choose MAE when all errors cost equally — late by 5 minutes is 5× worse than late by 1 minute, nothing more. Choose RMSE when catastrophic errors cost disproportionately — one 45-minute delay is far worse than nine 5-minute delays.

The complete metric toolkit

Four metrics — formulas, intuitions, and when each is right

Regression metrics — same predictions, different perspectives

Prediction	Actual	Error	\|Error\|	Error²	\|Error\|/Actual
32	41	-9	9	81	22.0%
45	42	3	3	9	7.1%
28	29	-1	1	1	3.4%
52	97	-45	45	2025	46.4%
38	35	3	3	9	8.6%
—	—	—	12.2 → MAE	425 → MSE	17.5% → MAPE

The 45-minute error (row 4) contributes 2025 to MSE — 25× more than the 9-minute error (81). In MAE it contributes 45 — only 5× more. RMSE = √(mean(MSE)) = √425 = 20.6 min. The one outlier dramatically inflates RMSE while MAE stays at 12.2.

MAE — Mean Absolute Errormean(|y − ŷ|)

Units: Same units as target

Interpret: "On average the model is off by X minutes."

Penalises: All errors proportionally. A 10-min error is 2× worse than a 5-min error.

Use when: When all error magnitudes cost equally. Easy to explain to stakeholders.

Avoid when: When large errors are disproportionately costly.

RMSE — Root Mean Squared Error√mean((y − ŷ)²)

Units: Same units as target

Interpret: "Typical error magnitude, with large errors weighted more heavily."

Penalises: Large errors quadratically. A 10-min error is 4× worse than a 5-min error.

Use when: When catastrophic errors must be avoided. Standard in competitions.

Avoid when: When outliers are present and acceptable — RMSE will be dominated by them.

MAPE — Mean Absolute Percentage Errormean(|y − ŷ| / |y|) × 100

Units: Percentage — scale-independent

Interpret: "On average the model is off by X% of the actual value."

Penalises: Relative errors. Being off by 5 on a target of 10 is worse than off by 5 on a target of 100.

Use when: Comparing models across targets of different scales. Demand forecasting.

Avoid when: When true values are zero or near-zero — MAPE explodes. Not symmetric.

R² — Coefficient of Determination1 − SS_res / SS_tot

Units: Dimensionless (0 to 1, can be negative)

Interpret: "The model explains X% of the variance in the target."

Penalises: Relative to the baseline of predicting the mean.

Use when: Quick sanity check. Comparing models on same dataset. R²=0.87 = 87% variance explained.

Avoid when: Comparing across datasets with different target variance. Can be misleading.

python

import numpy as np
from sklearn.metrics import (mean_absolute_error,
                              mean_squared_error,
                              mean_absolute_percentage_error,
                              r2_score)

# ── Compute all four metrics from scratch ─────────────────────────────
y_true = np.array([41, 42, 29, 97, 35], dtype=float)
y_pred = np.array([32, 45, 28, 52, 38], dtype=float)

errors     = y_true - y_pred
abs_errors = np.abs(errors)

# MAE
mae_manual = abs_errors.mean()
mae_sklearn = mean_absolute_error(y_true, y_pred)

# MSE and RMSE
mse_manual  = (errors ** 2).mean()
rmse_manual = np.sqrt(mse_manual)
rmse_sklearn = np.sqrt(mean_squared_error(y_true, y_pred))

# MAPE
mape_manual  = (abs_errors / np.abs(y_true)).mean() * 100
mape_sklearn = mean_absolute_percentage_error(y_true, y_pred) * 100

# R²
ss_res = (errors ** 2).sum()
ss_tot = ((y_true - y_true.mean()) ** 2).sum()
r2_manual  = 1 - ss_res / ss_tot
r2_sklearn = r2_score(y_true, y_pred)

print("Manual vs sklearn verification:")
print(f"  MAE:   {mae_manual:.4f}  sklearn: {mae_sklearn:.4f}  match: {np.isclose(mae_manual, mae_sklearn)}")
print(f"  RMSE:  {rmse_manual:.4f}  sklearn: {rmse_sklearn:.4f}  match: {np.isclose(rmse_manual, rmse_sklearn)}")
print(f"  MAPE:  {mape_manual:.4f}%  sklearn: {mape_sklearn:.4f}%")
print(f"  R²:    {r2_manual:.4f}  sklearn: {r2_sklearn:.4f}  match: {np.isclose(r2_manual, r2_sklearn)}")

print(f"
Per-sample contribution to each metric:")
print(f"{'Error':>8} {'|error|':>8} {'error²':>8} {'% error':>8}")
print("─" * 38)
for e, ae, y in zip(errors, abs_errors, y_true):
    pct = abs(e) / abs(y) * 100
    print(f"  {e:>6.0f}  {ae:>8.0f}  {ae**2:>8.0f}  {pct:>7.1f}%")
print(f"  {'Mean':>6}  {mae_manual:>8.1f}  {mse_manual:>8.0f}  {mape_manual:>7.1f}%")
print(f"  {'':>6}  {'↑MAE':>8}  {'√→RMSE':>8}")

Understanding R²

R² — what it measures, why it can go negative, and when it misleads

R² measures how much better your model is than the simplest possible baseline: always predicting the mean. If someone asked you to predict DoorDash delivery times with no model at all, your best guess would be the historical mean — about 36 minutes for everything. R² = 0 means your model is exactly as good as that naive guess. R² = 0.87 means your model explains 87% of the variance that the mean baseline cannot explain. R² = 1 is a perfect model.

R² can go below zero. This happens when your model is worse than just predicting the mean — its predictions are so bad they increase the total squared error beyond what a constant prediction would give. A negative R² is a signal that something is severely wrong: wrong features, data leakage in reverse, or a completely broken pipeline.

R² decomposition — what SS_res and SS_tot represent

R² = 1 − SS_res / SS_tot

SS_tot = Σ(yᵢ − ȳ)² ← total variance in the data (baseline error)

SS_res = Σ(yᵢ − ŷᵢ)² ← residual variance after model (model error)

R² = 1: SS_res=0 → perfect predictions

R² = 0: SS_res=SS_tot → no better than the mean

R² < 0: SS_res > SS_tot → worse than the mean

python

import numpy as np
from sklearn.metrics import r2_score
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
n = 2000
distance = np.abs(np.random.normal(4.0, 2.0, n)).clip(0.5, 15)
traffic  = np.random.randint(1, 11, n).astype(float)
prep     = np.abs(np.random.normal(15, 5, n)).clip(5, 35)
delivery = (8.6 + 7.3*distance + 0.8*prep + 1.5*traffic
            + np.random.normal(0, 4, n)).clip(10, 120)

X = np.column_stack([distance, traffic, prep])
y = delivery
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)

# ── R² for different models ───────────────────────────────────────────
models = {
    'Always predict mean (baseline)': None,
    'Linear Regression':              LinearRegression(),
    'Gradient Boosting':              GradientBoostingRegressor(
                                          n_estimators=200, random_state=42),
    'Shuffled labels (broken)':       'shuffled',
}

print(f"R² comparison — DoorDash delivery time:")
print(f"{'Model':<35} {'R²':>8} {'MAE':>8} {'RMSE':>8}")
print("─" * 64)

for name, model in models.items():
    if model is None:
        y_pred = np.full_like(y_te, y_tr.mean())
    elif model == 'shuffled':
        y_pred = np.random.permutation(y_te)   # random predictions
    else:
        model.fit(X_tr, y_tr)
        y_pred = model.predict(X_te)

    r2   = r2_score(y_te, y_pred)
    mae  = np.mean(np.abs(y_te - y_pred))
    rmse = np.sqrt(np.mean((y_te - y_pred)**2))
    flag = '← always predict mean' if model is None else            '← worse than baseline!' if r2 < 0 else ''
    print(f"  {name:<33}  {r2:>8.4f}  {mae:>8.4f}  {rmse:>8.4f}  {flag}")

# ── Adjusted R² — penalises adding useless features ───────────────────
# Standard R² always increases when you add more features (even noise)
# Adjusted R² penalises extra features that do not improve the model
def adjusted_r2(r2, n, p):
    """r2: R², n: samples, p: number of features"""
    return 1 - (1 - r2) * (n - 1) / (n - p - 1)

for n_noise in [0, 5, 10, 20]:
    X_noise = np.hstack([X_tr, np.random.randn(len(X_tr), n_noise)])
    X_te_n  = np.hstack([X_te, np.random.randn(len(X_te), n_noise)])
    lr = LinearRegression().fit(X_noise, y_tr)
    y_pred_n = lr.predict(X_te_n)
    r2_val  = r2_score(y_te, y_pred_n)
    adj_r2  = adjusted_r2(r2_val, len(X_te_n), X_noise.shape[1])
    n_feat  = 3 + n_noise
    print(f"  {n_feat} features ({n_noise} noise): R²={r2_val:.4f}  Adj R²={adj_r2:.4f}"
          f"  {'← noise inflated R²' if n_noise > 0 else ''}")

Making the decision

Which metric to use — a decision framework

The right metric is determined by the business cost structure of your errors, not by convention. Before picking a metric, answer two questions: are large errors disproportionately costly? And does the scale of the target vary across predictions?

Metric selection decision tree

Q1: Are large errors disproportionately costly?

Yes → Use RMSE — it penalises large errors quadratically

No → Use MAE — it treats all errors proportionally

Example: DoorDash: one 45-min delay triggers a refund (costly) → RMSE. Stock price: all errors equally bad → MAE.

Q2: Do targets vary in scale across predictions?

Yes → Use MAPE — percentage error is scale-independent

No → Use MAE or RMSE — absolute errors are comparable

Example: Demand forecasting: selling 1000 units vs 10 units — 5-unit error means different things. Use MAPE.

Q3: Do you need a relative performance number?

Yes → Report R² alongside MAE/RMSE for context

No → Report MAE or RMSE in the target units

Example: Report to stakeholders: "MAE = 4.2 minutes (R² = 0.87)" gives both absolute and relative context.

python

import numpy as np
from sklearn.metrics import (mean_absolute_error, mean_squared_error,
                              mean_absolute_percentage_error, r2_score)
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_validate, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
n = 3000
distance = np.abs(np.random.normal(4.0, 2.0, n)).clip(0.5, 15)
traffic  = np.random.randint(1, 11, n).astype(float)
prep     = np.abs(np.random.normal(15, 5, n)).clip(5, 35)
delivery = (8.6 + 7.3*distance + 0.8*prep + 1.5*traffic
            + np.random.normal(0, 4, n)).clip(10, 120)
X = np.column_stack([distance, traffic, prep])
y = delivery

pipeline = Pipeline([
    ('sc', StandardScaler()),
    ('m',  GradientBoostingRegressor(n_estimators=200, learning_rate=0.1,
                                      max_depth=3, random_state=42)),
])

# ── Always compare against a baseline ─────────────────────────────────
from sklearn.model_selection import train_test_split
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)
pipeline.fit(X_tr, y_tr)
y_pred = pipeline.predict(X_te)

# Baselines
y_mean   = np.full_like(y_te, y_tr.mean())
y_median = np.full_like(y_te, np.median(y_tr))

print("Model vs baselines — DoorDash delivery time:")
print(f"{'Model':<25} {'MAE':>8} {'RMSE':>8} {'MAPE%':>8} {'R²':>8}")
print("─" * 62)

for label, pred in [
    ('Always predict mean',   y_mean),
    ('Always predict median', y_median),
    ('Gradient Boosting',     y_pred),
]:
    mae  = mean_absolute_error(y_te, pred)
    rmse = np.sqrt(mean_squared_error(y_te, pred))
    mape = mean_absolute_percentage_error(y_te, pred) * 100
    r2   = r2_score(y_te, pred)
    print(f"  {label:<23}  {mae:>8.4f}  {rmse:>8.4f}  {mape:>8.2f}  {r2:>8.4f}")

# ── Cross-validate multiple metrics at once ────────────────────────────
def rmse_scorer(est, X, y):
    return -np.sqrt(mean_squared_error(y, est.predict(X)))

from sklearn.metrics import make_scorer
cv_results = cross_validate(
    pipeline, X, y,
    cv=KFold(5, shuffle=True, random_state=42),
    scoring={
        'mae':  make_scorer(mean_absolute_error, greater_is_better=False),
        'rmse': make_scorer(rmse_scorer, greater_is_better=False),
        'r2':   'r2',
        'mape': make_scorer(mean_absolute_percentage_error,
                             greater_is_better=False),
    },
    return_train_score=False,
)

print(f"
5-fold CV across all metrics:")
print(f"  MAE:   {-cv_results['test_mae'].mean():.4f} ± {cv_results['test_mae'].std():.4f} min")
print(f"  RMSE:  {-cv_results['test_rmse'].mean():.4f} ± {cv_results['test_rmse'].std():.4f} min")
print(f"  MAPE:  {-cv_results['test_mape'].mean()*100:.2f}% ± {cv_results['test_mape'].std()*100:.2f}%")
print(f"  R²:    {cv_results['test_r2'].mean():.4f} ± {cv_results['test_r2'].std():.4f}")

# ── RMSE vs MAE ratio reveals outlier presence ─────────────────────────
rmse_val = -cv_results['test_rmse'].mean()
mae_val  = -cv_results['test_mae'].mean()
ratio    = rmse_val / mae_val
print(f"
RMSE / MAE ratio: {ratio:.2f}")
print("  Ratio ≈ 1.0: errors are uniform, no major outliers")
print("  Ratio > 2.0: large outlier errors present — investigate them")

Beyond the summary number

Residual analysis — where is the model systematically wrong?

A single MAE number hides a lot. A model with MAE = 4.2 minutes might be consistently accurate for short deliveries but systematically wrong for long-distance orders. The aggregate metric looks fine while a whole segment of customers is getting bad predictions. Residual analysis reveals these systematic patterns.

python

import numpy as np
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
n = 3000
distance = np.abs(np.random.normal(4.0, 2.0, n)).clip(0.5, 15)
traffic  = np.random.randint(1, 11, n).astype(float)
prep     = np.abs(np.random.normal(15, 5, n)).clip(5, 35)
delivery = (8.6 + 7.3*distance + 0.8*prep + 1.5*traffic
            + np.random.normal(0, 4, n)).clip(10, 120)

X = np.column_stack([distance, traffic, prep])
y = delivery
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
    ('sc', StandardScaler()),
    ('m',  GradientBoostingRegressor(n_estimators=200, random_state=42)),
])
pipeline.fit(X_tr, y_tr)
y_pred = pipeline.predict(X_te)

residuals = y_te - y_pred   # positive = underestimate, negative = overestimate

print(f"Overall MAE: {mean_absolute_error(y_te, y_pred):.4f} min")

# ── Are residuals biased? ──────────────────────────────────────────────
mean_residual = residuals.mean()
print(f"
Mean residual: {mean_residual:+.4f} min")
print(f"  {'← model systematically underestimates' if mean_residual > 0.5 else '← model systematically overestimates' if mean_residual < -0.5 else '← no systematic bias'}")

# ── Error by delivery time bucket ─────────────────────────────────────
print(f"
MAE by delivery time bucket:")
for low, high in [(10,20),(20,30),(30,45),(45,60),(60,120)]:
    mask = (y_te >= low) & (y_te < high)
    if mask.sum() > 10:
        mae_bucket  = mean_absolute_error(y_te[mask], y_pred[mask])
        bias_bucket = residuals[mask].mean()
        print(f"  {low:2d}–{high:3d} min  n={mask.sum():4d}  "
              f"MAE={mae_bucket:.2f}  bias={bias_bucket:+.2f}")

# ── Error by distance bucket ──────────────────────────────────────────
dist_te = X_te[:, 0]   # distance feature in test set
print(f"
MAE by delivery distance:")
for low, high in [(0,2),(2,4),(4,7),(7,10),(10,15)]:
    mask = (dist_te >= low) & (dist_te < high)
    if mask.sum() > 10:
        mae_d = mean_absolute_error(y_te[mask], y_pred[mask])
        print(f"  {low:.0f}–{high:.0f} km   n={mask.sum():4d}  MAE={mae_d:.2f} min")

# ── Largest errors — what went wrong? ─────────────────────────────────
worst_idx   = np.argsort(np.abs(residuals))[-5:][::-1]
print(f"
Top 5 worst predictions:")
print(f"  {'Actual':>8} {'Predicted':>10} {'Error':>8} {'Dist':>6} {'Traffic':>8} {'Prep':>6}")
print("  " + "─" * 52)
for idx in worst_idx:
    print(f"  {y_te[idx]:>8.1f} {y_pred[idx]:>10.1f} {residuals[idx]:>+8.1f} "
          f"{X_te[idx,0]:>6.1f}  {X_te[idx,1]:>8.0f}  {X_te[idx,2]:>6.0f}")

Errors you will hit

Every common regression metric mistake — explained and fixed

R² is negative — model is worse than predicting the mean

Why it happens

Three common causes: the model was trained on different data than it is being evaluated on (wrong split, data leakage in reverse), the features have no relationship to the target on the test set, or the model was fit with the wrong target (e.g. predicting log(y) but evaluating on y). Negative R² means SS_res > SS_tot — the model's errors are larger than if you had just predicted the mean for everything.

Fix

Check that training and test data come from the same distribution. Print y_train.mean() and y_test.mean() — if very different, the split is wrong. Print model.predict(X_test)[:5] and y_test[:5] — if predictions are in a completely different range, the target was transformed during training but not reversed at evaluation. Always un-transform predictions before computing metrics.

MAPE is infinity or extremely large (1000%+)

Why it happens

One or more true values in y_true are zero or very close to zero. MAPE divides by y_true — division by zero produces infinity which propagates through the mean. Even a single zero target makes MAPE meaningless for the entire evaluation.

Fix

Never use MAPE when targets can be zero. Use MAE or RMSE instead. If zero targets are rare edge cases, filter them: mask = y_true > 0; mape = mean_absolute_percentage_error(y_true[mask], y_pred[mask]). Consider symmetric MAPE (sMAPE) which divides by (|y_true| + |y_pred|)/2 — defined even when y_true=0 though it has other issues.

RMSE looks terrible but the model is actually useful in production

Why it happens

RMSE is dominated by a small number of very large errors — outliers in the test set. Five predictions off by 1 minute each and one prediction off by 50 minutes gives RMSE ≈ 20 minutes, making the model look terrible even though 5 out of 6 predictions are nearly perfect. The summary metric hides the distribution of errors.

Fix

Always inspect the error distribution alongside RMSE. Plot a histogram of residuals. Compute the 50th, 90th, and 95th percentile of |residuals|: np.percentile(np.abs(residuals), [50, 90, 95]). Report 'MAE = 3.2 min, 90th percentile error = 8.1 min' — much more informative than a single RMSE of 12. Investigate the large errors separately — they often reveal a specific failure mode.

R² is high (0.92) but MAE is also high — stakeholders are confused

Why it happens

R² and MAE measure fundamentally different things. R²=0.92 means the model explains 92% of the variance — it is highly correlated with the target. But if the target has very high variance (delivery times ranging from 10 to 120 minutes), 92% explained variance still leaves 8% unexplained, which in absolute terms could be 8 minutes of MAE. High R² does not mean small absolute errors.

Fix

Always report metrics in the target's units (MAE, RMSE in minutes) alongside R². Stakeholders understand 'off by 4.2 minutes on average' better than 'R²=0.87'. Use R² for comparing models on the same dataset and for communicating relative improvement. Use MAE/RMSE for communicating operational accuracy.

What comes next

The Evaluation section is complete. Section 7 — Deep Learning — begins next.

You have now completed every module in the Model Evaluation section: classification metrics, calibration, ROC curves, cross-validation, hyperparameter tuning, model interpretability, and regression metrics. You can honestly evaluate any model — classifier or regressor — and communicate its performance to any audience.

Section 7 — Deep Learning — begins with Module 41. Everything changes: instead of hand-crafted features, the model learns its own representations from raw data. Module 41 builds a neural network from scratch in NumPy — forward pass, backpropagation, gradient descent — before introducing PyTorch.

Next — Module 41 · Deep Learning

Neural Networks from Scratch

Forward pass, backpropagation, and gradient descent built in NumPy before touching PyTorch. The foundation every deep learning framework is built on.

coming soon

🎯 Key Takeaways

✓MAE treats all errors proportionally — a 10-minute error is exactly 2× worse than a 5-minute error. RMSE squares the errors first — a 10-minute error is 4× worse than a 5-minute error. Choose based on whether large errors in your domain are disproportionately costly.
✓MAPE expresses error as a percentage of the actual value — scale-independent and useful when targets span different magnitudes. Never use MAPE when true values can be zero — division by zero makes it undefined.
✓R² measures how much better the model is than predicting the mean. R²=0.87 means 87% of variance explained. R²=0 means no better than the mean. Negative R² means worse than the mean — a signal of a severely broken pipeline.
✓Always compare your model against a naive baseline before reporting any metric. If the baseline (always predict mean) has MAE=12.4 and your model has MAE=11.9, the improvement is marginal despite the metric looking reasonable in isolation.
✓The RMSE/MAE ratio reveals the outlier situation. Ratio near 1.0 means errors are uniform. Ratio above 2.0 means a few very large errors are dominating RMSE. Always inspect the error distribution — report percentile errors (50th, 90th, 95th) alongside summary metrics.
✓Residual analysis exposes systematic bias that aggregate metrics hide. Always check: is the mean residual near zero (no bias)? Does error vary by prediction range or input feature? Are the largest errors concentrated in a specific segment? A model with good overall MAE can be systematically wrong for a specific customer group.

Discussion

Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.

Continue with GitHub