AI/ML — Module 34Intermediate

Evaluation Metrics — Beyond Accuracy

Precision, recall, F1, ROC-AUC, PR-AUC, confusion matrices, and the business cost framing that turns metrics into decisions.

26–34 min March 2026

Module 34 · Evaluation and Optimisation

Evaluation · 7 modulesModule 34

Evaluation Calibration ROC Cross-Validation Hyperparameter Model Regression

Before any formula — why accuracy is almost always the wrong metric

Your fraud model has 98.5% accuracy. Your manager is thrilled. Then you check: it flags zero fraud cases. All 98.5% comes from predicting "not fraud" on every single transaction.

Stripe processes 5 million transactions per day. Only 1.5% are fraudulent — 75,000 transactions. A model that predicts "legitimate" for every transaction achieves 98.5% accuracy without catching a single fraudulent rupee. This model is completely useless, yet the accuracy number looks spectacular in a presentation.

Accuracy is misleading whenever the classes are imbalanced — which is almost always the case in the problems that matter most. Fraud detection: 1–2% fraud. Disease diagnosis: 1–5% positive. Churn prediction: 3–8% churners. Spam detection: 5–20% spam. In all of these, a naive "always predict the majority" baseline achieves 92–99% accuracy while being completely worthless.

This module teaches the metrics that actually matter: the confusion matrix (what kind of errors is the model making?), precision and recall (the fundamental trade-off), F1 score (one number that balances both), ROC-AUC (threshold-independent performance), and PR-AUC (the right metric for severely imbalanced problems).

🧠 Analogy — read this first

A doctor is screening patients for a rare disease affecting 1 in 100 people. A doctor who says "healthy" to everyone achieves 99% accuracy. But they miss every sick patient. The medical community does not measure doctors by "how often are you right overall?" They measure: "of the people you said were sick, how many actually were?" (precision) and "of all the people who were actually sick, how many did you catch?" (recall).

These two questions — precision and recall — are the core of all classification evaluation. Every other metric (F1, ROC-AUC, PR-AUC) is built on top of them.

🎯 Pro Tip

The right metric depends entirely on the cost of each type of error in your business context. Missing a fraud case costs more than a false alarm? Optimise recall. False alarms cause customers to call support constantly? Optimise precision. This module shows you how to make that decision explicitly.

The foundation of all classification metrics

The confusion matrix — four outcomes, every metric derives from them

A binary classifier makes one of four possible outcomes for each prediction. The confusion matrix organises all four. Every metric — accuracy, precision, recall, F1 — is a formula combining these four numbers in different ways. Understanding the four cells first makes every metric obvious.

The confusion matrix — four cells, what each means

Predicted: Fraud

Predicted: Legit

Actual: Fraud

True Positive

False Negative

Actual: Legit

False Positive

True Negative

True Positive

Fraud predicted, actually fraud. Caught it. Good.

True Negative

Legit predicted, actually legit. Correct. Good.

False Positive

Fraud predicted, actually legit. False alarm — customer blocked unnecessarily.

False Negative

Legit predicted, actually fraud. Missed fraud — most costly error.

python

import numpy as np
from sklearn.metrics import (confusion_matrix, classification_report,
                              accuracy_score, precision_score,
                              recall_score, f1_score)
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
n = 10_000

# Simulate Stripe fraud predictions
# 1.5% fraud rate
y_true  = (np.random.random(n) < 0.015).astype(int)
# Realistic model: catches 70% of fraud, 3% false alarm rate
y_pred  = np.zeros(n, dtype=int)
fraud_idx = np.where(y_true == 1)[0]
legit_idx = np.where(y_true == 0)[0]
# True positives: catch 70% of actual fraud
tp_idx = np.random.choice(fraud_idx, int(len(fraud_idx)*0.70), replace=False)
y_pred[tp_idx] = 1
# False positives: flag 3% of legitimate transactions
fp_idx = np.random.choice(legit_idx, int(len(legit_idx)*0.03), replace=False)
y_pred[fp_idx] = 1

# ── Confusion matrix ──────────────────────────────────────────────────
cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()

print("Confusion matrix:")
print(f"  {'':15} {'Pred: Fraud':>14} {'Pred: Legit':>14}")
print(f"  {'Actual: Fraud':<15} {'TP = ' + str(tp):>14} {'FN = ' + str(fn):>14}")
print(f"  {'Actual: Legit':<15} {'FP = ' + str(fp):>14} {'TN = ' + str(tn):>14}")

# ── Derive every metric manually ─────────────────────────────────────
accuracy  = (tp + tn) / (tp + tn + fp + fn)
precision = tp / (tp + fp)          # of all flagged, how many were fraud?
recall    = tp / (tp + fn)          # of all fraud, how many did we catch?
f1        = 2 * precision * recall / (precision + recall)
specificity = tn / (tn + fp)        # of all legit, how many were correctly allowed?
fpr       = fp / (fp + tn)          # false positive rate = 1 - specificity

print(f"
Metrics (computed manually):")
print(f"  Accuracy:    {accuracy:.4f}   ← misleading — 98.5% if we predict all legit")
print(f"  Precision:   {precision:.4f}   ← of flagged transactions, {precision*100:.1f}% are fraud")
print(f"  Recall:      {recall:.4f}   ← we catch {recall*100:.1f}% of all fraud")
print(f"  F1 score:    {f1:.4f}   ← harmonic mean of precision and recall")
print(f"  Specificity: {specificity:.4f}   ← {specificity*100:.1f}% of legit transactions pass through")
print(f"  FPR:         {fpr:.4f}   ← {fpr*100:.1f}% of legit flagged as fraud")

# Business impact translation
fraud_value_per_tx = 2500  # avg ₹2500 per fraudulent transaction
false_alarm_cost   = 50    # ₹50 cost per false alarm (support call, friction)

fraud_caught    = tp * fraud_value_per_tx
fraud_missed    = fn * fraud_value_per_tx
false_alarm_cost_total = fp * false_alarm_cost

print(f"
Business impact:")
print(f"  Fraud caught:         ₹{fraud_caught:,.0f} protected")
print(f"  Fraud missed (FN):    ₹{fraud_missed:,.0f} lost")
print(f"  False alarm cost:     ₹{false_alarm_cost_total:,.0f} in friction")
print(f"  Net value of model:   ₹{fraud_caught - fraud_missed - false_alarm_cost_total:,.0f}")

# sklearn classification_report gives everything at once
print("
sklearn classification_report:")
print(classification_report(y_true, y_pred, target_names=['Legit', 'Fraud']))

The fundamental trade-off

Precision vs recall — you cannot maximise both simultaneously

Precision and recall are in tension. To catch more fraud (increase recall) you need to lower the classification threshold — flag more transactions. But flagging more transactions means more false alarms (lower precision). To reduce false alarms (increase precision) you raise the threshold — but then you miss more actual fraud (lower recall). This trade-off is unavoidable and inherent to every binary classifier.

The right balance depends entirely on the business cost of each error type. Missing a fraud transaction at Stripe costs ₹2,500 on average. A false alarm costs ₹50 in support friction. The cost ratio is 50:1. You should therefore accept 50 false alarms for every fraud case caught — meaning you should optimise heavily toward recall at the expense of precision.

Precision and recall — the complete definitions

PrecisionTP / (TP + FP)

Of all transactions I flagged as fraud — how many actually were?

High precision = few false alarms. Optimise when false alarms are costly.

Precision = 0.85: 85% of flagged transactions are genuine fraud.

Recall (Sensitivity)TP / (TP + FN)

Of all transactions that were actually fraud — how many did I catch?

High recall = few missed positives. Optimise when missing positives is costly.

Recall = 0.90: we catch 90% of all fraudulent transactions.

F1 Score2 × Precision × Recall / (Precision + Recall)

A single score that balances both — the harmonic mean.

Use when you need one number and both errors matter equally.

F1 = 0.87: balanced between precision=0.85 and recall=0.90.

python

import numpy as np
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (precision_score, recall_score, f1_score,
                              precision_recall_curve, average_precision_score)
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
n = 8_000

# Stripe fraud features
amount         = np.abs(np.random.normal(1200, 2000, n)).clip(10, 50_000)
merchant_risk  = np.random.uniform(0, 1, n)
n_tx_hour      = np.random.randint(0, 20, n).astype(float)
device_age     = np.abs(np.random.normal(200, 150, n)).clip(0, 1000)
is_new_device  = np.random.randint(0, 2, n).astype(float)

fraud_score = (
    (amount/50_000)*0.30 + merchant_risk*0.25
    + (n_tx_hour/20)*0.25 + is_new_device*0.15
    + np.random.randn(n)*0.05
)
y = (fraud_score > 0.55).astype(int)

X = np.column_stack([amount, merchant_risk, n_tx_hour, device_age, is_new_device])
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2,
                                            stratify=y, random_state=42)
sc = StandardScaler()
X_tr_sc = sc.fit_transform(X_tr)
X_te_sc = sc.transform(X_te)

model = GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=3,
    subsample=0.8, random_state=42,
)
model.fit(X_tr_sc, y_tr)
y_proba = model.predict_proba(X_te_sc)[:, 1]

# ── Precision-recall at different thresholds ──────────────────────────
print("Precision-recall trade-off at different thresholds:")
print(f"{'Threshold':<12} {'Precision':>11} {'Recall':>9} {'F1':>9} {'Flagged':>9}")
print("─" * 54)

for threshold in np.arange(0.1, 0.91, 0.1):
    y_pred_t  = (y_proba >= threshold).astype(int)
    prec      = precision_score(y_te, y_pred_t, zero_division=0)
    rec       = recall_score(y_te, y_pred_t, zero_division=0)
    f1        = f1_score(y_te, y_pred_t, zero_division=0)
    flagged   = y_pred_t.mean() * 100
    print(f"  t={threshold:.1f}      {prec:>9.4f}  {rec:>9.4f}  {f1:>9.4f}  {flagged:>7.1f}%")

# ── F-beta score — weight recall over precision ────────────────────────
from sklearn.metrics import fbeta_score

# At Stripe: missing fraud costs 50× more than false alarm
# β=2 weights recall twice as heavily as precision
# β=0.5 weights precision twice as heavily as recall
print("
F-beta score — adjusting the precision/recall balance:")
for beta in [0.5, 1.0, 2.0, 3.0]:
    y_pred_default = (y_proba >= 0.5).astype(int)
    fb = fbeta_score(y_te, y_pred_default, beta=beta, zero_division=0)
    desc = '(precision weighted)' if beta < 1 else '(equal weight)' if beta == 1 else '(recall weighted)'
    print(f"  F{beta}: {fb:.4f}  {desc}")

# ── PR-AUC — area under the precision-recall curve ────────────────────
# Better than ROC-AUC for highly imbalanced problems
pr_auc = average_precision_score(y_te, y_proba)
print(f"
PR-AUC (Average Precision): {pr_auc:.4f}")
print("PR-AUC = 1.0: perfect model")
print(f"PR-AUC = {y_te.mean():.4f}: random baseline (= fraud rate)")

Threshold-independent evaluation

ROC-AUC — how well the model ranks fraud above legitimate transactions

Precision and recall depend on the threshold you choose. Change the threshold, get different precision and recall. ROC-AUC (Receiver Operating Characteristic — Area Under Curve) is threshold-independent. It measures how well the model separates the two classes across all possible thresholds at once.

The ROC curve plots the true positive rate (recall) against the false positive rate at every possible threshold. A perfect model has a curve that goes straight up to (0, 1) — it achieves 100% recall with 0% false alarms. A random model produces a diagonal line — recall equals the false alarm rate. The AUC is the area under the curve: 1.0 is perfect, 0.5 is random.

🧠 Analogy — read this first

You have 100 fraud cases and 9,900 legit transactions — all shuffled randomly. You ask the model to score all 10,000 and sort them by fraud probability, highest first. How many of the actual 100 fraud cases appear in the top 100? Top 200? Top 500? If the model is perfect, all 100 fraud cases appear before any legitimate transaction. The ROC curve plots this across every possible cutpoint. AUC is the probability that a randomly chosen fraud transaction scores higher than a randomly chosen legit one.

AUC = 0.95 means: take one random fraud transaction and one random legit transaction. There is a 95% chance the model assigns a higher fraud score to the fraud transaction. This is the most intuitive interpretation of AUC.

ROC curve — three models, same dataset

AUC = 1.0Perfect — always ranks fraud above legit

AUC = 0.9+Excellent — production quality

AUC = 0.8–0.9Good — acceptable for most problems

AUC = 0.7–0.8Fair — investigate features

AUC = 0.5–0.7Poor — barely above random

AUC = 0.5Random — model has no signal

python

import numpy as np
from sklearn.metrics import roc_auc_score, roc_curve, average_precision_score
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
n = 8_000

amount        = np.abs(np.random.normal(1200, 2000, n)).clip(10, 50_000)
merchant_risk = np.random.uniform(0, 1, n)
n_tx_hour     = np.random.randint(0, 20, n).astype(float)
device_age    = np.abs(np.random.normal(200, 150, n)).clip(0, 1000)
is_new_device = np.random.randint(0, 2, n).astype(float)

fraud_score = (
    (amount/50_000)*0.30 + merchant_risk*0.25
    + (n_tx_hour/20)*0.25 + is_new_device*0.15
    + np.random.randn(n)*0.05
)
y = (fraud_score > 0.55).astype(int)
X = np.column_stack([amount, merchant_risk, n_tx_hour, device_age, is_new_device])

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2,
                                            stratify=y, random_state=42)
sc      = StandardScaler()
X_tr_sc = sc.fit_transform(X_tr)
X_te_sc = sc.transform(X_te)

models = {
    'LogisticRegression': LogisticRegression(max_iter=1000, random_state=42),
    'RandomForest':       RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
    'GradientBoosting':   GradientBoostingClassifier(n_estimators=200, learning_rate=0.1,
                                                       max_depth=3, random_state=42),
}

print(f"{'Model':<22} {'ROC-AUC':>9} {'PR-AUC':>9} {'CV AUC (5-fold)':>16}")
print("─" * 60)

for name, model in models.items():
    model.fit(X_tr_sc, y_tr)
    y_prob  = model.predict_proba(X_te_sc)[:, 1]
    roc_auc = roc_auc_score(y_te, y_prob)
    pr_auc  = average_precision_score(y_te, y_prob)
    cv_auc  = cross_val_score(model, X_tr_sc, y_tr, cv=5,
                               scoring='roc_auc').mean()
    print(f"  {name:<20}  {roc_auc:>9.4f}  {pr_auc:>9.4f}  {cv_auc:>14.4f}")

# ── ROC curve — operating points ──────────────────────────────────────
best_model = models['GradientBoosting']
y_prob     = best_model.predict_proba(X_te_sc)[:, 1]
fpr, tpr, thresholds = roc_curve(y_te, y_prob)

print("
ROC curve operating points (selected thresholds):")
print(f"{'Threshold':<12} {'TPR (Recall)':>13} {'FPR':>8} {'Specificity':>13}")
print("─" * 50)

# Find specific operating points
for target_recall in [0.95, 0.90, 0.80, 0.70, 0.60]:
    idx  = np.argmin(np.abs(tpr - target_recall))
    spec = 1 - fpr[idx]
    print(f"  t={thresholds[idx]:.3f}     TPR={tpr[idx]:.3f}       FPR={fpr[idx]:.3f}    Spec={spec:.3f}")

# ── When to use ROC-AUC vs PR-AUC ─────────────────────────────────────
print("
ROC-AUC vs PR-AUC:")
print("  ROC-AUC: affected equally by both classes — good for balanced datasets")
print("  PR-AUC:  focuses on the positive (minority) class — use for fraud, churn")
print(f"  Fraud rate in this dataset: {y_te.mean()*100:.1f}%")
print(f"  → At {y_te.mean()*100:.1f}% positive rate, PR-AUC is the more informative metric")

When the output is continuous

Regression metrics — MAE, RMSE, MAPE, and R²

Regression problems have their own set of evaluation metrics. The right choice depends on how you want to treat large errors and whether the scale of the target matters for interpretation.

Four regression metrics — what each penalises

MAE — Mean Absolute Error

mean(|y − ŷ|)

Average absolute difference. Easy to interpret — same units as target. Treats all errors equally. Robust to outliers.

Delivery time prediction: "model is off by 4.2 minutes on average"

RMSE — Root Mean Squared Error

√mean((y − ŷ)²)

Square root of average squared error. Penalises large errors more than MAE. More sensitive to outliers. Same units as target.

When large errors are disproportionately costly. Stock prices, safety-critical predictions.

MAPE — Mean Absolute Percentage Error

mean(|y − ŷ| / |y|) × 100

Average % error relative to actual. Scale-independent — good for comparing across products. Breaks when y=0.

Demand forecasting: "model is off by 8.3% on average". Comparable across SKUs.

R² — Coefficient of Determination

1 − SS_res / SS_tot

Fraction of variance explained by the model. R²=1 is perfect, R²=0 is as good as predicting the mean, R²<0 is worse than mean. Scale-independent.

Quick sanity check: R²=0.87 means model explains 87% of target variance.

python

import numpy as np
from sklearn.metrics import (mean_absolute_error, mean_squared_error,
                              mean_absolute_percentage_error, r2_score)
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
n = 3000
distance = np.abs(np.random.normal(4.0, 2.0, n)).clip(0.5, 15)
traffic  = np.random.randint(1, 11, n).astype(float)
prep     = np.abs(np.random.normal(15, 5, n)).clip(5, 35)
delivery = (8.6 + 7.3*distance + 0.8*prep + 1.5*traffic
            + np.random.normal(0, 4, n)).clip(10, 120)

X = np.column_stack([distance, traffic, prep])
y = delivery
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)

model = GradientBoostingRegressor(n_estimators=200, learning_rate=0.1,
                                   max_depth=3, random_state=42)
model.fit(X_tr, y_tr)
y_pred = model.predict(X_te)

# ── All four metrics ──────────────────────────────────────────────────
mae  = mean_absolute_error(y_te, y_pred)
rmse = np.sqrt(mean_squared_error(y_te, y_pred))
mape = mean_absolute_percentage_error(y_te, y_pred) * 100
r2   = r2_score(y_te, y_pred)

print(f"DoorDash delivery time model evaluation:")
print(f"  MAE:   {mae:.4f} min   ← average error in minutes")
print(f"  RMSE:  {rmse:.4f} min  ← penalises large errors more")
print(f"  MAPE:  {mape:.4f}%    ← percentage error relative to actual")
print(f"  R²:    {r2:.4f}       ← model explains {r2*100:.1f}% of variance")

# ── When RMSE >> MAE: outliers are present ────────────────────────────
print(f"
RMSE / MAE ratio: {rmse/mae:.2f}")
print("  Ratio near 1.0: errors are uniform, no major outliers")
print("  Ratio > 2.0: large outlier errors are inflating RMSE")

# ── Baseline comparison — always compare against naive models ─────────
mean_pred    = np.full_like(y_te, y_tr.mean())
median_pred  = np.full_like(y_te, np.median(y_tr))

print(f"
Baseline comparisons:")
print(f"  Always-predict-mean  MAE={mean_absolute_error(y_te, mean_pred):.4f}  R²={r2_score(y_te, mean_pred):.4f}")
print(f"  Always-predict-median MAE={mean_absolute_error(y_te, median_pred):.4f}  R²={r2_score(y_te, median_pred):.4f}")
print(f"  Our GBM model        MAE={mae:.4f}  R²={r2:.4f}")
print(f"  Improvement over mean: {(mean_absolute_error(y_te, mean_pred) - mae) / mean_absolute_error(y_te, mean_pred) * 100:.1f}%")

Turning probabilities into decisions

Threshold tuning — 0.5 is almost never the optimal threshold

sklearn's predict() uses 0.5 as the default threshold. A transaction with fraud probability 0.51 is flagged. One with 0.49 is not. This is almost never the right business decision. The optimal threshold should be derived from the relative cost of false positives and false negatives — which is a business decision, not a modelling decision.

python

import numpy as np
from sklearn.metrics import (precision_recall_curve, roc_curve,
                              f1_score, fbeta_score,
                              precision_score, recall_score)
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
n = 8_000
amount = np.abs(np.random.normal(1200,2000,n)).clip(10,50_000)
merchant_risk = np.random.uniform(0,1,n)
n_tx_hour = np.random.randint(0,20,n).astype(float)
device_age = np.abs(np.random.normal(200,150,n)).clip(0,1000)
is_new_device = np.random.randint(0,2,n).astype(float)
fraud_score = (amount/50_000)*0.30 + merchant_risk*0.25 + (n_tx_hour/20)*0.25 + is_new_device*0.15 + np.random.randn(n)*0.05
y = (fraud_score > 0.55).astype(int)
X = np.column_stack([amount,merchant_risk,n_tx_hour,device_age,is_new_device])
X_tr,X_te,y_tr,y_te = train_test_split(X,y,test_size=0.2,stratify=y,random_state=42)
X_tv,X_val,y_tv,y_val = train_test_split(X_tr,y_tr,test_size=0.2,stratify=y_tr,random_state=42)
sc = StandardScaler()
X_tv_sc  = sc.fit_transform(X_tv)
X_val_sc = sc.transform(X_val)
X_te_sc  = sc.transform(X_te)

model = GradientBoostingClassifier(n_estimators=200, learning_rate=0.1,
                                    max_depth=3, random_state=42)
model.fit(X_tv_sc, y_tv)
val_proba = model.predict_proba(X_val_sc)[:, 1]

# ── Method 1: Maximise F1 ─────────────────────────────────────────────
thresholds   = np.arange(0.05, 0.96, 0.01)
f1_scores    = [f1_score(y_val, (val_proba >= t).astype(int), zero_division=0)
                for t in thresholds]
best_t_f1    = thresholds[np.argmax(f1_scores)]
print(f"Best threshold by F1:     {best_t_f1:.2f}  (F1={max(f1_scores):.4f})")

# ── Method 2: Maximise Fβ (recall-weighted for fraud) ─────────────────
fb_scores    = [fbeta_score(y_val, (val_proba >= t).astype(int),
                             beta=2, zero_division=0) for t in thresholds]
best_t_fb    = thresholds[np.argmax(fb_scores)]
print(f"Best threshold by F2:     {best_t_fb:.2f}  (F2={max(fb_scores):.4f})")

# ── Method 3: Business cost optimisation ─────────────────────────────
# Cost of false negative (missed fraud):  ₹2,500 avg transaction
# Cost of false positive (blocked legit): ₹50 friction
fn_cost = 2500
fp_cost = 50

costs = []
for t in thresholds:
    pred = (val_proba >= t).astype(int)
    fn   = ((pred == 0) & (y_val == 1)).sum()
    fp   = ((pred == 1) & (y_val == 0)).sum()
    costs.append(fn * fn_cost + fp * fp_cost)

best_t_biz = thresholds[np.argmin(costs)]
print(f"Best threshold by cost:   {best_t_biz:.2f}  (cost=₹{min(costs):,.0f})")

# ── Compare all thresholds on test set ────────────────────────────────
test_proba = model.predict_proba(X_te_sc)[:, 1]
print(f"
Test set performance at different thresholds:")
print(f"{'Method':<22} {'Threshold':>10} {'Precision':>11} {'Recall':>9} {'F1':>9}")
print("─" * 65)

for label, t in [('Default (0.5)', 0.5),
                  ('Max F1', best_t_f1),
                  ('Max F2 (recall++)', best_t_fb),
                  ('Min business cost', best_t_biz)]:
    pred = (test_proba >= t).astype(int)
    p  = precision_score(y_te, pred, zero_division=0)
    r  = recall_score(y_te, pred, zero_division=0)
    f1 = f1_score(y_te, pred, zero_division=0)
    print(f"  {label:<20}  {t:>10.2f}  {p:>11.4f}  {r:>9.4f}  {f1:>9.4f}")

When there are more than two classes

Multi-class evaluation — macro, micro, and weighted averaging

Binary metrics extend naturally to multi-class problems. The question is how to aggregate per-class metrics into a single number. Three averaging strategies give different answers and are appropriate in different situations.

python

import numpy as np
from sklearn.metrics import (classification_report, confusion_matrix,
                              f1_score, precision_score, recall_score)
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
n = 5000

# DoorDash support ticket categories: 4 classes
# delivery_issue, food_quality, payment_issue, general
X = np.random.randn(n, 10)
y = np.random.choice([0,1,2,3], n, p=[0.40, 0.25, 0.20, 0.15])
# Class 0=delivery (40%), 1=food (25%), 2=payment (20%), 3=general (15%)

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2,
                                            stratify=y, random_state=42)
sc = StandardScaler()
model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
model.fit(sc.fit_transform(X_tr), y_tr)
y_pred = model.predict(sc.transform(X_te))

class_names = ['delivery', 'food_quality', 'payment', 'general']

# ── Classification report ──────────────────────────────────────────────
print("Full classification report:")
print(classification_report(y_te, y_pred, target_names=class_names))

# ── Three averaging strategies explained ──────────────────────────────
print("Averaging strategies for multi-class F1:")
for avg in ['macro', 'weighted', 'micro']:
    f1 = f1_score(y_te, y_pred, average=avg)
    if avg == 'macro':
        desc = 'unweighted mean of per-class F1 — treats all classes equally'
    elif avg == 'weighted':
        desc = 'weighted by support (class size) — accounts for imbalance'
    else:
        desc = 'global TP/FP/FN — equivalent to accuracy for multi-class'
    print(f"  {avg:<10}: {f1:.4f}  ← {desc}")

print("
When to use each:")
print("  macro:    when all classes matter equally (even rare ones)")
print("  weighted: when class frequency should influence the metric")
print("  micro:    rarely used for multi-class; equivalent to accuracy")

# ── Per-class metrics ──────────────────────────────────────────────────
print("
Per-class F1 scores:")
per_class_f1 = f1_score(y_te, y_pred, average=None)
class_counts = np.bincount(y_te)
for name, f1, count in zip(class_names, per_class_f1, class_counts):
    bar = '█' * int(f1 * 25)
    print(f"  {name:<14}: {bar:<25} {f1:.4f}  (n={count})")

Errors you will hit

Every common evaluation mistake — explained and fixed

Model reports 98% accuracy but the business is unhappy — model is useless in production

Why it happens

Accuracy on an imbalanced dataset is dominated by the majority class. A model that predicts 'no fraud' for every transaction achieves 98.5% accuracy on a dataset with 1.5% fraud — but catches zero fraud. The metric looks excellent while the model is completely worthless for its intended purpose.

Fix

For imbalanced classification problems, never report accuracy as the primary metric. Use ROC-AUC (threshold-independent ranking quality), PR-AUC (especially for severe imbalance), precision and recall at the operating threshold, or F1/F-beta score. Always check the confusion matrix before reporting any metric — it instantly exposes a model that is just predicting the majority class.

UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 — no predicted samples

Why it happens

Your model predicts zero positive cases — every sample is classified as negative. This happens when the classification threshold is too high, the model has very low confidence in all positive predictions, or the training data was too imbalanced and the model learned to always predict the majority class.

Fix

Check predict_proba output: if all probabilities are below 0.5, set zero_division=0 in precision_score to suppress the warning and investigate. Lower the classification threshold. Check if scale_pos_weight (XGBoost) or class_weight='balanced' (sklearn) was set for the imbalanced training data. Add is_unbalance=True (LightGBM) or class_weight='balanced' and verify the model is actually learning a signal.

ROC-AUC is 0.97 on validation but drops to 0.71 in production

Why it happens

There are three common causes: data leakage during training (validation set was contaminated by training statistics — Module 20), temporal leakage (training on future data to predict the past — the fraud patterns changed), or distribution shift (production transactions have a different distribution than training data — different time period, different merchant mix, different fraud patterns).

Fix

Audit the training pipeline for leakage using the Module 20 checklist. For time-series data (transactions always are), verify you used chronological splits: train on January–October, validate on November, test on December. Monitor the model's AUC in production with a weekly shadow evaluation against labelled samples. When production AUC drops 5+ points, trigger a retraining.

F1 score of 0.0 despite model having reasonable ROC-AUC

Why it happens

F1 score uses the default 0.5 threshold. If your model's predicted probabilities are all below 0.5 (common when the positive class is rare and the model is well-calibrated), the model predicts all negatives at threshold=0.5 and F1 becomes undefined or 0. Meanwhile ROC-AUC correctly reflects that the model's ranking is good.

Fix

Tune the threshold before computing F1. Use the validation set to find the threshold that maximises F1: thresholds = np.arange(0.01, 0.5, 0.01); best_t = max(thresholds, key=lambda t: f1_score(y_val, (val_proba >= t).astype(int))). Apply this threshold when calling predict() and when computing F1 on the test set.

What comes next

You can now evaluate any model honestly. Next: are the probabilities themselves trustworthy?

ROC-AUC tells you whether the model ranks fraud above legitimate transactions. It does not tell you whether the probabilities are accurate. A model that says P(fraud) = 0.9 for a transaction — does that mean 90% of such transactions are actually fraud? Or is the model's confidence unreliable?

The next module — Calibration — answers this. Calibration curves, reliability diagrams, and the two most common miscalibration patterns in gradient boosting and neural networks. Well-calibrated probabilities are essential for fraud scoring, credit decisions, and medical diagnosis where the actual probability matters, not just the ranking.

Next — Module 35 · Model Evaluation

Calibration — Are Your Probabilities Trustworthy?

Reliability diagrams, Brier score, and Platt scaling vs isotonic regression — when your model says 80% fraud, does it mean 80%?

coming soon

🎯 Key Takeaways

✓Accuracy is misleading on imbalanced datasets. A model that predicts the majority class every time achieves 98.5% accuracy on a 1.5% fraud dataset while catching zero fraud. Always check the confusion matrix before reporting any metric.
✓The confusion matrix has four cells: TP (caught fraud), TN (correctly allowed), FP (false alarm — legit blocked), FN (missed fraud). Every classification metric is a formula combining these four numbers.
✓Precision = TP/(TP+FP): of all flagged transactions, what fraction were genuinely fraud? Recall = TP/(TP+FN): of all actual fraud, what fraction did we catch? They trade off — raising the threshold increases precision but decreases recall.
✓ROC-AUC is threshold-independent — it measures how well the model ranks fraud above legitimate across all possible thresholds. AUC = 0.95 means a random fraud transaction scores higher than a random legit transaction 95% of the time.
✓For severely imbalanced problems (fraud rate < 5%), PR-AUC (area under the precision-recall curve) is more informative than ROC-AUC. ROC-AUC can look excellent even when precision on the minority class is terrible.
✓The optimal threshold is almost never 0.5. Derive it from the relative business cost of false negatives vs false positives. At Stripe, missing fraud (FN) costs ₹2,500 while a false alarm (FP) costs ₹50 — optimise heavily toward recall by lowering the threshold well below 0.5.

Discussion

Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.

Continue with GitHub