Python · SQL · Web Dev · Java · AI/ML tracks launching soon — your one platform for all of IT

Evaluation Metrics — Beyond Accuracy

Precision, recall, F1, ROC-AUC, PR-AUC, confusion matrices, and the business cost framing that turns metrics into decisions.

35–40 min March 2026
Before any formula — why accuracy is almost always the wrong metric

Your fraud model has 98.5% accuracy. Your manager is thrilled. Then you check: it flags zero fraud cases. All 98.5% comes from predicting "not fraud" on every single transaction.

Razorpay processes 5 million transactions per day. Only 1.5% are fraudulent — 75,000 transactions. A model that predicts "legitimate" for every transaction achieves 98.5% accuracy without catching a single fraudulent rupee. This model is completely useless, yet the accuracy number looks spectacular in a presentation.

Accuracy is misleading whenever the classes are imbalanced — which is almost always the case in the problems that matter most. Fraud detection: 1–2% fraud. Disease diagnosis: 1–5% positive. Churn prediction: 3–8% churners. Spam detection: 5–20% spam. In all of these, a naive "always predict the majority" baseline achieves 92–99% accuracy while being completely worthless.

This module teaches the metrics that actually matter: the confusion matrix (what kind of errors is the model making?), precision and recall (the fundamental trade-off), F1 score (one number that balances both), ROC-AUC (threshold-independent performance), and PR-AUC (the right metric for severely imbalanced problems).

🧠 Analogy — read this first

A doctor is screening patients for a rare disease affecting 1 in 100 people. A doctor who says "healthy" to everyone achieves 99% accuracy. But they miss every sick patient. The medical community does not measure doctors by "how often are you right overall?" They measure: "of the people you said were sick, how many actually were?" (precision) and "of all the people who were actually sick, how many did you catch?" (recall).

These two questions — precision and recall — are the core of all classification evaluation. Every other metric (F1, ROC-AUC, PR-AUC) is built on top of them.

🎯 Pro Tip
The right metric depends entirely on the cost of each type of error in your business context. Missing a fraud case costs more than a false alarm? Optimise recall. False alarms cause customers to call support constantly? Optimise precision. This module shows you how to make that decision explicitly.
The foundation of all classification metrics

The confusion matrix — four outcomes, every metric derives from them

A binary classifier makes one of four possible outcomes for each prediction. The confusion matrix organises all four. Every metric — accuracy, precision, recall, F1 — is a formula combining these four numbers in different ways. Understanding the four cells first makes every metric obvious.

The confusion matrix — four cells, what each means
Predicted: Fraud
Predicted: Legit
Actual: Fraud
TP
True Positive
FN
False Negative
Actual: Legit
FP
False Positive
TN
True Negative
TP
True Positive
Fraud predicted, actually fraud. Caught it. Good.
TN
True Negative
Legit predicted, actually legit. Correct. Good.
FP
False Positive
Fraud predicted, actually legit. False alarm — customer blocked unnecessarily.
FN
False Negative
Legit predicted, actually fraud. Missed fraud — most costly error.
python
import numpy as np
from sklearn.metrics import (confusion_matrix, classification_report,
                              accuracy_score, precision_score,
                              recall_score, f1_score)
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
n = 10_000

# Simulate Razorpay fraud predictions
# 1.5% fraud rate
y_true  = (np.random.random(n) < 0.015).astype(int)
# Realistic model: catches 70% of fraud, 3% false alarm rate
y_pred  = np.zeros(n, dtype=int)
fraud_idx = np.where(y_true == 1)[0]
legit_idx = np.where(y_true == 0)[0]
# True positives: catch 70% of actual fraud
tp_idx = np.random.choice(fraud_idx, int(len(fraud_idx)*0.70), replace=False)
y_pred[tp_idx] = 1
# False positives: flag 3% of legitimate transactions
fp_idx = np.random.choice(legit_idx, int(len(legit_idx)*0.03), replace=False)
y_pred[fp_idx] = 1

# ── Confusion matrix ──────────────────────────────────────────────────
cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()

print("Confusion matrix:")
print(f"  {'':15} {'Pred: Fraud':>14} {'Pred: Legit':>14}")
print(f"  {'Actual: Fraud':<15} {'TP = ' + str(tp):>14} {'FN = ' + str(fn):>14}")
print(f"  {'Actual: Legit':<15} {'FP = ' + str(fp):>14} {'TN = ' + str(tn):>14}")

# ── Derive every metric manually ─────────────────────────────────────
accuracy  = (tp + tn) / (tp + tn + fp + fn)
precision = tp / (tp + fp)          # of all flagged, how many were fraud?
recall    = tp / (tp + fn)          # of all fraud, how many did we catch?
f1        = 2 * precision * recall / (precision + recall)
specificity = tn / (tn + fp)        # of all legit, how many were correctly allowed?
fpr       = fp / (fp + tn)          # false positive rate = 1 - specificity

print(f"
Metrics (computed manually):")
print(f"  Accuracy:    {accuracy:.4f}   ← misleading — 98.5% if we predict all legit")
print(f"  Precision:   {precision:.4f}   ← of flagged transactions, {precision*100:.1f}% are fraud")
print(f"  Recall:      {recall:.4f}   ← we catch {recall*100:.1f}% of all fraud")
print(f"  F1 score:    {f1:.4f}   ← harmonic mean of precision and recall")
print(f"  Specificity: {specificity:.4f}   ← {specificity*100:.1f}% of legit transactions pass through")
print(f"  FPR:         {fpr:.4f}   ← {fpr*100:.1f}% of legit flagged as fraud")

# Business impact translation
fraud_value_per_tx = 2500  # avg ₹2500 per fraudulent transaction
false_alarm_cost   = 50    # ₹50 cost per false alarm (support call, friction)

fraud_caught    = tp * fraud_value_per_tx
fraud_missed    = fn * fraud_value_per_tx
false_alarm_cost_total = fp * false_alarm_cost

print(f"
Business impact:")
print(f"  Fraud caught:         ₹{fraud_caught:,.0f} protected")
print(f"  Fraud missed (FN):    ₹{fraud_missed:,.0f} lost")
print(f"  False alarm cost:     ₹{false_alarm_cost_total:,.0f} in friction")
print(f"  Net value of model:   ₹{fraud_caught - fraud_missed - false_alarm_cost_total:,.0f}")

# sklearn classification_report gives everything at once
print("
sklearn classification_report:")
print(classification_report(y_true, y_pred, target_names=['Legit', 'Fraud']))
The fundamental trade-off

Precision vs recall — you cannot maximise both simultaneously

Precision and recall are in tension. To catch more fraud (increase recall) you need to lower the classification threshold — flag more transactions. But flagging more transactions means more false alarms (lower precision). To reduce false alarms (increase precision) you raise the threshold — but then you miss more actual fraud (lower recall). This trade-off is unavoidable and inherent to every binary classifier.

The right balance depends entirely on the business cost of each error type. Missing a fraud transaction at Razorpay costs ₹2,500 on average. A false alarm costs ₹50 in support friction. The cost ratio is 50:1. You should therefore accept 50 false alarms for every fraud case caught — meaning you should optimise heavily toward recall at the expense of precision.

Precision and recall — the complete definitions
PrecisionTP / (TP + FP)

Of all transactions I flagged as fraud — how many actually were?

High precision = few false alarms. Optimise when false alarms are costly.
Precision = 0.85: 85% of flagged transactions are genuine fraud.
Recall (Sensitivity)TP / (TP + FN)

Of all transactions that were actually fraud — how many did I catch?

High recall = few missed positives. Optimise when missing positives is costly.
Recall = 0.90: we catch 90% of all fraudulent transactions.
F1 Score2 × Precision × Recall / (Precision + Recall)

A single score that balances both — the harmonic mean.

Use when you need one number and both errors matter equally.
F1 = 0.87: balanced between precision=0.85 and recall=0.90.
python
import numpy as np
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (precision_score, recall_score, f1_score,
                              precision_recall_curve, average_precision_score)
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
n = 8_000

# Razorpay fraud features
amount         = np.abs(np.random.normal(1200, 2000, n)).clip(10, 50_000)
merchant_risk  = np.random.uniform(0, 1, n)
n_tx_hour      = np.random.randint(0, 20, n).astype(float)
device_age     = np.abs(np.random.normal(200, 150, n)).clip(0, 1000)
is_new_device  = np.random.randint(0, 2, n).astype(float)

fraud_score = (
    (amount/50_000)*0.30 + merchant_risk*0.25
    + (n_tx_hour/20)*0.25 + is_new_device*0.15
    + np.random.randn(n)*0.05
)
y = (fraud_score > 0.55).astype(int)

X = np.column_stack([amount, merchant_risk, n_tx_hour, device_age, is_new_device])
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2,
                                            stratify=y, random_state=42)
sc = StandardScaler()
X_tr_sc = sc.fit_transform(X_tr)
X_te_sc = sc.transform(X_te)

model = GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=3,
    subsample=0.8, random_state=42,
)
model.fit(X_tr_sc, y_tr)
y_proba = model.predict_proba(X_te_sc)[:, 1]

# ── Precision-recall at different thresholds ──────────────────────────
print("Precision-recall trade-off at different thresholds:")
print(f"{'Threshold':<12} {'Precision':>11} {'Recall':>9} {'F1':>9} {'Flagged':>9}")
print("─" * 54)

for threshold in np.arange(0.1, 0.91, 0.1):
    y_pred_t  = (y_proba >= threshold).astype(int)
    prec      = precision_score(y_te, y_pred_t, zero_division=0)
    rec       = recall_score(y_te, y_pred_t, zero_division=0)
    f1        = f1_score(y_te, y_pred_t, zero_division=0)
    flagged   = y_pred_t.mean() * 100
    print(f"  t={threshold:.1f}      {prec:>9.4f}  {rec:>9.4f}  {f1:>9.4f}  {flagged:>7.1f}%")

# ── F-beta score — weight recall over precision ────────────────────────
from sklearn.metrics import fbeta_score

# At Razorpay: missing fraud costs 50× more than false alarm
# β=2 weights recall twice as heavily as precision
# β=0.5 weights precision twice as heavily as recall
print("
F-beta score — adjusting the precision/recall balance:")
for beta in [0.5, 1.0, 2.0, 3.0]:
    y_pred_default = (y_proba >= 0.5).astype(int)
    fb = fbeta_score(y_te, y_pred_default, beta=beta, zero_division=0)
    desc = '(precision weighted)' if beta < 1 else '(equal weight)' if beta == 1 else '(recall weighted)'
    print(f"  F{beta}: {fb:.4f}  {desc}")

# ── PR-AUC — area under the precision-recall curve ────────────────────
# Better than ROC-AUC for highly imbalanced problems
pr_auc = average_precision_score(y_te, y_proba)
print(f"
PR-AUC (Average Precision): {pr_auc:.4f}")
print("PR-AUC = 1.0: perfect model")
print(f"PR-AUC = {y_te.mean():.4f}: random baseline (= fraud rate)")
Threshold-independent evaluation

ROC-AUC — how well the model ranks fraud above legitimate transactions

Precision and recall depend on the threshold you choose. Change the threshold, get different precision and recall. ROC-AUC (Receiver Operating Characteristic — Area Under Curve) is threshold-independent. It measures how well the model separates the two classes across all possible thresholds at once.

The ROC curve plots the true positive rate (recall) against the false positive rate at every possible threshold. A perfect model has a curve that goes straight up to (0, 1) — it achieves 100% recall with 0% false alarms. A random model produces a diagonal line — recall equals the false alarm rate. The AUC is the area under the curve: 1.0 is perfect, 0.5 is random.

🧠 Analogy — read this first

You have 100 fraud cases and 9,900 legit transactions — all shuffled randomly. You ask the model to score all 10,000 and sort them by fraud probability, highest first. How many of the actual 100 fraud cases appear in the top 100? Top 200? Top 500? If the model is perfect, all 100 fraud cases appear before any legitimate transaction. The ROC curve plots this across every possible cutpoint. AUC is the probability that a randomly chosen fraud transaction scores higher than a randomly chosen legit one.

AUC = 0.95 means: take one random fraud transaction and one random legit transaction. There is a 95% chance the model assigns a higher fraud score to the fraud transaction. This is the most intuitive interpretation of AUC.

ROC curve — three models, same dataset
False Positive Rate →True Positive Rate ↑000.250.250.50.50.750.7511randomAUC=0.95AUC=0.80AUC=0.65ideal
AUC = 1.0Perfect — always ranks fraud above legit
AUC = 0.9+Excellent — production quality
AUC = 0.8–0.9Good — acceptable for most problems
AUC = 0.7–0.8Fair — investigate features
AUC = 0.5–0.7Poor — barely above random
AUC = 0.5Random — model has no signal
python
import numpy as np
from sklearn.metrics import roc_auc_score, roc_curve, average_precision_score
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
n = 8_000

amount        = np.abs(np.random.normal(1200, 2000, n)).clip(10, 50_000)
merchant_risk = np.random.uniform(0, 1, n)
n_tx_hour     = np.random.randint(0, 20, n).astype(float)
device_age    = np.abs(np.random.normal(200, 150, n)).clip(0, 1000)
is_new_device = np.random.randint(0, 2, n).astype(float)

fraud_score = (
    (amount/50_000)*0.30 + merchant_risk*0.25
    + (n_tx_hour/20)*0.25 + is_new_device*0.15
    + np.random.randn(n)*0.05
)
y = (fraud_score > 0.55).astype(int)
X = np.column_stack([amount, merchant_risk, n_tx_hour, device_age, is_new_device])

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2,
                                            stratify=y, random_state=42)
sc      = StandardScaler()
X_tr_sc = sc.fit_transform(X_tr)
X_te_sc = sc.transform(X_te)

models = {
    'LogisticRegression': LogisticRegression(max_iter=1000, random_state=42),
    'RandomForest':       RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
    'GradientBoosting':   GradientBoostingClassifier(n_estimators=200, learning_rate=0.1,
                                                       max_depth=3, random_state=42),
}

print(f"{'Model':<22} {'ROC-AUC':>9} {'PR-AUC':>9} {'CV AUC (5-fold)':>16}")
print("─" * 60)

for name, model in models.items():
    model.fit(X_tr_sc, y_tr)
    y_prob  = model.predict_proba(X_te_sc)[:, 1]
    roc_auc = roc_auc_score(y_te, y_prob)
    pr_auc  = average_precision_score(y_te, y_prob)
    cv_auc  = cross_val_score(model, X_tr_sc, y_tr, cv=5,
                               scoring='roc_auc').mean()
    print(f"  {name:<20}  {roc_auc:>9.4f}  {pr_auc:>9.4f}  {cv_auc:>14.4f}")

# ── ROC curve — operating points ──────────────────────────────────────
best_model = models['GradientBoosting']
y_prob     = best_model.predict_proba(X_te_sc)[:, 1]
fpr, tpr, thresholds = roc_curve(y_te, y_prob)

print("
ROC curve operating points (selected thresholds):")
print(f"{'Threshold':<12} {'TPR (Recall)':>13} {'FPR':>8} {'Specificity':>13}")
print("─" * 50)

# Find specific operating points
for target_recall in [0.95, 0.90, 0.80, 0.70, 0.60]:
    idx  = np.argmin(np.abs(tpr - target_recall))
    spec = 1 - fpr[idx]
    print(f"  t={thresholds[idx]:.3f}     TPR={tpr[idx]:.3f}       FPR={fpr[idx]:.3f}    Spec={spec:.3f}")

# ── When to use ROC-AUC vs PR-AUC ─────────────────────────────────────
print("
ROC-AUC vs PR-AUC:")
print("  ROC-AUC: affected equally by both classes — good for balanced datasets")
print("  PR-AUC:  focuses on the positive (minority) class — use for fraud, churn")
print(f"  Fraud rate in this dataset: {y_te.mean()*100:.1f}%")
print(f"  → At {y_te.mean()*100:.1f}% positive rate, PR-AUC is the more informative metric")
When the output is continuous

Regression metrics — MAE, RMSE, MAPE, and R²

Regression problems have their own set of evaluation metrics. The right choice depends on how you want to treat large errors and whether the scale of the target matters for interpretation.

Four regression metrics — what each penalises
MAE — Mean Absolute Error
mean(|y − ŷ|)

Average absolute difference. Easy to interpret — same units as target. Treats all errors equally. Robust to outliers.

Delivery time prediction: "model is off by 4.2 minutes on average"
RMSE — Root Mean Squared Error
√mean((y − ŷ)²)

Square root of average squared error. Penalises large errors more than MAE. More sensitive to outliers. Same units as target.

When large errors are disproportionately costly. Stock prices, safety-critical predictions.
MAPE — Mean Absolute Percentage Error
mean(|y − ŷ| / |y|) × 100

Average % error relative to actual. Scale-independent — good for comparing across products. Breaks when y=0.

Demand forecasting: "model is off by 8.3% on average". Comparable across SKUs.
R² — Coefficient of Determination
1 − SS_res / SS_tot

Fraction of variance explained by the model. R²=1 is perfect, R²=0 is as good as predicting the mean, R²<0 is worse than mean. Scale-independent.

Quick sanity check: R²=0.87 means model explains 87% of target variance.
python
import numpy as np
from sklearn.metrics import (mean_absolute_error, mean_squared_error,
                              mean_absolute_percentage_error, r2_score)
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
n = 3000
distance = np.abs(np.random.normal(4.0, 2.0, n)).clip(0.5, 15)
traffic  = np.random.randint(1, 11, n).astype(float)
prep     = np.abs(np.random.normal(15, 5, n)).clip(5, 35)
delivery = (8.6 + 7.3*distance + 0.8*prep + 1.5*traffic
            + np.random.normal(0, 4, n)).clip(10, 120)

X = np.column_stack([distance, traffic, prep])
y = delivery
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)

model = GradientBoostingRegressor(n_estimators=200, learning_rate=0.1,
                                   max_depth=3, random_state=42)
model.fit(X_tr, y_tr)
y_pred = model.predict(X_te)

# ── All four metrics ──────────────────────────────────────────────────
mae  = mean_absolute_error(y_te, y_pred)
rmse = np.sqrt(mean_squared_error(y_te, y_pred))
mape = mean_absolute_percentage_error(y_te, y_pred) * 100
r2   = r2_score(y_te, y_pred)

print(f"Swiggy delivery time model evaluation:")
print(f"  MAE:   {mae:.4f} min   ← average error in minutes")
print(f"  RMSE:  {rmse:.4f} min  ← penalises large errors more")
print(f"  MAPE:  {mape:.4f}%    ← percentage error relative to actual")
print(f"  R²:    {r2:.4f}       ← model explains {r2*100:.1f}% of variance")

# ── When RMSE >> MAE: outliers are present ────────────────────────────
print(f"
RMSE / MAE ratio: {rmse/mae:.2f}")
print("  Ratio near 1.0: errors are uniform, no major outliers")
print("  Ratio > 2.0: large outlier errors are inflating RMSE")

# ── Baseline comparison — always compare against naive models ─────────
mean_pred    = np.full_like(y_te, y_tr.mean())
median_pred  = np.full_like(y_te, np.median(y_tr))

print(f"
Baseline comparisons:")
print(f"  Always-predict-mean  MAE={mean_absolute_error(y_te, mean_pred):.4f}  R²={r2_score(y_te, mean_pred):.4f}")
print(f"  Always-predict-median MAE={mean_absolute_error(y_te, median_pred):.4f}  R²={r2_score(y_te, median_pred):.4f}")
print(f"  Our GBM model        MAE={mae:.4f}  R²={r2:.4f}")
print(f"  Improvement over mean: {(mean_absolute_error(y_te, mean_pred) - mae) / mean_absolute_error(y_te, mean_pred) * 100:.1f}%")
Turning probabilities into decisions

Threshold tuning — 0.5 is almost never the optimal threshold

sklearn's predict() uses 0.5 as the default threshold. A transaction with fraud probability 0.51 is flagged. One with 0.49 is not. This is almost never the right business decision. The optimal threshold should be derived from the relative cost of false positives and false negatives — which is a business decision, not a modelling decision.

python
import numpy as np
from sklearn.metrics import (precision_recall_curve, roc_curve,
                              f1_score, fbeta_score,
                              precision_score, recall_score)
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
n = 8_000
amount = np.abs(np.random.normal(1200,2000,n)).clip(10,50_000)
merchant_risk = np.random.uniform(0,1,n)
n_tx_hour = np.random.randint(0,20,n).astype(float)
device_age = np.abs(np.random.normal(200,150,n)).clip(0,1000)
is_new_device = np.random.randint(0,2,n).astype(float)
fraud_score = (amount/50_000)*0.30 + merchant_risk*0.25 + (n_tx_hour/20)*0.25 + is_new_device*0.15 + np.random.randn(n)*0.05
y = (fraud_score > 0.55).astype(int)
X = np.column_stack([amount,merchant_risk,n_tx_hour,device_age,is_new_device])
X_tr,X_te,y_tr,y_te = train_test_split(X,y,test_size=0.2,stratify=y,random_state=42)
X_tv,X_val,y_tv,y_val = train_test_split(X_tr,y_tr,test_size=0.2,stratify=y_tr,random_state=42)
sc = StandardScaler()
X_tv_sc  = sc.fit_transform(X_tv)
X_val_sc = sc.transform(X_val)
X_te_sc  = sc.transform(X_te)

model = GradientBoostingClassifier(n_estimators=200, learning_rate=0.1,
                                    max_depth=3, random_state=42)
model.fit(X_tv_sc, y_tv)
val_proba = model.predict_proba(X_val_sc)[:, 1]

# ── Method 1: Maximise F1 ─────────────────────────────────────────────
thresholds   = np.arange(0.05, 0.96, 0.01)
f1_scores    = [f1_score(y_val, (val_proba >= t).astype(int), zero_division=0)
                for t in thresholds]
best_t_f1    = thresholds[np.argmax(f1_scores)]
print(f"Best threshold by F1:     {best_t_f1:.2f}  (F1={max(f1_scores):.4f})")

# ── Method 2: Maximise Fβ (recall-weighted for fraud) ─────────────────
fb_scores    = [fbeta_score(y_val, (val_proba >= t).astype(int),
                             beta=2, zero_division=0) for t in thresholds]
best_t_fb    = thresholds[np.argmax(fb_scores)]
print(f"Best threshold by F2:     {best_t_fb:.2f}  (F2={max(fb_scores):.4f})")

# ── Method 3: Business cost optimisation ─────────────────────────────
# Cost of false negative (missed fraud):  ₹2,500 avg transaction
# Cost of false positive (blocked legit): ₹50 friction
fn_cost = 2500
fp_cost = 50

costs = []
for t in thresholds:
    pred = (val_proba >= t).astype(int)
    fn   = ((pred == 0) & (y_val == 1)).sum()
    fp   = ((pred == 1) & (y_val == 0)).sum()
    costs.append(fn * fn_cost + fp * fp_cost)

best_t_biz = thresholds[np.argmin(costs)]
print(f"Best threshold by cost:   {best_t_biz:.2f}  (cost=₹{min(costs):,.0f})")

# ── Compare all thresholds on test set ────────────────────────────────
test_proba = model.predict_proba(X_te_sc)[:, 1]
print(f"
Test set performance at different thresholds:")
print(f"{'Method':<22} {'Threshold':>10} {'Precision':>11} {'Recall':>9} {'F1':>9}")
print("─" * 65)

for label, t in [('Default (0.5)', 0.5),
                  ('Max F1', best_t_f1),
                  ('Max F2 (recall++)', best_t_fb),
                  ('Min business cost', best_t_biz)]:
    pred = (test_proba >= t).astype(int)
    p  = precision_score(y_te, pred, zero_division=0)
    r  = recall_score(y_te, pred, zero_division=0)
    f1 = f1_score(y_te, pred, zero_division=0)
    print(f"  {label:<20}  {t:>10.2f}  {p:>11.4f}  {r:>9.4f}  {f1:>9.4f}")
When there are more than two classes

Multi-class evaluation — macro, micro, and weighted averaging

Binary metrics extend naturally to multi-class problems. The question is how to aggregate per-class metrics into a single number. Three averaging strategies give different answers and are appropriate in different situations.

python
import numpy as np
from sklearn.metrics import (classification_report, confusion_matrix,
                              f1_score, precision_score, recall_score)
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
n = 5000

# Swiggy support ticket categories: 4 classes
# delivery_issue, food_quality, payment_issue, general
X = np.random.randn(n, 10)
y = np.random.choice([0,1,2,3], n, p=[0.40, 0.25, 0.20, 0.15])
# Class 0=delivery (40%), 1=food (25%), 2=payment (20%), 3=general (15%)

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2,
                                            stratify=y, random_state=42)
sc = StandardScaler()
model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
model.fit(sc.fit_transform(X_tr), y_tr)
y_pred = model.predict(sc.transform(X_te))

class_names = ['delivery', 'food_quality', 'payment', 'general']

# ── Classification report ──────────────────────────────────────────────
print("Full classification report:")
print(classification_report(y_te, y_pred, target_names=class_names))

# ── Three averaging strategies explained ──────────────────────────────
print("Averaging strategies for multi-class F1:")
for avg in ['macro', 'weighted', 'micro']:
    f1 = f1_score(y_te, y_pred, average=avg)
    if avg == 'macro':
        desc = 'unweighted mean of per-class F1 — treats all classes equally'
    elif avg == 'weighted':
        desc = 'weighted by support (class size) — accounts for imbalance'
    else:
        desc = 'global TP/FP/FN — equivalent to accuracy for multi-class'
    print(f"  {avg:<10}: {f1:.4f}  ← {desc}")

print("
When to use each:")
print("  macro:    when all classes matter equally (even rare ones)")
print("  weighted: when class frequency should influence the metric")
print("  micro:    rarely used for multi-class; equivalent to accuracy")

# ── Per-class metrics ──────────────────────────────────────────────────
print("
Per-class F1 scores:")
per_class_f1 = f1_score(y_te, y_pred, average=None)
class_counts = np.bincount(y_te)
for name, f1, count in zip(class_names, per_class_f1, class_counts):
    bar = '█' * int(f1 * 25)
    print(f"  {name:<14}: {bar:<25} {f1:.4f}  (n={count})")
Errors you will hit

Every common evaluation mistake — explained and fixed

Model reports 98% accuracy but the business is unhappy — model is useless in production
Why it happens

Accuracy on an imbalanced dataset is dominated by the majority class. A model that predicts 'no fraud' for every transaction achieves 98.5% accuracy on a dataset with 1.5% fraud — but catches zero fraud. The metric looks excellent while the model is completely worthless for its intended purpose.

Fix

For imbalanced classification problems, never report accuracy as the primary metric. Use ROC-AUC (threshold-independent ranking quality), PR-AUC (especially for severe imbalance), precision and recall at the operating threshold, or F1/F-beta score. Always check the confusion matrix before reporting any metric — it instantly exposes a model that is just predicting the majority class.

UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 — no predicted samples
Why it happens

Your model predicts zero positive cases — every sample is classified as negative. This happens when the classification threshold is too high, the model has very low confidence in all positive predictions, or the training data was too imbalanced and the model learned to always predict the majority class.

Fix

Check predict_proba output: if all probabilities are below 0.5, set zero_division=0 in precision_score to suppress the warning and investigate. Lower the classification threshold. Check if scale_pos_weight (XGBoost) or class_weight='balanced' (sklearn) was set for the imbalanced training data. Add is_unbalance=True (LightGBM) or class_weight='balanced' and verify the model is actually learning a signal.

ROC-AUC is 0.97 on validation but drops to 0.71 in production
Why it happens

There are three common causes: data leakage during training (validation set was contaminated by training statistics — Module 20), temporal leakage (training on future data to predict the past — the fraud patterns changed), or distribution shift (production transactions have a different distribution than training data — different time period, different merchant mix, different fraud patterns).

Fix

Audit the training pipeline for leakage using the Module 20 checklist. For time-series data (transactions always are), verify you used chronological splits: train on January–October, validate on November, test on December. Monitor the model's AUC in production with a weekly shadow evaluation against labelled samples. When production AUC drops 5+ points, trigger a retraining.

F1 score of 0.0 despite model having reasonable ROC-AUC
Why it happens

F1 score uses the default 0.5 threshold. If your model's predicted probabilities are all below 0.5 (common when the positive class is rare and the model is well-calibrated), the model predicts all negatives at threshold=0.5 and F1 becomes undefined or 0. Meanwhile ROC-AUC correctly reflects that the model's ranking is good.

Fix

Tune the threshold before computing F1. Use the validation set to find the threshold that maximises F1: thresholds = np.arange(0.01, 0.5, 0.01); best_t = max(thresholds, key=lambda t: f1_score(y_val, (val_proba >= t).astype(int))). Apply this threshold when calling predict() and when computing F1 on the test set.

What comes next

You can now evaluate any model honestly. Next: are the probabilities themselves trustworthy?

ROC-AUC tells you whether the model ranks fraud above legitimate transactions. It does not tell you whether the probabilities are accurate. A model that says P(fraud) = 0.9 for a transaction — does that mean 90% of such transactions are actually fraud? Or is the model's confidence unreliable?

The next module — Calibration — answers this. Calibration curves, reliability diagrams, and the two most common miscalibration patterns in gradient boosting and neural networks. Well-calibrated probabilities are essential for fraud scoring, credit decisions, and medical diagnosis where the actual probability matters, not just the ranking.

Next — Module 35 · Model Evaluation
Calibration — Are Your Probabilities Trustworthy?

Reliability diagrams, Brier score, and Platt scaling vs isotonic regression — when your model says 80% fraud, does it mean 80%?

coming soon

🎯 Key Takeaways

  • Accuracy is misleading on imbalanced datasets. A model that predicts the majority class every time achieves 98.5% accuracy on a 1.5% fraud dataset while catching zero fraud. Always check the confusion matrix before reporting any metric.
  • The confusion matrix has four cells: TP (caught fraud), TN (correctly allowed), FP (false alarm — legit blocked), FN (missed fraud). Every classification metric is a formula combining these four numbers.
  • Precision = TP/(TP+FP): of all flagged transactions, what fraction were genuinely fraud? Recall = TP/(TP+FN): of all actual fraud, what fraction did we catch? They trade off — raising the threshold increases precision but decreases recall.
  • ROC-AUC is threshold-independent — it measures how well the model ranks fraud above legitimate across all possible thresholds. AUC = 0.95 means a random fraud transaction scores higher than a random legit transaction 95% of the time.
  • For severely imbalanced problems (fraud rate < 5%), PR-AUC (area under the precision-recall curve) is more informative than ROC-AUC. ROC-AUC can look excellent even when precision on the minority class is terrible.
  • The optimal threshold is almost never 0.5. Derive it from the relative business cost of false negatives vs false positives. At Razorpay, missing fraud (FN) costs ₹2,500 while a false alarm (FP) costs ₹50 — optimise heavily toward recall by lowering the threshold well below 0.5.
Share

Discussion

0

Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.

Continue with GitHub
Loading...