Evaluation Metrics — Beyond Accuracy
Precision, recall, F1, ROC-AUC, PR-AUC, confusion matrices, and the business cost framing that turns metrics into decisions.
Your fraud model has 98.5% accuracy. Your manager is thrilled. Then you check: it flags zero fraud cases. All 98.5% comes from predicting "not fraud" on every single transaction.
Razorpay processes 5 million transactions per day. Only 1.5% are fraudulent — 75,000 transactions. A model that predicts "legitimate" for every transaction achieves 98.5% accuracy without catching a single fraudulent rupee. This model is completely useless, yet the accuracy number looks spectacular in a presentation.
Accuracy is misleading whenever the classes are imbalanced — which is almost always the case in the problems that matter most. Fraud detection: 1–2% fraud. Disease diagnosis: 1–5% positive. Churn prediction: 3–8% churners. Spam detection: 5–20% spam. In all of these, a naive "always predict the majority" baseline achieves 92–99% accuracy while being completely worthless.
This module teaches the metrics that actually matter: the confusion matrix (what kind of errors is the model making?), precision and recall (the fundamental trade-off), F1 score (one number that balances both), ROC-AUC (threshold-independent performance), and PR-AUC (the right metric for severely imbalanced problems).
A doctor is screening patients for a rare disease affecting 1 in 100 people. A doctor who says "healthy" to everyone achieves 99% accuracy. But they miss every sick patient. The medical community does not measure doctors by "how often are you right overall?" They measure: "of the people you said were sick, how many actually were?" (precision) and "of all the people who were actually sick, how many did you catch?" (recall).
These two questions — precision and recall — are the core of all classification evaluation. Every other metric (F1, ROC-AUC, PR-AUC) is built on top of them.
The confusion matrix — four outcomes, every metric derives from them
A binary classifier makes one of four possible outcomes for each prediction. The confusion matrix organises all four. Every metric — accuracy, precision, recall, F1 — is a formula combining these four numbers in different ways. Understanding the four cells first makes every metric obvious.
Precision vs recall — you cannot maximise both simultaneously
Precision and recall are in tension. To catch more fraud (increase recall) you need to lower the classification threshold — flag more transactions. But flagging more transactions means more false alarms (lower precision). To reduce false alarms (increase precision) you raise the threshold — but then you miss more actual fraud (lower recall). This trade-off is unavoidable and inherent to every binary classifier.
The right balance depends entirely on the business cost of each error type. Missing a fraud transaction at Razorpay costs ₹2,500 on average. A false alarm costs ₹50 in support friction. The cost ratio is 50:1. You should therefore accept 50 false alarms for every fraud case caught — meaning you should optimise heavily toward recall at the expense of precision.
Of all transactions I flagged as fraud — how many actually were?
Of all transactions that were actually fraud — how many did I catch?
A single score that balances both — the harmonic mean.
ROC-AUC — how well the model ranks fraud above legitimate transactions
Precision and recall depend on the threshold you choose. Change the threshold, get different precision and recall. ROC-AUC (Receiver Operating Characteristic — Area Under Curve) is threshold-independent. It measures how well the model separates the two classes across all possible thresholds at once.
The ROC curve plots the true positive rate (recall) against the false positive rate at every possible threshold. A perfect model has a curve that goes straight up to (0, 1) — it achieves 100% recall with 0% false alarms. A random model produces a diagonal line — recall equals the false alarm rate. The AUC is the area under the curve: 1.0 is perfect, 0.5 is random.
You have 100 fraud cases and 9,900 legit transactions — all shuffled randomly. You ask the model to score all 10,000 and sort them by fraud probability, highest first. How many of the actual 100 fraud cases appear in the top 100? Top 200? Top 500? If the model is perfect, all 100 fraud cases appear before any legitimate transaction. The ROC curve plots this across every possible cutpoint. AUC is the probability that a randomly chosen fraud transaction scores higher than a randomly chosen legit one.
AUC = 0.95 means: take one random fraud transaction and one random legit transaction. There is a 95% chance the model assigns a higher fraud score to the fraud transaction. This is the most intuitive interpretation of AUC.
Regression metrics — MAE, RMSE, MAPE, and R²
Regression problems have their own set of evaluation metrics. The right choice depends on how you want to treat large errors and whether the scale of the target matters for interpretation.
Threshold tuning — 0.5 is almost never the optimal threshold
sklearn's predict() uses 0.5 as the default threshold. A transaction with fraud probability 0.51 is flagged. One with 0.49 is not. This is almost never the right business decision. The optimal threshold should be derived from the relative cost of false positives and false negatives — which is a business decision, not a modelling decision.
Multi-class evaluation — macro, micro, and weighted averaging
Binary metrics extend naturally to multi-class problems. The question is how to aggregate per-class metrics into a single number. Three averaging strategies give different answers and are appropriate in different situations.
Every common evaluation mistake — explained and fixed
You can now evaluate any model honestly. Next: are the probabilities themselves trustworthy?
ROC-AUC tells you whether the model ranks fraud above legitimate transactions. It does not tell you whether the probabilities are accurate. A model that says P(fraud) = 0.9 for a transaction — does that mean 90% of such transactions are actually fraud? Or is the model's confidence unreliable?
The next module — Calibration — answers this. Calibration curves, reliability diagrams, and the two most common miscalibration patterns in gradient boosting and neural networks. Well-calibrated probabilities are essential for fraud scoring, credit decisions, and medical diagnosis where the actual probability matters, not just the ranking.
Reliability diagrams, Brier score, and Platt scaling vs isotonic regression — when your model says 80% fraud, does it mean 80%?
🎯 Key Takeaways
- ✓Accuracy is misleading on imbalanced datasets. A model that predicts the majority class every time achieves 98.5% accuracy on a 1.5% fraud dataset while catching zero fraud. Always check the confusion matrix before reporting any metric.
- ✓The confusion matrix has four cells: TP (caught fraud), TN (correctly allowed), FP (false alarm — legit blocked), FN (missed fraud). Every classification metric is a formula combining these four numbers.
- ✓Precision = TP/(TP+FP): of all flagged transactions, what fraction were genuinely fraud? Recall = TP/(TP+FN): of all actual fraud, what fraction did we catch? They trade off — raising the threshold increases precision but decreases recall.
- ✓ROC-AUC is threshold-independent — it measures how well the model ranks fraud above legitimate across all possible thresholds. AUC = 0.95 means a random fraud transaction scores higher than a random legit transaction 95% of the time.
- ✓For severely imbalanced problems (fraud rate < 5%), PR-AUC (area under the precision-recall curve) is more informative than ROC-AUC. ROC-AUC can look excellent even when precision on the minority class is terrible.
- ✓The optimal threshold is almost never 0.5. Derive it from the relative business cost of false negatives vs false positives. At Razorpay, missing fraud (FN) costs ₹2,500 while a false alarm (FP) costs ₹50 — optimise heavily toward recall by lowering the threshold well below 0.5.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.