ROC Curve and AUC — Threshold-Independent Evaluation
What the ROC curve actually measures, why AUC equals a probability, and how to use operating points to choose a threshold for production.
Precision and recall change every time you move the threshold. ROC-AUC gives you one number that works across every threshold at once.
Module 34 showed the fundamental problem: precision and recall depend on the threshold you choose. Lower the threshold from 0.5 to 0.3 — you catch more fraud (higher recall) but generate more false alarms (lower precision). Every threshold gives a different precision/recall pair. Which one do you report? Which one do you optimise?
The deeper question is: before you even choose a threshold, how good is the model's underlying ability to separate fraud from legitimate transactions? If the model's scores completely overlap — fraud transactions score 0.4–0.6 and legitimate transactions also score 0.4–0.6 — no threshold will produce a useful classifier. If fraud scores 0.7–0.9 and legitimate scores 0.1–0.3, any reasonable threshold works perfectly.
The ROC curve answers this question. It plots how the true positive rate and false positive rate trade off as you sweep the threshold from 1.0 down to 0.0 — across every possible threshold simultaneously. The AUC (area under that curve) collapses this into one number that describes the model's separability regardless of any threshold choice.
Imagine 100 CRED loan applicants — 10 will default, 90 will not. You line them up ordered by your model's default score, highest first. The ROC curve asks: as you walk down the line and draw a threshold between each pair of adjacent applicants, what fraction of the 10 defaulters have you caught so far (TPR), and what fraction of the 90 non-defaulters have you incorrectly included (FPR)?
If your model is perfect, all 10 defaulters appear at the top of the list before any non-defaulter. TPR reaches 1.0 while FPR is still 0.0 — a curve that hugs the top-left corner. AUC = 1.0. If your model is random, defaulters and non-defaulters are scattered randomly — TPR and FPR increase at the same rate. AUC = 0.5.
Building the curve from scratch — every threshold, one point
The ROC curve is constructed by sweeping the classification threshold from 1.0 (predict everything negative) down to 0.0 (predict everything positive). At each threshold you compute TPR and FPR and plot one point. Connect all points and you have the ROC curve.
AUC = P(score of random positive > score of random negative)
The probabilistic interpretation of AUC is not just a nice fact — it is the most practically useful way to understand and communicate model quality. It requires no threshold, no class imbalance adjustment, and no domain knowledge to interpret.
It means you can directly answer the question: "if I show this model one fraud transaction and one legitimate transaction, what is the probability it will rank the fraud higher?" For Razorpay's fraud model with AUC = 0.94, the answer is 94%. This is the number you put in the model card, the slide deck, and the RBI audit report.
A concordant pair is a (positive, negative) pair where the model correctly scores the positive higher. A discordant pair is one where the negative scores higher. AUC equals the fraction of all possible positive-negative pairs that are concordant. This is the Mann-Whitney U statistic — a non-parametric test that predates ROC analysis by decades.
ROC-AUC vs PR-AUC — which to use and when
ROC-AUC has a critical weakness on severely imbalanced datasets. When the negative class is 99× larger than the positive class, a huge number of true negatives make FPR look small even when the model generates enormous absolute numbers of false positives. The ROC curve looks excellent while the precision is terrible.
The Precision-Recall curve is immune to this. It never looks at true negatives at all — it only measures how well the model finds the positive class. For fraud detection (1–2% fraud), disease diagnosis (1% positive), and any severely imbalanced problem, PR-AUC is the more honest metric.
Choosing an operating point — where on the ROC curve should you sit?
The ROC curve gives you all possible operating points. Choosing which point to operate at is a business decision, not a modelling decision. The right point depends on the cost ratio between false negatives and false positives, the operational capacity of your review team, and regulatory requirements.
Three systematic methods for choosing an operating point, each appropriate for different situations:
When you have no cost information — maximises the balanced distance from the random baseline.
When you know the relative cost of each error type. Razorpay: cost_FN=₹2500 (missed fraud), cost_FP=₹50 (friction). Most situations.
When a regulator or business sets a minimum recall requirement. e.g. "catch at least 90% of all fraud no matter what."
Multi-class AUC — OvR and OvO strategies
ROC-AUC extends to multi-class problems via two strategies. One-vs-Rest (OvR) computes one ROC curve per class treating it as the positive class against all others combined. One-vs-One (OvO) computes one ROC curve for every pair of classes. Both produce a single aggregate AUC via averaging.
Every common ROC-AUC mistake — explained and fixed
You can evaluate any model at any threshold. Next: does your evaluation generalise — or did you get lucky on this particular test set?
ROC-AUC on a single test split gives one number. But how stable is it? A different random seed for the split might give AUC = 0.91 instead of 0.94. Cross-validation gives you a distribution of AUC scores across multiple non-overlapping test sets — mean and standard deviation — so you can report confidence intervals, not just point estimates. Module 37 covers cross-validation, the bias-variance tradeoff, and how to use them together to make model comparisons statistically rigorous.
From point estimates to confidence intervals. K-fold, stratified, and repeated CV — and when the bias-variance tradeoff determines which model to choose.
🎯 Key Takeaways
- ✓The ROC curve plots TPR (recall) against FPR as the classification threshold sweeps from 1.0 to 0.0. Each threshold produces one point on the curve. AUC is the area under that curve — a single number summarising model quality across every possible threshold.
- ✓AUC has a clean probabilistic interpretation: it equals the probability that the model assigns a higher score to a randomly chosen positive than to a randomly chosen negative. AUC = 0.94 means a random fraud transaction scores higher than a random legitimate one 94% of the time.
- ✓ROC-AUC is optimistic on severely imbalanced datasets. A large pool of true negatives makes FPR look tiny even with many absolute false positives. For fraud rates below 5%, use PR-AUC (average precision) as the primary metric — it ignores true negatives entirely.
- ✓Choosing an operating point on the ROC curve is a business decision, not a modelling decision. Three methods: Youden Index (max TPR − FPR, equal error cost), cost minimisation (explicit FN and FP costs), or fixed recall constraint (regulatory minimum catch rate).
- ✓For multi-class problems use roc_auc_score with multi_class="ovr" (One-vs-Rest) or "ovo" (One-vs-One). Use average="macro" when all classes matter equally, average="weighted" for an overall performance summary weighted by class frequency.
- ✓Never use a manually constructed threshold grid (np.linspace) to compute AUC — always use sklearn's roc_curve output directly with the auc() function. Manual grids miss critical threshold points and produce approximation errors.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.