Calibration — Are Your Probabilities Trustworthy?
Reliability diagrams, Brier score, and Platt scaling vs isotonic regression — when your model says 80% fraud probability, does it actually mean 80%?
Your fraud model says P(fraud) = 0.85 for a transaction. That should mean 85 out of 100 such transactions are genuinely fraudulent. Is that actually true? Almost certainly not — without calibration.
Module 34 taught you that ROC-AUC measures ranking quality — does the model score fraud higher than legitimate transactions? A model with AUC = 0.95 is excellent at ranking. But ranking quality says nothing about whether the actual probability values are meaningful.
A Razorpay credit risk model outputs a score of 0.85 for a loan application. The credit officer interprets this as "85% probability of default." They reject the loan. But if that model is poorly calibrated, 0.85 might actually correspond to a 40% default rate — meaning the officer rejected a loan that should have been approved. The model's score is directionally correct (high scores mean higher risk) but the actual probability value is wrong.
Calibration is the property that a predicted probability of 0.85 corresponds to an actual observed frequency of 85%. Among all predictions where the model said 0.85, approximately 85% of them should turn out to be positive. This property is called reliability— the predictions are reliable as probability estimates, not just rankings.
A weather forecaster who says "70% chance of rain" is well-calibrated if it actually rains on about 70% of the days they make that prediction. A poorly-calibrated forecaster might say "70%" but it only rains 30% of those days — their confidence is systematically too high. You still trust their ranking (70% means more likely than 40%) but you cannot trust the actual number.
ML models are forecasters. ROC-AUC measures whether their rankings are correct. Calibration measures whether their actual numbers are correct. Both matter — but only calibration lets you use the probabilities for quantitative business decisions.
The reliability diagram — one chart that shows everything
The reliability diagram (also called a calibration plot) is the standard tool for visualising calibration. You group predictions into bins by predicted probability (0–0.1, 0.1–0.2, etc.), then for each bin you compare the predicted probability to the actual fraction of positive cases observed. A perfectly calibrated model produces points along the diagonal. Deviations reveal systematic over- or under-confidence.
Brier score — the single best metric for probability quality
The reliability diagram is visual. When you need a single number to compare models or track calibration over time, use the Brier score — the mean squared error between predicted probabilities and actual labels.
A Brier score of 0 is perfect. A Brier score of 0.25 is what you get from predicting the base rate every time (the baseline). Anything between 0 and the baseline is better than random. The lower the Brier score, the more accurate the probability estimates. Unlike ROC-AUC, the Brier score penalises overconfident predictions — a model that says 0.99 for a case that turns out negative is penalised heavily.
The Brier score can be decomposed into three components: calibration (how far are the predicted probs from the actual rates?), resolution (does the model separate positives from negatives?), and uncertainty (inherent noise in the problem). Good models minimise calibration error while maximising resolution.
Three algorithms — three characteristic miscalibration patterns
Different algorithms have different systematic calibration failures. Knowing which algorithm tends to be miscalibrated in which direction tells you whether calibration is likely needed and which method to apply.
RF probabilities are the fraction of trees voting positive. With 100 trees, the maximum possible probability is 100/100 = 1.0 in theory, but in practice the averaging across diverse trees pulls extreme predictions toward the centre. Few samples ever get 95+ out of 100 trees to agree.
Boosting with many trees on the log-loss objective aggressively separates classes. The model becomes very confident after hundreds of corrections, pushing probabilities toward 0 and 1. Especially severe when max_depth is large or n_estimators is high with a low learning rate.
The naive independence assumption causes probability products to push toward 0 and 1 very aggressively. Module 27 covers this. With 20 correlated features, multiplying 20 individual likelihoods produces extreme products even for ambiguous cases.
Logistic regression is trained to directly minimise log-loss which is a proper scoring rule — minimising it forces the output probabilities to match the true class frequencies. It is the only common classifier with this property by design.
Two calibration methods — Platt scaling and isotonic regression
Once you have detected miscalibration (reliability diagram off-diagonal, high Brier score), you can fix it using a post-hoc calibration method. These methods do not retrain the model — they fit a small wrapper on top of the model's outputs that maps raw scores to calibrated probabilities. The original model is unchanged. Only the probability transformation changes.
Production credit scoring pipeline — calibrated end to end
At Razorpay's credit team, every loan application produces a calibrated default probability. The credit officer sees "this applicant has a 23% probability of default." They need to trust that number — it drives the interest rate, the loan amount, and the approval decision. An uncalibrated 0.23 is meaningless. A calibrated 0.23 means roughly 23 out of 100 such applicants historically defaulted. That is actionable.
Calibration drift — why a well-calibrated model degrades
A model calibrated in January on historical data may be poorly calibrated by June — not because the model changed, but because the world changed. Economic conditions shift, fraud patterns evolve, customer demographics change. The relationship between the model's scores and the actual default rate drifts over time. Monitoring calibration in production is as important as monitoring accuracy.
Every common calibration mistake — explained and fixed
You can evaluate and calibrate any model. Next: what happens inside the model — feature importance and SHAP across all algorithm types.
You have ROC-AUC, PR-AUC, precision, recall, F1, Brier score, and calibration curves. You can now honestly evaluate any model and trust its probability outputs. The next question stakeholders always ask: why did the model make this prediction? Which features matter most?
Module 36 — Feature Importance and Explainability — covers permutation importance, SHAP values (already introduced in Module 30 for XGBoost), SHAP across all model types including tree ensembles, linear models, and black-box models. And the business of explaining an individual prediction to a customer who was rejected for credit.
Permutation importance, SHAP across all models, and explaining individual predictions to regulators, customers, and stakeholders.
🎯 Key Takeaways
- ✓Calibration is distinct from ranking quality (ROC-AUC). A model with AUC=0.95 may still be poorly calibrated — its scores rank correctly but the actual probability values are wrong. Calibration means: when the model says 80%, roughly 80% of those cases are actually positive.
- ✓The reliability diagram (calibration plot) is the standard visualisation. Group predictions by predicted probability, compare to actual fraction of positives per bin. Points on the diagonal = well-calibrated. S-curve below diagonal = overconfident (GBM). S-curve above diagonal = underconfident (Random Forest).
- ✓The Brier score is the single best number for probability quality — it is the mean squared error between predicted probabilities and actual labels. Lower is better. The baseline (predicting the class mean every time) is base_rate × (1 − base_rate). Use the Brier Skill Score (1 − Brier/Baseline) to compare models.
- ✓Two calibration methods: Platt scaling (fits a sigmoid — 2 parameters, use when calibration set < 5,000 samples) and isotonic regression (non-parametric step function — more flexible but needs 5,000+ samples to avoid overfitting). Both are implemented via CalibratedClassifierCV in sklearn.
- ✓Use cv=5 in CalibratedClassifierCV when possible — it calibrates inside cross-validation, preventing the common mistake of calibrating on data the base model was trained on. Using cv="prefit" requires a strictly held-out calibration set.
- ✓Monitor calibration in production monthly. Calibration drifts as the data distribution changes — economic conditions, fraud patterns, user demographics all shift over time. When MCE exceeds a threshold, recalibrate by refitting only the calibrator on recent labelled data — the base model does not need retraining.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.