XGBoost in Practice — End to End
Train, tune, and interpret XGBoost on a real dataset. Regularisation parameters, early stopping, SHAP values, and production deployment — all in one module.
XGBoost won every Kaggle competition from 2016–2019. It is still the most deployed ML algorithm in Indian fintech today. Here is why.
Module 29 explained gradient boosting conceptually — sequential trees each correcting the previous ensemble's mistakes. XGBoost (eXtreme Gradient Boosting) is an engineering implementation of that idea that made it practical at scale. Chen and Guestrin (2016) published a paper at KDD that introduced three key improvements: second-order gradients for more accurate tree construction, a built-in regularisation term that penalises model complexity, and a column subsampling technique borrowed from Random Forest.
The result was an algorithm that was simultaneously faster, more accurate, and less prone to overfitting than the original gradient boosting. Within a year it dominated every tabular ML benchmark. In 2026 it is still what most Indian fintech companies — Razorpay, CRED, Zepto, PhonePe — use for credit scoring, fraud detection, and churn prediction in production.
Gradient boosting is like a team of students taking turns correcting each other's homework — each student fixes what the previous one got wrong. XGBoost is the same team, but now each student: looks at not just where they were wrong but how sharply wrong (second derivative), gets penalised for writing overly complex answers (regularisation), and only studies a random subset of topics each turn (column subsampling).
The result: faster convergence, better generalisation, and answers that are easier to explain to the teacher (interpretability via SHAP).
What XGBoost adds — and why each improvement matters
Understanding the three improvements XGBoost made over the original gradient boosting directly maps to knowing which hyperparameters to tune. Each improvement has a corresponding parameter.
Vanilla gradient boosting uses only the first derivative (gradient) to decide how to split. XGBoost also uses the second derivative (Hessian) — the curvature of the loss. This gives more accurate information about the optimal leaf values, leading to better trees with fewer iterations.
XGBoost adds a penalty to the loss function that discourages trees from having too many leaves or leaves with extreme values. This is controlled by alpha (L1), lambda (L2), and gamma (minimum gain to make a split). Gradient boosting had none of this.
For each tree and each level, XGBoost randomly selects a fraction of features to consider for splitting. This decorrelates the trees (same insight as Random Forest) and reduces overfitting when many features are correlated.
Your first XGBoost model — Razorpay fraud detection
XGBoost's sklearn-compatible API means you already know how to use it. The only differences are the parameter names — which map directly to the three improvements described above.
Early stopping — automatically find the optimal number of trees
The most common XGBoost mistake is setting n_estimators to a fixed number and hoping it is right. Too few trees — underfits. Too many — overfits and wastes training time. Early stopping solves this automatically: train until the validation score stops improving, then stop. Use the number of trees that produced the best validation score.
Early stopping requires a separate validation set — a portion of the training data held back just for monitoring. XGBoost evaluates it after each tree and tracks the best score. After early_stopping_rounds consecutive rounds with no improvement it stops and restores the best model.
Train loss keeps falling. Validation loss bottoms out then rises (overfitting begins). Early stopping fires after "patience" rounds of no improvement. The best model — from the green dot — is restored automatically.
XGBoost parameters — what each one does, in plain English
XGBoost has dozens of parameters. Most can be left at defaults. A handful matter significantly. Here is the complete practical reference — grouped by what aspect of training they control.
SHAP values — explain any individual prediction in plain English
Fraud detection at Razorpay faces a hard business requirement: when a transaction is flagged, the system must be able to explain why. "The model said fraud" is not acceptable — not to the compliance team, not to the customer disputing the block, not to the RBI audit. SHAP (SHapley Additive exPlanations) solves this.
SHAP computes the contribution of each feature to a specific prediction. For a transaction flagged as fraud with probability 0.87, SHAP might say: "merchant_risk contributed +0.31 toward fraud, n_tx_last_hour contributed +0.25, user_tenure_days contributed −0.12 toward legitimate." These contributions sum to the final log-odds of the prediction. Every flagged transaction now has a human-readable explanation.
A bank decides to reject a loan application. Without SHAP: "The model rejected it." With SHAP: "Low credit score contributed ₹−8 LPA to the effective income estimate. High existing EMI burden contributed ₹−5 LPA. Short employment history contributed ₹−3 LPA. High income partially offset these: +₹12 LPA."
SHAP gives each feature a "blame or credit" score for each individual prediction. It is mathematically rigorous — the scores are derived from cooperative game theory and have provable fairness properties. This is why regulators accept them.
Complete production fraud detection pipeline — end to end
Every common XGBoost error — explained and fixed
XGBoost is mastered. LightGBM takes the same ideas and makes them faster.
XGBoost and LightGBM implement the same gradient boosting algorithm. The difference is in the engineering: LightGBM uses leaf-wise tree growth (instead of level-wise), Gradient-based One-Side Sampling (GOSS) to skip uninformative training samples, and Exclusive Feature Bundling (EFB) to compress sparse features. The result trains 10–20× faster on large datasets with equal or better accuracy. On datasets above 100,000 rows, LightGBM is almost always the right choice over XGBoost.
Leaf-wise growth, histogram-based splitting, and why LightGBM trains 10× faster than XGBoost on large datasets.
🎯 Key Takeaways
- ✓XGBoost adds three improvements over vanilla gradient boosting: second-order gradients (Newton step) for better tree construction, L1/L2 regularisation on leaf weights (alpha, lambda, gamma), and column subsampling (colsample_bytree) for decorrelated trees.
- ✓Always use early stopping. Set n_estimators high (1000–3000), pass a validation set via eval_set=, and set early_stopping_rounds=50. XGBoost stops when val AUC stops improving and restores the best model automatically.
- ✓The key regularisation parameters in order of importance: max_depth (keep at 3–5), subsample + colsample_bytree (0.7–0.9 each), gamma (min split gain, try 0–0.5), min_child_weight (try 1–10), reg_alpha and reg_lambda. Tune with RandomizedSearchCV.
- ✓scale_pos_weight = n_negative/n_positive handles class imbalance. For fraud detection where 2% of transactions are fraud, scale_pos_weight = 49 tells XGBoost to weight fraud examples 49× more.
- ✓SHAP values explain any individual prediction by computing each feature's contribution to the log-odds. They are the industry standard for model explainability in regulated industries (banking, insurance, healthcare).
- ✓The optimal classification threshold is almost never 0.5. For fraud detection, tune the threshold on a validation set to balance precision (false alarm rate) and recall (fraud catch rate) according to the business cost of each type of error.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.