Gradient Boosting — How XGBoost and LightGBM Work
Sequential weak learners, residuals, learning rate, and why gradient boosting wins almost every tabular ML competition — built from plain English first.
Random Forest trains 500 trees independently and averages them. Gradient Boosting trains 500 trees sequentially — each one fixing the mistakes of all the previous trees.
You trained a Random Forest on Swiggy delivery time data and got a mean absolute error of 4.2 minutes. Some orders are predicted well. Others are consistently wrong — long-distance orders during peak hours that the model always underestimates. The errors are not random noise. They have a pattern.
Random Forest ignores this. It trains every tree independently on a random sample of data. It has no mechanism to say "pay more attention to the orders we keep getting wrong."
Gradient Boosting does exactly this. After training the first tree, it looks at every prediction error. It trains the second tree specifically to predict those errors — not the original target, but the residuals (the mistakes). The third tree predicts the residuals of the first two combined. Each new tree corrects what all previous trees got wrong. After 500 trees, the accumulated corrections produce a model that consistently outperforms any single tree or Random Forest on almost every tabular dataset.
You are learning to throw darts. First throw: you miss the bullseye by 8cm to the right. A coach watches and says "next throw, aim 8cm to the left of wherever you aimed before." Second throw: miss by 3cm upward. Coach: "aim 3cm down from last time." Each throw corrects the accumulated error of all previous throws.
Gradient Boosting trains each new tree to hit where the previous ensemble missed. The final prediction is the sum of all trees — each one having corrected the previous collection's errors.
Residuals — what each tree actually learns to predict
A residual is simply the difference between the actual value and what the current ensemble predicts. If the true delivery time is 42 minutes and the current ensemble predicts 35 minutes, the residual is 42 − 35 = +7 minutes. The next tree tries to predict +7. After adding it, the ensemble now predicts 35 + 7 = 42. Exact.
Of course real data has noise — you cannot eliminate all error. The next tree predicts the residuals imperfectly. But each iteration reduces them further. After many iterations the residuals shrink to near-zero for most training points.
Learning rate and n_estimators — always tune them together
The learning rate controls how much each tree contributes to the final prediction. A small learning rate (0.01) means each tree makes tiny corrections — you need many more trees to converge, but the final model generalises better because it took small careful steps. A large learning rate (0.5) means each tree makes large corrections — you converge faster but risk overshooting and overfitting.
This creates an important relationship: lower learning rate requires more trees, but generally produces a better model. The two hyperparameters must be tuned together. Halving the learning rate and doubling n_estimators often improves performance.
Four ways to regularise gradient boosting
Gradient boosting can overfit severely if unconstrained. With 1000 deep trees, it will eventually memorise the training data. Four parameters control overfitting — each from a different angle. Understanding all four lets you tune systematically rather than randomly.
Maximum depth of each tree. Shallower trees = simpler weak learners = less overfitting. Gradient boosting works best with shallow trees (3–6) — unlike Random Forest which uses full-depth trees.
Fraction of training data used for each tree. Like Random Forest's bootstrap, but without replacement. Introduces randomness — each tree sees a different subset. Reduces variance and often improves generalisation.
Minimum samples required at a leaf. Forces the tree to only make splits that affect at least this many samples. Prevents the tree from fitting single-sample noise.
Number of features considered at each split. Like Random Forest's random feature selection. Introduces randomness and can improve generalisation, especially with many correlated features.
The gradient connection — residuals are negative gradients of MSE
The word "gradient" in gradient boosting is not just marketing. It connects directly to gradient descent from Module 07. When the loss function is mean squared error, the residuals y − ŷ are exactly the negative gradient of the loss with respect to the predictions. So fitting a tree on residuals is the same as taking a gradient descent step in the space of functions.
The power of the gradient framework is that it works for any differentiable loss function. For regression you use MSE residuals. For classification you use the gradient of the log-loss. For ranking problems you use custom ranking loss gradients. XGBoost extends this further by using both first and second derivatives (the Hessian) for more accurate tree fitting.
sklearn GB vs XGBoost vs LightGBM — what changed and why it matters
sklearn's GradientBoostingRegressor implements the original Friedman (2001) algorithm faithfully. XGBoost (2016) and LightGBM (2017) are engineering breakthroughs that made gradient boosting 10–100× faster while often improving accuracy. Understanding what they changed explains why they dominate every tabular ML benchmark today.
Day-one task — production delivery time predictor
Every common gradient boosting error — explained and fixed
You understand gradient boosting. Now the production implementation.
Gradient boosting is the concept. XGBoost is the implementation that won every Kaggle competition from 2016–2019 and is still deployed at most Indian fintech companies today. Module 30 covers XGBoost in practice — regularisation parameters, early stopping with a validation set, SHAP values for explaining individual predictions, and a complete end-to-end workflow.
Train, tune, and interpret XGBoost on a real dataset. Regularisation parameters, early stopping, SHAP values, and production deployment.
🎯 Key Takeaways
- ✓Gradient Boosting trains trees sequentially. Each new tree learns to predict the residuals — the errors — of all previous trees combined. Final prediction = initial mean + learning_rate × sum of all trees.
- ✓Residuals are the negative gradient of the MSE loss. This is why it is called gradient boosting — fitting trees on residuals is equivalent to gradient descent in function space. The framework generalises to any differentiable loss function.
- ✓Learning rate and n_estimators must be tuned together. Lower learning rate requires more trees but generally produces better generalisation. Halving the learning rate and doubling n_estimators is a reliable improvement strategy.
- ✓Four regularisation handles: max_depth (keep at 3–5), subsample (0.7–0.9 adds beneficial randomness), min_samples_leaf (prevents leaf overfitting), max_features (random feature selection). Use all four together for the most regularised model.
- ✓For datasets above 50,000 rows, use HistGradientBoostingRegressor (sklearn), XGBoost, or LightGBM instead of the original GradientBoostingRegressor. Histogram-based splitting gives 10–50× speedup with equal or better accuracy.
- ✓Enable early stopping for automatic n_estimators selection. It monitors a held-out validation set and stops training when performance stops improving — preventing overfitting and saving you from manually tuning n_estimators.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.