Random Forest — Zepto Stock Prediction
Bagging, random feature subsets, out-of-bag evaluation, and the feature importance that actually works. Why Random Forest beats a single tree on every real dataset.
A single decision tree is unstable. One noisy sample can change everything.
You trained a decision tree on HDFC loan data and got 88% accuracy. You add 50 new training samples — a routine monthly data refresh — and retrain. The tree looks completely different. Different root split, different branches, different feature importances. The accuracy barely changed but the structure changed dramatically. This is variance. The tree is too sensitive to the specific samples it saw.
The fix was published by Leo Breiman in 2001. His insight: if one tree is unstable and noisy, train 500 trees on slightly different versions of the data and average their predictions. Each individual tree is still noisy, but the noise is random and independent across trees. It cancels out in the average. What remains is the underlying signal.
That is Random Forest. It is still, in 2026, one of the first algorithms you should try on any tabular ML problem. It almost never catastrophically fails, requires minimal tuning, handles missing values gracefully, provides reliable feature importances, and trains in parallel across cores. The Zepto data science team uses it for demand forecasting, inventory reorder prediction, and fraud detection — often as a strong baseline before reaching for XGBoost.
What this module covers:
Bagging — bootstrap aggregation
Bagging starts with a simple observation: if you had access to many independent training datasets, you could train one model per dataset and average their predictions. The average would be more stable and accurate than any single model.
You only have one training dataset. The trick: create many simulated datasets by sampling from it with replacement. This is called bootstrap sampling. Each bootstrap sample is the same size as the original but contains roughly 63% unique samples (some samples appear 2 or 3 times, about 37% never appear). Train one tree on each bootstrap sample. Average the predictions. That is bagging.
Random feature subsets — why RF beats plain bagging
Plain bagging with decision trees works but has a problem. If one feature is very predictive of the target — say, days_of_stock for stock-out prediction — every tree will put it at the root. The 500 trees will all look similar in their top splits, making them highly correlated. Correlated trees cancel each other's errors poorly. The benefit of averaging is reduced.
Random Forest fixes this by constraining each split to consider only a random subset of features — typically sqrt(n_features) for classification and n_features/3 for regression. Now no single feature can dominate every tree. Different trees explore different feature combinations. The trees are decorrelated, and averaging them cancels much more error. This is the one addition that makes Random Forest beat plain bagging by a significant margin.
Out-of-bag evaluation — cross-validation at no extra cost
Each bootstrap sample leaves out roughly 37% of the training data. Those left-out samples are called out-of-bag (OOB) samples. For any given training sample, there will be trees in the forest that never saw it during training — because it was OOB for those trees. We can evaluate each sample using only those trees, giving us an unbiased estimate of generalisation performance without any separate validation set or cross-validation loop.
Set oob_score=True and the OOB score is computed automatically. For large datasets, OOB evaluation is often preferred over k-fold CV because it is effectively one-pass rather than k-pass, much faster.
Key hyperparameters — what to tune and in what order
Random Forest is remarkably robust to hyperparameter choices compared to other algorithms. The defaults often work well. But three parameters consistently matter and are worth tuning in this order.
Number of trees. More is always better — adding trees never hurts, it only reduces variance. Keep adding until OOB error stops improving. 100 is often enough; 500 for important production models.
Features considered at each split. sqrt(n_features) for classification, n_features/3 for regression are the theory-backed defaults. Smaller = more decorrelated trees = less variance. Larger = more powerful individual trees = less bias. Try "sqrt", "log2", 0.3, 0.5.
Minimum samples at a leaf. Controls tree depth indirectly. Higher = shallower trees = less overfitting but more bias. Try 1, 2, 5, 10, 20. For noisy datasets increase this.
Feature importance — MDI and permutation importance
Random Forest provides two types of feature importance. Mean Decrease in Impurity (MDI) is fast — it's computed during training as the total Gini reduction per feature. But MDI has a known bias: it overestimates the importance of high-cardinality features (features with many unique values like IDs or continuous floats). Permutation importance is slower but unbiased — it measures how much model performance degrades when a feature's values are randomly shuffled.
Random Forest for regression — predicting demand quantity
Random Forest vs XGBoost — when to use which
Both algorithms are dominant in tabular ML. The choice between them depends on your priorities — not on a blanket "XGBoost is always better" rule that many tutorials incorrectly state.
The practical rule: start with Random Forest. It gives you a strong baseline in minutes with minimal tuning. If you need every last point of AUC and have time to tune properly, switch to XGBoost or LightGBM. At most Indian product companies, a well-tuned Random Forest is already good enough for production — and it deploys faster and is easier to maintain.
Day-one task at Zepto — stock-out predictor end to end
Every common Random Forest error — explained and fixed
Random Forest is parallel averaging. The next step is sequential correction.
Random Forest trains all trees independently and averages them. Gradient Boosting trains trees sequentially — each new tree is built specifically to correct the errors of all previous trees. This sequential error correction is why XGBoost and LightGBM consistently outperform Random Forest on most tabular benchmarks. Module 22 explains how it works from scratch.
Sequential weak learners, residuals, learning rate, and why gradient boosting wins almost every tabular ML competition.
🎯 Key Takeaways
- ✓Random Forest = bootstrap sampling (bagging) + random feature subsets at each split. The random features are the key innovation — they decorrelate the trees so averaging them cancels much more error than plain bagging.
- ✓Each bootstrap sample leaves out ~37% of training data as out-of-bag (OOB) samples. Setting oob_score=True gives a free, unbiased evaluation of generalisation performance without any separate validation set or cross-validation loop.
- ✓The three parameters that matter most in order: n_estimators (more is always better, find the elbow), max_features (sqrt for classification, n_features/3 for regression — the most impactful param), min_samples_leaf (increase for noisy data).
- ✓MDI feature importance is biased toward high-cardinality features. For feature selection decisions always use permutation_importance from sklearn.inspection — it is unbiased and directly measures impact on model performance.
- ✓Random Forest needs no feature scaling — trees are threshold-based and scale-invariant. It also handles mixed feature types natively and is robust to outliers, making it one of the lowest-friction algorithms to deploy.
- ✓On class-imbalanced datasets always set class_weight="balanced". Evaluate with ROC-AUC or average precision, not accuracy — accuracy is trivially gamed by predicting the majority class.
- ✓Use Random Forest as your first strong baseline on any tabular problem. It gives production-quality results with minimal tuning. Switch to XGBoost/LightGBM only when you need maximum performance and can afford proper hyperparameter tuning.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.