Decision Trees — Loan Approval at HDFC
The algorithm that thinks in if-then questions. Gini impurity, information gain, pruning, and why decision trees are the foundation of every ensemble method.
It's 2024. You're a new ML engineer at HDFC Bank.
250,000 loan applications come in every month. Each one has an income figure, a credit score, an employment status, a debt-to-income ratio, and a dozen other fields. A team of credit analysts reviews each application and approves or rejects it. The process takes three days per application. The bank wants to automate the first pass — flag clear approvals and clear rejections automatically, and route only the borderline cases to humans.
You could use logistic regression. But your manager asks: "When the model rejects someone, can you explain exactly why?" Logistic regression gives you a probability — not an explanation a customer or a regulator can follow. You need something interpretable. Something that says: "rejected because monthly income < ₹35,000 AND credit score < 650 AND existing EMI > ₹12,000."
That is a decision tree. It learns a flowchart of if-then questions from your data. Every prediction comes with a traceable path through the tree. Every rejection has a human-readable reason. And it takes zero feature scaling, handles missing values gracefully, and works on both classification and regression problems without changing a single line of code.
What this module covers:
How a decision tree actually works
A decision tree asks a sequence of yes/no questions about the input features. Each question splits the data into two groups. The process repeats inside each group until the groups are pure enough — mostly one class — or a stopping condition is hit. The result is a tree of decisions that ends at leaf nodes containing a class label or a predicted value.
The key question is: which feature do you split on, and at what threshold? At every node, the tree tries every possible split on every feature and picks the one that makes the resulting groups most homogeneous — most "pure". Pure means one group is mostly approvals and the other is mostly rejections. An impure group is a mix of both.
Gini impurity — measuring how mixed a group is
Gini impurity measures how often a randomly chosen element from a group would be incorrectly labelled if it was randomly labelled according to the distribution of labels in that group. A perfectly pure group (all one class) has Gini = 0. A perfectly mixed group (50% each class) has Gini = 0.5. The tree chooses the split that produces the lowest weighted average Gini impurity across the two resulting groups.
Information gain and entropy — the alternative criterion
Entropy (from information theory — Module 07) measures the same thing as Gini but using logarithms. Information gain is the reduction in entropy from a split. Both Gini and entropy produce very similar trees in practice. Gini is slightly faster to compute (no logarithm). Entropy can sometimes produce slightly more balanced trees. sklearn defaults to Gini. Use Gini unless you have a specific reason to switch.
Building a decision tree from scratch
Building a tree from scratch makes every part of the algorithm visible. The recursive structure — split a node, then recursively split each child — is why trees are called recursive partitioning algorithms. Once you see this implementation, sklearn's API will have no mysteries.
Overfitting — why trees grow too deep
An unconstrained decision tree will grow until every leaf contains exactly one training sample — a perfectly pure leaf with zero training error. This sounds great until you realise the tree has memorised the training set completely. It has learned the noise, the one applicant who got approved despite a 400 credit score because the analyst was having a good day. That pattern will not generalise.
Pruning is the process of limiting tree growth to prevent overfitting. There are three main strategies, and sklearn exposes all of them.
Feature importance — which inputs drove the decisions
A decision tree assigns importance to each feature based on how much it reduced impurity across all splits where it was used, weighted by the number of samples at those nodes. Features that appear near the root (early splits) tend to have high importance because they affect more samples. Features that only appear in deep, small-population splits get low importance.
Regression trees — the same algorithm, continuous output
Decision trees work identically for regression — the only difference is in the splitting criterion and the leaf output. Instead of Gini impurity, regression trees minimise the mean squared error of predictions within each split. Instead of a class label, each leaf outputs the mean target value of all training samples that reached it.
Decision trees are the building block of Random Forest and XGBoost
A single decision tree overfits. It is also unstable — small changes in the training data produce dramatically different trees. Both problems were solved by two different ideas that are now the dominant algorithms in production tabular ML worldwide.
Train 100–1000 trees, each on a random sample of data and a random subset of features. Average their predictions. The averaging cancels out individual tree errors — the ensemble is far more accurate and stable than any single tree.
Trees: independent, full depth
Prediction: average (regression) / vote (classification)
Train shallow trees sequentially, each one correcting the errors of the previous. The final prediction is the sum of all trees. Gradient boosting consistently wins tabular ML benchmarks and is the most widely deployed ML algorithm in Indian fintech.
Trees: shallow (depth 3–6), sequential
Prediction: sum of all tree outputs
The key insight: once you understand how a single decision tree chooses splits, computes impurity, and makes predictions, Random Forest and XGBoost become straightforward — they are just collections of trees combined in different ways. The algorithm you built from scratch in this module IS the algorithm inside every XGBoost model at Razorpay, Zepto, and every Indian unicorn running tabular ML.
What this looks like at work — day one at HDFC
Every common decision tree error — explained and fixed
You understand trees. Now watch what happens when you build thousands of them.
A single tree overfits, is unstable, and has high variance. The fix — discovered in the 1990s — was to train many trees and combine their predictions. Module 28 covers Random Forest: 100–1000 trees, each trained on a random sample of data and a random subset of features, their predictions averaged into something far more powerful and stable than any individual tree.
Bagging, random feature subsets, out-of-bag evaluation, and the feature importance that actually works in production.
🎯 Key Takeaways
- ✓A decision tree recursively partitions the feature space using if-then questions. At each node it tries every feature and every threshold, picking the split that produces the purest child nodes.
- ✓Gini impurity = 1 − Σpᵢ². Zero means a node is completely pure (all one class). 0.5 is maximally mixed for binary classification. The tree greedily minimises weighted Gini impurity at every split.
- ✓Information gain is the alternative criterion: it measures entropy reduction from a split. Gini and entropy produce nearly identical trees in practice. sklearn defaults to Gini because it is faster to compute (no logarithm).
- ✓Unconstrained trees always reach 100% training accuracy by memorising every sample. Control overfitting with max_depth (start at 3–5), min_samples_leaf (try 20–50), and ccp_alpha (post-pruning via cost-complexity path).
- ✓Decision trees need no feature scaling — splits are threshold-based and scale-invariant. They also handle mixed numeric and categorical features natively (after encoding) and produce interpretable if-then rules.
- ✓Feature importance from a tree = total Gini reduction attributable to each feature across all splits, weighted by samples. Features near the root have high importance because they affect more samples.
- ✓Decision trees are the foundation of Random Forest (bagging + random features) and XGBoost/LightGBM (sequential boosting). Understanding one tree completely means understanding the building block of the two most powerful tabular ML algorithms.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.