Train / Validation / Test Split
Why three splits not two. Holdout sets, stratified splits, data leakage across splits, and the time-series exception where random splits break everything.
Your model scored 99% accuracy. Then you deployed it and it was wrong on half the real orders. What went wrong?
You trained a delivery time model on 10,000 Swiggy orders and measured its accuracy on the same 10,000 orders it trained on. It scored incredibly well. You deployed it. It was terrible on real incoming orders. The problem: the model had already seen every order it was evaluated on. It did not learn to predict — it learned to memorise.
This is why you always split your data before training. You hold back a portion of your data that the model never sees during training. You evaluate on this held-out portion only. If the model performs well on data it has never seen, you have evidence it has actually learned something generalisable — not just memorised the training set.
But a two-way split — train and test — has a subtle flaw. When you tune hyperparameters (how deep should the tree be? what is the best regularisation strength?) using the test set score to decide, you are indirectly letting the test set influence your training decisions. Over many experiments, you overfit to the test set without realising it. This is why you need three splits — not two.
Think of preparing for a competitive exam like GATE or CAT. Your textbook problems are the training set — you practice on these, make mistakes, learn from them. Practice mock tests are the validation set — you use your score to decide which topics to study more. The actual exam on exam day is the test set — you sit it exactly once, at the very end, to get your true performance.
If you kept using the exam paper to decide what to study, your exam score would look great — but you would have cheated yourself out of knowing your real ability. Same with the test set in ML.
Training, validation, and test — what each one is for
Each split has a specific job. Confusing the jobs leads to either overly optimistic performance estimates or models that do not generalise.
Stratified splits — preserve class balance across all three sets
A random split might put 90% of the rare class into training and leave only 10% in test — making evaluation noisy and unreliable. Stratified splitting ensures each split has the same class proportion as the original dataset. This is especially important for imbalanced classification problems — fraud detection, churn prediction, medical diagnosis — where the minority class is what you actually care about.
Data leakage across splits — five ways your evaluation lies to you
Data leakage across splits is when information from the test or validation set influences the training process — even indirectly. The result is an evaluation metric that looks excellent in development but collapses in production. It is the most common and most costly mistake in applied ML.
Why: Test set mean and std contaminate the scaler used for training.
Fix: Always split first, then fit preprocessors on X_train only. Use Pipeline.
Why: Target means include test set labels. Each test row's target leaks into its own feature.
Fix: Compute target encoding inside cross-validation folds or use sklearn TargetEncoder.
Why: Feature selection uses test set labels to choose features. Overfits to test.
Fix: Perform feature selection inside the training fold only. Wrap in Pipeline.
Why: Model memorises training samples and scores them perfectly in test.
Fix: Deduplicate BEFORE splitting. Check: assert len(set(train_ids) & set(test_ids)) == 0
Why: Model learns from the future to predict the past. Impossible in production.
Fix: Always use time-based split for time-series: train on past, validate on future.
Time-series splits — when random splitting destroys your model
For time-series data — stock prices, daily orders, sensor readings, anything measured over time — random splitting is not just suboptimal, it is fundamentally wrong. A random split puts future data in the training set and past data in the test set. The model learns from information that would not exist at prediction time. You are training on the future to predict the past — the opposite of what you need.
For any dataset ordered by time, your split must respect chronological order. Training data must come entirely before validation data. Validation data must come entirely before test data. There must be no temporal overlap between any two splits.
This simulates what actually happens in production: your model was trained on historical data and is now predicting future events it has never seen.
Holdout split vs cross-validation — when each is appropriate
A single holdout split is fast but noisy — performance depends on which samples ended up in test. Cross-validation runs multiple splits and averages the result, giving a more reliable estimate. But it is k times slower and requires that all preprocessing fits inside each fold (a Pipeline).
How much data to put in each split — rules of thumb
There is no universally correct split ratio. The right ratio depends on how much data you have and what you need from each split. Here are the rules practitioners actually use:
Every common split error — explained and fixed
Data is collected, cleaned, scaled, encoded, and split. You are ready to build models.
This completes Section 4 — Data Engineering for ML. You can now take any raw dataset, clean it, engineer features, encode categoricals, scale numerics, and split it correctly without leaking information. These five modules are the foundation every ML model you will ever build sits on.
Section 5 — Classical Machine Learning — begins next. Module 21 answers the question you have been building toward: what actually is machine learning, and how does training work mechanically? Every algorithm in the section — linear regression, logistic regression, decision trees, random forests — builds on the data engineering foundation you have just completed.
Not the Wikipedia definition. The actual idea — what training means mechanically, the 3 types of ML, the 7-step workflow, and 12 key terms defined once and for all.
🎯 Key Takeaways
- ✓You need three splits — not two — because using the test set to make tuning decisions turns it into a second validation set. Over many experiments you silently overfit to it. The test set must be touched exactly once, at the very end, to get an honest performance estimate.
- ✓Training set: the model fits on this. Validation set: you use the score to make tuning decisions — the model never trains on it. Test set: the final honest evaluation — touched once, never used for decisions.
- ✓Always split before any preprocessing. Fitting a scaler, encoder, or imputer on the full dataset (before splitting) leaks test set statistics into training. Use sklearn Pipeline to make this structurally impossible.
- ✓Use stratify=y for classification problems. Without stratification, a random split can put most of the minority class into one split — making evaluation unreliable and hyperparameter tuning misleading.
- ✓For time-series data, random splits are fundamentally wrong. They put future data in the training set and past data in test — leaking information that would not exist at prediction time. Always use chronological splits: train on past, evaluate on future. Use TimeSeriesSplit for cross-validation.
- ✓Split size depends on dataset size. Under 1,000 rows: use cross-validation, no holdout. 1k–10k: 80/20 with CV for tuning. 10k–100k: 70/15/15 three-way split. Over 100k: 80/10/10 is reliable.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.