Data Cleaning and Validation
Turn raw, messy data into reliable ML training sets. Schema validation, duplicate detection, type coercion, outlier handling, and automated rules that catch problems before they reach your model.
Garbage in, garbage out — and the garbage is invisible until your model ships.
A 2020 survey by Anaconda found data scientists spend 45% of their time cleaning data. That number hasn't changed much since. But the more dangerous problem isn't the time it takes — it's the errors that slip through uncleaned and silently corrupt a model that looks fine in evaluation but behaves wrong in production.
Consider what happens at Swiggy. The orders table has negative distances from data entry errors. Delivery times of 0 minutes from cancelled orders never removed. Duplicate records from a retry bug in the mobile app. City names spelled three different ways — "Bangalore", "Bengaluru", "bangalore". Star ratings of 6 from a frontend validation bug that was fixed three months ago. None of these cause your training script to crash. They all silently degrade your model.
This module gives you a systematic process — not a one-time cleaning script, but a validation framework that runs automatically every time new data arrives and catches problems before they reach training.
What this module covers:
The messy Swiggy dataset used throughout this module
Run this block once to create a realistic messy dataset with deliberate data quality problems. All sections in this module clean and validate it.
Data quality audit — know exactly what you are dealing with
The first rule of data cleaning: audit before you touch anything. Running a comprehensive quality report on a new dataset takes five minutes and reveals every problem you'll spend hours debugging if you skip it. It also gives you a baseline so you can prove the data got better after cleaning.
Schema validation — define what valid data looks like
A schema is a contract: a precise description of what each column should contain. Validating against a schema answers: are the right columns present? Are they the right types? Are values in the expected ranges? Are required columns non-null? Schema validation is the first gate every new dataset should pass before any further processing.
Duplicate detection — exact and near-duplicate removal
Duplicates are more than a storage problem. In ML, duplicate training examples cause the model to overweight those records — whatever pattern they represent gets amplified. A duplicate rate of 5% can meaningfully skew a model trained on imbalanced data. There are two kinds of duplicates: exact copies and near-duplicates (same record, slightly different values from a retry or data merging issue).
Type coercion — columns stored as the wrong dtype
Type errors are the most common data quality problem after nulls. A price column stored as a string because someone entered "N/A" once. A boolean column stored as integers 0 and 1 — but also containing 2. A date column stored as a free-text string with three different formats. These cause silent failures when you call .values or fit().
String cleaning — normalise free text and categoricals
String columns are the messiest part of any real dataset. "Bangalore", "bangalore", "Bengaluru", "BANGALORE", "bangalore " — these are five representations of the same city, and they will be treated as five separate categories by any ML model. String cleaning must be systematic, not case-by-case.
Outlier detection and treatment
Not all outliers are errors. Some are genuine extreme values — a restaurant that genuinely takes 90 minutes to prepare food, or an order delivered in 8 minutes because it was around the corner. The first question is always: is this an error or a real edge case? Only then do you decide what to do with it.
IQR method — the standard robust outlier detector
Consistency checks — rules that span multiple columns
Individual column validation catches per-column problems. But some data quality issues only appear when you look at two columns together. A delivery time of 10 minutes for a distance of 15km is physically impossible. A 5-star rating with a note "worst experience ever" is inconsistent. These cross-column rules are called consistency checks and they catch a class of errors that column-level validation misses entirely.
Schema drift — catch when upstream data changes silently
Schema drift is when the data coming from an upstream source changes in a way nobody told you about. A new column appears. An existing column is renamed. A categorical column gains a new value. A numeric column suddenly contains nulls. Any of these will silently break a downstream ML pipeline — the training runs, metrics look plausible, but the model has learned from corrupted data.
Great Expectations — automated validation at scale
Great Expectations (GX) is the standard open-source library for data validation in production ML pipelines. Instead of writing custom validation code, you define "expectations" — declarative statements about what your data should look like. GX runs them against your data and produces a detailed HTML report. It integrates with Airflow, dbt, Spark, and every major data platform.
The complete cleaning pipeline — one class, all steps
Every common data cleaning error — explained and fixed
You now have clean, validated data. It's time to build features from it.
Cleaning removes what is wrong. Validation catches new problems automatically. Together they ensure that the data reaching your model is trustworthy. The next module — Feature Engineering — takes clean data and transforms it into the representations that make ML models learn fastest and generalise best. Distance becomes log-distance. Timestamps become hour-of-day, day-of-week, and cyclical encodings. Categorical columns become embeddings or one-hot vectors. This transformation step consistently produces larger improvements than changing the model architecture.
Transform raw columns into powerful model inputs — log transforms, interaction features, target encoding, embeddings, and the feature engineering techniques that consistently outperform model tuning.
🎯 Key Takeaways
- ✓Always audit before fixing. Run a comprehensive quality report first — count nulls, check ranges, list unique values, detect duplicates. You cannot clean systematically what you have not measured.
- ✓Schema validation is a contract. Define every column's type, nullability, allowed values, and range as explicit code. Run validation on every new data batch — not just once during development.
- ✓Deduplicate on the business key (order_id), not all columns. Full-row deduplication is slow and misses near-duplicates. Hash stable fields to catch near-duplicates from retry bugs and data merge issues.
- ✓Type errors are silent killers. Use pd.to_numeric(errors="coerce") for numeric columns and pd.to_datetime(format="mixed", errors="coerce") for dates. Check how many values became NaN after coercion — a large number signals a serious upstream problem.
- ✓Three outlier strategies: remove provably impossible values (delivery_time=0), clip extreme-but-real values to 99th percentile AND add a flag column, or keep outliers and let the model handle them. Never clip without flagging — you lose information silently.
- ✓Consistency checks span multiple columns. delivery_time < distance_km/1.0 + 5 is physically impossible. is_late != (delivery_time > 45) is a label error. These cross-column rules catch a class of errors that column-level validation misses entirely.
- ✓Schema drift detection is not optional for production pipelines. Capture a reference schema fingerprint (dtypes, null rates, value ranges, allowed categories) and compare every new batch against it. A 35% shift in mean delivery time or a new city value should trigger an alert before the data reaches your model.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.