Data Quality in Production Pipelines — What to Check and When
Bad data flowing silently through a pipeline is worse than a broken pipeline. A broken pipeline stops. Bad data corrupts reports, misleads analysts, and causes business decisions based on wrong numbers. Data quality checks are not optional — they are the difference between a pipeline and a reliable pipeline.
The four categories of data quality issues
Completeness: required fields are null or missing. A sales record with no customer_id is unusable.
Accuracy: values are present but wrong. A revenue field showing negative values or dates in the future.
Consistency: the same entity has different representations. Customer ID 101 in one system, CUST-101 in another.
Timeliness: data is correct but arrived too late to be useful. Last week's inventory data used for today's reorder decision.
Where to apply checks in the Medallion Architecture
Bronze layer: check row counts only. Did we receive data? Is the file size plausible? Never transform or filter at Bronze — just alert if something looks wrong.
Silver layer: this is where quality checks live. Check nulls on required columns, remove duplicates, validate ranges, cast types. Rows that fail checks go to a quarantine table, not the trash.
Gold layer: check aggregation logic. Does total revenue match the sum of line items? Are there unexpected zeros in KPI columns?
Practical checks to implement first
Start with these five checks in every Silver notebook:
1. Row count — did we get any data? Is the count suspiciously low compared to yesterday?
2. Null check on primary keys — filter out any row where the join key is null
3. Duplicate check — deduplicate on natural key before writing
4. Range validation — is order_amount between 0 and 1,000,000? Flag extremes
5. Referential integrity — does every customer_id in orders exist in the customers table?
What to do when checks fail
Never silently drop bad rows in production. Write them to a quarantine or rejected records table with the reason for rejection.
This lets you investigate root causes: is the source system sending bad data? Is the schema changing? Is a new upstream team sending test data into production?
For critical pipelines: raise an alert and stop the pipeline. A Gold table with missing data is better than a Gold table with wrong data that analytics teams trust.