Model Monitoring — Drift Detection and Retraining
How to know your model is degrading before users complain. Data drift, concept drift, Evidently AI, and automated retraining triggers.
A model trained in January works well in January. By June, the world has changed — new fraud patterns, different traffic, monsoon season — and the model is quietly producing wrong predictions that no one has noticed yet.
Every deployed model degrades over time. The question is not whether it will degrade but whether you will notice before your users do. Without monitoring, degradation is discovered via user complaints, dropped revenue, or a business review showing a metric that looked fine six months ago is now at half its original performance. With monitoring, you catch it in days, not months.
Two types of drift cause degradation. Data drift: the input feature distribution has shifted. Swiggy trained on pre-monsoon delivery patterns. During monsoon, distance_km distributions shift (longer routes around flooded roads), is_peak_hour patterns shift (orders cluster differently), and the model receives inputs far from what it was trained on. Concept drift: the relationship between features and the target has changed. A fraud model trained before a new fraud scheme emerged correctly identifies old patterns but the new scheme looks like legitimate traffic to it. The input distribution may look the same but the correct output for those inputs has changed.
A doctor trained in the 1990s using the medical knowledge of that era. Data drift: the patient population has changed — more diabetes, more sedentary lifestyle, new drug interactions. Concept drift: the same symptoms now indicate different conditions due to new pathogens. A good doctor keeps learning. A monitoring system is the mechanism that tells the doctor which patients they are getting wrong, so they know what to study next.
Monitoring without ground truth labels is like checking a patient's vital signs without doing bloodwork. You can detect that something is wrong (features look unusual) but not what is wrong (model is wrong) without comparing predictions to actual outcomes. Both layers — leading indicators and lagging indicators — are needed.
Data drift, concept drift, and prediction drift — three failure modes
KS test, PSI, and chi-squared — detecting distribution shift statistically
Drift detection requires comparing two distributions: the reference distribution (features seen during training) and the current distribution (features seen this week). Three statistical tests are standard. The Kolmogorov-Smirnov test measures the maximum difference between two empirical CDFs — works for continuous features, no binning required. Population Stability Index (PSI) measures how much a distribution has shifted — widely used in credit risk and fraud at Indian banks. Chi-squared test compares observed vs expected frequencies — works for categorical features.
Evidently AI — automated drift reports and monitoring dashboards
Evidently generates HTML drift reports comparing a reference dataset (training data) to a current dataset (recent production data). It runs all relevant statistical tests automatically per feature type, generates visual distributions, and produces a JSON summary that can be parsed to trigger retraining alerts. It is the standard open-source monitoring tool used at Indian startups.
Performance monitoring — track actual model accuracy over time
Drift monitoring detects input distribution shifts without labels — it is a leading indicator. Performance monitoring requires ground truth labels and is the lagging indicator. For delivery time prediction: the actual delivery time is available 30-60 minutes after prediction. For fraud detection: chargebacks confirm fraud 7-30 days after the transaction. The monitoring system joins predictions with delayed labels and tracks accuracy metrics over rolling windows.
Automated retraining — trigger, retrain, evaluate, promote
Manual retraining — a data scientist noticing a metric, running a notebook, and deploying — does not scale to dozens of models. Automated retraining monitors metrics and triggers the training pipeline (Module 69) when thresholds are breached. The trigger calls the Airflow DAG or Prefect flow with a flag indicating emergency retraining. The pipeline runs, evaluates the new model, and either promotes it automatically (if above a quality threshold) or sends a human alert for review.
Every common monitoring mistake — explained and fixed
Your model is monitored. Next: version your data like you version your code.
Monitoring tells you when to retrain. But when you retrain, you need to know exactly what data produced each model — so you can reproduce results, audit decisions, and debug regressions. Module 73 covers retraining pipelines with champion-challenger evaluation, safe model promotion, and the rollback patterns that protect production when a new model unexpectedly underperforms after deployment.
Champion-challenger evaluation, safe model promotion, and rollback patterns that protect production.
🎯 Key Takeaways
- ✓Two types of drift cause model degradation. Data drift: input feature distributions shift (P(X) changes) — detectable immediately without labels using statistical tests. Concept drift: the relationship between features and target changes (P(Y|X) changes) — requires ground truth labels and is invisible until labels arrive. Both need separate monitoring strategies.
- ✓Three statistical tests cover all feature types: KS test for continuous features (compares empirical CDFs, p-value + statistic threshold), PSI for continuous features (industry standard in Indian finance: PSI < 0.10 safe, 0.10-0.20 investigate, > 0.20 retrain), chi-squared for categorical features (compares observed vs expected frequencies).
- ✓Evidently AI automates drift reporting: run Report(metrics=[DataDriftPreset()]) with reference and current DataFrames. It selects the right test per feature type, generates HTML dashboards, and produces JSON results for programmatic alerting. Schedule as an Airflow task daily, parse JSON results to trigger retraining.
- ✓Performance monitoring requires delayed ground truth labels. Log every prediction with a prediction_id. When labels arrive (actual delivery time, fraud confirmation), join them back to predictions. Compute rolling metrics (MAE, within-N-minutes rate, bias) over daily/weekly windows. A 25%+ MAE increase is a critical retraining trigger.
- ✓Automated retraining trigger hierarchy: critical (performance drop > 25% → retrain immediately), high (> 40% features drifted or prediction distribution shifted 2σ+ → retrain if 2+ high triggers), medium (> 10,000 new samples accumulated → scheduled retrain). Add 24-hour cooldown to prevent retrain loops.
- ✓Alert fatigue is the primary failure mode of monitoring systems. Use effect size thresholds not just p-values — require KS statistic > 0.10, not just p < 0.05. Require drift to persist 3 consecutive days before triggering. Route by severity: critical = page on-call, high = Slack, medium = weekly digest. A monitoring system that fires 20 false alarms per week is worse than no monitoring.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.