Retraining Pipelines — Keeping Models Fresh
Champion-challenger evaluation, safe model promotion, and rollback patterns that protect production when a new model underperforms after deployment.
Training a new model is easy. Safely replacing the production model with the new one without breaking anything — that is the hard part most teams get wrong.
Module 72 explained when to retrain. This module explains how. The naive approach: train a new model, compare its offline metrics to the old model's offline metrics, deploy if better. This fails in practice for two reasons. Offline metrics measured on a held-out test set do not always predict online performance. A model with better MAE on the test set might perform worse on the real distribution of live traffic due to subtle differences in how the test set was constructed. The second problem: even if the new model is genuinely better, deploying it incorrectly — without a rollback plan, without gradual traffic shifting, without real-time comparison against the incumbent — exposes all production traffic to an unproven model simultaneously.
The production-safe retraining pipeline has five stages: automated training, offline evaluation with quality gates, shadow deployment for real-traffic validation, champion-challenger A/B testing for live comparison, and gradual promotion with automated rollback if the challenger underperforms. Each stage is a checkpoint that a bad model cannot pass silently.
A Formula 1 pit stop team replacing a tyre during a race. They do not stop the car entirely (that would lose the race). They do not just bolt the new tyre on without checking it first (that would crash the car). They have a rehearsed procedure: jack the car, change the tyre, check it is secure, lower the car, driver goes. If anything is wrong, they abort and diagnose. The whole process takes 2 seconds because every step is practiced and every failure mode is handled. Retraining pipelines are the same — a procedure so well-engineered that updating a model in production takes minutes with zero downtime.
The champion model is the tyre that got the car this far. The challenger is the new tyre. You do not swap until you are certain the new tyre is at least as good. And you keep the old tyre nearby in case you need to switch back in a hurry.
Automated train → offline eval → shadow → A/B → promote
Offline evaluation — the champion baseline and quality gate logic
Shadow deployment — test on live traffic without user impact
Shadow deployment runs the challenger model on every real production request in parallel with the champion. Users receive the champion's prediction. The challenger's prediction is logged but never returned. After 24 hours you have the challenger's predictions on the actual live traffic distribution — not a held-out test set — and can compare both models' predictions when ground truth labels arrive. This catches distribution drift between the test set and live traffic that offline evaluation misses.
Champion-challenger A/B — statistically rigorous live comparison
Shadow deployment shows how the challenger would have performed on past requests. A/B testing sends some users to the challenger in real time and measures the actual business impact. The challenger must beat the champion with statistical significance — not just look marginally better due to random chance. A Welch's t-test or Mann-Whitney U test determines whether the difference in prediction error is significant given the sample size.
Gradual promotion and automated rollback — the safety net
Every common retraining pipeline mistake — explained and fixed
Models retrain safely. Next: version your data like you version code.
Safe retraining requires knowing exactly which data produced each model. Module 74 covers DVC (Data Version Control) — tracking datasets as first-class artifacts alongside code, so every model has a reproducible lineage: this model was trained on this exact dataset, with this exact feature pipeline, at this exact code commit. Reproduce any past experiment in one command.
Version datasets like code. DVC pipelines, remote storage, experiment tracking, and the full DVC + Git workflow for ML projects.
🎯 Key Takeaways
- ✓Safe retraining is a five-stage pipeline with gates at each stage: automated training → offline quality gate → shadow deployment → champion-challenger A/B → gradual promotion with auto-rollback. A bad model cannot silently pass all five stages. Each stage catches a different failure mode that the previous stages miss.
- ✓The offline quality gate compares challenger to champion on a fixed held-out test set using strict conditions: challenger MAE must be ≤ 105% of champion MAE, R² must be above minimum, and MAE must not be suspiciously low (which indicates data leakage). The test set must be updated monthly — stale test sets fail to catch distribution drift.
- ✓Shadow deployment runs the challenger on 100% of live traffic in parallel with the champion. Users receive only the champion prediction. After 24 hours with ground truth labels, compare both models on actual live distribution. This catches test-distribution mismatch that offline evaluation misses — it is the most important gate before A/B testing.
- ✓Champion-challenger A/B uses consistent hash routing on user_id so each user always goes to the same model — preventing mixed predictions for the same user. Require statistical significance (Welch t-test p < 0.05) and practical significance (improvement > 1%) before promoting. Set a maximum experiment duration — inconclusive experiments should favour the champion.
- ✓Gradual promotion shifts traffic in steps: 10% → 25% → 50% → 100% over several hours. Monitor MAE and p99 latency at each step. Auto-rollback if challenger MAE exceeds champion MAE by 15% or p99 latency exceeds champion p99 by 50%. Use blue-green Deployment not rolling update for rollback — switching the Kubernetes Service selector is instant and atomic.
- ✓Four critical gotchas: test set staleness (update monthly with recent data), temporal leakage in retraining (use point-in-time correct feature retrieval), A/B test never concluding (pre-calculate required sample size with power analysis), and rollback leaving mixed state (always use blue-green, never rolling update for production model swaps).
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.