MLOps on Cloud — CI/CD for ML
GitHub Actions triggering retraining, model quality gates in CI, automated deployment to staging and production across Azure ML, SageMaker, and Vertex AI.
Software CI/CD is: push code → tests run → deploy if green. ML CI/CD is: push code or data → train model → quality gates run → deploy to staging → shadow test → promote to production. The same idea, four more steps.
Software engineers take CI/CD for granted. A pull request opens, unit tests run, integration tests run, and the change deploys automatically if everything passes. ML teams almost never have this. Retraining is manual — someone runs a notebook when they remember. Deployment is manual — someone SSH-es into a server and restarts a process. Quality gates are absent — a worse model can go to production because no one compared it to the incumbent. This is the gap that ML CI/CD closes.
The complete ML CI/CD pipeline has two orthogonal triggers. Code changes: a pull request modifying the training script or feature pipeline runs tests, trains a model on a small data sample, checks quality, and blocks the merge if it fails. Data changes or schedules: a weekly cron job or a drift alert triggers full retraining on production data, runs evaluation, compares to the champion model, and deploys if the challenger wins. Both flows use GitHub Actions — the same CI/CD tool your software team already uses.
A car manufacturer runs two quality checks. The design check happens when engineers submit a blueprint change — does the new design meet safety standards on paper? The production check happens on the assembly line — does the manufactured car meet standards in reality? ML CI/CD is the same: code-change tests (does the training script work correctly on a small sample?) and data-change tests (does the model trained on new production data beat the current production model?). Neither check alone is sufficient. Together they ensure nothing bad reaches customers.
The key shift in mindset: in ML, the model is not just code — it is code plus data plus hyperparameters. CI/CD must test all three dimensions simultaneously. A code change that looks fine in unit tests might produce a degraded model when combined with production data. The full pipeline test is the only reliable check.
Four-stage ML CI/CD pipeline — what runs, when, and what gates block progress
The pull request workflow — test ML code like software code
The retraining workflow — weekly cron, full cloud training, quality gate, deploy
Five scripts every ML CI pipeline needs — platform-agnostic patterns
Platform-agnostic CI/CD — the same workflow adapted for SageMaker and Vertex AI
The GitHub Actions workflow structure is identical across all three cloud platforms. Only the submit and compare scripts change. The abstraction pattern: write a thin adapter for each platform behind a common interface. The CI workflow calls the interface — it does not care which cloud is underneath. This lets you migrate between platforms without rewriting the entire CI pipeline.
Every common ML CI/CD mistake — explained and fixed
The Cloud ML Platforms section and the entire AI/ML track are complete. Module 80 is your interview preparation — 50 complete ML answers.
You have covered 79 modules across ten sections: Math and Statistics, Python for ML, Classical ML, Deep Learning, NLP, Computer Vision, Generative AI, MLOps, and Cloud ML Platforms. Every concept connects to the next. Every module includes working code and real Indian company examples. Module 80 is the capstone — 50 complete answers to the most common ML engineering interview questions asked at Swiggy, Razorpay, Flipkart, CRED, and every other major Indian tech company.
The 50 most-asked ML engineering questions across Swiggy, Razorpay, Flipkart, CRED, and Indian tech — with complete, ready-to-deliver answers.
🎯 Key Takeaways
- ✓ML CI/CD has two orthogonal triggers: code changes (PR opened → unit tests + smoke test on 500 rows → block merge if any test fails) and data/schedule changes (weekly cron or drift alert → full cloud training → quality gate → staging deploy → production promote). Keep them separate — running full cloud training on every commit is wasteful and defeats the purpose of fast PR feedback.
- ✓Four-stage pipeline with a gate at every stage: unit tests + smoke test (no cloud cost, fast feedback), cloud training + champion comparison (MAE within 5% tolerance), staging integration + load tests (p99 < 500ms, error rate < 1%), gradual production promotion (10% → 50% → 100% with auto-rollback). A model cannot reach production unless it passes all four gates.
- ✓Store champion metrics explicitly when promoting a model — add val_mae as a tag on the registry entry. The compare_models.py script must retrieve the champion metric reliably. When no champion exists (first deployment), use a fallback of 999.0 to always promote. Log both metrics to Slack on every run so the team can visually sanity-check every comparison.
- ✓Platform-agnostic adapter pattern: write a CloudMLAdapter abstract class with submit_training_job, wait_for_job, register_model, and deploy_to_endpoint methods. Implement AzureMLAdapter and SageMakerAdapter (and VertexAdapter). Select via ML_PLATFORM environment variable. The GitHub Actions workflow calls the interface — never the platform SDK directly. Migrating clouds means swapping one env var.
- ✓GitHub Actions mechanics for ML: use outputs to pass data between jobs (job_name, challenger_mae, should_promote), needs: to enforce job order, if: conditions to skip stages when challengers fail, environment: with manual approval gates for production, and workflow_dispatch with inputs for manual retraining with custom parameters. Always set timeout-minutes on jobs that call cloud training APIs.
- ✓Four common CI/CD failures: job stalls and times out without failing (set explicit timeouts at both CI and cloud job level, add cleanup step to cancel the cloud job), comparison logic promotes a worse model (store and retrieve champion metrics explicitly, add sanity range checks), race condition runs production before staging (verify needs: chain with GitHub workflow visualiser), training runs on every commit and costs spike (separate code-change CI from data-change retraining triggers).
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.