Model Deployment — FastAPI, Docker, Kubernetes
Wrap your model in a FastAPI endpoint, containerise with Docker, scale with Kubernetes. Full working deployment of the Swiggy delivery time model.
A trained model is a pkl file sitting on your laptop. Deployment means turning that pkl file into an API that handles thousands of requests per minute, survives crashes, and can be updated without downtime.
The standard production ML deployment stack at Indian startups is three layers. FastAPI wraps the model in an HTTP endpoint — it receives a JSON request, extracts features, runs the model, and returns a JSON prediction. Docker packages the API and all its dependencies into a container that runs identically on any machine. Kubernetes runs many containers in parallel, restarts crashed ones, and distributes incoming traffic across all of them.
Swiggy's delivery time prediction API serves 200,000 requests per minute during dinner peak hours. A single Python process handles perhaps 50 requests per second. To handle 200,000 per minute (3,333 per second) you need roughly 70 parallel processes. Kubernetes manages those 70 containers automatically — scaling up during peak hours and down at 3 AM to save compute cost. This is the deployment stack this module teaches.
A chef (your model) → a restaurant (FastAPI API) → a restaurant chain (Docker) → a restaurant franchise (Kubernetes). One chef cooking in their kitchen is a model in a notebook. Opening a restaurant adds a standardised environment, a menu (API contract), and a way for customers to order. Franchising the restaurant (Docker) means any city can run the same restaurant with the same recipe, regardless of local conditions. The franchise management company (Kubernetes) opens more locations when demand spikes and closes underperforming ones.
Docker solves "works on my machine." Kubernetes solves "stays running at scale." FastAPI solves "speaks HTTP." Together they are how every production ML model at Indian tech companies is served.
FastAPI — production model serving with validation, health checks, and versioning
FastAPI is the standard for Python model serving — faster than Flask, automatic request validation via Pydantic, automatic OpenAPI docs, async support, and type hints throughout. A production model API needs more than just a predict endpoint: a health check endpoint that Kubernetes uses to restart crashed pods, a readiness endpoint that signals when the model is loaded and ready, request validation that rejects malformed inputs before they reach the model, and versioned endpoints so you can deploy a new model without breaking existing clients.
Docker — package everything so it runs identically everywhere
"It works on my machine" is not acceptable in production. Docker solves this by packaging the application, its dependencies, and its runtime environment into a single image that runs identically on your laptop, on the CI server, and in production. The image is built once and deployed everywhere. A production ML Docker image has one additional concern: keeping image size small — a 10GB image takes 5 minutes to pull on a new node, causing slow cold starts.
Kubernetes — run, scale, and update containers in production
Kubernetes (K8s) manages containerised applications at scale. You tell it what you want (5 replicas of this container, restart if it crashes, distribute traffic across all replicas) and it makes it happen. Three Kubernetes objects matter most for ML serving: Deployment (define the container and how many replicas), Service (expose the deployment as a network endpoint), and HorizontalPodAutoscaler (automatically add replicas when CPU usage is high, remove them when it drops).
Rolling updates, canary releases, and blue-green deployment
Swiggy cannot take the delivery time model offline to update it. Every second of downtime means delayed delivery estimates, poor user experience, and drivers idling without assignments. Production model updates must be zero-downtime. Three patterns handle this with increasing safety.
Load testing — verify your deployment handles production traffic before it sees it
Every common deployment mistake — explained and fixed
Your model is live. Next: know when it starts degrading before your users do.
Deploying a model is not the end — it is the beginning of monitoring. Models degrade silently as the world changes around them. The fraud patterns Razorpay trained on in January look different by June. The delivery time patterns from pre-monsoon do not hold during monsoon season. Module 72 covers drift detection and monitoring — how to know your model is degrading before users notice, and how to trigger automatic retraining when it does.
How to know your model is degrading before users complain. Data drift, concept drift, Evidently AI, and automated retraining triggers.
🎯 Key Takeaways
- ✓The production ML deployment stack is three layers: FastAPI (wrap model in HTTP endpoint with validation, health checks, and versioning), Docker (package everything into a reproducible container), Kubernetes (run, scale, and update containers without downtime). This is the standard at Swiggy, Flipkart, Razorpay, and every Indian unicorn.
- ✓A production FastAPI model API needs four endpoints beyond /predict: /health (liveness probe — is the container alive), /ready (readiness probe — is the model loaded), /v1/predict (versioned, never break old clients), and /v1/predict/batch (batch endpoint for throughput). Always validate inputs with Pydantic before they reach the model.
- ✓Use multi-stage Docker builds to keep images small: build stage installs gcc and dependencies, runtime stage copies only the installed packages. python:3.11-slim not python:3.11. Never bake model artifacts into the image — load from S3/GCS at startup via MODEL_PATH env var. Target: under 200MB for scikit-learn models.
- ✓Kubernetes Deployment + Service + HPA is the standard serving setup. Key settings: maxUnavailable: 0 (never drop below desired replicas during update), livenessProbe initialDelaySeconds = model load time (60-120s), readinessProbe removes pod from load balancer if model is not ready, resource requests and limits prevent one pod from starving others.
- ✓Three update strategies: Rolling Update (default, simple, zero downtime, brief mixed traffic), Canary (send 5% traffic to new model, monitor, then promote — safest for ML models), Blue-Green (instant cutover by switching Service selector, instant rollback, requires 2× resources briefly). Use canary for new model versions where quality change is uncertain.
- ✓Always load-test before going live. SLO targets: p50 < 50ms, p95 < 200ms, p99 < 500ms, error rate < 0.1%, availability 99.9%. The most common deployment error is CrashLoopBackOff — check kubectl logs pod-name --previous immediately. The most dangerous is silent feature mismatch — add integration tests that compare API predictions to notebook predictions on the same input.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.