ML System Design — End to End
Design any ML system from scratch. The framework, tradeoffs, capacity estimation, and how to present it in a senior ML engineering interview.
Every ML system design problem has the same eight questions. Answer them in order and you will never miss a critical component.
ML system design interviews — and real ML architecture discussions — feel open-ended and overwhelming. You are handed a problem like "design Swiggy's delivery time prediction system" and expected to produce a coherent architecture in 45 minutes. Without a framework you will either forget something important or spend 30 minutes on model selection when the interviewer cares about serving infrastructure.
The framework below is not a rigid script — it is a checklist of the questions every ML system must answer. Work through them in order. Each answer constrains the next. The latency requirement determines whether you can use online or batch serving. The scale requirement determines whether you need a feature store. The feedback loop determines how you detect drift. By the time you have answered all eight you have a complete architecture.
An architect designing a building does not start by choosing the colour of the walls. They start with: who lives here, how many people, what activities happen inside, what is the budget, what are the structural constraints of the land. The colour comes last. ML system design is the same — the model choice (colour of the walls) comes after you understand the data availability, latency requirements, and scale constraints. Most candidates start with model selection and never get to the questions that actually determine system feasibility.
In ML system design interviews, an interviewer would rather see you ask the right clarifying questions than immediately jump to "I would use a Transformer." The right questions demonstrate systems thinking. The immediate model answer demonstrates pattern matching.
Design Swiggy's delivery time prediction system — full walkthrough
This is the most commonly asked ML design question in Indian interviews. Delivery time estimation appears at Swiggy, Zomato, Dunzo, Blinkit, and every quick-commerce startup. Walk through all eight questions.
Design Razorpay's real-time fraud detection system
Fraud detection is fundamentally different from delivery time prediction. The class imbalance is extreme (0.1% fraud rate). The cost asymmetry is severe (false negative = fraud loss, false positive = legitimate transaction declined = customer anger + lost revenue). Latency is critical — the prediction must complete before the payment clears. And the adversary is adaptive — fraudsters study and evade every model.
Design Meesho's product recommendation system — two-stage retrieval
Recommendation systems are the third most common ML design question after delivery time and fraud. The key insight almost every candidate misses: you cannot run a complex ranking model over 50 million products. The two-stage architecture — fast retrieval of 100-500 candidates, then expensive ranking of just those candidates — is how every production recommendation system works at scale.
Six recurring tradeoffs — know these and you can handle any ML design question
Real-time prediction at request time. Required when: prediction depends on request context (fraud amount, delivery distance). Latency-sensitive. Higher cost.
Pre-compute predictions for all entities daily. Possible when: context does not change per-request (user recommendations pre-computed by user_id). Lower cost, higher throughput.
High precision (low threshold): fewer false positives. For fraud: fewer declined legitimate transactions. Cost: miss more fraud.
High recall (high threshold): catch more fraud. For fraud: higher false positive rate. Cost: more customer complaints.
Simple model (LightGBM): 1ms inference, interpretable, less accurate. Deployed as single endpoint.
Complex model (deep learning): 100ms+ inference, better accuracy. Requires GPU serving, model quantisation, or batching.
Real-time features: maximum freshness, maximum cost. Requires streaming infrastructure (Kafka, Flink). For fast-changing signals (fraud velocity, driver location).
Batch features: stale but cheap. Daily or hourly batch job. For slowly-changing signals (user purchase history, restaurant prep time baseline).
Global model: simpler, one deployment, data pooling. Worse for underrepresented segments (tier-2 cities with little data).
Per-segment models: better for each segment, higher maintenance. n models to retrain, monitor, and deploy.
Full automation: fast, scalable, no human cost. Risk: wrong automated decision at scale (e.g. fraud model blocks all transactions during a bug).
Human review for high-stakes decisions: slower, expensive, required for regulatory compliance. Fraud above Rs 1L, medical diagnosis, loan decisions.
Time allocation and what interviewers are actually scoring
The MLOps section is complete. Section 12 — Cloud ML Platforms — connects everything to Azure ML, SageMaker, and Vertex AI.
You have completed the full MLOps section across seven modules: ML pipelines and feature stores, experiment tracking, model deployment, monitoring, retraining pipelines, DVC, and ML system design. Section 12 shows how all of this maps onto the managed cloud platforms — Azure ML, AWS SageMaker, and GCP Vertex AI — that most Indian enterprise ML teams use. The concepts are identical; the platforms automate the infrastructure so you can focus on the ML.
Azure Machine Learning Studio, compute clusters, AML Pipelines, AutoML, model registry, and online endpoints.
🎯 Key Takeaways
- ✓Every ML system design problem has the same eight questions answered in order: problem framing → data → features → model → serving → scale → monitoring → failure modes. Answer them in this order — each answer constrains the next. Jumping to model selection first is the most common interview mistake.
- ✓Problem framing before everything: what is the ML task type, what is the business metric (separate from ML metric), how are labels obtained, and what is the latency budget. These four answers determine the entire architecture. Never start designing until you have them.
- ✓Two-stage architecture is the universal pattern for recommendation and search: fast retrieval of 500-1000 candidates (ANN search on pre-computed embeddings), then expensive ranking of only those candidates. Running a neural ranker over 50M products is impossible at real-time serving latency — two-stage is not an optimisation, it is a requirement.
- ✓Capacity estimation is not optional. Give numbers: Swiggy 580 peak RPS → 30 replicas × 20 RPS each at 10ms model latency. Meesho 15M DAU × 3 sessions × 4 requests = 2,083 avg RPS. Fraud detection 1,157 peak TPS at < 10ms model budget → 12 replicas. Interviewers score "thinking in numbers" explicitly.
- ✓Six recurring tradeoffs to master: online vs batch serving (depends on whether real-time features are required), precision vs recall (depends on cost of FN vs FP), model complexity vs latency (start simple, add complexity when plateau), freshness vs cost (compute frequency = signal change rate), global vs per-segment models (add segments when global underperforms >10%), full automation vs human-in-the-loop (automate reversible low-stakes, humans for irreversible high-stakes).
- ✓Always address: cold start problem (new users/items with no history — content-based or popularity fallback), label strategy (how and when ground truth is obtained — delivery time is immediate, fraud is delayed 30 days), and fallback when model is unavailable (rule-based or static fallback — never block the core user action due to ML unavailability).
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.