AI/ML — Module 75Advanced

ML System Design — End to End

Design any ML system from scratch. The framework, tradeoffs, capacity estimation, and how to present it in a senior ML engineering interview.

55–70 min March 2026

Module 75 · MLOps and Production

MLOps · 7 modulesModule 75

ML Experiment Model Model Retraining DVC ML

The universal structure

Every ML system design problem has the same eight questions. Answer them in order and you will never miss a critical component.

ML system design interviews — and real ML architecture discussions — feel open-ended and overwhelming. You are handed a problem like "design DoorDash's delivery time prediction system" and expected to produce a coherent architecture in 45 minutes. Without a framework you will either forget something important or spend 30 minutes on model selection when the interviewer cares about serving infrastructure.

The framework below is not a rigid script — it is a checklist of the questions every ML system must answer. Work through them in order. Each answer constrains the next. The latency requirement determines whether you can use online or batch serving. The scale requirement determines whether you need a feature store. The feedback loop determines how you detect drift. By the time you have answered all eight you have a complete architecture.

Eight questions — answer in this order every time

1.Problem framingWhat ML task exactly? What is the business metric? What does success look like?

2.DataWhat data exists? How much? How is it labelled? What are the data quality issues?

3.FeaturesWhat features matter? How are they computed? Point-in-time correctness needed?

4.ModelWhich model family? Why? What are the tradeoffs vs simpler baselines?

5.ServingOnline or batch? What latency? How many RPS? How do features reach the model at inference?

6.ScaleHow many predictions per day? Peak load? Storage requirements? Cost per prediction?

7.MonitoringWhat drifts? How is it detected? When is retraining triggered?

8.Failure modesWhat breaks first? What is the fallback when the model is unavailable?

🧠 Analogy — read this first

An architect designing a building does not start by choosing the colour of the walls. They start with: who lives here, how many people, what activities happen inside, what is the budget, what are the structural constraints of the land. The colour comes last. ML system design is the same — the model choice (colour of the walls) comes after you understand the data availability, latency requirements, and scale constraints. Most candidates start with model selection and never get to the questions that actually determine system feasibility.

In ML system design interviews, an interviewer would rather see you ask the right clarifying questions than immediately jump to "I would use a Transformer." The right questions demonstrate systems thinking. The immediate model answer demonstrates pattern matching.

Case study 1 — DoorDash

Design DoorDash's delivery time prediction system — full walkthrough

This is the most commonly asked ML design question in Indian interviews. Delivery time estimation appears at DoorDash, Uber Eats, Gopuff, Blinkit, and every quick-commerce startup. Walk through all eight questions.

Question 1 — Problem framing

Clarifying questions to ask:

Q: Is this shown to customers pre-order, in-flight, or both?

→ Pre-order (ordering decision) AND in-flight (tracking page) — two separate models

Q: What is the business metric?

→ Customer-visible accuracy: % of orders within ±5 min of prediction

Q: What counts as success?

→ Currently showing "30-40 min" static ranges — model should beat this

Q: Is underestimating or overestimating worse?

→ Underestimating (promising 20 min, delivering in 40) is worse — customer anger

ML task: Regression — predict delivery time in minutes. Target: MAE < 5 min, within-5-min rate > 70%.

Question 2 — Data

Historical orders200M+ completed orders with actual delivery time. Labels available immediately after delivery.Good — but outliers from cancelled/redelivered orders need filtering

Real-time GPSDriver location updated every 30s. Route data, current speed, traffic.High volume, requires streaming infrastructure

Restaurant dataPrep time per restaurant, kitchen capacity, average acceptance time.Partially available — new restaurants have no history

External signalsWeather API, traffic APIs, public holiday calendar, event data.Low latency required at serving time

Question 3 — Features (what, and where computed)

Order features

→ distance_km (pickup→delivery)

→ order_value

→ n_items

→ has_special_instructions

Restaurant features (pre-computed)

→ avg_prep_time_15min

→ current_queue_length

→ acceptance_rate_1h

→ peak_hour_multiplier

Driver features (pre-computed)

→ driver_avg_speed_30min

→ driver_distance_to_restaurant

→ driver_active_orders

→ driver_historical_mae

Context features

→ is_peak_hour

→ day_of_week

→ weather_condition

→ nearby_events

Real-time signals

→ current_traffic_index

→ restaurant_wait_estimate

→ payment_processing_time

Cold start handling

→ new_restaurant: use city average

→ new_driver: use city median

→ missing weather: use seasonal average

Questions 4-8 — Model, Serving, Scale, Monitoring, Failure modes

4. Model

LightGBM for the main model — fast inference (< 1ms), handles missing values, good on tabular data. Separate models per city initially, unified with city as a feature at scale. Simple rule-based fallback: (distance × 6) + restaurant_avg_prep_time.

Tradeoff: Deep learning (LSTM for sequence) would capture GPS trajectory better but 100× slower to train and deploy. LightGBM gets 90% of the quality at 1% of the complexity.

5. Serving

Online serving required — prediction must be made at order-placement time. Latency budget: 100ms total, model gets 20ms. Feature store (Redis) for pre-computed restaurant/driver features: 1ms lookup. Fresh features (GPS, traffic) fetched in parallel: 10ms.

Tradeoff: Batch pre-computation of predictions is not viable — too many (restaurant, driver, destination) combinations to pre-compute. Must be online.

6. Scale

5M orders/day = 58 orders/second average. Peak (8 PM): 10× = 580 RPS. Each prediction: 20ms model + 30ms feature fetch = 50ms. 30 replicas × 20 RPS/replica handles peak. Feature store: 580 × 5 features = 2,900 Redis reads/second (trivial for Redis).

Tradeoff: At 580 RPS with 50ms latency, 30 pods × 0.5 CPU each = 15 CPU cores. Feature store cache hit rate must be >99% to meet latency SLO.

7. Monitoring

Primary metric: within-5-min rate, computed hourly with 1h delivery delay. Feature drift: distance_km, peak_hour distribution monitored daily with PSI. Labels available 30-90 min after prediction. Retrain weekly or when within-5-min rate drops 5pp.

Tradeoff: Cannot use prediction accuracy as real-time signal — need to wait for delivery completion. Use prediction distribution as leading indicator.

8. Failure modes

Feature store down → use request features only + restaurant/driver averages from config. Model serving down → rule-based fallback (distance × 6 + 15 min). Data pipeline stale → detect via feature staleness check at serving time, switch to fallback.

Tradeoff: Rule-based fallback showing "25-35 min" is always better than showing nothing or an error. Degrade gracefully — never block order placement due to ML unavailability.

Case study 2 — Stripe

Design Stripe's real-time fraud detection system

Fraud detection is fundamentally different from delivery time prediction. The class imbalance is extreme (0.1% fraud rate). The cost asymmetry is severe (false negative = fraud loss, false positive = legitimate transaction declined = customer anger + lost revenue). Latency is critical — the prediction must complete before the payment clears. And the adversary is adaptive — fraudsters study and evade every model.

python

# ── Fraud detection system design — capacity and tradeoff analysis ───

# Question 1: Problem framing
print("PROBLEM FRAMING")
print("=" * 55)
print("""
ML task:     Binary classification — fraud vs legitimate
             Output: fraud probability score (0-1), not binary
             Thresholds set by risk team based on business cost

Business metric:
  Primary:   Transaction fraud rate (TFR) — fraud_amount / total_amount
  Secondary: False positive rate (FPR) — declined_legitimate / all_legitimate

Why score not binary:
  Different thresholds for different transaction types:
  - Low value UPI (< Rs 500): high threshold (0.95) — FP very costly
  - High value bank transfer (> Rs 1L): low threshold (0.30) — FN very costly
  - International: medium threshold (0.60)
""")

# Question 6: Scale (do this early for fraud — it's the key constraint)
print("SCALE ANALYSIS")
print("=" * 55)

transactions_per_day = 10_000_000        # 10M transactions/day
peak_tps             = transactions_per_day / 86400 * 10   # 10× peak = ~1,157 TPS
fraud_rate           = 0.001             # 0.1% = 10,000 fraud transactions/day

print(f"Transaction volume:  {transactions_per_day:,}/day = {transactions_per_day/86400:.0f} avg TPS")
print(f"Peak TPS:            {peak_tps:.0f} TPS  (10× peak factor)")
print(f"Expected fraud:      {int(transactions_per_day * fraud_rate):,} transactions/day")
print(f"Latency budget:      < 50ms total  (before payment clears)")
print(f"Model budget:        < 10ms  (50ms - feature fetch - network)")

# Capacity for 1,157 TPS at 10ms each:
import math
model_time_ms    = 10
replicas_needed  = math.ceil(peak_tps * model_time_ms / 1000)
print(f"
Replicas needed:     {replicas_needed} pods × (1000ms/10ms) = handles {1000//model_time_ms} RPS each")

# Question 3: Features — three tiers by computation cost
print("
FEATURE TIERS")
print("=" * 55)
feature_tiers = {
    'Tier 1 — Request (0ms)': [
        'transaction_amount', 'merchant_category', 'payment_method',
        'device_fingerprint', 'ip_address', 'billing_zip',
        'hour_of_day', 'is_international',
    ],
    'Tier 2 — Feature Store (1-3ms Redis)': [
        'user_7d_transaction_count', 'user_7d_total_spend',
        'user_30d_distinct_merchants', 'user_30d_avg_amount',
        'merchant_fraud_rate_30d', 'ip_country_mismatch_flag',
        'velocity_1h_amount', 'device_new_flag',
    ],
    'Tier 3 — Computed (5ms, parallel)': [
        'transaction_vs_user_avg_ratio',   # txn / user 30d avg
        'amount_round_number_flag',         # Rs 10000.00 exact
        'time_since_last_transaction_s',
        'distance_from_last_merchant_km',
    ],
}
for tier, features in feature_tiers.items():
    print(f"
  {tier}:")
    for f in features:
        print(f"    → {f}")

# Question 4: Model — why an ensemble
print("
MODEL SELECTION")
print("=" * 55)
print("""
Primary:  LightGBM score (tabular features, fast, interpretable)
          Trained on 6-month rolling window
          Feature importance for regulatory compliance (RBI audit)

Secondary: Rule engine running in parallel (100% recall for known patterns)
           Hard rules: velocity limits, blacklisted IPs, blocked BINs
           Soft rules: unusual merchant for user, odd hours

Output: max(lgbm_score, rule_score) → final risk score

Why not deep learning:
  - Regulatory: RBI requires explainability for declined transactions
  - Latency: transformer inference > 50ms for tabular data
  - Data: 0.1% fraud rate with 10M daily → only 10k fraud labels/day
    LightGBM is far more data-efficient than deep learning for this

Class imbalance handling:
  - scale_pos_weight=999 in LightGBM (ratio of negatives to positives)
  - Focal loss as alternative: penalises easy negatives more
  - Evaluation: precision-recall AUC not ROC-AUC (imbalanced classes)
""")

# Question 7: Monitoring — adversarial drift is the hardest problem
print("MONITORING — ADVERSARIAL DRIFT")
print("=" * 55)
print("""
Normal drift: feature distributions shift gradually (monthly retraining)
Adversarial drift: fraudsters adapt to the model within DAYS

Detection:
  - Monitor fraud rate per merchant category hourly
  - Monitor false positive rate daily (declined legit transactions)
  - Monitor feature distribution of FRAUD transactions (not all)
    → New fraud pattern appears as a cluster in a previously clean region

Response:
  - New fraud pattern detected → add hard rule IMMEDIATELY (same day)
  - Retrain model weekly with new fraud samples
  - Deploy rule before model update — rules are instant, models take days

Feedback loop:
  - Chargebacks confirm fraud (7-30 day delay)
  - Merchant disputes confirm false positives (same day)
  - Manual review team labels 2% of flagged transactions same-day
    → Creates fast label feedback for concept drift detection
""")

Case study 3 — Shopify

Design Shopify's product recommendation system — two-stage retrieval

Recommendation systems are the third most common ML design question after delivery time and fraud. The key insight almost every candidate misses: you cannot run a complex ranking model over 50 million products. The two-stage architecture — fast retrieval of 100-500 candidates, then expensive ranking of just those candidates — is how every production recommendation system works at scale.

Two-stage recommendation architecture — retrieval then ranking

Stage 1A — Collaborative Filtering (ALS/Matrix Factorisation)

In: User ID

Out: 500 products the user mig…

Approximate nearest neighbour search on user embedding vs product embeddings. Faiss index.

< 5ms (pre-computed user embedding lookup)

Stage 1B — Content-based (CLIP embeddings)

In: User recent views + search query

Out: 500 visually/semantically…

CLIP product image embeddings indexed in Faiss. Query with viewed product embeddings.

< 10ms (ANN on pre-computed product embeddings)

Stage 1C — Trending / Popularity

In: User category preferences + location

Out: 200 trending products in …

Hourly batch job computes trending score per category. Redis stores top-200 per category.

< 2ms (Redis sorted set lookup)

Merge + Deduplicate

In: 1,200 candidate products (with duplicates)

Out: 500 unique candidates…

Union of all candidate sets, deduplicate by product_id, score by source count.

< 1ms

Stage 2 — Neural Ranking Model

In: 500 candidates + user context + product features

Out: Top 20 personalised produ…

Two-tower model or gradient boosting on (user, product) pairs. Trained on click/purchase labels.

< 30ms

python

# ── Recommendation system capacity analysis ───────────────────────────
print("MEESHO RECOMMENDATION SCALE ANALYSIS")
print("=" * 55)

# User and product scale
dau           = 15_000_000   # 15M daily active users
products      = 50_000_000   # 50M products in catalogue
sessions_per_user = 3        # avg 3 sessions per day per DAU
recs_per_session = 4         # recommendation requests per session
total_rps = dau * sessions_per_user * recs_per_session / 86400

print(f"DAU:                    {dau/1e6:.0f}M")
print(f"Product catalogue:      {products/1e6:.0f}M")
print(f"Recommendation RPS:     {total_rps:.0f} avg  (~{total_rps*5:.0f} peak)")

# Why two-stage is necessary
print("
WHY TWO-STAGE IS NECESSARY:")
ranking_model_ms  = 5    # ms per (user, product) pair
full_catalogue_ms = ranking_model_ms * products / 1000
print(f"  Ranking all {products/1e6:.0f}M products: {full_catalogue_ms/1000:.0f} seconds — impossible")
print(f"  Ranking 500 candidates:              {ranking_model_ms * 500:.0f}ms — feasible")

# Offline training cadence
print("
TRAINING CADENCE:")
print("""
  Retrieval models (ALS, CLIP embeddings):  Weekly
    → Catalogue changes slowly, embeddings expensive to compute

  Ranking model:  Daily
    → Click patterns change quickly, model relatively cheap to retrain

  Popularity/trending features:  Hourly
    → Trending products change fast, must be fresh

  Cold start (new product):  Real-time
    → New product uploaded → compute CLIP embedding → add to index immediately
    → Use content-based retrieval until enough interactions for collaborative
""")

# Evaluation metrics — what actually matters
print("EVALUATION METRICS:")
metrics = [
    ('CTR (Click-Through Rate)',   'Online',  '% users who click a recommendation'),
    ('Conversion rate',            'Online',  '% recommendations that lead to purchase'),
    ('Diversity (ILD)',            'Offline', 'Intra-list diversity — are recommendations varied?'),
    ('Coverage',                   'Offline', '% of catalogue that appears in recommendations'),
    ('Novelty',                    'Offline', '% of recommendations user has never seen'),
    ('NDCG@10',                    'Offline', 'Normalised Discounted Cumulative Gain at position 10'),
]
for metric, where, desc in metrics:
    print(f"  {metric:<28}: {where:<8}  {desc}")

The decisions that define every system

Six recurring tradeoffs — know these and you can handle any ML design question

Online vs Batch serving

OPTION A

Real-time prediction at request time. Required when: prediction depends on request context (fraud amount, delivery distance). Latency-sensitive. Higher cost.

OPTION B

Pre-compute predictions for all entities daily. Possible when: context does not change per-request (user recommendations pre-computed by user_id). Lower cost, higher throughput.

Rule: If prediction requires real-time features → online. If top-k for a fixed entity → batch. Hybrid: batch compute candidates, online re-rank.

Precision vs Recall tradeoff (classification threshold)

OPTION A

High precision (low threshold): fewer false positives. For fraud: fewer declined legitimate transactions. Cost: miss more fraud.

OPTION B

High recall (high threshold): catch more fraud. For fraud: higher false positive rate. Cost: more customer complaints.

Rule: Set threshold based on business cost: cost_FN / cost_FP. If fraud loss >> complaint cost, lower threshold. Always expose the score, let the business set the threshold.

Model complexity vs latency

OPTION A

Simple model (LightGBM): 1ms inference, interpretable, less accurate. Deployed as single endpoint.

OPTION B

Complex model (deep learning): 100ms+ inference, better accuracy. Requires GPU serving, model quantisation, or batching.

Rule: Start simple. Add complexity only when simple model plateaus AND latency budget allows. 80% of production models are gradient boosting, not deep learning.

Freshness vs cost (feature computation frequency)

OPTION A

Real-time features: maximum freshness, maximum cost. Requires streaming infrastructure (Kafka, Flink). For fast-changing signals (fraud velocity, driver location).

OPTION B

Batch features: stale but cheap. Daily or hourly batch job. For slowly-changing signals (user purchase history, restaurant prep time baseline).

Rule: Use freshness of the underlying signal to decide: if signal changes significantly in 1 hour, compute hourly. If stable across days, compute daily.

Single global model vs per-segment models

OPTION A

Global model: simpler, one deployment, data pooling. Worse for underrepresented segments (tier-2 cities with little data).

OPTION B

Per-segment models: better for each segment, higher maintenance. n models to retrain, monitor, and deploy.

Rule: Start global. Add segment-specific models when: global model underperforms a segment by >10%, segment has >100k samples, and business impact justifies the maintenance overhead.

Human-in-the-loop vs full automation

OPTION A

Full automation: fast, scalable, no human cost. Risk: wrong automated decision at scale (e.g. fraud model blocks all transactions during a bug).

OPTION B

Human review for high-stakes decisions: slower, expensive, required for regulatory compliance. Fraud above Rs 1L, medical diagnosis, loan decisions.

Rule: Automate when: low cost of errors, high volume, reversible actions. Human review when: high cost of errors, regulatory requirement, irreversible actions (account ban).

How to present in 45 minutes

Time allocation and what interviewers are actually scoring

45-minute interview structure — time allocation

0-5 minClarify the problemAsk 3-4 clarifying questions. Confirm scale, latency, and business metric. Do not assume.

5-10 minProblem framingState the ML task, metric, and success criteria. Define the label and explain how it is obtained.

10-20 minData and featuresWhat data exists, quality issues, feature engineering. Explicitly address cold start and missing features.

20-30 minModel and servingModel choice with justification. Online vs batch. Feature store. Latency analysis. Draw the serving architecture.

30-38 minScale and monitoringCapacity estimates. Retraining trigger. Drift detection. Mention specific tools (Evidently, MLflow).

38-45 minTradeoffs and extensionsWhat you would do with more time. What the main risks are. What you would change at 10× scale.

python

# ── What interviewers score — the actual rubric ───────────────────────

SCORING_RUBRIC = {
    'Problem framing (10%)': [
        'Asks clarifying questions before designing',
        'Correctly identifies ML task type (regression/classification/ranking)',
        'Defines business metric separate from ML metric',
        'States success criteria quantitatively',
    ],
    'Data and features (25%)': [
        'Identifies relevant data sources including non-obvious ones',
        'Addresses data quality and bias',
        'Addresses cold start problem for new entities',
        'Explains feature computation (batch vs real-time)',
        'Mentions point-in-time correctness for training',
    ],
    'Model selection (20%)': [
        'Proposes a simple baseline before complex model',
        'Justifies model choice with explicit tradeoffs',
        'Addresses class imbalance if present',
        'Mentions explainability requirements if regulatory context',
    ],
    'System design (25%)': [
        'Correctly determines online vs batch serving',
        'Mentions feature store for pre-computed features',
        'Provides latency breakdown (feature fetch + model + network)',
        'Capacity estimates with numbers',
        'Fallback when model unavailable',
    ],
    'Monitoring (15%)': [
        'Identifies what drifts in this specific system',
        'Proposes concrete monitoring metrics',
        'States retraining trigger condition',
        'Addresses feedback loop and label delay',
    ],
    'Communication (5%)': [
        'Draws architecture diagram',
        'States assumptions explicitly',
        'Handles interviewer's probing questions without getting flustered',
    ],
}

for category, criteria in SCORING_RUBRIC.items():
    print(f"
{category}:")
    for c in criteria:
        print(f"  ✓ {c}")

# ── Common mistakes that kill interview scores ────────────────────────
print("

COMMON MISTAKES:")
mistakes = [
    ('Jumping to model selection',    'First 5 minutes spent on LightGBM vs XGBoost. Interviewer wants system design.'),
    ('Forgetting cold start',         '"New restaurant has no history" — always asked as a follow-up. Address proactively.'),
    ('No fallback plan',              '"What if the model is down?" — must have a rule-based or static fallback.'),
    ('Missing label strategy',        '"How do you get training labels?" — must be answered for every system.'),
    ('Batch when online needed',      '"Pre-compute all (user, merchant) pairs" — combinatorial explosion.'),
    ('Ignoring data leakage',         '"Use future features in training" — always mention point-in-time correctness.'),
    ('No numbers',                    '"Large scale, many requests" — always give orders of magnitude.'),
]
print(f"  {'Mistake':<30} {'Why it hurts'}")
print("  " + "─" * 70)
for mistake, why in mistakes:
    print(f"  {mistake:<30} {why}")

Section 11 complete

The MLOps section is complete. Section 12 — Cloud ML Platforms — connects everything to Azure ML, SageMaker, and Vertex AI.

You have completed the full MLOps section across seven modules: ML pipelines and feature stores, experiment tracking, model deployment, monitoring, retraining pipelines, DVC, and ML system design. Section 12 shows how all of this maps onto the managed cloud platforms — Azure ML, AWS SageMaker, and GCP Vertex AI — that most Indian enterprise ML teams use. The concepts are identical; the platforms automate the infrastructure so you can focus on the ML.

Next — Section 12 · Cloud ML Platforms

Azure ML — Studio, Pipelines and AutoML

Azure Machine Learning Studio, compute clusters, AML Pipelines, AutoML, model registry, and online endpoints.

coming soon

🎯 Key Takeaways

✓Every ML system design problem has the same eight questions answered in order: problem framing → data → features → model → serving → scale → monitoring → failure modes. Answer them in this order — each answer constrains the next. Jumping to model selection first is the most common interview mistake.
✓Problem framing before everything: what is the ML task type, what is the business metric (separate from ML metric), how are labels obtained, and what is the latency budget. These four answers determine the entire architecture. Never start designing until you have them.
✓Two-stage architecture is the universal pattern for recommendation and search: fast retrieval of 500-1000 candidates (ANN search on pre-computed embeddings), then expensive ranking of only those candidates. Running a neural ranker over 50M products is impossible at real-time serving latency — two-stage is not an optimisation, it is a requirement.
✓Capacity estimation is not optional. Give numbers: DoorDash 580 peak RPS → 30 replicas × 20 RPS each at 10ms model latency. Shopify 15M DAU × 3 sessions × 4 requests = 2,083 avg RPS. Fraud detection 1,157 peak TPS at < 10ms model budget → 12 replicas. Interviewers score "thinking in numbers" explicitly.
✓Six recurring tradeoffs to master: online vs batch serving (depends on whether real-time features are required), precision vs recall (depends on cost of FN vs FP), model complexity vs latency (start simple, add complexity when plateau), freshness vs cost (compute frequency = signal change rate), global vs per-segment models (add segments when global underperforms >10%), full automation vs human-in-the-loop (automate reversible low-stakes, humans for irreversible high-stakes).
✓Always address: cold start problem (new users/items with no history — content-based or popularity fallback), label strategy (how and when ground truth is obtained — delivery time is immediate, fraud is delayed 30 days), and fallback when model is unavailable (rule-based or static fallback — never block the core user action due to ML unavailability).

Discussion

Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.

Continue with GitHub