AI/ML — Module 22Intermediate

Linear Regression

The simplest ML algorithm — and the most important one to truly understand. Build a DoorDash delivery time predictor from scratch.

30–35 min March 2026

Module 22 · Classical Machine Learning

Classical ML · 13 modulesModule 22

What Linear Logistic Decision Support K-Nearest Naive Random Gradient XGBoost LightGBM K-Means Principal

The problem

DoorDash needs a number. You need to give them one.

You're a data scientist at DoorDash. Your lead drops a CSV on your desk: 10,000 completed orders, each with the delivery distance and the actual time it took. Your job is to build a model that predicts delivery time from distance.

You open the file and look at the first few rows.

order_id	distance_km	delivery_time_min
SW001	1.2	18
SW002	3.8	32
SW003	2.1	24
SW004	5.6	47
SW005	0.8	14
SW006	4.2	38
...	...	...

You notice something immediately: longer distances mean longer delivery times. SW001 at 1.2 km took 18 minutes. SW004 at 5.6 km took 47 minutes. There is a clear upward trend. If you could capture that relationship as a formula, you could predict delivery time for any new order.

Linear Regression is the algorithm that captures exactly this relationship. It draws the best possible straight line through your data points — and once you have that line, predicting delivery time for any distance is a matter of reading off the value.

The intuition

Drawing the best line through messy data

Imagine plotting all 10,000 orders on a graph. Distance on the x-axis. Delivery time on the y-axis. What you see is a cloud of dots drifting upward from left to right — longer distances, longer times, but with enough scatter that no perfect line could touch every point.

Your goal is to draw a line through the middle of that cloud. Once you have the line, predicting is trivial: find your distance on the x-axis, go straight up until you hit the line, read off the delivery time. Done.

The question is: which line is best? There are infinitely many lines you could draw. You need a way to measure how good a line is — and then find the line that scores best by that measure.

The measure is error. For any line, each data point sits some vertical distance above or below it. That distance is the error for that point — how wrong the line's prediction was. A good line keeps these errors small across all 10,000 points.

Linear Regression minimises the sum of squared errors — not the raw errors. Why squared? Two reasons: squaring makes every error positive (so a -5 error and a +5 error do not cancel out), and squaring penalises large errors much more than small ones (a 10-minute error counts 4× as much as a 5-minute error, not 2×). This makes the algorithm more sensitive to outliers, which is usually what you want in practice.

The core concept: Linear Regression finds the straight line that minimises the sum of squared vertical distances between the line and every data point. This is called Ordinary Least Squares (OLS). "Least squares" because you are minimising a sum of squared errors. "Ordinary" because it is the simplest version.

The line is defined by two numbers:

SLOPE · WEIGHT · COEFFICIENT

How much delivery time increases for each additional kilometre of distance. A slope of 7.3 means: add 1 km, add 7.3 minutes to the prediction.

INTERCEPT · BIAS

The baseline delivery time when distance is zero — roughly the time to accept the order, prepare it, and hand it to a rider before they move. Around 8–9 minutes.

delivery_time = slope × distance + intercept

slope=7.3, intercept=8.6, distance=4 km

7.3 × 4 + 8.6 = 37.8 minutes

🎯 Pro Tip

Real DoorDash models use dozens of features — time of day, restaurant load, weather, rider count. The same idea extends directly: each feature gets its own slope (coefficient), and you add them all up. That is Multiple Linear Regression, covered at the end of this page. The two-number version here is where every ML practitioner starts.

The math (optional)

How the algorithm actually finds the best line

Two methods. sklearn uses the first. Deep learning uses the second. Both find the same answer.

Approach 1 — Ordinary Least Squares (OLS)

optional deeper understanding

For simple linear regression (one feature), a closed-form solution exists. You can calculate the exact best slope and intercept directly from your data using these formulas:

slope (m) = Σ[(xᵢ − x̄)(yᵢ − ȳ)] / Σ[(xᵢ − x̄)²]

intercept (b) = ȳ − m × x̄

In plain English: the slope is the covariance of x and y divided by the variance of x. The intercept is the mean of y minus the slope times the mean of x.

This is why sklearn fits a linear regression model on a million rows in under a second — it is not iterating; it is computing a formula directly. No looping, no guessing. One calculation.

Approach 2 — Gradient Descent

optional deeper understanding

Used when you have many features (OLS becomes expensive) and the same mechanism that trains every neural network in existence. You start with a guess and improve it step by step:

Start with random parameters: m = 0, b = 0 (or near-zero random values).

Make predictions for every training point using the current m and b.

Calculate MSE (Mean Squared Error) — how wrong are all predictions on average?

Calculate the gradient — which direction and how much to move m and b to reduce MSE?

Take a small step in the opposite direction of the gradient. The size of the step is the learning rate hyperparameter.

Repeat steps 2–5 until the loss stops decreasing meaningfully. You have converged.

The learning rate is a hyperparameter you set. Too large: the steps overshoot and the loss bounces or diverges. Too small: training takes forever. Typical values: 0.01, 0.001, 0.0001.

💡 Note

sklearn's LinearRegression() uses OLS by default via a matrix decomposition called SVD. It is exact, fast, and requires no learning rate. For very large datasets or when you want online learning, use SGDRegressor which uses stochastic gradient descent.

The code

Build the DoorDash delivery predictor — step by step

Eight steps. Every step has a purpose. Read the explanation before the code — the code will make more sense when you know why you are writing it.

STEP 1 — Create the dataset

In a real job you would pull this from BigQuery or a Postgres database. Here we simulate it with numpy so you can run it instantly with no setup.

python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

np.random.seed(42)          # same random numbers every run — reproducibility

n_orders = 1000
distance_km = np.random.uniform(0.5, 8.0, n_orders)    # orders range 0.5–8 km
noise = np.random.normal(0, 4, n_orders)                # ±4 min real-world variability
delivery_time_min = 8.6 + 7.3 * distance_km + noise    # true relationship + noise

df = pd.DataFrame({
    'distance_km': distance_km,
    'delivery_time_min': delivery_time_min
})

print(df.describe())        # always run describe() first — know your data before touching it

STEP 2 — Look at the data before touching it

This is non-negotiable. A scatter plot in 30 seconds tells you whether Linear Regression is the right tool, before you write another line of code.

python

plt.figure(figsize=(8, 5))
plt.scatter(df['distance_km'], df['delivery_time_min'],
            alpha=0.3, s=15, color='#378ADD')
plt.xlabel('Distance (km)')
plt.ylabel('Delivery Time (min)')
plt.title('DoorDash Orders — Distance vs Delivery Time')
plt.tight_layout()
plt.show()

# What to look for:
#   Clear upward trend  → Linear Regression is appropriate
#   No trend / random   → LR will learn nothing useful
#   Curve (parabola)    → Wrong tool — try polynomial features or tree model
#   Tight cluster       → Great signal, model will be accurate

STEP 3 — Split into training and test sets

This is the single most important step beginners skip. If you evaluate on the same data you trained on, you are asking a student to grade their own exam using the answer sheet they already memorised. The score is meaningless.

python

X = df[['distance_km']]      # double brackets → 2D array (n_samples, n_features)
y = df['delivery_time_min']  # single brackets → 1D array (n_samples,)

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,       # 20% for testing = 200 orders
    random_state=42      # same split every run
)

print(f"Training set: {len(X_train)} orders")   # 800
print(f"Test set:     {len(X_test)} orders")    # 200

# The test set is the answer sheet.
# Treat it like one: don't look at it, don't tune on it, don't touch it
# until you are done building and ready for a final honest evaluation.

⚠️ Important

Never use your test data during development. Not for feature selection, not for hyperparameter tuning, not for model comparison. The moment you make any decision based on test set performance, it is no longer a valid estimate of production performance. Use a validation set or cross-validation for development decisions.

STEP 4 — Train the model

Three lines. That is all sklearn needs. The complexity is hidden inside fit().

python

model = LinearRegression()      # create the model object — no learning has happened yet

model.fit(X_train, y_train)    # THIS is where learning happens:
                                # - computes optimal slope and intercept via OLS
                                # - for 800 rows, this takes milliseconds
                                # - result: model.coef_ and model.intercept_ are set

print(f"Slope (coef):     {model.coef_[0]:.2f}")    # expect ~7.3 (min per km)
print(f"Intercept:        {model.intercept_:.2f}")  # expect ~8.6 (baseline minutes)

# These numbers ARE the model.
# coef_ = [7.28]    ← distance adds 7.28 min per km
# intercept_ = 8.71 ← base time before rider moves

STEP 5 — Make predictions

python

y_pred = model.predict(X_test)

# Compare predicted vs actual for the first 8 test orders
comparison = pd.DataFrame({
    'distance_km': X_test['distance_km'].values[:8],
    'actual_min':  y_test.values[:8].round(1),
    'predicted_min': y_pred[:8].round(1),
    'error_min':   (y_pred[:8] - y_test.values[:8]).round(1)
})
print(comparison.to_string(index=False))

# Expected output (approximate):
#  distance_km  actual_min  predicted_min  error_min
#          2.3        22.8           25.4        2.6
#          6.1        52.1           53.1        1.0
#          1.0        13.2           15.9        2.7
#          4.7        39.4           42.9        3.5
#          3.5        32.3           34.2        1.9
#          0.7        15.1           13.7       -1.4
#          5.8        48.8           51.0        2.2
#          2.9        28.5           29.8        1.3

STEP 6 — Evaluate properly

A number without context is useless. Always compare your model to the dumbest possible baseline: predict the mean for every order. If your model cannot beat that, it has learned nothing.

python

mae = mean_absolute_error(y_test, y_pred)
print(f"MAE:  {mae:.2f} min")      # ~3.1 min on average — a PM can understand this

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print(f"RMSE: {rmse:.2f} min")     # penalises large errors more heavily than MAE

r2 = r2_score(y_test, y_pred)
print(f"R²:   {r2:.3f}")           # ~0.78 — distance explains 78% of delivery time variation

# --- Compare to baseline ---
baseline_pred = np.full(len(y_test), y_train.mean())   # always predict training mean
baseline_mae  = mean_absolute_error(y_test, baseline_pred)
print(f"\nBaseline MAE (always predict mean): {baseline_mae:.2f} min")
print(f"Model MAE:                           {mae:.2f} min")
print(f"Improvement over baseline:           {((baseline_mae - mae) / baseline_mae * 100):.1f}%")

STEP 7 — Visualise predictions vs actual

Numbers tell you how wrong you are. Charts tell you where and why. The residual plot is the most important diagnostic chart in Linear Regression.

python

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Left: scatter + fitted line
ax1.scatter(X_train['distance_km'], y_train, alpha=0.2, s=12, color='#378ADD', label='Training data')
x_line = np.linspace(0.5, 8.0, 100).reshape(-1, 1)
ax1.plot(x_line, model.predict(x_line), color='#ff6b6b', linewidth=2, label='Fitted line')
ax1.set_xlabel('Distance (km)')
ax1.set_ylabel('Delivery Time (min)')
ax1.set_title('Fitted Line')
ax1.legend()

# Right: residual plot — actual minus predicted vs predicted
residuals = y_test - y_pred
ax2.scatter(y_pred, residuals, alpha=0.3, s=12, color='#7F77DD')
ax2.axhline(0, color='#ff6b6b', linewidth=1.5, linestyle='--')
ax2.set_xlabel('Predicted Time (min)')
ax2.set_ylabel('Residual (actual − predicted)')
ax2.set_title('Residuals')

# Interpreting the residual plot:
#   GOOD → random cloud centred on 0, no pattern
#   BAD  → fan shape (wider spread at higher predictions) → heteroscedasticity
#   BAD  → curve → the relationship is not actually linear

plt.tight_layout()
plt.show()

STEP 8 — Use the model for a real prediction

python

# Single prediction
new_order = pd.DataFrame({'distance_km': [3.2]})
predicted_time = model.predict(new_order)[0]
print(f"Predicted delivery time: {predicted_time:.0f} minutes")

# In production you would:
#   1. Serialise the model to disk with joblib
#   2. Load it in a FastAPI service
#   3. Monitor for drift (performance degrading over time)

import joblib

joblib.dump(model, 'swiggy_eta_model.pkl')         # save model
loaded_model = joblib.load('swiggy_eta_model.pkl') # load it later

# FastAPI endpoint (production pattern):
# @app.post("/predict-eta")
# def predict_eta(distance_km: float):
#     X = pd.DataFrame({'distance_km': [distance_km]})
#     return {"eta_minutes": int(loaded_model.predict(X)[0])}

Going further

Simple vs Multiple Linear Regression

Simple Linear Regression

y = m·x + b

One input feature. The model is a 2D line. Two parameters: one slope, one intercept. Good for understanding the algorithm. Rarely sufficient for production.

Multiple Linear Regression

y = m₁x₁ + m₂x₂ + … + b

Multiple input features. The model is a hyperplane in n-dimensional space. One coefficient per feature, one intercept. This is what you use in practice.

python

# Add more features to the dataset
df['traffic_score']    = np.random.uniform(1, 10, n_orders)    # 1=clear 10=gridlock
df['restaurant_prep']  = np.random.uniform(5, 25, n_orders)   # minutes to prepare

# Re-split with all features
X_multi = df[['distance_km', 'traffic_score', 'restaurant_prep']]
y = df['delivery_time_min']
X_tr, X_te, y_tr, y_te = train_test_split(X_multi, y, test_size=0.2, random_state=42)

# Training is identical — sklearn handles multiple features automatically
model_multi = LinearRegression()
model_multi.fit(X_tr, y_tr)

# Inspect coefficients — one per feature
for feature, coef in zip(X_multi.columns, model_multi.coef_):
    print(f"{feature:22s}: {coef:.3f}")
# distance_km           :  7.312  ← 7.3 min per km (as expected)
# traffic_score         :  0.891  ← each traffic point adds ~0.9 min
# restaurant_prep       :  0.998  ← 1 extra prep minute = 1 extra delivery minute

y_pred_multi = model_multi.predict(X_te)
print(f"\nSimple LR R²:   {r2_score(y_test, y_pred):.3f}")        # ~0.78
print(f"Multiple LR R²: {r2_score(y_te, y_pred_multi):.3f}")      # ~0.92

When to use it and when not to

Linear Regression assumptions — the honest version

Every statistics textbook lists Linear Regression assumptions in a way designed to make you feel like you need a PhD to check them. You do not. Here is each assumption in plain English, how to check it in 5 minutes, and what happens if it is violated.

Linearity

What it means: The relationship between your features and your target is approximately a straight line. If the true relationship is a curve, forcing a straight line through it produces systematic errors.

How to check: Plot each feature vs the target. If you see a clear curve, run a residual plot — if residuals curve instead of scatter randomly, linearity is violated.

VIOLATED

Predicting app revenue vs user count — early users grow revenue linearly but later users contribute less (saturation). LR underestimates at high user counts.

HOLDS

Distance vs delivery time — adding 1 km consistently adds ~7 minutes regardless of starting distance. The relationship is genuinely linear.

No extreme outliers

What it means: Because errors are squared, a single extreme point can drag the line significantly toward itself. LR is not robust to outliers.

How to check: Box plots of each feature. Check for values more than 3 standard deviations from the mean. Plot residuals — outliers appear as isolated points far from zero.

VIOLATED

A 90-minute delivery (driver had an accident) treated as normal training data. The line tilts toward that point, making predictions slightly worse for all other orders.

HOLDS

After removing the 0.3% of orders with delivery_time > 90 minutes, the line fits the remaining 99.7% much more cleanly.

No multicollinearity

What it means: Your features should not be highly correlated with each other. If distance_km and distance_miles are both in your model, the algorithm cannot separate their individual contributions.

How to check: Compute a correlation matrix: df.corr(). Features correlated above 0.85 with each other are a problem. Use VIF (Variance Inflation Factor) for a precise check.

VIOLATED

Including both distance_km and an estimated_travel_time_sec feature — they measure the same underlying thing. Coefficients become unstable and uninterpretable.

HOLDS

distance_km, restaurant_prep_time, and weather_severity are genuinely independent. Each measures something different. Coefficients are stable and interpretable.

Independence of errors

What it means: Errors for one prediction should not predict errors for another. If your model is always wrong at 7pm, those errors are correlated with time — and your model has missed a systematic pattern.

How to check: Plot residuals against time or any variable not in your model. A pattern means you are missing a feature. Random scatter means errors are independent.

VIOLATED

Errors are consistently positive (under-predicting) on Friday evenings. time_of_week is not in the model. Residuals correlate with hour_of_day.

HOLDS

After adding is_peak_hour, the Friday evening systematic error disappears. Residuals scatter randomly across all hours.

Errors you'll hit

Every error, explained and fixed

These are the errors you will encounter in your first few weeks. Every one of them is fixable in under five minutes once you know what caused it.

ValueError: Input contains NaN, infinity or a value too large for dtype('float64')

WHY IT HAPPENS

You have missing values (NaN) or infinite values in your feature matrix. sklearn's LinearRegression does not handle missing values — it expects clean numerical input.

FIX

Run df.isnull().sum() to find which columns have missing values. Then handle them: df.fillna(df.median()) for numerical columns, df.dropna() if the fraction is tiny, or use sklearn's SimpleImputer in a Pipeline for production code.

ValueError: could not convert string to float: 'restaurant_name'

WHY IT HAPPENS

You included a categorical (text) column in your feature matrix X. LinearRegression can only work with numbers. 'restaurant_name' is still a string.

FIX

Encode categorical features before training. For low-cardinality categoricals: pd.get_dummies(df, columns=['restaurant_name']). For high-cardinality: use OrdinalEncoder or TargetEncoder from sklearn.preprocessing.

ValueError: X has 1 feature, but LinearRegression is expecting 3 features as input

WHY IT HAPPENS

The model was trained on 3 features but you are passing 1 feature at inference time. Feature names or counts do not match between training and prediction.

FIX

Use an sklearn Pipeline that includes your preprocessing steps. This guarantees the same transformations run at training and inference time. Alternatively: always build X for inference the same way you built X_train — same column order, same column names.

R² score is negative

WHY IT HAPPENS

Negative R² means your model is worse than just predicting the mean for every order. Four possible causes: (1) you evaluated on training data accidentally, (2) features and target are not aligned — index mismatch after filtering, (3) severe outliers pulling the line to a useless position, (4) the relationship is genuinely non-linear.

FIX

Check each cause in order: verify you are calling score(X_test, y_test) not score(X_train, y_train). Reset dataframe indexes after filtering with .reset_index(drop=True). Check for and remove extreme outliers. Plot the data — if it curves, try polynomial features or switch to a tree-based model.

MAE on test set is 5× higher than on training set

WHY IT HAPPENS

Classic overfitting. Your model has memorised the training data instead of learning generalizable patterns. Linear Regression rarely overfits severely, but it can happen with many engineered features that are specific to the training period.

FIX

Add regularisation: switch to Ridge (L2) or Lasso (L1) regression. Ridge penalises large coefficients; Lasso can zero them out entirely. Both are in sklearn.linear_model. Start with Ridge(alpha=1.0) and tune alpha with cross-validation.

Predictions are systematically too high for short distances and too low for long distances

WHY IT HAPPENS

The relationship between distance and delivery time is not linear. Short-distance orders have a fixed cost (restaurant prep, handover) that dominates. Long-distance orders may benefit from faster roads. A straight line cannot capture this.

FIX

Add polynomial features: from sklearn.preprocessing import PolynomialFeatures, then pipe it before LinearRegression. Or switch to a tree-based model (Decision Tree, XGBoost) that naturally captures non-linear patterns without feature engineering.

What this looks like at work

Day one. You've just joined DoorDash's data team.

Your manager shares a Notion doc: "Current ETA accuracy is ±12 minutes. We need ±5 minutes within Q2. You have access to 6 months of BigQuery order data. Go."

Here is what the actual week looks like — not the sanitised tutorial version.

Mon

Data pull and exploration

Write BigQuery SQL to pull 6 months of completed orders with distance, time, restaurant_id, weather, hour_of_day

Export to pandas. df.shape returns (1,847,332, 23). Run df.describe() and df.isnull().sum()

Find: restaurant_prep_time is 18% null (never logged for partner restaurants). weather_code missing before March

Slack your lead: "prep_time has 18% nulls — impute with restaurant median or drop the feature?" Decision needed before Tuesday

Tue

Feature analysis and first baseline

Decision back: impute with restaurant median. Write imputation code with a comment explaining why

Baseline: always predict median delivery time (31 min). MAE = 11.2 minutes. This is the bar to beat

Simple LR on distance_km only. MAE = 6.8 minutes. Already a 39% improvement over baseline

Plot residuals. Find: model over-predicts for short distances (< 1.5 km). Something non-linear at the short end

Wed

Multiple LR and the real breakthrough

Add 5 more features: restaurant_prep_time, hour_of_day, day_of_week, weather_severity, rider_count_nearby

Multiple LR: MAE = 4.9 minutes. Just below the ±5 minute target. Check correlation matrix — no multicollinearity issues

Residual plot reveals: model still systematically under-predicts 7–9pm. is_peak_hour not in model

Add is_peak_hour binary feature. MAE drops to 4.1 minutes. R² = 0.81. Target met with room to spare

Thu

Cross-validation and documentation

Run 5-fold cross-validation to confirm 4.1 MAE is stable, not lucky on one split. All folds: 3.9–4.3 MAE. Stable

Check assumptions: linearity (residual plot looks good), no major outliers (removed top 0.1%), no multicollinearity confirmed

Write up findings: baseline → simple LR → multiple LR → peak hour feature. Each step with the MAE improvement

Build a single presentation slide: before/after, feature importances, what happens if model degrades

Fri

Present and get sign-off

Present to lead and product manager. Walk through the 5-step improvement story. Show the residual plots

Likely questions: "What happens at 3 months when weather patterns change?" → plan automated monthly retraining

"Can we interpret why the model predicts high for some restaurants?" → show per-restaurant coefficient analysis

Get sign-off. Hand off to ML engineer for production deployment. Your job: monitor MAE weekly and flag if it drifts above 5.5

The thing no tutorial tells you: in a real job the hardest part was Monday — getting the data, understanding what each column actually means, finding undocumented data quality issues, waiting for a decision on the 18% null problem before you can move forward. The modelling on Wednesday took 4 hours. The data work took 2 days. This is always the ratio. Data engineering is not a prerequisite you get past — it is half the job, every week.

🎯 Key Takeaways

✓Linear Regression finds the line that minimises the sum of squared errors (OLS). "Least squares" is the name of that objective.
✓The trained model is just two numbers: model.coef_ (one slope per feature) and model.intercept_. Prediction = dot product of weights and features + intercept.
✓Always split 80/20 before touching the model. Never evaluate on training data. Never make decisions based on test set performance — use validation or cross-validation.
✓Report three metrics: MAE (interpretable, same units as target), RMSE (penalises large errors), R² (fraction of variance explained). Always compare to a naive baseline.
✓Four assumptions to check: linearity (scatter plot), no extreme outliers (box plot), no multicollinearity (correlation matrix), independence of errors (residual plot vs time).
✓sklearn interface is always the same: instantiate → fit(X_train, y_train) → predict(X_test). Every algorithm in this section follows this pattern.

Discussion

Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.

Continue with GitHub