AI/ML — Module 23Intermediate

Logistic Regression

The foundation of all classification. Sigmoid, decision boundaries, cross-entropy, regularisation, and multi-class extension — built from scratch then in sklearn on real data.

25–30 min March 2026

AI & ML›Classical ML›Logistic Regression

Section 05 · Classical Machine Learning

Classical ML · 13 topics0/13 done

What Linear Logistic Decision Support K-Nearest Naive Random Gradient XGBoost LightGBM K-Means Principal

The wrong name, the right algorithm

Logistic regression is not regression. It is the foundation of all classification.

The name is misleading. Logistic regression predicts probabilities — "what is the probability that this DoorDash order will be late?" — and converts those probabilities into class labels. It is a classification algorithm, not a regression one. The "regression" refers to the linear equation inside it, not to what it predicts.

Despite being over 60 years old, logistic regression is still the first algorithm deployed at many companies for binary classification. At Stripe it predicts fraud. At DoorDash it predicts late deliveries. At every bank in India it predicts loan defaults. It is fast, interpretable, probabilistically calibrated, and works well with good features. Every ML engineer should understand it completely.

This module builds logistic regression from scratch — sigmoid function, cross-entropy loss, gradient descent — so every piece is visible. Then shows you the sklearn implementation, all regularisation options, the multi-class extension, and every evaluation metric that matters for classification problems.

What this module covers:

The core idea — why linear regression fails for classification

The sigmoid function — squash any number into [0, 1]

Decision boundaries — when does the model say yes?

Cross-entropy loss — why not MSE for classification

Gradient descent for logistic regression — from scratch

sklearn LogisticRegression — all options explained

L1 and L2 regularisation — prevent overfitting

Multi-class: One-vs-Rest and Softmax (Multinomial)

Probability calibration — are the probabilities trustworthy?

Evaluation — accuracy, precision, recall, F1, ROC-AUC

Threshold tuning — 0.5 is rarely optimal

Interpreting coefficients — what the model learned

🎯 Pro Tip

The problem throughout this module: predict whether a DoorDash delivery will be late (delivery_time > 45 minutes). This is a binary classification problem — the kind logistic regression was designed for. Every concept is demonstrated on this real business question.

The problem with the obvious approach

Why linear regression breaks for classification

The obvious approach to binary classification: train a linear regression, predict a number, and if the number is above 0.5 call it class 1. This actually works for some problems. But it has three fundamental flaws that make it unreliable in general.

Predictions go outside [0, 1]

Linear regression predicts any real number. For a classification problem, a prediction of 1.7 or -0.3 is meaningless as a probability. The further a point is from the decision boundary, the more absurd the prediction becomes.

Sensitive to outliers far from the boundary

Add a single extreme point far into the positive class region. The regression line tilts toward it, moving the decision boundary and misclassifying many correctly-labelled points. Classification should not care about how far positive examples are from the boundary — only that they are on the right side.

Not probabilistically calibrated

For risk-sensitive decisions (fraud, loan default, medical diagnosis), you need a calibrated probability: "this transaction has a 3.2% chance of being fraud." Linear regression gives you a raw number with no probabilistic interpretation.

Logistic regression solves all three by applying one function to the linear prediction before outputting it: the sigmoid.

python

import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('Agg')   # non-interactive backend for scripts

# Demonstrate why linear regression fails for classification
np.random.seed(42)
n = 200

# Generate binary classification data: late vs on-time deliveries
distance_ontime = np.random.normal(3, 1, n//2)
distance_late   = np.random.normal(6, 1, n//2)

X = np.concatenate([distance_ontime, distance_late]).reshape(-1, 1)
y = np.concatenate([np.zeros(n//2), np.ones(n//2)])   # 0=on-time, 1=late

from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_sc = sc.fit_transform(X)

# Linear regression on binary labels
lin_reg = LinearRegression()
lin_reg.fit(X_sc, y)
y_pred_lin = lin_reg.predict(X_sc)

print("Linear regression predictions (should be in [0,1] but are not):")
print(f"  Min prediction: {y_pred_lin.min():.3f}")
print(f"  Max prediction: {y_pred_lin.max():.3f}")
print(f"  Predictions < 0: {(y_pred_lin < 0).sum()}")
print(f"  Predictions > 1: {(y_pred_lin > 1).sum()}")

# Logistic regression
log_reg = LogisticRegression()
log_reg.fit(X_sc, y)
y_pred_proba = log_reg.predict_proba(X_sc)[:, 1]

print("
Logistic regression predictions (always in [0,1]):")
print(f"  Min prediction: {y_pred_proba.min():.3f}")
print(f"  Max prediction: {y_pred_proba.max():.3f}")
print(f"  Predictions < 0: {(y_pred_proba < 0).sum()}")
print(f"  Predictions > 1: {(y_pred_proba > 1).sum()}")

The key function

The sigmoid — squash any number into a probability

The sigmoid function takes any real number — large positive, large negative, anything in between — and maps it to a number strictly between 0 and 1. This is exactly the range of probabilities. As the input grows toward +∞, the output approaches 1. As it shrinks toward −∞, the output approaches 0. At input 0, the output is exactly 0.5.

Sigmoid function — the S-curve

σ(z) = 1 / (1 + e⁻ᶻ)

z → −∞→σ → 0.0

z = −6→σ = 0.002

z = −2→σ = 0.119

z = 0→σ = 0.500

z = +2→σ = 0.881

z = +6→σ = 0.998

z → +∞→σ → 1.0

The full logistic regression model chains two steps: first a linear combination of the features (the same as linear regression), then the sigmoid applied to the result. The linear part (z = w·x + b) can produce any number. The sigmoid converts it into a probability.

Logistic regression — the full model

1. Linear combinationz = w₁x₁ + w₂x₂ + ... + wₙxₙ + bSame as linear regression — gives any real number

2. Sigmoid activationp = σ(z) = 1 / (1 + e⁻ᶻ)Squashes z into probability [0, 1]

3. Decision ruleŷ = 1 if p ≥ threshold else 0threshold = 0.5 by default, tunable

python

import numpy as np

# ── The sigmoid function ──────────────────────────────────────────────
def sigmoid(z: np.ndarray) -> np.ndarray:
    """σ(z) = 1 / (1 + e^(-z)) — numerically stable implementation."""
    # Clip to prevent overflow in exp for very large negative values
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

# Properties of sigmoid
z_values = np.array([-10, -5, -2, -1, 0, 1, 2, 5, 10])
print("Sigmoid values:")
for z, s in zip(z_values, sigmoid(z_values)):
    print(f"  σ({z:+3d}) = {s:.6f}")

# Sigmoid derivative: σ'(z) = σ(z) * (1 - σ(z))
# Maximum at z=0: σ'(0) = 0.5 * 0.5 = 0.25
# This is why sigmoid causes vanishing gradients in deep networks

def sigmoid_derivative(z):
    s = sigmoid(z)
    return s * (1 - s)

print(f"
Sigmoid derivative at z=0: {sigmoid_derivative(0):.4f}")  # 0.25 (max)
print(f"Sigmoid derivative at z=5: {sigmoid_derivative(5):.6f}")   # near 0

# ── Logistic regression prediction ───────────────────────────────────
def logistic_predict(X: np.ndarray, w: np.ndarray, b: float) -> np.ndarray:
    """
    Forward pass: compute probability for each sample.
    X: (n_samples, n_features)
    w: (n_features,)
    b: scalar
    Returns: (n_samples,) — probability of class 1
    """
    z = X @ w + b         # linear combination: shape (n_samples,)
    return sigmoid(z)     # squash to [0,1]

# Test with one sample: distance=6km, traffic=8, prep=20min
# High values → should predict high probability of being late
X_sample = np.array([[6.0, 8.0, 20.0]])   # shape (1, 3)
w_random  = np.array([0.5, 0.3, 0.2])     # random weights
b_random  = -5.0

p_late = logistic_predict(X_sample, w_random, b_random)
print(f"
P(late | distance=6, traffic=8, prep=20) = {p_late[0]:.4f}")
print(f"Prediction: {'LATE' if p_late[0] >= 0.5 else 'ON-TIME'}")

The loss function

Cross-entropy loss — why not MSE for classification

We need a loss function that tells the model how wrong its probability prediction was. Why not use MSE — (p − y)² — the same loss as regression? Two reasons: MSE with sigmoid produces a non-convex loss surface full of local minima that gradient descent gets stuck in. And MSE penalises a confident wrong prediction (p=0.99, y=0) by only (0.99)²=0.98 — not harshly enough to teach the model to be certain only when correct.

Cross-entropy loss penalises a confident wrong prediction with −log(0.01) = 4.6 — much harsher. And it produces a perfectly convex loss surface, meaning gradient descent always finds the global minimum.

L = −[y · log(p) + (1 − y) · log(1 − p)]

When y=1 (truly late): L = −log(p). Low loss if p≈1, huge loss if p≈0.

When y=0 (truly on-time): L = −log(1−p). Low loss if p≈0, huge loss if p≈1.

Confident and correct → loss near 0. Confident and wrong → loss → ∞.

The total loss is the mean over all training examples.

python

import numpy as np

def binary_cross_entropy(y_true: np.ndarray, y_pred: np.ndarray,
                          eps: float = 1e-10) -> float:
    """
    Binary cross-entropy loss.
    eps prevents log(0) = -inf.
    """
    y_pred = np.clip(y_pred, eps, 1 - eps)   # prevent log(0)
    return -np.mean(
        y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)
    )

# Loss for different prediction / label combinations
scenarios = [
    (1, 0.99,  'Truly late,   predicts 0.99 (confident & correct)'),
    (1, 0.50,  'Truly late,   predicts 0.50 (uncertain)'),
    (1, 0.01,  'Truly late,   predicts 0.01 (confident & WRONG)'),
    (0, 0.01,  'Truly on-time, predicts 0.01 (confident & correct)'),
    (0, 0.50,  'Truly on-time, predicts 0.50 (uncertain)'),
    (0, 0.99,  'Truly on-time, predicts 0.99 (confident & WRONG)'),
]

print("Cross-entropy loss per scenario:")
for y, p, desc in scenarios:
    loss = binary_cross_entropy(np.array([y]), np.array([p]))
    bar  = '█' * int(loss * 5)
    print(f"  {desc}")
    print(f"    Loss = {loss:.4f}  {bar}")
    print()

# Key insight: loss is SYMMETRIC
# L(y=1, p=0.01) ≈ L(y=0, p=0.99) ≈ 4.6
# Both confident wrong predictions are equally penalised

Under the hood

Logistic regression from scratch — gradient descent on cross-entropy

To train logistic regression we need the gradient of the cross-entropy loss with respect to the weights. The chain rule through sigmoid produces a beautifully simple result: the gradient is just the prediction error times the input feature — the same form as linear regression.

The gradient — derived via chain ruleoptional — read when ready

∂L/∂w = (1/n) × Xᵀ(p − y)

∂L/∂b = (1/n) × Σ(p − y)

where p = σ(Xw + b) — predicted probabilities
y = true labels (0 or 1)
(p − y) = prediction errors
Identical form to linear regression gradient — sigmoid derivative cancels out perfectly

python

import numpy as np

np.random.seed(42)

# ── Generate DoorDash dataset ───────────────────────────────────────────
n = 2000
distance = np.abs(np.random.normal(4.0, 2.0, n)).clip(0.5, 15)
traffic  = np.random.randint(1, 11, n).astype(float)
prep     = np.abs(np.random.normal(15, 5, n)).clip(5, 35)
delivery = 8.6 + 7.3*distance + 0.8*prep + 1.5*traffic + np.random.normal(0, 4, n)
y = (delivery > 45).astype(float)   # 1 = late, 0 = on-time

X = np.column_stack([distance, traffic, prep])

# Scale features (critical for gradient descent convergence)
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc  = sc.transform(X_test)

# ── Logistic regression from scratch ─────────────────────────────────
class LogisticRegressionScratch:
    """
    Binary logistic regression trained with mini-batch gradient descent.
    All the math visible, nothing hidden.
    """

    def __init__(self, lr: float = 0.1, n_epochs: int = 200,
                 batch_size: int = 64, l2: float = 0.01):
        self.lr          = lr
        self.n_epochs    = n_epochs
        self.batch_size  = batch_size
        self.l2          = l2   # L2 regularisation strength
        self.w           = None
        self.b           = 0.0
        self.loss_history = []

    @staticmethod
    def sigmoid(z): return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

    def fit(self, X: np.ndarray, y: np.ndarray):
        n, d = X.shape
        self.w = np.zeros(d)   # initialise weights at 0
        rng    = np.random.default_rng(42)

        for epoch in range(self.n_epochs):
            # Shuffle training data each epoch
            idx = rng.permutation(n)
            X_s, y_s = X[idx], y[idx]
            epoch_loss = 0.0

            for start in range(0, n, self.batch_size):
                Xb = X_s[start:start + self.batch_size]
                yb = y_s[start:start + self.batch_size]
                nb = len(Xb)

                # ── Forward pass ──────────────────────────────────
                z = Xb @ self.w + self.b        # linear: (nb,)
                p = self.sigmoid(z)              # probability: (nb,)

                # ── Loss (with L2 regularisation) ─────────────────
                eps   = 1e-10
                bce   = -np.mean(yb*np.log(p+eps) + (1-yb)*np.log(1-p+eps))
                reg   = 0.5 * self.l2 * np.dot(self.w, self.w)
                loss  = bce + reg
                epoch_loss += loss

                # ── Backward pass (gradients) ─────────────────────
                error  = p - yb                        # (nb,) — prediction error
                dw     = (Xb.T @ error) / nb + self.l2 * self.w  # (d,)
                db     = error.mean()                  # scalar

                # ── Weight update ──────────────────────────────────
                self.w -= self.lr * dw
                self.b -= self.lr * db

            self.loss_history.append(epoch_loss / (n // self.batch_size))

            if epoch % 40 == 0:
                print(f"  Epoch {epoch:3d}: loss = {self.loss_history[-1]:.4f}")

    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        return self.sigmoid(X @ self.w + self.b)

    def predict(self, X: np.ndarray, threshold: float = 0.5) -> np.ndarray:
        return (self.predict_proba(X) >= threshold).astype(int)

    def score(self, X: np.ndarray, y: np.ndarray) -> float:
        return (self.predict(X) == y).mean()

# ── Train ──────────────────────────────────────────────────────────────
model_scratch = LogisticRegressionScratch(lr=0.1, n_epochs=200, l2=0.01)
model_scratch.fit(X_train_sc, y_train)

acc_train = model_scratch.score(X_train_sc, y_train)
acc_test  = model_scratch.score(X_test_sc, y_test)
print(f"
From-scratch model:")
print(f"  Train accuracy: {acc_train:.4f}")
print(f"  Test accuracy:  {acc_test:.4f}")

# Verify against sklearn
from sklearn.linear_model import LogisticRegression
sk_model = LogisticRegression(C=1/0.01, max_iter=500, random_state=42)
sk_model.fit(X_train_sc, y_train)
print(f"
sklearn LogisticRegression test accuracy: {sk_model.score(X_test_sc, y_test):.4f}")
# Should match closely ↑

The production way

sklearn LogisticRegression — every option explained

sklearn's LogisticRegression has many parameters. Most tutorials use the defaults without explaining what they do. This section explains every important parameter so you can make principled choices rather than accepting defaults blindly.

python

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
import numpy as np

# ── The C parameter — inverse of regularisation strength ──────────────
# C = 1/lambda  where lambda is the regularisation strength
# Large C  → weak regularisation → model can fit training data more freely
# Small C  → strong regularisation → simpler model, less overfitting
# Default C=1.0 is a reasonable starting point
# ALWAYS tune C — it is the most important hyperparameter

for C in [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]:
    model = LogisticRegression(C=C, max_iter=1000, random_state=42)
    scores = cross_val_score(model, X_train_sc, y_train, cv=5, scoring='roc_auc')
    print(f"  C={C:<8}: ROC-AUC = {scores.mean():.4f} ± {scores.std():.4f}")

# ── penalty — L1 vs L2 vs ElasticNet ──────────────────────────────────
# penalty='l2' (default): weight² penalty — drives weights toward 0, keeps all features
# penalty='l1':           |weight| penalty — drives some weights to exactly 0 (feature selection)
# penalty='elasticnet':   mix of L1 and L2 — l1_ratio controls the mix

penalties = [
    ('l2',         dict(C=1.0, solver='lbfgs')),
    ('l1',         dict(C=1.0, solver='liblinear')),
    ('elasticnet', dict(C=1.0, solver='saga', l1_ratio=0.5, max_iter=2000)),
    ('none',       dict(solver='lbfgs')),
]

print("
Regularisation comparison (5-fold CV ROC-AUC):")
for penalty, kwargs in penalties:
    model = LogisticRegression(penalty=penalty, random_state=42, **kwargs)
    scores = cross_val_score(model, X_train_sc, y_train, cv=5, scoring='roc_auc')
    print(f"  penalty='{penalty}'   : {scores.mean():.4f} ± {scores.std():.4f}")

# ── solver — which optimisation algorithm ─────────────────────────────
# 'lbfgs':     default, L-BFGS quasi-Newton. Fast for small-medium data.
# 'liblinear': fast for small datasets, supports L1 and L2
# 'saga':      faster for large datasets, supports L1, L2, ElasticNet
# 'sag':       variant of SAGA, only L2
# 'newton-cg': Newton's method, only L2

# ── class_weight — handle imbalanced datasets ─────────────────────────
# If 90% of orders are on-time and 10% are late, a model that always
# predicts on-time gets 90% accuracy but catches 0 late orders.
# class_weight='balanced' weights the loss by inverse class frequency.

model_balanced = LogisticRegression(
    C=1.0,
    class_weight='balanced',   # auto-weight: rare class gets higher penalty for errors
    max_iter=1000,
    random_state=42,
)
model_balanced.fit(X_train_sc, y_train)

from sklearn.metrics import classification_report
y_pred_balanced = model_balanced.predict(X_test_sc)
print("
With class_weight='balanced':")
print(classification_report(y_test, y_pred_balanced, target_names=['on-time','late']))

# ── Full pipeline — the production pattern ─────────────────────────────
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model',  LogisticRegression(
        C=1.0,
        penalty='l2',
        solver='lbfgs',
        max_iter=1000,
        class_weight='balanced',
        random_state=42,
    )),
])
pipe.fit(X_train, y_train)
print(f"
Pipeline test accuracy: {pipe.score(X_test, y_test):.4f}")

What the model learned

Decision boundary and coefficient interpretation

The decision boundary is the set of points where the model is exactly 50% confident — the line (in 2D) or hyperplane (in n dimensions) that separates the two classes. Every point on one side gets predicted as class 1, every point on the other side as class 0.

Unlike neural networks, logistic regression coefficients are directly interpretable. Each coefficient tells you: holding all other features fixed, how does a one standard deviation increase in this feature change the log-odds of the positive class?

python

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Train on full feature set including engineered features
feature_names = ['distance_km','traffic_score','restaurant_prep']
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc  = sc.transform(X_test)

model = LogisticRegression(C=1.0, max_iter=1000, random_state=42)
model.fit(X_train_sc, y_train)

# ── Coefficient interpretation ────────────────────────────────────────
# Coefficients are in terms of STANDARDISED features
# A coefficient of 1.5 means: increasing this feature by 1 std
# increases the log-odds of being late by 1.5
# log-odds = log(p / (1-p))

coef_df = pd.DataFrame({
    'feature':     feature_names,
    'coefficient': model.coef_[0],
    'odds_ratio':  np.exp(model.coef_[0]),  # e^coef = how odds multiply per std
}).sort_values('coefficient', ascending=False)

print("Logistic regression coefficients (standardised features):")
print(coef_df.to_string(index=False))
print(f"
Intercept (bias): {model.intercept_[0]:.4f}")

print("
Interpretation (per 1 standard deviation increase):")
for _, row in coef_df.iterrows():
    direction = 'INCREASES' if row['coefficient'] > 0 else 'DECREASES'
    print(f"  {row['feature']:<20}: {direction} P(late) | "
          f"coef={row['coefficient']:.3f} | odds_ratio={row['odds_ratio']:.3f}")

# ── Decision boundary in 2D (distance vs traffic) ─────────────────────
# Boundary: w₁x₁ + w₂x₂ + b = 0  →  x₂ = -(w₁x₁ + b) / w₂
# For a 2-feature model:
model_2d = LogisticRegression(C=1.0, random_state=42)
X_2d_train = X_train_sc[:, :2]   # only distance and traffic (standardised)
model_2d.fit(X_2d_train, y_train)

w1, w2 = model_2d.coef_[0]
b      = model_2d.intercept_[0]

# x_traffic = -(w1*x_distance + b) / w2
print(f"
Decision boundary equation:")
print(f"  {w2:.3f} × traffic + {w1:.3f} × distance + {b:.3f} = 0")
print(f"  traffic = ({-w1:.3f} × distance + {-b:.3f}) / {w2:.3f}")

# At distance = 0 (standardised): traffic threshold
traffic_at_mean_dist = (- w1 * 0 - b) / w2
print(f"  At mean distance: classify as late if traffic > {traffic_at_mean_dist:.2f} std")

Did it actually work?

Classification evaluation — beyond accuracy

Accuracy is the wrong metric for almost every real classification problem. If 85% of deliveries are on-time, a model that always predicts on-time gets 85% accuracy while being completely useless. You need metrics that capture how well the model finds the minority class.

The four cells of a confusion matrix — what everything is built from

Pred 0

Pred 1

True 0

True Negative

False Positive

True 1

False Negative

True Positive

Accuracy(TP+TN) / TotalOnly when classes are balanced

PrecisionTP / (TP+FP)When FP is costly (spam filter)

RecallTP / (TP+FN)When FN is costly (fraud, medical)

F12 × P×R / (P+R)Balanced harmonic mean of P and R

ROC-AUCArea under ROCThreshold-independent, most complete

python

from sklearn.metrics import (
    classification_report, confusion_matrix,
    roc_auc_score, average_precision_score,
    precision_recall_curve, roc_curve,
)
import numpy as np

# Train final model
model = LogisticRegression(C=1.0, class_weight='balanced',
                            max_iter=1000, random_state=42)
model.fit(X_train_sc, y_train)

y_pred       = model.predict(X_test_sc)
y_proba      = model.predict_proba(X_test_sc)[:, 1]

# ── Classification report ──────────────────────────────────────────────
print("Classification report:")
print(classification_report(y_test, y_pred, target_names=['on-time','late']))

# ── Confusion matrix ───────────────────────────────────────────────────
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = cm.ravel()
print(f"Confusion matrix:")
print(f"  TN={tn}  FP={fp}")
print(f"  FN={fn}  TP={tp}")
print(f"
  Precision (of predicted-late, how many truly late): {tp/(tp+fp):.3f}")
print(f"  Recall    (of truly-late, how many we caught):       {tp/(tp+fn):.3f}")

# ── ROC-AUC ───────────────────────────────────────────────────────────
roc_auc = roc_auc_score(y_test, y_proba)
avg_prec = average_precision_score(y_test, y_proba)
print(f"
  ROC-AUC:              {roc_auc:.4f}")
print(f"  Avg Precision (AP):   {avg_prec:.4f}")
print(f"  Accuracy:             {(y_pred == y_test).mean():.4f}")

# ── Threshold tuning — 0.5 is not always optimal ──────────────────────
# The business question determines the right threshold.
# At DoorDash: missing a late delivery (FN) costs customer experience.
# A lower threshold catches more late deliveries but flags more on-time ones.

print("
Threshold analysis:")
print(f"{'Threshold':<12} {'Precision':<12} {'Recall':<10} {'F1':<10} {'Flagged %'}")
print("─" * 58)
for threshold in [0.2, 0.3, 0.4, 0.5, 0.6, 0.7]:
    y_pred_t  = (y_proba >= threshold).astype(int)
    tp_t      = ((y_pred_t == 1) & (y_test == 1)).sum()
    fp_t      = ((y_pred_t == 1) & (y_test == 0)).sum()
    fn_t      = ((y_pred_t == 0) & (y_test == 1)).sum()
    prec_t    = tp_t / max(tp_t + fp_t, 1)
    rec_t     = tp_t / max(tp_t + fn_t, 1)
    f1_t      = 2 * prec_t * rec_t / max(prec_t + rec_t, 1e-10)
    flagged_t = y_pred_t.mean() * 100
    print(f"{threshold:<12.1f} {prec_t:<12.3f} {rec_t:<10.3f} {f1_t:<10.3f} {flagged_t:.1f}%")

# For DoorDash: a threshold of 0.3 catches more late deliveries
# but flags more on-time ones for proactive ETA warnings — good tradeoff

Preventing overfitting

L1 and L2 regularisation — what they do and when to use each

Regularisation adds a penalty term to the loss function that discourages large weight values. Without it, logistic regression can memorise the training data (especially when features are many or highly correlated), producing large weights that don't generalise.

python

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Generate a high-dimensional problem: 20 features, many irrelevant
np.random.seed(42)
n_train = 500

# 3 truly useful features + 17 noise features
X_useful = X_train[:n_train, :]                            # (n, 3) useful
X_noise  = np.random.randn(n_train, 17)                   # (n, 17) pure noise
X_high   = np.column_stack([X_useful, X_noise])
y_high   = y_train[:n_train]

X_useful_te = X_test[:100, :]
X_noise_te  = np.random.randn(100, 17)
X_high_te   = np.column_stack([X_useful_te, X_noise_te])
y_high_te   = y_test[:100]

sc20 = StandardScaler()
X_high_sc    = sc20.fit_transform(X_high)
X_high_te_sc = sc20.transform(X_high_te)

feature_names_20 = (
    ['distance_km','traffic_score','restaurant_prep'] +
    [f'noise_{i:02d}' for i in range(17)]
)

print("Effect of regularisation on 20-feature problem (3 useful, 17 noise):")
print(f"{'Model':<35} {'Train acc':<12} {'Test acc':<12} {'Non-zero coefs'}")
print("─" * 72)

for name, model in [
    ('No regularisation (penalty=None)', LogisticRegression(penalty=None, solver='lbfgs', max_iter=2000)),
    ('L2 regularisation C=1.0',          LogisticRegression(penalty='l2', C=1.0, solver='lbfgs', max_iter=2000)),
    ('L2 regularisation C=0.1',          LogisticRegression(penalty='l2', C=0.1, solver='lbfgs', max_iter=2000)),
    ('L1 regularisation C=1.0',          LogisticRegression(penalty='l1', C=1.0, solver='liblinear', max_iter=2000)),
    ('L1 regularisation C=0.1',          LogisticRegression(penalty='l1', C=0.1, solver='liblinear', max_iter=2000)),
]:
    model.fit(X_high_sc, y_high)
    tr_acc = model.score(X_high_sc, y_high)
    te_acc = model.score(X_high_te_sc, y_high_te)
    n_nz   = (np.abs(model.coef_[0]) > 1e-6).sum()
    print(f"{name:<35} {tr_acc:<12.4f} {te_acc:<12.4f} {n_nz}/20")

# L1 observations:
# - Drives noise feature coefficients to exactly 0 (true sparse solution)
# - Keeps only the 3 useful features (if C is small enough)
# - Acts as automatic feature selection

# L2 observations:
# - Shrinks all coefficients toward 0 but never exactly 0
# - All 20 features stay in the model with small weights
# - Better when all features are somewhat relevant

# Check which features L1 kept
l1_model = LogisticRegression(penalty='l1', C=0.1, solver='liblinear', max_iter=2000)
l1_model.fit(X_high_sc, y_high)
kept = [name for name, coef in zip(feature_names_20, l1_model.coef_[0]) if abs(coef) > 1e-6]
print(f"
Features kept by L1 (C=0.1): {kept}")

Beyond binary

Multi-class logistic regression — OvR and Softmax

Binary logistic regression predicts two classes. For three or more classes, there are two strategies. One-vs-Rest (OvR) trains one binary classifier per class — "is this class 1 or not?", "is this class 2 or not?" — and picks the class with highest confidence. Multinomial (Softmax) extends the model directly to output a proper probability distribution over all classes simultaneously.

OvR vs Multinomial (Softmax) — when to use each

One-vs-Rest (OvR)

Trains K binary classifiers. Probabilities don't sum to 1 across classes — they are independent binary probabilities normalised afterward.

multi_class='ovr'
Works with any solver
Faster for large K
Best when classes are highly imbalanced

Multinomial (Softmax)

Trains one model for all K classes simultaneously. Probabilities always sum to exactly 1 across classes — a proper probability distribution.

multi_class='multinomial'
Needs solver='lbfgs', 'saga', 'newton-cg'
Better calibrated probabilities
Preferred when K is small and balanced

python

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report
import numpy as np
import pandas as pd

np.random.seed(42)
n = 3000

# Multi-class problem: predict delivery speed category
# Classes: 'express' (<25 min), 'normal' (25-45 min), 'delayed' (>45 min)
distance = np.abs(np.random.normal(4.0, 2.0, n)).clip(0.5, 15)
traffic  = np.random.randint(1, 11, n).astype(float)
prep     = np.abs(np.random.normal(15, 5, n)).clip(5, 35)
delivery = 8.6 + 7.3*distance + 0.8*prep + 1.5*traffic + np.random.normal(0, 4, n)

y_multi = pd.cut(
    delivery,
    bins=[-np.inf, 25, 45, np.inf],
    labels=['express','normal','delayed'],
).astype(str)

le = LabelEncoder()
y_enc = le.fit_transform(y_multi)
print(f"Classes: {le.classes_}")
print(f"Class distribution: {dict(zip(le.classes_, np.bincount(y_enc)))}")

X_multi = np.column_stack([distance, traffic, prep])
from sklearn.model_selection import train_test_split
Xm_tr, Xm_te, ym_tr, ym_te = train_test_split(X_multi, y_enc,
                                                test_size=0.2, random_state=42)
sc = StandardScaler()
Xm_tr_sc = sc.fit_transform(Xm_tr)
Xm_te_sc = sc.transform(Xm_te)

# ── OvR ────────────────────────────────────────────────────────────────
model_ovr = LogisticRegression(
    multi_class='ovr',
    C=1.0, solver='lbfgs', max_iter=1000, random_state=42
)
model_ovr.fit(Xm_tr_sc, ym_tr)

# ── Multinomial (Softmax) ──────────────────────────────────────────────
model_softmax = LogisticRegression(
    multi_class='multinomial',
    C=1.0, solver='lbfgs', max_iter=1000, random_state=42
)
model_softmax.fit(Xm_tr_sc, ym_tr)

print(f"
OvR accuracy:       {model_ovr.score(Xm_te_sc, ym_te):.4f}")
print(f"Softmax accuracy:   {model_softmax.score(Xm_te_sc, ym_te):.4f}")

print("
Softmax probabilities for one sample:")
sample = Xm_te_sc[:1]
proba  = model_softmax.predict_proba(sample)[0]
for cls, p in zip(le.classes_, proba):
    bar = '█' * int(p * 30)
    print(f"  {cls:<10}: {bar} {p:.4f}")
print(f"  Sum: {proba.sum():.6f}  ← always exactly 1.0")

print("
Classification report (Softmax):")
print(classification_report(
    ym_te, model_softmax.predict(Xm_te_sc),
    target_names=le.classes_,
))

What this looks like at work

Production late-delivery predictor — end to end

This is what the actual day-one task looks like when you join a data team and are asked to build a late-delivery classifier. Feature engineering, cross-validation, threshold selection, and model persistence — all in one pipeline.

python

import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.metrics import (roc_auc_score, average_precision_score,
                              classification_report)
import joblib

np.random.seed(42)
n = 8000
restaurants = ['Pizza Hut','Biryani Blues',"McDonald's","Haldiram's",
               'Dominos','KFC','Subway','Burger King']
cities = ['Seattle','New York','Delhi','Austin','Boston','Chicago']
slots  = ['breakfast','lunch','evening','dinner']

distance = np.abs(np.random.normal(4.0, 2.0, n)).clip(0.5, 15)
traffic  = np.random.randint(1, 11, n).astype(float)
prep     = np.abs(np.random.normal(15, 5, n)).clip(5, 35)
value    = np.abs(np.random.normal(350, 150, n)).clip(50, 1200)
delivery = (8.6 + 7.3*distance + 0.8*prep + 1.5*traffic
            + np.random.normal(0, 4, n)).clip(10, 120)

df_prod = pd.DataFrame({
    'restaurant':     np.random.choice(restaurants, n),
    'city':           np.random.choice(cities, n),
    'time_slot':      np.random.choice(slots, n),
    'distance_km':    distance,
    'traffic_score':  traffic,
    'restaurant_prep':prep,
    'order_value':    value,
})
y_prod = (delivery > 45).astype(int)

# ── Feature engineering ────────────────────────────────────────────────
df_prod['log_distance']  = np.log1p(df_prod['distance_km'])
df_prod['dist_x_traffic']= df_prod['distance_km'] * df_prod['traffic_score']
df_prod['log_value']     = np.log1p(df_prod['order_value'])

NUM_FEATURES = ['log_distance','traffic_score','restaurant_prep',
                'dist_x_traffic','log_value','distance_km']
CAT_FEATURES = ['city','time_slot']

# ── Column transformer ─────────────────────────────────────────────────
preprocessor = ColumnTransformer([
    ('num', Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler',  StandardScaler()),
    ]), NUM_FEATURES),
    ('cat', Pipeline([
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('onehot',  OneHotEncoder(handle_unknown='ignore', sparse_output=False, drop='first')),
    ]), CAT_FEATURES),
])

# ── Final pipeline ─────────────────────────────────────────────────────
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier',   LogisticRegression(
        C=0.5,
        penalty='l2',
        class_weight='balanced',
        solver='lbfgs',
        max_iter=2000,
        random_state=42,
    )),
])

# ── Stratified cross-validation ────────────────────────────────────────
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_results = cross_validate(
    pipeline, df_prod, y_prod,
    cv=cv,
    scoring=['roc_auc','average_precision','accuracy'],
    return_train_score=True,
)

print("5-Fold Cross-Validation Results:")
for metric in ['roc_auc','average_precision','accuracy']:
    train_mean = cv_results[f'train_{metric}'].mean()
    val_mean   = cv_results[f'test_{metric}'].mean()
    val_std    = cv_results[f'test_{metric}'].std()
    print(f"  {metric:<22}: train={train_mean:.4f}  val={val_mean:.4f} ± {val_std:.4f}")

# ── Train final model on full data ─────────────────────────────────────
pipeline.fit(df_prod, y_prod)

# ── Find optimal threshold on held-out validation set ─────────────────
from sklearn.model_selection import train_test_split
df_tr, df_val, y_tr, y_val = train_test_split(df_prod, y_prod,
                                               test_size=0.2, stratify=y_prod, random_state=99)
pipeline.fit(df_tr, y_tr)
val_proba = pipeline.predict_proba(df_val)[:, 1]

# Choose threshold that maximises F1 for the late class
from sklearn.metrics import f1_score
best_threshold, best_f1 = 0.5, 0.0
for t in np.arange(0.2, 0.8, 0.02):
    f1 = f1_score(y_val, (val_proba >= t).astype(int), pos_label=1)
    if f1 > best_f1:
        best_f1, best_threshold = f1, t

print(f"
Optimal threshold: {best_threshold:.2f} (F1={best_f1:.4f})")

# ── Save the pipeline ──────────────────────────────────────────────────
joblib.dump({
    'pipeline':  pipeline,
    'threshold': best_threshold,
    'features':  NUM_FEATURES + CAT_FEATURES,
    'version':   'v1.0',
}, '/tmp/late_delivery_model.pkl')
print("Model saved to /tmp/late_delivery_model.pkl")

# ── Load and score a new order ─────────────────────────────────────────
saved = joblib.load('/tmp/late_delivery_model.pkl')
new_order = pd.DataFrame([{
    'restaurant': 'Pizza Hut', 'city': 'Seattle', 'time_slot': 'dinner',
    'distance_km': 7.5, 'traffic_score': 9, 'restaurant_prep': 22,
    'order_value': 480,
}])
new_order['log_distance']   = np.log1p(new_order['distance_km'])
new_order['dist_x_traffic'] = new_order['distance_km'] * new_order['traffic_score']
new_order['log_value']      = np.log1p(new_order['order_value'])

p_late = saved['pipeline'].predict_proba(new_order)[0, 1]
pred   = 'LATE' if p_late >= saved['threshold'] else 'ON-TIME'
print(f"
New order prediction: {pred} (P(late)={p_late:.3f})")

Errors you will hit

Every common logistic regression error — explained and fixed

ConvergenceWarning: Logistic Regression failed to converge

Why it happens

The optimiser ran out of iterations before finding the minimum. Common causes: features not scaled (gradient steps are vastly different magnitudes per feature), very large or very small feature values, or insufficient max_iter.

Fix

Always scale features before LogisticRegression: StandardScaler() inside a Pipeline. Increase max_iter: LogisticRegression(max_iter=2000). Try solver='saga' which converges faster on large datasets. Check for extreme feature values: df.describe() should show roughly comparable ranges after scaling.

Model predicts only one class — all predictions are 0 or all are 1

Why it happens

Severe class imbalance — one class dominates so heavily that predicting it always achieves high accuracy. The model learns that the intercept alone gives a good enough loss and ignores all features.

Fix

Use class_weight='balanced' which re-weights the loss to treat each class equally. Or use class_weight={0: 1, 1: 10} to manually weight the minority class. Or oversample the minority class with SMOTE before training. Always check class distribution: y.value_counts(normalize=True).

Coefficients are very large (in the hundreds or thousands)

Why it happens

Perfectly separable classes: there exists a hyperplane that perfectly separates the two classes in training data. With perfect separation, logistic regression keeps increasing coefficient magnitude trying to make the boundary sharper (sigmoid → step function), and the optimiser diverges.

Fix

Add regularisation: reduce C (e.g. C=0.01). Perfect separation often means a feature is a direct proxy for the label (leakage) or your dataset is too small. Check which feature has a perfect or near-perfect split: df.groupby(y)[feature].describe().

ValueError: Unknown label type — continuous target passed to classifier

Why it happens

You passed a continuous float array as y to LogisticRegression. For example, y = df['delivery_time'] instead of y = (df['delivery_time'] > 45).astype(int). LogisticRegression expects integer class labels, not continuous values.

Fix

Convert the continuous target to class labels before passing to the classifier. For binary: y = (df['delivery_time'] > 45).astype(int). For multi-class: y = pd.cut(df['delivery_time'], bins=[...], labels=[0,1,2]).astype(int). For regression tasks use LinearRegression, Ridge, or Lasso instead.

predict_proba gives overconfident probabilities — 0.999 for most samples

Why it happens

Model is overfitting or features are too informative (possibly leakage). Also common when using OvR multi_class and normalising independent binary probabilities — they don't form a proper calibrated distribution.

Fix

Add regularisation (reduce C). Use multi_class='multinomial' for calibrated multi-class probabilities. For binary: use CalibratedClassifierCV with method='isotonic' or method='sigmoid' to post-process probabilities. Check for leakage: a feature should never have correlation > 0.9 with the label.

What comes next

You now have the foundation of classification. Every classifier builds on this.

Sigmoid. Cross-entropy. Gradient descent. Decision boundary. Regularisation. Threshold tuning. These are not logistic regression concepts — they are classification concepts. Neural networks use the same sigmoid (and its variants). The same cross-entropy loss. The same gradient descent. Deep learning is logistic regression applied many times with non-linear layers in between.

Module 21 covers Decision Trees — the algorithm that grows a flowchart from your data. Trees are the conceptual foundation of Random Forests and Gradient Boosting (XGBoost, LightGBM) — the algorithms that win most tabular ML competitions and power most production ML systems at Indian tech companies today.

Next — Module 21 · Classical ML

Decision Trees — Learning a Flowchart from Data

How trees split features to minimise impurity, how to control overfitting with depth and pruning, and how trees become the building blocks of Random Forests and XGBoost.

coming soon

🎯 Key Takeaways

✓Logistic regression is not regression — it is a classification algorithm. The "regression" refers to the linear equation inside it. It outputs a probability between 0 and 1, converted to a class label by a threshold.
✓The sigmoid σ(z) = 1/(1+e⁻ᶻ) maps any real number to (0,1). It is the entire mechanism that makes logistic regression a probability model rather than an unbounded linear predictor.
✓Cross-entropy loss − [y·log(p) + (1−y)·log(1−p)] penalises confident wrong predictions far more harshly than MSE. It produces a convex loss surface — gradient descent always finds the global minimum.
✓The gradient of cross-entropy with respect to weights is (1/n) × Xᵀ(p−y) — identical in form to linear regression gradient. The sigmoid derivative cancels out perfectly, giving this clean result.
✓C is the inverse of regularisation strength. Large C = weak regularisation = risk of overfitting. Small C = strong regularisation = simpler model. Always tune C. L1 regularisation drives some coefficients to exactly zero (feature selection). L2 shrinks all coefficients toward zero.
✓Accuracy is the wrong metric for imbalanced classes. Use ROC-AUC (threshold-independent), Precision-Recall curve, and F1 score. The optimal threshold is rarely 0.5 — tune it to match the business cost of false positives vs false negatives.
✓Coefficients in logistic regression are directly interpretable: a coefficient of 1.5 for distance_km means one standard deviation increase in distance multiplies the odds of being late by e^1.5 = 4.5. This interpretability is why logistic regression remains widely used in production despite its simplicity.

Discussion

Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.

Continue with GitHub