Logistic Regression
The foundation of all classification. Sigmoid, decision boundaries, cross-entropy, regularisation, and multi-class extension — built from scratch then in sklearn on real data.
Logistic regression is not regression. It is the foundation of all classification.
The name is misleading. Logistic regression predicts probabilities — "what is the probability that this Swiggy order will be late?" — and converts those probabilities into class labels. It is a classification algorithm, not a regression one. The "regression" refers to the linear equation inside it, not to what it predicts.
Despite being over 60 years old, logistic regression is still the first algorithm deployed at many companies for binary classification. At Razorpay it predicts fraud. At Swiggy it predicts late deliveries. At every bank in India it predicts loan defaults. It is fast, interpretable, probabilistically calibrated, and works well with good features. Every ML engineer should understand it completely.
This module builds logistic regression from scratch — sigmoid function, cross-entropy loss, gradient descent — so every piece is visible. Then shows you the sklearn implementation, all regularisation options, the multi-class extension, and every evaluation metric that matters for classification problems.
What this module covers:
Why linear regression breaks for classification
The obvious approach to binary classification: train a linear regression, predict a number, and if the number is above 0.5 call it class 1. This actually works for some problems. But it has three fundamental flaws that make it unreliable in general.
Linear regression predicts any real number. For a classification problem, a prediction of 1.7 or -0.3 is meaningless as a probability. The further a point is from the decision boundary, the more absurd the prediction becomes.
Add a single extreme point far into the positive class region. The regression line tilts toward it, moving the decision boundary and misclassifying many correctly-labelled points. Classification should not care about how far positive examples are from the boundary — only that they are on the right side.
For risk-sensitive decisions (fraud, loan default, medical diagnosis), you need a calibrated probability: "this transaction has a 3.2% chance of being fraud." Linear regression gives you a raw number with no probabilistic interpretation.
Logistic regression solves all three by applying one function to the linear prediction before outputting it: the sigmoid.
The sigmoid — squash any number into a probability
The sigmoid function takes any real number — large positive, large negative, anything in between — and maps it to a number strictly between 0 and 1. This is exactly the range of probabilities. As the input grows toward +∞, the output approaches 1. As it shrinks toward −∞, the output approaches 0. At input 0, the output is exactly 0.5.
The full logistic regression model chains two steps: first a linear combination of the features (the same as linear regression), then the sigmoid applied to the result. The linear part (z = w·x + b) can produce any number. The sigmoid converts it into a probability.
Cross-entropy loss — why not MSE for classification
We need a loss function that tells the model how wrong its probability prediction was. Why not use MSE — (p − y)² — the same loss as regression? Two reasons: MSE with sigmoid produces a non-convex loss surface full of local minima that gradient descent gets stuck in. And MSE penalises a confident wrong prediction (p=0.99, y=0) by only (0.99)²=0.98 — not harshly enough to teach the model to be certain only when correct.
Cross-entropy loss penalises a confident wrong prediction with −log(0.01) = 4.6 — much harsher. And it produces a perfectly convex loss surface, meaning gradient descent always finds the global minimum.
Logistic regression from scratch — gradient descent on cross-entropy
To train logistic regression we need the gradient of the cross-entropy loss with respect to the weights. The chain rule through sigmoid produces a beautifully simple result: the gradient is just the prediction error times the input feature — the same form as linear regression.
sklearn LogisticRegression — every option explained
sklearn's LogisticRegression has many parameters. Most tutorials use the defaults without explaining what they do. This section explains every important parameter so you can make principled choices rather than accepting defaults blindly.
Decision boundary and coefficient interpretation
The decision boundary is the set of points where the model is exactly 50% confident — the line (in 2D) or hyperplane (in n dimensions) that separates the two classes. Every point on one side gets predicted as class 1, every point on the other side as class 0.
Unlike neural networks, logistic regression coefficients are directly interpretable. Each coefficient tells you: holding all other features fixed, how does a one standard deviation increase in this feature change the log-odds of the positive class?
Classification evaluation — beyond accuracy
Accuracy is the wrong metric for almost every real classification problem. If 85% of deliveries are on-time, a model that always predicts on-time gets 85% accuracy while being completely useless. You need metrics that capture how well the model finds the minority class.
L1 and L2 regularisation — what they do and when to use each
Regularisation adds a penalty term to the loss function that discourages large weight values. Without it, logistic regression can memorise the training data (especially when features are many or highly correlated), producing large weights that don't generalise.
Multi-class logistic regression — OvR and Softmax
Binary logistic regression predicts two classes. For three or more classes, there are two strategies. One-vs-Rest (OvR) trains one binary classifier per class — "is this class 1 or not?", "is this class 2 or not?" — and picks the class with highest confidence. Multinomial (Softmax) extends the model directly to output a proper probability distribution over all classes simultaneously.
Production late-delivery predictor — end to end
This is what the actual day-one task looks like when you join a data team and are asked to build a late-delivery classifier. Feature engineering, cross-validation, threshold selection, and model persistence — all in one pipeline.
Every common logistic regression error — explained and fixed
You now have the foundation of classification. Every classifier builds on this.
Sigmoid. Cross-entropy. Gradient descent. Decision boundary. Regularisation. Threshold tuning. These are not logistic regression concepts — they are classification concepts. Neural networks use the same sigmoid (and its variants). The same cross-entropy loss. The same gradient descent. Deep learning is logistic regression applied many times with non-linear layers in between.
Module 21 covers Decision Trees — the algorithm that grows a flowchart from your data. Trees are the conceptual foundation of Random Forests and Gradient Boosting (XGBoost, LightGBM) — the algorithms that win most tabular ML competitions and power most production ML systems at Indian tech companies today.
How trees split features to minimise impurity, how to control overfitting with depth and pruning, and how trees become the building blocks of Random Forests and XGBoost.
🎯 Key Takeaways
- ✓Logistic regression is not regression — it is a classification algorithm. The "regression" refers to the linear equation inside it. It outputs a probability between 0 and 1, converted to a class label by a threshold.
- ✓The sigmoid σ(z) = 1/(1+e⁻ᶻ) maps any real number to (0,1). It is the entire mechanism that makes logistic regression a probability model rather than an unbounded linear predictor.
- ✓Cross-entropy loss − [y·log(p) + (1−y)·log(1−p)] penalises confident wrong predictions far more harshly than MSE. It produces a convex loss surface — gradient descent always finds the global minimum.
- ✓The gradient of cross-entropy with respect to weights is (1/n) × Xᵀ(p−y) — identical in form to linear regression gradient. The sigmoid derivative cancels out perfectly, giving this clean result.
- ✓C is the inverse of regularisation strength. Large C = weak regularisation = risk of overfitting. Small C = strong regularisation = simpler model. Always tune C. L1 regularisation drives some coefficients to exactly zero (feature selection). L2 shrinks all coefficients toward zero.
- ✓Accuracy is the wrong metric for imbalanced classes. Use ROC-AUC (threshold-independent), Precision-Recall curve, and F1 score. The optimal threshold is rarely 0.5 — tune it to match the business cost of false positives vs false negatives.
- ✓Coefficients in logistic regression are directly interpretable: a coefficient of 1.5 for distance_km means one standard deviation increase in distance multiplies the odds of being late by e^1.5 = 4.5. This interpretability is why logistic regression remains widely used in production despite its simplicity.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.