Support Vector Machines
The algorithm that finds the widest possible boundary between classes. Margins, support vectors, the kernel trick, and when SVMs still beat neural networks.
Logistic regression draws any boundary that separates the classes. SVM draws the best boundary — the one with the maximum safety margin.
Imagine Razorpay's fraud detection system. You have thousands of transactions — some fraudulent, some legitimate. You train a logistic regression. It draws a line that separates them correctly on the training data. But there are infinitely many lines that separate them correctly. Which one should you choose?
Logistic regression picks whichever line happens to minimise the loss. It could be a line that sits dangerously close to some legitimate transactions — technically correct, but fragile. A new transaction that is only slightly different from the training data might fall on the wrong side.
Support Vector Machines take a different approach. Instead of just finding any separating line, they find the line (or hyperplane in higher dimensions) that maximises the distance to the nearest points of both classes. This maximum distance is called the margin. A wider margin means the boundary is more robust — new points have to be much further off before they get misclassified.
Imagine drawing a road between two rows of houses. You could draw the road anywhere between them — but the safest road is the one exactly in the middle, with equal distance to both rows. Any car staying on the road has the maximum buffer before hitting a house.
SVM finds that middle road — the decision boundary equidistant from both classes, giving the maximum safety margin to new data points. The houses closest to the road are the support vectors — they are the only training points that actually determine where the road goes.
The margin — what SVM maximises
The margin is the total width of the gap between the two classes at the decision boundary. It is measured as twice the distance from the boundary to the nearest point of each class. SVM finds the boundary that makes this margin as wide as possible.
The hyperplane that separates the two classes. A line in 2D, a plane in 3D, a hyperplane in higher dimensions. All points on one side are predicted as class +1, all points on the other as class -1.
The training points closest to the decision boundary. These are the only points that determine where the boundary is. Remove any other training point — the boundary stays the same. Remove a support vector — the boundary moves.
The total width of the gap between the two classes at the boundary. Equal to 2 / ||w|| where w is the weight vector of the boundary. SVM maximises this margin — a wider margin means a more robust classifier.
Hard margin vs soft margin — handling overlapping classes
The margin explained above — where no training point is allowed inside the margin gap — is called a hard margin. It only works when the two classes are perfectly separable with a straight line. Real data almost never is. Some fraudulent transactions look exactly like legitimate ones. Some legitimate transactions look suspicious.
Soft margin SVM allows some training points to fall inside the margin or even on the wrong side of the boundary — but penalises them. The parameter C controls this trade-off: high C means "penalise violations heavily, keep the margin tight" (closer to hard margin). Low C means "allow more violations, keep the margin wide" (more regularisation, better generalisation).
The kernel trick — separate non-linear data without computing high dimensions
What if the two classes cannot be separated by any straight line? In 2D, circles around the origin versus points outside the circle cannot be split with a line — no matter how you draw it. SVM's solution: project the data into a higher-dimensional space where a linear separator does exist.
The problem with projecting to higher dimensions is that it becomes computationally very expensive — projecting to 1,000 dimensions means working with 1,000-dimensional vectors. The kernel trick solves this beautifully: it computes the dot product in the high-dimensional space without ever explicitly going there. It uses a kernel function that takes two original vectors and returns the same number as if you had projected them first and then taken the dot product. All the power of high-dimensional separation, none of the cost.
Imagine two groups of ants on a table — one group in the centre, one group around the edges. You cannot draw a straight line between them. But if you lift the table into the air and fold it into a bowl shape, suddenly the centre ants are at the bottom and the edge ants are up high — and you can cut them apart with a flat knife.
The kernel function is like the bowl shape — it transforms the space so a linear separator works. The kernel trick means you never actually have to fold the table — you just compute as if you did.
SVR — Support Vector Regression
SVM has a regression variant called SVR (Support Vector Regression). Instead of maximising the margin between classes, SVR fits a tube around the data — predictions within the tube incur no penalty. Only points outside the tube (the support vectors for regression) contribute to the loss. The width of the tube is controlled by the parameter epsilon.
When SVMs win — and when to use something else
SVMs were the dominant algorithm in ML from the late 1990s until around 2012 when deep learning took over. They are no longer the default choice for large-scale problems, but they still genuinely win in specific situations that come up regularly in production.
Every common SVM error — explained and fixed
SVMs find the best boundary. The next algorithm finds the nearest neighbours.
SVM is a global algorithm — it uses the entire training set to find the optimal boundary, then only remembers the support vectors. K-Nearest Neighbours (KNN) is the opposite — it is a local algorithm that remembers every single training point and makes predictions purely based on what the closest neighbours look like. No training phase. No boundary. Just: "what do the k points nearest to this new point look like?"
The simplest possible ML algorithm — predict based on what your neighbours look like. Distance metrics, the curse of dimensionality, and when KNN actually works in production.
🎯 Key Takeaways
- ✓SVM does not just find any separating boundary — it finds the boundary with the maximum margin: the widest possible gap between the two classes. A wider margin means more robust predictions on new data.
- ✓Support vectors are the training points closest to the boundary. They are the only points that determine where the boundary is. All other training points can be removed without changing the boundary at all.
- ✓C is the most important hyperparameter. High C = narrow margin, few violations (risks overfitting). Low C = wide margin, more violations allowed (more regularisation). Start with C=1.0 and tune with cross-validation.
- ✓The kernel trick projects data into higher dimensions where a linear separator exists — without the computational cost of actually working in those dimensions. RBF (Gaussian) kernel is the default and works well on most non-linear problems.
- ✓ALWAYS scale features before SVM. It is one of the most scaling-sensitive algorithms in all of sklearn. An unscaled feature with large values completely dominates the distance calculations and makes the model ignore all other features.
- ✓SVMs do not scale to large datasets — training complexity is O(n²) to O(n³). For datasets above ~50k rows, use LinearSVC, XGBoost, or a neural network. SVMs genuinely win on small high-dimensional datasets like text classification and biological data.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.