AI/ML — Module 25Intermediate

Support Vector Machines

The algorithm that finds the widest possible boundary between classes. Margins, support vectors, the kernel trick, and when SVMs still beat neural networks.

26–34 min March 2026

Module 25 · Classical Machine Learning

Classical ML · 13 modulesModule 25

What Linear Logistic Decision Support K-Nearest Naive Random Gradient XGBoost LightGBM K-Means Principal

Before any formula — what problem does this solve?

Logistic regression draws any boundary that separates the classes. SVM draws the best boundary — the one with the maximum safety margin.

Imagine Stripe's fraud detection system. You have thousands of transactions — some fraudulent, some legitimate. You train a logistic regression. It draws a line that separates them correctly on the training data. But there are infinitely many lines that separate them correctly. Which one should you choose?

Logistic regression picks whichever line happens to minimise the loss. It could be a line that sits dangerously close to some legitimate transactions — technically correct, but fragile. A new transaction that is only slightly different from the training data might fall on the wrong side.

Support Vector Machines take a different approach. Instead of just finding any separating line, they find the line (or hyperplane in higher dimensions) that maximises the distance to the nearest points of both classes. This maximum distance is called the margin. A wider margin means the boundary is more robust — new points have to be much further off before they get misclassified.

🧠 Analogy — read this first

Imagine drawing a road between two rows of houses. You could draw the road anywhere between them — but the safest road is the one exactly in the middle, with equal distance to both rows. Any car staying on the road has the maximum buffer before hitting a house.

SVM finds that middle road — the decision boundary equidistant from both classes, giving the maximum safety margin to new data points. The houses closest to the road are the support vectors — they are the only training points that actually determine where the road goes.

🎯 Pro Tip

The most important insight about SVMs: only the training points closest to the boundary matter. Remove all other training points and the boundary stays exactly the same. Those closest points are the support vectors — they literally "support" (hold up) the boundary.

The core concept

The margin — what SVM maximises

The margin is the total width of the gap between the two classes at the decision boundary. It is measured as twice the distance from the boundary to the nearest point of each class. SVM finds the boundary that makes this margin as wide as possible.

Hard margin SVM — three candidate boundaries, one optimal

Dashed circles/squares = support vectors. Only these points determine the boundary. The margin is the gap between the two dashed lines. SVM maximises this gap.

Key vocabulary — three terms that define SVM

Decision boundary

The hyperplane that separates the two classes. A line in 2D, a plane in 3D, a hyperplane in higher dimensions. All points on one side are predicted as class +1, all points on the other as class -1.

Support vectors

The training points closest to the decision boundary. These are the only points that determine where the boundary is. Remove any other training point — the boundary stays the same. Remove a support vector — the boundary moves.

Margin

The total width of the gap between the two classes at the boundary. Equal to 2 / ||w|| where w is the weight vector of the boundary. SVM maximises this margin — a wider margin means a more robust classifier.

Real data is never perfectly separable

Hard margin vs soft margin — handling overlapping classes

The margin explained above — where no training point is allowed inside the margin gap — is called a hard margin. It only works when the two classes are perfectly separable with a straight line. Real data almost never is. Some fraudulent transactions look exactly like legitimate ones. Some legitimate transactions look suspicious.

Soft margin SVM allows some training points to fall inside the margin or even on the wrong side of the boundary — but penalises them. The parameter C controls this trade-off: high C means "penalise violations heavily, keep the margin tight" (closer to hard margin). Low C means "allow more violations, keep the margin wide" (more regularisation, better generalisation).

C parameter effect — the single most important SVM hyperparameter

C = 0.01

Very wide margin

Many training points allowed inside margin or misclassified. Very regularised. Underfits if C is too small.

Margin violations: ~8

C = 1.0

Balanced (default)

Moderate trade-off. Usually a good starting point. Tune from here.

Margin violations: ~2

C = 1000

Narrow margin

Almost no violations allowed. Boundary hugs training data. Risk of overfitting.

Margin violations: ~0

Rule: Start with C=1.0. If training accuracy is much higher than validation accuracy → decrease C (more regularisation). If both are low → increase C (less regularisation) or try a different kernel. Always tune C with cross-validation.

python

import numpy as np
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
n = 2000

# Stripe transaction features
transaction_amount = np.abs(np.random.normal(500, 400, n)).clip(10, 5000)
time_since_last    = np.abs(np.random.normal(24, 20, n)).clip(0.1, 200)
merchant_risk      = np.random.uniform(0, 1, n)
device_age_days    = np.abs(np.random.normal(180, 100, n)).clip(0, 730)
n_transactions_24h = np.random.randint(0, 20, n).astype(float)

# Fraud signal: high amount + high merchant risk + many transactions in 24h
fraud_score = (
    (transaction_amount / 5000) * 0.35
    + merchant_risk * 0.30
    + (n_transactions_24h / 20) * 0.20
    + np.random.randn(n) * 0.15
)
y = (fraud_score > 0.55).astype(int)

X = np.column_stack([transaction_amount, time_since_last, merchant_risk,
                      device_age_days, n_transactions_24h])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# ── ALWAYS scale before SVM ───────────────────────────────────────────
# SVM is one of the most scaling-sensitive algorithms
# Unscaled: dominated by transaction_amount (up to 5000)
# Scaled: all features contribute equally
scaler     = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc  = scaler.transform(X_test)

# ── Effect of C parameter ─────────────────────────────────────────────
print(f"{'C value':<12} {'Train acc':<12} {'Test acc':<12} {'CV acc (5-fold)'}")
print("─" * 52)
for C in [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]:
    model = SVC(C=C, kernel='rbf', random_state=42)
    model.fit(X_train_sc, y_train)
    tr_acc = model.score(X_train_sc, y_train)
    te_acc = model.score(X_test_sc, y_test)
    cv     = cross_val_score(model, X_train_sc, y_train, cv=5).mean()
    gap    = ' ← overfit' if tr_acc - te_acc > 0.05 else ''
    print(f"  {C:<10}  {tr_acc:<12.4f} {te_acc:<12.4f} {cv:.4f}{gap}")

# ── Number of support vectors ─────────────────────────────────────────
for C in [0.01, 1.0, 100.0]:
    m = SVC(C=C, kernel='rbf', random_state=42).fit(X_train_sc, y_train)
    print(f"  C={C}: {m.n_support_} support vectors per class  "
          f"(total {sum(m.n_support_)}/{len(X_train)})")

The most powerful idea in SVM

The kernel trick — separate non-linear data without computing high dimensions

What if the two classes cannot be separated by any straight line? In 2D, circles around the origin versus points outside the circle cannot be split with a line — no matter how you draw it. SVM's solution: project the data into a higher-dimensional space where a linear separator does exist.

The problem with projecting to higher dimensions is that it becomes computationally very expensive — projecting to 1,000 dimensions means working with 1,000-dimensional vectors. The kernel trick solves this beautifully: it computes the dot product in the high-dimensional space without ever explicitly going there. It uses a kernel function that takes two original vectors and returns the same number as if you had projected them first and then taken the dot product. All the power of high-dimensional separation, none of the cost.

🧠 Analogy — read this first

Imagine two groups of ants on a table — one group in the centre, one group around the edges. You cannot draw a straight line between them. But if you lift the table into the air and fold it into a bowl shape, suddenly the centre ants are at the bottom and the edge ants are up high — and you can cut them apart with a flat knife.

The kernel function is like the bowl shape — it transforms the space so a linear separator works. The kernel trick means you never actually have to fold the table — you just compute as if you did.

The four main kernels — when to use each

linear

K(x, z) = x · z

SVC(kernel='linear')

Data is linearly separable, or you have many features (text classification). Fastest kernel. Interpretable — has feature weights.

rbf (Gaussian)

K(x, z) = exp(−γ||x−z||²)

SVC(kernel='rbf', gamma='scale')

Default choice. Works well on most non-linear problems. Controlled by γ (gamma) — how tightly the kernel wraps around each training point.

poly

K(x, z) = (γx·z + r)^d

SVC(kernel='poly', degree=3)

When features have clear polynomial relationships. Image classification historically. Rarely better than rbf in practice.

sigmoid

K(x, z) = tanh(γx·z + r)

SVC(kernel='sigmoid')

Mimics neural network activation. Rarely used — rbf almost always outperforms it.

python

import numpy as np
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import make_circles, make_moons

np.random.seed(42)

# ── Demonstrate kernel effect on non-linearly separable data ──────────

# Dataset 1: circles — inner cluster vs outer ring
X_circ, y_circ = make_circles(n_samples=500, noise=0.1, factor=0.4, random_state=42)
X_tr_c, X_te_c, y_tr_c, y_te_c = train_test_split(X_circ, y_circ, test_size=0.2, random_state=42)

sc = StandardScaler()
X_tr_cs = sc.fit_transform(X_tr_c)
X_te_cs = sc.transform(X_te_c)

print("Circles dataset (inner vs outer ring):")
for kernel in ['linear', 'poly', 'rbf', 'sigmoid']:
    m   = SVC(kernel=kernel, C=1.0, random_state=42).fit(X_tr_cs, y_tr_c)
    acc = accuracy_score(y_te_c, m.predict(X_te_cs))
    bar = '█' * int(acc * 30)
    print(f"  {kernel:<10}: {bar} {acc:.4f}")

# RBF and poly should dominate — linear cannot separate circles

# Dataset 2: moons — two interleaved crescent shapes
X_moon, y_moon = make_moons(n_samples=500, noise=0.15, random_state=42)
X_tr_m, X_te_m, y_tr_m, y_te_m = train_test_split(X_moon, y_moon, test_size=0.2, random_state=42)
X_tr_ms = sc.fit_transform(X_tr_m)
X_te_ms = sc.transform(X_te_m)

print("
Moons dataset (two crescents):")
for kernel in ['linear', 'poly', 'rbf']:
    m   = SVC(kernel=kernel, C=1.0, random_state=42).fit(X_tr_ms, y_tr_m)
    acc = accuracy_score(y_te_m, m.predict(X_te_ms))
    bar = '█' * int(acc * 30)
    print(f"  {kernel:<10}: {bar} {acc:.4f}")

# ── The gamma parameter — RBF kernel ─────────────────────────────────
# gamma = how much influence each training point has
# Small gamma: smooth, wide decision boundary (underfits if too small)
# Large gamma: tight, complex boundary that wraps around training points (overfits)
print("
RBF kernel — effect of gamma:")
for gamma in [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]:
    m    = SVC(kernel='rbf', C=1.0, gamma=gamma).fit(X_tr_cs, y_tr_c)
    tr_a = m.score(X_tr_cs, y_tr_c)
    te_a = m.score(X_te_cs, y_te_c)
    flag = ' ← overfit' if tr_a - te_a > 0.05 else ''
    print(f"  gamma={gamma:<8}: train={tr_a:.4f}  test={te_a:.4f}{flag}")

# 'scale' (default) = 1/(n_features * X.var()) — usually a good starting point
m_scale = SVC(kernel='rbf', gamma='scale').fit(X_tr_cs, y_tr_c)
print(f"
  gamma='scale': test={m_scale.score(X_te_cs, y_te_c):.4f}")

Not just classification

SVR — Support Vector Regression

SVM has a regression variant called SVR (Support Vector Regression). Instead of maximising the margin between classes, SVR fits a tube around the data — predictions within the tube incur no penalty. Only points outside the tube (the support vectors for regression) contribute to the loss. The width of the tube is controlled by the parameter epsilon.

python

import numpy as np
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

np.random.seed(42)
n = 1000

distance = np.abs(np.random.normal(4.0, 2.0, n)).clip(0.5, 15)
traffic  = np.random.randint(1, 11, n).astype(float)
prep     = np.abs(np.random.normal(15, 5, n)).clip(5, 35)
delivery = (8.6 + 7.3*distance + 0.8*prep + 1.5*traffic
            + np.random.normal(0, 4, n)).clip(10, 120)

X = np.column_stack([distance, traffic, prep])
y = delivery

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)
sc = StandardScaler()
X_tr_sc = sc.fit_transform(X_tr)
X_te_sc = sc.transform(X_te)

print(f"{'Model':<35} {'Test MAE':>10} {'Support Vectors':>16}")
print("─" * 64)

for kernel in ['linear', 'rbf', 'poly']:
    svr = SVR(kernel=kernel, C=1.0, epsilon=0.5)
    svr.fit(X_tr_sc, y_tr)
    mae = mean_absolute_error(y_te, svr.predict(X_te_sc))
    n_sv = svr.n_support_vectors_
    print(f"  SVR(kernel='{kernel}'){'':<15} {mae:>10.4f} {n_sv:>16}")

# epsilon parameter: the tube width
# Points within epsilon of the prediction contribute zero loss
# Larger epsilon = wider tube = fewer support vectors = smoother model
print("
Epsilon effect (SVR with rbf kernel):")
for eps in [0.01, 0.1, 0.5, 1.0, 2.0, 5.0]:
    svr = SVR(kernel='rbf', C=1.0, epsilon=eps)
    svr.fit(X_tr_sc, y_tr)
    mae = mean_absolute_error(y_te, svr.predict(X_te_sc))
    print(f"  epsilon={eps:<6}: MAE={mae:.4f}  support_vectors={svr.n_support_vectors_}")

What this looks like at work

When SVMs win — and when to use something else

SVMs were the dominant algorithm in ML from the late 1990s until around 2012 when deep learning took over. They are no longer the default choice for large-scale problems, but they still genuinely win in specific situations that come up regularly in production.

SVM wins when ✓

• Small to medium dataset (< 100k samples)

• High-dimensional features (text, gene expression)

• Clear margin of separation exists in the data

• You need a kernel trick for non-linear boundaries

• Training data is limited — SVM generalises well with few samples

• Memory-efficient needed — only support vectors stored

Use something else when ✗

• Large dataset (> 100k samples) — SVMs scale as O(n²) to O(n³)

• You need probability estimates — SVC requires expensive calibration

• Many noisy features — SVMs sensitive to irrelevant features

• Need feature importance — SVMs do not provide it directly

• Need fast retraining on new data — SVMs are slow to retrain

• Tabular data with mixed types — XGBoost/RF almost always win

python

import numpy as np
import pandas as pd
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import classification_report, roc_auc_score
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
n = 3000

# Stripe fraud detection — production pipeline
transaction_amount = np.abs(np.random.normal(500, 400, n)).clip(10, 5000)
time_since_last    = np.abs(np.random.normal(24, 20, n)).clip(0.1, 200)
merchant_risk      = np.random.uniform(0, 1, n)
device_age_days    = np.abs(np.random.normal(180, 100, n)).clip(0, 730)
n_tx_24h           = np.random.randint(0, 20, n).astype(float)
velocity_score     = (n_tx_24h / 20) * 0.5 + (transaction_amount / 5000) * 0.5

fraud_score = (
    (transaction_amount / 5000) * 0.35 + merchant_risk * 0.30
    + (n_tx_24h / 20) * 0.20 + np.random.randn(n) * 0.15
)
y = (fraud_score > 0.55).astype(int)

X = pd.DataFrame({
    'transaction_amount': transaction_amount,
    'time_since_last':    time_since_last,
    'merchant_risk':      merchant_risk,
    'device_age_days':    device_age_days,
    'n_tx_24h':           n_tx_24h,
    'velocity_score':     velocity_score,
})

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# ── Production SVM pipeline ───────────────────────────────────────────
# ALWAYS scale — SVM is one of the most scaling-sensitive algorithms
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model',  SVC(kernel='rbf', probability=True, random_state=42)),
])

# ── Hyperparameter search ─────────────────────────────────────────────
# C and gamma are the two most important parameters for RBF SVM
param_grid = {
    'model__C':     [0.1, 1.0, 10.0, 100.0],
    'model__gamma': ['scale', 'auto', 0.001, 0.01, 0.1],
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
grid = GridSearchCV(
    pipeline, param_grid,
    cv=cv, scoring='roc_auc',
    n_jobs=-1, verbose=0,
)
grid.fit(X_train, y_train)

print(f"Best params: {grid.best_params_}")
print(f"Best CV AUC: {grid.best_score_:.4f}")

# ── Final evaluation ──────────────────────────────────────────────────
best_model = grid.best_estimator_
y_pred     = best_model.predict(X_test)
y_proba    = best_model.predict_proba(X_test)[:, 1]

print(f"
Test ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")
print("
Classification report:")
print(classification_report(y_test, y_pred, target_names=['Legitimate', 'Fraud']))

# ── Calibration — SVM probabilities need calibration ──────────────────
# SVC(probability=True) uses Platt scaling internally
# For better calibration use CalibratedClassifierCV
from sklearn.calibration import CalibratedClassifierCV
base_svm   = Pipeline([
    ('scaler', StandardScaler()),
    ('model',  SVC(kernel='rbf', C=10.0, gamma='scale', random_state=42)),
])
calibrated = CalibratedClassifierCV(base_svm, cv=5, method='isotonic')
calibrated.fit(X_train, y_train)
y_calib    = calibrated.predict_proba(X_test)[:, 1]
print(f"
Calibrated SVM ROC-AUC: {roc_auc_score(y_test, y_calib):.4f}")

Errors you will hit

Every common SVM error — explained and fixed

SVM training takes hours or runs out of memory on a dataset with 50,000+ rows

Why it happens

SVM's training complexity is O(n²) to O(n³) in the number of samples. With 50,000 samples this means computing a 50,000 × 50,000 kernel matrix — 2.5 billion entries. This is both slow and memory-intensive. The rbf kernel makes this worse because every pair of training points must be evaluated.

Fix

For large datasets, use LinearSVC (uses liblinear, scales as O(n)) or SGDClassifier(loss='hinge') which approximates SVM with stochastic gradient descent. Alternatively use a subset: fit SVM on a representative 10,000-sample subset. For production at scale, switch to XGBoost or a neural network — SVMs genuinely do not scale to millions of samples.

SVC predict_proba() is very slow even after training is complete

Why it happens

SVC(probability=True) uses Platt scaling — it runs an additional 5-fold cross-validation internally during fit() to calibrate probabilities. This doubles or triples training time. More critically, predict_proba() is slower than predict() because it must apply the Platt calibration to every prediction.

Fix

If you only need class labels (not probabilities), use predict() instead of predict_proba() — much faster. If you need calibrated probabilities, use CalibratedClassifierCV(SVC(), cv=5) instead of SVC(probability=True) — it gives better calibration. For production serving where latency matters, consider RandomForest or XGBoost whose predict_proba() is much faster.

SVM gives poor results — accuracy barely above baseline

Why it happens

Almost always caused by forgetting to scale features. SVM uses distance calculations — a feature with values 0–5000 (like transaction amount) completely dominates a feature with values 0–1 (like merchant_risk). The decision boundary is almost entirely determined by the large-scale feature, ignoring all others.

Fix

Always put StandardScaler() inside a Pipeline before SVC. Check: after scaling, every feature should have mean≈0 and std≈1. Verify with scaler.mean_ and scaler.scale_. SVM is one of the most scaling-sensitive algorithms — this is the first thing to check when results are poor.

ConvergenceWarning: Solver terminated early — set max_iter higher

Why it happens

The SVM optimiser (libsvm) did not converge within the default number of iterations. This happens on difficult problems, very small C values, or data that is not well-scaled. The model was returned before finding the optimal boundary.

Fix

Scale your features first — this usually resolves convergence issues. If scaling does not help, increase max_iter: SVC(max_iter=5000). Also try adjusting tol (tolerance): SVC(tol=1e-4). If the warning persists after scaling, the data may have fundamental separability issues — try a different kernel or a different algorithm entirely.

What comes next

SVMs find the best boundary. The next algorithm finds the nearest neighbours.

SVM is a global algorithm — it uses the entire training set to find the optimal boundary, then only remembers the support vectors. K-Nearest Neighbours (KNN) is the opposite — it is a local algorithm that remembers every single training point and makes predictions purely based on what the closest neighbours look like. No training phase. No boundary. Just: "what do the k points nearest to this new point look like?"

Next — Module 26 · Classical ML

K-Nearest Neighbours — Similarity-Based Prediction

The simplest possible ML algorithm — predict based on what your neighbours look like. Distance metrics, the curse of dimensionality, and when KNN actually works in production.

coming soon

🎯 Key Takeaways

✓SVM does not just find any separating boundary — it finds the boundary with the maximum margin: the widest possible gap between the two classes. A wider margin means more robust predictions on new data.
✓Support vectors are the training points closest to the boundary. They are the only points that determine where the boundary is. All other training points can be removed without changing the boundary at all.
✓C is the most important hyperparameter. High C = narrow margin, few violations (risks overfitting). Low C = wide margin, more violations allowed (more regularisation). Start with C=1.0 and tune with cross-validation.
✓The kernel trick projects data into higher dimensions where a linear separator exists — without the computational cost of actually working in those dimensions. RBF (Gaussian) kernel is the default and works well on most non-linear problems.
✓ALWAYS scale features before SVM. It is one of the most scaling-sensitive algorithms in all of sklearn. An unscaled feature with large values completely dominates the distance calculations and makes the model ignore all other features.
✓SVMs do not scale to large datasets — training complexity is O(n²) to O(n³). For datasets above ~50k rows, use LinearSVC, XGBoost, or a neural network. SVMs genuinely win on small high-dimensional datasets like text classification and biological data.

Discussion

Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.

Continue with GitHub