AI/ML — Module 26Intermediate

K-Nearest Neighbours

The simplest possible ML algorithm — predict based on what your neighbours look like. Distance metrics, the curse of dimensionality, and when KNN actually works.

22–28 min March 2026

Module 26 · Classical Machine Learning

Classical ML · 13 modulesModule 26

What Linear Logistic Decision Support K-Nearest Naive Random Gradient XGBoost LightGBM K-Means Principal

Before any formula — what problem does this solve?

KNN has no training phase. No weights. No boundary. Just one idea: similar inputs should produce similar outputs.

A new customer joins Amazon. They are 24 years old, live in Seattle, buy mostly electronics, and spend ₹3,000 per order on average. What products should you recommend to them?

The simplest possible answer: find the 5 existing customers who are most similar to this new customer. Look at what they bought and liked. Recommend those. You do not need a trained model, learned weights, or a decision boundary. You just need a way to measure similarity and enough historical data to look up neighbours from.

That is the entire KNN algorithm. For a new query point, find the k training points nearest to it (by some distance measure), and predict based on what those neighbours say. For classification: majority vote among the k neighbours. For regression: average of the k neighbours' values. No training. No parameters learned from data. Every prediction looks up the training set from scratch.

🧠 Analogy — read this first

You move to a new city and want to know if a neighbourhood is safe. You do not build a statistical model of crime rates. You ask the 5 people who live closest to that neighbourhood what they think. If 4 out of 5 say it is safe, you conclude it is safe. That is KNN — ask your nearest neighbours and take a vote.

The key questions are: how do you measure "closeness"? How many neighbours k should you ask? And what happens when the neighbourhood is crowded in some dimensions but empty in others — the curse of dimensionality. This module answers all three.

🎯 Pro Tip

KNN is called a lazy learner — it does no work during training and all the work during prediction. Contrast with a neural network which does all the work during training and almost no work during prediction. Both extremes have real trade-offs in production systems.

The algorithm in detail

How KNN makes a prediction — four steps, no hidden magic

KNN's prediction process is completely transparent. Every step can be inspected and understood. There are no learned parameters — the algorithm memorises the entire training set and consults it at prediction time.

KNN prediction — step by step on one new DoorDash order

Store the entire training set

"Training" KNN means nothing more than storing X_train and y_train in memory. No computation happens. No weights are adjusted. This is why KNN training is instantaneous.

Receive a new query point

New DoorDash order: distance=5.2km, traffic=8, prep=22min. We want to predict delivery time.

Compute distance to every training point

Calculate the distance between the new point and every single training example. With 10,000 training points, that is 10,000 distance calculations per prediction.

Find the k nearest and predict

Sort all distances. Take the k smallest. For regression: return their average y value. For classification: return the majority class. With k=5 and delivery times [34, 38, 41, 35, 39], prediction = (34+38+41+35+39)/5 = 37.4 min.

python

import numpy as np

# ── KNN from scratch — every step visible ─────────────────────────────
class KNNScratch:
    """
    K-Nearest Neighbours from scratch.
    Training = just storing the data.
    Prediction = find k nearest, aggregate their labels.
    """

    def __init__(self, k: int = 5, task: str = 'regression'):
        self.k      = k
        self.task   = task   # 'regression' or 'classification'
        self.X_train = None
        self.y_train = None

    def fit(self, X: np.ndarray, y: np.ndarray):
        """
        'Training' KNN = just storing the data.
        Nothing is computed. Nothing is learned.
        """
        self.X_train = X.copy()
        self.y_train = y.copy()
        return self

    def _euclidean_distance(self, a: np.ndarray, b: np.ndarray) -> float:
        """Distance between two points: sqrt(sum of squared differences)."""
        return np.sqrt(np.sum((a - b) ** 2))

    def _predict_one(self, x: np.ndarray):
        """Predict for a single query point."""
        # Step 1: compute distance to every training point
        distances = np.array([
            self._euclidean_distance(x, x_train)
            for x_train in self.X_train
        ])

        # Step 2: find indices of k nearest neighbours
        k_nearest_indices = np.argsort(distances)[:self.k]

        # Step 3: get their labels
        k_nearest_labels = self.y_train[k_nearest_indices]

        # Step 4: aggregate
        if self.task == 'regression':
            return k_nearest_labels.mean()
        else:
            # majority vote
            values, counts = np.unique(k_nearest_labels, return_counts=True)
            return values[np.argmax(counts)]

    def predict(self, X: np.ndarray) -> np.ndarray:
        return np.array([self._predict_one(x) for x in X])


# ── Test on DoorDash delivery data ───────────────────────────────────────
np.random.seed(42)
n = 500
distance  = np.abs(np.random.normal(4.0, 2.0, n)).clip(0.5, 15)
traffic   = np.random.randint(1, 11, n).astype(float)
prep      = np.abs(np.random.normal(15, 5, n)).clip(5, 35)
delivery  = (8.6 + 7.3*distance + 0.8*prep + 1.5*traffic
             + np.random.normal(0, 4, n)).clip(10, 120)

X = np.column_stack([distance, traffic, prep])
y = delivery

# Scale — CRITICAL for KNN
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc  = sc.transform(X_test)

# From-scratch KNN
knn_scratch = KNNScratch(k=5, task='regression')
knn_scratch.fit(X_train_sc, y_train)
y_pred_scratch = knn_scratch.predict(X_test_sc[:10])  # only 10 for speed

# Verify with sklearn
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error
knn_sk = KNeighborsRegressor(n_neighbors=5)
knn_sk.fit(X_train_sc, y_train)
y_pred_sk = knn_sk.predict(X_test_sc[:10])

print("From-scratch vs sklearn predictions (first 10):")
for s, sk, true in zip(y_pred_scratch, y_pred_sk, y_test[:10]):
    print(f"  scratch={s:.1f}  sklearn={sk:.1f}  true={true:.1f}")

# Full test MAE
y_pred_full = knn_sk.predict(X_test_sc)
print(f"
Full test MAE (k=5): {mean_absolute_error(y_test, y_pred_full):.4f} min")

Measuring closeness

Distance metrics — which one to use and why it matters

KNN's entire behaviour depends on how you measure distance. The same dataset can produce completely different predictions depending on which distance metric you choose. The default — Euclidean distance — works well in most cases. But understanding the alternatives lets you make better choices for specific problem types.

The three main distance metrics — compared on the same two points

Euclidean (L2)√Σ(aᵢ − bᵢ)²

Straight-line distance. The default. Treats all directions equally. Best when features are continuous and on the same scale.

Example: Two DoorDash orders: A=(5km, traffic=8) B=(3km, traffic=6). Distance = √((5-3)²+(8-6)²) = √8 = 2.83

Use when: Default for most problems. Always scale first.

Manhattan (L1)Σ|aᵢ − bᵢ|

Sum of absolute differences. Like navigating a grid — you can only move horizontally or vertically. Less sensitive to outliers than Euclidean.

Example: Same orders: |5-3| + |8-6| = 2 + 2 = 4. Larger than Euclidean.

Use when: High-dimensional data, sparse features, when outliers are a concern.

Minkowski (general)(Σ|aᵢ − bᵢ|^p)^(1/p)

General case: p=1 is Manhattan, p=2 is Euclidean. Higher p emphasises the largest difference between dimensions.

Example: sklearn uses Minkowski with p=2 (Euclidean) by default. Set p=1 for Manhattan.

Use when: Tune p as a hyperparameter if the default is not working.

python

import numpy as np
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_absolute_error

np.random.seed(42)
n = 2000
distance  = np.abs(np.random.normal(4.0, 2.0, n)).clip(0.5, 15)
traffic   = np.random.randint(1, 11, n).astype(float)
prep      = np.abs(np.random.normal(15, 5, n)).clip(5, 35)
delivery  = (8.6 + 7.3*distance + 0.8*prep + 1.5*traffic
             + np.random.normal(0, 4, n)).clip(10, 120)

X = np.column_stack([distance, traffic, prep])
y = delivery

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)
sc = StandardScaler()
X_tr_sc = sc.fit_transform(X_tr)
X_te_sc = sc.transform(X_te)

# ── Compare distance metrics ───────────────────────────────────────────
print(f"{'Metric':<20} {'p':<5} {'Test MAE'}")
print("─" * 35)
for metric, p in [
    ('euclidean (L2)', 2),
    ('manhattan (L1)', 1),
    ('minkowski p=3',  3),
    ('minkowski p=4',  4),
]:
    knn = KNeighborsRegressor(n_neighbors=5, metric='minkowski', p=p)
    knn.fit(X_tr_sc, y_tr)
    mae = mean_absolute_error(y_te, knn.predict(X_te_sc))
    print(f"  {metric:<18} {p:<5} {mae:.4f}")

# ── Effect of not scaling — the most common mistake ───────────────────
print("
Effect of not scaling on distance-based algorithms:")
knn_unscaled = KNeighborsRegressor(n_neighbors=5)
knn_unscaled.fit(X_tr, y_tr)   # no scaling
mae_unscaled = mean_absolute_error(y_te, knn_unscaled.predict(X_te))

knn_scaled = KNeighborsRegressor(n_neighbors=5)
knn_scaled.fit(X_tr_sc, y_tr)
mae_scaled = mean_absolute_error(y_te, knn_scaled.predict(X_te_sc))

print(f"  Without scaling: MAE = {mae_unscaled:.4f}  ← dominated by large-scale features")
print(f"  With scaling:    MAE = {mae_scaled:.4f}  ← all features contribute equally")

# ── Weighted KNN — closer neighbours vote more ────────────────────────
# weights='uniform': all k neighbours vote equally
# weights='distance': closer neighbours have more influence (1/distance weighting)
print("
Uniform vs distance-weighted KNN:")
for weights in ['uniform', 'distance']:
    knn = KNeighborsRegressor(n_neighbors=5, weights=weights)
    knn.fit(X_tr_sc, y_tr)
    mae = mean_absolute_error(y_te, knn.predict(X_te_sc))
    print(f"  weights='{weights}': MAE = {mae:.4f}")

The most important hyperparameter

Choosing k — the bias-variance trade-off made visual

The value of k directly controls how much the model generalises versus memorises. Small k means the prediction is based on very few neighbours — highly sensitive to noise. Large k means the prediction is based on many neighbours — smoother but potentially missing local structure. This is the classic bias-variance trade-off, and KNN makes it unusually visible.

Effect of k on decision boundary — from jagged to smooth

k = 1

BiasVery low

VarianceVery high

Train acc100%

Test accPoor — memorised noise

Jagged, fits every training point exactly

k = 7

BiasBalanced

VarianceBalanced

Train accGood

Test accGood — generalises well

Smooth enough to generalise, flexible enough to fit patterns

k = n

BiasVery high

VarianceVery low

Train accPoor

Test accPoor — too simple

Always predicts the same thing (global majority/mean)

python

import numpy as np
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import mean_absolute_error
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
n = 2000
distance  = np.abs(np.random.normal(4.0, 2.0, n)).clip(0.5, 15)
traffic   = np.random.randint(1, 11, n).astype(float)
prep      = np.abs(np.random.normal(15, 5, n)).clip(5, 35)
delivery  = (8.6 + 7.3*distance + 0.8*prep + 1.5*traffic
             + np.random.normal(0, 4, n)).clip(10, 120)

X = np.column_stack([distance, traffic, prep])
y = delivery

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)
sc = StandardScaler()
X_tr_sc = sc.fit_transform(X_tr)
X_te_sc = sc.transform(X_te)

# ── k sweep — find optimal k ──────────────────────────────────────────
print(f"{'k':<6} {'Train MAE':<12} {'CV MAE (5-fold)':<18} {'Test MAE'}")
print("─" * 54)

best_k, best_cv = None, float('inf')
for k in [1, 3, 5, 7, 10, 15, 20, 30, 50, 100]:
    knn = KNeighborsRegressor(n_neighbors=k, weights='distance')
    knn.fit(X_tr_sc, y_tr)

    train_mae = mean_absolute_error(y_tr, knn.predict(X_tr_sc))
    cv_scores = cross_val_score(knn, X_tr_sc, y_tr,
                                 cv=5, scoring='neg_mean_absolute_error')
    cv_mae    = -cv_scores.mean()
    test_mae  = mean_absolute_error(y_te, knn.predict(X_te_sc))

    flag = ' ← best CV' if cv_mae < best_cv else ''
    if cv_mae < best_cv:
        best_cv, best_k = cv_mae, k

    overfit = ' ← overfit' if train_mae < cv_mae * 0.7 else ''
    print(f"  {k:<4}  {train_mae:<12.4f} {cv_mae:<16.4f}  {test_mae:.4f}{flag}{overfit}")

print(f"
Best k by CV: {best_k}  (CV MAE = {best_cv:.4f})")

# ── Rule of thumb: k = sqrt(n_train) as starting point ────────────────
k_rule_of_thumb = int(np.sqrt(len(X_tr)))
print(f"
Rule of thumb: k = sqrt(n_train) = sqrt({len(X_tr)}) = {k_rule_of_thumb}")
print("Always use an odd k for binary classification to avoid ties.")
print("Tune k with cross-validation — do not rely on the rule of thumb alone.")

Why KNN breaks in high dimensions

The curse of dimensionality — KNN's fundamental limitation

KNN works beautifully in 2 or 3 dimensions. It falls apart in 100 dimensions. The reason is counterintuitive but important — in high-dimensional space, distances lose their meaning. All points become approximately equidistant from each other. When every point is roughly the same distance from every other point, "nearest neighbours" is a meaningless concept.

🧠 Analogy — read this first

Imagine searching for the nearest person to you on a street (1D). Easy — look left and right. Now on a field (2D). Harder, but doable. Now in a building (3D). Now in a 100-dimensional space where each dimension is one feature of a customer profile.

In 100 dimensions, the "nearest" customer might actually be almost as far away as the furthest customer. The ratio of nearest-to-furthest distance approaches 1 as dimensions grow. All distances become equally large and equally meaningless.

The curse — what happens to distances as dimensions increase

In d dimensions, to maintain the same neighbourhood density you need exponentially more data. To have 10 neighbours within a distance of 0.1 in 1D, you might need ~100 points. In 10D, you need ~10 billion points for the same density. In practice, your data becomes infinitely sparse.

2DNeighbourhoods are meaningful — 5 neighbours is genuinely close

10DNeighbourhoods start thinning — need more data to work well

50DDistances nearly uniform — KNN accuracy degrades significantly

100D+All points approximately equidistant — KNN essentially random

python

import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

# ── Demonstrate curse of dimensionality ───────────────────────────────
# Add progressively more noise dimensions — watch KNN performance degrade

n = 1000
# Core signal: 3 features with real predictive power
X_core = np.column_stack([
    np.abs(np.random.normal(4.0, 2.0, n)).clip(0.5, 15),
    np.random.randint(1, 11, n).astype(float),
    np.abs(np.random.normal(15, 5, n)).clip(5, 35),
])
y = (8.6 + 7.3*X_core[:,0] + 0.8*X_core[:,2] + 1.5*X_core[:,1]
     + np.random.normal(0, 4, n)).clip(10, 120)

print("KNN performance as dimensions increase (3 real + noise):")
print(f"{'Dimensions':<14} {'CV MAE':>10} {'Distance ratio':>16}")
print("─" * 44)

for n_noise in [0, 2, 5, 10, 20, 50, 100]:
    X_noise = np.random.randn(n, n_noise) if n_noise > 0 else np.empty((n, 0))
    X_full  = np.hstack([X_core, X_noise])
    total_d = X_full.shape[1]

    sc   = StandardScaler()
    X_sc = sc.fit_transform(X_full)

    # KNN CV MAE
    knn    = KNeighborsRegressor(n_neighbors=10)
    scores = cross_val_score(knn, X_sc, y, cv=5, scoring='neg_mean_absolute_error')
    cv_mae = -scores.mean()

    # Distance ratio: nearest / farthest — approaches 1 as dims increase
    from sklearn.metrics.pairwise import euclidean_distances
    sample_dists = euclidean_distances(X_sc[:100])
    np.fill_diagonal(sample_dists, np.inf)
    nearest  = sample_dists.min(axis=1).mean()
    np.fill_diagonal(sample_dists, 0)
    farthest = sample_dists.max(axis=1).mean()
    ratio    = nearest / farthest

    print(f"  {total_d:<12}d  {cv_mae:>10.4f} {ratio:>14.4f}")

print("
As dimensions grow: MAE degrades, distance ratio → 1 (all equidistant)")

# ── Fix: reduce dimensions before KNN ─────────────────────────────────
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

X_high = np.hstack([X_core, np.random.randn(n, 47)])  # 50 total dimensions
sc = StandardScaler()
X_high_sc = sc.fit_transform(X_high)

print("
PCA dimensionality reduction before KNN:")
for n_components in [50, 20, 10, 5, 3]:
    pca  = PCA(n_components=n_components)
    X_pca = pca.fit_transform(X_high_sc)
    knn   = KNeighborsRegressor(n_neighbors=10)
    score = cross_val_score(knn, X_pca, y, cv=5,
                             scoring='neg_mean_absolute_error')
    print(f"  PCA to {n_components:2d} dims: CV MAE = {-score.mean():.4f}  "
          f"({pca.explained_variance_ratio_.sum()*100:.0f}% variance retained)")

Classification variant

KNN for classification — majority vote with probabilities

KNN classification works identically to regression — find k nearest neighbours and aggregate. For classification the aggregation is a majority vote: whichever class appears most among the k neighbours wins. Probabilities come naturally — the fraction of neighbours belonging to each class.

python

import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
n = 3000

# Late delivery classification: is_late = delivery > 45 min
distance  = np.abs(np.random.normal(4.0, 2.0, n)).clip(0.5, 15)
traffic   = np.random.randint(1, 11, n).astype(float)
prep      = np.abs(np.random.normal(15, 5, n)).clip(5, 35)
value     = np.abs(np.random.normal(350, 150, n)).clip(50, 1200)
delivery  = (8.6 + 7.3*distance + 0.8*prep + 1.5*traffic
             + np.random.normal(0, 4, n)).clip(10, 120)
y_cls     = (delivery > 45).astype(int)

X = np.column_stack([distance, traffic, prep, value])
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y_cls, test_size=0.2, stratify=y_cls, random_state=42
)

# ── Pipeline: scale → KNN ─────────────────────────────────────────────
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('knn',    KNeighborsClassifier(n_neighbors=11, weights='distance')),
])
pipe.fit(X_tr, y_tr)

y_pred  = pipe.predict(X_te)
y_proba = pipe.predict_proba(X_te)[:, 1]

print(f"KNN Classifier (k=11):")
print(f"  Accuracy: {pipe.score(X_te, y_te):.4f}")
print(f"  ROC-AUC:  {roc_auc_score(y_te, y_proba):.4f}")
print(classification_report(y_te, y_pred, target_names=['On-time', 'Late']))

# ── Probability = fraction of neighbours in each class ────────────────
# Inspect probabilities for a few examples
print("Sample predictions with neighbourhood breakdown:")
knn_fitted = pipe.named_steps['knn']
sc_fitted  = pipe.named_steps['scaler']
X_te_sc    = sc_fitted.transform(X_te)

# Get the actual k nearest neighbours for first 3 test points
distances_arr, indices = knn_fitted.kneighbors(X_te_sc[:3])
for i in range(3):
    neighbour_labels = y_tr[indices[i]]
    late_count  = neighbour_labels.sum()
    total       = len(neighbour_labels)
    p_late      = y_proba[i]
    actual      = 'LATE' if y_te[i] == 1 else 'ON-TIME'
    predicted   = 'LATE' if y_pred[i] == 1 else 'ON-TIME'
    print(f"  Sample {i+1}: {late_count}/{total} neighbours are late "
          f"→ P(late)={p_late:.3f}  predicted={predicted}  actual={actual}")

# ── GridSearch over k and weights ─────────────────────────────────────
param_grid = {
    'knn__n_neighbors': [3, 5, 7, 11, 15, 21, 31],
    'knn__weights':     ['uniform', 'distance'],
    'knn__metric':      ['euclidean', 'manhattan'],
}
grid = GridSearchCV(pipe, param_grid, cv=5, scoring='roc_auc', n_jobs=-1)
grid.fit(X_tr, y_tr)
print(f"
Best params: {grid.best_params_}")
print(f"Best CV AUC: {grid.best_score_:.4f}")

Practical guidance

When KNN wins — and when it does not

KNN is not a general-purpose algorithm in modern ML. It is slow at prediction time, sensitive to irrelevant features, and breaks down in high dimensions. But it has genuine use cases where it outperforms more complex algorithms.

KNN genuinely wins ✓

• Recommendation systems — nearest-neighbour collaborative filtering

• Anomaly detection — points far from all neighbours are outliers

• Low-dimensional data (< 20 features) with non-linear patterns

• When you need a fast baseline to compare against

• Image similarity search (with learned embeddings, not raw pixels)

• Small datasets where tree/ensemble overhead is not worth it

Use something else ✗

• Large datasets — prediction is O(n) per query, gets slow fast

• High dimensions (> 20 features) — curse of dimensionality

• Many irrelevant features — noise kills distance meaning

• Need feature importance — KNN provides none

• Need fast predictions in production — every query scans all data

• Tabular ML competitions — XGBoost/RF almost always wins

Speed up KNN for production: approximate nearest neighbours

python

import numpy as np
from sklearn.neighbors import KNeighborsRegressor, BallTree, KDTree
from sklearn.preprocessing import StandardScaler
import time

np.random.seed(42)
n_train = 10_000
n_query = 1000

X_train = np.random.randn(n_train, 5)
y_train = np.random.randn(n_train)
X_query = np.random.randn(n_query, 5)

sc = StandardScaler()
X_tr_sc = sc.fit_transform(X_train)
X_qu_sc = sc.transform(X_query)

# ── algorithm parameter controls the search structure ─────────────────
# 'brute':    compute all distances — O(n) per query, exact, slow for large n
# 'ball_tree': partition space into nested hyperspheres — faster for low-dim
# 'kd_tree':  partition space into axis-aligned boxes — fastest for very low-dim
# 'auto':     sklearn picks the best based on data characteristics

print("KNN algorithm comparison (10,000 training points, 1,000 queries):")
print(f"{'Algorithm':<15} {'Fit time':>12} {'Query time':>12} {'Speed vs brute':>16}")
print("─" * 58)

brute_time = None
for algo in ['brute', 'kd_tree', 'ball_tree', 'auto']:
    knn = KNeighborsRegressor(n_neighbors=10, algorithm=algo)

    t0 = time.time()
    knn.fit(X_tr_sc, y_train)
    fit_time = time.time() - t0

    t0 = time.time()
    knn.predict(X_qu_sc)
    query_time = time.time() - t0

    if algo == 'brute':
        brute_time = query_time
    speedup = brute_time / query_time if brute_time else 1.0
    print(f"  {algo:<13}  {fit_time:>10.4f}s  {query_time:>10.4f}s  {speedup:>13.1f}×")

# ── leaf_size: BallTree/KDTree trade-off ─────────────────────────────
# Smaller leaf_size → deeper tree → faster query but more memory
# Larger leaf_size → shallower tree → less memory but more brute-force at leaves
print("
leaf_size effect on BallTree query time:")
for leaf in [5, 10, 30, 50, 100]:
    knn = KNeighborsRegressor(n_neighbors=10, algorithm='ball_tree', leaf_size=leaf)
    knn.fit(X_tr_sc, y_train)
    t0 = time.time()
    knn.predict(X_qu_sc)
    qt = time.time() - t0
    print(f"  leaf_size={leaf:<5}: {qt:.4f}s")

Errors you will hit

Every common KNN error — explained and fixed

KNN accuracy is terrible — barely better than random guessing

Why it happens

Almost always caused by not scaling features. KNN computes Euclidean distance. A feature with values 0–5000 (like price in rupees) contributes 5000² = 25 million to the squared distance. A feature with values 0–1 contributes at most 1. The large-scale feature completely dominates distance calculations and the model ignores everything else.

Fix

Always put StandardScaler inside a Pipeline before KNeighborsClassifier or KNeighborsRegressor. After scaling, every feature has mean=0 and std=1 — they all contribute equally to distances. This single fix often improves KNN accuracy by 20–40% on mixed-scale datasets.

KNN prediction is extremely slow in production — taking seconds per request

Why it happens

Default KNN (algorithm='brute') computes the distance from every query point to every training point. With 100,000 training points and 50 features, each prediction requires 100,000 distance calculations. At 1,000 predictions per second this becomes the bottleneck.

Fix

Switch to algorithm='ball_tree' or algorithm='kd_tree' which partition the training space into a tree structure, reducing average query complexity from O(n) to O(log n). For production at very large scale, use approximate nearest neighbour libraries like FAISS or Annoy which trade small accuracy losses for 100× speed improvements.

KNN with k=1 has 100% training accuracy but poor test accuracy

Why it happens

With k=1, every training point's nearest neighbour is itself — so the model always predicts the exact correct label for training data. This is perfect overfitting. The model has memorised every training example instead of learning a general pattern.

Fix

Never use k=1 for evaluation — it is meaningless. Use cross-validation to find the optimal k: cross_val_score(KNeighborsRegressor(n_neighbors=k), ...). Start with k=sqrt(n_train) and tune around that value. Use odd k for binary classification to avoid tie votes.

ValueError: Expected 2D array, got 1D array instead when calling predict

Why it happens

You passed a single sample as a 1D array (shape (n_features,)) instead of a 2D array (shape (1, n_features)). sklearn's predict() always expects a 2D array — even for a single sample.

Fix

Reshape a single sample before predicting: model.predict(x.reshape(1, -1)). Or use np.array([x]) which creates a (1, n_features) array. For pandas: model.predict(df.iloc[[0]]) — double brackets keep it as a DataFrame with shape (1, n_features).

What comes next

KNN asks its neighbours. Naive Bayes asks Bayes' theorem.

KNN is a distance-based algorithm — it makes no assumptions about the distribution of data, it just measures proximity. Naive Bayes is the opposite — it is a probabilistic algorithm that makes explicit assumptions about how features are distributed, then uses Bayes' theorem to compute the probability of each class. Despite the "naive" independence assumption that is almost always wrong, it works surprisingly well for text classification and spam detection.

Next — Module 27 · Classical ML

Naive Bayes — Probabilistic Text Classification

Bayes theorem applied to classification. Why the naive independence assumption works surprisingly well for spam and document classification.

coming soon

🎯 Key Takeaways

✓KNN has no training phase — "training" means storing the data. All computation happens at prediction time: find k nearest training points, aggregate their labels. This makes training instant but prediction slow on large datasets.
✓KNN is one of the most scaling-sensitive algorithms in all of sklearn. Unscaled features completely break distance calculations. Always put StandardScaler in a Pipeline before any KNN model — this is the single most impactful thing you can do.
✓k controls the bias-variance trade-off directly. k=1 memorises every training point (zero bias, maximum variance, 100% training accuracy). k=n predicts the global average (maximum bias, zero variance). Start with k=sqrt(n_train) and tune with cross-validation.
✓Use weights="distance" instead of the default weights="uniform". Closer neighbours are more informative than distant ones — distance weighting consistently improves KNN performance at almost no cost.
✓The curse of dimensionality is KNN's fundamental limit. In high dimensions all points become approximately equidistant, making nearest neighbours meaningless. For datasets with more than ~20 features, apply PCA to reduce dimensions before KNN.
✓KNN genuinely wins at recommendation (collaborative filtering), anomaly detection (far from all neighbours = anomaly), and low-dimensional non-linear problems. For large datasets, tabular ML, or high-dimensional data, XGBoost or Random Forest will almost always outperform it.

Discussion

Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.

Continue with GitHub