Python · SQL · Web Dev · Java · AI/ML tracks launching soon — your one platform for all of IT
Math Foundations
Information Theory — Entropy, Cross-Entropy and KL Divergence
The information-theoretic foundations of ML. Why surprise is measurable, what entropy really means, and how every neural network loss function connects to information theory.
Claude Shannon asked: how do you measure information?
It's 1948. Claude Shannon is working at Bell Labs trying to figure out how to send messages over telephone wires as efficiently as possible. He asks a question nobody had asked precisely before: what is the minimum number of bits required to transmit a message? How much information does a message actually contain?
His answer — published in "A Mathematical Theory of Communication" — created the entire field of information theory. And it turns out that the same mathematics that describes efficient telephone communication is also the mathematics behind why neural networks use cross-entropy loss, how language models are evaluated, and what makes a feature useful for classification.
Module 06 introduced entropy, cross-entropy, and KL divergence briefly as they connect to probability and loss functions. This module goes much deeper — building from the very first principle (what is information?) all the way to mutual information, Jensen-Shannon divergence, and how information theory shapes model evaluation and feature selection.
What this module adds beyond Module 06:
Self-information — what does it mean for one specific event to carry information?
Entropy from first principles — derived, not just defined
Joint entropy and conditional entropy — information across multiple variables
Mutual information — the most important concept in feature selection
Jensen-Shannon divergence — a symmetric, bounded version of KL divergence
Perplexity — how language models are actually evaluated in practice
Information bottleneck — why deep learning compresses information
Practical applications: feature selection, data compression, model evaluation
🎯 Pro Tip
This module is more conceptual than the previous ones. The goal is not to memorise formulas but to develop an intuition for what information means quantitatively. That intuition will make every loss function and every evaluation metric click into place.
Building from first principles
Self-information — how surprising is one event?
Before defining entropy for a whole distribution, we need to define information for a single event. Shannon's insight was this: the information content of an event is related to how surprising it is. An event that happens all the time carries no information when it occurs. An event that almost never happens carries a lot of information when it does.
Think about it concretely. You open Swiggy and see "Your order is being prepared." This happens every single time — completely expected, no new information. But if you open Swiggy and see "Your restaurant just won a Michelin star" — extremely rare, extremely surprising, and therefore carries a lot of information.
The self-information formula:
I(x) = −log₂ P(x) bits
Certain event P(x)=1: I(x) = −log₂(1) = 0 bits → no information
Coin flip P(x)=0.5: I(x) = −log₂(0.5) = 1 bit → one bit of information
Why the negative logarithm? Three reasons that Shannon showed are the only consistent choice: information should be zero for certain events, should decrease as probability increases, and should add when two independent events occur simultaneously. The negative log is the unique function satisfying all three.
The base of the logarithm determines the unit. Base 2 → bits (binary digits). Base e → nats. Base 10 → hartleys. ML uses both bits (log₂) and nats (ln) depending on context. Cross-entropy loss in PyTorch uses natural log → units are nats.
python
import numpy as np
def self_information_bits(p):
"""I(x) = -log2(p) — result in bits"""
return -np.log2(p)
def self_information_nats(p):
"""I(x) = -ln(p) — result in nats (used in PyTorch)"""
return -np.log(p)
# Real examples
events = [
('Swiggy order confirmed (always happens)', 1.00),
('Order arrives in under 20 min', 0.30),
('Coin flip lands heads', 0.50),
('Order delayed by more than 1 hour', 0.05),
('Restaurant partner wins Michelin star', 0.001),
]
print(f"{'Event':<50} {'P(x)':<8} {'Bits':<10} {'Nats'}")
print('─' * 85)
for event, p in events:
bits = self_information_bits(p)
nats = self_information_nats(p)
print(f"{event:<50} {p:<8.3f} {bits:<10.3f} {nats:.3f}")
# Key property: information is ADDITIVE for independent events
# I(A and B) = I(A) + I(B) when A and B are independent
p_heads1 = 0.5
p_heads2 = 0.5
p_both = p_heads1 * p_heads2 # independent → multiply probabilities
i_heads1 = self_information_bits(p_heads1)
i_heads2 = self_information_bits(p_heads2)
i_both = self_information_bits(p_both)
print(f"
Two coin flips (independent):")
print(f" I(heads1) = {i_heads1:.1f} bits")
print(f" I(heads2) = {i_heads2:.1f} bits")
print(f" I(both) = {i_both:.1f} bits")
print(f" Sum: {i_heads1 + i_heads2:.1f} bits ← additive ✓")
Why log₂ and why "bits"?
A bit is the answer to a yes/no question. If I need to tell you which of 2 equally likely outcomes occurred, I need 1 bit. Which of 4 equally likely outcomes? 2 bits (two yes/no questions). Which of 8? 3 bits. Which of n? log₂(n) bits.
When an event has probability 1/8 (one of 8 equally likely outcomes), I(x) = −log₂(1/8) = log₂(8) = 3 bits. Exactly as many bits as needed to encode which outcome it was. This is not a coincidence — it's the definition working exactly as intended.
python
# Demonstrating: self-information = bits needed to encode the outcome
# If there are n equally likely outcomes, each has probability 1/n
# Self-information = -log2(1/n) = log2(n) = bits needed to identify one outcome
for n_outcomes in [2, 4, 8, 16, 64, 256, 1024]:
p = 1 / n_outcomes
bits_needed = self_information_bits(p)
# This equals log2(n_outcomes) exactly
print(f" {n_outcomes:4d} equally likely outcomes: {bits_needed:.1f} bits per outcome")
# Output:
# 2: 1.0 bits ← one yes/no question
# 4: 2.0 bits ← two yes/no questions
# 8: 3.0 bits
# 16: 4.0 bits
# 64: 6.0 bits
# 256: 8.0 bits ← one byte!
# 1024: 10.0 bits
# Practical: a byte (8 bits) can encode 256 equally likely symbols.
# ASCII has 128 characters: each one needs log2(128) = 7 bits.
# This is why ASCII is a 7-bit encoding. Information theory designed it.
Shannon entropy
Entropy — the average self-information of a distribution
We defined self-information for one specific event. Entropy extends this to an entire distribution. It asks: on average, how much information do I get when I observe an outcome drawn from this distribution? Or equivalently: on average, how many bits do I need to communicate which outcome occurred?
The entropy H(X) is the expected value of self-information over the entire distribution — the probability-weighted average of how surprising each outcome is.
Entropy — the derivationoptional — read when ready
Expected value of self-information:
H(X) = E[I(X)] = E[−log₂ P(X)]
H(X) = −Σ P(xᵢ) log₂ P(xᵢ) (discrete)
H(X) = −∫ p(x) log₂ p(x) dx (continuous)
Each term P(xᵢ) × I(xᵢ) = P(xᵢ) × (−log P(xᵢ)) is: how likely this outcome is × how surprising it would be. The sum gives the average surprise across all possible outcomes.
The two extreme cases — and why they matter in ML
Entropy extremes — from deterministic to maximally uncertain
Deterministic — H = 0 bitsminimum entropy
p=1.0
One outcome has probability 1, all others 0. You already know what will happen — no information gained from observation. In ML: a perfectly confident model on its training data. This is why H=0 doesn't mean good model — it can mean overfitting.
Uniform — H = log₂(n) bitsmaximum entropy
0.25
0.25
0.25
0.25
All outcomes equally likely. Maximum uncertainty — maximum information per observation. For n=4 outcomes: H = log₂(4) = 2 bits. In ML: an untrained model outputting uniform probabilities has maximum entropy. A well-trained model should have low entropy on confident predictions.
python
import numpy as np
def entropy_bits(probs):
"""H(X) = -sum(p * log2(p))"""
probs = np.array(probs, dtype=float)
probs = probs[probs > 0] # 0*log(0) is defined as 0 by convention
return -np.sum(probs * np.log2(probs))
def entropy_nats(probs):
"""H(X) = -sum(p * ln(p)) — used in PyTorch"""
probs = np.array(probs, dtype=float)
probs = probs[probs > 0]
return -np.sum(probs * np.log(probs))
# ── Four scenarios from a Swiggy star rating predictor ───────────────
# Scenario A: model is very confident (good!)
model_confident = [0.01, 0.02, 0.04, 0.13, 0.80] # almost certain it's 5 stars
# Scenario B: model is moderately confident
model_moderate = [0.05, 0.10, 0.20, 0.40, 0.25]
# Scenario C: model is totally confused (bad — high entropy)
model_confused = [0.20, 0.20, 0.20, 0.20, 0.20] # uniform
# Scenario D: model is confident but WRONG (also bad)
model_wrong = [0.80, 0.13, 0.04, 0.02, 0.01] # says 1 star, but true label is 5
scenarios = [
('Confident (correct)', model_confident),
('Moderate', model_moderate),
('Confused (uniform)', model_confused),
('Confident (wrong)', model_wrong),
]
max_entropy = np.log2(5) # maximum entropy for 5 classes = log2(5) ≈ 2.32 bits
print(f"Max possible entropy (5 classes): {max_entropy:.4f} bits
")
print(f"{'Scenario':<28} {'H (bits)':<12} {'% of max':<12} {'Interpretation'}")
print('─' * 75)
for name, probs in scenarios:
h = entropy_bits(probs)
pct = h / max_entropy * 100
interp = "confident" if pct < 30 else "uncertain" if pct < 70 else "very uncertain"
print(f"{name:<28} {h:<12.4f} {pct:<12.1f} {interp}")
# Entropy as an uncertainty metric:
# High entropy → model is uncertain → maybe need more training or more data
# Low entropy → model is confident → check if it's correct too!
# Entropy alone doesn't tell you if the model is right, only if it's confident
Entropy of continuous distributions
For continuous distributions we use differential entropy — the continuous analogue. Unlike discrete entropy it can be negative (a very narrow, concentrated distribution can have negative entropy) and its value depends on the units. In ML, the relative differences in entropy matter more than absolute values.
python
from scipy import stats
import numpy as np
# Differential entropy for common distributions
# Gaussian: H = 0.5 * ln(2πeσ²)
def gaussian_entropy(sigma):
return 0.5 * np.log(2 * np.pi * np.e * sigma**2)
print("Gaussian entropy as σ grows:")
for sigma in [0.1, 0.5, 1.0, 2.0, 5.0, 10.0]:
h = gaussian_entropy(sigma)
print(f" σ={sigma:4.1f} → H = {h:.4f} nats")
# Narrower distribution = lower entropy = more certain = less information per sample
# This is why batch normalisation stabilises training — it controls the entropy
# of activations, preventing them from becoming too concentrated or too spread.
# Comparing entropy of two delivery time models:
model_precise = stats.norm(loc=35, scale=3) # σ=3 min — precise
model_imprecise = stats.norm(loc=35, scale=12) # σ=12 min — imprecise
print(f"
Precise model (σ=3): H = {gaussian_entropy(3):.4f} nats")
print(f"Imprecise model (σ=12): H = {gaussian_entropy(12):.4f} nats")
print(f"Difference: {gaussian_entropy(12) - gaussian_entropy(3):.4f} nats")
# Higher entropy = more uncertainty in predictions = worse model (here)
Two variables together
Joint entropy and conditional entropy
So far we've measured information in one variable at a time. But ML problems almost always involve multiple variables — features and labels, inputs and outputs, current state and next state. Joint entropy and conditional entropy extend Shannon's framework to pairs of variables.
Joint entropy — total uncertainty in two variables together
The joint entropy H(X, Y) measures the total uncertainty when you consider X and Y together. If X and Y are completely independent, H(X, Y) = H(X) + H(Y) — their combined uncertainty is just the sum. If they are correlated, H(X, Y) < H(X) + H(Y) — knowing about one reduces uncertainty about the other, so together they contain less combined uncertainty than separately.
python
import numpy as np
# Joint probability table: P(X=x, Y=y)
# X = time of day (morning/evening)
# Y = delivery fast or slow
# joint_probs[i][j] = P(time_i, speed_j)
# Columns: fast slow
joint = np.array([
[0.35, 0.15], # morning: mostly fast
[0.10, 0.40], # evening: mostly slow (rush hour)
])
# Each cell: P(time=row, speed=col)
# Row sums = P(time): [0.50, 0.50]
# Col sums = P(speed): [0.45, 0.55]
def joint_entropy(joint_probs):
"""H(X,Y) = -sum_xy P(x,y) log2 P(x,y)"""
probs = joint_probs.flatten()
probs = probs[probs > 0]
return -np.sum(probs * np.log2(probs))
def marginal_entropy(probs_1d):
"""H(X) for a 1D marginal distribution"""
probs = np.array(probs_1d)
probs = probs[probs > 0]
return -np.sum(probs * np.log2(probs))
p_time = joint.sum(axis=1) # [0.50, 0.50] — marginal over time
p_speed = joint.sum(axis=0) # [0.45, 0.55] — marginal over speed
H_time = marginal_entropy(p_time)
H_speed = marginal_entropy(p_speed)
H_joint = joint_entropy(joint)
H_sum = H_time + H_speed
print(f"H(time) = {H_time:.4f} bits")
print(f"H(speed) = {H_speed:.4f} bits")
print(f"H(time) + H(speed) = {H_sum:.4f} bits (if independent)")
print(f"H(time, speed) = {H_joint:.4f} bits (actual joint)")
print(f"Reduction: {H_sum - H_joint:.4f} bits (← shared information)")
# Joint < sum → they are correlated (time of day tells you something about speed)
Conditional entropy — uncertainty remaining after knowing something
Conditional entropy H(Y|X) answers: if I already know X, how much uncertainty remains about Y? It's the average entropy of Y given each possible value of X, weighted by how likely each value of X is. If X and Y are independent, H(Y|X) = H(Y) — knowing X tells you nothing. If X perfectly predicts Y, H(Y|X) = 0 — no uncertainty remains.
python
# Conditional entropy H(Y|X)
# H(Y|X) = sum_x P(x) * H(Y|X=x)
def conditional_entropy(joint_probs):
"""H(Y|X) from a joint probability table (rows=X, cols=Y)"""
p_x = joint_probs.sum(axis=1, keepdims=True) # marginal P(x)
# Conditional distribution P(Y|X=x) for each row
p_y_given_x = joint_probs / np.where(p_x > 0, p_x, 1)
total = 0.0
for i in range(joint_probs.shape[0]):
row = p_y_given_x[i]
row = row[row > 0]
h_row = -np.sum(row * np.log2(row)) # H(Y | X=x_i)
total += p_x[i, 0] * h_row # weight by P(X=x_i)
return total
H_speed_given_time = conditional_entropy(joint)
H_time_given_speed = conditional_entropy(joint.T)
print(f"H(speed | time) = {H_speed_given_time:.4f} bits")
print(f"H(time | speed) = {H_time_given_speed:.4f} bits")
print(f"H(speed) = {H_speed:.4f} bits")
print(f"Reduction from knowing time: {H_speed - H_speed_given_time:.4f} bits")
# Chain rule of entropy:
# H(X, Y) = H(X) + H(Y|X) always true
print(f"
Chain rule check:")
print(f"H(time) + H(speed|time) = {H_time + H_speed_given_time:.4f}")
print(f"H(time, speed) = {H_joint:.4f}")
print(f"Equal: {np.isclose(H_time + H_speed_given_time, H_joint)}") # True ✓
The most important concept for feature selection
Mutual information — how much does knowing X tell you about Y?
Mutual information I(X; Y) measures how much information X and Y share — how much knowing one reduces uncertainty about the other. It's symmetric: I(X; Y) = I(Y; X). And it's zero if and only if X and Y are completely independent.
For ML, mutual information between a feature and the target label tells you how much that feature contributes to predicting the target. A feature with high mutual information with the label is a valuable feature. A feature with zero mutual information is useless for prediction — no matter how sophisticated your model.
I(X; Y) = H(X) + H(Y) − H(X, Y)
I(X; Y) = 0 → X and Y are independent → feature X tells you nothing about Y
I(X; Y) = H(Y) → X perfectly predicts Y → feature X is the perfect predictor
I(X; Y) = I(Y; X) always → mutual information is symmetric
I(X; Y) ≥ 0 always → knowing a variable never increases uncertainty about another
Mutual information — the Venn diagram of entropy
I(X;Y) is the overlap. I(X;Y) = H(X) + H(Y) − H(X,Y)
python
import numpy as np
from sklearn.feature_selection import mutual_info_classif
from sklearn.datasets import make_classification
np.random.seed(42)
# ── Manual mutual information ─────────────────────────────────────────
def mutual_information(joint_probs):
"""I(X;Y) = H(X) + H(Y) - H(X,Y)"""
p_x = joint_probs.sum(axis=1)
p_y = joint_probs.sum(axis=0)
H_X = marginal_entropy(p_x)
H_Y = marginal_entropy(p_y)
H_XY = joint_entropy(joint_probs)
return H_X + H_Y - H_XY
# Using the delivery time/speed example
mi = mutual_information(joint)
print(f"I(time; speed) = {mi:.4f} bits")
print(f"H(speed) = {H_speed:.4f} bits")
print(f"MI / H(speed) = {mi/H_speed:.4f} ({mi/H_speed*100:.1f}% of label entropy explained)")
# ── Sklearn mutual information for feature selection ─────────────────
# Generate classification dataset: 5 features, 2 informative, 3 noise
X, y = make_classification(
n_samples=1000, n_features=5,
n_informative=2, n_redundant=0,
n_repeated=0, n_classes=2,
random_state=42
)
# mutual_info_classif estimates I(feature; label) for each feature
mi_scores = mutual_info_classif(X, y, random_state=42)
print("
Mutual information: feature → label")
for i, score in enumerate(mi_scores):
bar = '█' * int(score * 50)
label = "informative" if score > 0.05 else "noise"
print(f" Feature {i}: {bar} {score:.4f} [{label}]")
# Features 0 and 1 should have the highest MI — they were informative.
# Features 2, 3, 4 are noise — their MI with y ≈ 0.
# ── MI as a correlation alternative ──────────────────────────────────
# Pearson correlation only captures LINEAR relationships.
# Mutual information captures ANY relationship — linear or not.
x_linear = np.random.randn(500)
y_linear = x_linear + np.random.randn(500) * 0.5 # linear relationship
y_nonlin = x_linear ** 2 + np.random.randn(500) * 0.3 # non-linear (quadratic)
corr_linear = np.corrcoef(x_linear, y_linear)[0, 1]
corr_nonlin = np.corrcoef(x_linear, y_nonlin)[0, 1]
print(f"
Linear relationship: Pearson r = {corr_linear:.3f} (high → detected)")
print(f"Non-linear (quadratic): Pearson r = {corr_nonlin:.3f} (near 0 → missed!)")
# Pearson missed the quadratic relationship. MI would not.
Normalised mutual information — comparing across different scales
python
# Normalised Mutual Information (NMI)
# Scales MI to [0, 1] so you can compare features with different entropies
def normalised_mutual_information(joint_probs):
"""NMI = I(X;Y) / sqrt(H(X) * H(Y))"""
mi = mutual_information(joint_probs)
p_x = joint_probs.sum(axis=1)
p_y = joint_probs.sum(axis=0)
H_X = marginal_entropy(p_x)
H_Y = marginal_entropy(p_y)
if H_X == 0 or H_Y == 0:
return 0.0
return mi / np.sqrt(H_X * H_Y)
nmi = normalised_mutual_information(joint)
print(f"NMI(time, speed) = {nmi:.4f}")
# 0 = completely independent, 1 = perfectly correlated
# NMI is also used to evaluate clustering quality:
# Compare predicted cluster labels with true labels
# High NMI → predicted clusters align with true structure
from sklearn.metrics import normalized_mutual_info_score
true_labels = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2])
pred_labels = np.array([0, 0, 1, 1, 1, 2, 2, 2, 0]) # some errors
nmi_clustering = normalized_mutual_info_score(true_labels, pred_labels)
print(f"NMI (clustering evaluation): {nmi_clustering:.4f}")
print("(1.0 = perfect, 0.0 = random)")
Comparing distributions
KL divergence — a deep dive into asymmetry and meaning
Module 06 introduced KL divergence as a measure of how different two distributions are. This section goes deeper — particularly on the asymmetry property, which has practical consequences in how you use it in ML.
D_KL(P ∥ Q) ≠ D_KL(Q ∥ P) in general. The two directions have different meanings and different failure modes. Getting this wrong causes subtle bugs in VAEs, knowledge distillation, and reinforcement learning.
KL divergence asymmetry — forward vs reverse
Forward KL: D_KL(P ∥ Q)
"Make Q cover everywhere P has mass." If P is non-zero somewhere but Q is zero there, the KL is infinite. Q is penalised heavily for missing regions where the true distribution has probability.
Result: Q tends to spread out to cover P → "mean-seeking" behaviour
Reverse KL: D_KL(Q ∥ P)
"Make Q avoid everywhere P has no mass." If Q is non-zero somewhere but P is zero there, the KL is infinite. Q is penalised for placing probability where the true distribution has none.
Result: Q focuses on one mode of P → "mode-seeking" behaviour
python
import numpy as np
def kl_divergence(p, q, eps=1e-10):
"""D_KL(P || Q) = sum(P * log(P/Q))"""
p = np.array(p, dtype=float)
q = np.array(q, dtype=float)
q = np.clip(q, eps, 1) # avoid division by zero
mask = p > 0
return np.sum(p[mask] * np.log(p[mask] / q[mask]))
# True distribution P: bimodal (two peaks)
P = np.array([0.30, 0.05, 0.05, 0.30, 0.05, 0.05, 0.20, 0.00])
# Q_mean: tries to cover both peaks → spreads out
Q_mean = np.array([0.13, 0.12, 0.12, 0.13, 0.12, 0.12, 0.13, 0.13])
# Q_mode1: focuses on first peak only → ignores second
Q_mode1 = np.array([0.45, 0.20, 0.15, 0.10, 0.05, 0.03, 0.02, 0.00])
# Q_mode2: focuses on second peak only
Q_mode2 = np.array([0.05, 0.03, 0.05, 0.45, 0.20, 0.10, 0.10, 0.02])
print("Forward KL D_KL(P||Q) — 'zero-avoiding' — penalises missing P's modes:")
print(f" Q_mean (covers both): {kl_divergence(P, Q_mean):.4f}")
print(f" Q_mode1 (first peak): {kl_divergence(P, Q_mode1):.4f}")
print(f" Q_mode2 (second peak): {kl_divergence(P, Q_mode2):.4f}")
print("
Reverse KL D_KL(Q||P) — 'zero-forcing' — penalises adding to P's empty spots:")
print(f" Q_mean (covers both): {kl_divergence(Q_mean, P):.4f}")
print(f" Q_mode1 (first peak): {kl_divergence(Q_mode1, P):.4f}")
print(f" Q_mode2 (second peak): {kl_divergence(Q_mode2, P):.4f}")
# Key insight:
# Cross-entropy loss uses forward KL (implicitly)
# → neural networks try to cover all modes of the true distribution
#
# Some generative models (flows, early energy models) use reverse KL
# → they pick one mode and fit it perfectly (mode collapse in GANs!)
# This is the information-theoretic explanation for why GANs suffer from
# mode collapse: they minimise something related to reverse KL
When KL = infinity — and why this matters in practice
python
# KL divergence can be infinite — a practical gotcha
P_has_zero = np.array([0.5, 0.5, 0.0]) # P gives zero probability to class 3
Q_nonzero = np.array([0.4, 0.4, 0.2]) # Q gives 0.2 probability to class 3
# Forward KL D_KL(P||Q): P has 0 probability where Q has 0.2
# The 0 * log(0/0.2) term = 0 (0 * anything = 0) — finite!
print(f"D_KL(P || Q) = {kl_divergence(P_has_zero, Q_nonzero):.4f}") # finite — OK
# Reverse KL D_KL(Q||P): Q has 0.2 where P has 0
# The 0.2 * log(0.2/0) = infinity!
print(f"D_KL(Q || P) = {kl_divergence(Q_nonzero, P_has_zero):.4f}") # very large!
# This is why in variational inference and VAEs:
# - The KL term in the ELBO uses reverse KL D_KL(Q||P)
# - Q (the approximate posterior) is forced to put zero mass
# wherever the prior P puts zero mass
# - This prevents Q from "inventing" probability mass where P has none
# - The result: the learned latent space stays well-structured
# Practical fix: always add epsilon before computing KL in code
def safe_kl(p, q, eps=1e-10):
p = np.clip(p, eps, 1)
q = np.clip(q, eps, 1)
# Renormalise after clipping
p = p / p.sum()
q = q / q.sum()
return np.sum(p * np.log(p / q))
print(f"Safe KL with clipping: {safe_kl(P_has_zero, Q_nonzero):.4f}")
KL divergence has two problems: it's asymmetric and it can be infinite. The Jensen-Shannon divergence (JSD) fixes both. It's symmetric (JSD(P, Q) = JSD(Q, P)), always finite (even when P and Q have non-overlapping support), and bounded between 0 and 1 (when using log₂) or 0 and ln(2) (when using natural log).
JSD also has a beautiful interpretation: it's the average KL divergence from each distribution to their mixture M = (P + Q) / 2. You create a midpoint distribution and measure how far each original distribution is from that midpoint.
M = (P + Q) / 2
JSD(P, Q) = (D_KL(P ∥ M) + D_KL(Q ∥ M)) / 2
The square root √JSD is actually a proper metric — it satisfies the triangle inequality. This makes it useful for comparing text distributions in NLP and comparing model outputs to true labels.
python
import numpy as np
from scipy.spatial.distance import jensenshannon
def jsd_manual(p, q):
"""Jensen-Shannon Divergence — symmetric and bounded"""
p = np.array(p, dtype=float)
q = np.array(q, dtype=float)
M = (p + q) / 2
return (kl_divergence(p, M) + kl_divergence(q, M)) / 2
# Properties demonstration
P = np.array([0.5, 0.3, 0.2])
Q = np.array([0.2, 0.3, 0.5]) # P reversed
print("JSD properties:")
print(f" JSD(P, Q) = {jsd_manual(P, Q):.6f}")
print(f" JSD(Q, P) = {jsd_manual(Q, P):.6f}")
print(f" Symmetric: {np.isclose(jsd_manual(P, Q), jsd_manual(Q, P))}")
# Bounds
identical = np.array([0.5, 0.3, 0.2])
opposite = np.array([1.0, 0.0, 0.0])
print(f"
JSD(P, P) = {jsd_manual(P, identical):.6f} (identical → 0)")
print(f" JSD(P, Q_extreme): {jsd_manual(P, opposite):.6f} (very different)")
print(f" Max possible JSD = {np.log(2):.6f} (= ln(2) with natural log)")
# Comparison: KL can be infinite where JSD is bounded
P_sparse = np.array([0.9, 0.1, 0.0])
Q_sparse = np.array([0.0, 0.5, 0.5]) # non-overlapping support
print(f"
Non-overlapping distributions:")
print(f" KL(P||Q) = {kl_divergence(P_sparse, Q_sparse):.4f}") # huge
print(f" KL(Q||P) = {kl_divergence(Q_sparse, P_sparse):.4f}") # huge
print(f" JSD(P,Q) = {jsd_manual(P_sparse, Q_sparse):.6f}") # bounded ✓
# ── JSD in ML: GAN training ───────────────────────────────────────────
# The original GAN paper (Goodfellow 2014) proved that the optimal GAN
# generator minimises the JSD between the generated distribution
# and the true data distribution.
# When JSD = 0, the generator has perfectly learned the true distribution.
# When JSD = ln(2), the distributions are completely different.
# This is why GAN training is unstable: gradient of JSD vanishes
# when distributions have no overlap (early training).
# Solution: Wasserstein GAN uses Earth Mover Distance instead of JSD.
# scipy implementation (uses log base 2 by default)
jsd_scipy = jensenshannon(P, Q) ** 2 # scipy returns sqrt(JSD)
print(f"
scipy JSD (squared): {jsd_scipy:.6f}")
The ML loss function
Cross-entropy — every detail the tutorials skip
Cross-entropy is the most commonly used loss function in all of deep learning. You've seen it before, but this section gives you the full picture — including the numerical stability issues that cause silent bugs in production, and why PyTorch's implementation differs from the mathematical definition.
The full derivation — from information theory to code
python
import numpy as np
# Cross-entropy H(P, Q) = -sum(P * log(Q))
# Where P = true distribution, Q = predicted distribution
# ── Multi-class classification example ───────────────────────────────
# True label: class 1 (one-hot encoded)
# Predicted probabilities from softmax output
true_label = np.array([0.0, 1.0, 0.0, 0.0]) # class 1
pred_good = np.array([0.05, 0.85, 0.07, 0.03]) # confident and correct
pred_unsure = np.array([0.25, 0.30, 0.25, 0.20]) # uncertain
pred_wrong = np.array([0.80, 0.07, 0.08, 0.05]) # confident and WRONG
def cross_entropy(p_true, q_pred, eps=1e-10):
"""H(P, Q) = -sum(P * log(Q))"""
q_pred = np.clip(q_pred, eps, 1)
return -np.sum(p_true * np.log(q_pred))
print("Cross-entropy losses:")
print(f" Good prediction: {cross_entropy(true_label, pred_good):.4f}") # low
print(f" Uncertain: {cross_entropy(true_label, pred_unsure):.4f}") # medium
print(f" Wrong (confident): {cross_entropy(true_label, pred_wrong):.4f}") # high
# For one-hot labels, cross-entropy simplifies beautifully:
# H(P, Q) = -log(Q[correct_class])
# We only care about the probability assigned to the true class!
correct_class = 1
print(f"
Simplified (one-hot): -log(Q[{correct_class}])")
print(f" Good: -log({pred_good[correct_class]:.2f}) = {-np.log(pred_good[correct_class]):.4f}")
print(f" Unsure: -log({pred_unsure[correct_class]:.2f}) = {-np.log(pred_unsure[correct_class]):.4f}")
print(f" Wrong: -log({pred_wrong[correct_class]:.2f}) = {-np.log(pred_wrong[correct_class]):.4f}")
The numerical stability problem — why PyTorch does it differently
Computing softmax then cross-entropy in two steps causes numerical instability. Softmax involves exponentials that overflow for large logits. Then taking the log undoes the exponential — so why exponentiate in the first place? PyTorch's nn.CrossEntropyLoss combines softmax and log in a single numerically stable operation called "log-softmax." This is one of the most common sources of NaN errors in deep learning when people implement it manually.
python
import numpy as np
# The numerical stability problem
logits = np.array([2.0, 8.0, 1.0, 0.5]) # raw neural network outputs
true_class = 1
# ── Naive approach: softmax then cross-entropy ────────────────────────
def softmax_naive(x):
exp_x = np.exp(x) # OVERFLOW risk for large x!
return exp_x / exp_x.sum()
def cross_entropy_naive(logits, true_class):
probs = softmax_naive(logits)
return -np.log(probs[true_class])
print(f"Naive CE loss: {cross_entropy_naive(logits, true_class):.6f}")
# Now with large logits (common in early training)
logits_large = np.array([200.0, 800.0, 100.0, 50.0])
try:
loss = cross_entropy_naive(logits_large, true_class)
print(f"Large logits (naive): {loss}")
except Exception as e:
print(f"Large logits (naive): OVERFLOW → {e}")
# np.exp(800) = inf → inf/inf = NaN → log(NaN) = NaN
# ── Numerically stable: log-softmax ──────────────────────────────────
def log_softmax_stable(x):
"""log(softmax(x)) computed stably via the log-sum-exp trick"""
c = x.max() # subtract max for numerical stability
return x - c - np.log(np.sum(np.exp(x - c)))
def cross_entropy_stable(logits, true_class):
log_probs = log_softmax_stable(logits)
return -log_probs[true_class]
print(f"Stable CE loss: {cross_entropy_stable(logits, true_class):.6f}")
print(f"Large logits (stable): {cross_entropy_stable(logits_large, true_class):.6f}")
# No overflow! log-sum-exp trick makes it safe.
# ── Why they are equal mathematically ────────────────────────────────
# log(softmax(x_i)) = x_i - log(sum(exp(x_j)))
# = x_i - c - log(sum(exp(x_j - c))) subtract/add c
# The c cancels. Numerically: exp(x_j - c) is safe because x_j - c ≤ 0.
# ── PyTorch equivalent (for reference) ────────────────────────────────
# import torch
# import torch.nn as nn
# loss_fn = nn.CrossEntropyLoss() # takes logits directly, not probabilities!
# logits_t = torch.tensor([[2.0, 8.0, 1.0, 0.5]])
# label_t = torch.tensor([1]) # class index
# loss = loss_fn(logits_t, label_t)
# print(loss) # same result, implemented with log-sum-exp internally
Temperature scaling — controlling prediction confidence
Temperature is a parameter that controls how "sharp" or "soft" the predicted distribution is. It divides the logits before softmax. Temperature < 1 makes predictions more confident (sharper). Temperature > 1 makes predictions more uncertain (flatter). This appears in language model sampling, knowledge distillation, and model calibration.
python
def softmax_with_temperature(logits, temperature=1.0):
"""Softmax with temperature scaling."""
scaled = logits / temperature
# Stable softmax
scaled -= scaled.max()
exp_scaled = np.exp(scaled)
return exp_scaled / exp_scaled.sum()
logits = np.array([3.0, 1.5, 0.5, -0.5])
print("Effect of temperature on predictions:")
print(f"{'T':<8} {'Probs':<50} {'Entropy'}")
print('─' * 70)
for T in [0.1, 0.5, 1.0, 2.0, 5.0, 10.0]:
probs = softmax_with_temperature(logits, temperature=T)
h = entropy_bits(probs)
probs_str = ' '.join([f'{p:.3f}' for p in probs])
print(f"{T:<8.1f} {probs_str:<50} {h:.3f}")
# T=0.1: very sharp → [0.999, 0.001, 0.000, 0.000] → H≈0 (overconfident)
# T=1.0: standard softmax
# T=10 : very flat → [0.262, 0.248, 0.242, 0.248] → H≈2 (underconfident)
# Use cases:
# T < 1: make language model outputs more deterministic (less random)
# T > 1: make language model outputs more creative (more random)
# T → ∞: uniform distribution (completely random)
# T → 0: argmax (completely deterministic — picks the top class only)
#
# In knowledge distillation: teacher model uses high T to produce "soft labels"
# that reveal the model's uncertainty about borderline cases.
# The student model trains on these soft labels — learning more than hard 0/1.
How language models are evaluated
Perplexity — the standard evaluation metric for language models
When you read "GPT-4 achieves a perplexity of 5.4 on this benchmark," what does that mean? Perplexity is a direct application of entropy to language modelling — and once you know information theory, it's completely transparent.
A language model assigns a probability to each possible next word given the context. Perplexity measures how surprised the model is, on average, by each word in a test dataset. Low perplexity means the model predicted the actual words well — it was rarely surprised. High perplexity means it was frequently surprised — bad model.
Perplexity = 2^H(P,Q) = exp(cross-entropy)
Perplexity of K means the model is as confused as if it had to choose uniformly from K equally likely options at each step. Perplexity of 5 means the model is effectively choosing among 5 equally probable words at each position. Lower is always better.
python
import numpy as np
def perplexity(text_probs):
"""
Compute perplexity given the model's assigned probability for each word.
text_probs: list of probabilities P(word_i | context) for each word in the text.
"""
n = len(text_probs)
# Cross-entropy = average negative log probability
cross_ent = -np.mean(np.log(text_probs))
return np.exp(cross_ent)
# Simulating model evaluation on a sample sentence:
# "The delivery arrived on time today"
# Each value: probability model assigned to the actual word
# A well-trained model (high probabilities → low perplexity)
good_model_probs = [0.12, 0.45, 0.60, 0.35, 0.52, 0.48, 0.70]
# A poorly trained model (low probabilities → high perplexity)
poor_model_probs = [0.02, 0.08, 0.12, 0.05, 0.09, 0.07, 0.15]
# A perfect model (always assigns probability 1 to correct word)
perfect_model = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
# A random model (vocabulary of 10,000 words, each equally likely)
random_model = [1/10000] * 7
print("Perplexity comparison:")
print(f" Perfect model: {perplexity(perfect_model):.2f}") # 1.0
print(f" Good model: {perplexity(good_model_probs):.2f}") # low
print(f" Poor model: {perplexity(poor_model_probs):.2f}") # high
print(f" Random model: {perplexity(random_model):.2f}") # ~10,000
# Real-world benchmarks (approximate 2024 values):
real_world = {
'GPT-2 (117M)': 29.4,
'GPT-2 (1.5B)': 17.5,
'GPT-3 (175B)': 8.6,
'GPT-4': 4.2, # approximate
}
print("
Real-world language model perplexities (Penn Treebank):")
for model, ppl in real_world.items():
bar = '█' * min(int(ppl), 30)
print(f" {model:<20}: {bar} {ppl}")
# The connection to compression:
# A language model with perplexity K can compress text to ~log2(K) bits/word
# Perplexity 4.2 → ~2.07 bits/word → very efficient compression
# Random (perplexity 10000) → 13.3 bits/word → terrible compression
print(f"
Bits per word for each model:")
for model, ppl in real_world.items():
bits_per_word = np.log2(ppl)
print(f" {model:<20}: {bits_per_word:.2f} bits/word")
Why deep learning works
The information bottleneck — why deep networks compress and generalise
The information bottleneck principle offers a compelling information-theoretic explanation for why deep learning works. Proposed by Tishby and Schwartz-Ziv in 2017, it says: a neural network learns by compressing the input X into a compact representation T that retains only the information relevant to predicting the output Y.
During training, two things happen simultaneously. The network first increases I(T; Y) — it captures more information about the labels from the representation. Then it decreases I(T; X) — it compresses away irrelevant information from the input. The compression phase is what enables generalisation: a representation that ignores irrelevant details of X will work on new, unseen examples.
The information bottleneck objective:
max I(T; Y) − β × I(T; X)
Maximise information about the label (I(T; Y)) while minimising information retained about the raw input (I(T; X)). β controls the tradeoff — higher β = more compression = more generalisation but potentially less accuracy. Lower β = retain more input info = higher accuracy but potential overfitting.
python
import numpy as np
# Demonstrating compression vs accuracy tradeoff
# Simulated: how much information about input X vs label Y
# is retained in the representation T at each layer
np.random.seed(42)
n_layers = 8
# Simulated mutual information values during training
# In practice you'd measure these from activations
# Early training: both I(T;X) and I(T;Y) increase
# Late training: I(T;X) decreases (compression), I(T;Y) stays high
I_T_Y = np.array([0.05, 0.25, 0.55, 0.75, 0.85, 0.87, 0.88, 0.88]) # label info
I_T_X = np.array([5.00, 4.80, 4.50, 4.00, 3.20, 2.50, 2.10, 2.00]) # input info
print("Information plane: training a deep network")
print(f"{'Layer/Step':<15} {'I(T;Y) ↑':<15} {'I(T;X) ↓':<15} {'Phase'}")
print('─' * 55)
for i, (iy, ix) in enumerate(zip(I_T_Y, I_T_X)):
phase = "fitting" if i < 4 else "compressing"
direction = "→" if i < 4 else "←"
print(f"Step {i+1:<10} {iy:<15.2f} {ix:<15.2f} {direction} {phase}")
# Interpretation:
# Steps 1-4: network learns to extract signal from input (I(T;Y) rises)
# both types of information increase
# Steps 5-8: network compresses irrelevant input info (I(T;X) falls)
# only label-relevant information is kept
# THIS is when generalisation improves
# Practical implications:
print("
Practical implications of information bottleneck:")
implications = [
"Dropout forces compression — randomly removing neurons prevents memorisation",
"Data augmentation reduces I(T;X) — model can't rely on specific pixel values",
"Weight decay compresses representations — penalises large weights → simpler T",
"Early stopping catches the compression phase before overfitting",
"Deeper networks = more layers to compress = potentially better generalisation",
]
for imp in implications:
print(f" • {imp}")
Practical applications
Information theory in everyday ML work
Information theory isn't just theoretical — it shows up in practical ML engineering constantly. Here are the five most common situations where knowing this material directly changes how you solve a problem.
1. Choosing the right loss function
MSE for regression (Gaussian noise assumption). Binary cross-entropy for binary classification (Bernoulli). Categorical cross-entropy for multi-class (Categorical). Focal loss for severe class imbalance. The right choice isn't convention — it's choosing the correct distributional model for your target variable.
2. Feature selection with mutual information
sklearn's mutual_info_classif and mutual_info_regression compute I(feature; label) for each feature. Features with MI near zero add no predictive value. This is model-agnostic — it works even if no linear correlation exists. Use it as a first-pass filter before expensive model training.
3. Model evaluation beyond accuracy
Cross-entropy loss is the correct evaluation metric when your model outputs probabilities — not accuracy. Accuracy ignores confidence. A model that says "60% fraud" on every fraud case and "40% fraud" on every legit case has 100% accuracy but terrible cross-entropy. Use cross-entropy when you care about calibration.
4. Detecting data drift
KL divergence and JSD measure whether a new batch of data has the same distribution as training data. High KL between training and production distributions signals distribution shift — your model may have degraded. This is used in every modern ML monitoring system (Evidently, WhyLabs, Arize).
5. Knowledge distillation
Train a small student model to match the output distribution of a large teacher model — not just the hard labels. The student minimises KL(teacher_output || student_output). High-temperature softmax from the teacher reveals uncertainty about borderline cases, providing richer training signal than one-hot labels alone.
💡 Note
The math foundations section (Modules 03–07) is now complete. Vectors and matrices gave you the data structures. Matrix multiplication gave you the core operation. Derivatives and gradients gave you the learning mechanism. Probability gave you the uncertainty framework. Information theory gave you the language for measuring knowledge. The next section — Programming Ecosystem — turns all of this into working Python code using NumPy, Pandas, and sklearn.
What comes next
Math foundations complete. Time to write code.
You have now studied the full mathematical foundation of ML: linear algebra (vectors, matrices, SVD), calculus (derivatives, gradients, chain rule), probability (distributions, Bayes, MLE), and information theory (entropy, KL divergence, mutual information, perplexity). Every algorithm in this track builds on these foundations.
Module 08 starts the Programming Ecosystem section. It covers Python specifically for ML — not syntax basics, but the patterns professional ML engineers actually use: NumPy vectorisation, Pandas for data manipulation, Matplotlib for exploration, and the sklearn interface that all classical ML algorithms share. Every concept from the math section gets implemented in working code.
Next — Module 08 · Programming Ecosystem
Python for Machine Learning
Not Python 101. Python the way ML engineers actually write it — vectorised, readable, and production-ready from day one.
coming soon
🎯 Key Takeaways
✓Self-information I(x) = −log₂ P(x) bits. Certain events carry 0 bits. Rare events carry many bits. The negative log is the unique function that is zero for certain events, decreasing with probability, and additive for independent events.
✓Entropy H(X) = expected self-information = −Σ P(x) log₂ P(x). It measures the average uncertainty across a distribution. Zero for deterministic distributions, maximum (log₂ n) for uniform over n outcomes.
✓Joint entropy H(X,Y) ≤ H(X) + H(Y). Equality holds iff X and Y are independent. Conditional entropy H(Y|X) = H(X,Y) − H(X). Chain rule: H(X,Y) = H(X) + H(Y|X).
✓Mutual information I(X;Y) = H(X) + H(Y) − H(X,Y). It measures shared information between X and Y. Zero iff independent. Equals H(Y) if X perfectly predicts Y. Use mutual_info_classif for model-agnostic feature selection.
✓KL divergence D_KL(P∥Q) is asymmetric. Forward KL is zero-avoiding (Q spreads to cover P). Reverse KL is zero-forcing (Q focuses on one mode). JSD is symmetric, bounded between 0 and ln(2), and always finite — even when distributions do not overlap.
✓Perplexity = exp(cross-entropy). It measures how surprised a language model is per word. Perplexity of K means the model effectively chooses among K equally likely words at each step. Modern LLMs achieve perplexity below 10 on standard benchmarks.
✓The information bottleneck principle: deep networks learn by first fitting the label (maximising I(T;Y)) then compressing the input (minimising I(T;X)). The compression phase is what produces generalisation. Dropout, weight decay, and data augmentation all promote compression.
Share
Discussion
0
Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.