AI/ML — Module 48Advanced

Transformers and Self-Attention

Queries, keys, values, and why attention is all you need. Build a self-attention layer from scratch, then see how GPT and BERT use it.

45–58 min March 2026

Module 48 · Deep Learning

Deep Learning · 8 modulesModule 48

Neural Backpropagation Activation Optimisers Batch CNNs RNNs Transformers

Before any formula — what problem does attention solve?

LSTMs process tokens one at a time — step 1, then step 2, then step 3. A 512-token sequence takes 512 sequential steps. Self-attention processes every token in relation to every other token simultaneously. That is the entire revolution.

Module 47 showed the LSTM's fundamental limitation: sequential processing. To understand token 512 in a document you must first process tokens 1 through 511. You cannot parallelise across the sequence. Modern GPUs have thousands of cores that sit idle while the LSTM plods forward one step at a time. Training a large LSTM on a billion tokens takes weeks.

Self-attention removes sequential dependency entirely. For each token it asks: which other tokens in this sequence are relevant to understanding me? It computes a relevance score between every pair of tokens simultaneously — all in one matrix multiplication. The entire sequence is processed in parallel. GPT-3 was trained on 300 billion tokens in a few weeks. An LSTM of equivalent capacity would have taken years.

Beyond speed, attention solves the core weakness of LSTMs — long-range dependencies. "The bank on the river bank was steep." An LSTM processing this sentence might forget "river" by the time it reaches the second "bank." Self-attention directly connects "bank" to "river" regardless of distance. Every token directly attends to every other token in a single layer.

🧠 Analogy — read this first

Imagine a meeting with 10 people. An LSTM-style meeting: person 1 speaks, whispers to person 2, person 2 whispers to person 3 — by person 10, the original message is distorted. A self-attention meeting: every person simultaneously reads every other person's written statement and decides how much to pay attention to each one when forming their own response. No information degrades. No sequential bottleneck.

The attention score between two tokens is their relevance — how much should token A look at token B when computing its contextual meaning? "Bank" should look at "river" with high attention weight. "Bank" should look at "steep" with lower weight. These weights are learned during training.

🎯 Pro Tip

This module builds self-attention from scratch in NumPy, then scales to a full Transformer encoder block in PyTorch. Understanding the QKV attention mechanism completely is more valuable than memorising the full Transformer architecture — everything else (multi-head, positional encoding, feedforward) is built on top of this one operation.

The core operation

Scaled dot-product attention — queries, keys, and values

Self-attention projects each token into three vectors: a Query (Q), a Key (K), and a Value (V). Think of it like a library system. The query is your search request. The keys are the index cards for every book. The values are the actual book contents. Attention computes how well your query matches each key, converts those match scores to weights (softmax), and returns a weighted sum of values.

Scaled dot-product attention — the formula in full

Attention(Q, K, V) = softmax(Q Kᵀ / √dₖ) × V

Q = X Wq ← queries: (seq_len, d_k)

K = X Wk ← keys: (seq_len, d_k)

V = X Wv ← values: (seq_len, d_v)

Q Kᵀ ← attention scores: (seq_len, seq_len)

/ √dₖ ← scale to prevent softmax saturation

softmax(...) ← attention weights: each row sums to 1

× V ← weighted sum of values: (seq_len, d_v)

Attention weights — 'bank' attends to context words in a sentence

Sentence: "The river bank was steep" — attention weights from token "bank"

The

0.04

river

0.61

bank

0.20

was

0.06

steep

0.09

"bank" attends most strongly to "river" (0.61) — learning the disambiguation. Weights sum to 1.0. These are learned, not hand-crafted.

python

import numpy as np

# ── Self-attention from scratch — every operation visible ─────────────

def softmax(x, axis=-1):
    e = np.exp(x - x.max(axis=axis, keepdims=True))
    return e / e.sum(axis=axis, keepdims=True)

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q: (seq_len, d_k)
    K: (seq_len, d_k)
    V: (seq_len, d_v)
    Returns: output (seq_len, d_v), attention_weights (seq_len, seq_len)
    """
    d_k = Q.shape[-1]

    # Step 1: Compute raw attention scores — how much does each Q match each K?
    scores = Q @ K.T / np.sqrt(d_k)   # (seq_len, seq_len)

    # Step 2: Apply mask (optional — for causal/decoder attention)
    if mask is not None:
        scores = np.where(mask == 0, -1e9, scores)

    # Step 3: Softmax — convert scores to weights (each row sums to 1)
    weights = softmax(scores, axis=-1)  # (seq_len, seq_len)

    # Step 4: Weighted sum of values
    output = weights @ V                # (seq_len, d_v)

    return output, weights

# ── Concrete example: 5-token sentence ────────────────────────────────
np.random.seed(42)
seq_len, d_model, d_k = 5, 8, 4

# Token embeddings (in practice these come from an embedding table)
X = np.random.randn(seq_len, d_model)

# Projection matrices — learned during training
Wq = np.random.randn(d_model, d_k) * 0.1
Wk = np.random.randn(d_model, d_k) * 0.1
Wv = np.random.randn(d_model, d_k) * 0.1

# Project to Q, K, V
Q = X @ Wq   # (5, 4)
K = X @ Wk   # (5, 4)
V = X @ Wv   # (5, 4)

output, attn_weights = scaled_dot_product_attention(Q, K, V)

print(f"Input X shape:    {X.shape}")
print(f"Q, K, V shapes:   {Q.shape}")
print(f"Output shape:     {output.shape}")
print(f"Weights shape:    {attn_weights.shape}")
print(f"
Attention weights (each row sums to 1.0):")
print(attn_weights.round(3))
print(f"
Row sums: {attn_weights.sum(axis=1).round(6)}  ← all 1.0")

# ── Why √dₖ scaling matters ───────────────────────────────────────────
print("
Effect of scaling on softmax:")
raw_scores   = np.array([1.0, 2.0, 3.0, 4.0])
unscaled     = softmax(raw_scores)
scaled_4     = softmax(raw_scores / np.sqrt(4))
scaled_64    = softmax(raw_scores / np.sqrt(64))

print(f"No scaling  (d_k=1):  {unscaled.round(4)}  max={unscaled.max():.4f}")
print(f"Scale √4:             {scaled_4.round(4)}  max={scaled_4.max():.4f}")
print(f"Scale √64:            {scaled_64.round(4)}  max={scaled_64.max():.4f}")
print("Large d_k → large dot products → softmax saturates → vanishing gradients")
print("Dividing by √dₖ keeps variance stable regardless of d_k")

Multiple attention patterns simultaneously

Multi-head attention — h parallel attention heads, concatenated

A single attention head learns one type of relationship between tokens. But a sentence has many simultaneous relationships — syntactic dependencies, coreference, semantic similarity, positional proximity. Multi-head attention runs h attention heads in parallel, each with its own Q, K, V projection matrices. Each head can specialise in a different relationship type. The outputs are concatenated and projected back to d_model.

Multi-head attention — h heads in parallel

For each head i in 1..h:

headᵢ = Attention(Q Wqᵢ, K Wkᵢ, V Wvᵢ)

MultiHead(Q,K,V) = Concat(head₁,...,headₕ) Wo

d_model = 512 (full embedding dimension)

h = 8 (number of heads)

d_k = d_model/h = 64 (dimension per head)

Total parameters = same as one big head — just split differently

python

import numpy as np
import torch
import torch.nn as nn

# ── Multi-head attention from scratch ─────────────────────────────────
class MultiHeadAttentionScratch:
    def __init__(self, d_model, n_heads):
        assert d_model % n_heads == 0
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k     = d_model // n_heads

        np.random.seed(42)
        scale = np.sqrt(1.0 / d_model)
        # One projection matrix per head per role
        self.Wq = [np.random.randn(d_model, self.d_k) * scale for _ in range(n_heads)]
        self.Wk = [np.random.randn(d_model, self.d_k) * scale for _ in range(n_heads)]
        self.Wv = [np.random.randn(d_model, self.d_k) * scale for _ in range(n_heads)]
        self.Wo = np.random.randn(d_model, d_model) * scale   # output projection

    def softmax(self, x):
        e = np.exp(x - x.max(axis=-1, keepdims=True))
        return e / e.sum(axis=-1, keepdims=True)

    def attention(self, Q, K, V):
        scores  = Q @ K.T / np.sqrt(self.d_k)
        weights = self.softmax(scores)
        return weights @ V, weights

    def forward(self, X):
        heads = []
        all_weights = []
        for i in range(self.n_heads):
            Q = X @ self.Wq[i]
            K = X @ self.Wk[i]
            V = X @ self.Wv[i]
            head_out, weights = self.attention(Q, K, V)
            heads.append(head_out)
            all_weights.append(weights)

        # Concatenate all heads → (seq_len, d_model)
        concat = np.concatenate(heads, axis=-1)
        # Final linear projection
        output = concat @ self.Wo
        return output, all_weights

np.random.seed(42)
d_model, n_heads, seq_len = 32, 4, 6
X = np.random.randn(seq_len, d_model)

mha    = MultiHeadAttentionScratch(d_model, n_heads)
out, weights = mha.forward(X)

print(f"Multi-head attention (d_model={d_model}, heads={n_heads}):")
print(f"  Input shape:    {X.shape}")
print(f"  Output shape:   {out.shape}  ← same as input")
print(f"  d_k per head:   {d_model // n_heads}")
print(f"  Weight shapes:  {weights[0].shape} × {n_heads} heads")

print("
Attention weight patterns per head (row 0 = token 0 attending to all):")
for i, w in enumerate(weights):
    print(f"  Head {i+1}: {w[0].round(3)}")
print("Each head learns a different attention pattern")

# ── PyTorch nn.MultiheadAttention ─────────────────────────────────────
mha_pt = nn.MultiheadAttention(
    embed_dim=32,
    num_heads=4,
    dropout=0.0,
    batch_first=True,
)
X_pt   = torch.FloatTensor(X).unsqueeze(0)   # (1, seq, d_model)
out_pt, attn_pt = mha_pt(X_pt, X_pt, X_pt)  # Q=K=V=X for self-attention
print(f"
PyTorch MultiheadAttention output: {tuple(out_pt.shape)}")

The full building block

Transformer encoder block — attention + feedforward + residual + LayerNorm

A single Transformer encoder block combines four components. Multi-head self-attention computes contextual representations. A position-wise feedforward network applies the same two-layer MLP to each token independently — adding non-linearity and capacity. Residual connections add the input to the output of each sub-layer — preventing vanishing gradients and enabling very deep stacking. Layer Normalisation stabilises training — applied before each sub-layer in the modern "Pre-LN" variant used by GPT.

Transformer encoder block — data flow

Input X(batch, seq, d_model)

↓

LayerNormPre-LN style

↓

Multi-Head Self-AttentionQKV projection + attention

↓ + residual

LayerNorm

↓

Feed-Forward NetworkLinear(d_model→4d) → GELU → Linear(4d→d_model)

↓ + residual

Output(batch, seq, d_model) ← same shape as input

python

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

# ── Positional encoding — inject position info since attention is order-agnostic ──
class PositionalEncoding(nn.Module):
    """
    Self-attention has no notion of token order — "cat sat mat" and
    "mat sat cat" produce the same attention scores without positional encoding.
    Sinusoidal PE adds a unique position signal to each token embedding.
    """
    def __init__(self, d_model, max_seq_len=512, dropout=0.1):
        super().__init__()
        self.dropout = nn.Dropout(dropout)

        # Compute sinusoidal encoding once
        pe   = torch.zeros(max_seq_len, d_model)
        pos  = torch.arange(max_seq_len).unsqueeze(1).float()
        div  = torch.exp(
            torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
        )
        pe[:, 0::2] = torch.sin(pos * div)   # even dimensions
        pe[:, 1::2] = torch.cos(pos * div)   # odd dimensions
        self.register_buffer('pe', pe.unsqueeze(0))  # (1, max_len, d_model)

    def forward(self, x):
        # x: (batch, seq_len, d_model)
        x = x + self.pe[:, :x.size(1), :]
        return self.dropout(x)

# ── Single Transformer encoder block ──────────────────────────────────
class TransformerEncoderBlock(nn.Module):
    def __init__(self, d_model=128, n_heads=4, d_ff=512, dropout=0.1):
        super().__init__()
        # Self-attention sub-layer
        self.attention  = nn.MultiheadAttention(
            d_model, n_heads, dropout=dropout, batch_first=True,
        )
        # Feed-forward sub-layer
        self.ff         = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),              # GELU is standard in modern transformers
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
        )
        # Layer norms — Pre-LN style (norm before sub-layer, not after)
        self.norm1      = nn.LayerNorm(d_model)
        self.norm2      = nn.LayerNorm(d_model)
        self.dropout    = nn.Dropout(dropout)

    def forward(self, x, src_key_padding_mask=None):
        # ── Self-attention with residual ──────────────────────────────
        # Pre-LN: normalise first, then attend, then add residual
        normed   = self.norm1(x)
        attn_out, _ = self.attention(
            normed, normed, normed,
            key_padding_mask=src_key_padding_mask,
        )
        x = x + self.dropout(attn_out)   # residual connection

        # ── Feed-forward with residual ────────────────────────────────
        x = x + self.dropout(self.ff(self.norm2(x)))
        return x

# ── Stack multiple blocks — a full Transformer encoder ────────────────
class TransformerEncoder(nn.Module):
    def __init__(self, vocab_size, d_model=128, n_heads=4,
                 n_layers=4, d_ff=512, max_seq=512, dropout=0.1):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model, padding_idx=0)
        self.pos_enc   = PositionalEncoding(d_model, max_seq, dropout)
        self.layers    = nn.ModuleList([
            TransformerEncoderBlock(d_model, n_heads, d_ff, dropout)
            for _ in range(n_layers)
        ])
        self.norm      = nn.LayerNorm(d_model)

    def forward(self, x, padding_mask=None):
        # x: (batch, seq_len) — token indices
        x = self.embedding(x)    # (batch, seq, d_model)
        x = self.pos_enc(x)
        for layer in self.layers:
            x = layer(x, src_key_padding_mask=padding_mask)
        return self.norm(x)      # (batch, seq, d_model)

# ── Shape check ───────────────────────────────────────────────────────
torch.manual_seed(42)
model    = TransformerEncoder(vocab_size=5000, d_model=128, n_heads=4, n_layers=2)
x_tokens = torch.randint(1, 5000, (8, 32))   # batch=8, seq_len=32
out      = model(x_tokens)

total = sum(p.numel() for p in model.parameters())
print(f"Transformer Encoder (d=128, heads=4, layers=2):")
print(f"  Input:  {tuple(x_tokens.shape)}")
print(f"  Output: {tuple(out.shape)}  ← contextual token representations")
print(f"  Params: {total:,}")

How production LLMs use this

BERT vs GPT — encoder vs decoder, bidirectional vs causal

The original Transformer had both an encoder and a decoder. Modern LLMs use just one half. BERT uses encoder-only — every token can attend to every other token (bidirectional). This makes it excellent for understanding tasks: classification, NER, question answering. GPT uses decoder-only — each token can only attend to previous tokens (causal masking). This makes it excellent for generation: complete this sentence, write this email.

BERT (Encoder-only)

Attention: Bidirectional — every token sees all tokens

Mask: No causal mask

Best for: Understanding: classification, NER, Q&A

Pre-training: Masked Language Modelling — predict masked tokens

Fine-tuning: Add classification head, fine-tune on labelled data

Examples: Sentiment analysis, spam detection, search ranking

GPT (Decoder-only)

Attention: Causal — each token only sees past tokens

Mask: Upper triangular causal mask

Best for: Generation: text completion, chat, code

Pre-training: Next token prediction — predict token t+1 from 1..t

Fine-tuning: RLHF or instruction fine-tuning

Examples: ChatGPT, Claude, Copilot, Gemini

python

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

# ── Causal mask — the key difference between BERT and GPT ─────────────
def make_causal_mask(seq_len):
    """
    Upper triangular mask: token i cannot attend to token j > i.
    PyTorch MultiheadAttention uses True = IGNORE (confusing but correct).
    """
    mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1).bool()
    return mask

seq = 6
mask = make_causal_mask(seq)
print("Causal mask (True = blocked):")
print(mask.int().numpy())
print("Token 0 can only attend to: token 0")
print("Token 3 can attend to: tokens 0, 1, 2, 3")
print("Token 5 can attend to: all tokens 0-5")

# ── GPT-style decoder block ────────────────────────────────────────────
class GPTBlock(nn.Module):
    def __init__(self, d_model=128, n_heads=4, d_ff=512, dropout=0.1):
        super().__init__()
        self.attention = nn.MultiheadAttention(
            d_model, n_heads, dropout=dropout, batch_first=True,
        )
        self.ff   = nn.Sequential(
            nn.Linear(d_model, d_ff), nn.GELU(),
            nn.Dropout(dropout), nn.Linear(d_ff, d_model),
        )
        self.norm1   = nn.LayerNorm(d_model)
        self.norm2   = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        seq_len   = x.size(1)
        # Causal mask — prevents attending to future tokens
        causal    = make_causal_mask(seq_len).to(x.device)
        normed    = self.norm1(x)
        attn_out, _ = self.attention(normed, normed, normed, attn_mask=causal)
        x = x + self.dropout(attn_out)
        x = x + self.dropout(self.ff(self.norm2(x)))
        return x

class MiniGPT(nn.Module):
    def __init__(self, vocab_size=1000, d_model=128, n_heads=4,
                 n_layers=4, d_ff=512, max_seq=256):
        super().__init__()
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb   = nn.Embedding(max_seq, d_model)   # learnable PE (GPT style)
        self.blocks    = nn.ModuleList([
            GPTBlock(d_model, n_heads, d_ff) for _ in range(n_layers)
        ])
        self.norm      = nn.LayerNorm(d_model)
        self.lm_head   = nn.Linear(d_model, vocab_size, bias=False)
        # Weight tying: token embedding and lm_head share weights
        self.lm_head.weight = self.token_emb.weight

    def forward(self, idx):
        # idx: (batch, seq_len) token indices
        B, T     = idx.shape
        tok_emb  = self.token_emb(idx)
        pos_emb  = self.pos_emb(torch.arange(T, device=idx.device))
        x        = tok_emb + pos_emb
        for block in self.blocks:
            x = block(x)
        x        = self.norm(x)
        logits   = self.lm_head(x)   # (B, T, vocab_size)
        return logits

    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0):
        """Autoregressive generation — one token at a time."""
        for _ in range(max_new_tokens):
            logits  = self(idx)[:, -1, :]   # last token's logits
            logits  = logits / temperature
            probs   = F.softmax(logits, dim=-1)
            next_id = torch.multinomial(probs, num_samples=1)
            idx     = torch.cat([idx, next_id], dim=1)
        return idx

torch.manual_seed(42)
gpt   = MiniGPT(vocab_size=1000, d_model=64, n_heads=4, n_layers=2)
x     = torch.randint(1, 1000, (2, 16))
logits = gpt(x)
total  = sum(p.numel() for p in gpt.parameters())

print(f"
MiniGPT (d=64, heads=4, layers=2, vocab=1000):")
print(f"  Input:   {tuple(x.shape)}")
print(f"  Logits:  {tuple(logits.shape)}  ← (batch, seq, vocab)")
print(f"  Params:  {total:,}")

# Generate 5 tokens from a seed
seed     = torch.randint(1, 1000, (1, 4))
generated = gpt.generate(seed, max_new_tokens=5)
print(f"
Generation: {seed[0].tolist()} → {generated[0].tolist()}")

Production NLP — the real workflow

Fine-tuning a pretrained Transformer — Stripe payment dispute classification

In production, nobody trains a Transformer from scratch for NLP tasks. You take a pretrained model (BERT, RoBERTa, DistilBERT) that has already learned language from billions of tokens, add a small task-specific head, and fine-tune on your labelled data. Stripe classifies payment dispute reasons — fraudulent charge, service not received, wrong amount — from customer-submitted text. A fine-tuned DistilBERT achieves near-human accuracy with 1,000 labelled examples in minutes of fine-tuning.

python

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from torch.utils.data import Dataset, DataLoader
import warnings
warnings.filterwarnings('ignore')

torch.manual_seed(42)
np.random.seed(42)

# ── Simulate Stripe dispute classification ───────────────────────────
# In production: use HuggingFace transformers
# pip install transformers
# from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

# Here: demonstrate the fine-tuning pattern with a minimal Transformer
DISPUTE_CLASSES = [
    'fraudulent_charge',
    'service_not_received',
    'wrong_amount',
    'duplicate_charge',
]
N_CLASSES = len(DISPUTE_CLASSES)
VOCAB_SIZE = 500
MAX_LEN    = 32

class DisputeDataset(Dataset):
    def __init__(self, n=800):
        np.random.seed(42)
        self.labels   = np.random.randint(0, N_CLASSES, n)
        # Each class: different vocabulary distribution (simulates real text)
        self.sequences = []
        for label in self.labels:
            seq = np.random.randint(1, VOCAB_SIZE, MAX_LEN)
            # Add class-specific signal to first 5 tokens
            seq[:5] = label * 50 + np.random.randint(1, 50, 5)
            self.sequences.append(seq)

    def __len__(self): return len(self.labels)

    def __getitem__(self, i):
        return (
            torch.LongTensor(self.sequences[i]),
            self.labels[i],
        )

train_ds = DisputeDataset(640)
val_ds   = DisputeDataset(160)
train_ld = DataLoader(train_ds, batch_size=32, shuffle=True)
val_ld   = DataLoader(val_ds,   batch_size=32)

# ── BERT-style classifier — encoder + [CLS] token classification head ──
class BERTClassifier(nn.Module):
    """
    BERT uses a special [CLS] token prepended to every input.
    The final hidden state of [CLS] is used for classification.
    It aggregates information from the entire sequence via attention.
    """
    def __init__(self, vocab_size=VOCAB_SIZE, d_model=64, n_heads=4,
                 n_layers=2, n_classes=N_CLASSES, dropout=0.1):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size + 1, d_model, padding_idx=0)
        self.pos_emb   = nn.Embedding(MAX_LEN + 1, d_model)
        self.layers    = nn.ModuleList([
            self._make_block(d_model, n_heads, dropout)
            for _ in range(n_layers)
        ])
        self.norm      = nn.LayerNorm(d_model)
        # Classification head — applied to [CLS] token representation
        self.classifier = nn.Sequential(
            nn.Linear(d_model, d_model),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(d_model, n_classes),
        )
        self.cls_token = nn.Parameter(torch.randn(1, 1, d_model))

    def _make_block(self, d_model, n_heads, dropout):
        return nn.ModuleDict({
            'attn':  nn.MultiheadAttention(d_model, n_heads,
                         dropout=dropout, batch_first=True),
            'ff':    nn.Sequential(
                         nn.Linear(d_model, d_model*4), nn.GELU(),
                         nn.Dropout(dropout), nn.Linear(d_model*4, d_model)),
            'norm1': nn.LayerNorm(d_model),
            'norm2': nn.LayerNorm(d_model),
            'drop':  nn.Dropout(dropout),
        })

    def forward(self, x):
        B, T  = x.shape
        tok   = self.embedding(x)
        pos   = self.pos_emb(torch.arange(T, device=x.device))
        x_emb = tok + pos

        # Prepend [CLS] token
        cls   = self.cls_token.expand(B, -1, -1)
        x_emb = torch.cat([cls, x_emb], dim=1)   # (B, T+1, d)

        h = x_emb
        for block in self.layers:
            n   = block['norm1'](h)
            a, _ = block['attn'](n, n, n)
            h   = h + block['drop'](a)
            h   = h + block['drop'](block['ff'](block['norm2'](h)))
        h = self.norm(h)

        # Use [CLS] token (position 0) for classification
        cls_repr = h[:, 0, :]
        return self.classifier(cls_repr)

model     = BERTClassifier()
criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=2e-4, weight_decay=0.01)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=20)

print("Fine-tuning BERT-style classifier on Stripe disputes:")
print(f"{'Epoch':>6} {'Loss':>10} {'Val acc':>10}")
print("─" * 30)

for epoch in range(1, 21):
    model.train()
    total_loss = 0
    for Xb, yb in train_ld:
        optimizer.zero_grad()
        loss = criterion(model(Xb), yb)
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        total_loss += loss.item()
    scheduler.step()

    if epoch % 4 == 0:
        model.eval()
        correct = 0
        with torch.no_grad():
            for Xb, yb in val_ld:
                correct += (model(Xb).argmax(1) == yb).sum().item()
        acc = correct / len(val_ds)
        print(f"  {epoch:>4}  {total_loss/len(train_ld):>10.4f}  {acc:>10.4f}")

Errors you will hit

Every common Transformer mistake — explained and fixed

Attention weights are all equal (uniform) — model learns nothing from attention

Why it happens

The Q and K projections are initialised identically or the dot products are all the same value. Happens when all token embeddings are initialised to the same value (e.g. all zeros or all the same random seed), making Q @ K.T a matrix of identical values. Softmax of identical values is uniform — 1/seq_len for every position. The attention mechanism provides no signal.

Fix

Ensure token embeddings are randomly initialised with different values per token. Check nn.Embedding is initialised with the default normal distribution, not zeros. Verify that Wq and Wk are initialised differently — use separate parameter tensors. Add a small amount of noise to break symmetry if needed. Also check that you are not accidentally tying Q and K weights.

RuntimeError: The shape of the attn_mask is wrong — expected (seq, seq) got (batch, seq, seq)

Why it happens

PyTorch's nn.MultiheadAttention expects attn_mask of shape (seq_len, seq_len) for the causal mask, not (batch, seq_len, seq_len). The mask is broadcast across the batch. Passing a batch-dimension mask causes a shape mismatch. Also: the key_padding_mask (for padding tokens) is (batch, seq_len) — different from attn_mask and used differently.

Fix

For causal masking pass attn_mask of shape (seq_len, seq_len): causal = torch.triu(torch.ones(T, T), diagonal=1).bool(). For padding masking pass key_padding_mask of shape (batch, seq_len) with True where tokens are padding. Never pass a 3D tensor to attn_mask — it will be interpreted as a per-head mask which requires shape (batch×heads, seq, seq).

Loss is stuck at log(vocab_size) — model predicts uniform distribution over vocabulary

Why it happens

Weight tying between the embedding and the lm_head (output projection) is initialised in a way that causes the logits to be all-zero, producing uniform softmax. Also happens when learning rate is too high — the Transformer diverges immediately and collapses to predicting the uniform distribution. Or positional encodings are missing — without position information, every token looks identical and the model cannot learn sequential patterns.

Fix

Use a warmup learning rate schedule — start at 0 and linearly increase to target lr over the first 100–1000 steps. Verify positional encodings are being added: print (x + pos_enc(x)).std() — should be larger than x.std() alone. Check weight tying is correct: lm_head.weight = token_embedding.weight (share the same tensor). Reduce learning rate to 1e-4 or lower.

CUDA out of memory with long sequences — quadratic memory in seq_len

Why it happens

Self-attention computes a (batch, heads, seq_len, seq_len) attention matrix. Memory scales as O(seq_len²). For seq_len=4096 with batch=8 and 8 heads in float32: 8 × 8 × 4096 × 4096 × 4 bytes = 4GB just for the attention matrix — before activations, weights, or gradients.

Fix

Reduce batch size or sequence length. Use gradient checkpointing: torch.utils.checkpoint.checkpoint(block, x) — recomputes activations during backward instead of storing them, trading compute for memory. Use Flash Attention (pip install flash-attn) — an exact attention algorithm that is O(seq_len) in memory by tiling. For inference only, use torch.backends.cuda.enable_flash_sdp(True) in PyTorch 2.0+.

What comes next

The Deep Learning section is complete. Section 8 — NLP — begins next.

You have now completed the full Deep Learning section: neural networks from scratch, backpropagation, activation and loss functions, optimisers, batch normalisation and dropout, CNNs, RNNs and LSTMs, and Transformers. You can build, train, and debug any standard deep learning architecture from first principles.

Section 8 — NLP — goes deeper into language-specific techniques: tokenisation, embeddings, fine-tuning large pretrained models with HuggingFace, retrieval-augmented generation, and building production NLP pipelines. Everything builds on the Transformer architecture you just learned.

Next — Section 8 · NLP

Tokenisation and Word Embeddings

BPE, WordPiece, SentencePiece — how text becomes numbers. Word2Vec, GloVe, and contextual embeddings from BERT.

coming soon

🎯 Key Takeaways

✓Self-attention processes every token in relation to every other token simultaneously — no sequential bottleneck. For each token it computes Q (what am I looking for?), K (what do I contain?), and V (what do I provide?). Attention(Q,K,V) = softmax(QKᵀ/√dₖ)V. The result is a weighted sum of values where weights reflect token relevance.
✓Scaling by √dₖ is essential. Without it, large d_k produces large dot products that push softmax into saturation — attention weights become one-hot and gradients vanish. Dividing by √dₖ keeps variance stable regardless of d_k.
✓Multi-head attention runs h attention heads in parallel, each with separate Wq, Wk, Wv projections. Each head specialises in a different relationship type — syntactic, semantic, positional. Outputs are concatenated and projected back to d_model. Total parameters are the same as one large head.
✓A Transformer encoder block: LayerNorm → Multi-head self-attention → residual → LayerNorm → Feed-forward (Linear→GELU→Linear) → residual. Residual connections allow gradients to flow through very deep stacks. Pre-LN (normalise before sub-layer) is more stable than the original Post-LN.
✓BERT (encoder-only): bidirectional attention, pretrained with masked language modelling, fine-tuned for understanding tasks. GPT (decoder-only): causal attention mask prevents attending to future tokens, pretrained with next-token prediction, used for generation. The causal mask is the only architectural difference.
✓In production never train a Transformer from scratch for NLP. Use HuggingFace pretrained models (DistilBERT, RoBERTa, LLaMA). Add a task-specific head, fine-tune with AdamW at lr=2e-5, warmup for 6% of steps. Self-attention memory scales as O(seq_len²) — use gradient checkpointing or Flash Attention for long sequences.

Discussion

Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.

Continue with GitHub