Transformers and Self-Attention
Queries, keys, values, and why attention is all you need. Build a self-attention layer from scratch, then see how GPT and BERT use it.
LSTMs process tokens one at a time — step 1, then step 2, then step 3. A 512-token sequence takes 512 sequential steps. Self-attention processes every token in relation to every other token simultaneously. That is the entire revolution.
Module 47 showed the LSTM's fundamental limitation: sequential processing. To understand token 512 in a document you must first process tokens 1 through 511. You cannot parallelise across the sequence. Modern GPUs have thousands of cores that sit idle while the LSTM plods forward one step at a time. Training a large LSTM on a billion tokens takes weeks.
Self-attention removes sequential dependency entirely. For each token it asks: which other tokens in this sequence are relevant to understanding me? It computes a relevance score between every pair of tokens simultaneously — all in one matrix multiplication. The entire sequence is processed in parallel. GPT-3 was trained on 300 billion tokens in a few weeks. An LSTM of equivalent capacity would have taken years.
Beyond speed, attention solves the core weakness of LSTMs — long-range dependencies. "The bank on the river bank was steep." An LSTM processing this sentence might forget "river" by the time it reaches the second "bank." Self-attention directly connects "bank" to "river" regardless of distance. Every token directly attends to every other token in a single layer.
Imagine a meeting with 10 people. An LSTM-style meeting: person 1 speaks, whispers to person 2, person 2 whispers to person 3 — by person 10, the original message is distorted. A self-attention meeting: every person simultaneously reads every other person's written statement and decides how much to pay attention to each one when forming their own response. No information degrades. No sequential bottleneck.
The attention score between two tokens is their relevance — how much should token A look at token B when computing its contextual meaning? "Bank" should look at "river" with high attention weight. "Bank" should look at "steep" with lower weight. These weights are learned during training.
Scaled dot-product attention — queries, keys, and values
Self-attention projects each token into three vectors: a Query (Q), a Key (K), and a Value (V). Think of it like a library system. The query is your search request. The keys are the index cards for every book. The values are the actual book contents. Attention computes how well your query matches each key, converts those match scores to weights (softmax), and returns a weighted sum of values.
Multi-head attention — h parallel attention heads, concatenated
A single attention head learns one type of relationship between tokens. But a sentence has many simultaneous relationships — syntactic dependencies, coreference, semantic similarity, positional proximity. Multi-head attention runs h attention heads in parallel, each with its own Q, K, V projection matrices. Each head can specialise in a different relationship type. The outputs are concatenated and projected back to d_model.
Transformer encoder block — attention + feedforward + residual + LayerNorm
A single Transformer encoder block combines four components. Multi-head self-attention computes contextual representations. A position-wise feedforward network applies the same two-layer MLP to each token independently — adding non-linearity and capacity. Residual connections add the input to the output of each sub-layer — preventing vanishing gradients and enabling very deep stacking. Layer Normalisation stabilises training — applied before each sub-layer in the modern "Pre-LN" variant used by GPT.
BERT vs GPT — encoder vs decoder, bidirectional vs causal
The original Transformer had both an encoder and a decoder. Modern LLMs use just one half. BERT uses encoder-only — every token can attend to every other token (bidirectional). This makes it excellent for understanding tasks: classification, NER, question answering. GPT uses decoder-only — each token can only attend to previous tokens (causal masking). This makes it excellent for generation: complete this sentence, write this email.
Fine-tuning a pretrained Transformer — Razorpay payment dispute classification
In production, nobody trains a Transformer from scratch for NLP tasks. You take a pretrained model (BERT, RoBERTa, DistilBERT) that has already learned language from billions of tokens, add a small task-specific head, and fine-tune on your labelled data. Razorpay classifies payment dispute reasons — fraudulent charge, service not received, wrong amount — from customer-submitted text. A fine-tuned DistilBERT achieves near-human accuracy with 1,000 labelled examples in minutes of fine-tuning.
Every common Transformer mistake — explained and fixed
The Deep Learning section is complete. Section 8 — NLP — begins next.
You have now completed the full Deep Learning section: neural networks from scratch, backpropagation, activation and loss functions, optimisers, batch normalisation and dropout, CNNs, RNNs and LSTMs, and Transformers. You can build, train, and debug any standard deep learning architecture from first principles.
Section 8 — NLP — goes deeper into language-specific techniques: tokenisation, embeddings, fine-tuning large pretrained models with HuggingFace, retrieval-augmented generation, and building production NLP pipelines. Everything builds on the Transformer architecture you just learned.
BPE, WordPiece, SentencePiece — how text becomes numbers. Word2Vec, GloVe, and contextual embeddings from BERT.
🎯 Key Takeaways
- ✓Self-attention processes every token in relation to every other token simultaneously — no sequential bottleneck. For each token it computes Q (what am I looking for?), K (what do I contain?), and V (what do I provide?). Attention(Q,K,V) = softmax(QKᵀ/√dₖ)V. The result is a weighted sum of values where weights reflect token relevance.
- ✓Scaling by √dₖ is essential. Without it, large d_k produces large dot products that push softmax into saturation — attention weights become one-hot and gradients vanish. Dividing by √dₖ keeps variance stable regardless of d_k.
- ✓Multi-head attention runs h attention heads in parallel, each with separate Wq, Wk, Wv projections. Each head specialises in a different relationship type — syntactic, semantic, positional. Outputs are concatenated and projected back to d_model. Total parameters are the same as one large head.
- ✓A Transformer encoder block: LayerNorm → Multi-head self-attention → residual → LayerNorm → Feed-forward (Linear→GELU→Linear) → residual. Residual connections allow gradients to flow through very deep stacks. Pre-LN (normalise before sub-layer) is more stable than the original Post-LN.
- ✓BERT (encoder-only): bidirectional attention, pretrained with masked language modelling, fine-tuned for understanding tasks. GPT (decoder-only): causal attention mask prevents attending to future tokens, pretrained with next-token prediction, used for generation. The causal mask is the only architectural difference.
- ✓In production never train a Transformer from scratch for NLP. Use HuggingFace pretrained models (DistilBERT, RoBERTa, LLaMA). Add a task-specific head, fine-tune with AdamW at lr=2e-5, warmup for 6% of steps. Self-attention memory scales as O(seq_len²) — use gradient checkpointing or Flash Attention for long sequences.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.