LLMs — Pretraining, RLHF, and Scaling Laws
How GPT, Claude, and Gemini are built. Next-token prediction at scale, RLHF alignment, DPO, instruction tuning, and the laws that predict capability from compute.
An LLM is a Transformer trained on hundreds of billions of tokens to predict the next word. That one objective — predict what comes next — turns out to be sufficient to learn reasoning, coding, translation, and every other language task ever attempted.
Module 48 covered the Transformer architecture — attention, positional encoding, encoder-decoder. LLMs use only the decoder half (or a modified encoder-only variant for BERT). GPT, LLaMA, Mistral, and Gemini are all decoder-only Transformers. The key difference from what you built in Module 48: scale. GPT-3 has 175 billion parameters trained on 300 billion tokens using thousands of A100 GPUs over months. LLaMA-3-70B has 70 billion parameters trained on 15 trillion tokens. Scale changes everything — capabilities emerge that were completely absent at smaller scales and were never explicitly trained.
But pretraining alone produces a model that completes text in the style of its training data — helpful for some tasks, dangerous for others. A pretrained GPT asked "how do I make a bomb?" will helpfully complete the sentence if such text appeared in its training data. Alignment — the process of making LLMs helpful, harmless, and honest — requires three additional stages: supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and increasingly direct preference optimisation (DPO).
Pretraining is like a person reading every book, article, and website ever written. They become extraordinarily knowledgeable about language and the world. But they have no manners, no values, and no sense of what is helpful vs harmful — they just know what typically follows what in text. Alignment is giving them social training, teaching them to be genuinely helpful, and giving them the judgment to refuse harmful requests.
The insight from OpenAI's InstructGPT paper (2022): a 1.3B parameter model fine-tuned with RLHF was preferred by human raters over a raw 175B GPT-3. Alignment is more important than raw scale for user-facing applications.
Next-token prediction at scale — the only pretraining objective
The pretraining objective is next-token prediction (causal language modelling). Given a sequence of tokens [t₁, t₂, …, t_n], the model predicts t_i+1given [t₁, …, t_i] for every position simultaneously. The loss is cross-entropy averaged over all token predictions. This objective is self-supervised — no human labels required. Any text on the internet is valid training data.
How much compute, how many parameters, how much data — the laws that answer all three
Kaplan et al. (2020) discovered that LLM loss follows power laws with respect to model size N, dataset size D, and compute budget C. These scaling laws make LLM development predictable — you can forecast the loss of a model before training it. Hoffmann et al. (2022) refined these laws with Chinchilla: for a fixed compute budget, the optimal strategy is to train a smaller model on more tokens, not a larger model on fewer tokens. The rule: N_opt ≈ D_opt / 20 — use 20 tokens per parameter.
SFT then RLHF — turning a text predictor into a helpful assistant
After pretraining, the model completes text but does not follow instructions. Supervised Fine-Tuning (SFT) is the first alignment step: fine-tune the pretrained model on a dataset of high-quality (prompt, response) pairs written or curated by humans. Typically 10,000–100,000 examples. This teaches the model to respond to instructions rather than just complete text. But SFT only teaches the model to imitate — it cannot teach the nuanced human preferences about what makes a response helpful, honest, and harmless.
RLHF (Reinforcement Learning from Human Feedback) goes further. Humans compare pairs of model responses and indicate which is better. A reward model is trained to predict human preference scores. The LLM is then fine-tuned with PPO (Proximal Policy Optimisation) to maximise the reward model's score. This is how ChatGPT, Claude, and Gemini are aligned — RLHF is what makes them feel like helpful assistants rather than text completion engines.
DPO — Direct Preference Optimisation — RLHF without the RL
RLHF requires training three models simultaneously — the LLM policy, the reward model, and the reference policy — and running PPO, a notoriously finicky RL algorithm. Engineering complexity is enormous. DPO (Rafailov et al., 2023) derives a closed-form loss that achieves the same objective as RLHF without training a reward model or running RL. The insight: the optimal RLHF policy has a closed form that can be directly optimised with a simple binary cross-entropy loss on preference pairs. Most open-source models (LLaMA, Mistral, Phi) are now aligned with DPO rather than RLHF because it is far simpler.
Temperature, sampling strategies, and quantisation for deployment
Every common LLM mistake — explained and fixed
You understand how LLMs are built and aligned. Next: fine-tune one yourself for a specific task.
Module 64 covered the architecture and training pipeline of LLMs at a conceptual and code level. Module 65 makes it practical: full LoRA fine-tuning walkthrough on a real dataset using HuggingFace Transformers and PEFT, including when to fine-tune vs use RAG vs prompt engineer, and how to evaluate the result.
When to fine-tune vs RAG vs prompt. Full LoRA fine-tuning walkthrough on a real dataset using HuggingFace Transformers and PEFT.
🎯 Key Takeaways
- ✓LLMs are decoder-only Transformers trained with next-token prediction (causal language modelling). The loss is cross-entropy over every token position. The causal mask ensures position i only attends to positions ≤ i. Weight tying shares the input embedding and output projection matrices — saving parameters.
- ✓Chinchilla scaling law: for a fixed compute budget C, the optimal model has N_opt ≈ √(C/120) parameters trained on D_opt ≈ 20×N_opt tokens. GPT-3 was undertrained by this law. LLaMA-3-8B is intentionally over-trained (15T tokens on 8B params) to produce a small model with high inference efficiency.
- ✓Three-stage alignment pipeline: pretraining (next-token prediction on trillions of tokens, 99% of compute), SFT (fine-tune on 10k–100k prompt-response pairs, compute loss on response tokens only), RLHF or DPO (align to human preferences using comparison data).
- ✓RLHF requires training a reward model on human preference pairs then using PPO to maximise expected reward minus KL penalty from the SFT reference. DPO achieves the same objective with a closed-form loss directly on preference pairs — no reward model, no RL. DPO is now the standard for open-source alignment.
- ✓Sampling strategies: greedy (deterministic, repetitive), temperature (scale logits — lower = conservative, higher = creative), top-k (only consider k most likely tokens), top-p/nucleus (keep smallest set summing to probability p). Production default: top-p=0.9 + temperature=0.7.
- ✓Quantisation makes large models deployable: fp16 halves memory vs fp32 with identical quality. int8 (bitsandbytes) halves again with <0.5% degradation. int4 (GPTQ/AWQ) halves again with 1-2% degradation — a 70B model fits in 35GB VRAM. For CPU inference: llama.cpp with GGUF format runs 7B models on laptops.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.