BERT and the Encoder-Only Family
Masked language modelling, next sentence prediction, fine-tuning on downstream tasks. The model that changed NLP — still powering classification and NER.
GPT reads left to right. BERT reads the entire sentence at once — forward and backward simultaneously. That one change made it the best model for understanding tasks for three years running.
Before BERT (2018), language models were unidirectional. GPT reads token 1, then token 2, then token 3 — each token only sees what came before it. This is necessary for generation (you cannot read the future when writing) but it is a handicap for understanding. To classify whether a sentence is positive or negative, every word should inform the meaning of every other word — bidirectionally.
BERT (Bidirectional Encoder Representations from Transformers) uses a Transformer encoder — the left half of the original Transformer architecture from Module 48. Every token attends to every other token with no causal mask. To pretrain this bidirectional model without the ability to simply predict the next token (which would leak the answer), BERT uses two novel pretraining objectives: Masked Language Modelling and Next Sentence Prediction.
The result: BERT representations capture deep bidirectional context. Fine-tune BERT on 1,000 labelled examples and it outperforms models trained from scratch on 100,000. Flipkart's review classifier, Swiggy's complaint tagger, Razorpay's intent detector — all fine-tuned BERT variants.
Reading comprehension in school: you read the full passage, then answer questions about it. You read forwards and backwards, checking context in both directions. A student who only reads left to right and never re-reads misses nuance. BERT is the student who reads the full passage before answering. GPT is the student writing an essay — they cannot read what they have not written yet.
This is why BERT dominates understanding tasks (classification, NER, Q&A) while GPT dominates generation tasks (completion, summarisation, chat). Same Transformer architecture, different direction of attention, completely different use cases.
Masked Language Modelling and Next Sentence Prediction — the two pretraining tasks
BERT cannot use next-token prediction as its pretraining objective — that would require masking future tokens, making it unidirectional. Instead it uses two self-supervised objectives that can be computed from raw unlabelled text with no human annotation.
Note: Later research (RoBERTa, 2019) showed NSP does not help and may hurt. RoBERTa removed it entirely and achieved better results training with MLM only on more data for longer.
BERT's three embeddings — token, segment, and position
BERT's input is the sum of three embedding types. The token embedding is the standard lookup table for each WordPiece token. The segment embedding distinguishes sentence A (all zeros) from sentence B (all ones) — needed for the NSP task and any two-sentence input like Q&A. The position embedding is learned (unlike GPT's sinusoidal encoding) — one vector per position 0 to 511.
Fine-tuning BERT — add a task head, update all weights end-to-end
BERT fine-tuning is simple: add one task-specific layer on top of the pretrained encoder and train the entire model end-to-end on your labelled data for 2–4 epochs. For classification, use the [CLS] token's final hidden state (a 768-dim vector) as input to a linear classifier. For NER, use every token's final hidden state. For Q&A, predict start and end positions of the answer span.
RoBERTa, DistilBERT, ALBERT, DeBERTa — what each one improved
BERT spawned an entire family of encoder-only models. Each one identified a specific weakness in the original BERT and fixed it — more data, better training recipe, smaller model, better attention mechanism. Understanding what each model improved helps you choose the right one for your task.
Named Entity Recognition — labelling every token in a sequence
Classification uses only the [CLS] token. NER uses every token's output — one label per token. Useful at Razorpay to extract merchant names, amounts, and dates from unstructured dispute text. The label format is BIO: B-entity (beginning), I-entity (inside), O (outside/no entity).
Every common BERT mistake — explained and fixed
You can fine-tune BERT for any classification or NER task. Next: fine-tune with less than 1% of the parameters.
Full fine-tuning updates all 110 million parameters of BERT. For large models (7B, 13B, 70B parameters) this requires enormous GPU memory and storage. PEFT (Parameter-Efficient Fine-Tuning) methods like LoRA and adapters fine-tune less than 1% of parameters while achieving 95% of full fine-tuning performance. Module 51 covers LoRA, adapters, and prefix tuning — how to fine-tune a 7B parameter model on a single GPU.
Tune less than 1% of a model's parameters and get 95% of the performance. LoRA, adapters, and prefix tuning — when and how to use each.
🎯 Key Takeaways
- ✓BERT uses a Transformer encoder with bidirectional attention — every token attends to every other token with no causal mask. This makes it ideal for understanding tasks (classification, NER, Q&A) where context from both directions matters.
- ✓BERT is pretrained with two objectives: Masked Language Modelling (predict 15% of randomly masked tokens using surrounding context) and Next Sentence Prediction (predict whether sentence B follows sentence A). RoBERTa later showed NSP hurts — train MLM only on more data.
- ✓BERT input is the sum of three embeddings: token (WordPiece lookup), segment (sentence A vs B), and position (learned, 0–511). Special tokens [CLS] (start) and [SEP] (sentence separator) are always added by the tokeniser automatically.
- ✓Fine-tuning pattern: for classification use the [CLS] token final hidden state → Linear(768, n_classes). For NER use every token final hidden state → Linear(768, n_labels). For Q&A predict start and end positions. All use the same pretrained backbone, different task heads.
- ✓The encoder family: RoBERTa (better training recipe, no NSP) is the default when accuracy matters most. DistilBERT (40% smaller, 60% faster, 97% quality) is the default for production serving. DeBERTa achieves state of the art on NLU benchmarks. IndicBERT/MuRIL for Indian language tasks.
- ✓For NER, use word_ids() to align labels to tokenised subwords. First subword of each word gets the real label. Continuation subwords (## prefix) get label -100 to be ignored in loss. Never align labels by position index — WordPiece splits change the count of tokens per word unpredictably.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.