Tokenisation and Word Embeddings
BPE, WordPiece, SentencePiece — how text becomes numbers. Word2Vec, GloVe, and contextual embeddings from BERT. The foundation of every NLP system.
Neural networks only understand numbers. "Razorpay declined my payment" is text. Before any model can process it, every character, subword, or word must become an integer. That conversion is tokenisation — and the choice of how to split text changes everything.
Tokenisation is the first step in every NLP pipeline. It converts raw text into a sequence of integer IDs that the model's embedding layer can look up. The tokenisation strategy determines the vocabulary size, how unknown words are handled, and how efficiently rare or domain-specific terms are represented.
A naive approach: split on spaces. "running" and "runs" and "ran" become three separate vocabulary entries with no shared representation. The model must learn from scratch that they are related. A character-level approach: every letter is a token. "running" becomes 7 tokens. Sequences become very long and the model struggles to learn word-level patterns. Subword tokenisation — used by every modern LLM — splits "running" into "run" and "##ning", sharing the "run" representation across all its inflections while handling unknown words gracefully by falling back to subword pieces.
Imagine a dictionary for a new language. You could have one entry per full word — huge dictionary, every form of every verb listed separately. Or one entry per letter — tiny dictionary, but reading requires spelling out every word. Subword tokenisation is like a dictionary of common syllables and word roots — compact vocabulary, still captures meaning, handles new words by combining known pieces.
GPT-4 uses ~100,000 BPE tokens. BERT uses ~30,000 WordPiece tokens. Both can represent any text — known words as single tokens, unknown words as sequences of subword pieces. Nothing is truly "out of vocabulary."
Four tokenisation strategies — word, character, subword, and byte-level
Byte Pair Encoding — the algorithm that powers GPT
BPE starts with a character-level vocabulary and iteratively merges the most frequent pair of adjacent tokens into a new token. After k merges you have a vocabulary of approximately k + n_chars tokens. The merge rules learned during training are applied at inference time to tokenise any new text — including text with words never seen before.
Word embeddings — dense vectors that capture semantic relationships
After tokenisation, each token ID is looked up in an embedding table — a matrix of shape (vocab_size, d_model). Each row is a dense vector representing that token. Tokens with similar meanings end up with similar vectors — "payment" and "transaction" are close in embedding space. "payment" and "bicycle" are far apart.
Word2Vec (2013) was the first widely-used word embedding method. It trains a shallow neural network to predict a word from its context (CBOW) or predict context from a word (Skip-gram). The learned weight matrix becomes the embedding table. The famous result: king − man + woman ≈ queen — arithmetic in embedding space reflects semantic relationships.
The fundamental limitation of Word2Vec and GloVe: each word has exactly one vector regardless of context. "Bank" has the same embedding whether it means a river bank or a financial institution. BERT-style contextual embeddings fix this — the embedding for each token depends on all surrounding tokens.
Contextual embeddings — the same word, different vectors based on context
Static embeddings (Word2Vec, GloVe) assign one fixed vector per word. BERT and its successors produce contextual embeddings — the vector for each token depends on the full surrounding context. "Bank" in "river bank" and "bank" in "bank transfer" get different vectors because BERT processes the entire sentence at once via self-attention before producing the embedding.
In practice, contextual embeddings from pretrained models are used in two ways. As features: run BERT, extract the [CLS] token or averaged token embeddings, use them as input to a classifier. As fine-tuned representations: run BERT, add a task head, fine-tune all parameters end-to-end on your labelled data. Fine-tuning almost always outperforms feature extraction but requires more compute.
nn.Embedding — the lookup table that connects tokenisation to neural networks
In PyTorch, the embedding table is nn.Embedding(vocab_size, d_model). It is a matrix of shape (vocab_size, d_model) — one row per token. A forward pass takes integer token IDs and returns the corresponding rows. It is mathematically equivalent to a one-hot encoding multiplied by a weight matrix — but implemented as a simple lookup for efficiency.
Every common tokenisation and embedding mistake — explained and fixed
Text is now numbers. Next: take a pretrained model and make it do your specific task.
You know how text becomes tokens and how tokens become dense vectors. The next step is using pretrained language models in production — loading BERT or RoBERTa from HuggingFace, fine-tuning on a labelled dataset, and deploying for inference. Module 50 covers the complete HuggingFace fine-tuning workflow — the Trainer API, evaluation, saving checkpoints, and serving predictions in production.
Trainer API, evaluation metrics, saving and loading checkpoints, and production inference — the complete fine-tuning workflow.
🎯 Key Takeaways
- ✓Tokenisation converts raw text to integer token IDs before any model processing. Word-level tokenisation suffers from OOV problems. Character-level produces very long sequences. Subword tokenisation (BPE, WordPiece, SentencePiece) is the standard — it never produces UNK by splitting unknown words into known subword pieces.
- ✓BPE starts with a character vocabulary and iteratively merges the most frequent adjacent pair into a new token. After k merges the vocabulary has approximately k + n_chars tokens. The same merge rules are applied deterministically at inference time — making BPE reproducible and fast.
- ✓GPT-family models use BPE (byte-level). BERT uses WordPiece — similar to BPE but merges are chosen to maximise the likelihood of the training corpus rather than frequency. Both produce ~30-100k token vocabularies. Always use the tokeniser that was trained with the model — never mix tokenisers.
- ✓Static word embeddings (Word2Vec, GloVe) assign one fixed vector per word regardless of context. Contextual embeddings (BERT, RoBERTa) produce different vectors for the same word in different contexts — "bank" near "river" gets a different vector than "bank" near "transfer". Contextual embeddings almost always produce better downstream task performance.
- ✓In PyTorch, nn.Embedding(vocab_size, d_model) is a lookup table of shape (vocab_size, d_model). padding_idx=0 ensures the padding token always returns a zero vector and receives no gradient. Always ensure the tokeniser vocab_size exactly matches the nn.Embedding num_embeddings parameter.
- ✓For production NLP: never train embeddings from scratch unless you have 100M+ tokens. Load pretrained embeddings (GloVe for static, BERT/RoBERTa for contextual). Fine-tune on your domain data. Always save the tokeniser alongside the model — a different tokeniser will produce different token IDs and break the model entirely.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.