RNNs and LSTMs — Sequence Modelling
Hidden states, vanishing gradients across time, and how LSTMs use gates to selectively remember and forget. Built from scratch before PyTorch.
A CNN sees one image independently. An MLP sees one row independently. But a sentence, a stock price, a user session — each step depends on what came before. RNNs process sequences by carrying memory forward.
Flipkart wants to predict whether a user will make a purchase in the next 10 minutes based on their browsing session: home page → search "running shoes" → product page → add to cart → remove from cart. An MLP treats each action independently — it sees five inputs with no concept of order or context. The sequence matters enormously. "Add to cart then remove" signals hesitation. "Search then product page" signals intent. The temporal pattern is the signal.
RNNs (Recurrent Neural Networks) process sequences one step at a time, maintaining a hidden state — a vector that summarises everything seen so far. At each step the hidden state is updated using the current input and the previous hidden state. After processing the full sequence, the final hidden state is a compressed representation of the entire sequence — used for classification, regression, or generation.
Reading this sentence word by word — your understanding of each new word depends on everything you have read before. "The bank was steep" versus "The bank was closed" — the word "bank" means something different based on prior context. You carry a mental model forward as you read. That mental model is the hidden state.
An RNN does exactly this — it maintains a hidden state vector that gets updated at every word (or time step). The hidden state at the end of the sequence encodes the full context. The problem: RNNs forget things from 20+ steps ago. LSTMs fix this with explicit memory management using gates.
The RNN cell — one equation, applied at every time step
An RNN cell has one equation. At each time step t it takes the current input xₜ and the previous hidden state hₜ₋₁, combines them linearly, and applies tanh to produce the new hidden state hₜ. The same weights Wₓ, Wₕ, and bias b are reused at every time step — weight sharing across time, just as CNNs share weights across space.
LSTM — three gates that control what to remember, forget, and output
The LSTM (Long Short-Term Memory) was designed specifically to fix the vanishing gradient problem. It maintains two states: the hidden state hₜ (same as RNN) and a new cell state Cₜ — a separate memory lane that runs through the sequence with only additive interactions. Because the cell state is modified additively (not multiplicatively), gradients flow backward through it without shrinking exponentially.
Three gates control the cell state. The forget gate decides what to erase from the previous cell state. The input gate decides what new information to write to the cell state. The output gate decides what part of the cell state to expose as the hidden state. All gates output values between 0 and 1 (sigmoid) — 0 means "block completely," 1 means "pass through completely."
PyTorch nn.LSTM — shapes, directions, and layers
PyTorch's nn.LSTM processes an entire sequence in one call. The most important thing to understand is the input and output shapes — they are not intuitive and cause the majority of LSTM bugs. Input is (seq_len, batch, input_size) by default — note seq_len comes first, not batch. Output is the hidden state at every time step plus the final hidden and cell states separately.
LSTM for Flipkart session classification — will this user buy?
LSTM for time series — Zepto demand forecasting
Beyond classification, LSTMs are widely used for sequence-to-value regression: given the last N time steps, predict the next value. Zepto predicts hourly demand for each SKU at each dark store — the last 24 hours of sales predict the next hour. This is a many-to-one sequence regression problem.
Every common RNN/LSTM mistake — explained and fixed
You can model sequences. Next: the architecture that replaced RNNs for almost everything.
LSTMs process sequences step by step — they cannot parallelise across time steps during training. A sequence of 512 tokens requires 512 sequential LSTM steps. Transformers replaced this with self-attention — every token attends to every other token simultaneously. Training is fully parallelisable, long-range dependencies are captured in a single layer, and the results are dramatically better. Every modern LLM — GPT, Gemini, Claude — is a Transformer. Module 48 builds self-attention from scratch.
Queries, keys, values, and why attention is all you need. Build a self-attention layer from scratch, then see how GPT and BERT use it.
🎯 Key Takeaways
- ✓RNNs process sequences by maintaining a hidden state — a vector summarising everything seen so far. At each step: hₜ = tanh(Wₓxₜ + Wₕhₜ₋₁ + b). The same weights are reused at every step (weight sharing across time). The final hidden state represents the entire sequence.
- ✓The vanishing gradient problem: tanh derivatives are at most 1. Over 50 time steps, gradients shrink by 0.7⁵⁰ ≈ 0.0000001. Early time steps receive essentially zero gradient — the network cannot learn long-range dependencies.
- ✓LSTMs add a cell state Cₜ alongside the hidden state hₜ. The cell state is updated additively: C = f × C_prev + i × g. Additive updates allow gradients to flow backward without shrinking — this is why LSTMs can learn dependencies 100+ steps apart.
- ✓Three gates control the cell state: forget gate f (what to erase from memory), input gate i + candidate g (what new information to write), output gate o (what part of memory to expose as hidden state). All gates use sigmoid — values between 0 and 1 act as soft on/off switches.
- ✓PyTorch LSTM shapes: input is (batch, seq_len, input_size) with batch_first=True. output is (batch, seq_len, hidden_size) — hidden at every step. h_n is (num_layers, batch, hidden_size) — final hidden. Always use pack_padded_sequence for variable-length sequences. Always clip gradients: nn.utils.clip_grad_norm_(model.parameters(), 1.0).
- ✓Use LSTMs for: time series forecasting (demand, sensor readings), sequence classification (session prediction, sentiment), anomaly detection in sequential data. For new NLP projects use Transformers (Module 48) — LSTMs are the standard choice only for time series and very long sequences where attention would be prohibitively expensive.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.