Loading learning content...
We have established that sequential data requires modeling dependencies across time. We have the autoregressive framework for training. Now comes the architectural question: what neural network should compute $P(x_t | \mathbf{x}_{<t})$?
The naive answer—use a standard feedforward network—immediately runs into fundamental obstacles. Feedforward networks expect fixed-size inputs, but sequences have variable length. They treat each input position independently, but sequence positions are interdependent. They have no notion of time or order, but order is definitional for sequences.
This page develops the case for recurrent neural networks (RNNs)—architectures specifically designed to process sequential data. We will see that the recurrent formulation elegantly addresses the limitations of feedforward networks while introducing its own challenges and tradeoffs.
By the end of this page, you will: (1) Understand why feedforward networks are unsuitable for sequences, (2) Appreciate the key properties RNNs must satisfy, (3) See how hidden states solve the variable-length context problem, (4) Understand parameter sharing across time, and (5) Recognize the tradeoffs RNNs introduce.
Consider the simplest approach: use a standard feedforward neural network for sequence modeling. How might we attempt this?
Attempt 1: Fixed-Size Sliding Window
Concatenate the last $k$ tokens as input to a feedforward network:
$$h = \text{MLP}([x_{t-k}, x_{t-k+1}, \ldots, x_{t-1}])$$ $$P(x_t | \mathbf{x}{<t}) \approx P(x_t | x{t-k:t-1}) = \text{softmax}(W h + b)$$
This is essentially a neural n-gram model. It works for some applications (CNNs for text classification use this idea), but has fundamental issues:
Attempt 2: Pad to Maximum Length
Pad all sequences to the maximum length $T_{\max}$ and use a single large network:
$$h = \text{MLP}(\text{pad}(\mathbf{x}, T_{\max}))$$
This is worse:
Attempt 3: Process Each Token Independently
Treat each token independently:
$$P(x_t | \mathbf{x}_{<t}) = P(x_t)$$
This ignores all sequential structure—the unigram model. Performance is terrible for any structured sequence.
Feedforward networks fail for sequences because they: (1) Cannot handle variable-length inputs without padding/truncation, (2) Have no mechanism for information to flow across positions, (3) Do not share parameters across positions, preventing generalization, and (4) Have no notion of order—shuffling input positions changes nothing except through explicit positional encoding.
| Requirement | Feedforward | What's Needed |
|---|---|---|
| Variable-length input | Fixed input size | Process sequences of any length |
| Cross-position dependencies | No information flow | Early tokens inform later predictions |
| Parameter efficiency | Parameters per position | Parameters shared across time |
| Order sensitivity | Permutation invariant | Order changes meaning |
| Unbounded history | Fixed context window | Potentially infinite memory |
The key insight behind RNNs is to introduce a hidden state that evolves as the sequence is processed. Instead of mapping directly from input to output, we maintain a running summary of everything seen so far:
$$h_t = f(h_{t-1}, x_t; \theta)$$
where:
This single equation encapsulates the essence of recurrent processing. Let's unpack its implications:
Think of $h_t$ as a 'summary' or 'memory' of everything the network has seen up to time $t$. At each step, the network reads the new input $x_t$ and its current summary $h_{t-1}$, then produces an updated summary $h_t$. This updated summary can be used for prediction or passed to the next step.
The Computational Graph View:
We can visualize the recurrence as a computational graph:
x_1 ──→ [f] ──→ h_1 ──→ [f] ──→ h_2 ──→ [f] ──→ h_3 ──→ ...
↑ ↑ ↑
h_0 x_2 x_3
Each application of $f$ takes the previous hidden state and current input, producing the next hidden state. Unrolling this graph across time reveals an equivalence to a very deep feedforward network—but with shared weights across all layers. This weight sharing is both the blessing (generalization, efficiency) and the curse (vanishing gradients) of RNNs.
Recurrent neural networks have a rich history spanning four decades. Understanding this history provides perspective on why RNNs developed as they did and what problems they were designed to solve.
1980s: Origins
1990s: Theoretical Foundations and Challenges
2000s-2010s: Practical Breakthroughs
2017-Present: Transformers and Beyond
Key Insight: RNNs solved the fundamental problem of sequence modeling that feedforward networks couldn't address. While Transformers have since superseded RNNs for many applications (particularly NLP), understanding RNNs remains essential—they illuminate core concepts, remain efficient for long sequences, and underlie many production systems.
Even as Transformers dominate NLP benchmarks, RNN concepts persist: the hidden state idea appears in state-space models, the recurrence appears in linear attention, and LSTM/GRU remain preferred for streaming applications where memory efficiency matters. Understanding RNNs provides deep insight into sequence modeling generally.
Before examining RNN details, let's articulate what properties an ideal sequence model should possess. This provides criteria for evaluating RNNs and understanding their design choices.
| Property | RNN | Transformer | State-Space |
|---|---|---|---|
| Variable length | ✓ | ✓ (with limits) | ✓ |
| Long-range deps | Challenging | ✓✓ | ✓✓ |
| Parameter efficiency | ✓✓ | ✓ | ✓✓ |
| Compute (training) | O(T) | O(T²) | O(T log T) |
| Compute (inference) | O(1)/step | O(T)/step | O(1)/step |
| Order sensitivity | ✓ | Needs pos enc | ✓ |
| Causal capable | ✓ | Needs masking | ✓ |
| Parallel training | Limited† | ✓✓ | ✓ |
| Stable training | Needs tricks | More stable | Stable |
| Streaming inference | ✓✓ | ✗ (needs cache) | ✓✓ |
†RNNs require sequential hidden state computation, but with teacher forcing, the forward pass across positions can be somewhat parallelized on modern hardware.
No architecture is perfect. RNNs excel at parameter efficiency, streaming inference, and natural causality, but struggle with long-range dependencies and parallel training. Understanding these tradeoffs helps select the right model for each application.
The central concept in RNNs is the hidden state $h_t$—a fixed-size vector that summarizes the sequence history $(x_1, \ldots, x_t)$. This compression is both the key feature and fundamental limitation of RNNs.
The Compression View:
The hidden state $h_t \in \mathbb{R}^d$ must encode all information from the history that is relevant for:
With $d$ dimensions (typically 256-2048), the network must compress histories of potentially millions of tokens. This is an aggressive compression ratio, and something must give:
From an information theory standpoint: $d$ real-valued numbers can in principle store unlimited information (via arbitrary precision). In practice, precision is limited and gradients must flow—so effective capacity is far less. The hidden state capacity is roughly O(d log(1/ε)) bits where ε is precision. For d=512 and reasonable precision, this is thousands of bits—enough for substantial summaries but far short of raw storage.
What Does the Hidden State Encode?
Empirical studies of trained RNNs reveal that hidden states encode:
Different dimensions specialize for different aspects. Some dimensions track syntax; others track semantics; still others count or track specific patterns.
The Compression Metaphor:
Consider reading a long novel. You don't remember every word—your mind maintains a compressed summary:
This summary evolves as you read: new information updates it, irrelevant details fade. The RNN hidden state works similarly—it's a learned, differentiable compression scheme that retains task-relevant information while discarding noise.
The Hierarchy of Preservation:
| Information Type | Preservation | Example |
|---|---|---|
| Last 1-3 tokens | Excellent | Immediate context for prediction |
| Recent syntax | Good | Open parentheses, clause markers |
| Topic/theme | Moderate | General subject matter |
| Specific distant facts | Poor | Unless repeatedly referenced |
One of the most important properties of RNNs is parameter sharing across time: the same function $f(h_{t-1}, x_t; \theta)$ with the same parameters $\theta$ is applied at every timestep. This design choice has profound implications.
Statistical Efficiency:
Without parameter sharing, a model for sequences of length 100 would need 100× the parameters of a model for length 1—each position would have its own weights. This is wasteful:
With parameter sharing, observations at any position contribute to learning a single function. This dramatically improves statistical efficiency—we can learn from shorter sequences and generalize to longer ones.
Parameter sharing in RNNs is analogous to parameter sharing in CNNs. A convolutional kernel is applied at every spatial position with the same weights. An RNN cell is applied at every temporal position with the same weights. Both exploit translation invariance—patterns that occur at one position are the same as patterns at another position.
Generalization Across Positions:
Parameter sharing enables powerful generalization:
The Tradeoff:
However, parameter sharing assumes that what happens at time $t$ is similar to what happens at time $t'$. This stationarity assumption may be violated:
RNNs address this through the hidden state: even with shared parameters, different histories produce different behaviors because $h_t$ carries positional information implicitly. The function is the same, but it's applied to different hidden states.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
import torchimport torch.nn as nn class RNNCell(nn.Module): """ Single RNN cell demonstrating parameter sharing. The SAME weights are used at every timestep. """ def __init__(self, input_size, hidden_size): super().__init__() self.hidden_size = hidden_size # These weights are shared across ALL timesteps self.W_ih = nn.Linear(input_size, hidden_size) # Input -> Hidden self.W_hh = nn.Linear(hidden_size, hidden_size) # Hidden -> Hidden def forward(self, x, h_prev): """ Single step: h_t = tanh(W_ih @ x_t + W_hh @ h_{t-1}) """ return torch.tanh(self.W_ih(x) + self.W_hh(h_prev)) def process_sequence(cell, x_sequence): """ Process full sequence using SAME cell at every position. This is parameter sharing in action. """ batch_size = x_sequence.size(0) seq_length = x_sequence.size(1) # Initialize hidden state h = torch.zeros(batch_size, cell.hidden_size) hidden_states = [] for t in range(seq_length): # SAME cell applied at every timestep # Parameters are identical; only h and x change h = cell(x_sequence[:, t], h) hidden_states.append(h) return torch.stack(hidden_states, dim=1) # Parameter count comparisoninput_size, hidden_size, seq_length = 100, 256, 1000 # With parameter sharing (RNN): same parameters for all timestepsrnn_cell = RNNCell(input_size, hidden_size)shared_params = sum(p.numel() for p in rnn_cell.parameters())print(f"RNN (shared): {shared_params:,} parameters") # Without sharing: separate parameters per timestepunshared_params = shared_params * seq_lengthprint(f"No sharing: {unshared_params:,} parameters")print(f"Reduction: {unshared_params / shared_params:.0f}x")The recurrent formulation introduces a fundamental training challenge that will be explored in depth later: vanishing and exploding gradients. Understanding this challenge is crucial for appreciating why architectures like LSTM and GRU were developed.
The Problem in Brief:
To learn from long-range dependencies, gradients must flow backward through many timesteps during backpropagation. Consider the gradient of the loss at time $T$ with respect to the hidden state at time $1$:
$$\frac{\partial \mathcal{L}T}{\partial h_1} = \frac{\partial \mathcal{L}T}{\partial h_T} \cdot \frac{\partial h_T}{\partial h{T-1}} \cdot \frac{\partial h{T-1}}{\partial h_{T-2}} \cdots \frac{\partial h_2}{\partial h_1}$$
This is a product of $T-1$ Jacobian matrices. Each Jacobian $\frac{\partial h_{t+1}}{\partial h_t}$ depends on the hidden-to-hidden weight matrix and the activation function's derivative.
If the typical eigenvalue of each Jacobian is $< 1$, the product shrinks exponentially: $(0.9)^{100} \approx 0.0000027$. Gradients vanish—distant timesteps provide no learning signal. If eigenvalues are $> 1$, the product grows exponentially: $(1.1)^{100} \approx 13,781$. Gradients explode—training becomes unstable. Staying exactly at 1 requires perfect balance—a knife-edge condition.
Why This Matters:
Vanishing gradients: The network cannot learn long-range dependencies. Recent positions dominate learning; distant positions are ignored.
Exploding gradients: Gradient values overflow, causing NaN losses and unstable training.
Solutions Preview:
Gradient clipping: Cap gradient magnitude to prevent explosion (but doesn't help vanishing)
Careful initialization: Initialize weights to preserve gradient magnitude initially
Gated architectures: LSTM and GRU use gates to create 'gradient highways' that allow gradients to flow without attenuation
Residual connections: Add skip connections that bypass the recurrence
Different architectures: Transformers avoid recurrence entirely, using attention for long-range connections
We will explore these solutions in detail when covering LSTM and GRU architectures.
This page has established the motivation for recurrent neural networks—why they exist and what problems they solve. The core insights are fundamental to understanding not just RNNs but sequence modeling generally.
What's Next:
The next page introduces Unfolded Computation—visualizing the RNN as an unrolled computational graph. This perspective is essential for understanding how RNNs are trained via backpropagation through time (BPTT) and why the gradient challenges arise.
You now understand WHY recurrent neural networks were developed and what fundamental problems they address. The hidden state + shared parameters formula is elegant and powerful. In the next page, we'll see exactly how this recurrence unfolds computationally and how gradients propagate through it.