Machine LearningSequence Modeling

Sequence Modeling

LevelIntermediate

Duration90 mins

TopicSequence Modeling

4 / 5

RNN Motivation

Why Feedforward Networks Fail for Sequences

We have established that sequential data requires modeling dependencies across time. We have the autoregressive framework for training. Now comes the architectural question: what neural network should compute $P(x_t | \mathbf{x}_{<t})$?

The naive answer—use a standard feedforward network—immediately runs into fundamental obstacles. Feedforward networks expect fixed-size inputs, but sequences have variable length. They treat each input position independently, but sequence positions are interdependent. They have no notion of time or order, but order is definitional for sequences.

This page develops the case for recurrent neural networks (RNNs)—architectures specifically designed to process sequential data. We will see that the recurrent formulation elegantly addresses the limitations of feedforward networks while introducing its own challenges and tradeoffs.

Learning Objectives

By the end of this page, you will: (1) Understand why feedforward networks are unsuitable for sequences, (2) Appreciate the key properties RNNs must satisfy, (3) See how hidden states solve the variable-length context problem, (4) Understand parameter sharing across time, and (5) Recognize the tradeoffs RNNs introduce.

The Feedforward Approach and Its Limitations

Consider the simplest approach: use a standard feedforward neural network for sequence modeling. How might we attempt this?

Attempt 1: Fixed-Size Sliding Window

Concatenate the last $k$ tokens as input to a feedforward network:

$$h = \text{MLP}([x_{t-k}, x_{t-k+1}, \ldots, x_{t-1}])$$ $$P(x_t | \mathbf{x}{<t}) \approx P(x_t | x{t-k:t-1}) = \text{softmax}(W h + b)$$

This is essentially a neural n-gram model. It works for some applications (CNNs for text classification use this idea), but has fundamental issues:

Fixed context length: Cannot capture dependencies beyond $k$ steps
Exploding parameters: Each position has separate parameters unless we share
No positional generalization: Learning 'the' in position 3 doesn't help position 7
Context window limitation: Must choose $k$ in advance; too small misses patterns, too large is intractable

Attempt 2: Pad to Maximum Length

Pad all sequences to the maximum length $T_{\max}$ and use a single large network:

$$h = \text{MLP}(\text{pad}(\mathbf{x}, T_{\max}))$$

This is worse:

Massive waste: Most positions are padding for most sequences
Fixed maximum: Cannot handle sequences longer than $T_{\max}$
Position sensitivity: Same word at different positions has entirely different parameters
No transfer: What the network learns about token at position 5 doesn't generalize to position 500

Attempt 3: Process Each Token Independently

Treat each token independently:

$$P(x_t | \mathbf{x}_{<t}) = P(x_t)$$

This ignores all sequential structure—the unigram model. Performance is terrible for any structured sequence.

The Core Problems

Feedforward networks fail for sequences because they: (1) Cannot handle variable-length inputs without padding/truncation, (2) Have no mechanism for information to flow across positions, (3) Do not share parameters across positions, preventing generalization, and (4) Have no notion of order—shuffling input positions changes nothing except through explicit positional encoding.

Feedforward vs. Sequence Requirements
Requirement	Feedforward	What's Needed
Variable-length input	Fixed input size	Process sequences of any length
Cross-position dependencies	No information flow	Early tokens inform later predictions
Parameter efficiency	Parameters per position	Parameters shared across time
Order sensitivity	Permutation invariant	Order changes meaning
Unbounded history	Fixed context window	Potentially infinite memory

The Recurrence Idea

The key insight behind RNNs is to introduce a hidden state that evolves as the sequence is processed. Instead of mapping directly from input to output, we maintain a running summary of everything seen so far:

$$h_t = f(h_{t-1}, x_t; \theta)$$

where:

$h_t \in \mathbb{R}^d$ is the hidden state at time $t$
$x_t$ is the input at time $t$
$f$ is a learned function with parameters $\theta$
$h_0$ is an initial hidden state (often zero or learned)

This single equation encapsulates the essence of recurrent processing. Let's unpack its implications:

Properties of the Recurrence

•Variable-length handling: The same function $f$ is applied at each step, regardless of sequence length. A sequence of 10 or 10,000 tokens uses the same architecture—just more iterations.
•Information flow: $h_t$ depends on $h_{t-1}$, which depends on $h_{t-2}$, and so on back to $h_0$. Information can theoretically flow from the very first token to the very last.
•Parameter sharing: The parameters $\theta$ are identical across all timesteps. What the network learns at position 5 directly helps at position 500—the same function is applied.
•Fixed-size representation: No matter how long the history, it's compressed into a fixed-size vector $h_t$. This enables prediction from bounded computation at each step.
•Order sensitivity: The recurrence inherently processes tokens sequentially. The order of inputs determines the sequence of hidden states.

The Summary Intuition

Think of $h_t$ as a 'summary' or 'memory' of everything the network has seen up to time $t$. At each step, the network reads the new input $x_t$ and its current summary $h_{t-1}$, then produces an updated summary $h_t$. This updated summary can be used for prediction or passed to the next step.

The Computational Graph View:

We can visualize the recurrence as a computational graph:

x_1 ──→ [f] ──→ h_1 ──→ [f] ──→ h_2 ──→ [f] ──→ h_3 ──→ ...
         ↑              ↑              ↑
         h_0            x_2            x_3

Each application of $f$ takes the previous hidden state and current input, producing the next hidden state. Unrolling this graph across time reveals an equivalence to a very deep feedforward network—but with shared weights across all layers. This weight sharing is both the blessing (generalization, efficiency) and the curse (vanishing gradients) of RNNs.

Historical Context: The Path to RNNs

Recurrent neural networks have a rich history spanning four decades. Understanding this history provides perspective on why RNNs developed as they did and what problems they were designed to solve.

1980s: Origins

1982: John Hopfield introduces Hopfield networks—recurrent networks with symmetric connections that converge to fixed points (associative memory)
1986: David Rumelhart, Geoffrey Hinton, and Ronald Williams describe backpropagation, enabling training of deep networks
1986: Michael Jordan introduces a recurrent architecture for sequence generation (Jordan networks)
1988: Jeffrey Elman proposes Simple Recurrent Networks (SRNs) with hidden-to-hidden connections—the direct ancestor of modern RNNs

1990s: Theoretical Foundations and Challenges

1990: Elman's paper 'Finding Structure in Time' demonstrates RNNs learning linguistic structure
1991: Paul Werbos describes Backpropagation Through Time (BPTT) for training RNNs
1994: Bengio, Simard, and Frasconi publish the definitive analysis of the vanishing gradient problem—showing that gradients decay exponentially for long sequences
1997: Sepp Hochreiter and Jürgen Schmidhuber introduce LSTM (Long Short-Term Memory)—specifically designed to address vanishing gradients

2000s-2010s: Practical Breakthroughs

2006-2012: Deep learning renaissance; RNNs benefit from GPU computing and better training techniques
2013: Alex Graves demonstrates state-of-the-art speech recognition with deep RNNs
2014: Sutskever, Vinyals, and Le achieve breakthrough machine translation with sequence-to-sequence models
2014: Cho et al. introduce GRU (Gated Recurrent Unit)—a simplified LSTM variant
2015-2016: RNNs become the dominant architecture for NLP, speech, and many sequence tasks

2017-Present: Transformers and Beyond

2017: 'Attention Is All You Need' introduces the Transformer, challenging RNN dominance
2018-2020: Transformers (BERT, GPT) surpass RNNs on most NLP benchmarks
2020-present: State-space models (S4, Mamba) revive the spirit of linear recurrence with new theory

Key Insight: RNNs solved the fundamental problem of sequence modeling that feedforward networks couldn't address. While Transformers have since superseded RNNs for many applications (particularly NLP), understanding RNNs remains essential—they illuminate core concepts, remain efficient for long sequences, and underlie many production systems.

The Enduring Legacy

Even as Transformers dominate NLP benchmarks, RNN concepts persist: the hidden state idea appears in state-space models, the recurrence appears in linear attention, and LSTM/GRU remain preferred for streaming applications where memory efficiency matters. Understanding RNNs provides deep insight into sequence modeling generally.

Desiderata for Sequence Models

Before examining RNN details, let's articulate what properties an ideal sequence model should possess. This provides criteria for evaluating RNNs and understanding their design choices.

Essential Properties

•Variable-length sequences: Handle inputs of any length $T$ without architectural changes. The same model processes tweets and novels.
•Long-range dependencies: Capture dependencies between elements arbitrarily far apart. Subject-verb agreement across many clauses; musical themes recurring after hundreds of measures.
•Parameter efficiency: Parameters independent of sequence length. A model for $T=100$ shouldn't need 10× parameters of one for $T=10$.
•Computational efficiency: Per-token computation bounded and parallelizable where possible. Linear or near-linear in sequence length.
•Order sensitivity: Meaningful encoding of sequential structure. 'Dog bites man' differs from 'Man bites dog'.
•Causal capability: Support generation by conditioning only on past (for autoregressive models).

Desirable Properties

•Parallel training: Compute gradients for all positions simultaneously for GPU efficiency.
•Interpretability: Understandable representations; ability to probe what the model has learned.
•Stable training: Well-behaved gradients; no explosion or vanishing without careful tuning.
•Incremental processing: Process streaming data, updating state with each element without recomputing from scratch.
•Memory efficiency: Constant memory per sequence element during inference (for streaming).

Architecture Comparison Against Desiderata
Property	RNN	Transformer	State-Space
Variable length	✓	✓ (with limits)	✓
Long-range deps	Challenging	✓✓	✓✓
Parameter efficiency	✓✓	✓	✓✓
Compute (training)	O(T)	O(T²)	O(T log T)
Compute (inference)	O(1)/step	O(T)/step	O(1)/step
Order sensitivity	✓	Needs pos enc	✓
Causal capable	✓	Needs masking	✓
Parallel training	Limited†	✓✓	✓
Stable training	Needs tricks	More stable	Stable
Streaming inference	✓✓	✗ (needs cache)	✓✓

†RNNs require sequential hidden state computation, but with teacher forcing, the forward pass across positions can be somewhat parallelized on modern hardware.

No architecture is perfect. RNNs excel at parameter efficiency, streaming inference, and natural causality, but struggle with long-range dependencies and parallel training. Understanding these tradeoffs helps select the right model for each application.

The Hidden State as Compressed Memory

The central concept in RNNs is the hidden state $h_t$—a fixed-size vector that summarizes the sequence history $(x_1, \ldots, x_t)$. This compression is both the key feature and fundamental limitation of RNNs.

The Compression View:

The hidden state $h_t \in \mathbb{R}^d$ must encode all information from the history that is relevant for:

Predicting the next token $x_{t+1}$
Passing information forward to future timesteps

With $d$ dimensions (typically 256-2048), the network must compress histories of potentially millions of tokens. This is an aggressive compression ratio, and something must give:

Lossy compression: Some information is inevitably discarded
Prioritization: The network learns what to remember based on what's useful for the task
Recency bias: Recent information is typically preserved better than distant information

Information-Theoretic Perspective

From an information theory standpoint: $d$ real-valued numbers can in principle store unlimited information (via arbitrary precision). In practice, precision is limited and gradients must flow—so effective capacity is far less. The hidden state capacity is roughly O(d log(1/ε)) bits where ε is precision. For d=512 and reasonable precision, this is thousands of bits—enough for substantial summaries but far short of raw storage.

What Does the Hidden State Encode?

Empirical studies of trained RNNs reveal that hidden states encode:

Recent tokens: Near-perfect reconstruction of last few tokens
Syntactic structure: Open brackets, clause boundaries, dependencies
Semantic content: Topic, sentiment, entity tracking
Task-relevant features: Whatever matters for the objective

Different dimensions specialize for different aspects. Some dimensions track syntax; others track semantics; still others count or track specific patterns.

The Compression Metaphor:

Consider reading a long novel. You don't remember every word—your mind maintains a compressed summary:

Current scene and characters present
Key plot points and relationships
Overall themes and emotional tone
Specific memorable quotes or events

This summary evolves as you read: new information updates it, irrelevant details fade. The RNN hidden state works similarly—it's a learned, differentiable compression scheme that retains task-relevant information while discarding noise.

The Hierarchy of Preservation:

Information Type	Preservation	Example
Last 1-3 tokens	Excellent	Immediate context for prediction
Recent syntax	Good	Open parentheses, clause markers
Topic/theme	Moderate	General subject matter
Specific distant facts	Poor	Unless repeatedly referenced

Parameter Sharing Across Time

One of the most important properties of RNNs is parameter sharing across time: the same function $f(h_{t-1}, x_t; \theta)$ with the same parameters $\theta$ is applied at every timestep. This design choice has profound implications.

Statistical Efficiency:

Without parameter sharing, a model for sequences of length 100 would need 100× the parameters of a model for length 1—each position would have its own weights. This is wasteful:

Much of what happens at position 5 is similar to position 50 (same words, same patterns)
Training data at position 5 should help with position 50
Long sequences would require prohibitive parameters

With parameter sharing, observations at any position contribute to learning a single function. This dramatically improves statistical efficiency—we can learn from shorter sequences and generalize to longer ones.

The Convolution Analogy

Parameter sharing in RNNs is analogous to parameter sharing in CNNs. A convolutional kernel is applied at every spatial position with the same weights. An RNN cell is applied at every temporal position with the same weights. Both exploit translation invariance—patterns that occur at one position are the same as patterns at another position.

Generalization Across Positions:

Parameter sharing enables powerful generalization:

Length generalization: Train on sequences of length ≤100, apply to sequences of length 1000
Pattern sharing: A pattern learned at the beginning transfers to the end
Robust learning: More observations per parameter means better estimates

The Tradeoff:

However, parameter sharing assumes that what happens at time $t$ is similar to what happens at time $t'$. This stationarity assumption may be violated:

Beginnings of documents differ systematically from middles
Some positions are special (start token, sentence boundaries)
Counting and positional tasks require position-aware processing

RNNs address this through the hidden state: even with shared parameters, different histories produce different behaviors because $h_t$ carries positional information implicitly. The function is the same, but it's applied to different hidden states.

parameter_sharing_example.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import torch
import torch.nn as nn
 
class RNNCell(nn.Module):
    """
    Single RNN cell demonstrating parameter sharing.
    The SAME weights are used at every timestep.
    """
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        
        # These weights are shared across ALL timesteps
        self.W_ih = nn.Linear(input_size, hidden_size)   # Input -> Hidden
        self.W_hh = nn.Linear(hidden_size, hidden_size)  # Hidden -> Hidden
    
    def forward(self, x, h_prev):
        """
        Single step: h_t = tanh(W_ih @ x_t + W_hh @ h_{t-1})
        """
        return torch.tanh(self.W_ih(x) + self.W_hh(h_prev))
 
 
def process_sequence(cell, x_sequence):
    """
    Process full sequence using SAME cell at every position.
    This is parameter sharing in action.
    """
    batch_size = x_sequence.size(0)
    seq_length = x_sequence.size(1)
    
    # Initialize hidden state
    h = torch.zeros(batch_size, cell.hidden_size)
    
    hidden_states = []
    for t in range(seq_length):
        # SAME cell applied at every timestep
        # Parameters are identical; only h and x change
        h = cell(x_sequence[:, t], h)
        hidden_states.append(h)
    
    return torch.stack(hidden_states, dim=1)
 
 
# Parameter count comparison
input_size, hidden_size, seq_length = 100, 256, 1000
 
# With parameter sharing (RNN): same parameters for all timesteps
rnn_cell = RNNCell(input_size, hidden_size)
shared_params = sum(p.numel() for p in rnn_cell.parameters())
print(f"RNN (shared): {shared_params:,} parameters")
 
# Without sharing: separate parameters per timestep
unshared_params = shared_params * seq_length
print(f"No sharing:   {unshared_params:,} parameters")
print(f"Reduction:    {unshared_params / shared_params:.0f}x")

Preview: The Gradient Challenge

The recurrent formulation introduces a fundamental training challenge that will be explored in depth later: vanishing and exploding gradients. Understanding this challenge is crucial for appreciating why architectures like LSTM and GRU were developed.

The Problem in Brief:

To learn from long-range dependencies, gradients must flow backward through many timesteps during backpropagation. Consider the gradient of the loss at time $T$ with respect to the hidden state at time $1$:

$$\frac{\partial \mathcal{L}T}{\partial h_1} = \frac{\partial \mathcal{L}T}{\partial h_T} \cdot \frac{\partial h_T}{\partial h{T-1}} \cdot \frac{\partial h{T-1}}{\partial h_{T-2}} \cdots \frac{\partial h_2}{\partial h_1}$$

This is a product of $T-1$ Jacobian matrices. Each Jacobian $\frac{\partial h_{t+1}}{\partial h_t}$ depends on the hidden-to-hidden weight matrix and the activation function's derivative.

The Exponential Effect

If the typical eigenvalue of each Jacobian is $< 1$, the product shrinks exponentially: $(0.9)^{100} \approx 0.0000027$. Gradients vanish—distant timesteps provide no learning signal. If eigenvalues are $> 1$, the product grows exponentially: $(1.1)^{100} \approx 13,781$. Gradients explode—training becomes unstable. Staying exactly at 1 requires perfect balance—a knife-edge condition.

Why This Matters:

Vanishing gradients: The network cannot learn long-range dependencies. Recent positions dominate learning; distant positions are ignored.
Exploding gradients: Gradient values overflow, causing NaN losses and unstable training.

Solutions Preview:

Gradient clipping: Cap gradient magnitude to prevent explosion (but doesn't help vanishing)
Careful initialization: Initialize weights to preserve gradient magnitude initially
Gated architectures: LSTM and GRU use gates to create 'gradient highways' that allow gradients to flow without attenuation
Residual connections: Add skip connections that bypass the recurrence
Different architectures: Transformers avoid recurrence entirely, using attention for long-range connections

We will explore these solutions in detail when covering LSTM and GRU architectures.

Summary and the Road Ahead

This page has established the motivation for recurrent neural networks—why they exist and what problems they solve. The core insights are fundamental to understanding not just RNNs but sequence modeling generally.

Key Takeaways

•Feedforward networks fail for sequences due to fixed input size, no cross-position information flow, and position-dependent parameters.
•The recurrence $h_t = f(h_{t-1}, x_t)$ is the key innovation: a hidden state that evolves, enabling variable-length processing.
•Historical context: RNNs emerged in the 1980s; LSTM (1997) addressed training challenges; Transformers (2017) provided an alternative.
•Hidden states compress history into fixed-size representations, trading completeness for computational tractability.
•Parameter sharing across time enables statistical efficiency and length generalization.
•Gradient challenges (vanishing/exploding) arise from deep recurrence and motivate gated architectures.

What's Next:

The next page introduces Unfolded Computation—visualizing the RNN as an unrolled computational graph. This perspective is essential for understanding how RNNs are trained via backpropagation through time (BPTT) and why the gradient challenges arise.

Motivation Complete

You now understand WHY recurrent neural networks were developed and what fundamental problems they address. The hidden state + shared parameters formula is elegant and powerful. In the next page, we'll see exactly how this recurrence unfolds computationally and how gradients propagate through it.

4 / 5

Loading learning content...

Machine LearningSequence Modeling

Sequence Modeling

LevelIntermediate

Duration90 mins

TopicSequence Modeling

4 / 5

RNN Motivation

Why Feedforward Networks Fail for Sequences

Learning Objectives

The Feedforward Approach and Its Limitations

Consider the simplest approach: use a standard feedforward neural network for sequence modeling. How might we attempt this?

Attempt 1: Fixed-Size Sliding Window

Concatenate the last $k$ tokens as input to a feedforward network:

$$h = \text{MLP}([x_{t-k}, x_{t-k+1}, \ldots, x_{t-1}])$$ $$P(x_t | \mathbf{x}{<t}) \approx P(x_t | x{t-k:t-1}) = \text{softmax}(W h + b)$$

This is essentially a neural n-gram model. It works for some applications (CNNs for text classification use this idea), but has fundamental issues:

Fixed context length: Cannot capture dependencies beyond $k$ steps
Exploding parameters: Each position has separate parameters unless we share
No positional generalization: Learning 'the' in position 3 doesn't help position 7
Context window limitation: Must choose $k$ in advance; too small misses patterns, too large is intractable

Attempt 2: Pad to Maximum Length

Pad all sequences to the maximum length $T_{\max}$ and use a single large network:

$$h = \text{MLP}(\text{pad}(\mathbf{x}, T_{\max}))$$

This is worse:

Massive waste: Most positions are padding for most sequences
Fixed maximum: Cannot handle sequences longer than $T_{\max}$
Position sensitivity: Same word at different positions has entirely different parameters
No transfer: What the network learns about token at position 5 doesn't generalize to position 500

Attempt 3: Process Each Token Independently

Treat each token independently:

$$P(x_t | \mathbf{x}_{<t}) = P(x_t)$$

This ignores all sequential structure—the unigram model. Performance is terrible for any structured sequence.

The Core Problems

Feedforward vs. Sequence Requirements
Requirement	Feedforward	What's Needed
Variable-length input	Fixed input size	Process sequences of any length
Cross-position dependencies	No information flow	Early tokens inform later predictions
Parameter efficiency	Parameters per position	Parameters shared across time
Order sensitivity	Permutation invariant	Order changes meaning
Unbounded history	Fixed context window	Potentially infinite memory

The Recurrence Idea

$$h_t = f(h_{t-1}, x_t; \theta)$$

where:

$h_t \in \mathbb{R}^d$ is the hidden state at time $t$
$x_t$ is the input at time $t$
$f$ is a learned function with parameters $\theta$
$h_0$ is an initial hidden state (often zero or learned)

This single equation encapsulates the essence of recurrent processing. Let's unpack its implications:

Properties of the Recurrence

•Variable-length handling: The same function $f$ is applied at each step, regardless of sequence length. A sequence of 10 or 10,000 tokens uses the same architecture—just more iterations.
•Information flow: $h_t$ depends on $h_{t-1}$, which depends on $h_{t-2}$, and so on back to $h_0$. Information can theoretically flow from the very first token to the very last.
•Parameter sharing: The parameters $\theta$ are identical across all timesteps. What the network learns at position 5 directly helps at position 500—the same function is applied.
•Fixed-size representation: No matter how long the history, it's compressed into a fixed-size vector $h_t$. This enables prediction from bounded computation at each step.
•Order sensitivity: The recurrence inherently processes tokens sequentially. The order of inputs determines the sequence of hidden states.

The Summary Intuition

The Computational Graph View:

We can visualize the recurrence as a computational graph:

x_1 ──→ [f] ──→ h_1 ──→ [f] ──→ h_2 ──→ [f] ──→ h_3 ──→ ...
         ↑              ↑              ↑
         h_0            x_2            x_3

Historical Context: The Path to RNNs

Recurrent neural networks have a rich history spanning four decades. Understanding this history provides perspective on why RNNs developed as they did and what problems they were designed to solve.

1980s: Origins

1982: John Hopfield introduces Hopfield networks—recurrent networks with symmetric connections that converge to fixed points (associative memory)
1986: David Rumelhart, Geoffrey Hinton, and Ronald Williams describe backpropagation, enabling training of deep networks
1986: Michael Jordan introduces a recurrent architecture for sequence generation (Jordan networks)
1988: Jeffrey Elman proposes Simple Recurrent Networks (SRNs) with hidden-to-hidden connections—the direct ancestor of modern RNNs

1990s: Theoretical Foundations and Challenges

1990: Elman's paper 'Finding Structure in Time' demonstrates RNNs learning linguistic structure
1991: Paul Werbos describes Backpropagation Through Time (BPTT) for training RNNs
1994: Bengio, Simard, and Frasconi publish the definitive analysis of the vanishing gradient problem—showing that gradients decay exponentially for long sequences
1997: Sepp Hochreiter and Jürgen Schmidhuber introduce LSTM (Long Short-Term Memory)—specifically designed to address vanishing gradients

2000s-2010s: Practical Breakthroughs

2006-2012: Deep learning renaissance; RNNs benefit from GPU computing and better training techniques
2013: Alex Graves demonstrates state-of-the-art speech recognition with deep RNNs
2014: Sutskever, Vinyals, and Le achieve breakthrough machine translation with sequence-to-sequence models
2014: Cho et al. introduce GRU (Gated Recurrent Unit)—a simplified LSTM variant
2015-2016: RNNs become the dominant architecture for NLP, speech, and many sequence tasks

2017-Present: Transformers and Beyond

2017: 'Attention Is All You Need' introduces the Transformer, challenging RNN dominance
2018-2020: Transformers (BERT, GPT) surpass RNNs on most NLP benchmarks
2020-present: State-space models (S4, Mamba) revive the spirit of linear recurrence with new theory

The Enduring Legacy

Desiderata for Sequence Models

Before examining RNN details, let's articulate what properties an ideal sequence model should possess. This provides criteria for evaluating RNNs and understanding their design choices.

Essential Properties

•Variable-length sequences: Handle inputs of any length $T$ without architectural changes. The same model processes tweets and novels.
•Long-range dependencies: Capture dependencies between elements arbitrarily far apart. Subject-verb agreement across many clauses; musical themes recurring after hundreds of measures.
•Parameter efficiency: Parameters independent of sequence length. A model for $T=100$ shouldn't need 10× parameters of one for $T=10$.
•Computational efficiency: Per-token computation bounded and parallelizable where possible. Linear or near-linear in sequence length.
•Order sensitivity: Meaningful encoding of sequential structure. 'Dog bites man' differs from 'Man bites dog'.
•Causal capability: Support generation by conditioning only on past (for autoregressive models).

Desirable Properties

•Parallel training: Compute gradients for all positions simultaneously for GPU efficiency.
•Interpretability: Understandable representations; ability to probe what the model has learned.
•Stable training: Well-behaved gradients; no explosion or vanishing without careful tuning.
•Incremental processing: Process streaming data, updating state with each element without recomputing from scratch.
•Memory efficiency: Constant memory per sequence element during inference (for streaming).

Architecture Comparison Against Desiderata
Property	RNN	Transformer	State-Space
Variable length	✓	✓ (with limits)	✓
Long-range deps	Challenging	✓✓	✓✓
Parameter efficiency	✓✓	✓	✓✓
Compute (training)	O(T)	O(T²)	O(T log T)
Compute (inference)	O(1)/step	O(T)/step	O(1)/step
Order sensitivity	✓	Needs pos enc	✓
Causal capable	✓	Needs masking	✓
Parallel training	Limited†	✓✓	✓
Stable training	Needs tricks	More stable	Stable
Streaming inference	✓✓	✗ (needs cache)	✓✓

†RNNs require sequential hidden state computation, but with teacher forcing, the forward pass across positions can be somewhat parallelized on modern hardware.

The Hidden State as Compressed Memory

The Compression View:

The hidden state $h_t \in \mathbb{R}^d$ must encode all information from the history that is relevant for:

Predicting the next token $x_{t+1}$
Passing information forward to future timesteps

With $d$ dimensions (typically 256-2048), the network must compress histories of potentially millions of tokens. This is an aggressive compression ratio, and something must give:

Lossy compression: Some information is inevitably discarded
Prioritization: The network learns what to remember based on what's useful for the task
Recency bias: Recent information is typically preserved better than distant information

Information-Theoretic Perspective

What Does the Hidden State Encode?

Empirical studies of trained RNNs reveal that hidden states encode:

Recent tokens: Near-perfect reconstruction of last few tokens
Syntactic structure: Open brackets, clause boundaries, dependencies
Semantic content: Topic, sentiment, entity tracking
Task-relevant features: Whatever matters for the objective

Different dimensions specialize for different aspects. Some dimensions track syntax; others track semantics; still others count or track specific patterns.

The Compression Metaphor:

Consider reading a long novel. You don't remember every word—your mind maintains a compressed summary:

Current scene and characters present
Key plot points and relationships
Overall themes and emotional tone
Specific memorable quotes or events

The Hierarchy of Preservation:

Information Type	Preservation	Example
Last 1-3 tokens	Excellent	Immediate context for prediction
Recent syntax	Good	Open parentheses, clause markers
Topic/theme	Moderate	General subject matter
Specific distant facts	Poor	Unless repeatedly referenced

Parameter Sharing Across Time

Statistical Efficiency:

Without parameter sharing, a model for sequences of length 100 would need 100× the parameters of a model for length 1—each position would have its own weights. This is wasteful:

Much of what happens at position 5 is similar to position 50 (same words, same patterns)
Training data at position 5 should help with position 50
Long sequences would require prohibitive parameters

The Convolution Analogy

Generalization Across Positions:

Parameter sharing enables powerful generalization:

Length generalization: Train on sequences of length ≤100, apply to sequences of length 1000
Pattern sharing: A pattern learned at the beginning transfers to the end
Robust learning: More observations per parameter means better estimates

The Tradeoff:

However, parameter sharing assumes that what happens at time $t$ is similar to what happens at time $t'$. This stationarity assumption may be violated:

Beginnings of documents differ systematically from middles
Some positions are special (start token, sentence boundaries)
Counting and positional tasks require position-aware processing

parameter_sharing_example.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import torch
import torch.nn as nn
 
class RNNCell(nn.Module):
    """
    Single RNN cell demonstrating parameter sharing.
    The SAME weights are used at every timestep.
    """
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        
        # These weights are shared across ALL timesteps
        self.W_ih = nn.Linear(input_size, hidden_size)   # Input -> Hidden
        self.W_hh = nn.Linear(hidden_size, hidden_size)  # Hidden -> Hidden
    
    def forward(self, x, h_prev):
        """
        Single step: h_t = tanh(W_ih @ x_t + W_hh @ h_{t-1})
        """
        return torch.tanh(self.W_ih(x) + self.W_hh(h_prev))
 
 
def process_sequence(cell, x_sequence):
    """
    Process full sequence using SAME cell at every position.
    This is parameter sharing in action.
    """
    batch_size = x_sequence.size(0)
    seq_length = x_sequence.size(1)
    
    # Initialize hidden state
    h = torch.zeros(batch_size, cell.hidden_size)
    
    hidden_states = []
    for t in range(seq_length):
        # SAME cell applied at every timestep
        # Parameters are identical; only h and x change
        h = cell(x_sequence[:, t], h)
        hidden_states.append(h)
    
    return torch.stack(hidden_states, dim=1)
 
 
# Parameter count comparison
input_size, hidden_size, seq_length = 100, 256, 1000
 
# With parameter sharing (RNN): same parameters for all timesteps
rnn_cell = RNNCell(input_size, hidden_size)
shared_params = sum(p.numel() for p in rnn_cell.parameters())
print(f"RNN (shared): {shared_params:,} parameters")
 
# Without sharing: separate parameters per timestep
unshared_params = shared_params * seq_length
print(f"No sharing:   {unshared_params:,} parameters")
print(f"Reduction:    {unshared_params / shared_params:.0f}x")

Preview: The Gradient Challenge

The Problem in Brief:

This is a product of $T-1$ Jacobian matrices. Each Jacobian $\frac{\partial h_{t+1}}{\partial h_t}$ depends on the hidden-to-hidden weight matrix and the activation function's derivative.

The Exponential Effect

Why This Matters:

Vanishing gradients: The network cannot learn long-range dependencies. Recent positions dominate learning; distant positions are ignored.
Exploding gradients: Gradient values overflow, causing NaN losses and unstable training.

Solutions Preview:

Gradient clipping: Cap gradient magnitude to prevent explosion (but doesn't help vanishing)
Careful initialization: Initialize weights to preserve gradient magnitude initially
Gated architectures: LSTM and GRU use gates to create 'gradient highways' that allow gradients to flow without attenuation
Residual connections: Add skip connections that bypass the recurrence
Different architectures: Transformers avoid recurrence entirely, using attention for long-range connections

We will explore these solutions in detail when covering LSTM and GRU architectures.

Summary and the Road Ahead

Key Takeaways

•Feedforward networks fail for sequences due to fixed input size, no cross-position information flow, and position-dependent parameters.
•The recurrence $h_t = f(h_{t-1}, x_t)$ is the key innovation: a hidden state that evolves, enabling variable-length processing.
•Historical context: RNNs emerged in the 1980s; LSTM (1997) addressed training challenges; Transformers (2017) provided an alternative.
•Hidden states compress history into fixed-size representations, trading completeness for computational tractability.
•Parameter sharing across time enables statistical efficiency and length generalization.
•Gradient challenges (vanishing/exploding) arise from deep recurrence and motivate gated architectures.

What's Next:

Motivation Complete

4 / 5