Rnn Architecture - Learning Module

Loading content...

0/245

Hidden State Dynamics

The Memory of Neural Networks

Traditional feedforward neural networks suffer from a fundamental limitation: they have no memory. Each input is processed independently, with no awareness of what came before or after. For many tasks—image classification, tabular data prediction—this statelessness is perfectly acceptable. But for sequential data—language, speech, music, time series, video—this amnesia is catastrophic.

Consider reading a sentence: 'The cat sat on the mat.' Your understanding of 'sat' depends on knowing that 'cat' came before it. Your comprehension of 'mat' builds upon the entire preceding context. Without memory, each word would be processed as if it appeared in isolation, destroying the very structure that gives language meaning.

Hidden state dynamics are the mechanism by which Recurrent Neural Networks (RNNs) solve this problem. The hidden state acts as a form of working memory—a compressed representation of everything the network has seen so far, updated at each timestep as new information arrives. Understanding these dynamics is foundational to understanding how RNNs model sequences.

What You Will Learn

By the end of this page, you will deeply understand: (1) the mathematical formulation of hidden state updates, (2) how hidden states encode temporal information, (3) the geometric interpretation of hidden state trajectories, (4) initialization strategies and their impact, and (5) the capacity and limitations of hidden state memory.

The Hidden State Equation

At the heart of every RNN lies a deceptively simple equation that defines how hidden states evolve over time. This equation is the differential equation of the recurrent world, governing how information flows through temporal dimensions.

The canonical hidden state update:

$$h_t = f(W_{hh} h_{t-1} + W_{xh} x_t + b_h)$$

Where:

$h_t \in \mathbb{R}^d$ is the hidden state at time $t$ (a vector of dimension $d$)
$h_{t-1} \in \mathbb{R}^d$ is the hidden state from the previous timestep
$x_t \in \mathbb{R}^n$ is the input at time $t$ (a vector of dimension $n$)
$W_{hh} \in \mathbb{R}^{d \times d}$ is the hidden-to-hidden weight matrix
$W_{xh} \in \mathbb{R}^{d \times n}$ is the input-to-hidden weight matrix
$b_h \in \mathbb{R}^d$ is the bias vector
$f$ is a nonlinear activation function (typically $\tanh$ or ReLU)

This equation captures the essence of recurrence: the new hidden state is a function of both the previous hidden state and the current input. The network doesn't just react to the present—it integrates the present with a summary of the past.

Decomposing the Update

Think of the hidden state update as two parallel streams being merged: W_{hh}h_{t-1} represents 'what I remember from before' while W_{xh}x_t represents 'what I'm seeing right now'. The activation function f blends these streams into a new memory representation, constrained to a bounded range (e.g., [-1, 1] for tanh).

hidden_state_update.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import numpy as np
 
class VanillaRNNCell:
    """
    A single RNN cell implementing the canonical hidden state dynamics.
    
    This implementation exposes the mathematical structure explicitly,
    prioritizing clarity over computational efficiency.
    """
    
    def __init__(self, input_dim: int, hidden_dim: int):
        """
        Initialize RNN cell with Xavier initialization for weights.
        
        Args:
            input_dim: Dimensionality of input vectors (n)
            hidden_dim: Dimensionality of hidden state vectors (d)
        """
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        
        # Xavier initialization helps maintain gradient flow
        # Scale for input-to-hidden: sqrt(2 / (input_dim + hidden_dim))
        self.W_xh = np.random.randn(hidden_dim, input_dim) * np.sqrt(2.0 / (input_dim + hidden_dim))
        
        # Scale for hidden-to-hidden: sqrt(2 / (2 * hidden_dim))
        self.W_hh = np.random.randn(hidden_dim, hidden_dim) * np.sqrt(1.0 / hidden_dim)
        
        # Bias initialized to zero
        self.b_h = np.zeros((hidden_dim, 1))
    
    def forward(self, x_t: np.ndarray, h_prev: np.ndarray) -> np.ndarray:
        """
        Compute the hidden state update for a single timestep.
        
        Args:
            x_t: Input vector at time t, shape (input_dim, 1)
            h_prev: Hidden state from time t-1, shape (hidden_dim, 1)
        
        Returns:
            h_t: New hidden state at time t, shape (hidden_dim, 1)
        """
        # Compute the two contributions separately for clarity
        input_contribution = self.W_xh @ x_t      # What the current input contributes
        memory_contribution = self.W_hh @ h_prev  # What the previous state contributes
        
        # Combine and apply nonlinearity
        pre_activation = input_contribution + memory_contribution + self.b_h
        h_t = np.tanh(pre_activation)
        
        return h_t
    
    def forward_sequence(self, x_sequence: np.ndarray, h_0: np.ndarray = None) -> list:
        """
        Process an entire sequence, returning all hidden states.
        
        Args:
            x_sequence: Input sequence, shape (seq_len, input_dim, 1)
            h_0: Initial hidden state (optional, defaults to zeros)
        
        Returns:
            List of hidden states for each timestep
        """
        seq_len = x_sequence.shape[0]
        
        # Initialize hidden state to zeros if not provided
        if h_0 is None:
            h_0 = np.zeros((self.hidden_dim, 1))
        
        hidden_states = [h_0]
        h_t = h_0
        
        for t in range(seq_len):
            h_t = self.forward(x_sequence[t], h_t)
            hidden_states.append(h_t)
        
        return hidden_states  # Length: seq_len + 1 (includes h_0)
 
 
# Demonstration: Processing a simple sequence
if __name__ == "__main__":
    # Create a small RNN cell
    rnn = VanillaRNNCell(input_dim=4, hidden_dim=8)
    
    # Create a random sequence of 5 timesteps
    sequence = np.random.randn(5, 4, 1)
    
    # Process the sequence
    hidden_states = rnn.forward_sequence(sequence)
    
    print(f"Processed sequence of length {len(sequence)}")
    print(f"Got {len(hidden_states)} hidden states (including h_0)")
    print(f"Each hidden state has shape: {hidden_states[0].shape}")
    
    # Observe how hidden states evolve
    for t, h in enumerate(hidden_states):
        print(f"h_{t}: norm = {np.linalg.norm(h):.4f}, range = [{h.min():.3f}, {h.max():.3f}]")

Geometric Interpretation of Hidden States

To truly understand hidden state dynamics, we must develop a geometric intuition. Think of the hidden state as a point in a $d$-dimensional space. As the RNN processes a sequence, this point traces a trajectory—a path through hidden state space that encodes the sequence's meaning.

The hidden state space:

Each dimension of $h_t$ represents a learned feature detector
The magnitude of each component indicates how strongly that feature is activated
The overall pattern of activations encodes a compressed representation of the sequence so far

Trajectory dynamics:

When we apply the tanh activation, hidden states are constrained to the hypercube $[-1, 1]^d$. The dynamics of the system determine how trajectories move within this bounded space:

Attractors: Regions where trajectories tend to converge, representing stable representations
Repellers: Unstable points that trajectories move away from
Saddle points: Points that attract in some dimensions while repelling in others
Limit cycles: Periodic trajectories that represent oscillating patterns

Why Geometry Matters

The geometric perspective reveals what mathematically trained practitioners see: RNNs are dynamical systems. Understanding their behavior requires the same tools used to analyze differential equations, chaos theory, and nonlinear dynamics. This perspective predicts phenomena like vanishing gradients before we even compute derivatives.

The role of the weight matrices:

The hidden-to-hidden matrix $W_{hh}$ fundamentally shapes the geometry of hidden state dynamics:

Eigenvalue analysis:

Consider the Jacobian of the hidden state update (ignoring the input for a moment): $$\frac{\partial h_t}{\partial h_{t-1}} = \text{diag}(f'(z_t)) \cdot W_{hh}$$

where $z_t = W_{hh}h_{t-1} + W_{xh}x_t + b_h$ and $f'$ is the derivative of the activation.

The eigenvalues of this Jacobian determine local stability:

$|\lambda| < 1$ for all eigenvalues: The system is locally contractive. Past information decays exponentially—the network 'forgets' quickly.
$|\lambda| > 1$ for some eigenvalues: The system is locally expansive in those directions. Small perturbations grow—the network is sensitive to initial conditions.
$|\lambda| \approx 1$: The system preserves information along these directions. This is the ideal regime for learning long-range dependencies.

The challenge of training RNNs is finding weight configurations where eigenvalues stay close to 1, enabling the network to maintain relevant information across many timesteps.

Eigenvalue Regimes and Their Implications
Eigenvalue Regime	Gradient Behavior	Memory Behavior	Training Implications
All \|λ\| < 1	Vanishing gradients	Short-term memory only	Cannot learn long-range dependencies
Some \|λ\| > 1	Exploding gradients	Chaotic, unstable	Training diverges without clipping
\|λ\| ≈ 1	Stable gradient flow	Preserves information	Ideal but hard to maintain
Mixed spectrum	Varying by direction	Selective memory	Most realistic trained regime

Hidden State as Compressed Memory

The hidden state serves as a lossy compression of the entire sequence history. Understanding what it can and cannot represent is crucial for understanding RNN capabilities and limitations.

Information-theoretic perspective:

At each timestep, the RNN faces an information bottleneck:

Input: The hidden state $h_{t-1}$ contains information about $x_1, x_2, \ldots, x_{t-1}$
New information: The current input $x_t$ adds new information
Output: The new hidden state $h_t$ must compress everything into $d$ dimensions

If the input sequence contains $t$ tokens, each with $n$ dimensions of information, the network has seen $t \times n$ values but must represent everything in just $d$ dimensions. This forces aggressive compression.

What gets preserved:

Through training, RNNs learn to preserve task-relevant information and discard irrelevant details:

Syntactic structure: For language tasks, RNNs learn to track grammatical state (e.g., 'am I inside a quotation?')
Semantic themes: Relevant entities and their relationships
Temporal patterns: Rhythms, periods, trends in time series
Boundary markers: Significant transitions or events

What gets lost:

Exact token positions (beyond limited range)
Redundant information that doesn't reduce loss
Early sequence details as later inputs dominate
Fine-grained distinctions between similar patterns

hidden_state_visualization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
 
def visualize_hidden_trajectories(rnn_cell, sequences, labels, figsize=(12, 8)):
    """
    Visualize how different sequences trace different trajectories
    through hidden state space.
    
    This demonstrates how the hidden state encodes sequence identity
    and how similar sequences have similar trajectories.
    """
    fig, axes = plt.subplots(2, 2, figsize=figsize)
    
    # Collect all hidden states for PCA fitting
    all_hidden_states = []
    trajectories = {}
    
    for seq_name, sequence in sequences.items():
        hidden_states = rnn_cell.forward_sequence(sequence)
        # Convert to 2D array: (seq_len+1, hidden_dim)
        hs_array = np.array([h.squeeze() for h in hidden_states])
        trajectories[seq_name] = hs_array
        all_hidden_states.append(hs_array)
    
    # Stack all states for PCA
    all_states = np.vstack(all_hidden_states)
    
    # Fit PCA to reduce to 2D for visualization
    pca = PCA(n_components=2)
    pca.fit(all_states)
    
    # Plot 1: Trajectories in 2D PCA space
    ax1 = axes[0, 0]
    for seq_name, hs_array in trajectories.items():
        projected = pca.transform(hs_array)
        ax1.plot(projected[:, 0], projected[:, 1], 
                 marker='o', markersize=4, label=seq_name)
        # Mark start and end
        ax1.scatter(projected[0, 0], projected[0, 1], 
                   marker='s', s=100, edgecolors='black', zorder=5)
        ax1.scatter(projected[-1, 0], projected[-1, 1], 
                   marker='*', s=150, edgecolors='black', zorder=5)
    ax1.set_xlabel('PC1')
    ax1.set_ylabel('PC2')
    ax1.set_title('Hidden State Trajectories (PCA projection)')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # Plot 2: Hidden state norms over time
    ax2 = axes[0, 1]
    for seq_name, hs_array in trajectories.items():
        norms = np.linalg.norm(hs_array, axis=1)
        ax2.plot(norms, marker='o', markersize=4, label=seq_name)
    ax2.set_xlabel('Timestep')
    ax2.set_ylabel('||h_t||_2')
    ax2.set_title('Hidden State Magnitude Over Time')
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    
    # Plot 3: Individual dimension activations
    ax3 = axes[1, 0]
    seq_name = list(trajectories.keys())[0]
    hs_array = trajectories[seq_name]
    im = ax3.imshow(hs_array.T, aspect='auto', cmap='RdBu_r', vmin=-1, vmax=1)
    ax3.set_xlabel('Timestep')
    ax3.set_ylabel('Hidden Dimension')
    ax3.set_title(f'Hidden State Activations: {seq_name}')
    plt.colorbar(im, ax=ax3)
    
    # Plot 4: Distance between consecutive hidden states
    ax4 = axes[1, 1]
    for seq_name, hs_array in trajectories.items():
        distances = np.linalg.norm(np.diff(hs_array, axis=0), axis=1)
        ax4.plot(range(1, len(distances)+1), distances, 
                 marker='o', markersize=4, label=seq_name)
    ax4.set_xlabel('Timestep')
    ax4.set_ylabel('||h_t - h_{t-1}||_2')
    ax4.set_title('Hidden State Change Rate')
    ax4.legend()
    ax4.grid(True, alpha=0.3)
    
    plt.tight_layout()
    return fig
 
 
# Example usage
if __name__ == "__main__":
    np.random.seed(42)
    
    # Create RNN cell
    rnn = VanillaRNNCell(input_dim=3, hidden_dim=16)
    
    # Create different types of sequences
    sequences = {
        'constant': np.ones((10, 3, 1)) * 0.5,
        'increasing': np.linspace(0, 1, 10).reshape(10, 1, 1) * np.ones((1, 3, 1)),
        'random': np.random.randn(10, 3, 1),
        'oscillating': np.sin(np.linspace(0, 4*np.pi, 10)).reshape(10, 1, 1) * np.ones((1, 3, 1)),
    }
    
    fig = visualize_hidden_trajectories(rnn, sequences, labels=None)
    plt.savefig('hidden_state_trajectories.png', dpi=150, bbox_inches='tight')
    plt.show()

The Recency Bias

Vanilla RNNs exhibit strong recency bias: recent inputs have far more influence on the current hidden state than distant inputs. This is mathematically inevitable because each hidden state update dilutes past information. After k timesteps, information from the past is scaled by factors like tanh'(·)^k, which decays exponentially. This is why LSTM and GRU architectures were developed.

Hidden State Initialization Strategies

The choice of initial hidden state $h_0$ is often overlooked but can significantly impact both training dynamics and inference quality. Different strategies suit different applications.

Common initialization approaches:

Initialization Strategies

•Zero initialization: $h_0 = \mathbf{0}$. The most common approach. Simple, reproducible, and works well when sequences are long enough that the initial state washes out. However, all sequences start from identical states, limiting expressiveness for short sequences.
•Learned initialization: $h_0 = h_{\text{learned}}$ where $h_{\text{learned}}$ is a trainable parameter. Allows the network to learn an optimal 'default state'. Useful when sequences are short or initial context matters. Adds $d$ parameters to the model.
•Input-conditional initialization: $h_0 = g(c)$ where $c$ is some context and $g$ is a learned mapping. Common in encoder-decoder models where the encoder's final state initializes the decoder. Enables sequence-to-sequence translation.
•Random initialization: $h_0 \sim \mathcal{N}(0, \sigma^2 I)$ with small $\sigma$. Can help with regularization by forcing robustness to initial conditions. Less common but useful for certain ensemble methods.

initialization_comparison.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
import torch
import torch.nn as nn
 
class RNNWithInitialization(nn.Module):
    """
    RNN that supports multiple hidden state initialization strategies.
    """
    
    def __init__(
        self, 
        input_dim: int, 
        hidden_dim: int, 
        init_strategy: str = 'zero',
        context_dim: int = None
    ):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.init_strategy = init_strategy
        
        # Core RNN cell
        self.rnn_cell = nn.RNNCell(input_dim, hidden_dim, nonlinearity='tanh')
        
        # Initialization depends on strategy
        if init_strategy == 'learned':
            # Learned initial state (trainable parameter)
            self.h0 = nn.Parameter(torch.zeros(1, hidden_dim))
        elif init_strategy == 'context':
            # Context-conditional initialization
            if context_dim is None:
                raise ValueError("context_dim required for 'context' strategy")
            self.context_projection = nn.Linear(context_dim, hidden_dim)
            self.h0 = None
        else:
            # Zero initialization (default)
            self.h0 = None
    
    def get_initial_state(self, batch_size: int, context=None):
        """Get initial hidden state based on strategy."""
        if self.init_strategy == 'zero':
            return torch.zeros(batch_size, self.hidden_dim)
        
        elif self.init_strategy == 'learned':
            # Expand learned h0 to batch size
            return self.h0.expand(batch_size, -1)
        
        elif self.init_strategy == 'context':
            if context is None:
                raise ValueError("Context required for context initialization")
            return torch.tanh(self.context_projection(context))
        
        elif self.init_strategy == 'random':
            return torch.randn(batch_size, self.hidden_dim) * 0.01
        
        else:
            raise ValueError(f"Unknown init strategy: {self.init_strategy}")
    
    def forward(self, x, context=None):
        """
        Process a sequence.
        
        Args:
            x: Input tensor of shape (batch, seq_len, input_dim)
            context: Optional context for initialization (batch, context_dim)
        
        Returns:
            hidden_states: All hidden states (batch, seq_len, hidden_dim)
            final_state: Last hidden state (batch, hidden_dim)
        """
        batch_size, seq_len, _ = x.shape
        
        # Get initial state
        h_t = self.get_initial_state(batch_size, context)
        if x.is_cuda:
            h_t = h_t.cuda()
        
        # Collect hidden states
        hidden_states = []
        
        # Process sequence
        for t in range(seq_len):
            h_t = self.rnn_cell(x[:, t, :], h_t)
            hidden_states.append(h_t)
        
        # Stack: (seq_len, batch, hidden) -> (batch, seq_len, hidden)
        hidden_states = torch.stack(hidden_states, dim=1)
        
        return hidden_states, h_t
 
 
# Demonstration: Compare initialization strategies
def compare_initializations():
    """Compare how different initializations affect hidden state trajectories."""
    torch.manual_seed(42)
    
    input_dim, hidden_dim = 4, 8
    batch_size, seq_len = 2, 20
    
    # Create models with different strategies
    models = {
        'zero': RNNWithInitialization(input_dim, hidden_dim, 'zero'),
        'learned': RNNWithInitialization(input_dim, hidden_dim, 'learned'),
        'context': RNNWithInitialization(input_dim, hidden_dim, 'context', context_dim=3),
    }
    
    # Same input sequence for all models
    x = torch.randn(batch_size, seq_len, input_dim)
    context = torch.randn(batch_size, 3)  # For context-based init
    
    results = {}
    for name, model in models.items():
        if name == 'context':
            hidden_states, final = model(x, context)
        else:
            hidden_states, final = model(x)
        
        # Compute statistics
        mean_norm = hidden_states.norm(dim=-1).mean().item()
        final_norm = final.norm(dim=-1).mean().item()
        
        results[name] = {
            'mean_hidden_norm': mean_norm,
            'final_state_norm': final_norm,
            'init_norm': hidden_states[:, 0, :].norm(dim=-1).mean().item()
        }
        
        print(f"\n{name.upper()} initialization:")
        print(f"  Initial state norm: {results[name]['init_norm']:.4f}")
        print(f"  Mean hidden norm: {mean_norm:.4f}")
        print(f"  Final state norm: {final_norm:.4f}")
 
if __name__ == "__main__":
    compare_initializations()

Practical Recommendation

For most applications, start with zero initialization. It's simple, stable, and works well for sequences of moderate length (20+ tokens). Switch to learned initialization if you notice that short sequences underperform, or if the task has a natural 'starting context' that the model should learn. Use context-conditional initialization for encoder-decoder architectures.

Hidden State Capacity and Limitations

The hidden state's finite dimensionality creates fundamental limits on what an RNN can represent. Understanding these limits helps predict when RNNs will succeed and when they'll fail.

Theoretical capacity:

A hidden state of dimension $d$ using tanh activation can represent approximately $2^d$ distinct stable patterns (considering each dimension as a binary gate at ±1). However, practical capacity is much lower due to:

Continuous representation: States aren't binary; intermediate values waste capacity
Training limitations: Not all configurations are reachable via gradient descent
Stability requirements: Only attracting fixed points are useful for memory
Generalization needs: States must generalize across similar inputs

Empirical capacity rules of thumb:

For counting tasks (e.g., counting parentheses): RNNs need $d \approx \log_2(\text{max count})$ dimensions
For sequence classification: $d$ should scale with the number of classes and sequence complexity
For language modeling: Practical models use $d \in [256, 2048]$ depending on vocabulary and task

Hidden State Dimensionality Guidelines
Task Type	Recommended d	Reasoning
Binary classification	64-128	Two states, but need margin for uncertainty
Sentiment analysis	128-256	Few classes, moderate sequence complexity
Named entity recognition	256-512	Many entity types, needs position awareness
Language modeling (small)	256-512	Vocabulary ~10K, limited context
Language modeling (large)	512-2048	Vocabulary ~50K+, long contexts
Machine translation	512-1024	Two languages, complex alignments

Known limitations of vanilla RNN hidden states:

Finite memory horizon: Due to gradient vanishing, information from more than ~10-20 timesteps ago is effectively lost
Catastrophic interference: New information can overwrite old information unpredictably. There's no mechanism to 'protect' important memories
No explicit addressing: The hidden state is a single vector—there's no way to 'look up' specific past information
Sequential bottleneck: All information must flow through a single fixed-size vector, regardless of sequence length or complexity
No uncertainty quantification: The hidden state represents a point estimate, not a distribution over possible states

These limitations motivated the development of LSTM, GRU, and attention mechanisms—each addressing some subset of these issues.

The Fundamental Tradeoff

Increasing hidden dimension d improves capacity but also increases computational cost (O(d²) for matrix multiplications) and parameter count (O(d²) weights). Moreover, larger hidden states don't solve the fundamental gradient flow problem—they just delay it. LSTM/GRU architectures address gradient flow directly, making hidden size increases more effective.

Hidden State Dynamics in Practice

Armed with theoretical understanding, let's examine how hidden state dynamics manifest in real-world applications and what practitioners must consider.

Observing hidden states during training:

Monitoring hidden state statistics provides valuable diagnostic information:

Saturation: If hidden states are frequently at ±1 (tanh saturation), gradients will be small. This indicates the network is 'maximally confident' at intermediate layers—often a sign of learning rate being too high or initialization being poor.
Collapse: If hidden states cluster near zero across different inputs, the network isn't discriminating between inputs. This suggests vanishing gradients or insufficient model capacity.
Explosion: Growing hidden state norms indicate potential instability. If norms grow without bound during forward passes, something is wrong with initialization or architecture.
Healthy dynamics: Hidden states should span a reasonable range (say, [-0.5, 0.5] on average with occasional extremes) and should differ meaningfully for different input sequences.

hidden_state_diagnostics.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
import torch
import torch.nn as nn
from collections import defaultdict
 
class RNNWithDiagnostics(nn.Module):
    """
    RNN that tracks hidden state statistics for debugging and monitoring.
    """
    
    def __init__(self, input_dim, hidden_dim):
        super().__init__()
        self.rnn = nn.RNN(input_dim, hidden_dim, batch_first=True)
        self.hidden_dim = hidden_dim
        
        # Diagnostics storage
        self.diagnostics = defaultdict(list)
        self.track_diagnostics = False
    
    def compute_diagnostics(self, hidden_states, name_prefix=""):
        """Compute and store diagnostic statistics."""
        # hidden_states: (batch, seq_len, hidden_dim)
        
        # Saturation: fraction of values near ±1
        abs_values = hidden_states.abs()
        saturation = (abs_values > 0.9).float().mean().item()
        
        # Per-timestep norms
        norms = hidden_states.norm(dim=-1)  # (batch, seq_len)
        mean_norm = norms.mean().item()
        max_norm = norms.max().item()
        
        # Value distribution
        mean_val = hidden_states.mean().item()
        std_val = hidden_states.std().item()
        
        # Temporal gradient: how much states change between timesteps
        if hidden_states.shape[1] > 1:
            diffs = (hidden_states[:, 1:, :] - hidden_states[:, :-1, :]).norm(dim=-1)
            mean_change = diffs.mean().item()
        else:
            mean_change = 0.0
        
        stats = {
            f'{name_prefix}saturation': saturation,
            f'{name_prefix}mean_norm': mean_norm,
            f'{name_prefix}max_norm': max_norm,
            f'{name_prefix}mean_value': mean_val,
            f'{name_prefix}std_value': std_val,
            f'{name_prefix}mean_change': mean_change,
        }
        
        for key, value in stats.items():
            self.diagnostics[key].append(value)
        
        return stats
    
    def forward(self, x, h_0=None):
        hidden_states, h_n = self.rnn(x, h_0)
        
        if self.track_diagnostics:
            self.compute_diagnostics(hidden_states, name_prefix="hidden_")
        
        return hidden_states, h_n
    
    def get_diagnostic_summary(self):
        """Return summary statistics of tracked diagnostics."""
        summary = {}
        for key, values in self.diagnostics.items():
            if len(values) > 0:
                values_tensor = torch.tensor(values)
                summary[key] = {
                    'mean': values_tensor.mean().item(),
                    'std': values_tensor.std().item(),
                    'min': values_tensor.min().item(),
                    'max': values_tensor.max().item(),
                }
        return summary
    
    def check_health(self):
        """Print diagnostic health check."""
        summary = self.get_diagnostic_summary()
        
        print("\n=== Hidden State Health Check ===")
        
        if 'hidden_saturation' in summary:
            sat = summary['hidden_saturation']['mean']
            if sat > 0.5:
                print(f"⚠️  High saturation: {sat:.2%} of values near ±1")
                print("   → Consider reducing learning rate or checking initialization")
            else:
                print(f"✓  Saturation OK: {sat:.2%}")
        
        if 'hidden_mean_norm' in summary:
            norm = summary['hidden_mean_norm']['mean']
            if norm < 0.1:
                print(f"⚠️  Low hidden norms: {norm:.4f}")
                print("   → Possible vanishing activations")
            elif norm > 0.9 * (self.hidden_dim ** 0.5):
                print(f"⚠️  High hidden norms: {norm:.4f}")
                print("   → Possible saturation or explosion")
            else:
                print(f"✓  Hidden norms OK: {norm:.4f}")
        
        if 'hidden_std_value' in summary:
            std = summary['hidden_std_value']['mean']
            if std < 0.1:
                print(f"⚠️  Low variance: {std:.4f}")
                print("   → Hidden states may be collapsing")
            else:
                print(f"✓  Variance OK: {std:.4f}")
        
        if 'hidden_mean_change' in summary:
            change = summary['hidden_mean_change']['mean']
            print(f"   Temporal dynamics: avg change = {change:.4f}")
 
 
# Example: Training loop with diagnostics
def training_with_diagnostics():
    """Example training loop with hidden state monitoring."""
    model = RNNWithDiagnostics(input_dim=10, hidden_dim=32)
    optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
    
    # Enable diagnostics
    model.track_diagnostics = True
    model.diagnostics.clear()
    
    # Simulated training
    for epoch in range(5):
        # Fake batch
        x = torch.randn(16, 20, 10)  # (batch, seq, features)
        targets = torch.randn(16, 20, 32)  # Fake targets
        
        optimizer.zero_grad()
        hidden_states, _ = model(x)
        loss = ((hidden_states - targets) ** 2).mean()
        loss.backward()
        optimizer.step()
        
        print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")
    
    # Check health after training
    model.check_health()
 
if __name__ == "__main__":
    training_with_diagnostics()

Summary: Hidden State Dynamics

We've explored the fundamental mechanism that gives RNNs their power: the hidden state. Let's consolidate the key insights:

Key Takeaways

•The hidden state equation $h_t = f(W_{hh}h_{t-1} + W_{xh}x_t + b_h)$ defines how RNNs integrate past with present. This simple recurrence enables temporal modeling.
•Geometrically, hidden states trace trajectories through a bounded space. The eigenvalues of the recurrent Jacobian determine whether information is preserved, forgotten, or amplified.
•Hidden states are compressed memories—lossy representations of sequence history. The network learns to preserve task-relevant information while discarding irrelevant details.
•Initialization matters: Zero initialization is default, but learned or context-conditional initialization improves performance for short sequences or encoder-decoder architectures.
•Capacity is limited by hidden dimension and gradient flow constraints. Vanilla RNNs effectively forget information older than ~10-20 timesteps.
•Monitoring hidden state statistics (saturation, norms, variance) provides crucial diagnostic information during training.

What's next:

Now that we understand how hidden states encode temporal information, the next page explores parameter sharing in time—the architectural decision that makes RNNs fundamentally different from other neural networks. We'll see how sharing weights across timesteps enables RNNs to generalize to variable-length sequences while dramatically reducing parameter count.

Page Complete

You now have a deep understanding of hidden state dynamics in RNNs. You understand the mathematical formulation, geometric interpretation, capacity limits, and practical monitoring strategies. This foundation will support your understanding of all subsequent RNN concepts.