Machine LearningRecurrent Neural Networks

Vanishing and Exploding Gradients

LevelAdvanced

Duration90 mins

TopicRecurrent Neural Networks

1 / 5

Gradient Flow Analysis

Understanding Gradient Dynamics in RNNs

Recurrent Neural Networks possess a remarkable theoretical capability: the ability to model arbitrarily long dependencies in sequential data. In principle, an RNN can remember information from the very first element of a sequence and use it to influence predictions at the very last. This is what makes RNNs fundamentally different from feedforward networks—they have memory.

However, between theoretical capability and practical reality lies a chasm. Training RNNs on long sequences is notoriously difficult, and the root cause traces directly to how gradients flow backward through time. When we apply backpropagation through time (BPTT), gradients must traverse the same recurrent connections repeatedly—once for each timestep in the sequence. What happens to these gradients during this journey determines whether our RNN can actually learn long-range dependencies or whether it effectively becomes blind to distant past information.

What You Will Learn

This page provides a rigorous mathematical analysis of gradient flow in RNNs. You will understand exactly how gradients propagate through time, why the recurrent structure creates multiplicative dynamics that can be unstable, and how to analyze the spectral properties of weight matrices that determine gradient behavior. This foundation is essential for understanding the vanishing and exploding gradient problems covered in subsequent pages.

Backpropagation Through Time: The Mathematical Foundation

To understand gradient flow, we must first derive the gradients produced by Backpropagation Through Time (BPTT). Consider a simple (Elman) RNN with the following recurrence:

$$h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)$$

$$y_t = W_{hy} h_t + b_y$$

where:

$h_t \in \mathbb{R}^n$ is the hidden state at time $t$
$x_t \in \mathbb{R}^m$ is the input at time $t$
$y_t \in \mathbb{R}^k$ is the output at time $t$
$W_{hh} \in \mathbb{R}^{n \times n}$ is the recurrent weight matrix
$W_{xh} \in \mathbb{R}^{n \times m}$ is the input-to-hidden weight matrix
$W_{hy} \in \mathbb{R}^{k \times n}$ is the hidden-to-output weight matrix

Suppose we have a sequence of length $T$ and a loss $\mathcal{L} = \sum_{t=1}^{T} \mathcal{L}_t$ where $\mathcal{L}_t$ depends on $y_t$. We want to compute the gradient of $\mathcal{L}$ with respect to all parameters.

The Critical Gradient: ∂L/∂Whh

While all parameter gradients are important, the gradient with respect to the recurrent weight matrix W_hh is particularly revealing. This gradient shows how information from future losses propagates back through the recurrent connections to earlier timesteps—the essence of learning temporal dependencies.

Deriving the gradient for $W_{hh}$:

Using the chain rule, the gradient of $\mathcal{L}$ with respect to $W_{hh}$ involves contributions from all timesteps:

$$\frac{\partial \mathcal{L}}{\partial W_{hh}} = \sum_{t=1}^{T} \frac{\partial \mathcal{L}t}{\partial W{hh}}$$

For each timestep $t$, the loss $\mathcal{L}t$ depends on $h_t$, which in turn depends on $W{hh}$ both directly (via the computation of $h_t$) and indirectly (via all previous hidden states $h_1, h_2, \ldots, h_{t-1}$). This gives us:

$$\frac{\partial \mathcal{L}t}{\partial W{hh}} = \sum_{k=1}^{t} \frac{\partial \mathcal{L}t}{\partial h_t} \frac{\partial h_t}{\partial h_k} \frac{\partial h_k}{\partial W{hh}}$$

The term $\frac{\partial h_t}{\partial h_k}$ is where things get interesting. This is a Jacobian chain—we must propagate the gradient from $h_t$ back through all intermediate hidden states to $h_k$.

bptt_gradient_computation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import numpy as np
 
def compute_bptt_gradients(xs, hs, ys, targets, Whh, Wxh, Why, bh, by):
    """
    Compute BPTT gradients for a simple RNN.
    
    Args:
        xs: List of input vectors [x_1, x_2, ..., x_T]
        hs: List of hidden states [h_0, h_1, ..., h_T] (h_0 is initial state)
        ys: List of output vectors [y_1, y_2, ..., y_T]
        targets: List of target vectors [t_1, t_2, ..., t_T]
        Whh, Wxh, Why, bh, by: Model parameters
    
    Returns:
        Gradients for all parameters
    """
    T = len(xs)
    n = Whh.shape[0]  # hidden size
    
    # Initialize gradients
    dWhh = np.zeros_like(Whh)
    dWxh = np.zeros_like(Wxh)
    dWhy = np.zeros_like(Why)
    dbh = np.zeros_like(bh)
    dby = np.zeros_like(by)
    
    # Backward pass through time
    dh_next = np.zeros((n, 1))  # Gradient flowing from future
    
    for t in reversed(range(T)):
        # Gradient of loss at time t w.r.t. output
        dy = ys[t] - targets[t]  # Assuming MSE loss
        
        # Gradient w.r.t. output weights
        dWhy += np.dot(dy, hs[t+1].T)
        dby += dy
        
        # Gradient w.r.t. hidden state (from output + from future)
        dh = np.dot(Why.T, dy) + dh_next
        
        # Gradient through tanh nonlinearity
        # h_t = tanh(pre_activation), so d(pre_act) = dh * (1 - h_t^2)
        dh_raw = dh * (1 - hs[t+1] ** 2)
        
        # Gradients w.r.t. parameters at this timestep
        dWhh += np.dot(dh_raw, hs[t].T)
        dWxh += np.dot(dh_raw, xs[t].T)
        dbh += dh_raw
        
        # Gradient to propagate to previous timestep
        # This is the KEY: gradient flows through Whh.T
        dh_next = np.dot(Whh.T, dh_raw)
    
    return dWhh, dWxh, dWhy, dbh, dby
 
 
def visualize_gradient_flow(Whh, T, h_size=10):
    """
    Visualize how gradient magnitudes change as we backpropagate through time.
    
    This demonstrates the multiplicative nature of gradient flow.
    """
    # Initialize random orthogonal matrix for Whh
    # (orthogonal initialization helps but doesn't solve the problem)
    
    # Simulate gradient flow backward through T timesteps
    gradient_norms = []
    
    # Start with unit gradient at time T
    grad = np.ones((h_size, 1))
    grad = grad / np.linalg.norm(grad)
    
    # Assume average gradient through tanh is about 0.8
    # (tanh derivative ranges from 0 to 1, typically ~0.5-0.9 in practice)
    tanh_deriv = 0.8
    
    for t in range(T):
        # Each backward step multiplies by Whh.T and tanh derivative
        grad = Whh.T @ grad * tanh_deriv
        gradient_norms.append(np.linalg.norm(grad))
    
    return gradient_norms
 
 
# Example: Compare gradient flow with different weight initializations
np.random.seed(42)
h_size = 100
T = 100
 
# Case 1: Small weights (will vanish)
Whh_small = np.random.randn(h_size, h_size) * 0.01
norms_small = visualize_gradient_flow(Whh_small, T, h_size)
 
# Case 2: Orthogonal weights (more stable)
Q, _ = np.linalg.qr(np.random.randn(h_size, h_size))
norms_ortho = visualize_gradient_flow(Q, T, h_size)
 
# Case 3: Large weights (will explode)
Whh_large = np.random.randn(h_size, h_size) * 1.5
norms_large = visualize_gradient_flow(Whh_large, T, h_size)
 
print("Gradient norm after 100 timesteps:")
print(f"  Small weights (σ=0.01):  {norms_small[-1]:.2e}")
print(f"  Orthogonal weights:       {norms_ortho[-1]:.2e}")
print(f"  Large weights (σ=1.5):    {norms_large[-1]:.2e}")

The Jacobian Chain: Multiplicative Gradient Dynamics

The heart of gradient flow analysis lies in understanding the Jacobian chain $\frac{\partial h_t}{\partial h_k}$. Let's derive this precisely.

Single-step Jacobian:

For the transition from $h_{t-1}$ to $h_t$:

$$h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)$$

The Jacobian is:

$$\frac{\partial h_t}{\partial h_{t-1}} = \text{diag}(\tanh'(z_t)) \cdot W_{hh}$$

where $z_t = W_{hh} h_{t-1} + W_{xh} x_t + b_h$ is the pre-activation, and $\tanh'(z) = 1 - \tanh^2(z)$ is the derivative of tanh (the "sech squared" function).

Note that $\text{diag}(\tanh'(z_t))$ is a diagonal matrix with values in $(0, 1]$—it can only shrink gradient magnitudes, never amplify them.

Multi-step Jacobian:

For the gradient to flow from timestep $t$ back to timestep $k$ (where $k < t$), we chain these Jacobians:

$$\frac{\partial h_t}{\partial h_k} = \prod_{i=k+1}^{t} \frac{\partial h_i}{\partial h_{i-1}} = \prod_{i=k+1}^{t} \text{diag}(\tanh'(z_i)) \cdot W_{hh}$$

The Multiplicative Problem

This product of $(t-k)$ matrices is where instability arises. Each factor includes W_hh, so we're effectively computing something like (D·W)^(t-k) where D is a shrinking diagonal matrix. If the spectral radius of W_hh is less than 1, this product shrinks exponentially. If greater than 1, it grows exponentially. Achieving stable gradients requires precise balancing.

Matrix power interpretation:

To gain intuition, consider a simplified case where the activation derivative is constant (say, $\tanh'(z) \approx \gamma$ for some $\gamma \in (0, 1]$). Then:

$$\frac{\partial h_t}{\partial h_k} \approx (\gamma W_{hh})^{t-k}$$

This is a matrix power. The behavior of $A^n$ as $n \to \infty$ is determined by the eigenvalue decomposition of $A$. If $A = V \Lambda V^{-1}$ where $\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_n)$, then:

$$A^n = V \Lambda^n V^{-1} = V \cdot \text{diag}(\lambda_1^n, \ldots, \lambda_n^n) \cdot V^{-1}$$

The behavior is dominated by the largest eigenvalue (in absolute value), called the spectral radius $\rho(A) = \max_i |\lambda_i|$:

If $\rho(\gamma W_{hh}) < 1$: The gradient vanishes exponentially as $(t-k) \to \infty$
If $\rho(\gamma W_{hh}) > 1$: The gradient explodes exponentially
If $\rho(\gamma W_{hh}) = 1$: The gradient remains bounded (ideal but hard to achieve)

Spectral Radius and Gradient Behavior
Spectral Radius ρ(γW)	Gradient Behavior	Learning Implication	Typical Cause
ρ < 0.9	Rapid exponential decay	Cannot learn long-range dependencies	Small weight initialization
0.9 ≤ ρ < 1.0	Slow exponential decay	Limited but possible long-range learning	Conservative initialization
ρ ≈ 1.0	Bounded, stable flow	Optimal for long-range dependencies	Careful initialization (orthogonal)
1.0 < ρ ≤ 1.1	Slow exponential growth	Possible instability, gradient clipping needed	Aggressive initialization
ρ > 1.1	Rapid exponential explosion	Training failure (NaN), optimization instability	Large weights, poor initialization

jacobian_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
import numpy as np
import matplotlib.pyplot as plt
 
def analyze_jacobian_chain(Whh, T, gamma=0.8):
    """
    Analyze the Jacobian chain for gradient flow.
    
    Args:
        Whh: Recurrent weight matrix
        T: Number of timesteps to analyze
        gamma: Average tanh derivative (approximation)
    
    Returns:
        Analysis results including spectral properties
    """
    n = Whh.shape[0]
    effective_matrix = gamma * Whh
    
    # Compute eigenvalues
    eigenvalues = np.linalg.eigvals(effective_matrix)
    spectral_radius = np.max(np.abs(eigenvalues))
    
    # Compute gradient norms at each timestep
    gradient_norms = []
    jacobian_product = np.eye(n)
    
    for t in range(T):
        jacobian_product = effective_matrix @ jacobian_product
        gradient_norms.append(np.linalg.norm(jacobian_product, ord=2))
    
    # Theoretical prediction based on spectral radius
    theoretical_norms = [spectral_radius ** t for t in range(T)]
    
    return {
        'spectral_radius': spectral_radius,
        'eigenvalues': eigenvalues,
        'gradient_norms': gradient_norms,
        'theoretical_norms': theoretical_norms,
        'condition_number': np.linalg.cond(Whh)
    }
 
 
def eigenvalue_distribution_plot(eigenvalues, title="Eigenvalue Distribution"):
    """
    Plot eigenvalues in the complex plane with unit circle reference.
    """
    fig, ax = plt.subplots(figsize=(8, 8))
    
    # Unit circle
    theta = np.linspace(0, 2*np.pi, 100)
    ax.plot(np.cos(theta), np.sin(theta), 'b--', label='Unit Circle', alpha=0.5)
    
    # Eigenvalues
    ax.scatter(eigenvalues.real, eigenvalues.imag, c='red', s=50, label='Eigenvalues')
    
    ax.set_xlim(-2, 2)
    ax.set_ylim(-2, 2)
    ax.set_aspect('equal')
    ax.axhline(y=0, color='k', linewidth=0.5)
    ax.axvline(x=0, color='k', linewidth=0.5)
    ax.set_xlabel('Real Part')
    ax.set_ylabel('Imaginary Part')
    ax.set_title(title)
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    return fig
 
 
# Demonstrate with different initializations
np.random.seed(42)
n = 50  # Hidden size
T = 50  # Timesteps
 
print("=" * 60)
print("JACOBIAN CHAIN ANALYSIS FOR RNN GRADIENT FLOW")
print("=" * 60)
 
# 1. Xavier/Glorot initialization (common default)
Whh_xavier = np.random.randn(n, n) / np.sqrt(n)
results_xavier = analyze_jacobian_chain(Whh_xavier, T)
print(f"\nXavier Initialization:")
print(f"  Spectral Radius: {results_xavier['spectral_radius']:.4f}")
print(f"  Gradient norm at t=50: {results_xavier['gradient_norms'][-1]:.2e}")
 
# 2. Orthogonal initialization (recommended for RNNs)
Q, _ = np.linalg.qr(np.random.randn(n, n))
Whh_ortho = Q
results_ortho = analyze_jacobian_chain(Whh_ortho, T)
print(f"\nOrthogonal Initialization:")
print(f"  Spectral Radius: {results_ortho['spectral_radius']:.4f}")
print(f"  Gradient norm at t=50: {results_ortho['gradient_norms'][-1]:.2e}")
 
# 3. Identity + small noise (similar to IRNN)
Whh_irnn = np.eye(n) + np.random.randn(n, n) * 0.01
results_irnn = analyze_jacobian_chain(Whh_irnn, T)
print(f"\nIdentity + Noise (IRNN-style):")
print(f"  Spectral Radius: {results_irnn['spectral_radius']:.4f}")
print(f"  Gradient norm at t=50: {results_irnn['gradient_norms'][-1]:.2e}")
 
# 4. Random initialization with larger variance (unstable)
Whh_large = np.random.randn(n, n) * 1.5 / np.sqrt(n)
results_large = analyze_jacobian_chain(Whh_large, T)
print(f"\nLarge Variance Initialization:")
print(f"  Spectral Radius: {results_large['spectral_radius']:.4f}")
print(f"  Gradient norm at t=50: {results_large['gradient_norms'][-1]:.2e}")

Spectral Analysis of Recurrent Weight Matrices

The spectral properties of the recurrent weight matrix $W_{hh}$ are central to understanding gradient dynamics. Let's develop a deeper understanding of how eigenvalue structure affects learning.

Eigenvalue decomposition and gradient flow:

For a diagonalizable matrix $W_{hh} = V \Lambda V^{-1}$, the matrix power becomes:

$$(\gamma W_{hh})^k = V (\gamma \Lambda)^k V^{-1}$$

If we express the gradient at the final time as $g_T$, then the gradient at time $t$ is approximately:

$$g_t \approx V (\gamma \Lambda)^{T-t} V^{-1} g_T$$

This reveals several important properties:

1. Direction selectivity:

Not all directions in gradient space decay equally. Components aligned with eigenvectors of large eigenvalues persist longer. Components aligned with small eigenvalues vanish quickly. This means:

The network can learn patterns that align with "strong" eigenvalue directions
Patterns requiring "weak" eigenvalue directions are nearly impossible to learn

2. Non-normal dynamics:

If $W_{hh}$ is non-normal (i.e., $W_{hh}^T W_{hh} \neq W_{hh} W_{hh}^T$), transient behavior can differ dramatically from asymptotic behavior. The matrix can amplify gradients significantly before eventually causing decay. This leads to unpredictable training dynamics.

Why Orthogonal Initialization Helps

Orthogonal matrices are normal matrices with all eigenvalues on the unit circle in the complex plane. This means: (1) All eigenvalues have magnitude exactly 1, so spectral radius is 1. (2) There's no transient amplification from non-normality. (3) All gradient directions are treated equally. This makes orthogonal initialization ideal for RNNs, though the tanh derivative still causes shrinkage.

Singular value perspective:

An alternative analysis uses singular values rather than eigenvalues. The singular values $\sigma_1 \geq \sigma_2 \geq \ldots \geq \sigma_n$ of $W_{hh}$ bound the operator norm:

$$|W_{hh}|_2 = \sigma_1$$

For the Jacobian at each step $J_t = D_t W_{hh}$ (where $D_t = \text{diag}(\tanh'(z_t))$):

$$|J_t|2 \leq |D_t|2 \cdot |W{hh}|2 = \max_i |\tanh'(z{t,i})| \cdot \sigma_1(W{hh})$$

Since $\tanh'(z) \leq 1$, we have $|J_t|2 \leq \sigma_1(W{hh})$.

For the full Jacobian chain:

$$\left| \prod_{i=k+1}^{t} J_i \right|2 \leq \prod{i=k+1}^{t} |J_i|2 \leq \sigma_1(W{hh})^{t-k}$$

This gives us a bound (though not tight) on gradient growth/decay.

spectral_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
import numpy as np
from scipy.linalg import svd, schur
 
def comprehensive_spectral_analysis(Whh):
    """
    Perform comprehensive spectral analysis of a recurrent weight matrix.
    
    Returns metrics relevant to gradient flow in RNNs.
    """
    n = Whh.shape[0]
    
    # Eigenvalue analysis
    eigenvalues = np.linalg.eigvals(Whh)
    spectral_radius = np.max(np.abs(eigenvalues))
    
    # Singular value analysis
    U, singular_values, Vh = svd(Whh)
    operator_norm = singular_values[0]  # Largest singular value
    condition_number = singular_values[0] / singular_values[-1]
    
    # Non-normality measure
    # A matrix is normal if ||AA* - A*A||_F = 0
    AAt = Whh @ Whh.T
    AtA = Whh.T @ Whh
    non_normality = np.linalg.norm(AAt - AtA, 'fro')
    
    # Schur decomposition for numerical stability analysis
    T_schur, Z = schur(Whh)
    
    # Effective spectral radius with tanh derivative
    gamma = 0.8  # Typical average tanh derivative
    effective_spectral_radius = gamma * spectral_radius
    
    return {
        'eigenvalues': eigenvalues,
        'spectral_radius': spectral_radius,
        'singular_values': singular_values,
        'operator_norm': operator_norm,
        'condition_number': condition_number,
        'non_normality': non_normality,
        'effective_spectral_radius': effective_spectral_radius,
        'stable_gradient_flow': effective_spectral_radius <= 1.0,
        'vanishing_risk': effective_spectral_radius < 0.95,
        'exploding_risk': effective_spectral_radius > 1.05
    }
 
 
def predict_trainability(analysis_results, sequence_length):
    """
    Predict trainability based on spectral analysis.
    """
    rho = analysis_results['effective_spectral_radius']
    
    # Gradient magnitude at the start of the sequence
    gradient_at_start = rho ** sequence_length
    
    predictions = {
        'estimated_gradient_ratio': gradient_at_start,
        'effective_sequence_length': int(np.log(1e-6) / np.log(rho)) if rho < 1 else float('inf')
    }
    
    if gradient_at_start < 1e-6:
        predictions['trainability'] = 'POOR: Gradients likely vanished'
        predictions['recommendation'] = 'Use orthogonal init, LSTM/GRU, or truncated BPTT'
    elif gradient_at_start > 1e6:
        predictions['trainability'] = 'UNSTABLE: Gradients likely exploded'
        predictions['recommendation'] = 'Use gradient clipping, smaller learning rate'
    elif gradient_at_start < 0.01:
        predictions['trainability'] = 'MARGINAL: Long-range gradients very weak'
        predictions['recommendation'] = 'Consider gated architectures for long sequences'
    else:
        predictions['trainability'] = 'GOOD: Gradients should propagate effectively'
        predictions['recommendation'] = 'Current configuration is suitable'
    
    return predictions
 
 
# Analyze different initialization strategies
np.random.seed(42)
n = 64  # Hidden size
 
initializations = {
    'Standard Normal': np.random.randn(n, n),
    'Xavier': np.random.randn(n, n) / np.sqrt(n),
    'He': np.random.randn(n, n) * np.sqrt(2/n),
    'Orthogonal': np.linalg.qr(np.random.randn(n, n))[0],
    'Identity + Noise': np.eye(n) + 0.01 * np.random.randn(n, n),
    'Scaled Orthogonal (1.1x)': 1.1 * np.linalg.qr(np.random.randn(n, n))[0],
}
 
print("=" * 70)
print("RECURRENT WEIGHT MATRIX SPECTRAL ANALYSIS")
print("=" * 70)
 
for name, Whh in initializations.items():
    analysis = comprehensive_spectral_analysis(Whh)
    trainability = predict_trainability(analysis, sequence_length=100)
    
    print(f"\n{name}:")
    print(f"  Spectral radius: {analysis['spectral_radius']:.4f}")
    print(f"  Effective ρ (with γ=0.8): {analysis['effective_spectral_radius']:.4f}")
    print(f"  Condition number: {analysis['condition_number']:.2f}")
    print(f"  Non-normality: {analysis['non_normality']:.4f}")
    print(f"  Vanishing risk: {analysis['vanishing_risk']}")
    print(f"  Exploding risk: {analysis['exploding_risk']}")
    print(f"  Trainability: {trainability['trainability']}")

Temporal Decomposition of Gradients

Understanding how gradients decompose across time provides insight into what an RNN can and cannot learn. Let's analyze the total gradient as a sum of contributions from different temporal distances.

Gradient decomposition:

The total gradient of the loss $\mathcal{L}$ with respect to $W_{hh}$ can be written as:

$$\frac{\partial \mathcal{L}}{\partial W_{hh}} = \sum_{t=1}^{T} \sum_{k=1}^{t} \underbrace{\frac{\partial \mathcal{L}t}{\partial h_t}}{\text{Error at } t} \cdot \underbrace{\frac{\partial h_t}{\partial h_k}}{\text{Jacobian chain}} \cdot \underbrace{\frac{\partial h_k}{\partial W{hh}}}_{\text{Local gradient}}$$

We can reorganize this by the temporal distance $\tau = t - k$:

$$\frac{\partial \mathcal{L}}{\partial W_{hh}} = \sum_{\tau=0}^{T-1} G_\tau$$

where $G_\tau$ represents the gradient contribution from temporal distance $\tau$:

$$G_\tau = \sum_{t=\tau+1}^{T} \frac{\partial \mathcal{L}t}{\partial h_t} \cdot \frac{\partial h_t}{\partial h{t-\tau}} \cdot \frac{\partial h_{t-\tau}}{\partial W_{hh}}$$

Interpretation of Temporal Contributions

G₀ represents gradients from immediate (same-timestep) effects. G₁ represents gradients from one-step-back dependencies. G_τ for large τ represents gradients from long-range dependencies. If |G_τ| → 0 rapidly as τ increases, the network cannot learn long-range dependencies.

Relative contribution analysis:

Define the relative contribution of distance $\tau$ as:

$$\rho_\tau = \frac{|G_\tau|}{\sum_{\tau'=0}^{T-1} |G_{\tau'}|}$$

In a well-functioning RNN:

$\rho_\tau$ should remain significant for large $\tau$ if long-range dependencies matter
The distribution of $\rho_\tau$ should match the dependency structure in the data

In practice, due to vanishing gradients:

$\rho_\tau$ drops exponentially with $\tau$
For $\tau > 10-20$, contributions are often negligible
The network effectively becomes a "local" model despite its recurrence

The effective horizon:

We can define an effective horizon $H$ as the maximum temporal distance at which gradients remain above a threshold:

$$H = \max{\tau : |G_\tau| > \epsilon}$$

For vanilla RNNs with tanh activation, $H$ is typically 10-20 timesteps. Beyond this horizon, the network is essentially blind to past information—it cannot learn dependencies spanning more than $H$ steps.

temporal_gradient_decomposition.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
import numpy as np
from collections import defaultdict
 
class RNNGradientAnalyzer:
    """
    Analyze temporal gradient contributions in an RNN.
    
    This class tracks how gradients from different temporal distances
    contribute to the total gradient, revealing the effective learning
    horizon of the network.
    """
    
    def __init__(self, input_size, hidden_size, output_size):
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        
        # Initialize with orthogonal recurrent weights
        self.Whh = np.linalg.qr(np.random.randn(hidden_size, hidden_size))[0]
        self.Wxh = np.random.randn(hidden_size, input_size) * 0.01
        self.Why = np.random.randn(output_size, hidden_size) * 0.01
        
    def forward(self, xs):
        """Forward pass, storing activations for gradient analysis."""
        T = len(xs)
        hs = [np.zeros((self.hidden_size, 1))]  # h_0
        pre_acts = []  # Store pre-activations for gradient computation
        
        for t in range(T):
            z = self.Whh @ hs[-1] + self.Wxh @ xs[t]
            h = np.tanh(z)
            hs.append(h)
            pre_acts.append(z)
        
        # Output at final timestep (for simplicity)
        y = self.Why @ hs[-1]
        
        return hs, pre_acts, y
    
    def compute_temporal_gradients(self, xs, target):
        """
        Compute gradients decomposed by temporal distance.
        
        Returns a dictionary mapping temporal distance τ to gradient contribution.
        """
        T = len(xs)
        hs, pre_acts, y = self.forward(xs)
        
        # Loss gradient at output
        dy = y - target  # Assuming MSE loss
        dh_T = self.Why.T @ dy  # Gradient at final hidden state
        
        # Dictionary to store gradient contributions by temporal distance
        temporal_gradients = defaultdict(lambda: np.zeros_like(self.Whh))
        
        # Backward pass with tracking
        # dh[t] = gradient of loss w.r.t. h_t
        dh = [None] * (T + 1)
        dh[T] = dh_T
        
        # First, compute all hidden state gradients
        for t in range(T - 1, 0, -1):
            tanh_deriv = 1 - hs[t+1] ** 2
            dh_raw = dh[t+1] * tanh_deriv
            dh[t] = self.Whh.T @ dh_raw
        
        # Now compute gradient contributions by temporal distance
        for t in range(1, T + 1):
            for tau in range(t):
                k = t - tau  # The timestep we're backpropagating to
                
                # Compute the Jacobian product from t to k
                jacobian_product = np.eye(self.hidden_size)
                for i in range(t, k, -1):
                    tanh_deriv = np.diag((1 - hs[i] ** 2).flatten())
                    jacobian_product = tanh_deriv @ self.Whh @ jacobian_product
                
                # Gradient contribution at temporal distance τ
                # G_τ contribution from timestep t
                local_grad = dh[t].T @ jacobian_product @ hs[k-1].T if k > 0 else 0
                if k > 0:
                    tanh_deriv_k = np.diag((1 - hs[k] ** 2).flatten())
                    contribution = tanh_deriv_k @ np.outer(
                        (jacobian_product.T @ dh[t]).flatten(),
                        hs[k-1].flatten()
                    )
                    temporal_gradients[tau] += contribution
        
        return dict(temporal_gradients)
    
    def analyze_effective_horizon(self, xs, target, threshold=1e-6):
        """
        Determine the effective learning horizon.
        """
        temporal_grads = self.compute_temporal_gradients(xs, target)
        
        # Compute norms
        grad_norms = {tau: np.linalg.norm(g) for tau, g in temporal_grads.items()}
        
        # Find maximum τ with significant gradient
        if not grad_norms:
            return 0, grad_norms
        
        max_norm = max(grad_norms.values())
        normalized_norms = {tau: norm / max_norm for tau, norm in grad_norms.items()}
        
        effective_horizon = max(
            [tau for tau, norm in normalized_norms.items() if norm > threshold],
            default=0
        )
        
        return effective_horizon, grad_norms
 
 
# Demonstration
np.random.seed(42)
analyzer = RNNGradientAnalyzer(input_size=10, hidden_size=50, output_size=1)
 
# Generate a sequence
T = 50
xs = [np.random.randn(10, 1) for _ in range(T)]
target = np.random.randn(1, 1)
 
# Analyze temporal gradient contributions
horizon, grad_norms = analyzer.analyze_effective_horizon(xs, target)
 
print("=" * 60)
print("TEMPORAL GRADIENT DECOMPOSITION ANALYSIS")
print("=" * 60)
print(f"\nSequence length: {T}")
print(f"Effective horizon: {horizon} timesteps")
print("\nGradient contribution by temporal distance:")
print("-" * 40)
 
# Sort and display
if grad_norms:
    max_norm = max(grad_norms.values())
    for tau in sorted(grad_norms.keys())[:20]:  # Show first 20
        relative = grad_norms[tau] / max_norm if max_norm > 0 else 0
        bar = "█" * int(relative * 30)
        print(f"τ={tau:2d}: {grad_norms[tau]:.2e} | {bar}")

Visualizing Gradient Flow Dynamics

Visualization provides crucial intuition about gradient flow. Let's explore several ways to visualize and understand gradient dynamics in RNNs.

1. Gradient norm over time:

The most direct visualization shows how gradient magnitude changes as we backpropagate through time. Starting from the loss at the final timestep, we track the gradient norm at each preceding timestep.

2. Gradient direction stability:

Beyond magnitude, we can track whether gradients maintain consistent direction. If gradients rotate chaotically, learning becomes difficult even if magnitudes are stable.

3. Per-hidden-unit gradient analysis:

Different hidden units may have very different gradient dynamics. Some units may carry gradients effectively while others contribute little.

4. Activation and gradient correlation:

Gradients depend on activations through the tanh derivative. Understanding this correlation helps explain gradient behavior.

gradient_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
 
class RNNGradientVisualizer:
    """
    Comprehensive visualization tools for RNN gradient flow.
    """
    
    def __init__(self, hidden_size=50):
        self.hidden_size = hidden_size
    
    def visualize_gradient_norm_decay(self, Whh_variants, T=100, gamma=0.8):
        """
        Compare gradient norm decay across different weight matrices.
        """
        fig, axes = plt.subplots(1, 2, figsize=(14, 5))
        
        for name, Whh in Whh_variants.items():
            norms = []
            grad = np.ones((self.hidden_size, 1))
            grad = grad / np.linalg.norm(grad)
            
            for t in range(T):
                grad = gamma * Whh.T @ grad
                norms.append(np.linalg.norm(grad))
            
            # Linear scale
            axes[0].plot(norms, label=name)
            # Log scale
            axes[1].plot(norms, label=name)
        
        axes[0].set_xlabel('Timesteps backward')
        axes[0].set_ylabel('Gradient Norm')
        axes[0].set_title('Gradient Magnitude vs Temporal Distance (Linear)')
        axes[0].legend()
        axes[0].grid(True, alpha=0.3)
        
        axes[1].set_xlabel('Timesteps backward')
        axes[1].set_ylabel('Gradient Norm (log scale)')
        axes[1].set_yscale('log')
        axes[1].set_title('Gradient Magnitude vs Temporal Distance (Log)')
        axes[1].legend()
        axes[1].grid(True, alpha=0.3)
        
        plt.tight_layout()
        return fig
    
    def visualize_gradient_directions(self, Whh, T=50, gamma=0.8, num_init=5):
        """
        Visualize how gradient directions evolve over time.
        Uses PCA to project to 2D.
        """
        from sklearn.decomposition import PCA
        
        # Track gradients from multiple random initializations
        all_grads = []
        
        for i in range(num_init):
            grad = np.random.randn(self.hidden_size, 1)
            grad = grad / np.linalg.norm(grad)
            grads = [grad.flatten()]
            
            for t in range(T):
                grad = gamma * Whh.T @ grad
                if np.linalg.norm(grad) > 1e-10:
                    grad = grad / np.linalg.norm(grad)  # Normalize
                grads.append(grad.flatten())
            
            all_grads.append(np.array(grads))
        
        # PCA on all gradients
        all_flat = np.vstack(all_grads)
        pca = PCA(n_components=2)
        projected = pca.fit_transform(all_flat)
        
        # Reshape and plot
        fig, ax = plt.subplots(figsize=(10, 10))
        
        colors = plt.cm.viridis(np.linspace(0, 1, num_init))
        for i, grads in enumerate(all_grads):
            proj = pca.transform(grads)
            ax.scatter(proj[:, 0], proj[:, 1], c=np.arange(len(proj)), 
                      cmap='plasma', s=20, alpha=0.7)
            ax.plot(proj[:, 0], proj[:, 1], alpha=0.3, color=colors[i])
        
        ax.set_xlabel('PC1')
        ax.set_ylabel('PC2')
        ax.set_title('Gradient Direction Evolution (PCA Projection)')
        ax.grid(True, alpha=0.3)
        
        return fig
    
    def visualize_per_unit_gradients(self, Whh, T=50, gamma=0.8):
        """
        Show gradient magnitude for each hidden unit over time.
        """
        # Start with uniform gradient
        grad = np.ones((self.hidden_size, 1))
        
        # Track per-unit gradients
        unit_grads = np.zeros((self.hidden_size, T))
        
        for t in range(T):
            grad = gamma * Whh.T @ grad
            unit_grads[:, t] = np.abs(grad.flatten())
        
        fig, ax = plt.subplots(figsize=(12, 8))
        
        # Heatmap
        im = ax.imshow(unit_grads, aspect='auto', cmap='viridis',
                       norm=plt.matplotlib.colors.LogNorm(vmin=1e-10, vmax=1))
        
        ax.set_xlabel('Timesteps Backward')
        ax.set_ylabel('Hidden Unit Index')
        ax.set_title('Per-Unit Gradient Magnitude Over Time')
        
        cbar = plt.colorbar(im, ax=ax)
        cbar.set_label('Gradient Magnitude (log scale)')
        
        return fig
 
 
# Create visualizations
np.random.seed(42)
n = 50
 
visualizer = RNNGradientVisualizer(hidden_size=n)
 
# Define weight matrices to compare
Whh_variants = {
    'Small (σ=0.5)': np.random.randn(n, n) * 0.5 / np.sqrt(n),
    'Xavier': np.random.randn(n, n) / np.sqrt(n),
    'Orthogonal': np.linalg.qr(np.random.randn(n, n))[0],
    'Identity + Noise': np.eye(n) + 0.01 * np.random.randn(n, n),
}
 
# Generate visualizations
fig1 = visualizer.visualize_gradient_norm_decay(Whh_variants, T=100)
fig2 = visualizer.visualize_gradient_directions(np.linalg.qr(np.random.randn(n, n))[0])
fig3 = visualizer.visualize_per_unit_gradients(np.random.randn(n, n) / np.sqrt(n))
 
plt.show()

Key Visualization Insights

When analyzing gradient visualizations: (1) Look for exponential decay patterns in gradient norms—steeper decay means shorter effective memory. (2) Check if gradients concentrate in certain hidden units—this indicates the network may only use a subset of its capacity. (3) Observe direction stability—random rotation suggests the network isn't learning coherent temporal patterns. (4) Compare across initializations to understand which strategies are more robust.

Summary: Gradient Flow Fundamentals

This page has established the mathematical foundation for understanding gradient flow in recurrent neural networks. The key insights will be essential for understanding the vanishing and exploding gradient problems in the following pages.

Key Takeaways

•BPTT involves Jacobian chains — Gradients flow backward through a product of Jacobian matrices, one for each timestep.
•Multiplicative dynamics create instability — Repeated multiplication of matrices leads to either exponential growth or decay.
•Spectral radius determines behavior — The eigenvalue structure of the recurrent weight matrix controls long-term gradient dynamics.
•Tanh derivative always shrinks — The nonlinearity's derivative is bounded by 1, contributing to gradient decay.
•Temporal contributions decay with distance — Gradients from distant timesteps contribute exponentially less to learning.
•Effective horizon is limited — Vanilla RNNs typically have an effective memory of only 10-20 timesteps.

What's Next:

With this foundation in gradient flow analysis, we're now equipped to understand exactly why the vanishing gradient problem makes long-range learning so difficult, and what mathematical conditions cause it. The next page provides a deep dive into this phenomenon.

Foundation Established

You now have a rigorous understanding of how gradients propagate through RNNs. This mathematical framework—Jacobian chains, spectral properties, temporal decomposition—provides the lens through which we'll analyze the vanishing gradient problem, the exploding gradient problem, gradient clipping, and architectural solutions like LSTM and GRU.

1 / 5

Loading learning content...

Machine LearningRecurrent Neural Networks

Vanishing and Exploding Gradients

LevelAdvanced

Duration90 mins

TopicRecurrent Neural Networks

1 / 5

Gradient Flow Analysis

Understanding Gradient Dynamics in RNNs

What You Will Learn

Backpropagation Through Time: The Mathematical Foundation

To understand gradient flow, we must first derive the gradients produced by Backpropagation Through Time (BPTT). Consider a simple (Elman) RNN with the following recurrence:

$$h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)$$

$$y_t = W_{hy} h_t + b_y$$

where:

$h_t \in \mathbb{R}^n$ is the hidden state at time $t$
$x_t \in \mathbb{R}^m$ is the input at time $t$
$y_t \in \mathbb{R}^k$ is the output at time $t$
$W_{hh} \in \mathbb{R}^{n \times n}$ is the recurrent weight matrix
$W_{xh} \in \mathbb{R}^{n \times m}$ is the input-to-hidden weight matrix
$W_{hy} \in \mathbb{R}^{k \times n}$ is the hidden-to-output weight matrix

The Critical Gradient: ∂L/∂Whh

Deriving the gradient for $W_{hh}$:

Using the chain rule, the gradient of $\mathcal{L}$ with respect to $W_{hh}$ involves contributions from all timesteps:

$$\frac{\partial \mathcal{L}}{\partial W_{hh}} = \sum_{t=1}^{T} \frac{\partial \mathcal{L}t}{\partial W{hh}}$$

$$\frac{\partial \mathcal{L}t}{\partial W{hh}} = \sum_{k=1}^{t} \frac{\partial \mathcal{L}t}{\partial h_t} \frac{\partial h_t}{\partial h_k} \frac{\partial h_k}{\partial W{hh}}$$

bptt_gradient_computation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import numpy as np
 
def compute_bptt_gradients(xs, hs, ys, targets, Whh, Wxh, Why, bh, by):
    """
    Compute BPTT gradients for a simple RNN.
    
    Args:
        xs: List of input vectors [x_1, x_2, ..., x_T]
        hs: List of hidden states [h_0, h_1, ..., h_T] (h_0 is initial state)
        ys: List of output vectors [y_1, y_2, ..., y_T]
        targets: List of target vectors [t_1, t_2, ..., t_T]
        Whh, Wxh, Why, bh, by: Model parameters
    
    Returns:
        Gradients for all parameters
    """
    T = len(xs)
    n = Whh.shape[0]  # hidden size
    
    # Initialize gradients
    dWhh = np.zeros_like(Whh)
    dWxh = np.zeros_like(Wxh)
    dWhy = np.zeros_like(Why)
    dbh = np.zeros_like(bh)
    dby = np.zeros_like(by)
    
    # Backward pass through time
    dh_next = np.zeros((n, 1))  # Gradient flowing from future
    
    for t in reversed(range(T)):
        # Gradient of loss at time t w.r.t. output
        dy = ys[t] - targets[t]  # Assuming MSE loss
        
        # Gradient w.r.t. output weights
        dWhy += np.dot(dy, hs[t+1].T)
        dby += dy
        
        # Gradient w.r.t. hidden state (from output + from future)
        dh = np.dot(Why.T, dy) + dh_next
        
        # Gradient through tanh nonlinearity
        # h_t = tanh(pre_activation), so d(pre_act) = dh * (1 - h_t^2)
        dh_raw = dh * (1 - hs[t+1] ** 2)
        
        # Gradients w.r.t. parameters at this timestep
        dWhh += np.dot(dh_raw, hs[t].T)
        dWxh += np.dot(dh_raw, xs[t].T)
        dbh += dh_raw
        
        # Gradient to propagate to previous timestep
        # This is the KEY: gradient flows through Whh.T
        dh_next = np.dot(Whh.T, dh_raw)
    
    return dWhh, dWxh, dWhy, dbh, dby
 
 
def visualize_gradient_flow(Whh, T, h_size=10):
    """
    Visualize how gradient magnitudes change as we backpropagate through time.
    
    This demonstrates the multiplicative nature of gradient flow.
    """
    # Initialize random orthogonal matrix for Whh
    # (orthogonal initialization helps but doesn't solve the problem)
    
    # Simulate gradient flow backward through T timesteps
    gradient_norms = []
    
    # Start with unit gradient at time T
    grad = np.ones((h_size, 1))
    grad = grad / np.linalg.norm(grad)
    
    # Assume average gradient through tanh is about 0.8
    # (tanh derivative ranges from 0 to 1, typically ~0.5-0.9 in practice)
    tanh_deriv = 0.8
    
    for t in range(T):
        # Each backward step multiplies by Whh.T and tanh derivative
        grad = Whh.T @ grad * tanh_deriv
        gradient_norms.append(np.linalg.norm(grad))
    
    return gradient_norms
 
 
# Example: Compare gradient flow with different weight initializations
np.random.seed(42)
h_size = 100
T = 100
 
# Case 1: Small weights (will vanish)
Whh_small = np.random.randn(h_size, h_size) * 0.01
norms_small = visualize_gradient_flow(Whh_small, T, h_size)
 
# Case 2: Orthogonal weights (more stable)
Q, _ = np.linalg.qr(np.random.randn(h_size, h_size))
norms_ortho = visualize_gradient_flow(Q, T, h_size)
 
# Case 3: Large weights (will explode)
Whh_large = np.random.randn(h_size, h_size) * 1.5
norms_large = visualize_gradient_flow(Whh_large, T, h_size)
 
print("Gradient norm after 100 timesteps:")
print(f"  Small weights (σ=0.01):  {norms_small[-1]:.2e}")
print(f"  Orthogonal weights:       {norms_ortho[-1]:.2e}")
print(f"  Large weights (σ=1.5):    {norms_large[-1]:.2e}")

The Jacobian Chain: Multiplicative Gradient Dynamics

The heart of gradient flow analysis lies in understanding the Jacobian chain $\frac{\partial h_t}{\partial h_k}$. Let's derive this precisely.

Single-step Jacobian:

For the transition from $h_{t-1}$ to $h_t$:

$$h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)$$

The Jacobian is:

$$\frac{\partial h_t}{\partial h_{t-1}} = \text{diag}(\tanh'(z_t)) \cdot W_{hh}$$

where $z_t = W_{hh} h_{t-1} + W_{xh} x_t + b_h$ is the pre-activation, and $\tanh'(z) = 1 - \tanh^2(z)$ is the derivative of tanh (the "sech squared" function).

Note that $\text{diag}(\tanh'(z_t))$ is a diagonal matrix with values in $(0, 1]$—it can only shrink gradient magnitudes, never amplify them.

Multi-step Jacobian:

For the gradient to flow from timestep $t$ back to timestep $k$ (where $k < t$), we chain these Jacobians:

$$\frac{\partial h_t}{\partial h_k} = \prod_{i=k+1}^{t} \frac{\partial h_i}{\partial h_{i-1}} = \prod_{i=k+1}^{t} \text{diag}(\tanh'(z_i)) \cdot W_{hh}$$

The Multiplicative Problem

Matrix power interpretation:

To gain intuition, consider a simplified case where the activation derivative is constant (say, $\tanh'(z) \approx \gamma$ for some $\gamma \in (0, 1]$). Then:

$$\frac{\partial h_t}{\partial h_k} \approx (\gamma W_{hh})^{t-k}$$

$$A^n = V \Lambda^n V^{-1} = V \cdot \text{diag}(\lambda_1^n, \ldots, \lambda_n^n) \cdot V^{-1}$$

The behavior is dominated by the largest eigenvalue (in absolute value), called the spectral radius $\rho(A) = \max_i |\lambda_i|$:

If $\rho(\gamma W_{hh}) < 1$: The gradient vanishes exponentially as $(t-k) \to \infty$
If $\rho(\gamma W_{hh}) > 1$: The gradient explodes exponentially
If $\rho(\gamma W_{hh}) = 1$: The gradient remains bounded (ideal but hard to achieve)

Spectral Radius and Gradient Behavior
Spectral Radius ρ(γW)	Gradient Behavior	Learning Implication	Typical Cause
ρ < 0.9	Rapid exponential decay	Cannot learn long-range dependencies	Small weight initialization
0.9 ≤ ρ < 1.0	Slow exponential decay	Limited but possible long-range learning	Conservative initialization
ρ ≈ 1.0	Bounded, stable flow	Optimal for long-range dependencies	Careful initialization (orthogonal)
1.0 < ρ ≤ 1.1	Slow exponential growth	Possible instability, gradient clipping needed	Aggressive initialization
ρ > 1.1	Rapid exponential explosion	Training failure (NaN), optimization instability	Large weights, poor initialization

jacobian_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
import numpy as np
import matplotlib.pyplot as plt
 
def analyze_jacobian_chain(Whh, T, gamma=0.8):
    """
    Analyze the Jacobian chain for gradient flow.
    
    Args:
        Whh: Recurrent weight matrix
        T: Number of timesteps to analyze
        gamma: Average tanh derivative (approximation)
    
    Returns:
        Analysis results including spectral properties
    """
    n = Whh.shape[0]
    effective_matrix = gamma * Whh
    
    # Compute eigenvalues
    eigenvalues = np.linalg.eigvals(effective_matrix)
    spectral_radius = np.max(np.abs(eigenvalues))
    
    # Compute gradient norms at each timestep
    gradient_norms = []
    jacobian_product = np.eye(n)
    
    for t in range(T):
        jacobian_product = effective_matrix @ jacobian_product
        gradient_norms.append(np.linalg.norm(jacobian_product, ord=2))
    
    # Theoretical prediction based on spectral radius
    theoretical_norms = [spectral_radius ** t for t in range(T)]
    
    return {
        'spectral_radius': spectral_radius,
        'eigenvalues': eigenvalues,
        'gradient_norms': gradient_norms,
        'theoretical_norms': theoretical_norms,
        'condition_number': np.linalg.cond(Whh)
    }
 
 
def eigenvalue_distribution_plot(eigenvalues, title="Eigenvalue Distribution"):
    """
    Plot eigenvalues in the complex plane with unit circle reference.
    """
    fig, ax = plt.subplots(figsize=(8, 8))
    
    # Unit circle
    theta = np.linspace(0, 2*np.pi, 100)
    ax.plot(np.cos(theta), np.sin(theta), 'b--', label='Unit Circle', alpha=0.5)
    
    # Eigenvalues
    ax.scatter(eigenvalues.real, eigenvalues.imag, c='red', s=50, label='Eigenvalues')
    
    ax.set_xlim(-2, 2)
    ax.set_ylim(-2, 2)
    ax.set_aspect('equal')
    ax.axhline(y=0, color='k', linewidth=0.5)
    ax.axvline(x=0, color='k', linewidth=0.5)
    ax.set_xlabel('Real Part')
    ax.set_ylabel('Imaginary Part')
    ax.set_title(title)
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    return fig
 
 
# Demonstrate with different initializations
np.random.seed(42)
n = 50  # Hidden size
T = 50  # Timesteps
 
print("=" * 60)
print("JACOBIAN CHAIN ANALYSIS FOR RNN GRADIENT FLOW")
print("=" * 60)
 
# 1. Xavier/Glorot initialization (common default)
Whh_xavier = np.random.randn(n, n) / np.sqrt(n)
results_xavier = analyze_jacobian_chain(Whh_xavier, T)
print(f"\nXavier Initialization:")
print(f"  Spectral Radius: {results_xavier['spectral_radius']:.4f}")
print(f"  Gradient norm at t=50: {results_xavier['gradient_norms'][-1]:.2e}")
 
# 2. Orthogonal initialization (recommended for RNNs)
Q, _ = np.linalg.qr(np.random.randn(n, n))
Whh_ortho = Q
results_ortho = analyze_jacobian_chain(Whh_ortho, T)
print(f"\nOrthogonal Initialization:")
print(f"  Spectral Radius: {results_ortho['spectral_radius']:.4f}")
print(f"  Gradient norm at t=50: {results_ortho['gradient_norms'][-1]:.2e}")
 
# 3. Identity + small noise (similar to IRNN)
Whh_irnn = np.eye(n) + np.random.randn(n, n) * 0.01
results_irnn = analyze_jacobian_chain(Whh_irnn, T)
print(f"\nIdentity + Noise (IRNN-style):")
print(f"  Spectral Radius: {results_irnn['spectral_radius']:.4f}")
print(f"  Gradient norm at t=50: {results_irnn['gradient_norms'][-1]:.2e}")
 
# 4. Random initialization with larger variance (unstable)
Whh_large = np.random.randn(n, n) * 1.5 / np.sqrt(n)
results_large = analyze_jacobian_chain(Whh_large, T)
print(f"\nLarge Variance Initialization:")
print(f"  Spectral Radius: {results_large['spectral_radius']:.4f}")
print(f"  Gradient norm at t=50: {results_large['gradient_norms'][-1]:.2e}")

Spectral Analysis of Recurrent Weight Matrices

The spectral properties of the recurrent weight matrix $W_{hh}$ are central to understanding gradient dynamics. Let's develop a deeper understanding of how eigenvalue structure affects learning.

Eigenvalue decomposition and gradient flow:

For a diagonalizable matrix $W_{hh} = V \Lambda V^{-1}$, the matrix power becomes:

$$(\gamma W_{hh})^k = V (\gamma \Lambda)^k V^{-1}$$

If we express the gradient at the final time as $g_T$, then the gradient at time $t$ is approximately:

$$g_t \approx V (\gamma \Lambda)^{T-t} V^{-1} g_T$$

This reveals several important properties:

1. Direction selectivity:

Not all directions in gradient space decay equally. Components aligned with eigenvectors of large eigenvalues persist longer. Components aligned with small eigenvalues vanish quickly. This means:

The network can learn patterns that align with "strong" eigenvalue directions
Patterns requiring "weak" eigenvalue directions are nearly impossible to learn

2. Non-normal dynamics:

Why Orthogonal Initialization Helps

Singular value perspective:

An alternative analysis uses singular values rather than eigenvalues. The singular values $\sigma_1 \geq \sigma_2 \geq \ldots \geq \sigma_n$ of $W_{hh}$ bound the operator norm:

$$|W_{hh}|_2 = \sigma_1$$

For the Jacobian at each step $J_t = D_t W_{hh}$ (where $D_t = \text{diag}(\tanh'(z_t))$):

$$|J_t|2 \leq |D_t|2 \cdot |W{hh}|2 = \max_i |\tanh'(z{t,i})| \cdot \sigma_1(W{hh})$$

Since $\tanh'(z) \leq 1$, we have $|J_t|2 \leq \sigma_1(W{hh})$.

For the full Jacobian chain:

$$\left| \prod_{i=k+1}^{t} J_i \right|2 \leq \prod{i=k+1}^{t} |J_i|2 \leq \sigma_1(W{hh})^{t-k}$$

This gives us a bound (though not tight) on gradient growth/decay.

spectral_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
import numpy as np
from scipy.linalg import svd, schur
 
def comprehensive_spectral_analysis(Whh):
    """
    Perform comprehensive spectral analysis of a recurrent weight matrix.
    
    Returns metrics relevant to gradient flow in RNNs.
    """
    n = Whh.shape[0]
    
    # Eigenvalue analysis
    eigenvalues = np.linalg.eigvals(Whh)
    spectral_radius = np.max(np.abs(eigenvalues))
    
    # Singular value analysis
    U, singular_values, Vh = svd(Whh)
    operator_norm = singular_values[0]  # Largest singular value
    condition_number = singular_values[0] / singular_values[-1]
    
    # Non-normality measure
    # A matrix is normal if ||AA* - A*A||_F = 0
    AAt = Whh @ Whh.T
    AtA = Whh.T @ Whh
    non_normality = np.linalg.norm(AAt - AtA, 'fro')
    
    # Schur decomposition for numerical stability analysis
    T_schur, Z = schur(Whh)
    
    # Effective spectral radius with tanh derivative
    gamma = 0.8  # Typical average tanh derivative
    effective_spectral_radius = gamma * spectral_radius
    
    return {
        'eigenvalues': eigenvalues,
        'spectral_radius': spectral_radius,
        'singular_values': singular_values,
        'operator_norm': operator_norm,
        'condition_number': condition_number,
        'non_normality': non_normality,
        'effective_spectral_radius': effective_spectral_radius,
        'stable_gradient_flow': effective_spectral_radius <= 1.0,
        'vanishing_risk': effective_spectral_radius < 0.95,
        'exploding_risk': effective_spectral_radius > 1.05
    }
 
 
def predict_trainability(analysis_results, sequence_length):
    """
    Predict trainability based on spectral analysis.
    """
    rho = analysis_results['effective_spectral_radius']
    
    # Gradient magnitude at the start of the sequence
    gradient_at_start = rho ** sequence_length
    
    predictions = {
        'estimated_gradient_ratio': gradient_at_start,
        'effective_sequence_length': int(np.log(1e-6) / np.log(rho)) if rho < 1 else float('inf')
    }
    
    if gradient_at_start < 1e-6:
        predictions['trainability'] = 'POOR: Gradients likely vanished'
        predictions['recommendation'] = 'Use orthogonal init, LSTM/GRU, or truncated BPTT'
    elif gradient_at_start > 1e6:
        predictions['trainability'] = 'UNSTABLE: Gradients likely exploded'
        predictions['recommendation'] = 'Use gradient clipping, smaller learning rate'
    elif gradient_at_start < 0.01:
        predictions['trainability'] = 'MARGINAL: Long-range gradients very weak'
        predictions['recommendation'] = 'Consider gated architectures for long sequences'
    else:
        predictions['trainability'] = 'GOOD: Gradients should propagate effectively'
        predictions['recommendation'] = 'Current configuration is suitable'
    
    return predictions
 
 
# Analyze different initialization strategies
np.random.seed(42)
n = 64  # Hidden size
 
initializations = {
    'Standard Normal': np.random.randn(n, n),
    'Xavier': np.random.randn(n, n) / np.sqrt(n),
    'He': np.random.randn(n, n) * np.sqrt(2/n),
    'Orthogonal': np.linalg.qr(np.random.randn(n, n))[0],
    'Identity + Noise': np.eye(n) + 0.01 * np.random.randn(n, n),
    'Scaled Orthogonal (1.1x)': 1.1 * np.linalg.qr(np.random.randn(n, n))[0],
}
 
print("=" * 70)
print("RECURRENT WEIGHT MATRIX SPECTRAL ANALYSIS")
print("=" * 70)
 
for name, Whh in initializations.items():
    analysis = comprehensive_spectral_analysis(Whh)
    trainability = predict_trainability(analysis, sequence_length=100)
    
    print(f"\n{name}:")
    print(f"  Spectral radius: {analysis['spectral_radius']:.4f}")
    print(f"  Effective ρ (with γ=0.8): {analysis['effective_spectral_radius']:.4f}")
    print(f"  Condition number: {analysis['condition_number']:.2f}")
    print(f"  Non-normality: {analysis['non_normality']:.4f}")
    print(f"  Vanishing risk: {analysis['vanishing_risk']}")
    print(f"  Exploding risk: {analysis['exploding_risk']}")
    print(f"  Trainability: {trainability['trainability']}")

Temporal Decomposition of Gradients

Understanding how gradients decompose across time provides insight into what an RNN can and cannot learn. Let's analyze the total gradient as a sum of contributions from different temporal distances.

Gradient decomposition:

The total gradient of the loss $\mathcal{L}$ with respect to $W_{hh}$ can be written as:

We can reorganize this by the temporal distance $\tau = t - k$:

$$\frac{\partial \mathcal{L}}{\partial W_{hh}} = \sum_{\tau=0}^{T-1} G_\tau$$

where $G_\tau$ represents the gradient contribution from temporal distance $\tau$:

$$G_\tau = \sum_{t=\tau+1}^{T} \frac{\partial \mathcal{L}t}{\partial h_t} \cdot \frac{\partial h_t}{\partial h{t-\tau}} \cdot \frac{\partial h_{t-\tau}}{\partial W_{hh}}$$

Interpretation of Temporal Contributions

Relative contribution analysis:

Define the relative contribution of distance $\tau$ as:

$$\rho_\tau = \frac{|G_\tau|}{\sum_{\tau'=0}^{T-1} |G_{\tau'}|}$$

In a well-functioning RNN:

$\rho_\tau$ should remain significant for large $\tau$ if long-range dependencies matter
The distribution of $\rho_\tau$ should match the dependency structure in the data

In practice, due to vanishing gradients:

$\rho_\tau$ drops exponentially with $\tau$
For $\tau > 10-20$, contributions are often negligible
The network effectively becomes a "local" model despite its recurrence

The effective horizon:

We can define an effective horizon $H$ as the maximum temporal distance at which gradients remain above a threshold:

$$H = \max{\tau : |G_\tau| > \epsilon}$$

temporal_gradient_decomposition.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
import numpy as np
from collections import defaultdict
 
class RNNGradientAnalyzer:
    """
    Analyze temporal gradient contributions in an RNN.
    
    This class tracks how gradients from different temporal distances
    contribute to the total gradient, revealing the effective learning
    horizon of the network.
    """
    
    def __init__(self, input_size, hidden_size, output_size):
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        
        # Initialize with orthogonal recurrent weights
        self.Whh = np.linalg.qr(np.random.randn(hidden_size, hidden_size))[0]
        self.Wxh = np.random.randn(hidden_size, input_size) * 0.01
        self.Why = np.random.randn(output_size, hidden_size) * 0.01
        
    def forward(self, xs):
        """Forward pass, storing activations for gradient analysis."""
        T = len(xs)
        hs = [np.zeros((self.hidden_size, 1))]  # h_0
        pre_acts = []  # Store pre-activations for gradient computation
        
        for t in range(T):
            z = self.Whh @ hs[-1] + self.Wxh @ xs[t]
            h = np.tanh(z)
            hs.append(h)
            pre_acts.append(z)
        
        # Output at final timestep (for simplicity)
        y = self.Why @ hs[-1]
        
        return hs, pre_acts, y
    
    def compute_temporal_gradients(self, xs, target):
        """
        Compute gradients decomposed by temporal distance.
        
        Returns a dictionary mapping temporal distance τ to gradient contribution.
        """
        T = len(xs)
        hs, pre_acts, y = self.forward(xs)
        
        # Loss gradient at output
        dy = y - target  # Assuming MSE loss
        dh_T = self.Why.T @ dy  # Gradient at final hidden state
        
        # Dictionary to store gradient contributions by temporal distance
        temporal_gradients = defaultdict(lambda: np.zeros_like(self.Whh))
        
        # Backward pass with tracking
        # dh[t] = gradient of loss w.r.t. h_t
        dh = [None] * (T + 1)
        dh[T] = dh_T
        
        # First, compute all hidden state gradients
        for t in range(T - 1, 0, -1):
            tanh_deriv = 1 - hs[t+1] ** 2
            dh_raw = dh[t+1] * tanh_deriv
            dh[t] = self.Whh.T @ dh_raw
        
        # Now compute gradient contributions by temporal distance
        for t in range(1, T + 1):
            for tau in range(t):
                k = t - tau  # The timestep we're backpropagating to
                
                # Compute the Jacobian product from t to k
                jacobian_product = np.eye(self.hidden_size)
                for i in range(t, k, -1):
                    tanh_deriv = np.diag((1 - hs[i] ** 2).flatten())
                    jacobian_product = tanh_deriv @ self.Whh @ jacobian_product
                
                # Gradient contribution at temporal distance τ
                # G_τ contribution from timestep t
                local_grad = dh[t].T @ jacobian_product @ hs[k-1].T if k > 0 else 0
                if k > 0:
                    tanh_deriv_k = np.diag((1 - hs[k] ** 2).flatten())
                    contribution = tanh_deriv_k @ np.outer(
                        (jacobian_product.T @ dh[t]).flatten(),
                        hs[k-1].flatten()
                    )
                    temporal_gradients[tau] += contribution
        
        return dict(temporal_gradients)
    
    def analyze_effective_horizon(self, xs, target, threshold=1e-6):
        """
        Determine the effective learning horizon.
        """
        temporal_grads = self.compute_temporal_gradients(xs, target)
        
        # Compute norms
        grad_norms = {tau: np.linalg.norm(g) for tau, g in temporal_grads.items()}
        
        # Find maximum τ with significant gradient
        if not grad_norms:
            return 0, grad_norms
        
        max_norm = max(grad_norms.values())
        normalized_norms = {tau: norm / max_norm for tau, norm in grad_norms.items()}
        
        effective_horizon = max(
            [tau for tau, norm in normalized_norms.items() if norm > threshold],
            default=0
        )
        
        return effective_horizon, grad_norms
 
 
# Demonstration
np.random.seed(42)
analyzer = RNNGradientAnalyzer(input_size=10, hidden_size=50, output_size=1)
 
# Generate a sequence
T = 50
xs = [np.random.randn(10, 1) for _ in range(T)]
target = np.random.randn(1, 1)
 
# Analyze temporal gradient contributions
horizon, grad_norms = analyzer.analyze_effective_horizon(xs, target)
 
print("=" * 60)
print("TEMPORAL GRADIENT DECOMPOSITION ANALYSIS")
print("=" * 60)
print(f"\nSequence length: {T}")
print(f"Effective horizon: {horizon} timesteps")
print("\nGradient contribution by temporal distance:")
print("-" * 40)
 
# Sort and display
if grad_norms:
    max_norm = max(grad_norms.values())
    for tau in sorted(grad_norms.keys())[:20]:  # Show first 20
        relative = grad_norms[tau] / max_norm if max_norm > 0 else 0
        bar = "█" * int(relative * 30)
        print(f"τ={tau:2d}: {grad_norms[tau]:.2e} | {bar}")

Visualizing Gradient Flow Dynamics

Visualization provides crucial intuition about gradient flow. Let's explore several ways to visualize and understand gradient dynamics in RNNs.

1. Gradient norm over time:

2. Gradient direction stability:

Beyond magnitude, we can track whether gradients maintain consistent direction. If gradients rotate chaotically, learning becomes difficult even if magnitudes are stable.

3. Per-hidden-unit gradient analysis:

Different hidden units may have very different gradient dynamics. Some units may carry gradients effectively while others contribute little.

4. Activation and gradient correlation:

Gradients depend on activations through the tanh derivative. Understanding this correlation helps explain gradient behavior.

gradient_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
 
class RNNGradientVisualizer:
    """
    Comprehensive visualization tools for RNN gradient flow.
    """
    
    def __init__(self, hidden_size=50):
        self.hidden_size = hidden_size
    
    def visualize_gradient_norm_decay(self, Whh_variants, T=100, gamma=0.8):
        """
        Compare gradient norm decay across different weight matrices.
        """
        fig, axes = plt.subplots(1, 2, figsize=(14, 5))
        
        for name, Whh in Whh_variants.items():
            norms = []
            grad = np.ones((self.hidden_size, 1))
            grad = grad / np.linalg.norm(grad)
            
            for t in range(T):
                grad = gamma * Whh.T @ grad
                norms.append(np.linalg.norm(grad))
            
            # Linear scale
            axes[0].plot(norms, label=name)
            # Log scale
            axes[1].plot(norms, label=name)
        
        axes[0].set_xlabel('Timesteps backward')
        axes[0].set_ylabel('Gradient Norm')
        axes[0].set_title('Gradient Magnitude vs Temporal Distance (Linear)')
        axes[0].legend()
        axes[0].grid(True, alpha=0.3)
        
        axes[1].set_xlabel('Timesteps backward')
        axes[1].set_ylabel('Gradient Norm (log scale)')
        axes[1].set_yscale('log')
        axes[1].set_title('Gradient Magnitude vs Temporal Distance (Log)')
        axes[1].legend()
        axes[1].grid(True, alpha=0.3)
        
        plt.tight_layout()
        return fig
    
    def visualize_gradient_directions(self, Whh, T=50, gamma=0.8, num_init=5):
        """
        Visualize how gradient directions evolve over time.
        Uses PCA to project to 2D.
        """
        from sklearn.decomposition import PCA
        
        # Track gradients from multiple random initializations
        all_grads = []
        
        for i in range(num_init):
            grad = np.random.randn(self.hidden_size, 1)
            grad = grad / np.linalg.norm(grad)
            grads = [grad.flatten()]
            
            for t in range(T):
                grad = gamma * Whh.T @ grad
                if np.linalg.norm(grad) > 1e-10:
                    grad = grad / np.linalg.norm(grad)  # Normalize
                grads.append(grad.flatten())
            
            all_grads.append(np.array(grads))
        
        # PCA on all gradients
        all_flat = np.vstack(all_grads)
        pca = PCA(n_components=2)
        projected = pca.fit_transform(all_flat)
        
        # Reshape and plot
        fig, ax = plt.subplots(figsize=(10, 10))
        
        colors = plt.cm.viridis(np.linspace(0, 1, num_init))
        for i, grads in enumerate(all_grads):
            proj = pca.transform(grads)
            ax.scatter(proj[:, 0], proj[:, 1], c=np.arange(len(proj)), 
                      cmap='plasma', s=20, alpha=0.7)
            ax.plot(proj[:, 0], proj[:, 1], alpha=0.3, color=colors[i])
        
        ax.set_xlabel('PC1')
        ax.set_ylabel('PC2')
        ax.set_title('Gradient Direction Evolution (PCA Projection)')
        ax.grid(True, alpha=0.3)
        
        return fig
    
    def visualize_per_unit_gradients(self, Whh, T=50, gamma=0.8):
        """
        Show gradient magnitude for each hidden unit over time.
        """
        # Start with uniform gradient
        grad = np.ones((self.hidden_size, 1))
        
        # Track per-unit gradients
        unit_grads = np.zeros((self.hidden_size, T))
        
        for t in range(T):
            grad = gamma * Whh.T @ grad
            unit_grads[:, t] = np.abs(grad.flatten())
        
        fig, ax = plt.subplots(figsize=(12, 8))
        
        # Heatmap
        im = ax.imshow(unit_grads, aspect='auto', cmap='viridis',
                       norm=plt.matplotlib.colors.LogNorm(vmin=1e-10, vmax=1))
        
        ax.set_xlabel('Timesteps Backward')
        ax.set_ylabel('Hidden Unit Index')
        ax.set_title('Per-Unit Gradient Magnitude Over Time')
        
        cbar = plt.colorbar(im, ax=ax)
        cbar.set_label('Gradient Magnitude (log scale)')
        
        return fig
 
 
# Create visualizations
np.random.seed(42)
n = 50
 
visualizer = RNNGradientVisualizer(hidden_size=n)
 
# Define weight matrices to compare
Whh_variants = {
    'Small (σ=0.5)': np.random.randn(n, n) * 0.5 / np.sqrt(n),
    'Xavier': np.random.randn(n, n) / np.sqrt(n),
    'Orthogonal': np.linalg.qr(np.random.randn(n, n))[0],
    'Identity + Noise': np.eye(n) + 0.01 * np.random.randn(n, n),
}
 
# Generate visualizations
fig1 = visualizer.visualize_gradient_norm_decay(Whh_variants, T=100)
fig2 = visualizer.visualize_gradient_directions(np.linalg.qr(np.random.randn(n, n))[0])
fig3 = visualizer.visualize_per_unit_gradients(np.random.randn(n, n) / np.sqrt(n))
 
plt.show()

Key Visualization Insights

Summary: Gradient Flow Fundamentals

Key Takeaways

•BPTT involves Jacobian chains — Gradients flow backward through a product of Jacobian matrices, one for each timestep.
•Multiplicative dynamics create instability — Repeated multiplication of matrices leads to either exponential growth or decay.
•Spectral radius determines behavior — The eigenvalue structure of the recurrent weight matrix controls long-term gradient dynamics.
•Tanh derivative always shrinks — The nonlinearity's derivative is bounded by 1, contributing to gradient decay.
•Temporal contributions decay with distance — Gradients from distant timesteps contribute exponentially less to learning.
•Effective horizon is limited — Vanilla RNNs typically have an effective memory of only 10-20 timesteps.

What's Next:

Foundation Established

1 / 5