Vanishing Exploding Gradients - Learning Module

Loading content...

0/245

The Exploding Gradient Problem

When Gradients Spiral Out of Control

While vanishing gradients cause RNNs to forget the distant past, exploding gradients represent the opposite failure mode: gradients that grow so large that they destroy the optimization process entirely. Where vanishing gradients cause slow, ineffective learning, exploding gradients cause catastrophic failure—NaN values, wildly oscillating losses, and parameters that fly off to infinity.

Exploding gradients are in some ways more dramatic and easier to notice than vanishing gradients. Training suddenly diverges, loss spikes to enormous values, and the model becomes unusable. But this visibility is deceptive: the conditions that cause explosion are often present even when training appears stable, lurking as a source of future instability.

This page provides a rigorous analysis of the exploding gradient problem: when and why it occurs, how to detect it before catastrophe, and a preview of the gradient clipping solution that makes RNN training viable in practice.

What You Will Learn

This page covers: (1) the mathematical conditions for gradient explosion—why spectral radius > 1 causes exponential growth, (2) the asymmetry between vanishing and exploding—why they're not symmetric problems, (3) practical manifestations including NaN values and loss spikes, (4) detection strategies to anticipate explosion, and (5) how exploding gradients interact with the optimization landscape.

Mathematical Conditions for Gradient Explosion

Gradient explosion is the mathematical dual of gradient vanishing. Recall the Jacobian chain:

$$\frac{\partial h_t}{\partial h_k} = \prod_{i=k+1}^{t} \text{diag}(\phi'(z_i)) \cdot W_{hh}$$

Condition for explosion:

If the spectral radius of the effective transition matrix exceeds 1, gradients explode:

$$\rho(\gamma W_{hh}) > 1 \implies \left|\frac{\partial h_t}{\partial h_k}\right| \to \infty \text{ as } (t-k) \to \infty$$

where $\gamma$ represents the "effective" activation derivative (though the analysis is more nuanced than for vanishing).

The subtlety of explosion:

Unlike vanishing, explosion doesn't have a simple sufficient condition based only on the spectral radius. Here's why:

Vanishing is an upper bound problem: If all singular values are < 1, the product must shrink.
Explosion is a lower bound problem: If any singular value is > 1, the product can grow—but whether it actually grows depends on the specific gradient direction.

However, in practice, random gradient directions will eventually align with explosive eigenvector directions, so explosion typically occurs when the spectral radius exceeds 1.

The Activation Function Nuance

For vanishing gradients, the activation derivative (≤1 for tanh/sigmoid) always contributes to shrinkage. For explosion, the situation is different: when activations are in the linear regime (near z=0), their derivative is close to 1, providing minimal damping. If W_hh has spectral radius > 1 and activations are frequently in the linear regime, explosion occurs. Saturation (large |z|) can actually help prevent explosion by shrinking gradients—but this creates vanishing!

Eigenvalue analysis:

Let $W_{hh} = V \Lambda V^{-1}$ where $\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_n)$. For the simplified case of constant activation derivatives:

$$(\gamma W_{hh})^k = V (\gamma \Lambda)^k V^{-1} = V \cdot \text{diag}((\gamma \lambda_1)^k, \ldots, (\gamma \lambda_n)^k) \cdot V^{-1}$$

If $|\gamma \lambda_i| > 1$ for any eigenvalue $\lambda_i$:

The component along eigenvector $v_i$ grows as $(\gamma \lambda_i)^k$
After $k$ steps, this component dominates
Gradient magnitude increases exponentially

Rate of explosion:

The explosion rate is $\rho(\gamma W_{hh})^k = (\gamma \cdot \rho(W_{hh}))^k$.

Examples:

$\rho = 1.01$: After 100 steps, gradients are $1.01^{100} \approx 2.7$ times larger
$\rho = 1.1$: After 100 steps, gradients are $1.1^{100} \approx 13,000$ times larger
$\rho = 1.5$: After 100 steps, gradients are $1.5^{100} \approx 4 \times 10^{17}$ times larger

Even modest spectral radii above 1 cause rapid explosion for long sequences.

explosion_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
import numpy as np
import matplotlib.pyplot as plt
 
def analyze_explosion_conditions(hidden_sizes, spectral_radii, T_max=100):
    """
    Analyze gradient explosion for different spectral radii.
    """
    print("=" * 70)
    print("GRADIENT EXPLOSION ANALYSIS")
    print("=" * 70)
    
    results = {}
    
    for rho in spectral_radii:
        # Gradient growth over time
        timesteps = np.arange(0, T_max + 1)
        gradient_growth = rho ** timesteps
        
        # Find when gradient exceeds thresholds
        threshold_1e6 = np.argmax(gradient_growth > 1e6) if np.any(gradient_growth > 1e6) else T_max
        threshold_1e10 = np.argmax(gradient_growth > 1e10) if np.any(gradient_growth > 1e10) else T_max
        threshold_inf = np.argmax(gradient_growth > 1e38) if np.any(gradient_growth > 1e38) else T_max  # float32 overflow
        
        results[rho] = {
            'timesteps': timesteps,
            'gradient_growth': gradient_growth,
            't_1e6': threshold_1e6,
            't_1e10': threshold_1e10,
            't_overflow': threshold_inf
        }
        
        print(f"\nSpectral radius ρ = {rho:.2f}:")
        print(f"  Gradient exceeds 10^6 at T = {threshold_1e6}")
        print(f"  Gradient exceeds 10^10 at T = {threshold_1e10}")
        print(f"  Float32 overflow at T ≈ {threshold_inf}")
        print(f"  Gradient at T=50: {rho**50:.2e}")
        print(f"  Gradient at T=100: {rho**100:.2e}")
    
    return results
 
 
def critical_spectral_radius_analysis():
    """
    Analyze the critical spectral radius for different sequence lengths.
    """
    print("\n" + "=" * 70)
    print("CRITICAL SPECTRAL RADIUS ANALYSIS")
    print("=" * 70)
    print("\n(Maximum ρ to avoid gradient > 10^6)")
    print("-" * 50)
    
    # For gradient < threshold: rho^T < threshold
    # rho < threshold^(1/T)
    threshold = 1e6
    
    sequence_lengths = [10, 25, 50, 100, 200, 500, 1000]
    
    for T in sequence_lengths:
        critical_rho = threshold ** (1.0 / T)
        print(f"  T = {T:4d}: max ρ = {critical_rho:.4f}")
    
    print("\nImplication: For long sequences, spectral radius must be VERY close to 1")
    print("Even ρ = 1.01 causes overflow for T = 1000")
 
 
def visualize_explosion_dynamics():
    """
    Visualize how gradients explode in the eigenspace.
    """
    np.random.seed(42)
    n = 50  # Hidden size
    
    # Create weight matrix with controlled spectral properties
    # Use eigendecomposition to set specific eigenvalues
    V = np.linalg.qr(np.random.randn(n, n))[0]  # Random orthogonal basis
    
    # Create eigenvalues with one explosive direction
    eigenvalues = np.random.uniform(0.8, 0.95, n)  # Most eigenvalues < 1
    eigenvalues[0] = 1.2  # One explosive eigenvalue
    eigenvalues[1] = 1.1  # Another mildly explosive
    
    Lambda = np.diag(eigenvalues)
    Whh = V @ Lambda @ V.T
    
    print("\n" + "=" * 70)
    print("EIGENSPACE DYNAMICS VISUALIZATION")
    print("=" * 70)
    print(f"\nWeight matrix with mixed eigenvalues:")
    print(f"  Range: [{eigenvalues.min():.2f}, {eigenvalues.max():.2f}]")
    print(f"  Explosive eigenvalues (>1): {np.sum(eigenvalues > 1)}")
    print(f"  Spectral radius: {np.max(np.abs(eigenvalues)):.2f}")
    
    # Track gradient evolution
    T = 50
    
    # Start with random gradient
    grad = np.random.randn(n)
    grad = grad / np.linalg.norm(grad)
    
    # Project onto eigenvectors
    v0 = V[:, 0]  # Most explosive direction
    v1 = V[:, 1]  # Second explosive direction
    
    projections_v0 = []
    projections_v1 = []
    other_projections = []
    total_norms = []
    
    current_grad = grad.copy()
    for t in range(T):
        current_grad = Whh.T @ current_grad  # Backward step
        
        proj_v0 = np.abs(np.dot(current_grad, v0))
        proj_v1 = np.abs(np.dot(current_grad, v1))
        other = np.linalg.norm(current_grad - proj_v0 * v0 - proj_v1 * v1)
        
        projections_v0.append(proj_v0)
        projections_v1.append(proj_v1)
        other_projections.append(other)
        total_norms.append(np.linalg.norm(current_grad))
    
    print(f"\nGradient evolution over {T} timesteps:")
    print(f"  Initial norm: 1.0")
    print(f"  Final norm: {total_norms[-1]:.2e}")
    print(f"  Final projection on explosive v0 (λ=1.2): {projections_v0[-1]:.2e}")
    print(f"  Final projection on explosive v1 (λ=1.1): {projections_v1[-1]:.2e}")
    print(f"  Final projection on stable subspace: {other_projections[-1]:.2e}")
    print(f"\n  → Explosive directions dominate as expected")
    
    return total_norms, projections_v0, projections_v1, other_projections
 
 
# Run analyses
spectral_radii = [0.99, 1.0, 1.01, 1.05, 1.1, 1.2, 1.5]
results = analyze_explosion_conditions([50], spectral_radii)
 
critical_spectral_radius_analysis()
 
norms_data = visualize_explosion_dynamics()

The Asymmetry Between Vanishing and Exploding

Although vanishing and exploding gradients are often discussed as symmetric problems—two sides of the same coin—they are fundamentally asymmetric in several important ways.

Asymmetry 1: Detection and visibility

Vanishing gradients are silent killers. Training proceeds, losses decrease (on short-range patterns), and nothing obviously breaks. You only discover the problem when evaluating on tasks requiring long-range dependencies.

Exploding gradients are loud failures. Loss spikes to enormous values, NaN appears in parameters or gradients, and training visibly breaks. This makes explosion easier to detect but also means it must be handled immediately.

Asymmetry 2: Frequency of occurrence

Vanishing is the default behavior. With standard initialization and saturating activations (tanh, sigmoid), vanilla RNNs almost always suffer from vanishing gradients. You have to work to avoid it.

Explosion requires spectral radius > 1/γ, which is less common with typical initialization. However, explosion can emerge during training as weights evolve, even if initialization was safe.

Vanishing Gradients

•Silent, hard to detect
•Training continues but ineffective
•Default behavior with standard init
•Gradual degradation with sequence length
•Cannot be directly "fixed" during training
•Fundamentally limits expressivity

Exploding Gradients

•Loud, obvious failure
•Training breaks completely
•Requires aggressive initialization or bad dynamics
•Sudden catastrophic failure
•Can be directly clipped during training
•Doesn't limit expressivity (if handled)

Asymmetry 3: Solutions available

This is the crucial asymmetry. Exploding gradients can be directly addressed during training through gradient clipping. When gradients exceed a threshold, we simply rescale them. This doesn't prevent learning—it just prevents catastrophe.

Vanishing gradients cannot be "boosted" symmetrically. If we tried to scale up small gradients, we'd amplify noise and learn invalid correlations. The solution requires architectural changes (LSTM, GRU) that prevent vanishing from occurring in the first place.

Asymmetry 4: Impact on optimization landscape

Vanishing gradients create flat regions in the loss landscape—plateaus where the optimizer makes no progress regardless of learning rate.

Exploding gradients create cliffs—sudden steep drops where a single step can jump far away from a good solution. Gradient clipping essentially "smooths" these cliffs.

Asymmetry 5: Relationship to training dynamics

Exploding gradients often occur transiently. Training might oscillate between explosion and stability as weights change. This makes the dynamics chaotic and hard to control.

Vanishing is more stable but permanently crippling. Once established, the network consistently fails to propagate gradients, and this doesn't improve over training.

Practical Implication

Because of this asymmetry, practical RNN training typically: (1) Uses LSTM/GRU to address vanishing (architectural solution), (2) Uses gradient clipping to address any remaining explosion (runtime solution), (3) Uses careful initialization to start in a stable regime. Gradient clipping is cheap and necessary; architectural changes are fundamental.

Practical Manifestations of Gradient Explosion

Understanding how gradient explosion manifests in practice helps with early detection and debugging. Here are the common symptoms:

1. NaN (Not a Number) values

The most obvious sign. When gradients exceed floating-point range:

Parameters become NaN
Loss becomes NaN
All subsequent computations are corrupted

NaN typically indicates explosion has already occurred and propagated. By this point, the model is unrecoverable without loading from a checkpoint.

2. Loss spikes

Before complete divergence, you may see sudden, large spikes in training loss—jumping from reasonable values (e.g., 2.0) to enormous values (e.g., 10^6) in a single step. This indicates one of:

A large gradient caused a huge parameter update
Training landed on a "cliff" in the loss landscape
Numerical instability is beginning

3. Parameter instability

Monitor parameter norms over training. Healthy training shows gradual, smooth changes. Explosion shows:

Rapid increase in parameter norms
Wild oscillations in parameter values
One layer's parameters dominating others

explosion_detection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
import numpy as np
import torch
import torch.nn as nn
from typing import Dict, List, Tuple, Optional
from dataclasses import dataclass
from collections import deque
 
@dataclass
class ExplosionWarning:
    """Warning about potential gradient explosion."""
    timestep: int
    severity: str  # 'low', 'medium', 'high', 'critical'
    indicator: str
    value: float
    threshold: float
    message: str
 
 
class GradientExplosionMonitor:
    """
    Real-time monitoring for gradient explosion in RNN training.
    
    Implements multiple detection strategies:
    1. Gradient norm monitoring
    2. Loss spike detection
    3. Parameter norm tracking
    4. NaN detection
    5. Rate of change analysis
    """
    
    def __init__(
        self,
        model: nn.Module,
        max_grad_norm: float = 100.0,
        max_loss: float = 1e6,
        loss_spike_factor: float = 10.0,
        history_size: int = 100
    ):
        self.model = model
        self.max_grad_norm = max_grad_norm
        self.max_loss = max_loss
        self.loss_spike_factor = loss_spike_factor
        
        # History tracking
        self.loss_history = deque(maxlen=history_size)
        self.grad_norm_history = deque(maxlen=history_size)
        self.param_norm_history = deque(maxlen=history_size)
        
        self.warnings: List[ExplosionWarning] = []
        self.step_count = 0
    
    def check_gradients(self) -> List[ExplosionWarning]:
        """
        Check current gradients for explosion indicators.
        """
        warnings = []
        
        total_grad_norm = 0.0
        max_single_grad = 0.0
        has_nan = False
        has_inf = False
        
        for name, param in self.model.named_parameters():
            if param.grad is None:
                continue
            
            grad = param.grad
            
            # Check for NaN/Inf
            if torch.isnan(grad).any():
                has_nan = True
                warnings.append(ExplosionWarning(
                    timestep=self.step_count,
                    severity='critical',
                    indicator='nan_gradient',
                    value=float('nan'),
                    threshold=0,
                    message=f"NaN detected in gradients of {name}"
                ))
            
            if torch.isinf(grad).any():
                has_inf = True
                warnings.append(ExplosionWarning(
                    timestep=self.step_count,
                    severity='critical',
                    indicator='inf_gradient',
                    value=float('inf'),
                    threshold=0,
                    message=f"Inf detected in gradients of {name}"
                ))
            
            # Compute norms
            grad_norm = grad.norm().item()
            total_grad_norm += grad_norm ** 2
            max_single_grad = max(max_single_grad, grad_norm)
        
        total_grad_norm = np.sqrt(total_grad_norm)
        
        # Check total gradient norm
        if total_grad_norm > self.max_grad_norm:
            severity = 'high' if total_grad_norm > 10 * self.max_grad_norm else 'medium'
            warnings.append(ExplosionWarning(
                timestep=self.step_count,
                severity=severity,
                indicator='large_gradient_norm',
                value=total_grad_norm,
                threshold=self.max_grad_norm,
                message=f"Gradient norm {total_grad_norm:.2e} exceeds threshold {self.max_grad_norm}"
            ))
        
        # Track history
        if not (has_nan or has_inf):
            self.grad_norm_history.append(total_grad_norm)
        
        # Check for rapid increase
        if len(self.grad_norm_history) >= 10:
            recent_avg = np.mean(list(self.grad_norm_history)[-5:])
            older_avg = np.mean(list(self.grad_norm_history)[-10:-5])
            
            if older_avg > 0 and recent_avg / older_avg > 5:
                warnings.append(ExplosionWarning(
                    timestep=self.step_count,
                    severity='medium',
                    indicator='gradient_acceleration',
                    value=recent_avg / older_avg,
                    threshold=5.0,
                    message=f"Gradient norms increasing rapidly: {older_avg:.2e} -> {recent_avg:.2e}"
                ))
        
        return warnings
    
    def check_loss(self, loss: float) -> List[ExplosionWarning]:
        """
        Check loss value for explosion indicators.
        """
        warnings = []
        
        # Check for NaN/Inf
        if np.isnan(loss):
            warnings.append(ExplosionWarning(
                timestep=self.step_count,
                severity='critical',
                indicator='nan_loss',
                value=float('nan'),
                threshold=0,
                message="Loss is NaN - training has diverged"
            ))
            return warnings
        
        if np.isinf(loss):
            warnings.append(ExplosionWarning(
                timestep=self.step_count,
                severity='critical',
                indicator='inf_loss',
                value=float('inf'),
                threshold=0,
                message="Loss is Inf - training has diverged"
            ))
            return warnings
        
        # Check absolute threshold
        if loss > self.max_loss:
            warnings.append(ExplosionWarning(
                timestep=self.step_count,
                severity='high',
                indicator='large_loss',
                value=loss,
                threshold=self.max_loss,
                message=f"Loss {loss:.2e} exceeds maximum threshold {self.max_loss}"
            ))
        
        # Check for spike relative to history
        if len(self.loss_history) >= 5:
            recent_avg = np.mean(list(self.loss_history)[-5:])
            if loss > self.loss_spike_factor * recent_avg:
                warnings.append(ExplosionWarning(
                    timestep=self.step_count,
                    severity='medium',
                    indicator='loss_spike',
                    value=loss / recent_avg,
                    threshold=self.loss_spike_factor,
                    message=f"Loss spiked to {loss:.4f} from average {recent_avg:.4f}"
                ))
        
        self.loss_history.append(loss)
        return warnings
    
    def step(self, loss: float) -> List[ExplosionWarning]:
        """
        Perform all checks for current training step.
        """
        self.step_count += 1
        
        warnings = []
        warnings.extend(self.check_gradients())
        warnings.extend(self.check_loss(loss))
        
        # Track parameter norms
        total_param_norm = sum(
            p.norm().item() ** 2 for p in self.model.parameters()
        ) ** 0.5
        self.param_norm_history.append(total_param_norm)
        
        self.warnings.extend(warnings)
        
        return warnings
    
    def summarize(self) -> Dict:
        """
        Summarize explosion risk over training.
        """
        summary = {
            'total_warnings': len(self.warnings),
            'critical_warnings': sum(1 for w in self.warnings if w.severity == 'critical'),
            'high_warnings': sum(1 for w in self.warnings if w.severity == 'high'),
            'max_gradient_norm': max(self.grad_norm_history) if self.grad_norm_history else 0,
            'max_loss': max(self.loss_history) if self.loss_history else 0,
            'explosion_detected': any(w.severity == 'critical' for w in self.warnings)
        }
        return summary
 
 
# Demo: Simulate training with explosion
def demonstrate_explosion_detection():
    """
    Demonstrate explosion detection by simulating problematic training.
    """
    print("=" * 60)
    print("GRADIENT EXPLOSION DETECTION DEMO")
    print("=" * 60)
    
    # Create RNN with large weight initialization (will explode)
    rnn = nn.RNN(input_size=10, hidden_size=50, num_layers=1)
    
    # Initialize with large weights to cause explosion
    with torch.no_grad():
        rnn.weight_hh_l0.mul_(3.0)  # Make weights large
    
    linear = nn.Linear(50, 1)
    criterion = nn.MSELoss()
    optimizer = torch.optim.SGD(list(rnn.parameters()) + list(linear.parameters()), lr=0.01)
    
    monitor = GradientExplosionMonitor(rnn, max_grad_norm=10.0)
    
    # Training loop
    for epoch in range(20):
        x = torch.randn(100, 1, 10)  # Long sequence
        target = torch.randn(1, 1)
        
        output, _ = rnn(x)
        pred = linear(output[-1])
        loss = criterion(pred, target)
        
        optimizer.zero_grad()
        loss.backward()
        
        # Check for explosion
        warnings = monitor.step(loss.item())
        
        if warnings:
            print(f"\nEpoch {epoch}:")
            for w in warnings:
                print(f"  [{w.severity.upper()}] {w.message}")
        
        # Would normally apply gradient clipping here
        optimizer.step()
        
        # Stop if exploded
        if any(w.severity == 'critical' for w in warnings):
            print("\nTraining halted due to explosion!")
            break
    
    summary = monitor.summarize()
    print(f"\nSummary: {summary}")
 
 
demonstrate_explosion_detection()

Early Warning Signs

Don't wait for NaN! Key early indicators: (1) Gradient norm increasing over epochs, (2) Occasional loss spikes even if training recovers, (3) Parameter norms growing unboundedly, (4) Training becoming more volatile. Implement monitoring from the start and add gradient clipping preemptively.

Impact on the Optimization Landscape

Exploding gradients fundamentally alter the optimization landscape that the training algorithm navigates. Understanding this geometric perspective provides insight into why explosion is so destructive.

The cliff metaphor:

Pascanu et al. (2013) introduced the influential "cliff" metaphor for understanding RNN loss landscapes. In regions prone to explosion:

The loss surface is relatively flat in most directions
But there are narrow "cliff" directions where the loss changes enormously
A gradient step in these directions can catapult parameters far from any good solution
Once off the cliff, the optimizer is lost in a region of high loss

Why cliffs exist:

Cliffs arise from the multiplicative nature of gradient computation in RNNs. Small changes in $W_{hh}$ that slightly increase its spectral radius can cause:

Activations to grow at each timestep (in forward pass)
Gradients to grow at each backward step (in backward pass)
Both effects compound, creating extreme sensitivity

The cliff is literally the boundary between stable and explosive regimes.

The curvature problem:

In regions near cliffs, the Hessian (second derivative) of the loss has extremely large eigenvalues in cliff directions. This means:

Standard gradient descent assumes locally linear loss (first-order approximation)
But the true loss is highly curved—the linear approximation is terrible
The gradient points toward the cliff, suggesting it's a good direction
Following the gradient faithfully leads to catastrophe

Optimization Landscape Characteristics
Region	Loss Behavior	Gradient Behavior	Optimization Challenge
Stable (ρ < 1)	Smooth, well-behaved	Predictable magnitude	May vanish for long T
Near-critical (ρ ≈ 1)	Steep in some directions	Variable, sometimes large	Sensitive to small changes
Unstable (ρ > 1)	Cliff-like structures	Huge, explosive	Single step causes divergence
Saddle points	Flat with escape directions	Small near saddle	Slow progress, wrong directions

loss_landscape_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
 
def analyze_loss_landscape():
    """
    Analyze the loss landscape of a simple RNN to visualize cliff structures.
    """
    np.random.seed(42)
    
    # Simple 2D hidden state for visualization
    hidden_size = 2
    seq_length = 20
    
    def rnn_loss(w_scale, input_scale, x_seq, target):
        """
        Compute loss for a simple RNN with given weight scale.
        
        We parameterize W_hh = w_scale * I to visualize the landscape
        as a function of this single parameter.
        """
        Whh = np.array([[w_scale, 0.1], [0.1, w_scale * 0.9]])
        Wxh = np.eye(hidden_size) * input_scale
        
        h = np.zeros(hidden_size)
        
        for x in x_seq:
            h = np.tanh(Whh @ h + Wxh @ x)
        
        # Simple loss: squared distance from target
        loss = np.sum((h - target) ** 2)
        return loss, h
    
    # Generate test data
    x_seq = [np.random.randn(hidden_size) * 0.1 for _ in range(seq_length)]
    target = np.array([0.5, 0.5])
    
    # Scan over weight scales
    w_scales = np.linspace(0.5, 2.0, 100)
    losses = []
    final_h_norms = []
    
    for w in w_scales:
        try:
            loss, h = rnn_loss(w, 0.1, x_seq, target)
            losses.append(loss if loss < 1e10 else 1e10)
            final_h_norms.append(np.linalg.norm(h))
        except:
            losses.append(1e10)
            final_h_norms.append(np.nan)
    
    # Find cliff location
    loss_gradient = np.diff(losses)
    cliff_idx = np.argmax(np.abs(loss_gradient))
    cliff_w = w_scales[cliff_idx]
    
    print("=" * 60)
    print("LOSS LANDSCAPE ANALYSIS")
    print("=" * 60)
    print(f"\nSequence length: {seq_length}")
    print(f"Cliff detected at weight scale ≈ {cliff_w:.2f}")
    print(f"Loss before cliff: {losses[cliff_idx-1]:.4f}")
    print(f"Loss after cliff: {losses[cliff_idx+1]:.2e}")
    print(f"Loss ratio: {losses[cliff_idx+1] / losses[cliff_idx-1]:.2e}x")
    
    return w_scales, losses, cliff_w
 
 
def gradient_direction_vs_cliff():
    """
    Demonstrate how gradient can point toward a cliff.
    """
    print("\n" + "=" * 60)
    print("GRADIENT DIRECTION ANALYSIS")
    print("=" * 60)
    
    # At a point near a cliff, compute:
    # 1. The gradient direction
    # 2. The actual loss along that direction
    # 3. Show that following gradient leads to cliff
    
    np.random.seed(42)
    hidden_size = 10
    seq_length = 30
    
    def compute_loss(Whh, Wxh, x_seq, target):
        h = np.zeros(hidden_size)
        for x in x_seq:
            h = np.tanh(Whh @ h + Wxh @ x)
        return np.sum((h - target) ** 2), h
    
    # Initialize near critical point
    spectral_target = 1.05
    Whh = np.eye(hidden_size) * spectral_target / hidden_size + np.random.randn(hidden_size, hidden_size) * 0.01
    # Rescale to have spectral norm close to target
    Whh = Whh / np.linalg.norm(Whh, 2) * spectral_target
    
    Wxh = np.eye(hidden_size) * 0.1
    x_seq = [np.random.randn(hidden_size) * 0.1 for _ in range(seq_length)]
    target = np.random.randn(hidden_size) * 0.3
    
    # Compute gradient numerically
    eps = 1e-5
    base_loss, _ = compute_loss(Whh, Wxh, x_seq, target)
    
    grad_Whh = np.zeros_like(Whh)
    for i in range(hidden_size):
        for j in range(hidden_size):
            Whh_plus = Whh.copy()
            Whh_plus[i, j] += eps
            loss_plus, _ = compute_loss(Whh_plus, Wxh, x_seq, target)
            grad_Whh[i, j] = (loss_plus - base_loss) / eps
    
    # Loss along gradient direction
    step_sizes = np.linspace(-0.1, 0.5, 50)
    losses_along_gradient = []
    
    grad_direction = grad_Whh / np.linalg.norm(grad_Whh)  # Normalized gradient
    
    for alpha in step_sizes:
        # Move in negative gradient direction (gradient descent) by alpha * grad_norm
        Whh_new = Whh - alpha * grad_Whh
        try:
            loss, h = compute_loss(Whh_new, Wxh, x_seq, target)
            if loss < 1e10 and not np.isnan(loss):
                losses_along_gradient.append(loss)
            else:
                losses_along_gradient.append(1e10)
        except:
            losses_along_gradient.append(1e10)
    
    # Find if there's a cliff
    loss_gradient_along = np.diff(losses_along_gradient)
    cliff_detected = np.any(np.abs(loss_gradient_along) > 100)
    
    print(f"\nSpectral radius of Whh: {np.max(np.abs(np.linalg.eigvals(Whh))):.3f}")
    print(f"Gradient norm: {np.linalg.norm(grad_Whh):.4f}")
    print(f"Base loss: {base_loss:.4f}")
    print(f"\nLoss along gradient descent direction:")
    print(f"  At α=0.0: {losses_along_gradient[25]:.4f}")
    print(f"  At α=0.2: {losses_along_gradient[35]:.4f}")
    print(f"  At α=0.5: {losses_along_gradient[49]:.2e}")
    print(f"\nCliff detected along gradient direction: {cliff_detected}")
    
    if cliff_detected:
        print("→ Standard gradient descent would step off this cliff!")
    
    return step_sizes, losses_along_gradient
 
 
# Run analyses
w_scales, losses, cliff_w = analyze_loss_landscape()
 
step_sizes, gradient_losses = gradient_direction_vs_cliff()

Why Learning Rate Reduction Isn't Enough

A naive fix for explosion is reducing the learning rate. But this doesn't solve the problem: (1) Very small learning rates slow training unacceptably. (2) The cliff is still there—you're just approaching it more slowly. (3) Eventually, accumulating updates still cross the cliff. (4) Different regions need different learning rates. Gradient clipping directly addresses the symptom (large gradients) without sacrificing learning speed.

Summary: The Exploding Gradient Problem

The exploding gradient problem is the volatile counterpart to vanishing gradients. While less common, its effects are dramatic and must be addressed for stable RNN training.

Key Takeaways

•Spectral radius > 1 causes explosion — When the effective spectral radius exceeds 1, gradients grow exponentially with sequence length.
•Explosion is asymmetric to vanishing — It's louder (visible failure), less common, but directly addressable through clipping.
•Practical manifestations include NaN, loss spikes, and parameter instability — Monitor these indicators throughout training.
•The optimization landscape contains "cliffs" — Regions where loss changes dramatically, causing gradient steps to overshoot catastrophically.
•Detection requires proactive monitoring — Implement gradient norm tracking and early warning systems from the start.
•Learning rate reduction is insufficient — Gradient clipping directly addresses the problem without slowing learning.

What's Next:

We've characterized both vanishing and exploding gradients. Now we turn to the first practical solution: gradient clipping. This technique allows training to continue even when gradients would otherwise explode, and is now standard practice in RNN training.

Problem Characterized

You now understand the conditions that cause gradient explosion, how to detect it in practice, and why it creates "cliffs" in the optimization landscape. Combined with your understanding of vanishing gradients, you're ready to learn about practical solutions, starting with gradient clipping in the next page.