Loading content...
While vanishing gradients cause RNNs to forget the distant past, exploding gradients represent the opposite failure mode: gradients that grow so large that they destroy the optimization process entirely. Where vanishing gradients cause slow, ineffective learning, exploding gradients cause catastrophic failure—NaN values, wildly oscillating losses, and parameters that fly off to infinity.
Exploding gradients are in some ways more dramatic and easier to notice than vanishing gradients. Training suddenly diverges, loss spikes to enormous values, and the model becomes unusable. But this visibility is deceptive: the conditions that cause explosion are often present even when training appears stable, lurking as a source of future instability.
This page provides a rigorous analysis of the exploding gradient problem: when and why it occurs, how to detect it before catastrophe, and a preview of the gradient clipping solution that makes RNN training viable in practice.
This page covers: (1) the mathematical conditions for gradient explosion—why spectral radius > 1 causes exponential growth, (2) the asymmetry between vanishing and exploding—why they're not symmetric problems, (3) practical manifestations including NaN values and loss spikes, (4) detection strategies to anticipate explosion, and (5) how exploding gradients interact with the optimization landscape.
Gradient explosion is the mathematical dual of gradient vanishing. Recall the Jacobian chain:
$$\frac{\partial h_t}{\partial h_k} = \prod_{i=k+1}^{t} \text{diag}(\phi'(z_i)) \cdot W_{hh}$$
Condition for explosion:
If the spectral radius of the effective transition matrix exceeds 1, gradients explode:
$$\rho(\gamma W_{hh}) > 1 \implies \left|\frac{\partial h_t}{\partial h_k}\right| \to \infty \text{ as } (t-k) \to \infty$$
where $\gamma$ represents the "effective" activation derivative (though the analysis is more nuanced than for vanishing).
The subtlety of explosion:
Unlike vanishing, explosion doesn't have a simple sufficient condition based only on the spectral radius. Here's why:
Vanishing is an upper bound problem: If all singular values are < 1, the product must shrink.
Explosion is a lower bound problem: If any singular value is > 1, the product can grow—but whether it actually grows depends on the specific gradient direction.
However, in practice, random gradient directions will eventually align with explosive eigenvector directions, so explosion typically occurs when the spectral radius exceeds 1.
For vanishing gradients, the activation derivative (≤1 for tanh/sigmoid) always contributes to shrinkage. For explosion, the situation is different: when activations are in the linear regime (near z=0), their derivative is close to 1, providing minimal damping. If W_hh has spectral radius > 1 and activations are frequently in the linear regime, explosion occurs. Saturation (large |z|) can actually help prevent explosion by shrinking gradients—but this creates vanishing!
Eigenvalue analysis:
Let $W_{hh} = V \Lambda V^{-1}$ where $\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_n)$. For the simplified case of constant activation derivatives:
$$(\gamma W_{hh})^k = V (\gamma \Lambda)^k V^{-1} = V \cdot \text{diag}((\gamma \lambda_1)^k, \ldots, (\gamma \lambda_n)^k) \cdot V^{-1}$$
If $|\gamma \lambda_i| > 1$ for any eigenvalue $\lambda_i$:
Rate of explosion:
The explosion rate is $\rho(\gamma W_{hh})^k = (\gamma \cdot \rho(W_{hh}))^k$.
Examples:
Even modest spectral radii above 1 cause rapid explosion for long sequences.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139
import numpy as npimport matplotlib.pyplot as plt def analyze_explosion_conditions(hidden_sizes, spectral_radii, T_max=100): """ Analyze gradient explosion for different spectral radii. """ print("=" * 70) print("GRADIENT EXPLOSION ANALYSIS") print("=" * 70) results = {} for rho in spectral_radii: # Gradient growth over time timesteps = np.arange(0, T_max + 1) gradient_growth = rho ** timesteps # Find when gradient exceeds thresholds threshold_1e6 = np.argmax(gradient_growth > 1e6) if np.any(gradient_growth > 1e6) else T_max threshold_1e10 = np.argmax(gradient_growth > 1e10) if np.any(gradient_growth > 1e10) else T_max threshold_inf = np.argmax(gradient_growth > 1e38) if np.any(gradient_growth > 1e38) else T_max # float32 overflow results[rho] = { 'timesteps': timesteps, 'gradient_growth': gradient_growth, 't_1e6': threshold_1e6, 't_1e10': threshold_1e10, 't_overflow': threshold_inf } print(f"\nSpectral radius ρ = {rho:.2f}:") print(f" Gradient exceeds 10^6 at T = {threshold_1e6}") print(f" Gradient exceeds 10^10 at T = {threshold_1e10}") print(f" Float32 overflow at T ≈ {threshold_inf}") print(f" Gradient at T=50: {rho**50:.2e}") print(f" Gradient at T=100: {rho**100:.2e}") return results def critical_spectral_radius_analysis(): """ Analyze the critical spectral radius for different sequence lengths. """ print("\n" + "=" * 70) print("CRITICAL SPECTRAL RADIUS ANALYSIS") print("=" * 70) print("\n(Maximum ρ to avoid gradient > 10^6)") print("-" * 50) # For gradient < threshold: rho^T < threshold # rho < threshold^(1/T) threshold = 1e6 sequence_lengths = [10, 25, 50, 100, 200, 500, 1000] for T in sequence_lengths: critical_rho = threshold ** (1.0 / T) print(f" T = {T:4d}: max ρ = {critical_rho:.4f}") print("\nImplication: For long sequences, spectral radius must be VERY close to 1") print("Even ρ = 1.01 causes overflow for T = 1000") def visualize_explosion_dynamics(): """ Visualize how gradients explode in the eigenspace. """ np.random.seed(42) n = 50 # Hidden size # Create weight matrix with controlled spectral properties # Use eigendecomposition to set specific eigenvalues V = np.linalg.qr(np.random.randn(n, n))[0] # Random orthogonal basis # Create eigenvalues with one explosive direction eigenvalues = np.random.uniform(0.8, 0.95, n) # Most eigenvalues < 1 eigenvalues[0] = 1.2 # One explosive eigenvalue eigenvalues[1] = 1.1 # Another mildly explosive Lambda = np.diag(eigenvalues) Whh = V @ Lambda @ V.T print("\n" + "=" * 70) print("EIGENSPACE DYNAMICS VISUALIZATION") print("=" * 70) print(f"\nWeight matrix with mixed eigenvalues:") print(f" Range: [{eigenvalues.min():.2f}, {eigenvalues.max():.2f}]") print(f" Explosive eigenvalues (>1): {np.sum(eigenvalues > 1)}") print(f" Spectral radius: {np.max(np.abs(eigenvalues)):.2f}") # Track gradient evolution T = 50 # Start with random gradient grad = np.random.randn(n) grad = grad / np.linalg.norm(grad) # Project onto eigenvectors v0 = V[:, 0] # Most explosive direction v1 = V[:, 1] # Second explosive direction projections_v0 = [] projections_v1 = [] other_projections = [] total_norms = [] current_grad = grad.copy() for t in range(T): current_grad = Whh.T @ current_grad # Backward step proj_v0 = np.abs(np.dot(current_grad, v0)) proj_v1 = np.abs(np.dot(current_grad, v1)) other = np.linalg.norm(current_grad - proj_v0 * v0 - proj_v1 * v1) projections_v0.append(proj_v0) projections_v1.append(proj_v1) other_projections.append(other) total_norms.append(np.linalg.norm(current_grad)) print(f"\nGradient evolution over {T} timesteps:") print(f" Initial norm: 1.0") print(f" Final norm: {total_norms[-1]:.2e}") print(f" Final projection on explosive v0 (λ=1.2): {projections_v0[-1]:.2e}") print(f" Final projection on explosive v1 (λ=1.1): {projections_v1[-1]:.2e}") print(f" Final projection on stable subspace: {other_projections[-1]:.2e}") print(f"\n → Explosive directions dominate as expected") return total_norms, projections_v0, projections_v1, other_projections # Run analysesspectral_radii = [0.99, 1.0, 1.01, 1.05, 1.1, 1.2, 1.5]results = analyze_explosion_conditions([50], spectral_radii) critical_spectral_radius_analysis() norms_data = visualize_explosion_dynamics()Although vanishing and exploding gradients are often discussed as symmetric problems—two sides of the same coin—they are fundamentally asymmetric in several important ways.
Asymmetry 1: Detection and visibility
Vanishing gradients are silent killers. Training proceeds, losses decrease (on short-range patterns), and nothing obviously breaks. You only discover the problem when evaluating on tasks requiring long-range dependencies.
Exploding gradients are loud failures. Loss spikes to enormous values, NaN appears in parameters or gradients, and training visibly breaks. This makes explosion easier to detect but also means it must be handled immediately.
Asymmetry 2: Frequency of occurrence
Vanishing is the default behavior. With standard initialization and saturating activations (tanh, sigmoid), vanilla RNNs almost always suffer from vanishing gradients. You have to work to avoid it.
Explosion requires spectral radius > 1/γ, which is less common with typical initialization. However, explosion can emerge during training as weights evolve, even if initialization was safe.
Asymmetry 3: Solutions available
This is the crucial asymmetry. Exploding gradients can be directly addressed during training through gradient clipping. When gradients exceed a threshold, we simply rescale them. This doesn't prevent learning—it just prevents catastrophe.
Vanishing gradients cannot be "boosted" symmetrically. If we tried to scale up small gradients, we'd amplify noise and learn invalid correlations. The solution requires architectural changes (LSTM, GRU) that prevent vanishing from occurring in the first place.
Asymmetry 4: Impact on optimization landscape
Vanishing gradients create flat regions in the loss landscape—plateaus where the optimizer makes no progress regardless of learning rate.
Exploding gradients create cliffs—sudden steep drops where a single step can jump far away from a good solution. Gradient clipping essentially "smooths" these cliffs.
Asymmetry 5: Relationship to training dynamics
Exploding gradients often occur transiently. Training might oscillate between explosion and stability as weights change. This makes the dynamics chaotic and hard to control.
Vanishing is more stable but permanently crippling. Once established, the network consistently fails to propagate gradients, and this doesn't improve over training.
Because of this asymmetry, practical RNN training typically: (1) Uses LSTM/GRU to address vanishing (architectural solution), (2) Uses gradient clipping to address any remaining explosion (runtime solution), (3) Uses careful initialization to start in a stable regime. Gradient clipping is cheap and necessary; architectural changes are fundamental.
Understanding how gradient explosion manifests in practice helps with early detection and debugging. Here are the common symptoms:
1. NaN (Not a Number) values
The most obvious sign. When gradients exceed floating-point range:
NaN typically indicates explosion has already occurred and propagated. By this point, the model is unrecoverable without loading from a checkpoint.
2. Loss spikes
Before complete divergence, you may see sudden, large spikes in training loss—jumping from reasonable values (e.g., 2.0) to enormous values (e.g., 10^6) in a single step. This indicates one of:
3. Parameter instability
Monitor parameter norms over training. Healthy training shows gradual, smooth changes. Explosion shows:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277
import numpy as npimport torchimport torch.nn as nnfrom typing import Dict, List, Tuple, Optionalfrom dataclasses import dataclassfrom collections import deque @dataclassclass ExplosionWarning: """Warning about potential gradient explosion.""" timestep: int severity: str # 'low', 'medium', 'high', 'critical' indicator: str value: float threshold: float message: str class GradientExplosionMonitor: """ Real-time monitoring for gradient explosion in RNN training. Implements multiple detection strategies: 1. Gradient norm monitoring 2. Loss spike detection 3. Parameter norm tracking 4. NaN detection 5. Rate of change analysis """ def __init__( self, model: nn.Module, max_grad_norm: float = 100.0, max_loss: float = 1e6, loss_spike_factor: float = 10.0, history_size: int = 100 ): self.model = model self.max_grad_norm = max_grad_norm self.max_loss = max_loss self.loss_spike_factor = loss_spike_factor # History tracking self.loss_history = deque(maxlen=history_size) self.grad_norm_history = deque(maxlen=history_size) self.param_norm_history = deque(maxlen=history_size) self.warnings: List[ExplosionWarning] = [] self.step_count = 0 def check_gradients(self) -> List[ExplosionWarning]: """ Check current gradients for explosion indicators. """ warnings = [] total_grad_norm = 0.0 max_single_grad = 0.0 has_nan = False has_inf = False for name, param in self.model.named_parameters(): if param.grad is None: continue grad = param.grad # Check for NaN/Inf if torch.isnan(grad).any(): has_nan = True warnings.append(ExplosionWarning( timestep=self.step_count, severity='critical', indicator='nan_gradient', value=float('nan'), threshold=0, message=f"NaN detected in gradients of {name}" )) if torch.isinf(grad).any(): has_inf = True warnings.append(ExplosionWarning( timestep=self.step_count, severity='critical', indicator='inf_gradient', value=float('inf'), threshold=0, message=f"Inf detected in gradients of {name}" )) # Compute norms grad_norm = grad.norm().item() total_grad_norm += grad_norm ** 2 max_single_grad = max(max_single_grad, grad_norm) total_grad_norm = np.sqrt(total_grad_norm) # Check total gradient norm if total_grad_norm > self.max_grad_norm: severity = 'high' if total_grad_norm > 10 * self.max_grad_norm else 'medium' warnings.append(ExplosionWarning( timestep=self.step_count, severity=severity, indicator='large_gradient_norm', value=total_grad_norm, threshold=self.max_grad_norm, message=f"Gradient norm {total_grad_norm:.2e} exceeds threshold {self.max_grad_norm}" )) # Track history if not (has_nan or has_inf): self.grad_norm_history.append(total_grad_norm) # Check for rapid increase if len(self.grad_norm_history) >= 10: recent_avg = np.mean(list(self.grad_norm_history)[-5:]) older_avg = np.mean(list(self.grad_norm_history)[-10:-5]) if older_avg > 0 and recent_avg / older_avg > 5: warnings.append(ExplosionWarning( timestep=self.step_count, severity='medium', indicator='gradient_acceleration', value=recent_avg / older_avg, threshold=5.0, message=f"Gradient norms increasing rapidly: {older_avg:.2e} -> {recent_avg:.2e}" )) return warnings def check_loss(self, loss: float) -> List[ExplosionWarning]: """ Check loss value for explosion indicators. """ warnings = [] # Check for NaN/Inf if np.isnan(loss): warnings.append(ExplosionWarning( timestep=self.step_count, severity='critical', indicator='nan_loss', value=float('nan'), threshold=0, message="Loss is NaN - training has diverged" )) return warnings if np.isinf(loss): warnings.append(ExplosionWarning( timestep=self.step_count, severity='critical', indicator='inf_loss', value=float('inf'), threshold=0, message="Loss is Inf - training has diverged" )) return warnings # Check absolute threshold if loss > self.max_loss: warnings.append(ExplosionWarning( timestep=self.step_count, severity='high', indicator='large_loss', value=loss, threshold=self.max_loss, message=f"Loss {loss:.2e} exceeds maximum threshold {self.max_loss}" )) # Check for spike relative to history if len(self.loss_history) >= 5: recent_avg = np.mean(list(self.loss_history)[-5:]) if loss > self.loss_spike_factor * recent_avg: warnings.append(ExplosionWarning( timestep=self.step_count, severity='medium', indicator='loss_spike', value=loss / recent_avg, threshold=self.loss_spike_factor, message=f"Loss spiked to {loss:.4f} from average {recent_avg:.4f}" )) self.loss_history.append(loss) return warnings def step(self, loss: float) -> List[ExplosionWarning]: """ Perform all checks for current training step. """ self.step_count += 1 warnings = [] warnings.extend(self.check_gradients()) warnings.extend(self.check_loss(loss)) # Track parameter norms total_param_norm = sum( p.norm().item() ** 2 for p in self.model.parameters() ) ** 0.5 self.param_norm_history.append(total_param_norm) self.warnings.extend(warnings) return warnings def summarize(self) -> Dict: """ Summarize explosion risk over training. """ summary = { 'total_warnings': len(self.warnings), 'critical_warnings': sum(1 for w in self.warnings if w.severity == 'critical'), 'high_warnings': sum(1 for w in self.warnings if w.severity == 'high'), 'max_gradient_norm': max(self.grad_norm_history) if self.grad_norm_history else 0, 'max_loss': max(self.loss_history) if self.loss_history else 0, 'explosion_detected': any(w.severity == 'critical' for w in self.warnings) } return summary # Demo: Simulate training with explosiondef demonstrate_explosion_detection(): """ Demonstrate explosion detection by simulating problematic training. """ print("=" * 60) print("GRADIENT EXPLOSION DETECTION DEMO") print("=" * 60) # Create RNN with large weight initialization (will explode) rnn = nn.RNN(input_size=10, hidden_size=50, num_layers=1) # Initialize with large weights to cause explosion with torch.no_grad(): rnn.weight_hh_l0.mul_(3.0) # Make weights large linear = nn.Linear(50, 1) criterion = nn.MSELoss() optimizer = torch.optim.SGD(list(rnn.parameters()) + list(linear.parameters()), lr=0.01) monitor = GradientExplosionMonitor(rnn, max_grad_norm=10.0) # Training loop for epoch in range(20): x = torch.randn(100, 1, 10) # Long sequence target = torch.randn(1, 1) output, _ = rnn(x) pred = linear(output[-1]) loss = criterion(pred, target) optimizer.zero_grad() loss.backward() # Check for explosion warnings = monitor.step(loss.item()) if warnings: print(f"\nEpoch {epoch}:") for w in warnings: print(f" [{w.severity.upper()}] {w.message}") # Would normally apply gradient clipping here optimizer.step() # Stop if exploded if any(w.severity == 'critical' for w in warnings): print("\nTraining halted due to explosion!") break summary = monitor.summarize() print(f"\nSummary: {summary}") demonstrate_explosion_detection()Don't wait for NaN! Key early indicators: (1) Gradient norm increasing over epochs, (2) Occasional loss spikes even if training recovers, (3) Parameter norms growing unboundedly, (4) Training becoming more volatile. Implement monitoring from the start and add gradient clipping preemptively.
Exploding gradients fundamentally alter the optimization landscape that the training algorithm navigates. Understanding this geometric perspective provides insight into why explosion is so destructive.
The cliff metaphor:
Pascanu et al. (2013) introduced the influential "cliff" metaphor for understanding RNN loss landscapes. In regions prone to explosion:
Why cliffs exist:
Cliffs arise from the multiplicative nature of gradient computation in RNNs. Small changes in $W_{hh}$ that slightly increase its spectral radius can cause:
The cliff is literally the boundary between stable and explosive regimes.
The curvature problem:
In regions near cliffs, the Hessian (second derivative) of the loss has extremely large eigenvalues in cliff directions. This means:
| Region | Loss Behavior | Gradient Behavior | Optimization Challenge |
|---|---|---|---|
| Stable (ρ < 1) | Smooth, well-behaved | Predictable magnitude | May vanish for long T |
| Near-critical (ρ ≈ 1) | Steep in some directions | Variable, sometimes large | Sensitive to small changes |
| Unstable (ρ > 1) | Cliff-like structures | Huge, explosive | Single step causes divergence |
| Saddle points | Flat with escape directions | Small near saddle | Slow progress, wrong directions |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154
import numpy as npimport matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3D def analyze_loss_landscape(): """ Analyze the loss landscape of a simple RNN to visualize cliff structures. """ np.random.seed(42) # Simple 2D hidden state for visualization hidden_size = 2 seq_length = 20 def rnn_loss(w_scale, input_scale, x_seq, target): """ Compute loss for a simple RNN with given weight scale. We parameterize W_hh = w_scale * I to visualize the landscape as a function of this single parameter. """ Whh = np.array([[w_scale, 0.1], [0.1, w_scale * 0.9]]) Wxh = np.eye(hidden_size) * input_scale h = np.zeros(hidden_size) for x in x_seq: h = np.tanh(Whh @ h + Wxh @ x) # Simple loss: squared distance from target loss = np.sum((h - target) ** 2) return loss, h # Generate test data x_seq = [np.random.randn(hidden_size) * 0.1 for _ in range(seq_length)] target = np.array([0.5, 0.5]) # Scan over weight scales w_scales = np.linspace(0.5, 2.0, 100) losses = [] final_h_norms = [] for w in w_scales: try: loss, h = rnn_loss(w, 0.1, x_seq, target) losses.append(loss if loss < 1e10 else 1e10) final_h_norms.append(np.linalg.norm(h)) except: losses.append(1e10) final_h_norms.append(np.nan) # Find cliff location loss_gradient = np.diff(losses) cliff_idx = np.argmax(np.abs(loss_gradient)) cliff_w = w_scales[cliff_idx] print("=" * 60) print("LOSS LANDSCAPE ANALYSIS") print("=" * 60) print(f"\nSequence length: {seq_length}") print(f"Cliff detected at weight scale ≈ {cliff_w:.2f}") print(f"Loss before cliff: {losses[cliff_idx-1]:.4f}") print(f"Loss after cliff: {losses[cliff_idx+1]:.2e}") print(f"Loss ratio: {losses[cliff_idx+1] / losses[cliff_idx-1]:.2e}x") return w_scales, losses, cliff_w def gradient_direction_vs_cliff(): """ Demonstrate how gradient can point toward a cliff. """ print("\n" + "=" * 60) print("GRADIENT DIRECTION ANALYSIS") print("=" * 60) # At a point near a cliff, compute: # 1. The gradient direction # 2. The actual loss along that direction # 3. Show that following gradient leads to cliff np.random.seed(42) hidden_size = 10 seq_length = 30 def compute_loss(Whh, Wxh, x_seq, target): h = np.zeros(hidden_size) for x in x_seq: h = np.tanh(Whh @ h + Wxh @ x) return np.sum((h - target) ** 2), h # Initialize near critical point spectral_target = 1.05 Whh = np.eye(hidden_size) * spectral_target / hidden_size + np.random.randn(hidden_size, hidden_size) * 0.01 # Rescale to have spectral norm close to target Whh = Whh / np.linalg.norm(Whh, 2) * spectral_target Wxh = np.eye(hidden_size) * 0.1 x_seq = [np.random.randn(hidden_size) * 0.1 for _ in range(seq_length)] target = np.random.randn(hidden_size) * 0.3 # Compute gradient numerically eps = 1e-5 base_loss, _ = compute_loss(Whh, Wxh, x_seq, target) grad_Whh = np.zeros_like(Whh) for i in range(hidden_size): for j in range(hidden_size): Whh_plus = Whh.copy() Whh_plus[i, j] += eps loss_plus, _ = compute_loss(Whh_plus, Wxh, x_seq, target) grad_Whh[i, j] = (loss_plus - base_loss) / eps # Loss along gradient direction step_sizes = np.linspace(-0.1, 0.5, 50) losses_along_gradient = [] grad_direction = grad_Whh / np.linalg.norm(grad_Whh) # Normalized gradient for alpha in step_sizes: # Move in negative gradient direction (gradient descent) by alpha * grad_norm Whh_new = Whh - alpha * grad_Whh try: loss, h = compute_loss(Whh_new, Wxh, x_seq, target) if loss < 1e10 and not np.isnan(loss): losses_along_gradient.append(loss) else: losses_along_gradient.append(1e10) except: losses_along_gradient.append(1e10) # Find if there's a cliff loss_gradient_along = np.diff(losses_along_gradient) cliff_detected = np.any(np.abs(loss_gradient_along) > 100) print(f"\nSpectral radius of Whh: {np.max(np.abs(np.linalg.eigvals(Whh))):.3f}") print(f"Gradient norm: {np.linalg.norm(grad_Whh):.4f}") print(f"Base loss: {base_loss:.4f}") print(f"\nLoss along gradient descent direction:") print(f" At α=0.0: {losses_along_gradient[25]:.4f}") print(f" At α=0.2: {losses_along_gradient[35]:.4f}") print(f" At α=0.5: {losses_along_gradient[49]:.2e}") print(f"\nCliff detected along gradient direction: {cliff_detected}") if cliff_detected: print("→ Standard gradient descent would step off this cliff!") return step_sizes, losses_along_gradient # Run analysesw_scales, losses, cliff_w = analyze_loss_landscape() step_sizes, gradient_losses = gradient_direction_vs_cliff()A naive fix for explosion is reducing the learning rate. But this doesn't solve the problem: (1) Very small learning rates slow training unacceptably. (2) The cliff is still there—you're just approaching it more slowly. (3) Eventually, accumulating updates still cross the cliff. (4) Different regions need different learning rates. Gradient clipping directly addresses the symptom (large gradients) without sacrificing learning speed.
The exploding gradient problem is the volatile counterpart to vanishing gradients. While less common, its effects are dramatic and must be addressed for stable RNN training.
What's Next:
We've characterized both vanishing and exploding gradients. Now we turn to the first practical solution: gradient clipping. This technique allows training to continue even when gradients would otherwise explode, and is now standard practice in RNN training.
You now understand the conditions that cause gradient explosion, how to detect it in practice, and why it creates "cliffs" in the optimization landscape. Combined with your understanding of vanishing gradients, you're ready to learn about practical solutions, starting with gradient clipping in the next page.