Machine LearningRecurrent Neural Networks

Vanishing and Exploding Gradients

LevelAdvanced

Duration90 mins

TopicRecurrent Neural Networks

4 / 5

Gradient Clipping

Taming Explosive Gradients

Gradient clipping is the primary defense against exploding gradients in RNN training. Rather than attempting to prevent explosion through careful initialization alone—which is often insufficient—gradient clipping intervenes directly during the optimization process, constraining gradient magnitudes to reasonable values.

The elegance of gradient clipping lies in its simplicity: when gradients exceed a threshold, we rescale them. This prevents catastrophic parameter updates without fundamentally changing the optimization algorithm. It's now standard practice in virtually all RNN training pipelines, and understanding its mechanics is essential for any practitioner.

What You Will Learn

This page covers: (1) the two main clipping strategies—value clipping and norm clipping, (2) mathematical guarantees each provides, (3) how to choose clipping thresholds, (4) implementation in major frameworks, and (5) interaction with adaptive optimizers.

Clipping Strategies: Value vs. Norm

There are two fundamental approaches to gradient clipping, each with distinct properties:

1. Value Clipping (Element-wise Clipping)

Clip each gradient element independently to lie within $[-\tau, \tau]$:

$$\tilde{g}_i = \begin{cases} \tau & \text{if } g_i > \tau \ -\tau & \text{if } g_i < -\tau \ g_i & \text{otherwise} \end{cases}$$

Properties:

Simple and fast to compute
Can change gradient direction (problematic!)
Independent of gradient structure
Less commonly used in practice

2. Norm Clipping (Gradient Rescaling)

Rescale the entire gradient vector if its norm exceeds threshold $\tau$:

$$\tilde{g} = \begin{cases} \frac{\tau}{|g|} g & \text{if } |g| > \tau \ g & \text{otherwise} \end{cases}$$

Properties:

Preserves gradient direction (critical advantage!)
Only changes magnitude when necessary
Respects the geometry of optimization
Standard practice in modern deep learning

Why Norm Clipping is Preferred

Gradient direction encodes which parameters need adjustment and in what direction. Value clipping distorts this information—a gradient of (100, 1) becomes (τ, 1), completely changing the direction. Norm clipping preserves direction: (100, 1) becomes (τ·100/√10001, τ·1/√10001), pointing the same way but with bounded magnitude.

Comparison of Clipping Strategies
Property	Value Clipping	Norm Clipping
Preserves direction	No	Yes
Computation	O(n) simple	O(n) + norm
Per-parameter control	Yes	No (global)
Common usage	Rare	Standard
Theoretical support	Limited	Strong

Norm Clipping: Mathematical Foundation

Let's formalize norm clipping and understand its properties.

Global Norm Clipping

Given gradients $g_1, g_2, \ldots, g_m$ for $m$ parameter tensors, compute the global norm:

$$|g|{\text{global}} = \sqrt{\sum{i=1}^{m} |g_i|_2^2}$$

Then clip:

$$\tilde{g}i = \begin{cases} \frac{\tau}{|g|{\text{global}}} g_i & \text{if } |g|_{\text{global}} > \tau \ g_i & \text{otherwise} \end{cases}$$

Properties:

Direction preservation: $\tilde{g} \propto g$
Bounded update: $|\tilde{g}|_{\text{global}} \leq \tau$
Smooth operation: Clipping is continuous (no discontinuities at $|g| = \tau$)
Relative magnitudes preserved: If $|g_1| = 2|g_2|$ before clipping, same ratio after

Layer-wise Clipping (Alternative)

Clip each layer independently: $\tilde{g}_i = \min(1, \tau / |g_i|) \cdot g_i$

This is less common but can help when different layers have very different gradient scales.

gradient_clipping.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import numpy as np
import torch
import torch.nn as nn
 
def clip_gradient_value(gradients, threshold):
    """Element-wise value clipping."""
    return [np.clip(g, -threshold, threshold) for g in gradients]
 
def clip_gradient_norm(gradients, max_norm):
    """Global norm clipping (preserves direction)."""
    total_norm = np.sqrt(sum(np.sum(g**2) for g in gradients))
    
    if total_norm > max_norm:
        scale = max_norm / total_norm
        return [g * scale for g in gradients], total_norm, True
    return gradients, total_norm, False
 
# PyTorch implementation
def pytorch_gradient_clipping_demo():
    """Demonstrate gradient clipping in PyTorch."""
    model = nn.RNN(input_size=10, hidden_size=50, num_layers=2)
    
    # Forward pass
    x = torch.randn(20, 1, 10)
    output, _ = model(x)
    loss = output.sum()  # Dummy loss
    loss.backward()
    
    # Method 1: clip_grad_norm_ (recommended)
    total_norm = nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    print(f"Gradient norm after clipping: {total_norm:.4f}")
    
    # Method 2: clip_grad_value_ (element-wise)
    # nn.utils.clip_grad_value_(model.parameters(), clip_value=1.0)
 
pytorch_gradient_clipping_demo()

Choosing the Clipping Threshold

Selecting the right clipping threshold $\tau$ is crucial. Too high, and clipping never activates (explosion still possible). Too low, and you're constantly clipping, potentially harming convergence.

Empirical Approach (Most Common)

Run training without clipping, logging gradient norms
Look at the distribution: median, 95th percentile, max
Set threshold at 1-5x the typical (median or 90th percentile) norm
Values of 1.0 to 10.0 are common starting points

Theoretical Guidance

For a model with $n$ parameters and learning rate $\eta$:

Maximum reasonable step size: parameters shouldn't change by more than ~10-50%
Implies: $\eta \cdot \tau \lesssim 0.1 \cdot |\theta|$ for parameter vector $\theta$

Common Practices by Domain

Language models: τ = 0.5 to 5.0
Speech recognition: τ = 1.0 to 10.0
Sequence-to-sequence: τ = 1.0 to 5.0
Reinforcement learning: τ = 0.5 to 1.0 (more conservative)

The Trade-off

Aggressive clipping (small τ) provides stability but may slow convergence or cause the optimizer to behave differently than intended. Conservative clipping (large τ) allows more natural gradient flow but risks occasional explosions. Start conservative (τ = 5-10) and reduce if instability persists.

threshold_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import numpy as np
import torch
import torch.nn as nn
from collections import deque
 
class AdaptiveGradientClipper:
    """
    Adaptive gradient clipping based on gradient norm history.
    """
    def __init__(self, model, percentile=95, multiplier=2.0, history_size=1000):
        self.model = model
        self.percentile = percentile
        self.multiplier = multiplier
        self.norm_history = deque(maxlen=history_size)
        self.current_threshold = 10.0  # Initial conservative value
    
    def step(self):
        """Compute gradient norm, update history, and clip."""
        # Compute current gradient norm
        total_norm = 0.0
        for p in self.model.parameters():
            if p.grad is not None:
                total_norm += p.grad.data.norm(2).item() ** 2
        total_norm = np.sqrt(total_norm)
        
        # Update history
        self.norm_history.append(total_norm)
        
        # Update adaptive threshold
        if len(self.norm_history) >= 100:
            self.current_threshold = np.percentile(
                self.norm_history, self.percentile
            ) * self.multiplier
        
        # Clip using current threshold
        nn.utils.clip_grad_norm_(self.model.parameters(), self.current_threshold)
        
        return total_norm, self.current_threshold
 
# Usage example
model = nn.RNN(10, 50)
clipper = AdaptiveGradientClipper(model)
 
for epoch in range(100):
    # Training step...
    loss = torch.randn(1).sum()
    loss.backward()
    
    norm, threshold = clipper.step()
    print(f"Epoch {epoch}: norm={norm:.4f}, threshold={threshold:.4f}")

Interaction with Adaptive Optimizers

Modern training typically uses adaptive optimizers like Adam. Understanding how gradient clipping interacts with these optimizers is important.

Adam's internal scaling:

Adam already performs per-parameter scaling based on gradient history:

$$\theta_t = \theta_{t-1} - \eta \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$

where $\hat{m}_t$ is the bias-corrected first moment (gradient) and $\hat{v}_t$ is the bias-corrected second moment (squared gradient).

The interaction:

Clipping happens before Adam processes the gradient
Adam's moment estimates include clipped gradients
This can cause Adam to underestimate the true variance
Large clipping may distort Adam's adaptation

Best Practices:

Use moderate clipping thresholds with Adam (τ = 1.0 to 5.0)
Consider gradient scaling before clipping for mixed precision
Some research suggests clipping after normalization works better

Order of Operations

The standard order is: (1) Compute gradients via backward(), (2) Apply gradient clipping, (3) Call optimizer.step(). For AMP (automatic mixed precision), gradient unscaling happens before clipping. Always clip before the optimizer step, never after.

Framework Implementations

framework_implementations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# ============================================
# PyTorch Implementation
# ============================================
import torch
import torch.nn as nn
 
model = nn.RNN(10, 50)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
 
for batch in dataloader:
    optimizer.zero_grad()
    loss = compute_loss(model, batch)
    loss.backward()
    
    # Gradient clipping (standard approach)
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    
    optimizer.step()
 
# ============================================
# TensorFlow/Keras Implementation  
# ============================================
import tensorflow as tf
 
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001, clipnorm=1.0)
# OR: clipvalue=0.5 for element-wise clipping
 
# Manual approach:
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
gradients = tape.gradient(loss, model.trainable_variables)
gradients, _ = tf.clip_by_global_norm(gradients, clip_norm=1.0)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
 
# ============================================
# JAX Implementation
# ============================================
import jax
import jax.numpy as jnp
import optax
 
# Optax provides gradient clipping as a transformation
optimizer = optax.chain(
    optax.clip_by_global_norm(max_norm=1.0),
    optax.adam(learning_rate=0.001)
)
 
# In training loop:
grads = jax.grad(loss_fn)(params)
updates, opt_state = optimizer.update(grads, opt_state, params)
params = optax.apply_updates(params, updates)

Summary: Gradient Clipping

Key Takeaways

•Norm clipping > value clipping — Preserving gradient direction is critical for optimization geometry.
•Threshold selection is empirical — Start with τ = 1.0-5.0 and adjust based on gradient norm distributions.
•Clipping is now standard practice — Every major RNN training pipeline uses gradient clipping.
•Works with adaptive optimizers — But be aware of interactions; moderate thresholds work best.
•Prevents explosion, doesn't fix vanishing — Clipping handles exploding gradients but doesn't help with vanishing.

Practical Solution Acquired

Gradient clipping gives you the primary defense against exploding gradients. Combined with proper initialization, it makes RNN training stable. But vanishing gradients remain unsolved—that requires architectural changes, which we cover in the next page on better architectures.

4 / 5

Loading learning content...

Machine LearningRecurrent Neural Networks

Vanishing and Exploding Gradients

LevelAdvanced

Duration90 mins

TopicRecurrent Neural Networks

4 / 5

Gradient Clipping

Taming Explosive Gradients

What You Will Learn

Clipping Strategies: Value vs. Norm

There are two fundamental approaches to gradient clipping, each with distinct properties:

1. Value Clipping (Element-wise Clipping)

Clip each gradient element independently to lie within $[-\tau, \tau]$:

$$\tilde{g}_i = \begin{cases} \tau & \text{if } g_i > \tau \ -\tau & \text{if } g_i < -\tau \ g_i & \text{otherwise} \end{cases}$$

Properties:

Simple and fast to compute
Can change gradient direction (problematic!)
Independent of gradient structure
Less commonly used in practice

2. Norm Clipping (Gradient Rescaling)

Rescale the entire gradient vector if its norm exceeds threshold $\tau$:

$$\tilde{g} = \begin{cases} \frac{\tau}{|g|} g & \text{if } |g| > \tau \ g & \text{otherwise} \end{cases}$$

Properties:

Preserves gradient direction (critical advantage!)
Only changes magnitude when necessary
Respects the geometry of optimization
Standard practice in modern deep learning

Why Norm Clipping is Preferred

Comparison of Clipping Strategies
Property	Value Clipping	Norm Clipping
Preserves direction	No	Yes
Computation	O(n) simple	O(n) + norm
Per-parameter control	Yes	No (global)
Common usage	Rare	Standard
Theoretical support	Limited	Strong

Norm Clipping: Mathematical Foundation

Let's formalize norm clipping and understand its properties.

Global Norm Clipping

Given gradients $g_1, g_2, \ldots, g_m$ for $m$ parameter tensors, compute the global norm:

$$|g|{\text{global}} = \sqrt{\sum{i=1}^{m} |g_i|_2^2}$$

Then clip:

$$\tilde{g}i = \begin{cases} \frac{\tau}{|g|{\text{global}}} g_i & \text{if } |g|_{\text{global}} > \tau \ g_i & \text{otherwise} \end{cases}$$

Properties:

Direction preservation: $\tilde{g} \propto g$
Bounded update: $|\tilde{g}|_{\text{global}} \leq \tau$
Smooth operation: Clipping is continuous (no discontinuities at $|g| = \tau$)
Relative magnitudes preserved: If $|g_1| = 2|g_2|$ before clipping, same ratio after

Layer-wise Clipping (Alternative)

Clip each layer independently: $\tilde{g}_i = \min(1, \tau / |g_i|) \cdot g_i$

This is less common but can help when different layers have very different gradient scales.

gradient_clipping.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import numpy as np
import torch
import torch.nn as nn
 
def clip_gradient_value(gradients, threshold):
    """Element-wise value clipping."""
    return [np.clip(g, -threshold, threshold) for g in gradients]
 
def clip_gradient_norm(gradients, max_norm):
    """Global norm clipping (preserves direction)."""
    total_norm = np.sqrt(sum(np.sum(g**2) for g in gradients))
    
    if total_norm > max_norm:
        scale = max_norm / total_norm
        return [g * scale for g in gradients], total_norm, True
    return gradients, total_norm, False
 
# PyTorch implementation
def pytorch_gradient_clipping_demo():
    """Demonstrate gradient clipping in PyTorch."""
    model = nn.RNN(input_size=10, hidden_size=50, num_layers=2)
    
    # Forward pass
    x = torch.randn(20, 1, 10)
    output, _ = model(x)
    loss = output.sum()  # Dummy loss
    loss.backward()
    
    # Method 1: clip_grad_norm_ (recommended)
    total_norm = nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    print(f"Gradient norm after clipping: {total_norm:.4f}")
    
    # Method 2: clip_grad_value_ (element-wise)
    # nn.utils.clip_grad_value_(model.parameters(), clip_value=1.0)
 
pytorch_gradient_clipping_demo()

Choosing the Clipping Threshold

Selecting the right clipping threshold $\tau$ is crucial. Too high, and clipping never activates (explosion still possible). Too low, and you're constantly clipping, potentially harming convergence.

Empirical Approach (Most Common)

Run training without clipping, logging gradient norms
Look at the distribution: median, 95th percentile, max
Set threshold at 1-5x the typical (median or 90th percentile) norm
Values of 1.0 to 10.0 are common starting points

Theoretical Guidance

For a model with $n$ parameters and learning rate $\eta$:

Maximum reasonable step size: parameters shouldn't change by more than ~10-50%
Implies: $\eta \cdot \tau \lesssim 0.1 \cdot |\theta|$ for parameter vector $\theta$

Common Practices by Domain

Language models: τ = 0.5 to 5.0
Speech recognition: τ = 1.0 to 10.0
Sequence-to-sequence: τ = 1.0 to 5.0
Reinforcement learning: τ = 0.5 to 1.0 (more conservative)

The Trade-off

threshold_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import numpy as np
import torch
import torch.nn as nn
from collections import deque
 
class AdaptiveGradientClipper:
    """
    Adaptive gradient clipping based on gradient norm history.
    """
    def __init__(self, model, percentile=95, multiplier=2.0, history_size=1000):
        self.model = model
        self.percentile = percentile
        self.multiplier = multiplier
        self.norm_history = deque(maxlen=history_size)
        self.current_threshold = 10.0  # Initial conservative value
    
    def step(self):
        """Compute gradient norm, update history, and clip."""
        # Compute current gradient norm
        total_norm = 0.0
        for p in self.model.parameters():
            if p.grad is not None:
                total_norm += p.grad.data.norm(2).item() ** 2
        total_norm = np.sqrt(total_norm)
        
        # Update history
        self.norm_history.append(total_norm)
        
        # Update adaptive threshold
        if len(self.norm_history) >= 100:
            self.current_threshold = np.percentile(
                self.norm_history, self.percentile
            ) * self.multiplier
        
        # Clip using current threshold
        nn.utils.clip_grad_norm_(self.model.parameters(), self.current_threshold)
        
        return total_norm, self.current_threshold
 
# Usage example
model = nn.RNN(10, 50)
clipper = AdaptiveGradientClipper(model)
 
for epoch in range(100):
    # Training step...
    loss = torch.randn(1).sum()
    loss.backward()
    
    norm, threshold = clipper.step()
    print(f"Epoch {epoch}: norm={norm:.4f}, threshold={threshold:.4f}")

Interaction with Adaptive Optimizers

Modern training typically uses adaptive optimizers like Adam. Understanding how gradient clipping interacts with these optimizers is important.

Adam's internal scaling:

Adam already performs per-parameter scaling based on gradient history:

$$\theta_t = \theta_{t-1} - \eta \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$

where $\hat{m}_t$ is the bias-corrected first moment (gradient) and $\hat{v}_t$ is the bias-corrected second moment (squared gradient).

The interaction:

Clipping happens before Adam processes the gradient
Adam's moment estimates include clipped gradients
This can cause Adam to underestimate the true variance
Large clipping may distort Adam's adaptation

Best Practices:

Use moderate clipping thresholds with Adam (τ = 1.0 to 5.0)
Consider gradient scaling before clipping for mixed precision
Some research suggests clipping after normalization works better

Order of Operations

Framework Implementations

framework_implementations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# ============================================
# PyTorch Implementation
# ============================================
import torch
import torch.nn as nn
 
model = nn.RNN(10, 50)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
 
for batch in dataloader:
    optimizer.zero_grad()
    loss = compute_loss(model, batch)
    loss.backward()
    
    # Gradient clipping (standard approach)
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    
    optimizer.step()
 
# ============================================
# TensorFlow/Keras Implementation  
# ============================================
import tensorflow as tf
 
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001, clipnorm=1.0)
# OR: clipvalue=0.5 for element-wise clipping
 
# Manual approach:
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
gradients = tape.gradient(loss, model.trainable_variables)
gradients, _ = tf.clip_by_global_norm(gradients, clip_norm=1.0)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
 
# ============================================
# JAX Implementation
# ============================================
import jax
import jax.numpy as jnp
import optax
 
# Optax provides gradient clipping as a transformation
optimizer = optax.chain(
    optax.clip_by_global_norm(max_norm=1.0),
    optax.adam(learning_rate=0.001)
)
 
# In training loop:
grads = jax.grad(loss_fn)(params)
updates, opt_state = optimizer.update(grads, opt_state, params)
params = optax.apply_updates(params, updates)

Summary: Gradient Clipping

Key Takeaways

•Norm clipping > value clipping — Preserving gradient direction is critical for optimization geometry.
•Threshold selection is empirical — Start with τ = 1.0-5.0 and adjust based on gradient norm distributions.
•Clipping is now standard practice — Every major RNN training pipeline uses gradient clipping.
•Works with adaptive optimizers — But be aware of interactions; moderate thresholds work best.
•Prevents explosion, doesn't fix vanishing — Clipping handles exploding gradients but doesn't help with vanishing.

Practical Solution Acquired

4 / 5