Machine LearningML Engineering & Career

Debugging ML Models

LevelAdvanced

Duration90 mins

TopicML Engineering & Career

1 / 5

Training Debugging

The Challenge of Training ML Models

Training machine learning models is fundamentally different from traditional software development. In conventional programming, bugs manifest as crashes, exceptions, or obviously incorrect outputs. In ML, failure modes are often silent and insidious—a model might train without errors yet learn nothing useful, or appear to learn brilliantly on training data while failing catastrophically in production.

Training debugging requires a unique mental model. You're not debugging code in the traditional sense; you're debugging an optimization process operating over a complex, high-dimensional loss landscape. The symptoms you observe (loss values, gradient magnitudes, weight distributions) are indirect signals of underlying issues that may be mathematical, architectural, or data-related.

What You Will Master

By completing this page, you will be able to systematically diagnose and resolve the most common training failures: vanishing and exploding gradients, loss plateaus, non-convergence, numerical instabilities, and optimization pathologies. You'll develop intuition for reading training dynamics and intervening effectively.

Understanding Training Dynamics

Before debugging training issues, you must understand what healthy training looks like. Training dynamics describe how loss, gradients, and model parameters evolve over time. Experienced practitioners develop an intuition for recognizing abnormal patterns—but this intuition is grounded in understanding the mathematical machinery of optimization.

The optimization landscape:

Deep learning training is fundamentally about navigating a loss landscape—a surface defined by the loss function over all possible parameter values. For a neural network with millions of parameters, this landscape exists in million-dimensional space. While we can't visualize it directly, its properties determine training behavior:

Minima (local/global): Points where the loss is lower than surrounding regions
Saddle points: Points that are minima along some dimensions, maxima along others
Plateaus: Flat regions where gradients are near zero
Valleys: Narrow regions requiring careful navigation
Curvature: How sharply the loss changes in different directions

Key Training Metrics and Their Interpretation
Metric	Healthy Range	Warning Signs	Indicates
Training Loss	Steady decrease, then plateau	Immediate plateau, oscillation, NaN/Inf	Optimization progress
Gradient Norm	Stable, moderate magnitude	Shrinking to 0, exploding to Inf	Signal propagation health
Weight Norm	Gradual, bounded growth	Rapid explosion or collapse	Model capacity usage
Learning Rate	Appropriate for loss scale	Too high (divergence), too low (stagnation)	Step size appropriateness
Batch Loss Variance	Decreasing over time	High variance late in training	Optimization stability

The Training Triangle

Healthy training requires balance between three forces: (1) Learning rate - determines step size, (2) Batch size - affects gradient estimate quality, (3) Model capacity - defines expressiveness. Imbalance in any causes training pathologies.

Vanishing Gradients: Diagnosis and Resolution

Vanishing gradients occur when gradients become exponentially small as they propagate backward through the network. This is perhaps the most historically significant training pathology, as it limited deep network training for decades before modern solutions emerged.

The mathematical root cause:

During backpropagation, gradients are computed via the chain rule:

$$\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial a_n} \cdot \frac{\partial a_n}{\partial a_{n-1}} \cdot ... \cdot \frac{\partial a_2}{\partial w_1}$$

For a network with $n$ layers, if each Jacobian term $\frac{\partial a_i}{\partial a_{i-1}}$ has magnitude less than 1, the product shrinks exponentially. With traditional sigmoid activation, derivatives max out at 0.25, meaning gradients decay by at least 75% per layer.

Symptoms of vanishing gradients:

Early layer weights don't update: Gradient histograms show near-zero values for initial layers
Training loss plateaus early: Model never reaches expected performance
Increasing layer depth hurts performance: Deeper networks perform worse than shallower ones
Gradient norm decreases through layers: Inspection shows exponential decay

diagnose_vanishing_gradients.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
def diagnose_vanishing_gradients(model, sample_batch):
    """
    Diagnostic tool for detecting vanishing gradients.
    Computes gradient statistics per layer during a forward-backward pass.
    """
    model.train()
    
    # Hook to capture gradients
    gradient_stats = {}
    
    def hook_fn(name):
        def hook(grad):
            gradient_stats[name] = {
                'mean': grad.abs().mean().item(),
                'std': grad.std().item(),
                'max': grad.abs().max().item(),
                'min': grad.abs().min().item(),
                'zero_fraction': (grad.abs() < 1e-7).float().mean().item()
            }
        return hook
    
    # Register hooks on all parameters
    handles = []
    for name, param in model.named_parameters():
        if param.requires_grad:
            handle = param.register_hook(hook_fn(name))
            handles.append(handle)
    
    # Forward-backward pass
    inputs, targets = sample_batch
    outputs = model(inputs)
    loss = F.cross_entropy(outputs, targets)
    loss.backward()
    
    # Cleanup hooks
    for handle in handles:
        handle.remove()
    
    # Analyze results
    print("
=== Gradient Analysis ===")
    for name, stats in sorted(gradient_stats.items()):
        status = "⚠️ VANISHING" if stats['mean'] < 1e-6 else "✓ OK"
        print(f"{name}: mean={stats['mean']:.2e}, "
              f"zeros={stats['zero_fraction']:.1%} {status}")
    
    return gradient_stats

Solutions for Vanishing Gradients

•ReLU-family activations: ReLU, LeakyReLU, ELU, GELU have derivatives that don't saturate below 1, maintaining gradient flow
•Skip connections (ResNets): Provide gradient highways that bypass problematic layers, enabling training of 100+ layer networks
•Batch/Layer Normalization: Stabilizes activation distributions, preventing saturation in activation functions
•Proper initialization: Xavier/Glorot or He initialization ensures initial activations have appropriate variance
•Gradient clipping (lower bound): Prevents gradients from becoming too small during updates
•LSTM/GRU for RNNs: Gating mechanisms specifically designed to combat vanishing gradients over long sequences

Exploding Gradients: Diagnosis and Resolution

Exploding gradients are the opposite pathology—gradients grow exponentially during backpropagation, causing weight updates so large they destabilize training. This typically manifests as NaN or Inf values in your loss or parameters.

The mathematical root cause:

Using the same chain rule formulation, if Jacobian terms have magnitude greater than 1, their product grows exponentially with depth. This can happen with:

Large weight initializations that amplify signals
Recurrent networks processing long sequences (each timestep multiplies gradients)
Networks without normalization operating in unstable regimes

Symptoms of exploding gradients:

Loss becomes NaN or Inf: The clearest indicator—optimization has completely failed
Extremely large gradient norms: Values in thousands or millions
Wildly oscillating loss: Loss jumps erratically rather than decreasing
Weights become NaN: Parameters themselves overflow to infinity

gradient_clipping_strategies.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import torch
import torch.nn as nn
 
def apply_gradient_clipping(model, optimizer, clip_value=1.0, clip_type='norm'):
    """
    Apply gradient clipping to prevent exploding gradients.
    
    Args:
        clip_type: 'norm' for gradient norm clipping (recommended)
                   'value' for per-element clipping
    """
    if clip_type == 'norm':
        # Clips if total gradient norm exceeds threshold
        # Preserves gradient direction, only scales magnitude
        total_norm = torch.nn.utils.clip_grad_norm_(
            model.parameters(), 
            max_norm=clip_value
        )
        return total_norm
    
    elif clip_type == 'value':
        # Clips each gradient element independently
        # Can change gradient direction - use cautiously
        torch.nn.utils.clip_grad_value_(
            model.parameters(), 
            clip_value=clip_value
        )
        return None
 
class GradientMonitor:
    """Monitors gradient health throughout training."""
    
    def __init__(self, model, alert_threshold=100.0):
        self.model = model
        self.alert_threshold = alert_threshold
        self.history = []
    
    def check_gradients(self):
        total_norm = 0.0
        for p in self.model.parameters():
            if p.grad is not None:
                param_norm = p.grad.data.norm(2)
                total_norm += param_norm.item() ** 2
        total_norm = total_norm ** 0.5
        
        self.history.append(total_norm)
        
        if total_norm > self.alert_threshold:
            print(f"⚠️ EXPLODING GRADIENT: norm={total_norm:.2e}")
        if torch.isnan(torch.tensor(total_norm)):
            print("🚨 NaN DETECTED in gradients!")
            
        return total_norm

Gradient Clipping Best Practices

Always use gradient norm clipping rather than value clipping for neural networks. Value clipping can alter gradient direction, leading to suboptimal update directions. Norm clipping preserves direction while bounding magnitude. Start with clip_value=1.0 and adjust based on observed gradient norms.

Loss Landscape Pathologies

Beyond gradient magnitude issues, training can fail due to pathological loss landscape geometry. Understanding these failure modes requires intuition about optimization dynamics.

Loss Plateaus:

Plateaus are flat regions where gradients approach zero despite the loss being far from optimal. Unlike saddle points or local minima, plateaus can span large regions of parameter space. Training can spend enormous time traversing plateaus before escaping.

Saddle Points:

In high-dimensional spaces, saddle points vastly outnumber local minima. At a saddle point, the gradient is zero, but the point is a minimum along some dimensions and maximum along others. Modern understanding suggests saddle points, not local minima, are the primary obstacle in deep learning optimization.

Sharp vs Flat Minima:

Research suggests that the geometry of minima affects generalization. Sharp minima (surrounded by regions of rapidly increasing loss) tend to generalize poorly, while flat minima generalize better. This has implications for optimizer choice and learning rate schedules.

Plateau Warning Signs

•Loss unchanged for many epochs
•Gradient norm very small but non-zero
•Validation loss also flat (not overfitting)
•Learning rate reduction doesn't help
•Different random seeds show same behavior

Escape Strategies

•Learning rate warm restarts
•SGD with momentum (builds velocity)
•Add noise (dropout, data augmentation)
•Cyclical learning rates
•Switch optimizer (Adam → SGD or vice versa)

Numerical Stability and Precision Issues

Training failures often stem from numerical precision limitations rather than algorithmic bugs. Deep learning operations push floating-point arithmetic to its limits, especially with mixed-precision training.

Common numerical instabilities:

Log of zero/negative: Occurs in cross-entropy when predictions are 0 or 1
Exp overflow: Large values passed to exponential functions (softmax with large logits)
Division by zero: Normalization with zero variance
Floating-point accumulation errors: Many small values summed lose precision
Underflow in probabilities: Product of many probabilities becomes zero

numerical_stability_patterns.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import torch
import torch.nn.functional as F
 
# ❌ UNSTABLE: Naive softmax implementation
def unstable_softmax(x):
    exp_x = torch.exp(x)  # Can overflow for large x
    return exp_x / exp_x.sum(dim=-1, keepdim=True)
 
# ✓ STABLE: Subtract max before exponentiating
def stable_softmax(x):
    x_max = x.max(dim=-1, keepdim=True).values
    exp_x = torch.exp(x - x_max)  # Prevents overflow
    return exp_x / exp_x.sum(dim=-1, keepdim=True)
 
# ❌ UNSTABLE: Naive log-softmax
def unstable_log_softmax(x):
    return torch.log(stable_softmax(x))  # log(small number) → -inf
 
# ✓ STABLE: Use log-sum-exp trick
def stable_log_softmax(x):
    x_max = x.max(dim=-1, keepdim=True).values
    return x - x_max - torch.log(torch.exp(x - x_max).sum(dim=-1, keepdim=True))
 
# ❌ UNSTABLE: Cross-entropy with manual log
def unstable_cross_entropy(pred, target):
    # If pred contains 0, log(0) = -inf
    return -torch.log(pred[range(len(target)), target]).mean()
 
# ✓ STABLE: Use logits directly with built-in function
def stable_cross_entropy(logits, target):
    # Numerically stable - never computes log of probabilities
    return F.cross_entropy(logits, target)

Mixed Precision Training

When using FP16/mixed-precision for speed, numerical issues become more common. Always use loss scaling (automatic with torch.cuda.amp) and keep batch normalization, softmax, and loss computation in FP32. Most frameworks handle this automatically, but verify when debugging NaN issues.

Systematic Training Debugging Workflow

Effective training debugging requires a systematic approach rather than random experimentation. Follow this structured workflow when training fails:

Training Debugging Checklist

•Verify data pipeline first: Can model overfit a single batch? If not, data loading or preprocessing is wrong.
•Start with known-good configuration: Use published hyperparameters for your architecture before customizing.
•Monitor gradient statistics: Add logging for gradient norms per layer to detect vanishing/exploding gradients.
•Check for NaN/Inf propagation: Add assertions or hooks to catch when values first become invalid.
•Reduce model complexity: Train a smaller model first to verify the training loop works.
•Compare against baseline: Use a simple baseline (linear model, small MLP) to establish expected behavior.
•Visualize learning curves: Plot train/val loss, gradient norms, weight magnitudes over time.
•Binary search for failure point: Systematically disable components to isolate the cause.

Key Takeaways

Training debugging is about reading signals: loss curves, gradient statistics, and weight distributions tell a story. Learn to read that story. Always verify you can overfit a tiny dataset first—if you can't, the problem is in your pipeline, not your hyperparameters. Document what you try and observe; ML debugging is empirical science.

1 / 5

Loading learning content...

Machine LearningML Engineering & Career

Debugging ML Models

LevelAdvanced

Duration90 mins

TopicML Engineering & Career

1 / 5

Training Debugging

The Challenge of Training ML Models

What You Will Master

Understanding Training Dynamics

The optimization landscape:

Minima (local/global): Points where the loss is lower than surrounding regions
Saddle points: Points that are minima along some dimensions, maxima along others
Plateaus: Flat regions where gradients are near zero
Valleys: Narrow regions requiring careful navigation
Curvature: How sharply the loss changes in different directions

Key Training Metrics and Their Interpretation
Metric	Healthy Range	Warning Signs	Indicates
Training Loss	Steady decrease, then plateau	Immediate plateau, oscillation, NaN/Inf	Optimization progress
Gradient Norm	Stable, moderate magnitude	Shrinking to 0, exploding to Inf	Signal propagation health
Weight Norm	Gradual, bounded growth	Rapid explosion or collapse	Model capacity usage
Learning Rate	Appropriate for loss scale	Too high (divergence), too low (stagnation)	Step size appropriateness
Batch Loss Variance	Decreasing over time	High variance late in training	Optimization stability

The Training Triangle

Vanishing Gradients: Diagnosis and Resolution

The mathematical root cause:

During backpropagation, gradients are computed via the chain rule:

$$\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial a_n} \cdot \frac{\partial a_n}{\partial a_{n-1}} \cdot ... \cdot \frac{\partial a_2}{\partial w_1}$$

Symptoms of vanishing gradients:

Early layer weights don't update: Gradient histograms show near-zero values for initial layers
Training loss plateaus early: Model never reaches expected performance
Increasing layer depth hurts performance: Deeper networks perform worse than shallower ones
Gradient norm decreases through layers: Inspection shows exponential decay

diagnose_vanishing_gradients.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
def diagnose_vanishing_gradients(model, sample_batch):
    """
    Diagnostic tool for detecting vanishing gradients.
    Computes gradient statistics per layer during a forward-backward pass.
    """
    model.train()
    
    # Hook to capture gradients
    gradient_stats = {}
    
    def hook_fn(name):
        def hook(grad):
            gradient_stats[name] = {
                'mean': grad.abs().mean().item(),
                'std': grad.std().item(),
                'max': grad.abs().max().item(),
                'min': grad.abs().min().item(),
                'zero_fraction': (grad.abs() < 1e-7).float().mean().item()
            }
        return hook
    
    # Register hooks on all parameters
    handles = []
    for name, param in model.named_parameters():
        if param.requires_grad:
            handle = param.register_hook(hook_fn(name))
            handles.append(handle)
    
    # Forward-backward pass
    inputs, targets = sample_batch
    outputs = model(inputs)
    loss = F.cross_entropy(outputs, targets)
    loss.backward()
    
    # Cleanup hooks
    for handle in handles:
        handle.remove()
    
    # Analyze results
    print("
=== Gradient Analysis ===")
    for name, stats in sorted(gradient_stats.items()):
        status = "⚠️ VANISHING" if stats['mean'] < 1e-6 else "✓ OK"
        print(f"{name}: mean={stats['mean']:.2e}, "
              f"zeros={stats['zero_fraction']:.1%} {status}")
    
    return gradient_stats

Solutions for Vanishing Gradients

•ReLU-family activations: ReLU, LeakyReLU, ELU, GELU have derivatives that don't saturate below 1, maintaining gradient flow
•Skip connections (ResNets): Provide gradient highways that bypass problematic layers, enabling training of 100+ layer networks
•Batch/Layer Normalization: Stabilizes activation distributions, preventing saturation in activation functions
•Proper initialization: Xavier/Glorot or He initialization ensures initial activations have appropriate variance
•Gradient clipping (lower bound): Prevents gradients from becoming too small during updates
•LSTM/GRU for RNNs: Gating mechanisms specifically designed to combat vanishing gradients over long sequences

Exploding Gradients: Diagnosis and Resolution

The mathematical root cause:

Using the same chain rule formulation, if Jacobian terms have magnitude greater than 1, their product grows exponentially with depth. This can happen with:

Large weight initializations that amplify signals
Recurrent networks processing long sequences (each timestep multiplies gradients)
Networks without normalization operating in unstable regimes

Symptoms of exploding gradients:

Loss becomes NaN or Inf: The clearest indicator—optimization has completely failed
Extremely large gradient norms: Values in thousands or millions
Wildly oscillating loss: Loss jumps erratically rather than decreasing
Weights become NaN: Parameters themselves overflow to infinity

gradient_clipping_strategies.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import torch
import torch.nn as nn
 
def apply_gradient_clipping(model, optimizer, clip_value=1.0, clip_type='norm'):
    """
    Apply gradient clipping to prevent exploding gradients.
    
    Args:
        clip_type: 'norm' for gradient norm clipping (recommended)
                   'value' for per-element clipping
    """
    if clip_type == 'norm':
        # Clips if total gradient norm exceeds threshold
        # Preserves gradient direction, only scales magnitude
        total_norm = torch.nn.utils.clip_grad_norm_(
            model.parameters(), 
            max_norm=clip_value
        )
        return total_norm
    
    elif clip_type == 'value':
        # Clips each gradient element independently
        # Can change gradient direction - use cautiously
        torch.nn.utils.clip_grad_value_(
            model.parameters(), 
            clip_value=clip_value
        )
        return None
 
class GradientMonitor:
    """Monitors gradient health throughout training."""
    
    def __init__(self, model, alert_threshold=100.0):
        self.model = model
        self.alert_threshold = alert_threshold
        self.history = []
    
    def check_gradients(self):
        total_norm = 0.0
        for p in self.model.parameters():
            if p.grad is not None:
                param_norm = p.grad.data.norm(2)
                total_norm += param_norm.item() ** 2
        total_norm = total_norm ** 0.5
        
        self.history.append(total_norm)
        
        if total_norm > self.alert_threshold:
            print(f"⚠️ EXPLODING GRADIENT: norm={total_norm:.2e}")
        if torch.isnan(torch.tensor(total_norm)):
            print("🚨 NaN DETECTED in gradients!")
            
        return total_norm

Gradient Clipping Best Practices

Loss Landscape Pathologies

Beyond gradient magnitude issues, training can fail due to pathological loss landscape geometry. Understanding these failure modes requires intuition about optimization dynamics.

Loss Plateaus:

Saddle Points:

Sharp vs Flat Minima:

Plateau Warning Signs

•Loss unchanged for many epochs
•Gradient norm very small but non-zero
•Validation loss also flat (not overfitting)
•Learning rate reduction doesn't help
•Different random seeds show same behavior

Escape Strategies

•Learning rate warm restarts
•SGD with momentum (builds velocity)
•Add noise (dropout, data augmentation)
•Cyclical learning rates
•Switch optimizer (Adam → SGD or vice versa)

Numerical Stability and Precision Issues

Common numerical instabilities:

Log of zero/negative: Occurs in cross-entropy when predictions are 0 or 1
Exp overflow: Large values passed to exponential functions (softmax with large logits)
Division by zero: Normalization with zero variance
Floating-point accumulation errors: Many small values summed lose precision
Underflow in probabilities: Product of many probabilities becomes zero

numerical_stability_patterns.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import torch
import torch.nn.functional as F
 
# ❌ UNSTABLE: Naive softmax implementation
def unstable_softmax(x):
    exp_x = torch.exp(x)  # Can overflow for large x
    return exp_x / exp_x.sum(dim=-1, keepdim=True)
 
# ✓ STABLE: Subtract max before exponentiating
def stable_softmax(x):
    x_max = x.max(dim=-1, keepdim=True).values
    exp_x = torch.exp(x - x_max)  # Prevents overflow
    return exp_x / exp_x.sum(dim=-1, keepdim=True)
 
# ❌ UNSTABLE: Naive log-softmax
def unstable_log_softmax(x):
    return torch.log(stable_softmax(x))  # log(small number) → -inf
 
# ✓ STABLE: Use log-sum-exp trick
def stable_log_softmax(x):
    x_max = x.max(dim=-1, keepdim=True).values
    return x - x_max - torch.log(torch.exp(x - x_max).sum(dim=-1, keepdim=True))
 
# ❌ UNSTABLE: Cross-entropy with manual log
def unstable_cross_entropy(pred, target):
    # If pred contains 0, log(0) = -inf
    return -torch.log(pred[range(len(target)), target]).mean()
 
# ✓ STABLE: Use logits directly with built-in function
def stable_cross_entropy(logits, target):
    # Numerically stable - never computes log of probabilities
    return F.cross_entropy(logits, target)

Mixed Precision Training

Systematic Training Debugging Workflow

Effective training debugging requires a systematic approach rather than random experimentation. Follow this structured workflow when training fails:

Training Debugging Checklist

•Verify data pipeline first: Can model overfit a single batch? If not, data loading or preprocessing is wrong.
•Start with known-good configuration: Use published hyperparameters for your architecture before customizing.
•Monitor gradient statistics: Add logging for gradient norms per layer to detect vanishing/exploding gradients.
•Check for NaN/Inf propagation: Add assertions or hooks to catch when values first become invalid.
•Reduce model complexity: Train a smaller model first to verify the training loop works.
•Compare against baseline: Use a simple baseline (linear model, small MLP) to establish expected behavior.
•Visualize learning curves: Plot train/val loss, gradient norms, weight magnitudes over time.
•Binary search for failure point: Systematically disable components to isolate the cause.

Key Takeaways

1 / 5