Machine LearningML Engineering & Career

Debugging ML Models

LevelAdvanced

Duration90 mins

TopicML Engineering & Career

3 / 5

Model Debugging

When the Model Itself Is the Problem

After ensuring your training process is stable and your data is clean, model-level issues become the focus. Model debugging addresses problems with the architecture, capacity, and learning behavior of the model itself.

Model issues typically manifest as:

Underfitting: Model cannot capture patterns in training data
Overfitting: Model memorizes training data, fails on new data
Erratic predictions: Extreme sensitivity to small input changes
Systematic biases: Model consistently wrong for certain subgroups
Capacity mismatch: Model too simple or too complex for the task

What You Will Master

This page covers diagnosing underfitting vs. overfitting, understanding the bias-variance tradeoff in practice, debugging model capacity issues, analyzing prediction errors, and validating that models exhibit expected behavior. You'll learn to systematically isolate whether poor performance stems from the model or elsewhere.

Diagnosing Bias-Variance Problems

The bias-variance tradeoff is the fundamental lens for understanding model behavior. Every prediction error is decomposable into:

Bias: Error from oversimplifying assumptions; causes underfitting
Variance: Error from sensitivity to training data fluctuations; causes overfitting
Irreducible error: Noise inherent in the problem

Diagnosing which problem you have:

The relationship between training error and validation error reveals the issue:

Scenario	Training Error	Validation Error	Diagnosis
High	High	High (similar)	High bias (underfitting)
Low	Low	High	High variance (overfitting)
Low	Low	Low	Good fit
High	High	Low	Data leakage or bug

bias_variance_diagnosis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
 
def diagnose_bias_variance(model, X, y, cv=5):
    """
    Generate learning curves to diagnose bias vs variance.
    
    Learning curves show how training and validation performance
    evolve as training set size increases.
    """
    train_sizes, train_scores, val_scores = learning_curve(
        model, X, y,
        train_sizes=np.linspace(0.1, 1.0, 10),
        cv=cv,
        scoring='accuracy',
        n_jobs=-1
    )
    
    train_mean = train_scores.mean(axis=1)
    train_std = train_scores.std(axis=1)
    val_mean = val_scores.mean(axis=1)
    val_std = val_scores.std(axis=1)
    
    # Diagnosis logic
    final_train = train_mean[-1]
    final_val = val_mean[-1]
    gap = final_train - final_val
    
    if final_train < 0.7:  # Arbitrary threshold
        diagnosis = "HIGH BIAS: Model underfits. Increase complexity."
    elif gap > 0.15:
        diagnosis = "HIGH VARIANCE: Model overfits. Regularize or get more data."
    elif final_val < 0.85:
        diagnosis = "BOTH: High bias and variance. Consider different architecture."
    else:
        diagnosis = "GOOD FIT: Model generalizes well."
    
    print(f"Train accuracy: {final_train:.3f}")
    print(f"Val accuracy: {final_val:.3f}")
    print(f"Gap: {gap:.3f}")
    print(f"Diagnosis: {diagnosis}")
    
    return {
        'train_sizes': train_sizes,
        'train_scores': train_mean,
        'val_scores': val_mean,
        'diagnosis': diagnosis
    }

Underfitting Solutions

•Increase model capacity (more layers, units)
•Add more features or feature engineering
•Reduce regularization strength
•Train longer (if loss still decreasing)
•Use more expressive model family

Overfitting Solutions

•Get more training data
•Apply regularization (L1, L2, dropout)
•Reduce model capacity
•Early stopping based on validation
•Data augmentation

Model Capacity Analysis

Model capacity refers to the range of functions a model can represent. Capacity must match task complexity:

Too little capacity: Cannot represent complex patterns (underfitting)
Too much capacity: Can memorize noise (overfitting)
Right capacity: Captures signal without memorizing noise

Practical capacity indicators:

Parameter count: More parameters = more capacity
Architectural depth: Deeper networks have more representational power
Width: Wider layers can approximate more functions
Effective capacity: Actual capacity after regularization

Capacity Guidelines by Problem Scale
Dataset Size	Task Complexity	Recommended Capacity	Regularization Needs
< 1,000	Simple	Linear/shallow models	Strong regularization
1,000 - 10,000	Moderate	Small neural nets, ensembles	Moderate regularization
10,000 - 100,000	Complex	Deep networks	Standard regularization
100,000+	Very complex	Large models	Light regularization, more data helps

The Overfit-First Strategy

Start with a model large enough to overfit your training data. If you can't overfit, you have a capacity problem or a bug. Once you can overfit, add regularization to generalize. This approach ensures you're not debugging regularization when capacity is the issue.

Systematic Error Analysis

Error analysis goes beyond aggregate metrics to understand where and why the model fails. Aggregate accuracy can hide systematic problems—a model with 95% accuracy might completely fail on a critical subgroup.

Error analysis workflow:

Collect misclassified examples: Build a dataset of errors
Categorize errors: Group by type, cause, or pattern
Quantify categories: How much does each error type contribute?
Prioritize fixes: Address highest-impact error categories first
Track over iterations: Monitor if fixes work and don't break other things

error_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import pandas as pd
import numpy as np
from collections import defaultdict
 
class ErrorAnalyzer:
    """Systematic error analysis for classification models."""
    
    def __init__(self, model, X_test, y_test, feature_names=None):
        self.model = model
        self.X_test = X_test
        self.y_test = y_test
        self.feature_names = feature_names
        self.predictions = model.predict(X_test)
        self.probabilities = model.predict_proba(X_test)
        
    def get_errors(self):
        """Get all misclassified examples."""
        error_mask = self.predictions != self.y_test
        return {
            'indices': np.where(error_mask)[0],
            'X': self.X_test[error_mask],
            'y_true': self.y_test[error_mask],
            'y_pred': self.predictions[error_mask],
            'confidence': self.probabilities[error_mask].max(axis=1)
        }
    
    def analyze_by_confidence(self):
        """Stratify errors by prediction confidence."""
        errors = self.get_errors()
        confidence_buckets = pd.cut(
            errors['confidence'], 
            bins=[0, 0.5, 0.7, 0.9, 1.0],
            labels=['very_low', 'low', 'medium', 'high']
        )
        return pd.Series(confidence_buckets).value_counts()
    
    def analyze_confusion_patterns(self):
        """Find which class pairs are most confused."""
        errors = self.get_errors()
        confusion_pairs = defaultdict(int)
        for true, pred in zip(errors['y_true'], errors['y_pred']):
            confusion_pairs[(true, pred)] += 1
        return dict(sorted(
            confusion_pairs.items(), 
            key=lambda x: -x[1]
        )[:10])
    
    def slice_analysis(self, slice_fn, slice_name):
        """Analyze performance on a specific data slice."""
        slice_mask = slice_fn(self.X_test)
        slice_preds = self.predictions[slice_mask]
        slice_true = self.y_test[slice_mask]
        
        accuracy = (slice_preds == slice_true).mean()
        error_rate = 1 - accuracy
        
        return {
            'slice_name': slice_name,
            'n_samples': slice_mask.sum(),
            'accuracy': accuracy,
            'error_rate': error_rate,
            'contribution_to_total_error': (
                (slice_preds != slice_true).sum() / 
                (self.predictions != self.y_test).sum()
            )
        }

High-Confidence Errors Are Most Dangerous

Errors where the model is highly confident are the most concerning. They indicate the model has learned the wrong pattern confidently. Prioritize understanding these cases—they often reveal labeling errors, data leakage, or fundamental model assumptions that are wrong.

Architecture Debugging

Neural network architectures can have subtle bugs that don't cause crashes but silently degrade performance. These are among the hardest bugs to find.

Common architecture bugs:

Wrong activation functions: Using sigmoid when ReLU needed, or vice versa
Missing normalization: Batch norm or layer norm omitted where needed
Incorrect layer ordering: Activation before vs. after normalization matters
Wrong dimensions: Tensor shape mismatches that broadcast incorrectly
Dead neurons: ReLU neurons that never activate due to initialization
Gradient bottlenecks: Layers that block gradient flow

Architecture Debugging Checklist

•Print model summary: Verify layer count, parameter count, and shapes match expectations
•Check gradient flow: Use hooks to verify gradients reach all layers
•Verify activations: Histogram activations per layer; look for dead ReLUs or saturated sigmoids
•Test with random input: Output should change as input changes; constant output indicates broken layers
•Compare to reference: If implementing from paper, match their architecture exactly first
•Ablate components: Remove components one at a time to verify each contributes

architecture_debugging.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import torch
import torch.nn as nn
 
def check_dead_relu(model, sample_input):
    """Detect ReLU layers with dead neurons."""
    dead_neuron_report = {}
    
    def hook_fn(name):
        def hook(module, input, output):
            if isinstance(output, torch.Tensor):
                # Fraction of neurons that never activate
                dead_fraction = (output <= 0).float().mean().item()
                if dead_fraction > 0.5:
                    dead_neuron_report[name] = dead_fraction
        return hook
    
    hooks = []
    for name, module in model.named_modules():
        if isinstance(module, nn.ReLU):
            hooks.append(module.register_forward_hook(hook_fn(name)))
    
    model(sample_input)
    
    for hook in hooks:
        hook.remove()
    
    if dead_neuron_report:
        print("⚠️ DEAD ReLU NEURONS DETECTED:")
        for name, frac in dead_neuron_report.items():
            print(f"  {name}: {frac:.1%} dead")
    else:
        print("✓ No dead neuron issues detected")
    
    return dead_neuron_report
 
def verify_gradient_flow(model, sample_input, sample_target, loss_fn):
    """Verify gradients flow to all parameters."""
    model.zero_grad()
    output = model(sample_input)
    loss = loss_fn(output, sample_target)
    loss.backward()
    
    no_grad_params = []
    for name, param in model.named_parameters():
        if param.grad is None or param.grad.abs().sum() == 0:
            no_grad_params.append(name)
    
    if no_grad_params:
        print("⚠️ PARAMETERS WITH NO GRADIENT:")
        for name in no_grad_params:
            print(f"  {name}")
    else:
        print("✓ All parameters receive gradients")
    
    return no_grad_params

Behavioral Testing and Sanity Checks

Behavioral tests verify that a model exhibits expected behavior beyond just aggregate metrics. They catch subtle issues that slip past standard evaluation.

Types of behavioral tests:

Invariance tests: Predictions should be unchanged by irrelevant transformations
Directional tests: Known input changes should move predictions in expected direction
Minimum functionality tests: Model should get obviously correct cases right
Consistency tests: Similar inputs should produce similar outputs

Behavioral Test Examples by Domain
Domain	Test Type	Example Test	Expected Behavior
Sentiment	Invariance	Add neutral filler words	Sentiment unchanged
Sentiment	Directional	Add strong positive words	Sentiment increases
Object Detection	Invariance	Slight brightness change	Same detections
Object Detection	Minimum Function	Clear centered object	Must detect
Regression	Directional	Increase predictor X (positive coef)	Output increases
Translation	Consistency	Paraphrased input	Semantically equivalent output

Key Takeaways

Model debugging requires moving beyond aggregate metrics to understand failure modes. Use learning curves for bias-variance diagnosis, start with models that can overfit, then regularize. Analyze errors systematically to prioritize fixes. Verify architectures receive gradients everywhere. Add behavioral tests to catch issues metrics miss.

3 / 5

Loading learning content...

Machine LearningML Engineering & Career

Debugging ML Models

LevelAdvanced

Duration90 mins

TopicML Engineering & Career

3 / 5

Model Debugging

When the Model Itself Is the Problem

Model issues typically manifest as:

Underfitting: Model cannot capture patterns in training data
Overfitting: Model memorizes training data, fails on new data
Erratic predictions: Extreme sensitivity to small input changes
Systematic biases: Model consistently wrong for certain subgroups
Capacity mismatch: Model too simple or too complex for the task

What You Will Master

Diagnosing Bias-Variance Problems

The bias-variance tradeoff is the fundamental lens for understanding model behavior. Every prediction error is decomposable into:

Bias: Error from oversimplifying assumptions; causes underfitting
Variance: Error from sensitivity to training data fluctuations; causes overfitting
Irreducible error: Noise inherent in the problem

Diagnosing which problem you have:

The relationship between training error and validation error reveals the issue:

Scenario	Training Error	Validation Error	Diagnosis
High	High	High (similar)	High bias (underfitting)
Low	Low	High	High variance (overfitting)
Low	Low	Low	Good fit
High	High	Low	Data leakage or bug

bias_variance_diagnosis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
 
def diagnose_bias_variance(model, X, y, cv=5):
    """
    Generate learning curves to diagnose bias vs variance.
    
    Learning curves show how training and validation performance
    evolve as training set size increases.
    """
    train_sizes, train_scores, val_scores = learning_curve(
        model, X, y,
        train_sizes=np.linspace(0.1, 1.0, 10),
        cv=cv,
        scoring='accuracy',
        n_jobs=-1
    )
    
    train_mean = train_scores.mean(axis=1)
    train_std = train_scores.std(axis=1)
    val_mean = val_scores.mean(axis=1)
    val_std = val_scores.std(axis=1)
    
    # Diagnosis logic
    final_train = train_mean[-1]
    final_val = val_mean[-1]
    gap = final_train - final_val
    
    if final_train < 0.7:  # Arbitrary threshold
        diagnosis = "HIGH BIAS: Model underfits. Increase complexity."
    elif gap > 0.15:
        diagnosis = "HIGH VARIANCE: Model overfits. Regularize or get more data."
    elif final_val < 0.85:
        diagnosis = "BOTH: High bias and variance. Consider different architecture."
    else:
        diagnosis = "GOOD FIT: Model generalizes well."
    
    print(f"Train accuracy: {final_train:.3f}")
    print(f"Val accuracy: {final_val:.3f}")
    print(f"Gap: {gap:.3f}")
    print(f"Diagnosis: {diagnosis}")
    
    return {
        'train_sizes': train_sizes,
        'train_scores': train_mean,
        'val_scores': val_mean,
        'diagnosis': diagnosis
    }

Underfitting Solutions

•Increase model capacity (more layers, units)
•Add more features or feature engineering
•Reduce regularization strength
•Train longer (if loss still decreasing)
•Use more expressive model family

Overfitting Solutions

•Get more training data
•Apply regularization (L1, L2, dropout)
•Reduce model capacity
•Early stopping based on validation
•Data augmentation

Model Capacity Analysis

Model capacity refers to the range of functions a model can represent. Capacity must match task complexity:

Too little capacity: Cannot represent complex patterns (underfitting)
Too much capacity: Can memorize noise (overfitting)
Right capacity: Captures signal without memorizing noise

Practical capacity indicators:

Parameter count: More parameters = more capacity
Architectural depth: Deeper networks have more representational power
Width: Wider layers can approximate more functions
Effective capacity: Actual capacity after regularization

Capacity Guidelines by Problem Scale
Dataset Size	Task Complexity	Recommended Capacity	Regularization Needs
< 1,000	Simple	Linear/shallow models	Strong regularization
1,000 - 10,000	Moderate	Small neural nets, ensembles	Moderate regularization
10,000 - 100,000	Complex	Deep networks	Standard regularization
100,000+	Very complex	Large models	Light regularization, more data helps

The Overfit-First Strategy

Systematic Error Analysis

Error analysis workflow:

Collect misclassified examples: Build a dataset of errors
Categorize errors: Group by type, cause, or pattern
Quantify categories: How much does each error type contribute?
Prioritize fixes: Address highest-impact error categories first
Track over iterations: Monitor if fixes work and don't break other things

error_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import pandas as pd
import numpy as np
from collections import defaultdict
 
class ErrorAnalyzer:
    """Systematic error analysis for classification models."""
    
    def __init__(self, model, X_test, y_test, feature_names=None):
        self.model = model
        self.X_test = X_test
        self.y_test = y_test
        self.feature_names = feature_names
        self.predictions = model.predict(X_test)
        self.probabilities = model.predict_proba(X_test)
        
    def get_errors(self):
        """Get all misclassified examples."""
        error_mask = self.predictions != self.y_test
        return {
            'indices': np.where(error_mask)[0],
            'X': self.X_test[error_mask],
            'y_true': self.y_test[error_mask],
            'y_pred': self.predictions[error_mask],
            'confidence': self.probabilities[error_mask].max(axis=1)
        }
    
    def analyze_by_confidence(self):
        """Stratify errors by prediction confidence."""
        errors = self.get_errors()
        confidence_buckets = pd.cut(
            errors['confidence'], 
            bins=[0, 0.5, 0.7, 0.9, 1.0],
            labels=['very_low', 'low', 'medium', 'high']
        )
        return pd.Series(confidence_buckets).value_counts()
    
    def analyze_confusion_patterns(self):
        """Find which class pairs are most confused."""
        errors = self.get_errors()
        confusion_pairs = defaultdict(int)
        for true, pred in zip(errors['y_true'], errors['y_pred']):
            confusion_pairs[(true, pred)] += 1
        return dict(sorted(
            confusion_pairs.items(), 
            key=lambda x: -x[1]
        )[:10])
    
    def slice_analysis(self, slice_fn, slice_name):
        """Analyze performance on a specific data slice."""
        slice_mask = slice_fn(self.X_test)
        slice_preds = self.predictions[slice_mask]
        slice_true = self.y_test[slice_mask]
        
        accuracy = (slice_preds == slice_true).mean()
        error_rate = 1 - accuracy
        
        return {
            'slice_name': slice_name,
            'n_samples': slice_mask.sum(),
            'accuracy': accuracy,
            'error_rate': error_rate,
            'contribution_to_total_error': (
                (slice_preds != slice_true).sum() / 
                (self.predictions != self.y_test).sum()
            )
        }

High-Confidence Errors Are Most Dangerous

Architecture Debugging

Neural network architectures can have subtle bugs that don't cause crashes but silently degrade performance. These are among the hardest bugs to find.

Common architecture bugs:

Wrong activation functions: Using sigmoid when ReLU needed, or vice versa
Missing normalization: Batch norm or layer norm omitted where needed
Incorrect layer ordering: Activation before vs. after normalization matters
Wrong dimensions: Tensor shape mismatches that broadcast incorrectly
Dead neurons: ReLU neurons that never activate due to initialization
Gradient bottlenecks: Layers that block gradient flow

Architecture Debugging Checklist

•Print model summary: Verify layer count, parameter count, and shapes match expectations
•Check gradient flow: Use hooks to verify gradients reach all layers
•Verify activations: Histogram activations per layer; look for dead ReLUs or saturated sigmoids
•Test with random input: Output should change as input changes; constant output indicates broken layers
•Compare to reference: If implementing from paper, match their architecture exactly first
•Ablate components: Remove components one at a time to verify each contributes

architecture_debugging.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import torch
import torch.nn as nn
 
def check_dead_relu(model, sample_input):
    """Detect ReLU layers with dead neurons."""
    dead_neuron_report = {}
    
    def hook_fn(name):
        def hook(module, input, output):
            if isinstance(output, torch.Tensor):
                # Fraction of neurons that never activate
                dead_fraction = (output <= 0).float().mean().item()
                if dead_fraction > 0.5:
                    dead_neuron_report[name] = dead_fraction
        return hook
    
    hooks = []
    for name, module in model.named_modules():
        if isinstance(module, nn.ReLU):
            hooks.append(module.register_forward_hook(hook_fn(name)))
    
    model(sample_input)
    
    for hook in hooks:
        hook.remove()
    
    if dead_neuron_report:
        print("⚠️ DEAD ReLU NEURONS DETECTED:")
        for name, frac in dead_neuron_report.items():
            print(f"  {name}: {frac:.1%} dead")
    else:
        print("✓ No dead neuron issues detected")
    
    return dead_neuron_report
 
def verify_gradient_flow(model, sample_input, sample_target, loss_fn):
    """Verify gradients flow to all parameters."""
    model.zero_grad()
    output = model(sample_input)
    loss = loss_fn(output, sample_target)
    loss.backward()
    
    no_grad_params = []
    for name, param in model.named_parameters():
        if param.grad is None or param.grad.abs().sum() == 0:
            no_grad_params.append(name)
    
    if no_grad_params:
        print("⚠️ PARAMETERS WITH NO GRADIENT:")
        for name in no_grad_params:
            print(f"  {name}")
    else:
        print("✓ All parameters receive gradients")
    
    return no_grad_params

Behavioral Testing and Sanity Checks

Behavioral tests verify that a model exhibits expected behavior beyond just aggregate metrics. They catch subtle issues that slip past standard evaluation.

Types of behavioral tests:

Invariance tests: Predictions should be unchanged by irrelevant transformations
Directional tests: Known input changes should move predictions in expected direction
Minimum functionality tests: Model should get obviously correct cases right
Consistency tests: Similar inputs should produce similar outputs

Behavioral Test Examples by Domain
Domain	Test Type	Example Test	Expected Behavior
Sentiment	Invariance	Add neutral filler words	Sentiment unchanged
Sentiment	Directional	Add strong positive words	Sentiment increases
Object Detection	Invariance	Slight brightness change	Same detections
Object Detection	Minimum Function	Clear centered object	Must detect
Regression	Directional	Increase predictor X (positive coef)	Output increases
Translation	Consistency	Paraphrased input	Semantically equivalent output

Key Takeaways

3 / 5