Machine LearningML Engineering & Career

Debugging ML Models

LevelAdvanced

Duration90 mins

TopicML Engineering & Career

5 / 5

Systematic Debugging

From Ad-Hoc Fixes to Systematic Process

The difference between junior and senior ML practitioners often lies not in what they know, but in how they approach problems. Ad-hoc debugging—randomly changing hyperparameters, adding layers, or trying different optimizers—is slow, frustrating, and often fails. Systematic debugging is methodical, reproducible, and effective.

This page synthesizes everything you've learned into a complete debugging methodology—a structured approach to diagnosing and resolving any ML problem you encounter.

What You Will Master

By completing this page, you'll have a complete mental framework for ML debugging: structured diagnosis processes, hypothesis-driven experimentation, reproducibility practices, and documentation standards. You'll debug faster, communicate better, and build a knowledge base that compounds over your career.

The ML Debugging Hierarchy

ML debugging should follow a hierarchy of causes, checking simpler explanations before complex ones. This prevents wasted effort chasing sophisticated problems when the root cause is mundane.

The debugging pyramid (check in order):

                    ▲
                   /│
                  / │ 
   Complex:      /  │  \    Architectural issues
                /───│───\   Hyperparameter sensitivity
               /    │    \  Optimizer pathologies
              /─────│─────\ Training dynamics
             /      │      
   Simple:  /───────│───────\ Data quality
           /        │        
          /─────────│─────────\ Pipeline bugs (most common!)

Most ML debugging situations are caused by issues at the bottom of the pyramid—data problems and pipeline bugs. Yet most engineers start at the top, tuning architectures and hyperparameters.

Debugging Hierarchy: Where to Look First
Level	Example Issues	How to Check	Typical Fix Time
Pipeline	Wrong data path, shuffled labels, preprocessing error	Assert shapes, print samples, sanity checks	Minutes to hours
Data	Label noise, leakage, distribution shift	Data analysis, visualization, validation sets	Hours to days
Training	Wrong loss, vanishing gradients, wrong LR	Monitor metrics, gradient histograms	Hours
Model	Wrong capacity, architecture bug, dead neurons	Overfit test, ablation, activation analysis	Hours to days
Advanced	Subtle optimization issues, hardware bugs	Deep profiling, research literature review	Days to weeks

The Overfit-a-Batch Test

Before anything else, try to overfit a single batch. If your model can't memorize 10-100 examples, you have a bug in your pipeline—not a tuning problem. This simple test eliminates levels 1-4 of the hierarchy in minutes.

Hypothesis-Driven Debugging

Effective debugging treats experiments as scientific hypotheses. Each change should test a specific belief about what's wrong. This prevents random exploration and builds understanding even when experiments fail.

The hypothesis-driven process:

Observe symptoms: What exactly is going wrong? Loss not decreasing? OOM errors? Poor validation accuracy?
Form hypothesis: What could cause this symptom? Be specific.
Design experiment: What minimal change would confirm or refute the hypothesis?
Execute and measure: Run the experiment, collect data.
Interpret results: Does this confirm, refute, or modify your hypothesis?
Iterate: Form new hypothesis based on what you learned.

debugging_log_template.md
Markdown
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# ML Debugging Log
 
## Problem Statement
- **Symptom**: Validation accuracy stuck at 65%, training accuracy reaches 99%
- **Expected**: Validation accuracy should reach ~85% (baseline from paper)
- **When it occurs**: After epoch 5, validation accuracy plateaus
 
## Hypotheses (prioritized)
1. [HIGH] Overfitting due to insufficient regularization
2. [MEDIUM] Data augmentation is too weak
3. [LOW] Learning rate too high for fine-grained features
 
## Experiment Log
 
### Experiment 1: Add Dropout (Testing Hypothesis 1)
- **Change**: Added dropout=0.5 after each dense layer
- **Hypothesis**: If overfitting, regularization should close train-val gap
- **Result**: Val accuracy improved to 72%, train-val gap reduced from 34% to 27%
- **Conclusion**: PARTIAL SUPPORT - overfitting is a factor but not the only issue
 
### Experiment 2: Stronger Data Augmentation (Testing Hypothesis 2)  
- **Change**: Added RandomRotation(15), ColorJitter, horizontal flip
- **Hypothesis**: More augmentation should improve generalization
- **Result**: Val accuracy improved to 78%
- **Conclusion**: SUPPORTED - data augmentation was insufficient
 
### Experiment 3: Combine Dropout + Augmentation
- **Result**: Val accuracy 83%
- **Conclusion**: Combined effect addresses most of the gap
 
## Resolution Summary
- Root causes: Insufficient regularization + weak data augmentation
- Solution: Dropout=0.3 + standard augmentation suite
- Remaining gap: 2% below paper (acceptable, likely due to implementation details)

One Variable at a Time

Change only one thing per experiment. If you change learning rate, regularization, and architecture simultaneously, you can't learn which change mattered. This feels slow but is faster in the long run because you build understanding.

Reproducibility: The Foundation of Reliable Debugging

Reproducibility is not just an academic concern—it's a debugging necessity. If you can't reproduce a result, you can't debug it. If you can't reproduce a fix, you don't understand it.

Sources of non-reproducibility:

Random seeds: Initialization, data shuffling, dropout all use randomness
Hardware non-determinism: GPU operations can have non-deterministic results
Version drift: Library versions change behavior silently
Environment differences: Different machines, containers, or configurations
Data mutations: Data changes between runs

reproducibility_practices.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import random
import numpy as np
import torch
import os
 
def set_all_seeds(seed: int = 42):
    """Set all seeds for reproducibility."""
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    
    # For hash-based operations
    os.environ['PYTHONHASHSEED'] = str(seed)
    
    # Force deterministic algorithms (may impact performance)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
 
def log_environment():
    """Log environment info for reproducibility."""
    import sys
    import platform
    
    info = {
        'python_version': sys.version,
        'pytorch_version': torch.__version__,
        'cuda_version': torch.version.cuda,
        'cudnn_version': torch.backends.cudnn.version(),
        'platform': platform.platform(),
        'gpu': torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU',
        'num_gpus': torch.cuda.device_count(),
    }
    
    # Also log pip freeze
    import subprocess
    info['pip_freeze'] = subprocess.run(
        ['pip', 'freeze'], capture_output=True, text=True
    ).stdout
    
    return info
 
def save_experiment_config(config, checkpoint_path):
    """Save full experiment configuration."""
    import json
    from datetime import datetime
    
    full_config = {
        'config': config,
        'environment': log_environment(),
        'timestamp': datetime.now().isoformat(),
        'git_hash': get_git_hash(),  # Track code version
    }
    
    with open(f'{checkpoint_path}/config.json', 'w') as f:
        json.dump(full_config, f, indent=2, default=str)
 
def get_git_hash():
    """Get current git commit hash."""
    import subprocess
    try:
        return subprocess.run(
            ['git', 'rev-parse', 'HEAD'],
            capture_output=True, text=True
        ).stdout.strip()
    except:
        return 'unknown'

Reproducibility Checklist

•Seed everything: Random, NumPy, PyTorch, CUDA—set them all explicitly
•Version lock dependencies: Use requirements.txt with pinned versions or conda environment files
•Track code versions: Git commit hash should be logged with every experiment
•Save full configs: Hyperparameters, paths, preprocessing settings—all of it
•Checksum data: Hash your training data to detect silent corruption or changes
•Container-ize: Docker ensures identical environments across machines

Binary Search Debugging for Complex Issues

When facing a complex bug with many possible causes, binary search debugging efficiently isolates the issue. The idea: systematically eliminate half the possible causes with each experiment.

Binary search strategies:

Temporal bisection: If training fails at epoch 50, check epoch 25. If it fails there, check epoch 12.
Component bisection: If full model fails, try half the layers. Keep narrowing.
Data bisection: If training fails, try half the data. Isolate problematic examples.
Config bisection: Start from working config, binary search to find which change broke it.

When to Use Binary Search

•Bug appeared after many changes
•Complex pipeline with many components
•No clear hypothesis about cause
•Checking all causes individually is slow
•You have a 'known-good' reference point

Binary Search Process

•
1. Enumerate all possible causes (N items)
•
1. Test with first half disabled/enabled
•
1. If bug persists → cause in remaining set
•
1. If bug gone → cause in disabled set
•
1. Repeat on the relevant half (log₂N iterations)

Git Bisect for ML

If your model worked last week but not today, use 'git bisect' to binary search through commits. Mark the last known-good commit as 'good' and current as 'bad'; Git automatically finds the breaking change in O(log n) checkouts.

Common ML Debugging Patterns

Experienced ML engineers recognize patterns—recurring problem types with known solutions. Learning these patterns accelerates debugging dramatically.

ML Debugging Pattern Reference
Symptom	Likely Cause	Diagnostic	Solution
Loss = NaN immediately	Learning rate too high or numerical instability	Reduce LR by 10x, check for log(0)	Lower LR, add epsilon to divisions
Loss stuck at random baseline	Labels shuffled, wrong loss function	Print predictions, verify labels match inputs	Fix data pipeline, check loss
Train accuracy 100%, val ~random	Severe overfitting or data leakage	Check train-test overlap, feature audit	Add regularization, fix leakage
Validation better than train	Bug, dropout misconfigured, or data issue	Check model.eval() called, verify datasets	Fix eval mode, audit data splits
Loss decreases then spikes	LR too high after warmup, data anomaly	Check LR schedule, inspect batch at spike	Tune LR schedule, clean data
Model outputs constant value	Dead network, wrong initialization	Check activations histogram	Fix initialization, check architecture

Building Long-Term Debugging Expertise

Debugging expertise compounds over a career. Every bug you solve adds to your pattern library. But only if you learn deliberately—documenting insights, building reusable tools, and reflecting on what worked.

Habits That Build Debugging Mastery

•Keep a debugging journal: Document every significant bug, its cause, and solution. Review periodically.
•Build reusable diagnostic tools: Generic sanity checks, profilers, visualization routines. Invest once, reuse always.
•Post-mortem analyses: After resolving hard bugs, ask: 'How could I have found this faster?'
•Share knowledge: Document team patterns, create runbooks, mentor others—teaching reinforces learning.
•Study failure modes: Read incident reports, paper errata, blog postmortems. Learn from others' bugs.
•Time-box experiments: Set limits on debugging time. If stuck, escalate, ask for help, or try a different approach.

Module Summary

You've now learned to debug ML systems systematically: training issues (gradients, loss, optimization), data problems (quality, labeling, leakage, shift), model behavior (capacity, errors, architecture), performance (memory, speed, distributed), and the meta-skill of structured debugging itself. The key is systematic process over random experimentation. Document, measure, hypothesize, and iterate. The debugging skills you build will differentiate you throughout your ML career.

5 / 5

Loading learning content...

Machine LearningML Engineering & Career

Debugging ML Models

LevelAdvanced

Duration90 mins

TopicML Engineering & Career

5 / 5

Systematic Debugging

From Ad-Hoc Fixes to Systematic Process

This page synthesizes everything you've learned into a complete debugging methodology—a structured approach to diagnosing and resolving any ML problem you encounter.

What You Will Master

The ML Debugging Hierarchy

ML debugging should follow a hierarchy of causes, checking simpler explanations before complex ones. This prevents wasted effort chasing sophisticated problems when the root cause is mundane.

The debugging pyramid (check in order):

                    ▲
                   /│
                  / │ 
   Complex:      /  │  \    Architectural issues
                /───│───\   Hyperparameter sensitivity
               /    │    \  Optimizer pathologies
              /─────│─────\ Training dynamics
             /      │      
   Simple:  /───────│───────\ Data quality
           /        │        
          /─────────│─────────\ Pipeline bugs (most common!)

Most ML debugging situations are caused by issues at the bottom of the pyramid—data problems and pipeline bugs. Yet most engineers start at the top, tuning architectures and hyperparameters.

Debugging Hierarchy: Where to Look First
Level	Example Issues	How to Check	Typical Fix Time
Pipeline	Wrong data path, shuffled labels, preprocessing error	Assert shapes, print samples, sanity checks	Minutes to hours
Data	Label noise, leakage, distribution shift	Data analysis, visualization, validation sets	Hours to days
Training	Wrong loss, vanishing gradients, wrong LR	Monitor metrics, gradient histograms	Hours
Model	Wrong capacity, architecture bug, dead neurons	Overfit test, ablation, activation analysis	Hours to days
Advanced	Subtle optimization issues, hardware bugs	Deep profiling, research literature review	Days to weeks

The Overfit-a-Batch Test

Hypothesis-Driven Debugging

The hypothesis-driven process:

Observe symptoms: What exactly is going wrong? Loss not decreasing? OOM errors? Poor validation accuracy?
Form hypothesis: What could cause this symptom? Be specific.
Design experiment: What minimal change would confirm or refute the hypothesis?
Execute and measure: Run the experiment, collect data.
Interpret results: Does this confirm, refute, or modify your hypothesis?
Iterate: Form new hypothesis based on what you learned.

debugging_log_template.md
Markdown
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# ML Debugging Log
 
## Problem Statement
- **Symptom**: Validation accuracy stuck at 65%, training accuracy reaches 99%
- **Expected**: Validation accuracy should reach ~85% (baseline from paper)
- **When it occurs**: After epoch 5, validation accuracy plateaus
 
## Hypotheses (prioritized)
1. [HIGH] Overfitting due to insufficient regularization
2. [MEDIUM] Data augmentation is too weak
3. [LOW] Learning rate too high for fine-grained features
 
## Experiment Log
 
### Experiment 1: Add Dropout (Testing Hypothesis 1)
- **Change**: Added dropout=0.5 after each dense layer
- **Hypothesis**: If overfitting, regularization should close train-val gap
- **Result**: Val accuracy improved to 72%, train-val gap reduced from 34% to 27%
- **Conclusion**: PARTIAL SUPPORT - overfitting is a factor but not the only issue
 
### Experiment 2: Stronger Data Augmentation (Testing Hypothesis 2)  
- **Change**: Added RandomRotation(15), ColorJitter, horizontal flip
- **Hypothesis**: More augmentation should improve generalization
- **Result**: Val accuracy improved to 78%
- **Conclusion**: SUPPORTED - data augmentation was insufficient
 
### Experiment 3: Combine Dropout + Augmentation
- **Result**: Val accuracy 83%
- **Conclusion**: Combined effect addresses most of the gap
 
## Resolution Summary
- Root causes: Insufficient regularization + weak data augmentation
- Solution: Dropout=0.3 + standard augmentation suite
- Remaining gap: 2% below paper (acceptable, likely due to implementation details)

One Variable at a Time

Reproducibility: The Foundation of Reliable Debugging

Reproducibility is not just an academic concern—it's a debugging necessity. If you can't reproduce a result, you can't debug it. If you can't reproduce a fix, you don't understand it.

Sources of non-reproducibility:

Random seeds: Initialization, data shuffling, dropout all use randomness
Hardware non-determinism: GPU operations can have non-deterministic results
Version drift: Library versions change behavior silently
Environment differences: Different machines, containers, or configurations
Data mutations: Data changes between runs

reproducibility_practices.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import random
import numpy as np
import torch
import os
 
def set_all_seeds(seed: int = 42):
    """Set all seeds for reproducibility."""
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    
    # For hash-based operations
    os.environ['PYTHONHASHSEED'] = str(seed)
    
    # Force deterministic algorithms (may impact performance)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
 
def log_environment():
    """Log environment info for reproducibility."""
    import sys
    import platform
    
    info = {
        'python_version': sys.version,
        'pytorch_version': torch.__version__,
        'cuda_version': torch.version.cuda,
        'cudnn_version': torch.backends.cudnn.version(),
        'platform': platform.platform(),
        'gpu': torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU',
        'num_gpus': torch.cuda.device_count(),
    }
    
    # Also log pip freeze
    import subprocess
    info['pip_freeze'] = subprocess.run(
        ['pip', 'freeze'], capture_output=True, text=True
    ).stdout
    
    return info
 
def save_experiment_config(config, checkpoint_path):
    """Save full experiment configuration."""
    import json
    from datetime import datetime
    
    full_config = {
        'config': config,
        'environment': log_environment(),
        'timestamp': datetime.now().isoformat(),
        'git_hash': get_git_hash(),  # Track code version
    }
    
    with open(f'{checkpoint_path}/config.json', 'w') as f:
        json.dump(full_config, f, indent=2, default=str)
 
def get_git_hash():
    """Get current git commit hash."""
    import subprocess
    try:
        return subprocess.run(
            ['git', 'rev-parse', 'HEAD'],
            capture_output=True, text=True
        ).stdout.strip()
    except:
        return 'unknown'

Reproducibility Checklist

•Seed everything: Random, NumPy, PyTorch, CUDA—set them all explicitly
•Version lock dependencies: Use requirements.txt with pinned versions or conda environment files
•Track code versions: Git commit hash should be logged with every experiment
•Save full configs: Hyperparameters, paths, preprocessing settings—all of it
•Checksum data: Hash your training data to detect silent corruption or changes
•Container-ize: Docker ensures identical environments across machines

Binary Search Debugging for Complex Issues

When facing a complex bug with many possible causes, binary search debugging efficiently isolates the issue. The idea: systematically eliminate half the possible causes with each experiment.

Binary search strategies:

Temporal bisection: If training fails at epoch 50, check epoch 25. If it fails there, check epoch 12.
Component bisection: If full model fails, try half the layers. Keep narrowing.
Data bisection: If training fails, try half the data. Isolate problematic examples.
Config bisection: Start from working config, binary search to find which change broke it.

When to Use Binary Search

•Bug appeared after many changes
•Complex pipeline with many components
•No clear hypothesis about cause
•Checking all causes individually is slow
•You have a 'known-good' reference point

Binary Search Process

•
1. Enumerate all possible causes (N items)
•
1. Test with first half disabled/enabled
•
1. If bug persists → cause in remaining set
•
1. If bug gone → cause in disabled set
•
1. Repeat on the relevant half (log₂N iterations)

Git Bisect for ML

Common ML Debugging Patterns

Experienced ML engineers recognize patterns—recurring problem types with known solutions. Learning these patterns accelerates debugging dramatically.

ML Debugging Pattern Reference
Symptom	Likely Cause	Diagnostic	Solution
Loss = NaN immediately	Learning rate too high or numerical instability	Reduce LR by 10x, check for log(0)	Lower LR, add epsilon to divisions
Loss stuck at random baseline	Labels shuffled, wrong loss function	Print predictions, verify labels match inputs	Fix data pipeline, check loss
Train accuracy 100%, val ~random	Severe overfitting or data leakage	Check train-test overlap, feature audit	Add regularization, fix leakage
Validation better than train	Bug, dropout misconfigured, or data issue	Check model.eval() called, verify datasets	Fix eval mode, audit data splits
Loss decreases then spikes	LR too high after warmup, data anomaly	Check LR schedule, inspect batch at spike	Tune LR schedule, clean data
Model outputs constant value	Dead network, wrong initialization	Check activations histogram	Fix initialization, check architecture

Building Long-Term Debugging Expertise

Habits That Build Debugging Mastery

•Keep a debugging journal: Document every significant bug, its cause, and solution. Review periodically.
•Build reusable diagnostic tools: Generic sanity checks, profilers, visualization routines. Invest once, reuse always.
•Post-mortem analyses: After resolving hard bugs, ask: 'How could I have found this faster?'
•Share knowledge: Document team patterns, create runbooks, mentor others—teaching reinforces learning.
•Study failure modes: Read incident reports, paper errata, blog postmortems. Learn from others' bugs.
•Time-box experiments: Set limits on debugging time. If stuck, escalate, ask for help, or try a different approach.

Module Summary

5 / 5