Loading learning content...
The difference between junior and senior ML practitioners often lies not in what they know, but in how they approach problems. Ad-hoc debugging—randomly changing hyperparameters, adding layers, or trying different optimizers—is slow, frustrating, and often fails. Systematic debugging is methodical, reproducible, and effective.
This page synthesizes everything you've learned into a complete debugging methodology—a structured approach to diagnosing and resolving any ML problem you encounter.
By completing this page, you'll have a complete mental framework for ML debugging: structured diagnosis processes, hypothesis-driven experimentation, reproducibility practices, and documentation standards. You'll debug faster, communicate better, and build a knowledge base that compounds over your career.
ML debugging should follow a hierarchy of causes, checking simpler explanations before complex ones. This prevents wasted effort chasing sophisticated problems when the root cause is mundane.
The debugging pyramid (check in order):
▲
/│
/ │
Complex: / │ \ Architectural issues
/───│───\ Hyperparameter sensitivity
/ │ \ Optimizer pathologies
/─────│─────\ Training dynamics
/ │
Simple: /───────│───────\ Data quality
/ │
/─────────│─────────\ Pipeline bugs (most common!)
Most ML debugging situations are caused by issues at the bottom of the pyramid—data problems and pipeline bugs. Yet most engineers start at the top, tuning architectures and hyperparameters.
| Level | Example Issues | How to Check | Typical Fix Time |
|---|---|---|---|
| Wrong data path, shuffled labels, preprocessing error | Assert shapes, print samples, sanity checks | Minutes to hours |
| Label noise, leakage, distribution shift | Data analysis, visualization, validation sets | Hours to days |
| Wrong loss, vanishing gradients, wrong LR | Monitor metrics, gradient histograms | Hours |
| Wrong capacity, architecture bug, dead neurons | Overfit test, ablation, activation analysis | Hours to days |
| Subtle optimization issues, hardware bugs | Deep profiling, research literature review | Days to weeks |
Before anything else, try to overfit a single batch. If your model can't memorize 10-100 examples, you have a bug in your pipeline—not a tuning problem. This simple test eliminates levels 1-4 of the hierarchy in minutes.
Effective debugging treats experiments as scientific hypotheses. Each change should test a specific belief about what's wrong. This prevents random exploration and builds understanding even when experiments fail.
The hypothesis-driven process:
12345678910111213141516171819202122232425262728293031323334
# ML Debugging Log ## Problem Statement- **Symptom**: Validation accuracy stuck at 65%, training accuracy reaches 99%- **Expected**: Validation accuracy should reach ~85% (baseline from paper)- **When it occurs**: After epoch 5, validation accuracy plateaus ## Hypotheses (prioritized)1. [HIGH] Overfitting due to insufficient regularization2. [MEDIUM] Data augmentation is too weak3. [LOW] Learning rate too high for fine-grained features ## Experiment Log ### Experiment 1: Add Dropout (Testing Hypothesis 1)- **Change**: Added dropout=0.5 after each dense layer- **Hypothesis**: If overfitting, regularization should close train-val gap- **Result**: Val accuracy improved to 72%, train-val gap reduced from 34% to 27%- **Conclusion**: PARTIAL SUPPORT - overfitting is a factor but not the only issue ### Experiment 2: Stronger Data Augmentation (Testing Hypothesis 2) - **Change**: Added RandomRotation(15), ColorJitter, horizontal flip- **Hypothesis**: More augmentation should improve generalization- **Result**: Val accuracy improved to 78%- **Conclusion**: SUPPORTED - data augmentation was insufficient ### Experiment 3: Combine Dropout + Augmentation- **Result**: Val accuracy 83%- **Conclusion**: Combined effect addresses most of the gap ## Resolution Summary- Root causes: Insufficient regularization + weak data augmentation- Solution: Dropout=0.3 + standard augmentation suite- Remaining gap: 2% below paper (acceptable, likely due to implementation details)Change only one thing per experiment. If you change learning rate, regularization, and architecture simultaneously, you can't learn which change mattered. This feels slow but is faster in the long run because you build understanding.
Reproducibility is not just an academic concern—it's a debugging necessity. If you can't reproduce a result, you can't debug it. If you can't reproduce a fix, you don't understand it.
Sources of non-reproducibility:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
import randomimport numpy as npimport torchimport os def set_all_seeds(seed: int = 42): """Set all seeds for reproducibility.""" random.seed(seed) np.random.seed(seed) torch.manual_seed(seed) torch.cuda.manual_seed_all(seed) # For hash-based operations os.environ['PYTHONHASHSEED'] = str(seed) # Force deterministic algorithms (may impact performance) torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False def log_environment(): """Log environment info for reproducibility.""" import sys import platform info = { 'python_version': sys.version, 'pytorch_version': torch.__version__, 'cuda_version': torch.version.cuda, 'cudnn_version': torch.backends.cudnn.version(), 'platform': platform.platform(), 'gpu': torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU', 'num_gpus': torch.cuda.device_count(), } # Also log pip freeze import subprocess info['pip_freeze'] = subprocess.run( ['pip', 'freeze'], capture_output=True, text=True ).stdout return info def save_experiment_config(config, checkpoint_path): """Save full experiment configuration.""" import json from datetime import datetime full_config = { 'config': config, 'environment': log_environment(), 'timestamp': datetime.now().isoformat(), 'git_hash': get_git_hash(), # Track code version } with open(f'{checkpoint_path}/config.json', 'w') as f: json.dump(full_config, f, indent=2, default=str) def get_git_hash(): """Get current git commit hash.""" import subprocess try: return subprocess.run( ['git', 'rev-parse', 'HEAD'], capture_output=True, text=True ).stdout.strip() except: return 'unknown'When facing a complex bug with many possible causes, binary search debugging efficiently isolates the issue. The idea: systematically eliminate half the possible causes with each experiment.
Binary search strategies:
If your model worked last week but not today, use 'git bisect' to binary search through commits. Mark the last known-good commit as 'good' and current as 'bad'; Git automatically finds the breaking change in O(log n) checkouts.
Experienced ML engineers recognize patterns—recurring problem types with known solutions. Learning these patterns accelerates debugging dramatically.
| Symptom | Likely Cause | Diagnostic | Solution |
|---|---|---|---|
| Loss = NaN immediately | Learning rate too high or numerical instability | Reduce LR by 10x, check for log(0) | Lower LR, add epsilon to divisions |
| Loss stuck at random baseline | Labels shuffled, wrong loss function | Print predictions, verify labels match inputs | Fix data pipeline, check loss |
| Train accuracy 100%, val ~random | Severe overfitting or data leakage | Check train-test overlap, feature audit | Add regularization, fix leakage |
| Validation better than train | Bug, dropout misconfigured, or data issue | Check model.eval() called, verify datasets | Fix eval mode, audit data splits |
| Loss decreases then spikes | LR too high after warmup, data anomaly | Check LR schedule, inspect batch at spike | Tune LR schedule, clean data |
| Model outputs constant value | Dead network, wrong initialization | Check activations histogram | Fix initialization, check architecture |
Debugging expertise compounds over a career. Every bug you solve adds to your pattern library. But only if you learn deliberately—documenting insights, building reusable tools, and reflecting on what worked.
You've now learned to debug ML systems systematically: training issues (gradients, loss, optimization), data problems (quality, labeling, leakage, shift), model behavior (capacity, errors, architecture), performance (memory, speed, distributed), and the meta-skill of structured debugging itself. The key is systematic process over random experimentation. Document, measure, hypothesize, and iterate. The debugging skills you build will differentiate you throughout your ML career.