Loading learning content...
After ensuring your training process is stable and your data is clean, model-level issues become the focus. Model debugging addresses problems with the architecture, capacity, and learning behavior of the model itself.
Model issues typically manifest as:
This page covers diagnosing underfitting vs. overfitting, understanding the bias-variance tradeoff in practice, debugging model capacity issues, analyzing prediction errors, and validating that models exhibit expected behavior. You'll learn to systematically isolate whether poor performance stems from the model or elsewhere.
The bias-variance tradeoff is the fundamental lens for understanding model behavior. Every prediction error is decomposable into:
Diagnosing which problem you have:
The relationship between training error and validation error reveals the issue:
| Scenario | Training Error | Validation Error | Diagnosis |
|---|---|---|---|
| High | High | High (similar) | High bias (underfitting) |
| Low | Low | High | High variance (overfitting) |
| Low | Low | Low | Good fit |
| High | High | Low | Data leakage or bug |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.model_selection import learning_curve def diagnose_bias_variance(model, X, y, cv=5): """ Generate learning curves to diagnose bias vs variance. Learning curves show how training and validation performance evolve as training set size increases. """ train_sizes, train_scores, val_scores = learning_curve( model, X, y, train_sizes=np.linspace(0.1, 1.0, 10), cv=cv, scoring='accuracy', n_jobs=-1 ) train_mean = train_scores.mean(axis=1) train_std = train_scores.std(axis=1) val_mean = val_scores.mean(axis=1) val_std = val_scores.std(axis=1) # Diagnosis logic final_train = train_mean[-1] final_val = val_mean[-1] gap = final_train - final_val if final_train < 0.7: # Arbitrary threshold diagnosis = "HIGH BIAS: Model underfits. Increase complexity." elif gap > 0.15: diagnosis = "HIGH VARIANCE: Model overfits. Regularize or get more data." elif final_val < 0.85: diagnosis = "BOTH: High bias and variance. Consider different architecture." else: diagnosis = "GOOD FIT: Model generalizes well." print(f"Train accuracy: {final_train:.3f}") print(f"Val accuracy: {final_val:.3f}") print(f"Gap: {gap:.3f}") print(f"Diagnosis: {diagnosis}") return { 'train_sizes': train_sizes, 'train_scores': train_mean, 'val_scores': val_mean, 'diagnosis': diagnosis }Model capacity refers to the range of functions a model can represent. Capacity must match task complexity:
Practical capacity indicators:
| Dataset Size | Task Complexity | Recommended Capacity | Regularization Needs |
|---|---|---|---|
| < 1,000 | Simple | Linear/shallow models | Strong regularization |
| 1,000 - 10,000 | Moderate | Small neural nets, ensembles | Moderate regularization |
| 10,000 - 100,000 | Complex | Deep networks | Standard regularization |
| 100,000+ | Very complex | Large models | Light regularization, more data helps |
Start with a model large enough to overfit your training data. If you can't overfit, you have a capacity problem or a bug. Once you can overfit, add regularization to generalize. This approach ensures you're not debugging regularization when capacity is the issue.
Error analysis goes beyond aggregate metrics to understand where and why the model fails. Aggregate accuracy can hide systematic problems—a model with 95% accuracy might completely fail on a critical subgroup.
Error analysis workflow:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
import pandas as pdimport numpy as npfrom collections import defaultdict class ErrorAnalyzer: """Systematic error analysis for classification models.""" def __init__(self, model, X_test, y_test, feature_names=None): self.model = model self.X_test = X_test self.y_test = y_test self.feature_names = feature_names self.predictions = model.predict(X_test) self.probabilities = model.predict_proba(X_test) def get_errors(self): """Get all misclassified examples.""" error_mask = self.predictions != self.y_test return { 'indices': np.where(error_mask)[0], 'X': self.X_test[error_mask], 'y_true': self.y_test[error_mask], 'y_pred': self.predictions[error_mask], 'confidence': self.probabilities[error_mask].max(axis=1) } def analyze_by_confidence(self): """Stratify errors by prediction confidence.""" errors = self.get_errors() confidence_buckets = pd.cut( errors['confidence'], bins=[0, 0.5, 0.7, 0.9, 1.0], labels=['very_low', 'low', 'medium', 'high'] ) return pd.Series(confidence_buckets).value_counts() def analyze_confusion_patterns(self): """Find which class pairs are most confused.""" errors = self.get_errors() confusion_pairs = defaultdict(int) for true, pred in zip(errors['y_true'], errors['y_pred']): confusion_pairs[(true, pred)] += 1 return dict(sorted( confusion_pairs.items(), key=lambda x: -x[1] )[:10]) def slice_analysis(self, slice_fn, slice_name): """Analyze performance on a specific data slice.""" slice_mask = slice_fn(self.X_test) slice_preds = self.predictions[slice_mask] slice_true = self.y_test[slice_mask] accuracy = (slice_preds == slice_true).mean() error_rate = 1 - accuracy return { 'slice_name': slice_name, 'n_samples': slice_mask.sum(), 'accuracy': accuracy, 'error_rate': error_rate, 'contribution_to_total_error': ( (slice_preds != slice_true).sum() / (self.predictions != self.y_test).sum() ) }Errors where the model is highly confident are the most concerning. They indicate the model has learned the wrong pattern confidently. Prioritize understanding these cases—they often reveal labeling errors, data leakage, or fundamental model assumptions that are wrong.
Neural network architectures can have subtle bugs that don't cause crashes but silently degrade performance. These are among the hardest bugs to find.
Common architecture bugs:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
import torchimport torch.nn as nn def check_dead_relu(model, sample_input): """Detect ReLU layers with dead neurons.""" dead_neuron_report = {} def hook_fn(name): def hook(module, input, output): if isinstance(output, torch.Tensor): # Fraction of neurons that never activate dead_fraction = (output <= 0).float().mean().item() if dead_fraction > 0.5: dead_neuron_report[name] = dead_fraction return hook hooks = [] for name, module in model.named_modules(): if isinstance(module, nn.ReLU): hooks.append(module.register_forward_hook(hook_fn(name))) model(sample_input) for hook in hooks: hook.remove() if dead_neuron_report: print("⚠️ DEAD ReLU NEURONS DETECTED:") for name, frac in dead_neuron_report.items(): print(f" {name}: {frac:.1%} dead") else: print("✓ No dead neuron issues detected") return dead_neuron_report def verify_gradient_flow(model, sample_input, sample_target, loss_fn): """Verify gradients flow to all parameters.""" model.zero_grad() output = model(sample_input) loss = loss_fn(output, sample_target) loss.backward() no_grad_params = [] for name, param in model.named_parameters(): if param.grad is None or param.grad.abs().sum() == 0: no_grad_params.append(name) if no_grad_params: print("⚠️ PARAMETERS WITH NO GRADIENT:") for name in no_grad_params: print(f" {name}") else: print("✓ All parameters receive gradients") return no_grad_paramsBehavioral tests verify that a model exhibits expected behavior beyond just aggregate metrics. They catch subtle issues that slip past standard evaluation.
Types of behavioral tests:
| Domain | Test Type | Example Test | Expected Behavior |
|---|---|---|---|
| Sentiment | Invariance | Add neutral filler words | Sentiment unchanged |
| Sentiment | Directional | Add strong positive words | Sentiment increases |
| Object Detection | Invariance | Slight brightness change | Same detections |
| Object Detection | Minimum Function | Clear centered object | Must detect |
| Regression | Directional | Increase predictor X (positive coef) | Output increases |
| Translation | Consistency | Paraphrased input | Semantically equivalent output |
Model debugging requires moving beyond aggregate metrics to understand failure modes. Use learning curves for bias-variance diagnosis, start with models that can overfit, then regularize. Analyze errors systematically to prioritize fixes. Verify architectures receive gradients everywhere. Add behavioral tests to catch issues metrics miss.