Loading learning content...
A model that works correctly but runs too slowly or consumes too much memory is effectively broken for production use. Performance debugging addresses computational efficiency—making training faster, reducing memory usage, and optimizing inference latency.
Performance issues manifest as:
This page teaches you to profile ML workloads, identify bottlenecks in training and inference, optimize memory usage, leverage hardware efficiently, and make principled tradeoffs between speed, memory, and accuracy. You'll learn to think about performance from first principles.
Premature optimization is the root of all evil, but optimization without measurement is pure guesswork. Profiling reveals where time and memory actually go.
What to profile:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
import torchimport timefrom contextlib import contextmanagerfrom torch.profiler import profile, record_function, ProfilerActivity @contextmanagerdef timer(name): """Simple timing context manager.""" start = time.perf_counter() yield elapsed = time.perf_counter() - start print(f"{name}: {elapsed*1000:.2f}ms") def profile_training_step(model, batch, loss_fn, optimizer): """Profile a single training iteration in detail.""" with timer("Total Step"): with timer(" Data to GPU"): inputs, targets = batch inputs = inputs.cuda() targets = targets.cuda() with timer(" Forward"): outputs = model(inputs) loss = loss_fn(outputs, targets) with timer(" Backward"): optimizer.zero_grad() loss.backward() with timer(" Optimizer"): optimizer.step() # Also report memory print(f"GPU Memory: {torch.cuda.memory_allocated()/1e9:.2f}GB / " f"{torch.cuda.max_memory_allocated()/1e9:.2f}GB peak") def pytorch_profiler_analysis(model, dataloader, num_steps=20): """Deep profiling with PyTorch Profiler.""" with profile( activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True, profile_memory=True, with_stack=True, schedule=torch.profiler.schedule( wait=2, warmup=3, active=10, repeat=1 ) ) as prof: for step, batch in enumerate(dataloader): if step >= num_steps: break with record_function("forward"): outputs = model(batch[0].cuda()) with record_function("backward"): outputs.sum().backward() prof.step() # Print summary sorted by CUDA time print(prof.key_averages().table( sort_by="cuda_time_total", row_limit=20 )) # Export for visualization prof.export_chrome_trace("trace.json") return profEngineers often optimize the wrong thing. Profile first to find actual bottlenecks. Common surprise: data loading is often the bottleneck, not GPU computation. Fixing the wrong bottleneck wastes effort and complicates code.
GPU memory is typically the binding constraint for deep learning. Understanding memory usage is essential for training larger models or using larger batch sizes.
What consumes GPU memory:
| Component | Memory (GB) | Notes |
|---|---|---|
| Model Parameters | 4 GB | 1B × 4 bytes per float |
| Gradients | 4 GB | Same as parameters |
| Adam Optimizer States | 8 GB | 2× parameters (momentum + variance) |
| Activations | Variable | Depends on batch size, sequence length |
| Total Training | 16+ GB | Before activations |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
import torchfrom torch.cuda.amp import autocast, GradScaler # Mixed precision training examplescaler = GradScaler() def train_step_mixed_precision(model, batch, loss_fn, optimizer): inputs, targets = batch[0].cuda(), batch[1].cuda() optimizer.zero_grad() # Forward pass in FP16 with autocast(): outputs = model(inputs) loss = loss_fn(outputs, targets) # Backward with scaled gradients scaler.scale(loss).backward() scaler.step(optimizer) scaler.update() return loss.item() # Gradient accumulation for large effective batch sizesdef train_with_accumulation(model, dataloader, loss_fn, optimizer, accumulation_steps=4): """Accumulate gradients to simulate larger batch size.""" optimizer.zero_grad() for step, batch in enumerate(dataloader): inputs, targets = batch[0].cuda(), batch[1].cuda() outputs = model(inputs) loss = loss_fn(outputs, targets) / accumulation_steps loss.backward() if (step + 1) % accumulation_steps == 0: optimizer.step() optimizer.zero_grad() # Gradient checkpointing to reduce activation memoryfrom torch.utils.checkpoint import checkpoint class CheckpointedModel(torch.nn.Module): def __init__(self, layers): super().__init__() self.layers = torch.nn.ModuleList(layers) def forward(self, x): for i, layer in enumerate(self.layers): if i % 4 == 0: # Checkpoint every 4th layer x = checkpoint(layer, x, use_reentrant=False) else: x = layer(x) return xFaster training means faster iteration, more experiments, and better final models. Training speed is determined by the slowest component in your pipeline.
Common bottlenecks:
123456789101112131415161718192021222324252627282930313233343536
import torchfrom torch.utils.data import DataLoader def optimized_dataloader(dataset, batch_size=32): """Create a well-optimized DataLoader.""" return DataLoader( dataset, batch_size=batch_size, shuffle=True, num_workers=4, # Parallel data loading pin_memory=True, # Faster CPU-GPU transfer prefetch_factor=2, # Batches to prefetch per worker persistent_workers=True, # Keep workers alive drop_last=True # Consistent batch sizes ) def measure_gpu_utilization(): """Check if training is compute-bound or memory-bound.""" import subprocess result = subprocess.run( ['nvidia-smi', '--query-gpu=utilization.gpu,utilization.memory', '--format=csv,noheader,nounits'], capture_output=True, text=True ) lines = result.stdout.strip().split('') for i, line in enumerate(lines): gpu_util, mem_util = line.split(', ') print(f"GPU {i}: Compute={gpu_util}%, Memory={mem_util}%") # Diagnosis if int(gpu_util) < 80: print(f" ⚠️ GPU {i} underutilized - likely data loading bottleneck") # Compile model for faster execution (PyTorch 2.0+)model = torch.compile(model, mode="default")Production inference has different requirements than training. Training happens offline where speed is nice-to-have; inference often has hard latency requirements.
Inference-specific concerns:
| Technique | Latency Impact | Accuracy Impact | Implementation Effort |
|---|---|---|---|
| TorchScript/ONNX export | 10-30% faster | None | Low |
| FP16 inference | 2× faster | Minimal | Low |
| INT8 quantization | 2-4× faster | 0-2% degradation | Medium |
| Pruning | Variable | 0-5% degradation | High |
| Distillation | Per design | 1-5% degradation | High |
| Dynamic batching | Throughput boost | None | Medium |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
import torchimport torch.quantization as quant def optimize_for_inference(model): """Apply standard inference optimizations.""" model.eval() # 1. Fuse common operation patterns model = torch.quantization.fuse_modules(model, [ ['conv', 'bn', 'relu'] # Fuse conv-batchnorm-relu ], inplace=True) return model def quantize_dynamic(model): """Apply dynamic quantization (easiest optimization).""" return torch.quantization.quantize_dynamic( model, {torch.nn.Linear}, # Quantize linear layers dtype=torch.qint8 ) def export_onnx(model, sample_input, output_path): """Export to ONNX for deployment.""" torch.onnx.export( model, sample_input, output_path, input_names=['input'], output_names=['output'], dynamic_axes={ 'input': {0: 'batch_size'}, 'output': {0: 'batch_size'} } ) def measure_inference_latency(model, sample_input, warmup=10, runs=100): """Accurate latency measurement.""" model.eval() # Warmup (important for accurate timing) for _ in range(warmup): with torch.no_grad(): model(sample_input) torch.cuda.synchronize() # Wait for all ops to complete import time latencies = [] for _ in range(runs): start = time.perf_counter() with torch.no_grad(): model(sample_input) torch.cuda.synchronize() latencies.append(time.perf_counter() - start) import numpy as np print(f"Latency: {np.mean(latencies)*1000:.2f}ms (p99: {np.percentile(latencies, 99)*1000:.2f}ms)")Distributed training introduces new performance challenges: communication overhead, synchronization delays, and load imbalance. Scaling efficiency rarely reaches 100%.
Distributed training bottlenecks:
Performance debugging starts with profiling—don't guess. Memory is usually the binding constraint; use mixed precision and checkpointing. Training speed depends on finding the bottleneck (data, compute, or transfer). Inference optimization differs from training; latency and throughput require different techniques. Distributed training adds communication overhead—measure scaling efficiency.