Machine LearningML Engineering & Career

Debugging ML Models

LevelAdvanced

Duration90 mins

TopicML Engineering & Career

4 / 5

Performance Debugging

When Models Are Too Slow or Too Expensive

A model that works correctly but runs too slowly or consumes too much memory is effectively broken for production use. Performance debugging addresses computational efficiency—making training faster, reducing memory usage, and optimizing inference latency.

Performance issues manifest as:

Training is impractically slow: Days or weeks for experiments that should take hours
Out-of-memory (OOM) errors: GPU memory exhausted during training or inference
Inference latency too high: Model too slow for real-time applications
Scaling failures: Performance degrades non-linearly with data or model size
Resource inefficiency: GPU utilization low, expensive hardware underused

What You Will Master

This page teaches you to profile ML workloads, identify bottlenecks in training and inference, optimize memory usage, leverage hardware efficiently, and make principled tradeoffs between speed, memory, and accuracy. You'll learn to think about performance from first principles.

Profiling ML Workloads

Premature optimization is the root of all evil, but optimization without measurement is pure guesswork. Profiling reveals where time and memory actually go.

What to profile:

Time profiling: Where is wall-clock time spent? (data loading, forward pass, backward pass, optimization step)
Memory profiling: What's consuming GPU/CPU memory? Peak usage, allocation patterns
Throughput profiling: Samples per second, tokens per second—what's limiting throughput?
Hardware utilization: GPU compute utilization, memory bandwidth, CPU usage

profiling_ml_workloads.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import torch
import time
from contextlib import contextmanager
from torch.profiler import profile, record_function, ProfilerActivity
 
@contextmanager
def timer(name):
    """Simple timing context manager."""
    start = time.perf_counter()
    yield
    elapsed = time.perf_counter() - start
    print(f"{name}: {elapsed*1000:.2f}ms")
 
def profile_training_step(model, batch, loss_fn, optimizer):
    """Profile a single training iteration in detail."""
    
    with timer("Total Step"):
        with timer("  Data to GPU"):
            inputs, targets = batch
            inputs = inputs.cuda()
            targets = targets.cuda()
        
        with timer("  Forward"):
            outputs = model(inputs)
            loss = loss_fn(outputs, targets)
        
        with timer("  Backward"):
            optimizer.zero_grad()
            loss.backward()
        
        with timer("  Optimizer"):
            optimizer.step()
    
    # Also report memory
    print(f"GPU Memory: {torch.cuda.memory_allocated()/1e9:.2f}GB / "
          f"{torch.cuda.max_memory_allocated()/1e9:.2f}GB peak")
 
def pytorch_profiler_analysis(model, dataloader, num_steps=20):
    """Deep profiling with PyTorch Profiler."""
    with profile(
        activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
        record_shapes=True,
        profile_memory=True,
        with_stack=True,
        schedule=torch.profiler.schedule(
            wait=2, warmup=3, active=10, repeat=1
        )
    ) as prof:
        for step, batch in enumerate(dataloader):
            if step >= num_steps:
                break
            with record_function("forward"):
                outputs = model(batch[0].cuda())
            with record_function("backward"):
                outputs.sum().backward()
            prof.step()
    
    # Print summary sorted by CUDA time
    print(prof.key_averages().table(
        sort_by="cuda_time_total", row_limit=20
    ))
    
    # Export for visualization
    prof.export_chrome_trace("trace.json")
    return prof

Profile Before Optimizing

Engineers often optimize the wrong thing. Profile first to find actual bottlenecks. Common surprise: data loading is often the bottleneck, not GPU computation. Fixing the wrong bottleneck wastes effort and complicates code.

Memory Optimization

GPU memory is typically the binding constraint for deep learning. Understanding memory usage is essential for training larger models or using larger batch sizes.

What consumes GPU memory:

Model parameters: Weights of all layers
Gradients: Same size as parameters during training
Optimizer states: Adam stores 2 extra copies per parameter
Activations: Saved for backward pass (often largest component)
Intermediate tensors: Temporary storage during computation

Memory Consumption by Component (1B Parameter Model, FP32)
Component	Memory (GB)	Notes
Model Parameters	4 GB	1B × 4 bytes per float
Gradients	4 GB	Same as parameters
Adam Optimizer States	8 GB	2× parameters (momentum + variance)
Activations	Variable	Depends on batch size, sequence length
Total Training	16+ GB	Before activations

Memory Reduction Techniques

•Mixed precision (FP16/BF16): Halves memory for parameters and activations, speeds up compute
•Gradient checkpointing: Trade compute for memory by recomputing activations during backward
•Gradient accumulation: Simulate large batches with multiple small-batch forward passes
•Model parallelism: Split model across GPUs to fit larger models
•8-bit optimizers: Use quantized optimizer states (bitsandbytes)
•Clear cache: torch.cuda.empty_cache() releases unused memory

memory_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import torch
from torch.cuda.amp import autocast, GradScaler
 
# Mixed precision training example
scaler = GradScaler()
 
def train_step_mixed_precision(model, batch, loss_fn, optimizer):
    inputs, targets = batch[0].cuda(), batch[1].cuda()
    
    optimizer.zero_grad()
    
    # Forward pass in FP16
    with autocast():
        outputs = model(inputs)
        loss = loss_fn(outputs, targets)
    
    # Backward with scaled gradients
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
    
    return loss.item()
 
# Gradient accumulation for large effective batch sizes
def train_with_accumulation(model, dataloader, loss_fn, optimizer, 
                            accumulation_steps=4):
    """Accumulate gradients to simulate larger batch size."""
    optimizer.zero_grad()
    
    for step, batch in enumerate(dataloader):
        inputs, targets = batch[0].cuda(), batch[1].cuda()
        outputs = model(inputs)
        loss = loss_fn(outputs, targets) / accumulation_steps
        loss.backward()
        
        if (step + 1) % accumulation_steps == 0:
            optimizer.step()
            optimizer.zero_grad()
 
# Gradient checkpointing to reduce activation memory
from torch.utils.checkpoint import checkpoint
 
class CheckpointedModel(torch.nn.Module):
    def __init__(self, layers):
        super().__init__()
        self.layers = torch.nn.ModuleList(layers)
    
    def forward(self, x):
        for i, layer in enumerate(self.layers):
            if i % 4 == 0:  # Checkpoint every 4th layer
                x = checkpoint(layer, x, use_reentrant=False)
            else:
                x = layer(x)
        return x

Training Speed Optimization

Faster training means faster iteration, more experiments, and better final models. Training speed is determined by the slowest component in your pipeline.

Common bottlenecks:

Data loading: CPU can't prepare batches fast enough for GPU
GPU compute: Raw computation on GPU (forward + backward)
GPU memory bandwidth: Moving data on/off GPU memory
CPU-GPU transfer: Moving batches from CPU to GPU
Synchronization: Waiting for operations to complete

Signs of Data Loading Bottleneck

•GPU utilization < 80%
•CPU at 100% during training
•nvidia-smi shows frequent drops to 0%
•Increasing batch size doesn't speed up training
•Training slower than expected for model size

Data Loading Optimizations

•Increase num_workers in DataLoader
•Use pin_memory=True for GPU training
•Prefetch batches (non_blocking=True)
•Preprocess data offline, cache to disk
•Use faster storage (SSD, ramdisk)

training_speed.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import torch
from torch.utils.data import DataLoader
 
def optimized_dataloader(dataset, batch_size=32):
    """Create a well-optimized DataLoader."""
    return DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=True,
        num_workers=4,         # Parallel data loading
        pin_memory=True,       # Faster CPU-GPU transfer
        prefetch_factor=2,     # Batches to prefetch per worker
        persistent_workers=True,  # Keep workers alive
        drop_last=True         # Consistent batch sizes
    )
 
def measure_gpu_utilization():
    """Check if training is compute-bound or memory-bound."""
    import subprocess
    result = subprocess.run(
        ['nvidia-smi', '--query-gpu=utilization.gpu,utilization.memory',
         '--format=csv,noheader,nounits'],
        capture_output=True, text=True
    )
    lines = result.stdout.strip().split('
')
    for i, line in enumerate(lines):
        gpu_util, mem_util = line.split(', ')
        print(f"GPU {i}: Compute={gpu_util}%, Memory={mem_util}%")
        
        # Diagnosis
        if int(gpu_util) < 80:
            print(f"  ⚠️ GPU {i} underutilized - likely data loading bottleneck")
 
# Compile model for faster execution (PyTorch 2.0+)
model = torch.compile(model, mode="default")

Inference Optimization

Production inference has different requirements than training. Training happens offline where speed is nice-to-have; inference often has hard latency requirements.

Inference-specific concerns:

Latency: Time per prediction (p50, p99 matter for SLAs)
Throughput: Predictions per second at acceptable latency
Cost efficiency: Cost per 1000 predictions
Cold start: Time to first prediction after initialization
Batch efficiency: Utilizing GPU efficiently with variable batch sizes

Inference Optimization Techniques
Technique	Latency Impact	Accuracy Impact	Implementation Effort
TorchScript/ONNX export	10-30% faster	None	Low
FP16 inference	2× faster	Minimal	Low
INT8 quantization	2-4× faster	0-2% degradation	Medium
Pruning	Variable	0-5% degradation	High
Distillation	Per design	1-5% degradation	High
Dynamic batching	Throughput boost	None	Medium

inference_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import torch
import torch.quantization as quant
 
def optimize_for_inference(model):
    """Apply standard inference optimizations."""
    model.eval()
    
    # 1. Fuse common operation patterns
    model = torch.quantization.fuse_modules(model, [
        ['conv', 'bn', 'relu']  # Fuse conv-batchnorm-relu
    ], inplace=True)
    
    return model
 
def quantize_dynamic(model):
    """Apply dynamic quantization (easiest optimization)."""
    return torch.quantization.quantize_dynamic(
        model,
        {torch.nn.Linear},  # Quantize linear layers
        dtype=torch.qint8
    )
 
def export_onnx(model, sample_input, output_path):
    """Export to ONNX for deployment."""
    torch.onnx.export(
        model,
        sample_input,
        output_path,
        input_names=['input'],
        output_names=['output'],
        dynamic_axes={
            'input': {0: 'batch_size'},
            'output': {0: 'batch_size'}
        }
    )
 
def measure_inference_latency(model, sample_input, warmup=10, runs=100):
    """Accurate latency measurement."""
    model.eval()
    
    # Warmup (important for accurate timing)
    for _ in range(warmup):
        with torch.no_grad():
            model(sample_input)
    
    torch.cuda.synchronize()  # Wait for all ops to complete
    
    import time
    latencies = []
    for _ in range(runs):
        start = time.perf_counter()
        with torch.no_grad():
            model(sample_input)
        torch.cuda.synchronize()
        latencies.append(time.perf_counter() - start)
    
    import numpy as np
    print(f"Latency: {np.mean(latencies)*1000:.2f}ms (p99: {np.percentile(latencies, 99)*1000:.2f}ms)")

Distributed Training Performance

Distributed training introduces new performance challenges: communication overhead, synchronization delays, and load imbalance. Scaling efficiency rarely reaches 100%.

Distributed training bottlenecks:

Gradient synchronization: All-reduce across GPUs/nodes
Load imbalance: Some GPUs finish faster than others
Network bandwidth: Inter-node communication limits
Stragglers: Slow workers hold everyone back
Large batch degradation: Statistical efficiency drops at large batch sizes

Distributed Performance Checklist

•Measure scaling efficiency: Compare N×GPUs vs 1×GPU; linear scaling is ideal, 0.8-0.9× is good
•Check communication overhead: Profile time spent in all-reduce vs compute
•Balance workload: Ensure batch sizes are equal across workers
•Use gradient compression: Reduce communication volume with sparsification or quantization
•Overlap compute and communication: Pipeline gradients while next batch computes
•Consider async training: For very large scale, async (e.g., Hogwild) may beat sync

Key Takeaways

Performance debugging starts with profiling—don't guess. Memory is usually the binding constraint; use mixed precision and checkpointing. Training speed depends on finding the bottleneck (data, compute, or transfer). Inference optimization differs from training; latency and throughput require different techniques. Distributed training adds communication overhead—measure scaling efficiency.

4 / 5

Loading learning content...

Machine LearningML Engineering & Career

Debugging ML Models

LevelAdvanced

Duration90 mins

TopicML Engineering & Career

4 / 5

Performance Debugging

When Models Are Too Slow or Too Expensive

Performance issues manifest as:

Training is impractically slow: Days or weeks for experiments that should take hours
Out-of-memory (OOM) errors: GPU memory exhausted during training or inference
Inference latency too high: Model too slow for real-time applications
Scaling failures: Performance degrades non-linearly with data or model size
Resource inefficiency: GPU utilization low, expensive hardware underused

What You Will Master

Profiling ML Workloads

Premature optimization is the root of all evil, but optimization without measurement is pure guesswork. Profiling reveals where time and memory actually go.

What to profile:

Time profiling: Where is wall-clock time spent? (data loading, forward pass, backward pass, optimization step)
Memory profiling: What's consuming GPU/CPU memory? Peak usage, allocation patterns
Throughput profiling: Samples per second, tokens per second—what's limiting throughput?
Hardware utilization: GPU compute utilization, memory bandwidth, CPU usage

profiling_ml_workloads.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import torch
import time
from contextlib import contextmanager
from torch.profiler import profile, record_function, ProfilerActivity
 
@contextmanager
def timer(name):
    """Simple timing context manager."""
    start = time.perf_counter()
    yield
    elapsed = time.perf_counter() - start
    print(f"{name}: {elapsed*1000:.2f}ms")
 
def profile_training_step(model, batch, loss_fn, optimizer):
    """Profile a single training iteration in detail."""
    
    with timer("Total Step"):
        with timer("  Data to GPU"):
            inputs, targets = batch
            inputs = inputs.cuda()
            targets = targets.cuda()
        
        with timer("  Forward"):
            outputs = model(inputs)
            loss = loss_fn(outputs, targets)
        
        with timer("  Backward"):
            optimizer.zero_grad()
            loss.backward()
        
        with timer("  Optimizer"):
            optimizer.step()
    
    # Also report memory
    print(f"GPU Memory: {torch.cuda.memory_allocated()/1e9:.2f}GB / "
          f"{torch.cuda.max_memory_allocated()/1e9:.2f}GB peak")
 
def pytorch_profiler_analysis(model, dataloader, num_steps=20):
    """Deep profiling with PyTorch Profiler."""
    with profile(
        activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
        record_shapes=True,
        profile_memory=True,
        with_stack=True,
        schedule=torch.profiler.schedule(
            wait=2, warmup=3, active=10, repeat=1
        )
    ) as prof:
        for step, batch in enumerate(dataloader):
            if step >= num_steps:
                break
            with record_function("forward"):
                outputs = model(batch[0].cuda())
            with record_function("backward"):
                outputs.sum().backward()
            prof.step()
    
    # Print summary sorted by CUDA time
    print(prof.key_averages().table(
        sort_by="cuda_time_total", row_limit=20
    ))
    
    # Export for visualization
    prof.export_chrome_trace("trace.json")
    return prof

Profile Before Optimizing

Memory Optimization

GPU memory is typically the binding constraint for deep learning. Understanding memory usage is essential for training larger models or using larger batch sizes.

What consumes GPU memory:

Model parameters: Weights of all layers
Gradients: Same size as parameters during training
Optimizer states: Adam stores 2 extra copies per parameter
Activations: Saved for backward pass (often largest component)
Intermediate tensors: Temporary storage during computation

Memory Consumption by Component (1B Parameter Model, FP32)
Component	Memory (GB)	Notes
Model Parameters	4 GB	1B × 4 bytes per float
Gradients	4 GB	Same as parameters
Adam Optimizer States	8 GB	2× parameters (momentum + variance)
Activations	Variable	Depends on batch size, sequence length
Total Training	16+ GB	Before activations

Memory Reduction Techniques

•Mixed precision (FP16/BF16): Halves memory for parameters and activations, speeds up compute
•Gradient checkpointing: Trade compute for memory by recomputing activations during backward
•Gradient accumulation: Simulate large batches with multiple small-batch forward passes
•Model parallelism: Split model across GPUs to fit larger models
•8-bit optimizers: Use quantized optimizer states (bitsandbytes)
•Clear cache: torch.cuda.empty_cache() releases unused memory

memory_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import torch
from torch.cuda.amp import autocast, GradScaler
 
# Mixed precision training example
scaler = GradScaler()
 
def train_step_mixed_precision(model, batch, loss_fn, optimizer):
    inputs, targets = batch[0].cuda(), batch[1].cuda()
    
    optimizer.zero_grad()
    
    # Forward pass in FP16
    with autocast():
        outputs = model(inputs)
        loss = loss_fn(outputs, targets)
    
    # Backward with scaled gradients
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
    
    return loss.item()
 
# Gradient accumulation for large effective batch sizes
def train_with_accumulation(model, dataloader, loss_fn, optimizer, 
                            accumulation_steps=4):
    """Accumulate gradients to simulate larger batch size."""
    optimizer.zero_grad()
    
    for step, batch in enumerate(dataloader):
        inputs, targets = batch[0].cuda(), batch[1].cuda()
        outputs = model(inputs)
        loss = loss_fn(outputs, targets) / accumulation_steps
        loss.backward()
        
        if (step + 1) % accumulation_steps == 0:
            optimizer.step()
            optimizer.zero_grad()
 
# Gradient checkpointing to reduce activation memory
from torch.utils.checkpoint import checkpoint
 
class CheckpointedModel(torch.nn.Module):
    def __init__(self, layers):
        super().__init__()
        self.layers = torch.nn.ModuleList(layers)
    
    def forward(self, x):
        for i, layer in enumerate(self.layers):
            if i % 4 == 0:  # Checkpoint every 4th layer
                x = checkpoint(layer, x, use_reentrant=False)
            else:
                x = layer(x)
        return x

Training Speed Optimization

Faster training means faster iteration, more experiments, and better final models. Training speed is determined by the slowest component in your pipeline.

Common bottlenecks:

Data loading: CPU can't prepare batches fast enough for GPU
GPU compute: Raw computation on GPU (forward + backward)
GPU memory bandwidth: Moving data on/off GPU memory
CPU-GPU transfer: Moving batches from CPU to GPU
Synchronization: Waiting for operations to complete

Signs of Data Loading Bottleneck

•GPU utilization < 80%
•CPU at 100% during training
•nvidia-smi shows frequent drops to 0%
•Increasing batch size doesn't speed up training
•Training slower than expected for model size

Data Loading Optimizations

•Increase num_workers in DataLoader
•Use pin_memory=True for GPU training
•Prefetch batches (non_blocking=True)
•Preprocess data offline, cache to disk
•Use faster storage (SSD, ramdisk)

training_speed.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import torch
from torch.utils.data import DataLoader
 
def optimized_dataloader(dataset, batch_size=32):
    """Create a well-optimized DataLoader."""
    return DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=True,
        num_workers=4,         # Parallel data loading
        pin_memory=True,       # Faster CPU-GPU transfer
        prefetch_factor=2,     # Batches to prefetch per worker
        persistent_workers=True,  # Keep workers alive
        drop_last=True         # Consistent batch sizes
    )
 
def measure_gpu_utilization():
    """Check if training is compute-bound or memory-bound."""
    import subprocess
    result = subprocess.run(
        ['nvidia-smi', '--query-gpu=utilization.gpu,utilization.memory',
         '--format=csv,noheader,nounits'],
        capture_output=True, text=True
    )
    lines = result.stdout.strip().split('
')
    for i, line in enumerate(lines):
        gpu_util, mem_util = line.split(', ')
        print(f"GPU {i}: Compute={gpu_util}%, Memory={mem_util}%")
        
        # Diagnosis
        if int(gpu_util) < 80:
            print(f"  ⚠️ GPU {i} underutilized - likely data loading bottleneck")
 
# Compile model for faster execution (PyTorch 2.0+)
model = torch.compile(model, mode="default")

Inference Optimization

Production inference has different requirements than training. Training happens offline where speed is nice-to-have; inference often has hard latency requirements.

Inference-specific concerns:

Latency: Time per prediction (p50, p99 matter for SLAs)
Throughput: Predictions per second at acceptable latency
Cost efficiency: Cost per 1000 predictions
Cold start: Time to first prediction after initialization
Batch efficiency: Utilizing GPU efficiently with variable batch sizes

Inference Optimization Techniques
Technique	Latency Impact	Accuracy Impact	Implementation Effort
TorchScript/ONNX export	10-30% faster	None	Low
FP16 inference	2× faster	Minimal	Low
INT8 quantization	2-4× faster	0-2% degradation	Medium
Pruning	Variable	0-5% degradation	High
Distillation	Per design	1-5% degradation	High
Dynamic batching	Throughput boost	None	Medium

inference_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import torch
import torch.quantization as quant
 
def optimize_for_inference(model):
    """Apply standard inference optimizations."""
    model.eval()
    
    # 1. Fuse common operation patterns
    model = torch.quantization.fuse_modules(model, [
        ['conv', 'bn', 'relu']  # Fuse conv-batchnorm-relu
    ], inplace=True)
    
    return model
 
def quantize_dynamic(model):
    """Apply dynamic quantization (easiest optimization)."""
    return torch.quantization.quantize_dynamic(
        model,
        {torch.nn.Linear},  # Quantize linear layers
        dtype=torch.qint8
    )
 
def export_onnx(model, sample_input, output_path):
    """Export to ONNX for deployment."""
    torch.onnx.export(
        model,
        sample_input,
        output_path,
        input_names=['input'],
        output_names=['output'],
        dynamic_axes={
            'input': {0: 'batch_size'},
            'output': {0: 'batch_size'}
        }
    )
 
def measure_inference_latency(model, sample_input, warmup=10, runs=100):
    """Accurate latency measurement."""
    model.eval()
    
    # Warmup (important for accurate timing)
    for _ in range(warmup):
        with torch.no_grad():
            model(sample_input)
    
    torch.cuda.synchronize()  # Wait for all ops to complete
    
    import time
    latencies = []
    for _ in range(runs):
        start = time.perf_counter()
        with torch.no_grad():
            model(sample_input)
        torch.cuda.synchronize()
        latencies.append(time.perf_counter() - start)
    
    import numpy as np
    print(f"Latency: {np.mean(latencies)*1000:.2f}ms (p99: {np.percentile(latencies, 99)*1000:.2f}ms)")

Distributed Training Performance

Distributed training introduces new performance challenges: communication overhead, synchronization delays, and load imbalance. Scaling efficiency rarely reaches 100%.

Distributed training bottlenecks:

Gradient synchronization: All-reduce across GPUs/nodes
Load imbalance: Some GPUs finish faster than others
Network bandwidth: Inter-node communication limits
Stragglers: Slow workers hold everyone back
Large batch degradation: Statistical efficiency drops at large batch sizes

Distributed Performance Checklist

•Measure scaling efficiency: Compare N×GPUs vs 1×GPU; linear scaling is ideal, 0.8-0.9× is good
•Check communication overhead: Profile time spent in all-reduce vs compute
•Balance workload: Ensure batch sizes are equal across workers
•Use gradient compression: Reduce communication volume with sparsification or quantization
•Overlap compute and communication: Pipeline gradients while next batch computes
•Consider async training: For very large scale, async (e.g., Hogwild) may beat sync

Key Takeaways

4 / 5