Batch Normalization - Learning Module

Loading content...

0/245

Training vs Inference

The Two Faces of BatchNorm

Batch Normalization has a unique property among neural network layers: it behaves differently during training and inference. This dual-mode behavior is essential for BatchNorm to work correctly, but it's also a common source of bugs and confusion.

During training, BatchNorm uses statistics computed from the current mini-batch. During inference, it uses pre-computed "running" statistics accumulated during training. Understanding this distinction deeply—why it exists, how it's implemented, and what can go wrong—is essential for deploying BatchNorm models correctly.

This page provides a complete treatment of training vs. inference behavior, the running statistics mechanism, and the many ways things can go wrong if this distinction is mishandled.

What You Will Learn

By the end of this page, you will understand: (1) why BatchNorm needs different behavior in training vs. inference, (2) how running statistics are computed and updated, (3) the momentum parameter and its effect on statistics, (4) common bugs and how to diagnose them, and (5) best practices for production deployment.

Why Two Modes Are Necessary

At first glance, the dual-mode behavior of BatchNorm seems like unnecessary complexity. Why not just use batch statistics all the time, or running statistics all the time? The answer involves several interconnected requirements.

Problem 1: Single-Sample Inference

During inference, we often process one sample at a time. With a batch size of 1, batch statistics are meaningless:

The mean is just the single sample's value
The variance is zero
Normalization would produce undefined results (division by zero) or constant outputs

single_sample_problem.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import numpy as np
 
def demonstrate_single_sample_problem():
    """
    Show why batch statistics fail for single-sample inference.
    """
    # Single sample input
    x_single = np.array([[2.5, -1.2, 0.8]])  # Shape: (1, 3)
    
    # Batch mean and variance
    batch_mean = np.mean(x_single, axis=0)  # Just the sample values!
    batch_var = np.var(x_single, axis=0)    # Zero!
    
    print("Single sample BatchNorm attempt:")
    print(f"  Input: {x_single}")
    print(f"  Batch mean: {batch_mean}")
    print(f"  Batch variance: {batch_var}")
    
    # Normalization attempt
    eps = 1e-5
    try:
        x_norm = (x_single - batch_mean) / np.sqrt(batch_var + eps)
        print(f"  'Normalized' output: {x_norm}")
        print("  Result: All zeros (or near-zero) - information destroyed!")
    except Exception as e:
        print(f"  Error: {e}")
    
    # Contrast with running statistics
    running_mean = np.array([0.1, -0.3, 0.5])  # From training
    running_var = np.array([1.2, 0.8, 1.5])
    
    x_norm_running = (x_single - running_mean) / np.sqrt(running_var + eps)
    print(f"\nWith running statistics:")
    print(f"  Normalized output: {x_norm_running}")
    print("  Result: Meaningful normalized values!")
 
demonstrate_single_sample_problem()

Problem 2: Deterministic Inference

Production systems require deterministic behavior: the same input should always produce the same output. If we used batch statistics during inference:

The output for sample A would depend on what other samples are in the batch
Processing A alone vs. A with B-Z would give different results
This makes debugging, testing, and deployment extremely difficult

Batch Statistics vs. Running Statistics
Aspect	Batch Statistics	Running Statistics
Deterministic?	No (depends on batch)	Yes (fixed values)
Single-sample support	No (undefined)	Yes
Reflects training data	Current batch only	Entire training distribution
Gradient information	Yes (trainable)	No (fixed at inference)
Regularization effect	Yes (batch noise)	No
Used during	Training	Inference

Problem 3: Gradient Flow During Training

BatchNorm's training benefits come partly from gradients flowing through the batch statistics. The normalization creates interdependence between samples in a batch, which acts as regularization.

Using fixed running statistics during training would:

Eliminate this regularization effect
Remove the optimization landscape smoothing
Likely hurt final model quality

The Solution: Mode Switching

BatchNorm maintains two modes:

Training mode: Uses batch statistics, updates running statistics
Evaluation mode: Uses running statistics, no updates

The Mode Mismatch Bug

Forgetting to switch modes is one of the most common deep learning bugs. A model in training mode during evaluation will use incorrect batch statistics and produce variable, degraded results. A model in evaluation mode during training will not get BatchNorm's regularization benefits and won't update running statistics.

The Running Statistics Mechanism

Running statistics are exponential moving averages of batch statistics, accumulated during training. They approximate the mean and variance of the entire training data distribution.

Exponential Moving Average Update:

After each training batch, the running statistics are updated:

$$\mu_{\text{running}} \leftarrow (1 - \alpha) \cdot \mu_{\text{running}} + \alpha \cdot \mu_{\text{batch}}$$

$$\sigma^2_{\text{running}} \leftarrow (1 - \alpha) \cdot \sigma^2_{\text{running}} + \alpha \cdot \sigma^2_{\text{batch}}$$

where α is the momentum parameter (typically 0.1).

running_statistics_tracking.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import numpy as np
import matplotlib.pyplot as plt
 
def simulate_running_statistics(true_mean, true_var, n_batches, 
                                  batch_size, momentum=0.1):
    """
    Simulate how running statistics converge to true values.
    """
    # Initialize running statistics
    running_mean = 0.0
    running_var = 1.0
    
    # Track history
    running_mean_history = [running_mean]
    running_var_history = [running_var]
    batch_mean_history = []
    batch_var_history = []
    
    for batch in range(n_batches):
        # Sample a batch from the true distribution
        batch_data = np.random.normal(true_mean, np.sqrt(true_var), batch_size)
        
        # Compute batch statistics
        batch_mean = np.mean(batch_data)
        batch_var = np.var(batch_data)
        
        batch_mean_history.append(batch_mean)
        batch_var_history.append(batch_var)
        
        # Update running statistics with exponential moving average
        running_mean = (1 - momentum) * running_mean + momentum * batch_mean
        running_var = (1 - momentum) * running_var + momentum * batch_var
        
        running_mean_history.append(running_mean)
        running_var_history.append(running_var)
    
    return {
        'running_mean_history': running_mean_history,
        'running_var_history': running_var_history,
        'batch_mean_history': batch_mean_history,
        'batch_var_history': batch_var_history,
        'true_mean': true_mean,
        'true_var': true_var
    }
 
# Simulate with different momentum values
np.random.seed(42)
true_mean, true_var = 2.5, 4.0
n_batches = 100
batch_size = 32
 
results = {}
for momentum in [0.01, 0.1, 0.5]:
    results[momentum] = simulate_running_statistics(
        true_mean, true_var, n_batches, batch_size, momentum
    )
    final_mean = results[momentum]['running_mean_history'][-1]
    final_var = results[momentum]['running_var_history'][-1]
    print(f"Momentum {momentum}: Final mean = {final_mean:.3f} (true: {true_mean}), "
          f"Final var = {final_var:.3f} (true: {true_var})")
 
# Key observations:
# - Higher momentum: Faster adaptation, more noise in estimate
# - Lower momentum: Slower adaptation, smoother estimate
# - All converge to true values given enough iterations

Understanding the Momentum Parameter:

The momentum α controls the trade-off between:

Responsiveness: How quickly running stats adapt to new information
Stability: How much noise from individual batches affects the running estimate

The effective "memory" of the exponential moving average is approximately 1/α batches. With momentum=0.1, the running statistics remember roughly the last 10 batches.

Common Momentum Values:

α = 0.1 (PyTorch default): Good balance of adaptation and stability
α = 0.01 (TensorFlow default): More stable, slower adaptation
α = 0.001: Very stable, used when batch statistics are very noisy

Momentum Effects
Momentum (α)	Memory (~batches)	Convergence Speed	Noise Sensitivity	Use Case
0.001	~1000	Very slow	Very low	Tiny batches, high noise
0.01	~100	Slow	Low	Conservative training
0.1	~10	Medium	Medium	Standard training
0.2	~5	Fast	High	Short training, stable data
0.5	~2	Very fast	Very high	Non-stationary data

Confusing Momentum Convention

Different frameworks use different conventions! PyTorch uses momentum α where running = (1-α)·old + α·new. TensorFlow uses momentum β where running = β·old + (1-β)·new. The relationship is α = 1 - β. PyTorch momentum=0.1 equals TensorFlow momentum=0.9. Always check the documentation!

Training Mode: Complete Behavior

Understanding exactly what happens during training mode is essential for debugging and optimization. Let's trace through a complete training forward pass.

Training Mode Operations:

Compute batch mean μ_B and variance σ²_B from the current batch
Normalize using these batch statistics
Apply learnable scale (γ) and shift (β)
Update running mean and variance using momentum
Backpropagate through the entire computation (including batch statistics)

training_mode_complete.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import torch
import torch.nn as nn
 
def analyze_training_mode():
    """
    Detailed analysis of BatchNorm behavior in training mode.
    """
    torch.manual_seed(42)
    
    # Create BatchNorm layer
    bn = nn.BatchNorm1d(4, momentum=0.1)
    bn.train()  # Explicitly set training mode
    
    print("=== Initial State ===")
    print(f"Running mean: {bn.running_mean.numpy()}")
    print(f"Running var:  {bn.running_var.numpy()}")
    print(f"Gamma (weight): {bn.weight.data.numpy()}")
    print(f"Beta (bias):    {bn.bias.data.numpy()}")
    
    # First batch
    x1 = torch.randn(8, 4) * 2 + 1  # Mean ~1, Std ~2
    print(f"\n=== Batch 1 ===")
    print(f"Input mean per feature: {x1.mean(dim=0).numpy()}")
    print(f"Input var per feature:  {x1.var(dim=0, unbiased=False).numpy()}")
    
    y1 = bn(x1)
    print(f"\nOutput mean per feature: {y1.mean(dim=0).detach().numpy()}")
    print(f"Output var per feature:  {y1.var(dim=0, unbiased=False).detach().numpy()}")
    
    print(f"\nRunning mean after batch 1: {bn.running_mean.numpy()}")
    print(f"Running var after batch 1:  {bn.running_var.numpy()}")
    
    # Second batch with different statistics
    x2 = torch.randn(8, 4) * 0.5 - 2  # Mean ~-2, Std ~0.5
    print(f"\n=== Batch 2 ===")
    print(f"Input mean per feature: {x2.mean(dim=0).numpy()}")
    print(f"Input var per feature:  {x2.var(dim=0, unbiased=False).numpy()}")
    
    y2 = bn(x2)
    print(f"\nOutput mean per feature: {y2.mean(dim=0).detach().numpy()}")
    print(f"Output var per feature:  {y2.var(dim=0, unbiased=False).detach().numpy()}")
    
    print(f"\nRunning mean after batch 2: {bn.running_mean.numpy()}")
    print(f"Running var after batch 2:  {bn.running_var.numpy()}")
    
    # Observation: Note how outputs are normalized to ~0 mean, ~1 var
    # but running statistics are gradually updating
    
    return bn
 
bn = analyze_training_mode()
 
# Demonstrate gradient flow through batch statistics
print("\n=== Gradient Flow ===")
bn.train()
x = torch.randn(8, 4, requires_grad=True)
y = bn(x)
loss = y.sum()
loss.backward()
 
print(f"Gradient on input x: shape={x.grad.shape}")
print(f"Gradient on gamma: {bn.weight.grad.numpy()}")
print(f"Gradient on beta:  {bn.bias.grad.numpy()}")
print("Note: Gradients flow through the normalization, including batch stats")

The Regularization Effect:

During training, the batch statistics introduce stochasticity:

Different random batches have slightly different μ_B and σ²_B
This adds noise to the normalization
Similar to dropout, this noise acts as regularization
The network learns to be robust to this variation

Gradient Through Batch Statistics:

The gradients ∂L/∂x flow through the batch statistics computation, creating sample interdependence:

$$\frac{\partial L}{\partial x_i} = f\left(\frac{\partial L}{\partial y_1}, ..., \frac{\partial L}{\partial y_m}\right)$$

This means the gradient for sample i depends on the gradients of all samples in the batch—different from standard layers where samples are independent.

Training Mode Checklist

When training with BatchNorm: (1) Use appropriate batch sizes (≥32 recommended), (2) Ensure batches are randomly sampled to avoid bias in statistics, (3) Use model.train() before training loops, (4) Monitor running statistics to verify convergence, (5) Consider batch composition for distributed training.

Evaluation Mode: Complete Behavior

Evaluation mode uses fixed running statistics accumulated during training. This provides deterministic, consistent behavior for inference.

Evaluation Mode Operations:

Use running_mean and running_var (no computation from batch)
Normalize using fixed running statistics
Apply learnable scale (γ) and shift (β)
No updates to running statistics
No batch-level dependencies (each sample processed independently)

evaluation_mode_complete.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import torch
import torch.nn as nn
 
def analyze_evaluation_mode():
    """
    Detailed analysis of BatchNorm behavior in evaluation mode.
    """
    torch.manual_seed(42)
    
    # Create and train a BatchNorm layer
    bn = nn.BatchNorm1d(4, momentum=0.1)
    
    # Simulate training to build up running statistics
    bn.train()
    for _ in range(100):
        x = torch.randn(32, 4) * 2 + 1  # Training distribution
        _ = bn(x)
    
    print("=== After Training ===")
    print(f"Running mean: {bn.running_mean.numpy()}")
    print(f"Running var:  {bn.running_var.numpy()}")
    saved_running_mean = bn.running_mean.clone()
    saved_running_var = bn.running_var.clone()
    
    # Switch to evaluation mode
    bn.eval()
    print("\n=== Evaluation Mode ===")
    
    # Process single samples
    x_single = torch.randn(1, 4) * 3 - 0.5  # Different distribution!
    y_single = bn(x_single)
    
    print(f"Single sample input:  {x_single.numpy()}")
    print(f"Single sample output: {y_single.detach().numpy()}")
    
    # Verify: same sample always produces same output
    y_repeat = bn(x_single)
    print(f"\nSame input again: {y_repeat.detach().numpy()}")
    print(f"Outputs identical: {torch.allclose(y_single, y_repeat)}")
    
    # Verify: running stats unchanged
    print(f"\nRunning mean unchanged: {torch.allclose(bn.running_mean, saved_running_mean)}")
    print(f"Running var unchanged:  {torch.allclose(bn.running_var, saved_running_var)}")
    
    # Demonstrate independence from batch composition
    x_batch = torch.randn(8, 4)
    
    # Process as batch
    y_batch = bn(x_batch)
    
    # Process individually
    y_individual = torch.stack([bn(x_batch[i:i+1]).squeeze() for i in range(8)])
    
    print(f"\nBatch vs individual processing identical: {torch.allclose(y_batch, y_individual)}")
    
    return bn
 
bn = analyze_evaluation_mode()
 
# Key insight: In eval mode, output for sample i depends ONLY on:
# 1. The sample x_i
# 2. The fixed running statistics (μ_running, σ²_running)
# 3. The learned parameters (γ, β)
# 
# Unlike training mode, there's no dependency on other samples in the batch

The Evaluation Mode Formula:

In evaluation mode, BatchNorm computes:

$$y = \gamma \cdot \frac{x - \mu_{\text{running}}}{\sqrt{\sigma^2_{\text{running}} + \epsilon}} + \beta$$

This can be algebraically simplified to:

$$y = \gamma' \cdot x + \beta'$$

where:

$\gamma' = \gamma / \sqrt{\sigma^2_{\text{running}} + \epsilon}$
$\beta' = \beta - \gamma \cdot \mu_{\text{running}} / \sqrt{\sigma^2_{\text{running}} + \epsilon}$

Implication: BatchNorm becomes a linear transform during inference!

This means BatchNorm layers can be fused with adjacent linear layers for efficiency during deployment.

BatchNorm Folding for Deployment

For production deployment, BatchNorm can be 'folded' into adjacent convolution or linear layers. This eliminates the BatchNorm layer entirely, replacing it with modified weights and biases. This reduces model size, eliminates a layer of computation, and simplifies the model graph.

batchnorm_folding.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import torch
import torch.nn as nn
 
def fold_batchnorm_into_linear(linear, bn):
    """
    Fold BatchNorm parameters into a preceding Linear layer.
    
    Returns a new Linear layer that computes equivalent output
    without needing the BatchNorm.
    """
    # Get BatchNorm parameters
    gamma = bn.weight.data
    beta = bn.bias.data
    mean = bn.running_mean
    var = bn.running_var
    eps = bn.eps
    
    # Get Linear parameters  
    W = linear.weight.data  # Shape: (out_features, in_features)
    b = linear.bias.data if linear.bias is not None else torch.zeros(W.shape[0])
    
    # Compute folded parameters
    std = torch.sqrt(var + eps)
    scale = gamma / std
    
    # New weights and bias
    W_folded = W * scale.unsqueeze(1)  # Scale each output row
    b_folded = scale * (b - mean) + beta
    
    # Create new Linear layer
    folded_linear = nn.Linear(linear.in_features, linear.out_features)
    folded_linear.weight.data = W_folded
    folded_linear.bias.data = b_folded
    
    return folded_linear
 
# Verify correctness
torch.manual_seed(42)
 
# Original layers
linear = nn.Linear(64, 128)
bn = nn.BatchNorm1d(128)
 
# Simulate training
model = nn.Sequential(linear, bn)
model.train()
for _ in range(100):
    x = torch.randn(32, 64)
    _ = model(x)
 
model.eval()
 
# Fold BatchNorm
folded = fold_batchnorm_into_linear(linear, bn)
 
# Compare outputs
x_test = torch.randn(16, 64)
y_original = model(x_test)
y_folded = folded(x_test)
 
print(f"Max difference: {(y_original - y_folded).abs().max().item():.2e}")
print("Outputs are equivalent! BatchNorm successfully folded.")

Common Pitfalls and Debugging

BatchNorm's dual-mode behavior leads to numerous potential bugs. Being aware of these common pitfalls can save hours of debugging.

Pitfall 1: Forgetting to Switch Modes

The most common bug: evaluating a model that's still in training mode.

mode_switching_bugs.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import torch
import torch.nn as nn
 
def demonstrate_mode_bug():
    """
    Show the consequences of incorrect mode switching.
    """
    torch.manual_seed(42)
    
    # Create model with BatchNorm
    model = nn.Sequential(
        nn.Linear(10, 20),
        nn.BatchNorm1d(20),
        nn.ReLU(),
        nn.Linear(20, 5)
    )
    
    # Simulate training
    model.train()
    for _ in range(100):
        x = torch.randn(32, 10)
        _ = model(x)
    
    # BUG: Forgot to call model.eval()!
    
    # Test with different batch sizes
    x_test = torch.randn(1, 10)  # Single sample
    
    # Multiple evaluations of same input in training mode
    results_training_mode = []
    for _ in range(5):
        # In training mode, adding other samples changes batch stats!
        batch = torch.cat([x_test, torch.randn(31, 10)], dim=0)
        y = model(batch)[0:1]  # Take only first sample output
        results_training_mode.append(y.detach())
    
    print("=== BUG: Model in training mode during evaluation ===")
    print("Same input, different batch compositions:")
    for i, r in enumerate(results_training_mode):
        print(f"  Attempt {i+1}: {r.numpy().flatten()[:3]}...")
    
    # Correct: switch to eval mode
    model.eval()
    
    results_eval_mode = []
    for _ in range(5):
        batch = torch.cat([x_test, torch.randn(31, 10)], dim=0)
        y = model(batch)[0:1]
        results_eval_mode.append(y.detach())
    
    print("\n=== CORRECT: Model in eval mode ===")
    print("Same input, different batch compositions:")
    for i, r in enumerate(results_eval_mode):
        print(f"  Attempt {i+1}: {r.numpy().flatten()[:3]}...")
    print("Outputs are now identical!")
 
demonstrate_mode_bug()

Common BatchNorm Bugs

•Forgetting model.eval() — Causes variable outputs during inference, poor performance especially with small batches
•Forgetting model.train() — Training doesn't update running statistics, final stats won't represent learned features
•Dropout + BatchNorm mode mismatch — Both layers have train/eval behavior; ensure consistent mode switching
•Small batch sizes at training end — Final batches with few samples can corrupt running statistics
•Frozen BatchNorm during fine-tuning — May keep unsuitable running statistics from pre-training
•Different batch compositions in distributed training — Each GPU sees different statistics; consider SyncBatchNorm

Pitfall 2: Corrupted Running Statistics

Running statistics can become corrupted in several ways:

Training with extremely small batches (high variance in batch stats)
Non-IID batch composition (biased samples)
Resuming training with mismatched running stats
Bug in data loading causing repeated or skipped samples

debugging_running_stats.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import torch
import torch.nn as nn
import numpy as np
 
def diagnose_running_statistics(model):
    """
    Diagnostic function to analyze BatchNorm running statistics.
    """
    print("=== BatchNorm Running Statistics Diagnosis ===\n")
    
    for name, module in model.named_modules():
        if isinstance(module, (nn.BatchNorm1d, nn.BatchNorm2d, nn.BatchNorm3d)):
            mean = module.running_mean.numpy()
            var = module.running_var.numpy()
            
            print(f"Layer: {name}")
            print(f"  Running mean - min: {mean.min():.4f}, max: {mean.max():.4f}, "
                  f"mean: {mean.mean():.4f}")
            print(f"  Running var  - min: {var.min():.4f}, max: {var.max():.4f}, "
                  f"mean: {var.mean():.4f}")
            
            # Red flags
            issues = []
            if np.any(var < 1e-6):
                issues.append("⚠️  Near-zero variance detected (possible dead features)")
            if np.any(var > 1e6):
                issues.append("⚠️  Extremely high variance (possible exploding activations)")
            if np.any(np.abs(mean) > 100):
                issues.append("⚠️  Large mean values (possible activation drift)")
            if np.any(np.isnan(mean)) or np.any(np.isnan(var)):
                issues.append("🚨 NaN values detected!")
            if np.any(np.isinf(var)):
                issues.append("🚨 Inf values detected!")
                
            if issues:
                for issue in issues:
                    print(f"  {issue}")
            else:
                print("  ✓ Statistics look healthy")
            print()
 
# Usage example
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.BatchNorm1d(256),
    nn.ReLU(),
    nn.Linear(256, 10)
)
 
# After training, diagnose
diagnose_running_statistics(model)
 
# If issues found, consider:
# 1. Re-running training with larger batch size
# 2. Computing statistics from a representative data sample
# 3. Using Batch Renormalization for small batches

The Frozen BatchNorm Trap

When fine-tuning pre-trained models, BatchNorm layers may have running statistics from the pre-training data. If your fine-tuning data has different statistics, using these stale values can hurt performance. Options: (1) Re-compute running stats on your data before fine-tuning, (2) Train BatchNorm layers normally (don't freeze), (3) Use a calibration pass to update running stats with frozen weights.

Best Practices for Production

Deploying models with BatchNorm requires attention to several details. These best practices help ensure reliable production behavior.

Best Practice 1: Explicit Mode Setting

Always explicitly set the mode before training or evaluation, even if you think it's already correct:

best_practices.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import torch
import torch.nn as nn
 
def train_epoch(model, dataloader, optimizer, criterion):
    """Training epoch with explicit mode setting."""
    model.train()  # ALWAYS set explicitly at start of training
    
    for batch in dataloader:
        # Training code...
        pass
 
def evaluate(model, dataloader, criterion):
    """Evaluation with explicit mode setting."""
    model.eval()  # ALWAYS set explicitly at start of evaluation
    
    with torch.no_grad():  # Also disable gradients for inference
        for batch in dataloader:
            # Evaluation code...
            pass
 
def inference(model, x):
    """Single-sample inference with safety checks."""
    was_training = model.training
    
    model.eval()
    with torch.no_grad():
        result = model(x)
    
    # Optionally restore original mode
    if was_training:
        model.train()
    
    return result
 
# Context manager approach (very clean)
class EvalMode:
    """Context manager for temporary eval mode."""
    def __init__(self, model):
        self.model = model
        self.was_training = model.training
    
    def __enter__(self):
        self.model.eval()
        return self.model
    
    def __exit__(self, *args):
        if self.was_training:
            self.model.train()
 
# Usage
model = nn.Sequential(nn.Linear(10, 10), nn.BatchNorm1d(10))
model.train()
 
with EvalMode(model):
    # Model is in eval mode here
    output = model(torch.randn(1, 10))
 
# Model is back in train mode here

Best Practice 2: Statistics Verification

Before deploying, verify that running statistics are reasonable. Check for:

NaN or Inf values
Extremely small or large variances
Statistics that seem implausible for your data domain

Production Checklist for BatchNorm Models
Check	Expected	Action if Failed
Running mean range	Within expected input range	Investigate training data or bugs
Running var > 0	All positive values	Check for dead features or bugs
No NaN/Inf	All finite values	Check for numerical issues in training
eval() produces consistent output	Same input → same output	Ensure model.eval() is called
train() after training	Running stats match data	Verify final training batches are valid
Gamma not near zero	Most values > 0.1	May have dead normalized features

Pre-Deployment Checklist

•Verify mode switching works correctly — Test that model.eval() produces consistent outputs
•Inspect running statistics — Use diagnostic functions to check for issues
•Test with single samples — Verify single-sample inference works
•Compare training vs eval outputs — Ensure they're similar on validation data
•Consider BatchNorm folding — For efficiency in production
•Document the expected input distribution — Running stats assume similar test distribution

Model Export Considerations

When exporting models (ONNX, TorchScript, TensorRT), BatchNorm layers are typically exported in eval mode with fixed running statistics. Ensure your running statistics are finalized before export. Some export formats allow BatchNorm folding automatically—check your framework's documentation.

Special Cases and Edge Scenarios

Some scenarios require special handling of BatchNorm's training/evaluation behavior. Understanding these edge cases prevents subtle bugs.

Case 1: Training with eval() for Specific Layers

Sometimes you want most of the network in training mode but specific BatchNorm layers in eval mode (e.g., frozen pre-trained layers):

mixed_mode_training.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import torch
import torch.nn as nn
 
def freeze_batchnorm_layers(model):
    """
    Freeze BatchNorm layers: use eval mode + no parameter updates.
    """
    for module in model.modules():
        if isinstance(module, (nn.BatchNorm1d, nn.BatchNorm2d, nn.BatchNorm3d)):
            module.eval()  # Use running statistics
            
            # Also freeze parameters
            for param in module.parameters():
                param.requires_grad = False
 
def unfreeze_batchnorm_layers(model):
    """
    Unfreeze BatchNorm layers.
    """
    for module in model.modules():
        if isinstance(module, (nn.BatchNorm1d, nn.BatchNorm2d, nn.BatchNorm3d)):
            module.train()  # Use batch statistics
            for param in module.parameters():
                param.requires_grad = True
 
# Usage in fine-tuning
backbone = nn.Sequential(
    nn.Conv2d(3, 64, 3),
    nn.BatchNorm2d(64),
    nn.ReLU(),
)
head = nn.Sequential(
    nn.Linear(64 * 30 * 30, 10)
)
 
# Freeze backbone BatchNorm when fine-tuning
freeze_batchnorm_layers(backbone)
backbone.eval()  # Keep in eval mode
 
# Head layers train normally
head.train()
 
# NOTE: After calling model.train(), individual modules reset!
# Need to re-freeze after any global mode change

Case 2: Recalculating Running Statistics

Sometimes running statistics become stale or corrupted. You can recalculate them from a data sample:

recalculate_running_stats.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import torch
import torch.nn as nn
 
def recalculate_running_stats(model, dataloader, num_batches=100):
    """
    Recalculate BatchNorm running statistics from data.
    
    This is useful when:
    1. Running stats were corrupted during training
    2. Fine-tuning on new data with different statistics
    3. Loading a model with mismatched running stats
    """
    # Reset running stats
    for module in model.modules():
        if isinstance(module, (nn.BatchNorm1d, nn.BatchNorm2d, nn.BatchNorm3d)):
            module.reset_running_stats()
            # Set momentum to None for cumulative average instead of EMA
            module.momentum = None
    
    # Set to training mode to accumulate statistics
    model.train()
    
    with torch.no_grad():
        for i, (x, _) in enumerate(dataloader):
            if i >= num_batches:
                break
            # Forward pass accumulates statistics
            _ = model(x)
            
            if (i + 1) % 10 == 0:
                print(f"Processed {i + 1} batches")
    
    # Restore momentum for future training
    for module in model.modules():
        if isinstance(module, (nn.BatchNorm1d, nn.BatchNorm2d, nn.BatchNorm3d)):
            module.momentum = 0.1  # Default value
    
    # Set back to eval mode
    model.eval()
    print("Running statistics recalculated.")
 
# Alternative: Use exponential moving average with specified momentum
def recalculate_with_ema(model, dataloader, momentum=0.1, num_batches=200):
    """Recalculate using EMA (more stable for streaming data)."""
    for module in model.modules():
        if isinstance(module, (nn.BatchNorm1d, nn.BatchNorm2d, nn.BatchNorm3d)):
            module.reset_running_stats()
            module.momentum = momentum
    
    model.train()
    with torch.no_grad():
        for i, (x, _) in enumerate(dataloader):
            if i >= num_batches:
                break
            _ = model(x)
    
    model.eval()

Domain Shift at Test Time

If your test data comes from a different distribution than training data (domain shift), the running statistics from training may be suboptimal. Test-time adaptation techniques update BatchNorm statistics using test batches. This requires batch-mode inference but can significantly improve performance under distribution shift.

Summary: Training vs Inference

We've comprehensively covered BatchNorm's dual-mode behavior. Here are the essential takeaways:

Key Takeaways

•Two modes exist for good reasons: Training uses batch stats (gradients, regularization), inference uses running stats (determinism, single-sample support)
•Running statistics are exponential moving averages accumulated during training, controlled by the momentum parameter
•Training mode computes batch statistics, normalizes with them, updates running statistics, and has full gradient flow
•Evaluation mode uses fixed running statistics, no updates, deterministic output, and can process single samples
•Mode switching bugs are extremely common—always call model.eval() before inference and model.train() before training
•Diagnose running statistics before deployment to catch training issues
•BatchNorm in eval mode is linear and can be folded into adjacent layers for efficient inference

What's Next:

While Batch Normalization revolutionized deep learning, it has limitations—particularly for sequence models and small batches. The next page introduces Layer Normalization, which normalizes across features rather than across the batch, enabling normalization in architectures where BatchNorm struggles.

Training vs Inference Mastered

You now have a deep understanding of BatchNorm's dual-mode behavior, common pitfalls, and best practices for production deployment. This knowledge will help you avoid subtle bugs and build reliable deep learning systems.