Loading content...
Batch Normalization has a unique property among neural network layers: it behaves differently during training and inference. This dual-mode behavior is essential for BatchNorm to work correctly, but it's also a common source of bugs and confusion.
During training, BatchNorm uses statistics computed from the current mini-batch. During inference, it uses pre-computed "running" statistics accumulated during training. Understanding this distinction deeply—why it exists, how it's implemented, and what can go wrong—is essential for deploying BatchNorm models correctly.
This page provides a complete treatment of training vs. inference behavior, the running statistics mechanism, and the many ways things can go wrong if this distinction is mishandled.
By the end of this page, you will understand: (1) why BatchNorm needs different behavior in training vs. inference, (2) how running statistics are computed and updated, (3) the momentum parameter and its effect on statistics, (4) common bugs and how to diagnose them, and (5) best practices for production deployment.
At first glance, the dual-mode behavior of BatchNorm seems like unnecessary complexity. Why not just use batch statistics all the time, or running statistics all the time? The answer involves several interconnected requirements.
Problem 1: Single-Sample Inference
During inference, we often process one sample at a time. With a batch size of 1, batch statistics are meaningless:
12345678910111213141516171819202122232425262728293031323334353637
import numpy as np def demonstrate_single_sample_problem(): """ Show why batch statistics fail for single-sample inference. """ # Single sample input x_single = np.array([[2.5, -1.2, 0.8]]) # Shape: (1, 3) # Batch mean and variance batch_mean = np.mean(x_single, axis=0) # Just the sample values! batch_var = np.var(x_single, axis=0) # Zero! print("Single sample BatchNorm attempt:") print(f" Input: {x_single}") print(f" Batch mean: {batch_mean}") print(f" Batch variance: {batch_var}") # Normalization attempt eps = 1e-5 try: x_norm = (x_single - batch_mean) / np.sqrt(batch_var + eps) print(f" 'Normalized' output: {x_norm}") print(" Result: All zeros (or near-zero) - information destroyed!") except Exception as e: print(f" Error: {e}") # Contrast with running statistics running_mean = np.array([0.1, -0.3, 0.5]) # From training running_var = np.array([1.2, 0.8, 1.5]) x_norm_running = (x_single - running_mean) / np.sqrt(running_var + eps) print(f"\nWith running statistics:") print(f" Normalized output: {x_norm_running}") print(" Result: Meaningful normalized values!") demonstrate_single_sample_problem()Problem 2: Deterministic Inference
Production systems require deterministic behavior: the same input should always produce the same output. If we used batch statistics during inference:
| Aspect | Batch Statistics | Running Statistics |
|---|---|---|
| Deterministic? | No (depends on batch) | Yes (fixed values) |
| Single-sample support | No (undefined) | Yes |
| Reflects training data | Current batch only | Entire training distribution |
| Gradient information | Yes (trainable) | No (fixed at inference) |
| Regularization effect | Yes (batch noise) | No |
| Used during | Training | Inference |
Problem 3: Gradient Flow During Training
BatchNorm's training benefits come partly from gradients flowing through the batch statistics. The normalization creates interdependence between samples in a batch, which acts as regularization.
Using fixed running statistics during training would:
The Solution: Mode Switching
BatchNorm maintains two modes:
Forgetting to switch modes is one of the most common deep learning bugs. A model in training mode during evaluation will use incorrect batch statistics and produce variable, degraded results. A model in evaluation mode during training will not get BatchNorm's regularization benefits and won't update running statistics.
Running statistics are exponential moving averages of batch statistics, accumulated during training. They approximate the mean and variance of the entire training data distribution.
Exponential Moving Average Update:
After each training batch, the running statistics are updated:
$$\mu_{\text{running}} \leftarrow (1 - \alpha) \cdot \mu_{\text{running}} + \alpha \cdot \mu_{\text{batch}}$$
$$\sigma^2_{\text{running}} \leftarrow (1 - \alpha) \cdot \sigma^2_{\text{running}} + \alpha \cdot \sigma^2_{\text{batch}}$$
where α is the momentum parameter (typically 0.1).
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
import numpy as npimport matplotlib.pyplot as plt def simulate_running_statistics(true_mean, true_var, n_batches, batch_size, momentum=0.1): """ Simulate how running statistics converge to true values. """ # Initialize running statistics running_mean = 0.0 running_var = 1.0 # Track history running_mean_history = [running_mean] running_var_history = [running_var] batch_mean_history = [] batch_var_history = [] for batch in range(n_batches): # Sample a batch from the true distribution batch_data = np.random.normal(true_mean, np.sqrt(true_var), batch_size) # Compute batch statistics batch_mean = np.mean(batch_data) batch_var = np.var(batch_data) batch_mean_history.append(batch_mean) batch_var_history.append(batch_var) # Update running statistics with exponential moving average running_mean = (1 - momentum) * running_mean + momentum * batch_mean running_var = (1 - momentum) * running_var + momentum * batch_var running_mean_history.append(running_mean) running_var_history.append(running_var) return { 'running_mean_history': running_mean_history, 'running_var_history': running_var_history, 'batch_mean_history': batch_mean_history, 'batch_var_history': batch_var_history, 'true_mean': true_mean, 'true_var': true_var } # Simulate with different momentum valuesnp.random.seed(42)true_mean, true_var = 2.5, 4.0n_batches = 100batch_size = 32 results = {}for momentum in [0.01, 0.1, 0.5]: results[momentum] = simulate_running_statistics( true_mean, true_var, n_batches, batch_size, momentum ) final_mean = results[momentum]['running_mean_history'][-1] final_var = results[momentum]['running_var_history'][-1] print(f"Momentum {momentum}: Final mean = {final_mean:.3f} (true: {true_mean}), " f"Final var = {final_var:.3f} (true: {true_var})") # Key observations:# - Higher momentum: Faster adaptation, more noise in estimate# - Lower momentum: Slower adaptation, smoother estimate# - All converge to true values given enough iterationsUnderstanding the Momentum Parameter:
The momentum α controls the trade-off between:
The effective "memory" of the exponential moving average is approximately 1/α batches. With momentum=0.1, the running statistics remember roughly the last 10 batches.
Common Momentum Values:
| Momentum (α) | Memory (~batches) | Convergence Speed | Noise Sensitivity | Use Case |
|---|---|---|---|---|
| 0.001 | ~1000 | Very slow | Very low | Tiny batches, high noise |
| 0.01 | ~100 | Slow | Low | Conservative training |
| 0.1 | ~10 | Medium | Medium | Standard training |
| 0.2 | ~5 | Fast | High | Short training, stable data |
| 0.5 | ~2 | Very fast | Very high | Non-stationary data |
Different frameworks use different conventions! PyTorch uses momentum α where running = (1-α)·old + α·new. TensorFlow uses momentum β where running = β·old + (1-β)·new. The relationship is α = 1 - β. PyTorch momentum=0.1 equals TensorFlow momentum=0.9. Always check the documentation!
Understanding exactly what happens during training mode is essential for debugging and optimization. Let's trace through a complete training forward pass.
Training Mode Operations:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
import torchimport torch.nn as nn def analyze_training_mode(): """ Detailed analysis of BatchNorm behavior in training mode. """ torch.manual_seed(42) # Create BatchNorm layer bn = nn.BatchNorm1d(4, momentum=0.1) bn.train() # Explicitly set training mode print("=== Initial State ===") print(f"Running mean: {bn.running_mean.numpy()}") print(f"Running var: {bn.running_var.numpy()}") print(f"Gamma (weight): {bn.weight.data.numpy()}") print(f"Beta (bias): {bn.bias.data.numpy()}") # First batch x1 = torch.randn(8, 4) * 2 + 1 # Mean ~1, Std ~2 print(f"\n=== Batch 1 ===") print(f"Input mean per feature: {x1.mean(dim=0).numpy()}") print(f"Input var per feature: {x1.var(dim=0, unbiased=False).numpy()}") y1 = bn(x1) print(f"\nOutput mean per feature: {y1.mean(dim=0).detach().numpy()}") print(f"Output var per feature: {y1.var(dim=0, unbiased=False).detach().numpy()}") print(f"\nRunning mean after batch 1: {bn.running_mean.numpy()}") print(f"Running var after batch 1: {bn.running_var.numpy()}") # Second batch with different statistics x2 = torch.randn(8, 4) * 0.5 - 2 # Mean ~-2, Std ~0.5 print(f"\n=== Batch 2 ===") print(f"Input mean per feature: {x2.mean(dim=0).numpy()}") print(f"Input var per feature: {x2.var(dim=0, unbiased=False).numpy()}") y2 = bn(x2) print(f"\nOutput mean per feature: {y2.mean(dim=0).detach().numpy()}") print(f"Output var per feature: {y2.var(dim=0, unbiased=False).detach().numpy()}") print(f"\nRunning mean after batch 2: {bn.running_mean.numpy()}") print(f"Running var after batch 2: {bn.running_var.numpy()}") # Observation: Note how outputs are normalized to ~0 mean, ~1 var # but running statistics are gradually updating return bn bn = analyze_training_mode() # Demonstrate gradient flow through batch statisticsprint("\n=== Gradient Flow ===")bn.train()x = torch.randn(8, 4, requires_grad=True)y = bn(x)loss = y.sum()loss.backward() print(f"Gradient on input x: shape={x.grad.shape}")print(f"Gradient on gamma: {bn.weight.grad.numpy()}")print(f"Gradient on beta: {bn.bias.grad.numpy()}")print("Note: Gradients flow through the normalization, including batch stats")The Regularization Effect:
During training, the batch statistics introduce stochasticity:
Gradient Through Batch Statistics:
The gradients ∂L/∂x flow through the batch statistics computation, creating sample interdependence:
$$\frac{\partial L}{\partial x_i} = f\left(\frac{\partial L}{\partial y_1}, ..., \frac{\partial L}{\partial y_m}\right)$$
This means the gradient for sample i depends on the gradients of all samples in the batch—different from standard layers where samples are independent.
When training with BatchNorm: (1) Use appropriate batch sizes (≥32 recommended), (2) Ensure batches are randomly sampled to avoid bias in statistics, (3) Use model.train() before training loops, (4) Monitor running statistics to verify convergence, (5) Consider batch composition for distributed training.
Evaluation mode uses fixed running statistics accumulated during training. This provides deterministic, consistent behavior for inference.
Evaluation Mode Operations:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
import torchimport torch.nn as nn def analyze_evaluation_mode(): """ Detailed analysis of BatchNorm behavior in evaluation mode. """ torch.manual_seed(42) # Create and train a BatchNorm layer bn = nn.BatchNorm1d(4, momentum=0.1) # Simulate training to build up running statistics bn.train() for _ in range(100): x = torch.randn(32, 4) * 2 + 1 # Training distribution _ = bn(x) print("=== After Training ===") print(f"Running mean: {bn.running_mean.numpy()}") print(f"Running var: {bn.running_var.numpy()}") saved_running_mean = bn.running_mean.clone() saved_running_var = bn.running_var.clone() # Switch to evaluation mode bn.eval() print("\n=== Evaluation Mode ===") # Process single samples x_single = torch.randn(1, 4) * 3 - 0.5 # Different distribution! y_single = bn(x_single) print(f"Single sample input: {x_single.numpy()}") print(f"Single sample output: {y_single.detach().numpy()}") # Verify: same sample always produces same output y_repeat = bn(x_single) print(f"\nSame input again: {y_repeat.detach().numpy()}") print(f"Outputs identical: {torch.allclose(y_single, y_repeat)}") # Verify: running stats unchanged print(f"\nRunning mean unchanged: {torch.allclose(bn.running_mean, saved_running_mean)}") print(f"Running var unchanged: {torch.allclose(bn.running_var, saved_running_var)}") # Demonstrate independence from batch composition x_batch = torch.randn(8, 4) # Process as batch y_batch = bn(x_batch) # Process individually y_individual = torch.stack([bn(x_batch[i:i+1]).squeeze() for i in range(8)]) print(f"\nBatch vs individual processing identical: {torch.allclose(y_batch, y_individual)}") return bn bn = analyze_evaluation_mode() # Key insight: In eval mode, output for sample i depends ONLY on:# 1. The sample x_i# 2. The fixed running statistics (μ_running, σ²_running)# 3. The learned parameters (γ, β)# # Unlike training mode, there's no dependency on other samples in the batchThe Evaluation Mode Formula:
In evaluation mode, BatchNorm computes:
$$y = \gamma \cdot \frac{x - \mu_{\text{running}}}{\sqrt{\sigma^2_{\text{running}} + \epsilon}} + \beta$$
This can be algebraically simplified to:
$$y = \gamma' \cdot x + \beta'$$
where:
Implication: BatchNorm becomes a linear transform during inference!
This means BatchNorm layers can be fused with adjacent linear layers for efficiency during deployment.
For production deployment, BatchNorm can be 'folded' into adjacent convolution or linear layers. This eliminates the BatchNorm layer entirely, replacing it with modified weights and biases. This reduces model size, eliminates a layer of computation, and simplifies the model graph.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
import torchimport torch.nn as nn def fold_batchnorm_into_linear(linear, bn): """ Fold BatchNorm parameters into a preceding Linear layer. Returns a new Linear layer that computes equivalent output without needing the BatchNorm. """ # Get BatchNorm parameters gamma = bn.weight.data beta = bn.bias.data mean = bn.running_mean var = bn.running_var eps = bn.eps # Get Linear parameters W = linear.weight.data # Shape: (out_features, in_features) b = linear.bias.data if linear.bias is not None else torch.zeros(W.shape[0]) # Compute folded parameters std = torch.sqrt(var + eps) scale = gamma / std # New weights and bias W_folded = W * scale.unsqueeze(1) # Scale each output row b_folded = scale * (b - mean) + beta # Create new Linear layer folded_linear = nn.Linear(linear.in_features, linear.out_features) folded_linear.weight.data = W_folded folded_linear.bias.data = b_folded return folded_linear # Verify correctnesstorch.manual_seed(42) # Original layerslinear = nn.Linear(64, 128)bn = nn.BatchNorm1d(128) # Simulate trainingmodel = nn.Sequential(linear, bn)model.train()for _ in range(100): x = torch.randn(32, 64) _ = model(x) model.eval() # Fold BatchNormfolded = fold_batchnorm_into_linear(linear, bn) # Compare outputsx_test = torch.randn(16, 64)y_original = model(x_test)y_folded = folded(x_test) print(f"Max difference: {(y_original - y_folded).abs().max().item():.2e}")print("Outputs are equivalent! BatchNorm successfully folded.")BatchNorm's dual-mode behavior leads to numerous potential bugs. Being aware of these common pitfalls can save hours of debugging.
Pitfall 1: Forgetting to Switch Modes
The most common bug: evaluating a model that's still in training mode.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
import torchimport torch.nn as nn def demonstrate_mode_bug(): """ Show the consequences of incorrect mode switching. """ torch.manual_seed(42) # Create model with BatchNorm model = nn.Sequential( nn.Linear(10, 20), nn.BatchNorm1d(20), nn.ReLU(), nn.Linear(20, 5) ) # Simulate training model.train() for _ in range(100): x = torch.randn(32, 10) _ = model(x) # BUG: Forgot to call model.eval()! # Test with different batch sizes x_test = torch.randn(1, 10) # Single sample # Multiple evaluations of same input in training mode results_training_mode = [] for _ in range(5): # In training mode, adding other samples changes batch stats! batch = torch.cat([x_test, torch.randn(31, 10)], dim=0) y = model(batch)[0:1] # Take only first sample output results_training_mode.append(y.detach()) print("=== BUG: Model in training mode during evaluation ===") print("Same input, different batch compositions:") for i, r in enumerate(results_training_mode): print(f" Attempt {i+1}: {r.numpy().flatten()[:3]}...") # Correct: switch to eval mode model.eval() results_eval_mode = [] for _ in range(5): batch = torch.cat([x_test, torch.randn(31, 10)], dim=0) y = model(batch)[0:1] results_eval_mode.append(y.detach()) print("\n=== CORRECT: Model in eval mode ===") print("Same input, different batch compositions:") for i, r in enumerate(results_eval_mode): print(f" Attempt {i+1}: {r.numpy().flatten()[:3]}...") print("Outputs are now identical!") demonstrate_mode_bug()Pitfall 2: Corrupted Running Statistics
Running statistics can become corrupted in several ways:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
import torchimport torch.nn as nnimport numpy as np def diagnose_running_statistics(model): """ Diagnostic function to analyze BatchNorm running statistics. """ print("=== BatchNorm Running Statistics Diagnosis ===\n") for name, module in model.named_modules(): if isinstance(module, (nn.BatchNorm1d, nn.BatchNorm2d, nn.BatchNorm3d)): mean = module.running_mean.numpy() var = module.running_var.numpy() print(f"Layer: {name}") print(f" Running mean - min: {mean.min():.4f}, max: {mean.max():.4f}, " f"mean: {mean.mean():.4f}") print(f" Running var - min: {var.min():.4f}, max: {var.max():.4f}, " f"mean: {var.mean():.4f}") # Red flags issues = [] if np.any(var < 1e-6): issues.append("⚠️ Near-zero variance detected (possible dead features)") if np.any(var > 1e6): issues.append("⚠️ Extremely high variance (possible exploding activations)") if np.any(np.abs(mean) > 100): issues.append("⚠️ Large mean values (possible activation drift)") if np.any(np.isnan(mean)) or np.any(np.isnan(var)): issues.append("🚨 NaN values detected!") if np.any(np.isinf(var)): issues.append("🚨 Inf values detected!") if issues: for issue in issues: print(f" {issue}") else: print(" ✓ Statistics look healthy") print() # Usage examplemodel = nn.Sequential( nn.Linear(784, 256), nn.BatchNorm1d(256), nn.ReLU(), nn.Linear(256, 10)) # After training, diagnosediagnose_running_statistics(model) # If issues found, consider:# 1. Re-running training with larger batch size# 2. Computing statistics from a representative data sample# 3. Using Batch Renormalization for small batchesWhen fine-tuning pre-trained models, BatchNorm layers may have running statistics from the pre-training data. If your fine-tuning data has different statistics, using these stale values can hurt performance. Options: (1) Re-compute running stats on your data before fine-tuning, (2) Train BatchNorm layers normally (don't freeze), (3) Use a calibration pass to update running stats with frozen weights.
Deploying models with BatchNorm requires attention to several details. These best practices help ensure reliable production behavior.
Best Practice 1: Explicit Mode Setting
Always explicitly set the mode before training or evaluation, even if you think it's already correct:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
import torchimport torch.nn as nn def train_epoch(model, dataloader, optimizer, criterion): """Training epoch with explicit mode setting.""" model.train() # ALWAYS set explicitly at start of training for batch in dataloader: # Training code... pass def evaluate(model, dataloader, criterion): """Evaluation with explicit mode setting.""" model.eval() # ALWAYS set explicitly at start of evaluation with torch.no_grad(): # Also disable gradients for inference for batch in dataloader: # Evaluation code... pass def inference(model, x): """Single-sample inference with safety checks.""" was_training = model.training model.eval() with torch.no_grad(): result = model(x) # Optionally restore original mode if was_training: model.train() return result # Context manager approach (very clean)class EvalMode: """Context manager for temporary eval mode.""" def __init__(self, model): self.model = model self.was_training = model.training def __enter__(self): self.model.eval() return self.model def __exit__(self, *args): if self.was_training: self.model.train() # Usagemodel = nn.Sequential(nn.Linear(10, 10), nn.BatchNorm1d(10))model.train() with EvalMode(model): # Model is in eval mode here output = model(torch.randn(1, 10)) # Model is back in train mode hereBest Practice 2: Statistics Verification
Before deploying, verify that running statistics are reasonable. Check for:
| Check | Expected | Action if Failed |
|---|---|---|
| Running mean range | Within expected input range | Investigate training data or bugs |
| Running var > 0 | All positive values | Check for dead features or bugs |
| No NaN/Inf | All finite values | Check for numerical issues in training |
| eval() produces consistent output | Same input → same output | Ensure model.eval() is called |
| train() after training | Running stats match data | Verify final training batches are valid |
| Gamma not near zero | Most values > 0.1 | May have dead normalized features |
When exporting models (ONNX, TorchScript, TensorRT), BatchNorm layers are typically exported in eval mode with fixed running statistics. Ensure your running statistics are finalized before export. Some export formats allow BatchNorm folding automatically—check your framework's documentation.
Some scenarios require special handling of BatchNorm's training/evaluation behavior. Understanding these edge cases prevents subtle bugs.
Case 1: Training with eval() for Specific Layers
Sometimes you want most of the network in training mode but specific BatchNorm layers in eval mode (e.g., frozen pre-trained layers):
1234567891011121314151617181920212223242526272829303132333435363738394041424344
import torchimport torch.nn as nn def freeze_batchnorm_layers(model): """ Freeze BatchNorm layers: use eval mode + no parameter updates. """ for module in model.modules(): if isinstance(module, (nn.BatchNorm1d, nn.BatchNorm2d, nn.BatchNorm3d)): module.eval() # Use running statistics # Also freeze parameters for param in module.parameters(): param.requires_grad = False def unfreeze_batchnorm_layers(model): """ Unfreeze BatchNorm layers. """ for module in model.modules(): if isinstance(module, (nn.BatchNorm1d, nn.BatchNorm2d, nn.BatchNorm3d)): module.train() # Use batch statistics for param in module.parameters(): param.requires_grad = True # Usage in fine-tuningbackbone = nn.Sequential( nn.Conv2d(3, 64, 3), nn.BatchNorm2d(64), nn.ReLU(),)head = nn.Sequential( nn.Linear(64 * 30 * 30, 10)) # Freeze backbone BatchNorm when fine-tuningfreeze_batchnorm_layers(backbone)backbone.eval() # Keep in eval mode # Head layers train normallyhead.train() # NOTE: After calling model.train(), individual modules reset!# Need to re-freeze after any global mode changeCase 2: Recalculating Running Statistics
Sometimes running statistics become stale or corrupted. You can recalculate them from a data sample:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
import torchimport torch.nn as nn def recalculate_running_stats(model, dataloader, num_batches=100): """ Recalculate BatchNorm running statistics from data. This is useful when: 1. Running stats were corrupted during training 2. Fine-tuning on new data with different statistics 3. Loading a model with mismatched running stats """ # Reset running stats for module in model.modules(): if isinstance(module, (nn.BatchNorm1d, nn.BatchNorm2d, nn.BatchNorm3d)): module.reset_running_stats() # Set momentum to None for cumulative average instead of EMA module.momentum = None # Set to training mode to accumulate statistics model.train() with torch.no_grad(): for i, (x, _) in enumerate(dataloader): if i >= num_batches: break # Forward pass accumulates statistics _ = model(x) if (i + 1) % 10 == 0: print(f"Processed {i + 1} batches") # Restore momentum for future training for module in model.modules(): if isinstance(module, (nn.BatchNorm1d, nn.BatchNorm2d, nn.BatchNorm3d)): module.momentum = 0.1 # Default value # Set back to eval mode model.eval() print("Running statistics recalculated.") # Alternative: Use exponential moving average with specified momentumdef recalculate_with_ema(model, dataloader, momentum=0.1, num_batches=200): """Recalculate using EMA (more stable for streaming data).""" for module in model.modules(): if isinstance(module, (nn.BatchNorm1d, nn.BatchNorm2d, nn.BatchNorm3d)): module.reset_running_stats() module.momentum = momentum model.train() with torch.no_grad(): for i, (x, _) in enumerate(dataloader): if i >= num_batches: break _ = model(x) model.eval()If your test data comes from a different distribution than training data (domain shift), the running statistics from training may be suboptimal. Test-time adaptation techniques update BatchNorm statistics using test batches. This requires batch-mode inference but can significantly improve performance under distribution shift.
We've comprehensively covered BatchNorm's dual-mode behavior. Here are the essential takeaways:
What's Next:
While Batch Normalization revolutionized deep learning, it has limitations—particularly for sequence models and small batches. The next page introduces Layer Normalization, which normalizes across features rather than across the batch, enabling normalization in architectures where BatchNorm struggles.
You now have a deep understanding of BatchNorm's dual-mode behavior, common pitfalls, and best practices for production deployment. This knowledge will help you avoid subtle bugs and build reliable deep learning systems.