Stochastic Variational Inference - Learning Module

Loading content...

0/245

Practical Considerations

From Theory to Production

Stochastic variational inference is a powerful framework, but translating the elegant mathematics into robust, production-ready code requires navigating a landscape of practical challenges that textbooks rarely discuss.

This page distills years of collective experience into actionable guidance for practitioners. We address the questions that arise when implementing SVI: How do I initialize? Why is my ELBO oscillating? When should I use SVI versus alternatives? How do I debug a model that isn't learning?

The goal is to transform you from someone who understands SVI to someone who can deploy it confidently on real problems.

What You Will Learn

By the end of this page, you will know how to initialize SVI for stable convergence, diagnose and fix common training pathologies, decide when SVI is (and isn't) the right approach, and implement production-grade variational inference systems.

Initialization Strategies

Proper initialization is critical for successful SVI. Poor initialization can lead to slow convergence, convergence to suboptimal local optima, or outright divergence.

General principles:

Start near the prior: Initialize variational parameters so $q(z; \phi_0) \approx p(z)$. This ensures valid probability distributions and often stable initial gradients.
Use small variances carefully: Very small initial variances can cause gradient explosion in the likelihood term; very large variances dilute the signal.
Leverage problem structure: Use domain knowledge to initialize near plausible posterior modes.

Initialization Recipes by Model Type
Model	Variational Family	Recommended Initialization
Gaussian posterior	Diagonal Gaussian	μ = 0, log σ = 0 (unit variance)
Topic models (LDA)	Dirichlet	α = 1 + small noise (near uniform)
VAE encoder	Gaussian	Xavier/He init for network; μ→0, σ→1
Bayesian neural net	Gaussian weights	μ = pretrained or Xavier; σ = small (0.01)
Mixture models	Categorical + Gaussian	K-means clustering for means; uniform mixing

Initialization for VAEs:

Variational Autoencoders require careful initialization of both encoder and decoder networks:

# Encoder initialization
# Mean projection: zero output initially → z_mean ≈ 0
nn.init.xavier_uniform_(encoder_mean.weight)
nn.init.zeros_(encoder_mean.bias)

# Log-variance projection: small output initially → z_std ≈ 1
nn.init.xavier_uniform_(encoder_logvar.weight)
nn.init.constant_(encoder_logvar.bias, -2)  # exp(-2) ≈ 0.14, conservative

# Decoder: standard initialization
for layer in decoder:
    nn.init.xavier_uniform_(layer.weight)

Why this works:

Initial latent codes $z \sim \mathcal{N}(0, 1)$ match the prior, so initial KL ≈ 0
Decoder sees random codes from prior distribution—good for exploration
Gradients are well-behaved: neither vanishing nor exploding

Warm-Start from Simpler Models

For complex models, initialize from solutions to simpler problems: • For hierarchical VAEs: train bottom-up, one level at a time • For Bayesian neural networks: initialize from maximum likelihood (point estimate) weights • For deep topic models: initialize from shallow LDA

This 'curriculum' of increasing complexity dramatically improves convergence.

Debugging Common Problems

When SVI fails to converge or produces poor results, systematic debugging is essential. Here we catalog common failure modes and their remedies.

Problem 1: ELBO is NaN or -Inf

Causes:

Log of zero or negative in likelihood computation
Exponential overflow in softmax or partition functions
Division by zero in variance terms

Solutions:

# Add numerical stability to log computations
log_prob = torch.log(prob + 1e-10)

# Use log-sum-exp trick for softmax
def stable_softmax(logits):
    max_logits = logits.max(dim=-1, keepdim=True).values
    exp_logits = torch.exp(logits - max_logits)
    return exp_logits / exp_logits.sum(dim=-1, keepdim=True)

# Clamp variance to prevent division by zero
var = torch.clamp(var, min=1e-6)

Problem 2: ELBO Not Improving

•Learning rate too small: Increase by 10× and observe. If still flat, there may be a gradient computation bug.
•Learning rate too large: ELBO fluctuates wildly or decreases. Reduce by 10× and add gradient clipping.
•Stuck at local optimum: Try random restarts from different initializations. Compare to the prior log-likelihood—if ELBO ≈ log p(x) under prior, the model isn't learning.
•Gradient bug: Verify gradient computation with finite differences. A gradient bug is surprisingly common.
•Model misspecification: The model may be unable to represent the data. Simplify and verify basic functionality.

Problem 3: Posterior Collapse (VAEs)

Symptom: KL divergence → 0, decoder ignores latent code, all samples look similar.

Diagnosis:

# Check per-dimension KL
kl_per_dim = 0.5 * (mu**2 + var - 1 - torch.log(var))
print("KL per dimension:", kl_per_dim.mean(dim=0))
# If all near zero → collapsed

Solutions:

KL annealing: Gradually increase KL weight from 0 to 1

beta = min(1.0, epoch / warmup_epochs)
loss = -reconstruction + beta * kl

Free bits: Enforce minimum KL per dimension

free_bits = 0.1
kl_per_dim = torch.clamp(kl_per_dim, min=free_bits)
kl = kl_per_dim.sum(dim=-1)

Weaker decoder: Use simpler decoder (fewer layers, less capacity) so latent code is necessary.
Input dropout: Drop out input features to force reliance on latent code.

Problem 4: Gradient Explosion

Symptom: NaN gradients or ELBO suddenly goes to -Inf.

Solutions: • Gradient clipping: torch.nn.utils.clip_grad_norm_(params, max_norm=5.0) • Reduce learning rate by 10× • Check for numerical instability in log/exp operations • Use float64 for debugging, then switch back to float32

Hyperparameter Selection

SVI has several hyperparameters that significantly affect performance. Here we provide guidance for setting them without exhaustive grid search.

Batch size:

The optimal batch size depends on computational resources and data characteristics:

Minimum viable batch: Large enough that gradient estimates aren't dominated by noise. Typically ≥ 32 for simple models, ≥ 128 for complex ones.
Maximum useful batch: Diminishing returns above $\sqrt{N}$ to $N/100$. Very large batches waste computation on variance reduction that doesn't improve convergence.
Memory limit: Practical upper bound from GPU/RAM constraints.

Rule of thumb: Start with 128, increase if training is unstable, decrease if memory-limited.

Hyperparameter Guidelines
Hyperparameter	Starting Value	Tuning Strategy	Signs of Mistuning
Learning rate	1e-3 (Adam), 0.01 (SGD)	LR range test	Oscillation (too high), no progress (too low)
Batch size	128	Double until memory limit	Noisy training (too small)
MC samples	1	Increase if variance high	Slow convergence with high variance
Latent dimension	Model-dependent	Cross-validation on held-out likelihood	Underfitting (too small), overfitting (too large)
KL weight (β)	1.0	Anneal from 0	Posterior collapse (β too high early)

Monte Carlo samples for gradient estimation:

The number of samples $S$ for estimating $\mathbb{E}_{q}[\cdot]$ trades variance for computation:

S = 1: Common for VAEs with reparameterization. Variance is manageable, computation is minimal.
S = 5–10: Reasonable for higher variance settings or when using score function estimators.
S > 10: Rarely beneficial. Better to increase batch size instead.

Pro tip: Use S = 1 during training (for speed) but S = 100+ for final evaluation (for accurate ELBO estimates).

Number of latent dimensions:

For VAEs and similar models:

Estimate from intrinsic dimensionality of data (PCA spectrum)
Start with 10–50 dimensions, adjust based on reconstruction quality
Monitor for posterior collapse—if many dimensions collapse, you may have too many

Automatic approach: Use automatic relevance determination (ARD) priors that prune unused dimensions.

When to Use (and Not Use) SVI

SVI is a powerful tool, but it's not always the right choice. Understanding its strengths and limitations helps select the best inference method for each problem.

SVI Is a Good Choice When

•Dataset is large: SVI shines at scale; MCMC struggles
•Approximate posterior is acceptable: Mean-field or structured approximations suffice
•Fast inference needed: SVI is orders of magnitude faster than MCMC for comparable accuracy
•Model is complex: Deep models with many latent variables benefit from gradient-based optimization
•Prediction is the goal: Point estimates from q are often sufficient for downstream tasks
•Amortization is valuable: Learning an inference network enables instant inference on new data

Consider Alternatives When

•Posterior is multimodal: Mean-field VI misses modes; MCMC or mixture VIs needed
•Exact posterior required: Scientific applications requiring unbiased samples
•Dataset is small: Full Bayesian inference via MCMC may be tractable and preferred
•Model is conjugate: Exact inference or coordinate ascent VI may be faster
•Uncertainty quantification is critical: VI underestimates variance; may need MCMC calibration
•Debugging time exceeds training time: Simpler methods may be more practical

Comparison with alternatives:

Method	Scalability	Posterior Quality	Flexibility	Complexity
SVI	Excellent	Approximate	High	Medium
MCMC	Poor	Asymptotically exact	High	Low
Expectation Propagation	Moderate	Often better than VI	Low	High
Laplace Approximation	Excellent	Gaussian only	Low	Low
Maximum Likelihood	Excellent	Point estimate	High	Low

Decision flowchart intuition:

N > 100,000? → Strong preference for SVI or point estimates
Need uncertainty? → If yes, SVI or MCMC; if no, point estimates
Complex posterior expected? → If multimodal, consider MCMC or mixture VI
Conjugate model? → Coordinate ascent VI may be simpler
Deep model? → SVI with amortization is natural fit

Production Implementation Patterns

Deploying SVI in production requires attention to reliability, monitoring, and operational concerns beyond pure algorithmic performance.

production_svi.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
import torch
import numpy as np
from typing import Dict, Optional, Callable
from dataclasses import dataclass
import logging
import json
from pathlib import Path
 
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
 
@dataclass
class SVIConfig:
    """Configuration for production SVI training."""
    batch_size: int = 128
    learning_rate: float = 1e-3
    max_epochs: int = 100
    patience: int = 10
    min_delta: float = 1e-4
    gradient_clip: float = 5.0
    checkpoint_every: int = 10
    validate_every: int = 1
    seed: int = 42
    
    def to_dict(self) -> dict:
        return {k: getattr(self, k) for k in self.__dataclass_fields__}
 
 
class ProductionSVITrainer:
    """
    Production-grade SVI trainer with:
    - Early stopping
    - Checkpointing  
    - Logging and monitoring
    - Reproducibility
    - Error handling
    """
    
    def __init__(
        self,
        model: torch.nn.Module,
        train_loader: torch.utils.data.DataLoader,
        val_loader: torch.utils.data.DataLoader,
        config: SVIConfig,
        output_dir: Path,
        device: str = "cuda"
    ):
        self.model = model.to(device)
        self.train_loader = train_loader
        self.val_loader = val_loader
        self.config = config
        self.output_dir = Path(output_dir)
        self.device = device
        
        # Create output directory
        self.output_dir.mkdir(parents=True, exist_ok=True)
        
        # Save config
        with open(self.output_dir / "config.json", "w") as f:
            json.dump(config.to_dict(), f, indent=2)
        
        # Set seeds for reproducibility
        torch.manual_seed(config.seed)
        np.random.seed(config.seed)
        
        # Optimizer
        self.optimizer = torch.optim.AdamW(
            model.parameters(),
            lr=config.learning_rate,
            weight_decay=1e-5
        )
        
        # Learning rate scheduler
        self.scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
            self.optimizer,
            mode='max',  # Maximizing ELBO
            factor=0.5,
            patience=5,
            min_lr=1e-6
        )
        
        # Tracking
        self.best_val_elbo = float('-inf')
        self.epochs_without_improvement = 0
        self.history = {
            'train_elbo': [],
            'val_elbo': [],
            'learning_rate': [],
            'gradient_norm': []
        }
    
    def train_epoch(self) -> Dict[str, float]:
        """Train for one epoch."""
        self.model.train()
        total_elbo = 0.0
        total_grad_norm = 0.0
        num_batches = 0
        
        for batch in self.train_loader:
            x = batch[0].to(self.device)
            
            self.optimizer.zero_grad()
            
            # Forward pass
            try:
                elbo, metrics = self.model.compute_elbo(x)
            except RuntimeError as e:
                logger.error(f"Forward pass failed: {e}")
                raise
            
            # Backward pass
            loss = -elbo.mean()  # Minimize negative ELBO
            loss.backward()
            
            # Gradient clipping
            grad_norm = torch.nn.utils.clip_grad_norm_(
                self.model.parameters(),
                self.config.gradient_clip
            )
            
            # Check for NaN gradients
            if torch.isnan(grad_norm):
                logger.warning("NaN gradient detected, skipping batch")
                self.optimizer.zero_grad()
                continue
            
            self.optimizer.step()
            
            total_elbo += elbo.mean().item()
            total_grad_norm += grad_norm.item()
            num_batches += 1
        
        return {
            'train_elbo': total_elbo / num_batches,
            'gradient_norm': total_grad_norm / num_batches
        }
    
    @torch.no_grad()
    def validate(self) -> float:
        """Compute validation ELBO."""
        self.model.eval()
        total_elbo = 0.0
        num_batches = 0
        
        for batch in self.val_loader:
            x = batch[0].to(self.device)
            elbo, _ = self.model.compute_elbo(x)
            total_elbo += elbo.mean().item()
            num_batches += 1
        
        return total_elbo / num_batches
    
    def save_checkpoint(self, epoch: int, is_best: bool = False):
        """Save model checkpoint."""
        checkpoint = {
            'epoch': epoch,
            'model_state_dict': self.model.state_dict(),
            'optimizer_state_dict': self.optimizer.state_dict(),
            'scheduler_state_dict': self.scheduler.state_dict(),
            'best_val_elbo': self.best_val_elbo,
            'history': self.history,
            'config': self.config.to_dict()
        }
        
        path = self.output_dir / f"checkpoint_epoch_{epoch}.pt"
        torch.save(checkpoint, path)
        
        if is_best:
            best_path = self.output_dir / "best_model.pt"
            torch.save(checkpoint, best_path)
            logger.info(f"Saved best model with val_elbo={self.best_val_elbo:.4f}")
    
    def load_checkpoint(self, path: Path):
        """Load from checkpoint."""
        checkpoint = torch.load(path, map_location=self.device)
        self.model.load_state_dict(checkpoint['model_state_dict'])
        self.optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
        self.scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
        self.best_val_elbo = checkpoint['best_val_elbo']
        self.history = checkpoint['history']
        return checkpoint['epoch']
    
    def train(self) -> Dict[str, float]:
        """Full training loop with early stopping."""
        logger.info(f"Starting training for up to {self.config.max_epochs} epochs")
        
        for epoch in range(self.config.max_epochs):
            # Training
            train_metrics = self.train_epoch()
            self.history['train_elbo'].append(train_metrics['train_elbo'])
            self.history['gradient_norm'].append(train_metrics['gradient_norm'])
            self.history['learning_rate'].append(
                self.optimizer.param_groups[0]['lr']
            )
            
            # Validation
            if epoch % self.config.validate_every == 0:
                val_elbo = self.validate()
                self.history['val_elbo'].append(val_elbo)
                
                # Learning rate scheduling
                self.scheduler.step(val_elbo)
                
                # Check for improvement
                if val_elbo > self.best_val_elbo + self.config.min_delta:
                    self.best_val_elbo = val_elbo
                    self.epochs_without_improvement = 0
                    self.save_checkpoint(epoch, is_best=True)
                else:
                    self.epochs_without_improvement += 1
                
                logger.info(
                    f"Epoch {epoch}: train_elbo={train_metrics['train_elbo']:.4f}, "
                    f"val_elbo={val_elbo:.4f}, lr={self.optimizer.param_groups[0]['lr']:.2e}"
                )
            
            # Periodic checkpointing
            if epoch % self.config.checkpoint_every == 0:
                self.save_checkpoint(epoch)
            
            # Early stopping
            if self.epochs_without_improvement >= self.config.patience:
                logger.info(f"Early stopping at epoch {epoch}")
                break
        
        # Save final history
        with open(self.output_dir / "training_history.json", "w") as f:
            json.dump(self.history, f, indent=2)
        
        return {
            'best_val_elbo': self.best_val_elbo,
            'final_epoch': epoch,
            'final_train_elbo': self.history['train_elbo'][-1]
        }

Evaluation and Model Selection

Evaluating variational inference requires care—the ELBO is a lower bound on the log-likelihood, not the log-likelihood itself. Several metrics and techniques provide better insight into model quality.

Evaluation Metrics for VI

•ELBO: Primary training objective. Higher is better, but ELBO gap to true log-likelihood is unknown.
•Importance-weighted ELBO (IWAE): Uses K importance samples for tighter bound. IWAE_K → log p(x) as K → ∞.
•Approximate log-likelihood: Estimate log p(x) via importance sampling from q. More accurate than ELBO for model comparison.
•Reconstruction error: For VAEs, measures how well decoder reconstructs inputs. Easy to interpret.
•Posterior predictive checks: Sample from q(z|x), then from p(x̃|z). Compare x̃ to real data qualitatively.
•Downstream task performance: If model serves prediction, evaluate on held-out prediction accuracy.

Importance-weighted ELBO:

The IWAE bound uses $K$ samples to provide a tighter lower bound:

$$\log p(x) \geq \mathcal{L}K = \mathbb{E}{z_1, \ldots, z_K \sim q}\left[\log \frac{1}{K} \sum_{k=1}^{K} \frac{p(x, z_k)}{q(z_k)}\right]$$

As $K \to \infty$, $\mathcal{L}_K \to \log p(x)$. For evaluation:

def compute_iwae(model, x, K=100):
    """Compute importance-weighted ELBO."""
    log_weights = []
    for _ in range(K):
        z, log_q = model.sample_and_log_prob(x)
        log_p = model.joint_log_prob(x, z)
        log_weights.append(log_p - log_q)
    
    # Log-sum-exp for numerical stability
    log_weights = torch.stack(log_weights, dim=0)
    iwae = torch.logsumexp(log_weights, dim=0) - np.log(K)
    return iwae.mean()

Model selection:

For selecting among models (different architectures, hyperparameters):

Train each model with SVI
Evaluate on held-out data using IWAE (K=1000) or importance-sampled log-likelihood
Select model with highest held-out metric

Important: Don't use training ELBO for model selection—it rewards overfitting.

Common Pitfalls and How to Avoid Them

Even experienced practitioners encounter pitfalls when implementing SVI. Here we catalog the most common mistakes and their solutions.

SVI Pitfalls and Solutions
Pitfall	Symptom	Solution
Forgetting the N/M scaling	ELBO is wrong by factor of N/batch_size	Always scale likelihood term by N/M in mini-batch ELBO
Using non-reparameterized gradients	Very high variance, slow convergence	Use rsample() not sample() for continuous latents
KL divergence computed incorrectly	Negative KL, training instability	Use library functions; verify on known distributions
Variance parameterization issues	NaN or negative variance	Parameterize as log(σ²) or use softplus(·)
Not using validation set	Overfitting undetected	Always hold out data for monitoring
Ignoring prior mismatch	Poor posterior approximation	Ensure prior matches problem structure

Pitfall deep-dive: The N/M scaling factor

The most common bug in SVI implementations is incorrect scaling. The stochastic ELBO is:

$$\hat{\mathcal{L}} = \underbrace{-\text{KL}[q | p]}{\text{Not scaled}} + \underbrace{\frac{N}{M} \sum{j \in \text{batch}} \log p(x_j | z)}_{\text{Scaled by N/M}}$$

Wrong (common mistake):

# WRONG: Treats batch as if it's the entire dataset
loss = kl_divergence + reconstruction_loss.mean()

Correct:

# CORRECT: Scales reconstruction to full dataset
N = len(full_dataset)
M = batch_size
loss = kl_divergence + (N / M) * reconstruction_loss.sum()
# Or equivalently:
loss = (kl_divergence / N) + reconstruction_loss.mean()  # Per-datapoint loss

The scaling matters because:

Without it, KL dominates as batch size decreases
Gradients become biased toward the prior
Model underfits severely

The Reparameterization Trap

In PyTorch, sample() and rsample() are different!

• sample(): Non-differentiable sampling (blocks gradients) • rsample(): Reparameterized sampling (gradients flow through)

Always use z = dist.rsample() for variational inference with continuous latents. Using sample() will silently produce zero gradients for the encoder.

Summary: Practical Mastery of SVI

This page has equipped you with the practical knowledge to implement, debug, and deploy stochastic variational inference in real-world applications.

Key Takeaways

•Initialization matters: Start variational distributions near the prior; for deep models, consider warm-starting from simpler solutions
•Systematic debugging: Common problems (NaN, no improvement, posterior collapse) have known causes and fixes; use the diagnostic checklist
•Tune hyperparameters strategically: Start with standard values, use learning rate range tests, and monitor for signs of mistuning
•Choose SVI appropriately: SVI excels for large datasets and complex models; consider MCMC for small data or when exact posteriors matter
•Production requires infrastructure: Early stopping, checkpointing, logging, and error handling are essential for reliable deployment
•Evaluate beyond ELBO: Use IWAE, held-out likelihood, and downstream metrics for model selection; ELBO alone is insufficient
•Avoid common pitfalls: N/M scaling, reparameterization, and correct KL computation are frequent sources of bugs

Module complete:

You have now completed the module on Stochastic Variational Inference. You understand:

Mini-batch optimization: How to formulate unbiased gradient estimators from data subsets
Natural gradients: Why geometry-aware optimization converges faster
Scaling to large datasets: Distributed training, gradient compression, and systems optimization
Convergence theory: Mathematical guarantees and practical diagnostics
Practical implementation: Initialization, debugging, hyperparameters, and production patterns

With this knowledge, you are prepared to apply SVI to real problems—from training VAEs on million-image datasets to fitting Bayesian neural networks for uncertainty-aware predictions to scaling topic models across document corpora.

Module Complete

Congratulations! You have mastered Stochastic Variational Inference. You now possess both the theoretical understanding and practical skills to implement scalable Bayesian inference systems. The techniques in this module form the foundation of modern probabilistic deep learning, from variational autoencoders to Bayesian neural networks to large-scale generative models.

Practical Considerations

From Theory to Production

The goal is to transform you from someone who understands SVI to someone who can deploy it confidently on real problems.

What You Will Learn

Initialization Strategies

Proper initialization is critical for successful SVI. Poor initialization can lead to slow convergence, convergence to suboptimal local optima, or outright divergence.

General principles:

Start near the prior: Initialize variational parameters so $q(z; \phi_0) \approx p(z)$. This ensures valid probability distributions and often stable initial gradients.
Use small variances carefully: Very small initial variances can cause gradient explosion in the likelihood term; very large variances dilute the signal.
Leverage problem structure: Use domain knowledge to initialize near plausible posterior modes.

Initialization Recipes by Model Type
Model	Variational Family	Recommended Initialization
Gaussian posterior	Diagonal Gaussian	μ = 0, log σ = 0 (unit variance)
Topic models (LDA)	Dirichlet	α = 1 + small noise (near uniform)
VAE encoder	Gaussian	Xavier/He init for network; μ→0, σ→1
Bayesian neural net	Gaussian weights	μ = pretrained or Xavier; σ = small (0.01)
Mixture models	Categorical + Gaussian	K-means clustering for means; uniform mixing

Initialization for VAEs:

Variational Autoencoders require careful initialization of both encoder and decoder networks:

# Encoder initialization
# Mean projection: zero output initially → z_mean ≈ 0
nn.init.xavier_uniform_(encoder_mean.weight)
nn.init.zeros_(encoder_mean.bias)

# Log-variance projection: small output initially → z_std ≈ 1
nn.init.xavier_uniform_(encoder_logvar.weight)
nn.init.constant_(encoder_logvar.bias, -2)  # exp(-2) ≈ 0.14, conservative

# Decoder: standard initialization
for layer in decoder:
    nn.init.xavier_uniform_(layer.weight)

Why this works:

Initial latent codes $z \sim \mathcal{N}(0, 1)$ match the prior, so initial KL ≈ 0
Decoder sees random codes from prior distribution—good for exploration
Gradients are well-behaved: neither vanishing nor exploding

Warm-Start from Simpler Models

This 'curriculum' of increasing complexity dramatically improves convergence.

Debugging Common Problems

When SVI fails to converge or produces poor results, systematic debugging is essential. Here we catalog common failure modes and their remedies.

Problem 1: ELBO is NaN or -Inf

Causes:

Log of zero or negative in likelihood computation
Exponential overflow in softmax or partition functions
Division by zero in variance terms

Solutions:

# Add numerical stability to log computations
log_prob = torch.log(prob + 1e-10)

# Use log-sum-exp trick for softmax
def stable_softmax(logits):
    max_logits = logits.max(dim=-1, keepdim=True).values
    exp_logits = torch.exp(logits - max_logits)
    return exp_logits / exp_logits.sum(dim=-1, keepdim=True)

# Clamp variance to prevent division by zero
var = torch.clamp(var, min=1e-6)

Problem 2: ELBO Not Improving

•Learning rate too small: Increase by 10× and observe. If still flat, there may be a gradient computation bug.
•Learning rate too large: ELBO fluctuates wildly or decreases. Reduce by 10× and add gradient clipping.
•Stuck at local optimum: Try random restarts from different initializations. Compare to the prior log-likelihood—if ELBO ≈ log p(x) under prior, the model isn't learning.
•Gradient bug: Verify gradient computation with finite differences. A gradient bug is surprisingly common.
•Model misspecification: The model may be unable to represent the data. Simplify and verify basic functionality.

Problem 3: Posterior Collapse (VAEs)

Symptom: KL divergence → 0, decoder ignores latent code, all samples look similar.

Diagnosis:

# Check per-dimension KL
kl_per_dim = 0.5 * (mu**2 + var - 1 - torch.log(var))
print("KL per dimension:", kl_per_dim.mean(dim=0))
# If all near zero → collapsed

Solutions:

KL annealing: Gradually increase KL weight from 0 to 1

beta = min(1.0, epoch / warmup_epochs)
loss = -reconstruction + beta * kl

Free bits: Enforce minimum KL per dimension

free_bits = 0.1
kl_per_dim = torch.clamp(kl_per_dim, min=free_bits)
kl = kl_per_dim.sum(dim=-1)

Weaker decoder: Use simpler decoder (fewer layers, less capacity) so latent code is necessary.
Input dropout: Drop out input features to force reliance on latent code.

Problem 4: Gradient Explosion

Symptom: NaN gradients or ELBO suddenly goes to -Inf.

Hyperparameter Selection

SVI has several hyperparameters that significantly affect performance. Here we provide guidance for setting them without exhaustive grid search.

Batch size:

The optimal batch size depends on computational resources and data characteristics:

Minimum viable batch: Large enough that gradient estimates aren't dominated by noise. Typically ≥ 32 for simple models, ≥ 128 for complex ones.
Maximum useful batch: Diminishing returns above $\sqrt{N}$ to $N/100$. Very large batches waste computation on variance reduction that doesn't improve convergence.
Memory limit: Practical upper bound from GPU/RAM constraints.

Rule of thumb: Start with 128, increase if training is unstable, decrease if memory-limited.

Hyperparameter Guidelines
Hyperparameter	Starting Value	Tuning Strategy	Signs of Mistuning
Learning rate	1e-3 (Adam), 0.01 (SGD)	LR range test	Oscillation (too high), no progress (too low)
Batch size	128	Double until memory limit	Noisy training (too small)
MC samples	1	Increase if variance high	Slow convergence with high variance
Latent dimension	Model-dependent	Cross-validation on held-out likelihood	Underfitting (too small), overfitting (too large)
KL weight (β)	1.0	Anneal from 0	Posterior collapse (β too high early)

Monte Carlo samples for gradient estimation:

The number of samples $S$ for estimating $\mathbb{E}_{q}[\cdot]$ trades variance for computation:

S = 1: Common for VAEs with reparameterization. Variance is manageable, computation is minimal.
S = 5–10: Reasonable for higher variance settings or when using score function estimators.
S > 10: Rarely beneficial. Better to increase batch size instead.

Pro tip: Use S = 1 during training (for speed) but S = 100+ for final evaluation (for accurate ELBO estimates).

Number of latent dimensions:

For VAEs and similar models:

Estimate from intrinsic dimensionality of data (PCA spectrum)
Start with 10–50 dimensions, adjust based on reconstruction quality
Monitor for posterior collapse—if many dimensions collapse, you may have too many

Automatic approach: Use automatic relevance determination (ARD) priors that prune unused dimensions.

When to Use (and Not Use) SVI

SVI is a powerful tool, but it's not always the right choice. Understanding its strengths and limitations helps select the best inference method for each problem.

SVI Is a Good Choice When

•Dataset is large: SVI shines at scale; MCMC struggles
•Approximate posterior is acceptable: Mean-field or structured approximations suffice
•Fast inference needed: SVI is orders of magnitude faster than MCMC for comparable accuracy
•Model is complex: Deep models with many latent variables benefit from gradient-based optimization
•Prediction is the goal: Point estimates from q are often sufficient for downstream tasks
•Amortization is valuable: Learning an inference network enables instant inference on new data

Consider Alternatives When

•Posterior is multimodal: Mean-field VI misses modes; MCMC or mixture VIs needed
•Exact posterior required: Scientific applications requiring unbiased samples
•Dataset is small: Full Bayesian inference via MCMC may be tractable and preferred
•Model is conjugate: Exact inference or coordinate ascent VI may be faster
•Uncertainty quantification is critical: VI underestimates variance; may need MCMC calibration
•Debugging time exceeds training time: Simpler methods may be more practical

Comparison with alternatives:

Method	Scalability	Posterior Quality	Flexibility	Complexity
SVI	Excellent	Approximate	High	Medium
MCMC	Poor	Asymptotically exact	High	Low
Expectation Propagation	Moderate	Often better than VI	Low	High
Laplace Approximation	Excellent	Gaussian only	Low	Low
Maximum Likelihood	Excellent	Point estimate	High	Low

Decision flowchart intuition:

N > 100,000? → Strong preference for SVI or point estimates
Need uncertainty? → If yes, SVI or MCMC; if no, point estimates
Complex posterior expected? → If multimodal, consider MCMC or mixture VI
Conjugate model? → Coordinate ascent VI may be simpler
Deep model? → SVI with amortization is natural fit

Production Implementation Patterns

Deploying SVI in production requires attention to reliability, monitoring, and operational concerns beyond pure algorithmic performance.

production_svi.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
import torch
import numpy as np
from typing import Dict, Optional, Callable
from dataclasses import dataclass
import logging
import json
from pathlib import Path
 
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
 
@dataclass
class SVIConfig:
    """Configuration for production SVI training."""
    batch_size: int = 128
    learning_rate: float = 1e-3
    max_epochs: int = 100
    patience: int = 10
    min_delta: float = 1e-4
    gradient_clip: float = 5.0
    checkpoint_every: int = 10
    validate_every: int = 1
    seed: int = 42
    
    def to_dict(self) -> dict:
        return {k: getattr(self, k) for k in self.__dataclass_fields__}
 
 
class ProductionSVITrainer:
    """
    Production-grade SVI trainer with:
    - Early stopping
    - Checkpointing  
    - Logging and monitoring
    - Reproducibility
    - Error handling
    """
    
    def __init__(
        self,
        model: torch.nn.Module,
        train_loader: torch.utils.data.DataLoader,
        val_loader: torch.utils.data.DataLoader,
        config: SVIConfig,
        output_dir: Path,
        device: str = "cuda"
    ):
        self.model = model.to(device)
        self.train_loader = train_loader
        self.val_loader = val_loader
        self.config = config
        self.output_dir = Path(output_dir)
        self.device = device
        
        # Create output directory
        self.output_dir.mkdir(parents=True, exist_ok=True)
        
        # Save config
        with open(self.output_dir / "config.json", "w") as f:
            json.dump(config.to_dict(), f, indent=2)
        
        # Set seeds for reproducibility
        torch.manual_seed(config.seed)
        np.random.seed(config.seed)
        
        # Optimizer
        self.optimizer = torch.optim.AdamW(
            model.parameters(),
            lr=config.learning_rate,
            weight_decay=1e-5
        )
        
        # Learning rate scheduler
        self.scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
            self.optimizer,
            mode='max',  # Maximizing ELBO
            factor=0.5,
            patience=5,
            min_lr=1e-6
        )
        
        # Tracking
        self.best_val_elbo = float('-inf')
        self.epochs_without_improvement = 0
        self.history = {
            'train_elbo': [],
            'val_elbo': [],
            'learning_rate': [],
            'gradient_norm': []
        }
    
    def train_epoch(self) -> Dict[str, float]:
        """Train for one epoch."""
        self.model.train()
        total_elbo = 0.0
        total_grad_norm = 0.0
        num_batches = 0
        
        for batch in self.train_loader:
            x = batch[0].to(self.device)
            
            self.optimizer.zero_grad()
            
            # Forward pass
            try:
                elbo, metrics = self.model.compute_elbo(x)
            except RuntimeError as e:
                logger.error(f"Forward pass failed: {e}")
                raise
            
            # Backward pass
            loss = -elbo.mean()  # Minimize negative ELBO
            loss.backward()
            
            # Gradient clipping
            grad_norm = torch.nn.utils.clip_grad_norm_(
                self.model.parameters(),
                self.config.gradient_clip
            )
            
            # Check for NaN gradients
            if torch.isnan(grad_norm):
                logger.warning("NaN gradient detected, skipping batch")
                self.optimizer.zero_grad()
                continue
            
            self.optimizer.step()
            
            total_elbo += elbo.mean().item()
            total_grad_norm += grad_norm.item()
            num_batches += 1
        
        return {
            'train_elbo': total_elbo / num_batches,
            'gradient_norm': total_grad_norm / num_batches
        }
    
    @torch.no_grad()
    def validate(self) -> float:
        """Compute validation ELBO."""
        self.model.eval()
        total_elbo = 0.0
        num_batches = 0
        
        for batch in self.val_loader:
            x = batch[0].to(self.device)
            elbo, _ = self.model.compute_elbo(x)
            total_elbo += elbo.mean().item()
            num_batches += 1
        
        return total_elbo / num_batches
    
    def save_checkpoint(self, epoch: int, is_best: bool = False):
        """Save model checkpoint."""
        checkpoint = {
            'epoch': epoch,
            'model_state_dict': self.model.state_dict(),
            'optimizer_state_dict': self.optimizer.state_dict(),
            'scheduler_state_dict': self.scheduler.state_dict(),
            'best_val_elbo': self.best_val_elbo,
            'history': self.history,
            'config': self.config.to_dict()
        }
        
        path = self.output_dir / f"checkpoint_epoch_{epoch}.pt"
        torch.save(checkpoint, path)
        
        if is_best:
            best_path = self.output_dir / "best_model.pt"
            torch.save(checkpoint, best_path)
            logger.info(f"Saved best model with val_elbo={self.best_val_elbo:.4f}")
    
    def load_checkpoint(self, path: Path):
        """Load from checkpoint."""
        checkpoint = torch.load(path, map_location=self.device)
        self.model.load_state_dict(checkpoint['model_state_dict'])
        self.optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
        self.scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
        self.best_val_elbo = checkpoint['best_val_elbo']
        self.history = checkpoint['history']
        return checkpoint['epoch']
    
    def train(self) -> Dict[str, float]:
        """Full training loop with early stopping."""
        logger.info(f"Starting training for up to {self.config.max_epochs} epochs")
        
        for epoch in range(self.config.max_epochs):
            # Training
            train_metrics = self.train_epoch()
            self.history['train_elbo'].append(train_metrics['train_elbo'])
            self.history['gradient_norm'].append(train_metrics['gradient_norm'])
            self.history['learning_rate'].append(
                self.optimizer.param_groups[0]['lr']
            )
            
            # Validation
            if epoch % self.config.validate_every == 0:
                val_elbo = self.validate()
                self.history['val_elbo'].append(val_elbo)
                
                # Learning rate scheduling
                self.scheduler.step(val_elbo)
                
                # Check for improvement
                if val_elbo > self.best_val_elbo + self.config.min_delta:
                    self.best_val_elbo = val_elbo
                    self.epochs_without_improvement = 0
                    self.save_checkpoint(epoch, is_best=True)
                else:
                    self.epochs_without_improvement += 1
                
                logger.info(
                    f"Epoch {epoch}: train_elbo={train_metrics['train_elbo']:.4f}, "
                    f"val_elbo={val_elbo:.4f}, lr={self.optimizer.param_groups[0]['lr']:.2e}"
                )
            
            # Periodic checkpointing
            if epoch % self.config.checkpoint_every == 0:
                self.save_checkpoint(epoch)
            
            # Early stopping
            if self.epochs_without_improvement >= self.config.patience:
                logger.info(f"Early stopping at epoch {epoch}")
                break
        
        # Save final history
        with open(self.output_dir / "training_history.json", "w") as f:
            json.dump(self.history, f, indent=2)
        
        return {
            'best_val_elbo': self.best_val_elbo,
            'final_epoch': epoch,
            'final_train_elbo': self.history['train_elbo'][-1]
        }

Evaluation and Model Selection

Evaluation Metrics for VI

•ELBO: Primary training objective. Higher is better, but ELBO gap to true log-likelihood is unknown.
•Importance-weighted ELBO (IWAE): Uses K importance samples for tighter bound. IWAE_K → log p(x) as K → ∞.
•Approximate log-likelihood: Estimate log p(x) via importance sampling from q. More accurate than ELBO for model comparison.
•Reconstruction error: For VAEs, measures how well decoder reconstructs inputs. Easy to interpret.
•Posterior predictive checks: Sample from q(z|x), then from p(x̃|z). Compare x̃ to real data qualitatively.
•Downstream task performance: If model serves prediction, evaluate on held-out prediction accuracy.

Importance-weighted ELBO:

The IWAE bound uses $K$ samples to provide a tighter lower bound:

$$\log p(x) \geq \mathcal{L}K = \mathbb{E}{z_1, \ldots, z_K \sim q}\left[\log \frac{1}{K} \sum_{k=1}^{K} \frac{p(x, z_k)}{q(z_k)}\right]$$

As $K \to \infty$, $\mathcal{L}_K \to \log p(x)$. For evaluation:

def compute_iwae(model, x, K=100):
    """Compute importance-weighted ELBO."""
    log_weights = []
    for _ in range(K):
        z, log_q = model.sample_and_log_prob(x)
        log_p = model.joint_log_prob(x, z)
        log_weights.append(log_p - log_q)
    
    # Log-sum-exp for numerical stability
    log_weights = torch.stack(log_weights, dim=0)
    iwae = torch.logsumexp(log_weights, dim=0) - np.log(K)
    return iwae.mean()

Model selection:

For selecting among models (different architectures, hyperparameters):

Train each model with SVI
Evaluate on held-out data using IWAE (K=1000) or importance-sampled log-likelihood
Select model with highest held-out metric

Important: Don't use training ELBO for model selection—it rewards overfitting.

Common Pitfalls and How to Avoid Them

Even experienced practitioners encounter pitfalls when implementing SVI. Here we catalog the most common mistakes and their solutions.

SVI Pitfalls and Solutions
Pitfall	Symptom	Solution
Forgetting the N/M scaling	ELBO is wrong by factor of N/batch_size	Always scale likelihood term by N/M in mini-batch ELBO
Using non-reparameterized gradients	Very high variance, slow convergence	Use rsample() not sample() for continuous latents
KL divergence computed incorrectly	Negative KL, training instability	Use library functions; verify on known distributions
Variance parameterization issues	NaN or negative variance	Parameterize as log(σ²) or use softplus(·)
Not using validation set	Overfitting undetected	Always hold out data for monitoring
Ignoring prior mismatch	Poor posterior approximation	Ensure prior matches problem structure

Pitfall deep-dive: The N/M scaling factor

The most common bug in SVI implementations is incorrect scaling. The stochastic ELBO is:

$$\hat{\mathcal{L}} = \underbrace{-\text{KL}[q | p]}{\text{Not scaled}} + \underbrace{\frac{N}{M} \sum{j \in \text{batch}} \log p(x_j | z)}_{\text{Scaled by N/M}}$$

Wrong (common mistake):

# WRONG: Treats batch as if it's the entire dataset
loss = kl_divergence + reconstruction_loss.mean()

Correct:

# CORRECT: Scales reconstruction to full dataset
N = len(full_dataset)
M = batch_size
loss = kl_divergence + (N / M) * reconstruction_loss.sum()
# Or equivalently:
loss = (kl_divergence / N) + reconstruction_loss.mean()  # Per-datapoint loss

The scaling matters because:

Without it, KL dominates as batch size decreases
Gradients become biased toward the prior
Model underfits severely

The Reparameterization Trap

In PyTorch, sample() and rsample() are different!

• sample(): Non-differentiable sampling (blocks gradients) • rsample(): Reparameterized sampling (gradients flow through)

Always use z = dist.rsample() for variational inference with continuous latents. Using sample() will silently produce zero gradients for the encoder.

Summary: Practical Mastery of SVI

This page has equipped you with the practical knowledge to implement, debug, and deploy stochastic variational inference in real-world applications.

Key Takeaways

•Initialization matters: Start variational distributions near the prior; for deep models, consider warm-starting from simpler solutions
•Systematic debugging: Common problems (NaN, no improvement, posterior collapse) have known causes and fixes; use the diagnostic checklist
•Tune hyperparameters strategically: Start with standard values, use learning rate range tests, and monitor for signs of mistuning
•Choose SVI appropriately: SVI excels for large datasets and complex models; consider MCMC for small data or when exact posteriors matter
•Production requires infrastructure: Early stopping, checkpointing, logging, and error handling are essential for reliable deployment
•Evaluate beyond ELBO: Use IWAE, held-out likelihood, and downstream metrics for model selection; ELBO alone is insufficient
•Avoid common pitfalls: N/M scaling, reparameterization, and correct KL computation are frequent sources of bugs

Module complete:

You have now completed the module on Stochastic Variational Inference. You understand:

Mini-batch optimization: How to formulate unbiased gradient estimators from data subsets
Natural gradients: Why geometry-aware optimization converges faster
Scaling to large datasets: Distributed training, gradient compression, and systems optimization
Convergence theory: Mathematical guarantees and practical diagnostics
Practical implementation: Initialization, debugging, hyperparameters, and production patterns

Module Complete