Stochastic Variational Inference - Learning Module

Loading content...

0/245

Convergence Analysis

The Theory Behind the Practice

Stochastic variational inference works remarkably well in practice, but why does it work? When can we guarantee convergence? How fast should we expect progress? These questions are not merely academic—they inform crucial practical decisions about learning rate schedules, stopping criteria, and hyperparameter selection.

The theoretical analysis of SVI draws on decades of work in stochastic optimization, adapting classical results to the probabilistic inference setting. Understanding this theory provides:

Confidence that algorithms will converge to meaningful solutions
Guidance for setting learning rates and batch sizes
Diagnostics for detecting and addressing pathological behavior
Bounds on how many iterations are needed for a desired accuracy

This page develops the convergence theory for stochastic variational inference, from classical Robbins-Monro conditions to modern non-convex convergence guarantees.

What You Will Learn

By the end of this page, you will understand the mathematical conditions that guarantee SVI convergence, convergence rates for convex and non-convex ELBOs, the role of learning rate schedules in balancing bias and variance, and practical diagnostics for monitoring convergence.

Stochastic Optimization Foundations

Stochastic variational inference is an instance of stochastic gradient ascent (or descent, depending on sign conventions). The foundational theory dates to Robbins and Monro (1951), who established conditions for convergence of stochastic approximation algorithms.

The SVI update:

Recall the SVI parameter update:

$$\phi_{t+1} = \phi_t + \rho_t \hat{g}_t$$

where:

$\rho_t$ is the learning rate (step size) at iteration $t$
$\hat{g}_t$ is a stochastic estimate of the ELBO gradient: $\mathbb{E}[\hat{g}t] = \nabla\phi \mathcal{L}(\phi_t)$

The fundamental challenge:

Unlike deterministic gradient descent, stochastic updates have noise that doesn't vanish. Each gradient estimate has variance $\sigma^2 > 0$. If we use constant step size $\rho$, updates will oscillate around the optimum rather than converging to it.

Decreasing step sizes are necessary for convergence but must decrease slowly enough that we still make progress.

The Robbins-Monro Conditions

For stochastic gradient methods to converge to a stationary point, the learning rate sequence {ρ_t} must satisfy:

Sum diverges: Σ_t ρ_t = ∞ (ensures we can reach any point from any starting point)
Sum of squares converges: Σ_t ρ_t² < ∞ (ensures noise eventually averages out)

Common choices: ρ_t = c/(t+τ) or ρ_t = c/t^α for α ∈ (0.5, 1].

Intuition for Robbins-Monro:

Consider what happens with different learning rate behaviors:

Too fast decay (e.g., $\rho_t = 1/t^2$):

$\sum_t \rho_t < \infty$
Total "budget" of movement is finite
If we start far from optimum, we may not have enough steps to reach it

Too slow decay (e.g., $\rho_t = 1$, constant):

$\sum_t \rho_t^2 = \infty$
Accumulated noise grows without bound
Parameters oscillate forever, never settling

Just right (e.g., $\rho_t = 1/t$):

$\sum_t \rho_t = \infty$ (harmonic series diverges)
$\sum_t \rho_t^2 = \pi^2/6 < \infty$
We can reach any point, and noise eventually dies out

The $1/t$ learning rate:

The canonical choice $\rho_t = c/(t + \tau)$ satisfies both conditions. The offset $\tau$ controls initial step size: larger $\tau$ means smaller initial steps (more conservative start).

Convergence for Convex ELBOs

When the ELBO is a concave function of the variational parameters (equivalent to convex minimization of the negative ELBO), strong convergence guarantees exist.

Convexity arises when:

The variational family is an exponential family in natural parameters
The model belongs to the conjugate exponential family
Mean-field approximations to certain models

Key convex convergence theorem:

Suppose $\mathcal{L}(\phi)$ is concave, $L$-smooth (gradient Lipschitz), and the stochastic gradient has bounded variance $\mathbb{E}[|\hat{g}_t - \nabla\mathcal{L}(\phi_t)|^2] \leq \sigma^2$. Then with learning rate $\rho_t = c/\sqrt{t}$:

$$\mathbb{E}[\mathcal{L}(\phi^*)] - \mathbb{E}[\mathcal{L}(\bar{\phi}_T)] \leq O\left(\frac{1}{\sqrt{T}}\right)$$

where $\bar{\phi}T = \frac{1}{T}\sum{t=1}^T \phi_t$ is the average iterate.

Interpretation: To achieve $\epsilon$-suboptimality, we need $O(1/\epsilon^2)$ iterations.

Convergence Rates for Convex Optimization
Setting	Rate	Iterations for ε-optimal	Notes
Convex, stochastic gradient	O(1/√T)	O(1/ε²)	Standard SVI rate
Strongly convex, stochastic	O(1/T)	O(1/ε)	Faster with strong convexity
Convex, batch gradient	O(1/T)	O(1/ε)	Deterministic GD
Strongly convex, batch	O(exp(-cT))	O(log(1/ε))	Linear convergence

Strong convexity and faster rates:

If the ELBO is additionally $\mu$-strongly concave:

$$\mathcal{L}(\phi) \leq \mathcal{L}(\phi') + \nabla\mathcal{L}(\phi')^T(\phi - \phi') - \frac{\mu}{2}|\phi - \phi'|^2$$

then convergence accelerates to $O(1/T)$ with appropriate learning rate $\rho_t = 2/(\mu(t+1))$.

Practical implication:

Strong convexity arises from regularization. Adding an $L_2$ penalty on variational parameters (or equivalently, a prior on the variational posterior) induces strong convexity and accelerates convergence:

$$\mathcal{L}_{\text{reg}}(\phi) = \mathcal{L}(\phi) - \frac{\lambda}{2}|\phi|^2$$

This trades bias (the regularized optimum differs from the true optimum) for faster convergence.

Convergence for Non-Convex ELBOs

Many practical variational inference problems involve non-convex ELBOs:

Variational autoencoders (VAEs) with neural network encoders
Latent variable models with multiple modes
Mixture models with non-conjugate priors
Any model where the variational family is non-exponential

For non-convex objectives, we cannot guarantee convergence to global optima—the landscape may have many local maxima, saddle points, and plateaus.

What We Can Guarantee

For non-convex problems, standard SVI guarantees convergence to a stationary point (zero gradient), not a global optimum. A stationary point may be a local maximum, saddle point, or even a local minimum (if the ELBO has negative curvature regions).

Non-convex convergence theorem:

Suppose $\mathcal{L}(\phi)$ is $L$-smooth (not necessarily concave), gradients have bounded variance $\sigma^2$, and learning rate $\rho_t = c/\sqrt{T}$ (constant over the run, tuned to horizon $T$). Then:

$$\min_{t \leq T} \mathbb{E}[|\nabla\mathcal{L}(\phi_t)|^2] \leq O\left(\frac{1}{\sqrt{T}}\right)$$

Interpretation: After $T$ iterations, the smallest gradient norm observed is $O(1/\sqrt{T})$. To find a point with gradient norm $\leq \epsilon$, we need $O(1/\epsilon^2)$ iterations.

The gap with convex analysis:

Notice we measure convergence by gradient norm, not function value. In non-convex settings, small gradients don't imply closeness to optimal value. A saddle point has zero gradient but may be far from any local maximum.

Escaping saddle points:

Modern theory shows that stochastic gradient noise actually helps escape saddle points. The randomness perturbs the trajectory, making it unlikely to get trapped at unstable equilibria. This is one advantage of stochastic over deterministic optimization for non-convex problems.

convergence_diagnostics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
import numpy as np
from typing import List, Tuple
from dataclasses import dataclass
 
@dataclass
class ConvergenceMetrics:
    """Metrics for monitoring SVI convergence."""
    iteration: int
    elbo: float
    gradient_norm: float
    parameter_change: float
    learning_rate: float
 
class ConvergenceMonitor:
    """
    Monitor convergence of stochastic variational inference.
    
    Tracks ELBO, gradient norms, and parameter changes to detect
    convergence, divergence, or pathological behavior.
    """
    
    def __init__(
        self,
        window_size: int = 100,
        elbo_tolerance: float = 1e-4,
        gradient_tolerance: float = 1e-5,
        patience: int = 10
    ):
        """
        Args:
            window_size: Number of iterations for moving average
            elbo_tolerance: Relative improvement threshold for convergence
            gradient_tolerance: Gradient norm threshold for convergence
            patience: Consecutive windows without improvement before stopping
        """
        self.window_size = window_size
        self.elbo_tolerance = elbo_tolerance
        self.gradient_tolerance = gradient_tolerance
        self.patience = patience
        
        self.history: List[ConvergenceMetrics] = []
        self.no_improvement_count = 0
        self.best_elbo = float('-inf')
    
    def update(
        self,
        iteration: int,
        elbo: float,
        gradient: np.ndarray,
        params_prev: np.ndarray,
        params_curr: np.ndarray,
        learning_rate: float
    ) -> Tuple[bool, str]:
        """
        Record metrics and check convergence.
        
        Returns:
            converged: Whether convergence criteria are met
            status: Descriptive status message
        """
        gradient_norm = np.linalg.norm(gradient)
        param_change = np.linalg.norm(params_curr - params_prev)
        
        metrics = ConvergenceMetrics(
            iteration=iteration,
            elbo=elbo,
            gradient_norm=gradient_norm,
            parameter_change=param_change,
            learning_rate=learning_rate
        )
        self.history.append(metrics)
        
        # Check for divergence (NaN or extreme values)
        if np.isnan(elbo) or np.isnan(gradient_norm):
            return True, "DIVERGED: NaN detected"
        
        if np.abs(elbo) > 1e10:
            return True, "DIVERGED: ELBO explosion"
        
        # Not enough history yet
        if len(self.history) < 2 * self.window_size:
            return False, "WARMING_UP"
        
        # Check gradient norm convergence
        recent_grads = [m.gradient_norm for m in self.history[-self.window_size:]]
        avg_gradient = np.mean(recent_grads)
        
        if avg_gradient < self.gradient_tolerance:
            return True, f"CONVERGED: Avg gradient norm {avg_gradient:.2e} < {self.gradient_tolerance}"
        
        # Check ELBO improvement
        recent_elbos = [m.elbo for m in self.history[-self.window_size:]]
        older_elbos = [m.elbo for m in self.history[-2*self.window_size:-self.window_size]]
        
        recent_avg = np.mean(recent_elbos)
        older_avg = np.mean(older_elbos)
        
        relative_improvement = (recent_avg - older_avg) / (np.abs(older_avg) + 1e-10)
        
        if relative_improvement > self.elbo_tolerance:
            self.no_improvement_count = 0
            if recent_avg > self.best_elbo:
                self.best_elbo = recent_avg
        else:
            self.no_improvement_count += 1
        
        if self.no_improvement_count >= self.patience:
            return True, f"CONVERGED: No improvement for {self.patience} windows"
        
        return False, f"RUNNING: Rel improvement {relative_improvement:.2e}"
    
    def get_summary(self) -> dict:
        """Get summary statistics of optimization run."""
        if not self.history:
            return {}
        
        elbos = [m.elbo for m in self.history]
        grads = [m.gradient_norm for m in self.history]
        
        return {
            "iterations": len(self.history),
            "final_elbo": elbos[-1],
            "best_elbo": max(elbos),
            "elbo_std": np.std(elbos[-self.window_size:]),
            "final_gradient_norm": grads[-1],
            "avg_gradient_norm": np.mean(grads[-self.window_size:]),
        }
 
 
def diagnose_convergence_issues(history: List[ConvergenceMetrics]) -> List[str]:
    """
    Analyze optimization history to diagnose common issues.
    """
    issues = []
    
    elbos = np.array([m.elbo for m in history])
    grads = np.array([m.gradient_norm for m in history])
    lrs = np.array([m.learning_rate for m in history])
    
    # Check for oscillation
    if len(elbos) > 100:
        recent_elbos = elbos[-100:]
        if np.std(recent_elbos) > 0.1 * np.abs(np.mean(recent_elbos)):
            issues.append("HIGH_VARIANCE: ELBO oscillating significantly. Consider reducing learning rate.")
    
    # Check for plateau
    if len(elbos) > 200:
        recent = elbos[-100:]
        older = elbos[-200:-100]
        if np.abs(np.mean(recent) - np.mean(older)) < 1e-6:
            issues.append("PLATEAU: ELBO not improving. May be converged or stuck at saddle point.")
    
    # Check gradient explosion
    if np.any(grads > 1e6):
        issues.append("GRADIENT_EXPLOSION: Very large gradients detected. Use gradient clipping.")
    
    # Check gradient vanishing
    if len(grads) > 50 and np.mean(grads[-50:]) < 1e-10:
        issues.append("GRADIENT_VANISHING: Gradients near zero. Check for numerical issues.")
    
    # Check learning rate
    if len(lrs) > 100 and lrs[-1] < 1e-8:
        issues.append("LR_TOO_SMALL: Learning rate may have decayed too aggressively.")
    
    return issues

Variance Reduction for Faster Convergence

The $O(1/\sqrt{T})$ convergence rate of stochastic gradient methods is fundamentally limited by gradient variance. Variance reduction techniques can achieve faster rates by constructing lower-variance gradient estimators.

Why variance matters:

The convergence rate depends on the signal-to-noise ratio of gradients:

$$\text{Rate} \propto \frac{|\nabla\mathcal{L}|^2}{\sigma^2}$$

Reducing $\sigma^2$ directly accelerates convergence.

Variance Reduction Techniques for SVI

•Control variates: Subtract a correlated zero-mean term from the gradient. The classic choice is a baseline b: ĝ - (ĝ - b) for judiciously chosen b.
•Reparameterization trick: For continuous latents, reparameterization yields dramatically lower variance than score function estimators.
•Rao-Blackwellization: Analytically integrate out some components of the gradient estimator, replacing Monte Carlo estimates with expectations.
•Importance sampling: Sample from a proposal distribution that allocates samples to high-variance regions, then reweight.
•SVRG/SAGA methods: Occasionally compute a full-batch gradient and use it to correct mini-batch gradients, achieving O(1/T) convergence.

SVRG for Variational Inference:

Stochastic Variance Reduced Gradient (SVRG) periodically computes a snapshot gradient $\tilde{g} = \nabla\mathcal{L}(\tilde{\phi})$ at a reference point $\tilde{\phi}$, then corrects mini-batch gradients:

$$\hat{g}_{\text{SVRG}} = \hat{g}_t - \hat{g}_t^{(ref)} + \tilde{g}$$

where $\hat{g}_t^{(ref)}$ is the mini-batch gradient at the reference point.

Why this works:

$\hat{g}_t^{(ref)}$ and $\hat{g}_t$ are positively correlated (same mini-batch, similar parameters)
The difference $\hat{g}_t - \hat{g}_t^{(ref)}$ has lower variance than $\hat{g}_t$ alone
Near convergence, $\phi_t \approx \tilde{\phi}$, so correction is nearly zero

Convergence improvement: SVRG achieves $O(1/T)$ convergence for convex problems (matching deterministic GD) while using only mini-batches between snapshots.

When to Use Variance Reduction

Variance reduction adds complexity and occasional full-batch computation. It's most beneficial when: • Dataset is moderately sized (full-batch feasible occasionally) • Gradient variance is the bottleneck (not computation) • High precision is required For very large datasets where even one full pass is expensive, standard SVI may be preferable.

Learning Rate Schedules

The choice of learning rate schedule profoundly affects SVI performance. While any schedule satisfying Robbins-Monro conditions converges asymptotically, practical performance varies dramatically.

Common schedules and their properties:

Learning Rate Schedule Comparison
Schedule	Formula	Properties	Best For
Inverse time	ρ_t = c/(t + τ)	Classical, proven convergence	Theoretical guarantees
Inverse sqrt	ρ_t = c/√t	Slower decay, more exploration	Non-convex, early training
Step decay	ρ_t = c × γ^{floor(t/s)}	Piecewise constant	Fine-tuning after warmup
Cosine annealing	ρ_t = ρ_min + ½(ρ_max-ρ_min)(1+cos(πt/T))	Smooth, restarts possible	Modern deep learning
Warmup + decay	Linear increase then inverse	Stable start, convergent end	Large models, unstable starts

The warmup phase:

Many modern systems use a warmup period where learning rate increases from near-zero to the target value:

$$\rho_t = \begin{cases} \rho_{\max} \cdot t / T_{\text{warmup}} & t < T_{\text{warmup}} \ \rho_{\max} \cdot \text{decay}(t - T_{\text{warmup}}) & t \geq T_{\text{warmup}} \end{cases}$$

Warmup helps with:

Unstable early gradients: Before the model learns basic structure, gradients may point in unhelpful directions
Adaptive optimizer initialization: Optimizers like Adam need time to estimate second moments
Distributed training: Synchronization issues are more pronounced with large steps

Practical learning rate tuning:

Start with standard values: $\rho_0 \in [10^{-4}, 10^{-2}]$ for Adam, $[10^{-2}, 1]$ for SGD
Use learning rate range test: Gradually increase LR and monitor loss; optimal LR is before loss explodes
Scale with batch size: For batch size $B$, scale LR by $\sqrt{B}$ or $B$ (with warmup)
Monitor gradient norms: If exploding, reduce LR; if vanishing, increase LR

learning_rate_schedules.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import numpy as np
from abc import ABC, abstractmethod
 
class LRSchedule(ABC):
    """Abstract base class for learning rate schedules."""
    
    @abstractmethod
    def get_lr(self, step: int) -> float:
        pass
 
class InverseTimeSchedule(LRSchedule):
    """ρ_t = initial_lr / (1 + decay_rate * t)"""
    
    def __init__(self, initial_lr: float = 0.1, decay_rate: float = 0.01):
        self.initial_lr = initial_lr
        self.decay_rate = decay_rate
    
    def get_lr(self, step: int) -> float:
        return self.initial_lr / (1 + self.decay_rate * step)
 
class RobbinsMonroSchedule(LRSchedule):
    """ρ_t = c / (t + τ)^α, satisfying Robbins-Monro conditions."""
    
    def __init__(self, c: float = 1.0, tau: float = 10.0, alpha: float = 0.75):
        """
        Args:
            c: Scale factor
            tau: Offset (larger = more conservative start)
            alpha: Decay exponent (must be in (0.5, 1] for convergence)
        """
        assert 0.5 < alpha <= 1, "Alpha must be in (0.5, 1] for Robbins-Monro"
        self.c = c
        self.tau = tau
        self.alpha = alpha
    
    def get_lr(self, step: int) -> float:
        return self.c / (step + self.tau) ** self.alpha
 
class CosineAnnealingSchedule(LRSchedule):
    """Cosine annealing with optional warm restarts."""
    
    def __init__(
        self,
        lr_max: float = 0.1,
        lr_min: float = 1e-6,
        period: int = 1000,
        warmup_steps: int = 100
    ):
        self.lr_max = lr_max
        self.lr_min = lr_min
        self.period = period
        self.warmup_steps = warmup_steps
    
    def get_lr(self, step: int) -> float:
        if step < self.warmup_steps:
            # Linear warmup
            return self.lr_max * (step + 1) / self.warmup_steps
        
        # Cosine annealing
        post_warmup_step = step - self.warmup_steps
        progress = (post_warmup_step % self.period) / self.period
        
        return self.lr_min + 0.5 * (self.lr_max - self.lr_min) * (1 + np.cos(np.pi * progress))
 
class AdaptiveLRSchedule(LRSchedule):
    """
    Adaptive learning rate based on optimization progress.
    
    Reduces LR when improvement stalls, increases when stable.
    """
    
    def __init__(
        self,
        initial_lr: float = 0.01,
        min_lr: float = 1e-6,
        max_lr: float = 1.0,
        patience: int = 50,
        factor: float = 0.5
    ):
        self.current_lr = initial_lr
        self.min_lr = min_lr
        self.max_lr = max_lr
        self.patience = patience
        self.factor = factor
        
        self.best_value = float('-inf')
        self.steps_without_improvement = 0
    
    def step(self, current_value: float):
        """Update schedule based on current objective value."""
        if current_value > self.best_value * 1.001:  # Small threshold for improvement
            self.best_value = current_value
            self.steps_without_improvement = 0
        else:
            self.steps_without_improvement += 1
        
        if self.steps_without_improvement >= self.patience:
            self.current_lr = max(self.min_lr, self.current_lr * self.factor)
            self.steps_without_improvement = 0
    
    def get_lr(self, step: int) -> float:
        return self.current_lr
 
 
def robbins_monro_verify(schedule: LRSchedule, check_steps: int = 100000) -> bool:
    """
    Verify that a schedule satisfies Robbins-Monro conditions (approximately).
    """
    lrs = [schedule.get_lr(t) for t in range(check_steps)]
    
    # Sum should be large (diverging)
    lr_sum = sum(lrs)
    
    # Sum of squares should be converging
    lr_sq_sum = sum(lr**2 for lr in lrs)
    
    # Heuristic checks
    sum_diverges = lr_sum > check_steps * 0.01  # Growing significantly
    sq_converges = lr_sq_sum < check_steps * lrs[0]**2  # Growing slower than linear
    
    return sum_diverges and sq_converges

Convergence with Natural Gradients

Natural gradients often converge faster than Euclidean gradients in practice. The theory explains this through the condition number of the optimization problem.

Condition number and convergence:

The condition number $\kappa$ measures how "stretched" the optimization landscape is:

$$\kappa = \frac{L}{\mu}$$

where $L$ is the smoothness constant and $\mu$ is the strong convexity constant.

High $\kappa$: Elongated level sets, slow convergence along principal axes
Low $\kappa$: Spherical level sets, uniform convergence

Natural gradients improve conditioning:

The Fisher Information Matrix $\mathbf{F}$ acts as a preconditioner. Multiplying by $\mathbf{F}^{-1}$ transforms the problem to have near-identity curvature in the natural geometry:

$$\kappa_{\text{natural}} \approx 1 \quad \text{vs} \quad \kappa_{\text{Euclidean}} \gg 1$$

Fisher-Rao Convergence Theorem

For exponential family models with natural gradient updates and step size ρ_t = 1, each iteration exactly achieves the coordinate-wise optimum. This means natural gradient VI converges in a single pass for conditionally conjugate models—no iteration needed!

Quantifying the speedup:

For a Gaussian variational distribution with $d$ dimensions:

Euclidean gradient convergence: $O(\kappa \cdot d \cdot \log(1/\epsilon))$ iterations
Natural gradient convergence: $O(d \cdot \log(1/\epsilon))$ iterations

The condition number $\kappa$ can be $10^3$–$10^6$ for ill-conditioned posteriors, representing potential 1000× speedups.

Natural gradients for neural network VI:

For Bayesian neural networks, exact natural gradients are impractical (inverting the full Fisher is $O(d^3)$ for millions of parameters). Approximations like K-FAC maintain much of the benefit:

K-FAC speedup: 3–10× faster convergence than Adam in wall-clock time
Diagonal Fisher (Adam-like): 1.5–3× speedup over vanilla SGD

The tradeoff is computational: K-FAC adds ~20-50% overhead per iteration but converges in many fewer iterations.

Practical Convergence Diagnostics

Theoretical convergence guarantees are asymptotic—they tell us what happens as $T \to \infty$. In practice, we need diagnostics to determine when to stop and whether optimization is proceeding normally.

Key Convergence Diagnostics

•ELBO trajectory: Plot ELBO vs iteration. Should increase (with noise) and plateau. Divergence or decline indicates problems.
•Gradient norm: Track ||∇L||. Should decrease as we approach stationary point. Explosion indicates learning rate too high; vanishing may indicate numerical issues.
•Parameter drift: Monitor ||φ_t - φ_{t-1}||. Should decrease with learning rate. Large drift late in training suggests instability.
•Held-out likelihood: Evaluate on validation data periodically. Overfitting manifests as improving training ELBO but declining validation likelihood.
•KL divergence components: Track KL[q||p] term separately. Too large KL suggests q is far from prior; too small suggests posterior collapse.
•Effective sample size: For MCMC-like diagnostics, estimate ESS from variational samples. Low ESS indicates poor posterior approximation.

Detecting common pathologies:

1. Posterior collapse (VAEs):

Symptom: KL term → 0, decoder ignores latent code
Diagnostic: Monitor per-dimension KL; collapsed if all near zero
Fix: KL annealing, free bits, better architecture

2. Oscillation:

Symptom: ELBO fluctuates without trending
Diagnostic: High variance in rolling ELBO average
Fix: Reduce learning rate, increase batch size

3. Premature convergence:

Symptom: Gradient norm small but ELBO suboptimal
Diagnostic: Compare to known bounds or simpler models
Fix: Better initialization, larger learning rate initially

4. Slow convergence:

Symptom: Steady improvement but very slow
Diagnostic: Rate of ELBO improvement
Fix: Natural gradients, larger learning rate, better preconditioning

When Is Convergence 'Good Enough'?

For practical applications, perfect convergence often isn't necessary. Early stopping when validation metrics plateau can actually improve generalization by acting as regularization. The key is ensuring the variational approximation is accurate enough for downstream tasks—sometimes a rough posterior is sufficient.

Summary: Convergence Theory for SVI

Understanding convergence theory transforms SVI from a heuristic into a principled algorithm with predictable behavior.

Key Takeaways

•Robbins-Monro conditions guarantee convergence: Learning rates must satisfy Σρ_t = ∞ and Σρ_t² < ∞; the canonical choice ρ_t = c/(t+τ) works
•Convex ELBOs converge at O(1/√T): To achieve ε-optimality, expect O(1/ε²) iterations; strong convexity improves this to O(1/T)
•Non-convex ELBOs find stationary points: We guarantee small gradients, not global optima; stochastic noise helps escape saddles
•Variance reduction accelerates convergence: Reparameterization, control variates, and SVRG-type methods can achieve O(1/T) rates even stochastically
•Learning rate schedules matter enormously: Warmup, cosine annealing, and adaptive schedules often outperform theory-optimal choices in practice
•Natural gradients improve conditioning: By respecting the Fisher geometry, natural gradients can provide 10-1000× speedups over Euclidean methods
•Practical diagnostics are essential: ELBO trajectory, gradient norms, and validation metrics reveal optimization health better than theory alone

What's next:

With the theoretical foundations complete, we conclude this module with practical considerations—a synthesis of implementation wisdom covering initialization strategies, debugging techniques, and guidelines for when to use (and not use) stochastic variational inference.

Page Complete

You now understand the convergence theory underlying stochastic variational inference. This knowledge enables you to set hyperparameters with confidence, diagnose optimization issues systematically, and reason about the computational requirements of probabilistic inference at scale.