Machine LearningVariational Inference

Mean-Field Approximation

LevelAdvanced

Duration90 mins

TopicVariational Inference

4 / 5

Convergence

When Does Mean-Field Find the Answer?

We've derived the mean-field update equations and implemented CAVI. But critical questions remain: Does the algorithm always converge? How fast? What does it converge to? When can we trust the result?

These questions are not merely theoretical—they directly impact how we use variational inference in practice. Understanding convergence helps us:

Set appropriate stopping criteria
Choose good initializations
Recognize when results might be unreliable
Compare variational inference to alternatives

What You Will Learn

By the end of this page, you will understand the theoretical convergence guarantees of mean-field VI, the factors that affect convergence speed, the nature of local optima in the ELBO landscape, and practical strategies for achieving reliable convergence.

Convergence Guarantees

Mean-field variational inference with coordinate ascent (CAVI) enjoys strong convergence guarantees that stem from the structure of the optimization problem.

Theorem (CAVI Convergence):

Let ${q^{(t)}}_{t=1}^{\infty}$ be the sequence of variational distributions produced by CAVI. Then:

Monotonicity: $\mathcal{L}(q^{(t+1)}) \geq \mathcal{L}(q^{(t)})$ for all $t$
Convergence: The ELBO sequence ${\mathcal{L}(q^{(t)})}$ converges
Stationarity: Every limit point of ${q^{(t)}}$ is a stationary point of the ELBO

Proof Sketch:

Monotonicity: Each coordinate update maximizes the ELBO with respect to one factor while holding others fixed. Since we're choosing the optimal update, the ELBO cannot decrease.

Convergence: The ELBO is bounded above by $\log p(\mathbf{x})$ (since $\text{KL}(q || p) \geq 0$). A bounded, monotonically increasing sequence must converge.

Stationarity: If the ELBO stops changing, no factor can improve—this is the definition of a stationary point for coordinate optimization.

Local, Not Global

These guarantees do NOT say that CAVI finds the global maximum of the ELBO. The ELBO landscape can have many local optima, saddle points, and plateaus. CAVI converges to a local maximum (or saddle point), which may be far from optimal. This is similar to gradient descent on non-convex functions.

Conditions for Stronger Convergence:

In special cases, we can prove stronger results:

Convex ELBO: If the ELBO is convex in all factors jointly (rare), CAVI converges to the global optimum.
Unique Stationary Point: If the ELBO has a unique stationary point (common in well-conditioned small models), CAVI converges to it.
Exponential Family with Convex Sufficient Statistics: For certain exponential family models, the ELBO can be shown to have favorable structure.

In practice, most models have multiple local optima, and we use multiple restarts to find good solutions.

Convergence Properties of Mean-Field VI
Property	Guaranteed?	Condition	Implication
ELBO non-decreasing	Yes	Optimal factor updates	Can monitor for bugs
ELBO converges	Yes	Bounded above by log p(x)	Algorithm terminates
Converges to stationary point	Yes	Coordinate-wise optimality	No single-factor improvement possible
Converges to global optimum	No	Only if ELBO convex	May find suboptimal solution
Unique limit point	No	Depends on initialization	Different runs may differ
Fast convergence	No	Depends on model structure	May need many iterations

Convergence Rate Analysis

How quickly does CAVI converge? The answer depends on the model structure and can range from just a few iterations to thousands.

Factors Affecting Convergence Rate:

Coupling Strength: Strong dependencies between latent variables slow convergence. If $z_i$'s optimal value depends heavily on $z_j$, and vice versa, many iterations may be needed to equilibrate.
Condition Number: The 'shape' of the ELBO landscape matters. Highly elongated or ill-conditioned regions lead to slow progress.
Dimensionality: More latent variables generally means more iterations, though not always linearly.
Initialization Quality: Starting close to the optimum dramatically reduces iterations needed.

Linear Convergence

Under favorable conditions (smooth, strongly convex objective), coordinate ascent exhibits LINEAR convergence: the error decreases by a constant factor each iteration. If ε_t ≤ ρᵗ ε₀ for some ρ < 1, then ε_t < ε requires t > log(ε₀/ε) / log(1/ρ) iterations. For ρ = 0.9, halving error takes ~7 iterations; for ρ = 0.99, it takes ~69 iterations.

convergence_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
import numpy as np
from typing import List, Tuple
import matplotlib.pyplot as plt
 
def analyze_convergence_rate(elbo_history: List[float]) -> dict:
    """
    Analyze convergence rate from ELBO history.
    
    Estimates the convergence rate by fitting an exponential decay
    to the ELBO improvements (distance from final value).
    
    Returns:
        Dictionary with convergence statistics
    """
    elbo = np.array(elbo_history)
    final_elbo = elbo[-1]
    
    # Distance from final (should decay exponentially for linear convergence)
    distance = np.abs(elbo - final_elbo)
    distance = np.maximum(distance, 1e-15)  # Avoid log(0)
    
    # Estimate convergence rate from log-linear fit
    # log(distance) ≈ log(ε₀) + t × log(ρ)
    t = np.arange(len(distance))
    valid = distance > 1e-10  # Only use points not at convergence
    
    if np.sum(valid) > 2:
        log_dist = np.log(distance[valid])
        t_valid = t[valid]
        
        # Linear regression
        coeffs = np.polyfit(t_valid, log_dist, 1)
        log_rho = coeffs[0]  # Slope
        rho = np.exp(log_rho)  # Convergence rate
    else:
        rho = None
    
    # Compute iteration counts for various tolerances
    improvements = np.diff(elbo)
    relative_improvements = improvements[:-1] / np.maximum(np.abs(elbo[:-2]), 1e-10)
    
    results = {
        'n_iterations': len(elbo),
        'initial_elbo': elbo[0],
        'final_elbo': final_elbo,
        'total_improvement': final_elbo - elbo[0],
        'estimated_rho': rho,
        'estimated_half_life': -np.log(2) / np.log(rho) if rho and 0 < rho < 1 else None,
        'mean_relative_improvement': np.mean(np.abs(relative_improvements)),
    }
    
    return results
 
 
def demonstrate_coupling_effects():
    """
    Show how coupling between variables affects convergence.
    
    Simulates CAVI for a bivariate Gaussian with varying correlation.
    """
    
    print("Effect of Variable Coupling on Convergence")
    print("=" * 60)
    print()
    print("Consider a bivariate Gaussian posterior with correlation ρ.")
    print("Mean-field approximates with independent q(z₁)q(z₂).")
    print()
    print("The CAVI updates for means are:")
    print("  μ₁ ← (data term) + ρ × E[z₂]")
    print("  μ₂ ← (data term) + ρ × E[z₁]")
    print()
    print("Strong correlation ρ causes oscillations between updates.")
    print()
    
    def simulate_cavi(rho: float, n_iter: int = 50) -> List[float]:
        """Simulate CAVI for 2D Gaussian with correlation rho."""
        # True posterior: N([0, 0], [[1, ρ], [ρ, 1]])
        # Target means are 0, but we start away
        mu1, mu2 = 5.0, 5.0
        errors = []
        
        for _ in range(n_iter):
            # Mean-field update: effective mean influenced by other variable
            # (Simplified model - actual updates depend on specific model)
            mu1_new = 0 + rho * mu2 * 0  # Data pulls toward 0
            mu2_new = 0 + rho * mu1 * 0  # Without coupling term
            
            # With coupling effect (exaggerated for illustration)
            mu1 = 0.5 * (mu1 + rho * mu2)  # Damped update
            mu2 = 0.5 * (mu2 + rho * mu1)
            
            errors.append(np.sqrt(mu1**2 + mu2**2))
        
        return errors
    
    correlations = [0.0, 0.3, 0.6, 0.9]
    
    print("Correlation | Iterations to ε<0.1 | Final Error")
    print("-" * 50)
    
    for rho in correlations:
        errors = simulate_cavi(rho, n_iter=100)
        iters_to_conv = next((i for i, e in enumerate(errors) if e < 0.1), 100)
        print(f"    {rho:.1f}     |         {iters_to_conv:3d}         |   {errors[-1]:.4f}")
    
    print()
    print("Higher correlation → slower convergence")
    print("This is why strong dependencies hurt mean-field performance.")
 
 
def analyze_elbo_trajectory():
    """
    Visualize typical ELBO trajectories.
    """
    
    print("
" + "=" * 60)
    print("Typical ELBO Trajectories")
    print("=" * 60)
    
    # Simulate different scenarios
    n_iter = 100
    
    # Scenario 1: Fast convergence (well-conditioned)
    fast = [-1000]
    for i in range(n_iter - 1):
        improvement = 200 * np.exp(-0.2 * i)
        fast.append(fast[-1] + improvement)
    
    # Scenario 2: Slow convergence (ill-conditioned)
    slow = [-1000]
    for i in range(n_iter - 1):
        improvement = 50 * np.exp(-0.05 * i)
        slow.append(slow[-1] + improvement)
    
    # Scenario 3: Plateau then progress
    plateau = [-1000]
    for i in range(n_iter - 1):
        if i < 30:
            improvement = 5
        else:
            improvement = 100 * np.exp(-0.1 * (i - 30))
        plateau.append(plateau[-1] + improvement)
    
    print()
    print("Scenario 1 (Fast): Well-conditioned, weak coupling")
    stats_fast = analyze_convergence_rate(fast)
    print(f"  Converged in ~{stats_fast['n_iterations']} iterations")
    print(f"  Estimated ρ = {stats_fast['estimated_rho']:.3f}")
    
    print()
    print("Scenario 2 (Slow): Ill-conditioned, strong coupling")
    stats_slow = analyze_convergence_rate(slow)
    print(f"  Still improving at {stats_slow['n_iterations']} iterations")
    print(f"  Estimated ρ = {stats_slow['estimated_rho']:.3f}")
    
    print()
    print("Scenario 3 (Plateau): Gets stuck then escapes")
    stats_plateau = analyze_convergence_rate(plateau[:60])
    print("  Shows plateau behavior - may indicate local minimum")
 
 
if __name__ == "__main__":
    demonstrate_coupling_effects()
    analyze_elbo_trajectory()

The Local Optima Landscape

Understanding the ELBO's local optima structure is crucial for using mean-field VI effectively. Different initial conditions can lead to dramatically different solutions.

Why Multiple Local Optima Exist:

Symmetry: Many models have inherent symmetries. In mixture models, relabeling clusters gives equivalent solutions. In factor models, rotating factors gives equivalent fits. The ELBO has multiple equivalent optima.
Multi-modality: Some posteriors are genuinely multi-modal—multiple explanations fit the data. Mean-field can only approximate one mode at a time.
Factorization Artifacts: The mean-field constraint creates artificial local optima that wouldn't exist if we optimized over all distributions.
Non-convexity: The ELBO is generally non-convex in the variational parameters, admitting multiple stationary points.

The Mixture Model Example

Consider a Gaussian mixture with 3 clusters. The ELBO has at least 3! = 6 equivalent global optima (cluster relabelings). But it also has many suboptimal local optima: two true clusters might be merged and one split, or data might be assigned incorrectly. Random initialization often finds these suboptimal solutions.

Strategies for Handling Local Optima:

Approaches to Find Better Optima

•Multiple Random Restarts: Run CAVI multiple times with different random initializations. Keep the solution with highest ELBO. Simple and effective.
•Informed Initialization: Use domain knowledge or simpler methods (K-means for clusters, SVD for factors) to start near a good solution.
•Deterministic Annealing: Start with a 'smoothed' ELBO (high temperature) and gradually sharpen. Helps escape poor local optima.
•Incremental Learning: Process data in batches, using the previous batch's solution to initialize the next. Natural for streaming data.
•Hierarchical Initialization: Fit a simpler model first (fewer clusters, fewer factors), then expand and refine.

local_optima_strategies.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
import numpy as np
from typing import List, Tuple, Callable, Optional
from dataclasses import dataclass
 
@dataclass
class CAVIResult:
    """Result from a single CAVI run."""
    final_elbo: float
    elbo_history: List[float]
    parameters: dict
    n_iterations: int
 
 
def multiple_restarts(
    run_cavi: Callable[[], CAVIResult],
    n_restarts: int = 10,
    verbose: bool = True
) -> CAVIResult:
    """
    Run CAVI multiple times and return the best result.
    
    Args:
        run_cavi: Function that runs CAVI with random initialization
        n_restarts: Number of random restarts
        verbose: Whether to print progress
        
    Returns:
        Best result (highest ELBO)
    """
    best_result = None
    best_elbo = float('-inf')
    all_elbos = []
    
    for i in range(n_restarts):
        result = run_cavi()
        all_elbos.append(result.final_elbo)
        
        if result.final_elbo > best_elbo:
            best_elbo = result.final_elbo
            best_result = result
        
        if verbose:
            print(f"Restart {i+1}/{n_restarts}: ELBO = {result.final_elbo:.2f}")
    
    if verbose:
        print(f"
Best ELBO: {best_elbo:.2f}")
        print(f"ELBO range: [{min(all_elbos):.2f}, {max(all_elbos):.2f}]")
        print(f"ELBO std: {np.std(all_elbos):.2f}")
        
        # Analyze how many restarts found similar solutions
        threshold = 0.01 * abs(best_elbo)  # Within 1%
        n_good = sum(1 for e in all_elbos if abs(e - best_elbo) < threshold)
        print(f"Restarts finding best (within 1%): {n_good}/{n_restarts}")
    
    return best_result
 
 
def deterministic_annealing(
    run_cavi_with_temp: Callable[[float], CAVIResult],
    temperatures: List[float] = None,
    verbose: bool = True
) -> CAVIResult:
    """
    Use deterministic annealing to escape local optima.
    
    Start with high temperature (smoothed ELBO), gradually decrease.
    Uses solution at each temperature to initialize the next.
    
    Args:
        run_cavi_with_temp: Function (temperature) -> CAVIResult
        temperatures: Temperature schedule (default: geometric from 10 to 1)
        
    Returns:
        Final result at temperature 1.0
    """
    if temperatures is None:
        # Geometric schedule from 10 down to 1
        temperatures = list(np.geomspace(10, 1, num=10))
    
    current_result = None
    
    for temp in temperatures:
        if verbose:
            print(f"Temperature {temp:.2f}:", end=" ")
        
        result = run_cavi_with_temp(temp)
        
        if verbose:
            print(f"ELBO = {result.final_elbo:.2f}")
        
        current_result = result
    
    return current_result
 
 
def analyze_optima_distribution(
    run_cavi: Callable[[], CAVIResult],
    n_runs: int = 50
) -> dict:
    """
    Analyze the distribution of local optima found by random restarts.
    """
    
    elbos = []
    for _ in range(n_runs):
        result = run_cavi()
        elbos.append(result.final_elbo)
    
    elbos = np.array(elbos)
    
    # Cluster similar ELBO values (likely same local optimum)
    sorted_elbos = np.sort(elbos)[::-1]
    
    # Find distinct optima (gaps > 1% of range)
    range_elbo = sorted_elbos[0] - sorted_elbos[-1]
    threshold = 0.01 * range_elbo if range_elbo > 0 else 1.0
    
    n_distinct = 1
    for i in range(1, len(sorted_elbos)):
        if sorted_elbos[i-1] - sorted_elbos[i] > threshold:
            n_distinct += 1
    
    return {
        'n_runs': n_runs,
        'best_elbo': np.max(elbos),
        'worst_elbo': np.min(elbos),
        'mean_elbo': np.mean(elbos),
        'std_elbo': np.std(elbos),
        'n_distinct_optima': n_distinct,
        'elbo_range': np.max(elbos) - np.min(elbos)
    }
 
 
def demonstrate_local_optima():
    """Demonstrate the local optima phenomenon."""
    
    print("Local Optima in Mean-Field VI")
    print("=" * 60)
    print()
    
    # Simulated scenario: Gaussian mixture with K=3 clusters
    print("Scenario: Gaussian Mixture Model with K=3 clusters")
    print()
    print("True clusters: well-separated, roughly equal size")
    print()
    
    # Simulate different local optima
    optima = [
        ("Global optimum (correct clustering)", -1500.2),
        ("Local optimum: clusters 1&2 merged, cluster 3 split", -1523.7),
        ("Local optimum: cluster 1 empty", -1548.1),
        ("Local optimum: poor centroid placement", -1561.9),
    ]
    
    print("Possible optima found by random restarts:")
    print("-" * 60)
    for description, elbo in optima:
        print(f"  ELBO = {elbo:.1f}: {description}")
    
    print()
    print("Key observations:")
    print("  • ELBO difference between best and worst: "
          f"{optima[0][1] - optima[-1][1]:.1f}")
    print("  • Random initialization finds suboptimal solutions frequently")
    print("  • Multiple restarts are essential for good solutions")
    print()
    
    # Simulate restart statistics
    print("Simulated restart statistics (50 runs):")
    print("-" * 60)
    # Pretend we ran 50 restarts
    print("  Found global optimum: 15 times (30%)")
    print("  Found second-best: 18 times (36%)")
    print("  Found poor solutions: 17 times (34%)")
    print()
    print("Recommendation: Run at least 10-20 restarts for mixture models")
 
 
if __name__ == "__main__":
    demonstrate_local_optima()

Convergence Diagnostics

How do we know when CAVI has converged? And how do we distinguish between convergence to a good solution versus getting stuck? Here are essential diagnostic tools.

Primary Diagnostic: ELBO Monitoring

The ELBO should be computed and tracked at every iteration (or every few iterations for large models). Key things to watch for:

ELBO Monitoring Checklist

•Monotonic increase: ELBO should NEVER decrease (within numerical tolerance). A decrease indicates a bug.
•Diminishing improvements: Changes should get smaller over time. If improvements are erratic, something may be wrong.
•Plateau detection: ELBO flatlining early may indicate a poor local optimum or degenerate solution.
•Reasonable final value: Compare to theoretical bounds or results from other methods/runs.
•Convergence criterion: Typically |ΔL| < ε or |ΔL/L| < ε for small ε (e.g., 10⁻⁶).

Red Flags in ELBO Trajectory

Watch for: (1) ELBO decreasing — indicates implementation bug; (2) ELBO going to ±∞ — numerical instability; (3) Very slow progress — may need different parameterization or initialization; (4) Immediate plateau — degenerate initialization or model misspecification.

Secondary Diagnostics:

Parameter Stability: Track changes in variational parameters. Convergence means parameters stop changing.
Responsibility Entropy: For mixture models, monitor the entropy of cluster assignments. Very low entropy (hard assignments) early may indicate premature convergence.
Predictive Performance: If possible, monitor held-out likelihood or other external metrics.
Component Usage: For mixture/factor models, check that all components are being used. Empty components suggest identifiability issues.

convergence_diagnostics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
import numpy as np
from typing import List, Dict, Optional
from dataclasses import dataclass, field
import warnings
 
@dataclass
class ConvergenceDiagnostics:
    """
    Comprehensive diagnostics for CAVI convergence.
    
    Tracks ELBO, parameters, and derived quantities to
    detect convergence and diagnose issues.
    """
    
    # Settings
    elbo_rtol: float = 1e-6  # Relative tolerance for ELBO
    elbo_atol: float = 1e-8  # Absolute tolerance for ELBO
    param_tol: float = 1e-5  # Tolerance for parameter changes
    
    # History
    elbo_history: List[float] = field(default_factory=list)
    param_history: List[Dict] = field(default_factory=list)
    
    # Diagnostics
    n_decreases: int = 0
    decrease_magnitudes: List[float] = field(default_factory=list)
    
    def record_iteration(
        self, 
        elbo: float, 
        params: Optional[Dict] = None
    ) -> Dict:
        """
        Record one iteration and return diagnostics.
        
        Args:
            elbo: Current ELBO value
            params: Optional dictionary of current parameters
            
        Returns:
            Dictionary with diagnostic information
        """
        diag = {
            'iteration': len(self.elbo_history),
            'elbo': elbo,
            'converged': False,
            'issues': []
        }
        
        # Check for invalid values
        if np.isnan(elbo):
            diag['issues'].append('ELBO is NaN')
            warnings.warn("ELBO is NaN - numerical instability")
        elif np.isinf(elbo):
            diag['issues'].append('ELBO is infinite')
            warnings.warn("ELBO is infinite - check for overflow/underflow")
        
        # Check monotonicity
        if len(self.elbo_history) > 0:
            prev = self.elbo_history[-1]
            change = elbo - prev
            diag['elbo_change'] = change
            diag['elbo_rel_change'] = change / abs(prev) if prev != 0 else 0
            
            if change < -1e-10:  # Allow tiny numerical errors
                self.n_decreases += 1
                self.decrease_magnitudes.append(-change)
                diag['issues'].append(f'ELBO decreased by {-change:.2e}')
                warnings.warn(
                    f"ELBO decreased at iteration {diag['iteration']}: "
                    f"{prev:.4f} -> {elbo:.4f}"
                )
            
            # Check convergence (ELBO criterion)
            rel_change = abs(change) / max(abs(prev), 1e-10)
            if rel_change < self.elbo_rtol and abs(change) < self.elbo_atol:
                diag['converged'] = True
        
        self.elbo_history.append(elbo)
        
        if params is not None:
            self.param_history.append(params.copy())
            
            # Check parameter convergence
            if len(self.param_history) > 1:
                max_param_change = self._max_param_change(
                    self.param_history[-2], 
                    self.param_history[-1]
                )
                diag['max_param_change'] = max_param_change
                
                if max_param_change < self.param_tol:
                    diag['params_converged'] = True
        
        return diag
    
    def _max_param_change(self, old: Dict, new: Dict) -> float:
        """Compute maximum absolute change across all parameters."""
        max_change = 0
        for key in old:
            if key in new:
                old_val = np.asarray(old[key])
                new_val = np.asarray(new[key])
                change = np.max(np.abs(old_val - new_val))
                max_change = max(max_change, change)
        return max_change
    
    def summary_report(self) -> str:
        """Generate a summary report of the optimization."""
        lines = [
            "=" * 60,
            "CAVI Convergence Report",
            "=" * 60,
            f"Total iterations: {len(self.elbo_history)}",
        ]
        
        if len(self.elbo_history) > 0:
            lines.extend([
                f"Initial ELBO: {self.elbo_history[0]:.4f}",
                f"Final ELBO: {self.elbo_history[-1]:.4f}",
                f"Total improvement: {self.elbo_history[-1] - self.elbo_history[0]:.4f}",
            ])
        
        lines.append("")
        lines.append("Monotonicity Check:")
        if self.n_decreases == 0:
            lines.append("  ✓ ELBO never decreased (correct behavior)")
        else:
            lines.append(f"  ✗ ELBO decreased {self.n_decreases} times (BUG!)")
            lines.append(f"    Max decrease: {max(self.decrease_magnitudes):.2e}")
        
        if len(self.elbo_history) > 1:
            lines.append("")
            lines.append("Convergence Assessment:")
            
            final_change = abs(self.elbo_history[-1] - self.elbo_history[-2])
            final_rel = final_change / max(abs(self.elbo_history[-2]), 1e-10)
            
            lines.append(f"  Final ELBO change: {final_change:.2e}")
            lines.append(f"  Final relative change: {final_rel:.2e}")
            
            if final_rel < self.elbo_rtol:
                lines.append("  ✓ Converged by ELBO criterion")
            else:
                lines.append("  ✗ May not have fully converged")
        
        return "
".join(lines)
    
    def detect_plateau(self, window: int = 10, threshold: float = 0.01) -> bool:
        """
        Detect if optimization is stuck on a plateau.
        
        A plateau is detected if the ELBO improvement over the last
        'window' iterations is less than 'threshold' fraction of
        total improvement so far.
        """
        if len(self.elbo_history) < window + 1:
            return False
        
        recent_improvement = self.elbo_history[-1] - self.elbo_history[-window]
        total_improvement = self.elbo_history[-1] - self.elbo_history[0]
        
        if total_improvement <= 0:
            return True  # No improvement at all
        
        return recent_improvement / total_improvement < threshold
 
 
def run_with_diagnostics():
    """Example of running CAVI with full diagnostics."""
    
    print("Running CAVI with Convergence Diagnostics")
    print("=" * 60)
    
    # Simulate a CAVI run
    diag = ConvergenceDiagnostics()
    
    elbo = -1000.0
    np.random.seed(42)
    
    for i in range(100):
        # Simulate ELBO improvement
        improvement = 50 * np.exp(-0.1 * i) + np.random.randn() * 0.1
        elbo += max(improvement, 0)  # Ensure non-decreasing
        
        result = diag.record_iteration(elbo)
        
        if result['converged']:
            print(f"Converged at iteration {i}")
            break
    
    print()
    print(diag.summary_report())
 
 
if __name__ == "__main__":
    run_with_diagnostics()

When Does Mean-Field Perform Well?

Mean-field VI isn't always the right choice. Understanding when it works well—and when it fails—helps you decide whether to use it for your problem.

Conditions Favoring Mean-Field:

Mean-Field Works Well When...

•Weak posterior correlations: If latent variables are nearly independent in the true posterior, mean-field loses little by assuming independence.
•Conjugate models: Exponential family models with conjugate priors have closed-form updates, making CAVI efficient.
•Large datasets: With many observations, the likelihood dominates the prior, and posterior correlations often weaken.
•Point estimation focus: If you mainly need posterior means (not uncertainty), mean-field often suffices.
•Exploratory analysis: When rough approximations are acceptable, mean-field provides fast results.

Mean-Field Struggles When...

•Strong posterior correlations: Highly correlated posteriors (e.g., from collinear predictors) are poorly approximated.
•Multi-modal posteriors: Mean-field focuses on one mode, missing other plausible explanations.
•Funnel geometries: Posteriors with varying scales (narrow in some regions, wide in others) confuse mean-field.
•Uncertainty quantification: Mean-field often underestimates posterior variance, giving overconfident intervals.
•Small datasets: With limited data, prior-likelihood interactions create correlations that matter.

The Underestimation Problem

Mean-field systematically underestimates posterior variance when variables are positively correlated. Intuitively: if z₁ is high making z₂ likely high, but we model them independently, we miss that they 'move together' and underestimate how uncertain the combination is. This is a fundamental limitation, not a bug.

Model Types and Mean-Field Suitability
Model Type	Mean-Field Suitability	Notes
Mixture models (GMM, LDA)	Good	Local latents often weakly correlated
Factor models (PCA, FA)	Moderate	May miss factor correlations
Linear regression	Good	Especially for point estimates
Hierarchical models	Moderate	Global-local correlations can matter
Time series (HMM, LGSSM)	Moderate	Sequential structure helps some
Deep latent models (VAE)	Good with caveats	Amortization compensates somewhat
Spatial models	Variable	Depends on correlation length
Small Bayesian networks	Often poor	Strong conditional dependencies

Summary: Convergence in Mean-Field VI

Understanding convergence is essential for effectively using mean-field variational inference. Here are the key points:

Key Takeaways

•Guaranteed convergence: CAVI converges to a stationary point of the ELBO. The ELBO never decreases.
•Local, not global: Convergence is to a LOCAL maximum. Different initializations find different solutions.
•Convergence rate varies: Strong coupling between variables slows convergence. Expect tens to hundreds of iterations.
•Multiple restarts essential: Run CAVI multiple times with different random seeds. Keep the best ELBO.
•Monitor the ELBO: Track ELBO at every iteration. Decreases indicate bugs. Plateaus may indicate local optima.
•Know the limitations: Mean-field underestimates variance for correlated posteriors. Use alternatives when correlations matter.

What's Next

The final page of this module examines the limitations of mean-field VI in depth. We'll explore exactly when and why the factorization assumption fails, and what alternatives exist for cases where mean-field isn't sufficient.

Page Complete

You now understand the convergence properties of mean-field VI: what's guaranteed, what affects convergence speed, how to diagnose problems, and when to expect good versus poor performance. This knowledge is essential for applying mean-field VI confidently and correctly.

4 / 5

Loading learning content...

Machine LearningVariational Inference

Mean-Field Approximation

LevelAdvanced

Duration90 mins

TopicVariational Inference

4 / 5

Convergence

When Does Mean-Field Find the Answer?

These questions are not merely theoretical—they directly impact how we use variational inference in practice. Understanding convergence helps us:

Set appropriate stopping criteria
Choose good initializations
Recognize when results might be unreliable
Compare variational inference to alternatives

What You Will Learn

Convergence Guarantees

Mean-field variational inference with coordinate ascent (CAVI) enjoys strong convergence guarantees that stem from the structure of the optimization problem.

Theorem (CAVI Convergence):

Let ${q^{(t)}}_{t=1}^{\infty}$ be the sequence of variational distributions produced by CAVI. Then:

Monotonicity: $\mathcal{L}(q^{(t+1)}) \geq \mathcal{L}(q^{(t)})$ for all $t$
Convergence: The ELBO sequence ${\mathcal{L}(q^{(t)})}$ converges
Stationarity: Every limit point of ${q^{(t)}}$ is a stationary point of the ELBO

Proof Sketch:

Monotonicity: Each coordinate update maximizes the ELBO with respect to one factor while holding others fixed. Since we're choosing the optimal update, the ELBO cannot decrease.

Convergence: The ELBO is bounded above by $\log p(\mathbf{x})$ (since $\text{KL}(q || p) \geq 0$). A bounded, monotonically increasing sequence must converge.

Stationarity: If the ELBO stops changing, no factor can improve—this is the definition of a stationary point for coordinate optimization.

Local, Not Global

Conditions for Stronger Convergence:

In special cases, we can prove stronger results:

Convex ELBO: If the ELBO is convex in all factors jointly (rare), CAVI converges to the global optimum.
Unique Stationary Point: If the ELBO has a unique stationary point (common in well-conditioned small models), CAVI converges to it.
Exponential Family with Convex Sufficient Statistics: For certain exponential family models, the ELBO can be shown to have favorable structure.

In practice, most models have multiple local optima, and we use multiple restarts to find good solutions.

Convergence Properties of Mean-Field VI
Property	Guaranteed?	Condition	Implication
ELBO non-decreasing	Yes	Optimal factor updates	Can monitor for bugs
ELBO converges	Yes	Bounded above by log p(x)	Algorithm terminates
Converges to stationary point	Yes	Coordinate-wise optimality	No single-factor improvement possible
Converges to global optimum	No	Only if ELBO convex	May find suboptimal solution
Unique limit point	No	Depends on initialization	Different runs may differ
Fast convergence	No	Depends on model structure	May need many iterations

Convergence Rate Analysis

How quickly does CAVI converge? The answer depends on the model structure and can range from just a few iterations to thousands.

Factors Affecting Convergence Rate:

Coupling Strength: Strong dependencies between latent variables slow convergence. If $z_i$'s optimal value depends heavily on $z_j$, and vice versa, many iterations may be needed to equilibrate.
Condition Number: The 'shape' of the ELBO landscape matters. Highly elongated or ill-conditioned regions lead to slow progress.
Dimensionality: More latent variables generally means more iterations, though not always linearly.
Initialization Quality: Starting close to the optimum dramatically reduces iterations needed.

Linear Convergence

convergence_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
import numpy as np
from typing import List, Tuple
import matplotlib.pyplot as plt
 
def analyze_convergence_rate(elbo_history: List[float]) -> dict:
    """
    Analyze convergence rate from ELBO history.
    
    Estimates the convergence rate by fitting an exponential decay
    to the ELBO improvements (distance from final value).
    
    Returns:
        Dictionary with convergence statistics
    """
    elbo = np.array(elbo_history)
    final_elbo = elbo[-1]
    
    # Distance from final (should decay exponentially for linear convergence)
    distance = np.abs(elbo - final_elbo)
    distance = np.maximum(distance, 1e-15)  # Avoid log(0)
    
    # Estimate convergence rate from log-linear fit
    # log(distance) ≈ log(ε₀) + t × log(ρ)
    t = np.arange(len(distance))
    valid = distance > 1e-10  # Only use points not at convergence
    
    if np.sum(valid) > 2:
        log_dist = np.log(distance[valid])
        t_valid = t[valid]
        
        # Linear regression
        coeffs = np.polyfit(t_valid, log_dist, 1)
        log_rho = coeffs[0]  # Slope
        rho = np.exp(log_rho)  # Convergence rate
    else:
        rho = None
    
    # Compute iteration counts for various tolerances
    improvements = np.diff(elbo)
    relative_improvements = improvements[:-1] / np.maximum(np.abs(elbo[:-2]), 1e-10)
    
    results = {
        'n_iterations': len(elbo),
        'initial_elbo': elbo[0],
        'final_elbo': final_elbo,
        'total_improvement': final_elbo - elbo[0],
        'estimated_rho': rho,
        'estimated_half_life': -np.log(2) / np.log(rho) if rho and 0 < rho < 1 else None,
        'mean_relative_improvement': np.mean(np.abs(relative_improvements)),
    }
    
    return results
 
 
def demonstrate_coupling_effects():
    """
    Show how coupling between variables affects convergence.
    
    Simulates CAVI for a bivariate Gaussian with varying correlation.
    """
    
    print("Effect of Variable Coupling on Convergence")
    print("=" * 60)
    print()
    print("Consider a bivariate Gaussian posterior with correlation ρ.")
    print("Mean-field approximates with independent q(z₁)q(z₂).")
    print()
    print("The CAVI updates for means are:")
    print("  μ₁ ← (data term) + ρ × E[z₂]")
    print("  μ₂ ← (data term) + ρ × E[z₁]")
    print()
    print("Strong correlation ρ causes oscillations between updates.")
    print()
    
    def simulate_cavi(rho: float, n_iter: int = 50) -> List[float]:
        """Simulate CAVI for 2D Gaussian with correlation rho."""
        # True posterior: N([0, 0], [[1, ρ], [ρ, 1]])
        # Target means are 0, but we start away
        mu1, mu2 = 5.0, 5.0
        errors = []
        
        for _ in range(n_iter):
            # Mean-field update: effective mean influenced by other variable
            # (Simplified model - actual updates depend on specific model)
            mu1_new = 0 + rho * mu2 * 0  # Data pulls toward 0
            mu2_new = 0 + rho * mu1 * 0  # Without coupling term
            
            # With coupling effect (exaggerated for illustration)
            mu1 = 0.5 * (mu1 + rho * mu2)  # Damped update
            mu2 = 0.5 * (mu2 + rho * mu1)
            
            errors.append(np.sqrt(mu1**2 + mu2**2))
        
        return errors
    
    correlations = [0.0, 0.3, 0.6, 0.9]
    
    print("Correlation | Iterations to ε<0.1 | Final Error")
    print("-" * 50)
    
    for rho in correlations:
        errors = simulate_cavi(rho, n_iter=100)
        iters_to_conv = next((i for i, e in enumerate(errors) if e < 0.1), 100)
        print(f"    {rho:.1f}     |         {iters_to_conv:3d}         |   {errors[-1]:.4f}")
    
    print()
    print("Higher correlation → slower convergence")
    print("This is why strong dependencies hurt mean-field performance.")
 
 
def analyze_elbo_trajectory():
    """
    Visualize typical ELBO trajectories.
    """
    
    print("
" + "=" * 60)
    print("Typical ELBO Trajectories")
    print("=" * 60)
    
    # Simulate different scenarios
    n_iter = 100
    
    # Scenario 1: Fast convergence (well-conditioned)
    fast = [-1000]
    for i in range(n_iter - 1):
        improvement = 200 * np.exp(-0.2 * i)
        fast.append(fast[-1] + improvement)
    
    # Scenario 2: Slow convergence (ill-conditioned)
    slow = [-1000]
    for i in range(n_iter - 1):
        improvement = 50 * np.exp(-0.05 * i)
        slow.append(slow[-1] + improvement)
    
    # Scenario 3: Plateau then progress
    plateau = [-1000]
    for i in range(n_iter - 1):
        if i < 30:
            improvement = 5
        else:
            improvement = 100 * np.exp(-0.1 * (i - 30))
        plateau.append(plateau[-1] + improvement)
    
    print()
    print("Scenario 1 (Fast): Well-conditioned, weak coupling")
    stats_fast = analyze_convergence_rate(fast)
    print(f"  Converged in ~{stats_fast['n_iterations']} iterations")
    print(f"  Estimated ρ = {stats_fast['estimated_rho']:.3f}")
    
    print()
    print("Scenario 2 (Slow): Ill-conditioned, strong coupling")
    stats_slow = analyze_convergence_rate(slow)
    print(f"  Still improving at {stats_slow['n_iterations']} iterations")
    print(f"  Estimated ρ = {stats_slow['estimated_rho']:.3f}")
    
    print()
    print("Scenario 3 (Plateau): Gets stuck then escapes")
    stats_plateau = analyze_convergence_rate(plateau[:60])
    print("  Shows plateau behavior - may indicate local minimum")
 
 
if __name__ == "__main__":
    demonstrate_coupling_effects()
    analyze_elbo_trajectory()

The Local Optima Landscape

Understanding the ELBO's local optima structure is crucial for using mean-field VI effectively. Different initial conditions can lead to dramatically different solutions.

Why Multiple Local Optima Exist:

Symmetry: Many models have inherent symmetries. In mixture models, relabeling clusters gives equivalent solutions. In factor models, rotating factors gives equivalent fits. The ELBO has multiple equivalent optima.
Multi-modality: Some posteriors are genuinely multi-modal—multiple explanations fit the data. Mean-field can only approximate one mode at a time.
Factorization Artifacts: The mean-field constraint creates artificial local optima that wouldn't exist if we optimized over all distributions.
Non-convexity: The ELBO is generally non-convex in the variational parameters, admitting multiple stationary points.

The Mixture Model Example

Strategies for Handling Local Optima:

Approaches to Find Better Optima

•Multiple Random Restarts: Run CAVI multiple times with different random initializations. Keep the solution with highest ELBO. Simple and effective.
•Informed Initialization: Use domain knowledge or simpler methods (K-means for clusters, SVD for factors) to start near a good solution.
•Deterministic Annealing: Start with a 'smoothed' ELBO (high temperature) and gradually sharpen. Helps escape poor local optima.
•Incremental Learning: Process data in batches, using the previous batch's solution to initialize the next. Natural for streaming data.
•Hierarchical Initialization: Fit a simpler model first (fewer clusters, fewer factors), then expand and refine.

local_optima_strategies.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
import numpy as np
from typing import List, Tuple, Callable, Optional
from dataclasses import dataclass
 
@dataclass
class CAVIResult:
    """Result from a single CAVI run."""
    final_elbo: float
    elbo_history: List[float]
    parameters: dict
    n_iterations: int
 
 
def multiple_restarts(
    run_cavi: Callable[[], CAVIResult],
    n_restarts: int = 10,
    verbose: bool = True
) -> CAVIResult:
    """
    Run CAVI multiple times and return the best result.
    
    Args:
        run_cavi: Function that runs CAVI with random initialization
        n_restarts: Number of random restarts
        verbose: Whether to print progress
        
    Returns:
        Best result (highest ELBO)
    """
    best_result = None
    best_elbo = float('-inf')
    all_elbos = []
    
    for i in range(n_restarts):
        result = run_cavi()
        all_elbos.append(result.final_elbo)
        
        if result.final_elbo > best_elbo:
            best_elbo = result.final_elbo
            best_result = result
        
        if verbose:
            print(f"Restart {i+1}/{n_restarts}: ELBO = {result.final_elbo:.2f}")
    
    if verbose:
        print(f"
Best ELBO: {best_elbo:.2f}")
        print(f"ELBO range: [{min(all_elbos):.2f}, {max(all_elbos):.2f}]")
        print(f"ELBO std: {np.std(all_elbos):.2f}")
        
        # Analyze how many restarts found similar solutions
        threshold = 0.01 * abs(best_elbo)  # Within 1%
        n_good = sum(1 for e in all_elbos if abs(e - best_elbo) < threshold)
        print(f"Restarts finding best (within 1%): {n_good}/{n_restarts}")
    
    return best_result
 
 
def deterministic_annealing(
    run_cavi_with_temp: Callable[[float], CAVIResult],
    temperatures: List[float] = None,
    verbose: bool = True
) -> CAVIResult:
    """
    Use deterministic annealing to escape local optima.
    
    Start with high temperature (smoothed ELBO), gradually decrease.
    Uses solution at each temperature to initialize the next.
    
    Args:
        run_cavi_with_temp: Function (temperature) -> CAVIResult
        temperatures: Temperature schedule (default: geometric from 10 to 1)
        
    Returns:
        Final result at temperature 1.0
    """
    if temperatures is None:
        # Geometric schedule from 10 down to 1
        temperatures = list(np.geomspace(10, 1, num=10))
    
    current_result = None
    
    for temp in temperatures:
        if verbose:
            print(f"Temperature {temp:.2f}:", end=" ")
        
        result = run_cavi_with_temp(temp)
        
        if verbose:
            print(f"ELBO = {result.final_elbo:.2f}")
        
        current_result = result
    
    return current_result
 
 
def analyze_optima_distribution(
    run_cavi: Callable[[], CAVIResult],
    n_runs: int = 50
) -> dict:
    """
    Analyze the distribution of local optima found by random restarts.
    """
    
    elbos = []
    for _ in range(n_runs):
        result = run_cavi()
        elbos.append(result.final_elbo)
    
    elbos = np.array(elbos)
    
    # Cluster similar ELBO values (likely same local optimum)
    sorted_elbos = np.sort(elbos)[::-1]
    
    # Find distinct optima (gaps > 1% of range)
    range_elbo = sorted_elbos[0] - sorted_elbos[-1]
    threshold = 0.01 * range_elbo if range_elbo > 0 else 1.0
    
    n_distinct = 1
    for i in range(1, len(sorted_elbos)):
        if sorted_elbos[i-1] - sorted_elbos[i] > threshold:
            n_distinct += 1
    
    return {
        'n_runs': n_runs,
        'best_elbo': np.max(elbos),
        'worst_elbo': np.min(elbos),
        'mean_elbo': np.mean(elbos),
        'std_elbo': np.std(elbos),
        'n_distinct_optima': n_distinct,
        'elbo_range': np.max(elbos) - np.min(elbos)
    }
 
 
def demonstrate_local_optima():
    """Demonstrate the local optima phenomenon."""
    
    print("Local Optima in Mean-Field VI")
    print("=" * 60)
    print()
    
    # Simulated scenario: Gaussian mixture with K=3 clusters
    print("Scenario: Gaussian Mixture Model with K=3 clusters")
    print()
    print("True clusters: well-separated, roughly equal size")
    print()
    
    # Simulate different local optima
    optima = [
        ("Global optimum (correct clustering)", -1500.2),
        ("Local optimum: clusters 1&2 merged, cluster 3 split", -1523.7),
        ("Local optimum: cluster 1 empty", -1548.1),
        ("Local optimum: poor centroid placement", -1561.9),
    ]
    
    print("Possible optima found by random restarts:")
    print("-" * 60)
    for description, elbo in optima:
        print(f"  ELBO = {elbo:.1f}: {description}")
    
    print()
    print("Key observations:")
    print("  • ELBO difference between best and worst: "
          f"{optima[0][1] - optima[-1][1]:.1f}")
    print("  • Random initialization finds suboptimal solutions frequently")
    print("  • Multiple restarts are essential for good solutions")
    print()
    
    # Simulate restart statistics
    print("Simulated restart statistics (50 runs):")
    print("-" * 60)
    # Pretend we ran 50 restarts
    print("  Found global optimum: 15 times (30%)")
    print("  Found second-best: 18 times (36%)")
    print("  Found poor solutions: 17 times (34%)")
    print()
    print("Recommendation: Run at least 10-20 restarts for mixture models")
 
 
if __name__ == "__main__":
    demonstrate_local_optima()

Convergence Diagnostics

How do we know when CAVI has converged? And how do we distinguish between convergence to a good solution versus getting stuck? Here are essential diagnostic tools.

Primary Diagnostic: ELBO Monitoring

The ELBO should be computed and tracked at every iteration (or every few iterations for large models). Key things to watch for:

ELBO Monitoring Checklist

•Monotonic increase: ELBO should NEVER decrease (within numerical tolerance). A decrease indicates a bug.
•Diminishing improvements: Changes should get smaller over time. If improvements are erratic, something may be wrong.
•Plateau detection: ELBO flatlining early may indicate a poor local optimum or degenerate solution.
•Reasonable final value: Compare to theoretical bounds or results from other methods/runs.
•Convergence criterion: Typically |ΔL| < ε or |ΔL/L| < ε for small ε (e.g., 10⁻⁶).

Red Flags in ELBO Trajectory

Secondary Diagnostics:

Parameter Stability: Track changes in variational parameters. Convergence means parameters stop changing.
Responsibility Entropy: For mixture models, monitor the entropy of cluster assignments. Very low entropy (hard assignments) early may indicate premature convergence.
Predictive Performance: If possible, monitor held-out likelihood or other external metrics.
Component Usage: For mixture/factor models, check that all components are being used. Empty components suggest identifiability issues.

convergence_diagnostics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
import numpy as np
from typing import List, Dict, Optional
from dataclasses import dataclass, field
import warnings
 
@dataclass
class ConvergenceDiagnostics:
    """
    Comprehensive diagnostics for CAVI convergence.
    
    Tracks ELBO, parameters, and derived quantities to
    detect convergence and diagnose issues.
    """
    
    # Settings
    elbo_rtol: float = 1e-6  # Relative tolerance for ELBO
    elbo_atol: float = 1e-8  # Absolute tolerance for ELBO
    param_tol: float = 1e-5  # Tolerance for parameter changes
    
    # History
    elbo_history: List[float] = field(default_factory=list)
    param_history: List[Dict] = field(default_factory=list)
    
    # Diagnostics
    n_decreases: int = 0
    decrease_magnitudes: List[float] = field(default_factory=list)
    
    def record_iteration(
        self, 
        elbo: float, 
        params: Optional[Dict] = None
    ) -> Dict:
        """
        Record one iteration and return diagnostics.
        
        Args:
            elbo: Current ELBO value
            params: Optional dictionary of current parameters
            
        Returns:
            Dictionary with diagnostic information
        """
        diag = {
            'iteration': len(self.elbo_history),
            'elbo': elbo,
            'converged': False,
            'issues': []
        }
        
        # Check for invalid values
        if np.isnan(elbo):
            diag['issues'].append('ELBO is NaN')
            warnings.warn("ELBO is NaN - numerical instability")
        elif np.isinf(elbo):
            diag['issues'].append('ELBO is infinite')
            warnings.warn("ELBO is infinite - check for overflow/underflow")
        
        # Check monotonicity
        if len(self.elbo_history) > 0:
            prev = self.elbo_history[-1]
            change = elbo - prev
            diag['elbo_change'] = change
            diag['elbo_rel_change'] = change / abs(prev) if prev != 0 else 0
            
            if change < -1e-10:  # Allow tiny numerical errors
                self.n_decreases += 1
                self.decrease_magnitudes.append(-change)
                diag['issues'].append(f'ELBO decreased by {-change:.2e}')
                warnings.warn(
                    f"ELBO decreased at iteration {diag['iteration']}: "
                    f"{prev:.4f} -> {elbo:.4f}"
                )
            
            # Check convergence (ELBO criterion)
            rel_change = abs(change) / max(abs(prev), 1e-10)
            if rel_change < self.elbo_rtol and abs(change) < self.elbo_atol:
                diag['converged'] = True
        
        self.elbo_history.append(elbo)
        
        if params is not None:
            self.param_history.append(params.copy())
            
            # Check parameter convergence
            if len(self.param_history) > 1:
                max_param_change = self._max_param_change(
                    self.param_history[-2], 
                    self.param_history[-1]
                )
                diag['max_param_change'] = max_param_change
                
                if max_param_change < self.param_tol:
                    diag['params_converged'] = True
        
        return diag
    
    def _max_param_change(self, old: Dict, new: Dict) -> float:
        """Compute maximum absolute change across all parameters."""
        max_change = 0
        for key in old:
            if key in new:
                old_val = np.asarray(old[key])
                new_val = np.asarray(new[key])
                change = np.max(np.abs(old_val - new_val))
                max_change = max(max_change, change)
        return max_change
    
    def summary_report(self) -> str:
        """Generate a summary report of the optimization."""
        lines = [
            "=" * 60,
            "CAVI Convergence Report",
            "=" * 60,
            f"Total iterations: {len(self.elbo_history)}",
        ]
        
        if len(self.elbo_history) > 0:
            lines.extend([
                f"Initial ELBO: {self.elbo_history[0]:.4f}",
                f"Final ELBO: {self.elbo_history[-1]:.4f}",
                f"Total improvement: {self.elbo_history[-1] - self.elbo_history[0]:.4f}",
            ])
        
        lines.append("")
        lines.append("Monotonicity Check:")
        if self.n_decreases == 0:
            lines.append("  ✓ ELBO never decreased (correct behavior)")
        else:
            lines.append(f"  ✗ ELBO decreased {self.n_decreases} times (BUG!)")
            lines.append(f"    Max decrease: {max(self.decrease_magnitudes):.2e}")
        
        if len(self.elbo_history) > 1:
            lines.append("")
            lines.append("Convergence Assessment:")
            
            final_change = abs(self.elbo_history[-1] - self.elbo_history[-2])
            final_rel = final_change / max(abs(self.elbo_history[-2]), 1e-10)
            
            lines.append(f"  Final ELBO change: {final_change:.2e}")
            lines.append(f"  Final relative change: {final_rel:.2e}")
            
            if final_rel < self.elbo_rtol:
                lines.append("  ✓ Converged by ELBO criterion")
            else:
                lines.append("  ✗ May not have fully converged")
        
        return "
".join(lines)
    
    def detect_plateau(self, window: int = 10, threshold: float = 0.01) -> bool:
        """
        Detect if optimization is stuck on a plateau.
        
        A plateau is detected if the ELBO improvement over the last
        'window' iterations is less than 'threshold' fraction of
        total improvement so far.
        """
        if len(self.elbo_history) < window + 1:
            return False
        
        recent_improvement = self.elbo_history[-1] - self.elbo_history[-window]
        total_improvement = self.elbo_history[-1] - self.elbo_history[0]
        
        if total_improvement <= 0:
            return True  # No improvement at all
        
        return recent_improvement / total_improvement < threshold
 
 
def run_with_diagnostics():
    """Example of running CAVI with full diagnostics."""
    
    print("Running CAVI with Convergence Diagnostics")
    print("=" * 60)
    
    # Simulate a CAVI run
    diag = ConvergenceDiagnostics()
    
    elbo = -1000.0
    np.random.seed(42)
    
    for i in range(100):
        # Simulate ELBO improvement
        improvement = 50 * np.exp(-0.1 * i) + np.random.randn() * 0.1
        elbo += max(improvement, 0)  # Ensure non-decreasing
        
        result = diag.record_iteration(elbo)
        
        if result['converged']:
            print(f"Converged at iteration {i}")
            break
    
    print()
    print(diag.summary_report())
 
 
if __name__ == "__main__":
    run_with_diagnostics()

When Does Mean-Field Perform Well?

Mean-field VI isn't always the right choice. Understanding when it works well—and when it fails—helps you decide whether to use it for your problem.

Conditions Favoring Mean-Field:

Mean-Field Works Well When...

•Weak posterior correlations: If latent variables are nearly independent in the true posterior, mean-field loses little by assuming independence.
•Conjugate models: Exponential family models with conjugate priors have closed-form updates, making CAVI efficient.
•Large datasets: With many observations, the likelihood dominates the prior, and posterior correlations often weaken.
•Point estimation focus: If you mainly need posterior means (not uncertainty), mean-field often suffices.
•Exploratory analysis: When rough approximations are acceptable, mean-field provides fast results.

Mean-Field Struggles When...

•Strong posterior correlations: Highly correlated posteriors (e.g., from collinear predictors) are poorly approximated.
•Multi-modal posteriors: Mean-field focuses on one mode, missing other plausible explanations.
•Funnel geometries: Posteriors with varying scales (narrow in some regions, wide in others) confuse mean-field.
•Uncertainty quantification: Mean-field often underestimates posterior variance, giving overconfident intervals.
•Small datasets: With limited data, prior-likelihood interactions create correlations that matter.

The Underestimation Problem

Model Types and Mean-Field Suitability
Model Type	Mean-Field Suitability	Notes
Mixture models (GMM, LDA)	Good	Local latents often weakly correlated
Factor models (PCA, FA)	Moderate	May miss factor correlations
Linear regression	Good	Especially for point estimates
Hierarchical models	Moderate	Global-local correlations can matter
Time series (HMM, LGSSM)	Moderate	Sequential structure helps some
Deep latent models (VAE)	Good with caveats	Amortization compensates somewhat
Spatial models	Variable	Depends on correlation length
Small Bayesian networks	Often poor	Strong conditional dependencies

Summary: Convergence in Mean-Field VI

Understanding convergence is essential for effectively using mean-field variational inference. Here are the key points:

Key Takeaways

•Guaranteed convergence: CAVI converges to a stationary point of the ELBO. The ELBO never decreases.
•Local, not global: Convergence is to a LOCAL maximum. Different initializations find different solutions.
•Convergence rate varies: Strong coupling between variables slows convergence. Expect tens to hundreds of iterations.
•Multiple restarts essential: Run CAVI multiple times with different random seeds. Keep the best ELBO.
•Monitor the ELBO: Track ELBO at every iteration. Decreases indicate bugs. Plateaus may indicate local optima.
•Know the limitations: Mean-field underestimates variance for correlated posteriors. Use alternatives when correlations matter.

What's Next

Page Complete

4 / 5