Loading content...
The EM algorithm's practical success hinges on understanding its convergence properties. We've established that EM monotonically increases the log-likelihood—but what does this guarantee? Does EM always converge? How fast? To what kind of solution?
These questions have profound practical implications. If EM converges too slowly, we may need acceleration techniques. If EM finds only local optima, we need multiple restarts. If EM can get stuck at saddle points, we need detection mechanisms. This page provides the mathematical framework to answer these questions rigorously.
By the end of this page, you will understand: (1) the rigorous proof of EM's monotonicity property, (2) conditions under which EM converges, (3) the rate of convergence and why EM can be slow, (4) when EM finds global vs local optima, and (5) practical implications for implementation.
We now prove rigorously that each EM iteration increases (or maintains) the observed-data log-likelihood. This is the foundational guarantee that makes EM a reliable optimization algorithm.
Theorem (EM Monotonicity): For any EM iteration from $\boldsymbol{\theta}^{(t)}$ to $\boldsymbol{\theta}^{(t+1)}$:
$$\mathcal{L}(\boldsymbol{\theta}^{(t+1)}) \geq \mathcal{L}(\boldsymbol{\theta}^{(t)})$$
with equality if and only if the Q-function is not improved and the posterior distribution remains unchanged.
Proof Setup: The Likelihood Decomposition
The key insight is to decompose the observed-data log-likelihood using the posterior distribution of latent variables. For any distribution $q(\mathbf{Z})$ over latent variables:
$$\mathcal{L}(\boldsymbol{\theta}) = \log p(\mathbf{X} \mid \boldsymbol{\theta}) = \log \sum_{\mathbf{Z}} p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})$$
Multiplying and dividing by $q(\mathbf{Z})$:
$$\mathcal{L}(\boldsymbol{\theta}) = \log \sum_{\mathbf{Z}} q(\mathbf{Z}) \frac{p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})}{q(\mathbf{Z})}$$
Applying Jensen's Inequality
Since $\log$ is concave and we're taking a weighted average (with weights $q(\mathbf{Z})$), Jensen's inequality gives:
$$\mathcal{L}(\boldsymbol{\theta}) \geq \sum_{\mathbf{Z}} q(\mathbf{Z}) \log \frac{p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})}{q(\mathbf{Z})}$$
Define the Evidence Lower Bound (ELBO):
$$\text{ELBO}(q, \boldsymbol{\theta}) \triangleq \sum_{\mathbf{Z}} q(\mathbf{Z}) \log \frac{p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})}{q(\mathbf{Z})} = \mathbb{E}_{q(\mathbf{Z})} \left[ \log \frac{p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})}{q(\mathbf{Z})} \right]$$
Thus: $\mathcal{L}(\boldsymbol{\theta}) \geq \text{ELBO}(q, \boldsymbol{\theta})$ for all $q$.
The gap between the log-likelihood and the ELBO is precisely the KL divergence between q(Z) and the true posterior: L(θ) - ELBO(q, θ) = KL(q(Z) || p(Z | X, θ)) ≥ 0. The bound is tight when q(Z) = p(Z | X, θ), i.e., when q matches the true posterior.
E-Step: Making the Bound Tight
In the E-step, we set $q(\mathbf{Z}) = p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{(t)})$. This choice makes the KL divergence zero (when evaluated at $\boldsymbol{\theta}^{(t)}$), so:
$$\text{ELBO}(p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{(t)}), \boldsymbol{\theta}^{(t)}) = \mathcal{L}(\boldsymbol{\theta}^{(t)})$$
The ELBO equals the log-likelihood at the current parameters.
M-Step: Improving the Bound
In the M-step, we maximize the ELBO with respect to $\boldsymbol{\theta}$, keeping $q$ fixed:
$$\boldsymbol{\theta}^{(t+1)} = \arg\max_{\boldsymbol{\theta}} \text{ELBO}(p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{(t)}), \boldsymbol{\theta})$$
Since we're maximizing, we have:
$$\text{ELBO}(p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{(t)}), \boldsymbol{\theta}^{(t+1)}) \geq \text{ELBO}(p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{(t)}), \boldsymbol{\theta}^{(t)}) = \mathcal{L}(\boldsymbol{\theta}^{(t)})$$
Completing the Proof
The log-likelihood at the new parameters is:
$$\mathcal{L}(\boldsymbol{\theta}^{(t+1)}) \geq \text{ELBO}(p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{(t)}), \boldsymbol{\theta}^{(t+1)}) \geq \mathcal{L}(\boldsymbol{\theta}^{(t)})$$
The first inequality is because ELBO is always a lower bound. The second inequality is from the M-step improvement.
Q.E.D. ∎
Each EM iteration is guaranteed to not decrease the log-likelihood. This means EM is a stable optimization algorithm—it never gets worse, only better (or stays the same at convergence). This property is not shared by gradient-based methods, which can oscillate or diverge without careful step-size tuning.
Given monotonicity and the fact that log-likelihood is bounded above (by 0 for proper probability distributions), EM must converge to some limit. But we need practical criteria to detect convergence.
Common Stopping Criteria
Likelihood-based: Stop when $|\mathcal{L}(\boldsymbol{\theta}^{(t+1)}) - \mathcal{L}(\boldsymbol{\theta}^{(t)})| < \epsilon$
Parameter-based: Stop when $|\boldsymbol{\theta}^{(t+1)} - \boldsymbol{\theta}^{(t)}| < \delta$
Relative improvement: Stop when $\frac{\mathcal{L}(\boldsymbol{\theta}^{(t+1)}) - \mathcal{L}(\boldsymbol{\theta}^{(t)})}{|\mathcal{L}(\boldsymbol{\theta}^{(t)})|} < \epsilon_{\text{rel}}$
Gradient-based: Stop when $|\nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}^{(t)})| < \gamma$
| Criterion | Advantages | Disadvantages | Typical Threshold |
|---|---|---|---|
| Likelihood change | Direct measure of convergence goal; scale-invariant | Requires computing likelihood; can miss parameter convergence | $\epsilon = 10^{-6}$ to $10^{-9}$ |
| Parameter change | Simple; detects oscillation | Scale-dependent; different parameters have different scales | $\delta = 10^{-4}$ to $10^{-6}$ |
| Relative improvement | Adaptive to likelihood scale | Fails if likelihood is near zero | $\epsilon_{\text{rel}} = 10^{-6}$ |
| Gradient norm | Theory-grounded (stationary point) | Expensive to compute; not always available | $\gamma = 10^{-5}$ |
Use a combination: stop when EITHER likelihood improvement drops below threshold OR maximum iterations reached. Set a generous iteration limit (e.g., 500-1000) as a safety net. For production systems, also monitor parameter stability—if parameters are still changing significantly but likelihood isn't, you may have numerical issues.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
class EMConvergenceMonitor: """ Monitor EM convergence with multiple criteria. """ def __init__( self, ll_tol: float = 1e-6, param_tol: float = 1e-5, max_iters: int = 500, patience: int = 5 ): self.ll_tol = ll_tol self.param_tol = param_tol self.max_iters = max_iters self.patience = patience self.ll_history = [] self.no_improvement_count = 0 def check_convergence( self, iteration: int, current_ll: float, theta_old: dict, theta_new: dict ) -> tuple[bool, str]: """ Check if EM has converged. Returns: (converged, reason): Whether converged and why """ self.ll_history.append(current_ll) # Check iteration limit if iteration >= self.max_iters: return True, "max_iterations_reached" # Need at least 2 iterations to check convergence if iteration < 2: return False, "warming_up" # Check likelihood improvement ll_improvement = current_ll - self.ll_history[-2] if ll_improvement < 0: # This should never happen - indicates a bug! return True, f"ERROR_likelihood_decreased_{ll_improvement:.2e}" if ll_improvement < self.ll_tol: self.no_improvement_count += 1 if self.no_improvement_count >= self.patience: return True, "likelihood_converged" else: self.no_improvement_count = 0 # Check parameter stability param_changes = [] for key in theta_old: old_val = theta_old[key] new_val = theta_new[key] change = np.linalg.norm(new_val - old_val) / (np.linalg.norm(old_val) + 1e-10) param_changes.append(change) max_param_change = max(param_changes) if max_param_change < self.param_tol: return True, "parameters_converged" return False, "still_improving" def get_convergence_summary(self) -> dict: """Return summary statistics of the EM run.""" return { "total_iterations": len(self.ll_history), "final_ll": self.ll_history[-1] if self.ll_history else None, "ll_improvement": self.ll_history[-1] - self.ll_history[0] if len(self.ll_history) > 1 else 0, "ll_history": self.ll_history }While EM is guaranteed to converge, its rate of convergence can range from very fast to painfully slow. Understanding this behavior is crucial for practical applications.
Linear Convergence
EM typically exhibits linear (first-order) convergence. Near a fixed point $\boldsymbol{\theta}^*$:
$$|\boldsymbol{\theta}^{(t+1)} - \boldsymbol{\theta}^| \approx r |\boldsymbol{\theta}^{(t)} - \boldsymbol{\theta}^|$$
where $r \in [0, 1)$ is the convergence rate. Each iteration reduces the error by a constant factor.
This is slower than superlinear or quadratic convergence (like Newton's method), where the number of correct digits doubles each iteration.
The Rate of Convergence Matrix
The convergence rate $r$ is determined by the spectrum of the rate matrix:
$$\mathbf{M} = \mathbf{I} - \mathbf{I}{\text{obs}}(\boldsymbol{\theta}^*)^{-1} \mathbf{I}{\text{comp}}(\boldsymbol{\theta}^*)$$
where:
The convergence rate is $r = \lambda_{\max}(\mathbf{M})$, the largest eigenvalue of $\mathbf{M}$.
The eigenvalues of M represent the 'fraction of missing information' due to latent variables. When little information is missing (well-separated clusters), r is small and EM converges quickly. When much information is missing (overlapping clusters), r approaches 1 and EM becomes very slow. Intuitively: the more ambiguous the cluster assignments, the slower the convergence.
Factors Affecting Convergence Speed
Cluster separation: Well-separated clusters → fast convergence ($r \ll 1$)
Cluster overlap: Highly overlapping clusters → slow convergence ($r \approx 1$)
Dimensionality: Higher dimensions can slow convergence
Initialization quality: Poor initialization requires more iterations
EM's monotonicity guarantee has a critical limitation: it only ensures convergence to a local maximum, not necessarily the global maximum. The log-likelihood surface for mixture models is typically:
The optimum that EM finds depends entirely on initialization.
This is a fundamental limitation, not a bug. The GMM log-likelihood surface has multiple local maxima. EM is a hill-climbing algorithm—it goes uphill from wherever it starts. If you start near a poor local maximum, EM will converge there, missing potentially much better solutions elsewhere.
Sources of Multiple Local Maxima
Label permutations: For $K$ components, there are $K!$ equivalent labelings (same likelihood, different parameter orderings). These are not distinct optima in a statistical sense, just symmetries.
Genuine local maxima: Different component configurations that explain the data reasonably well but with different likelihoods.
Degenerate solutions: Components that collapse onto individual points or small clusters, creating singular covariances with infinite likelihood (addressed by regularization).
Theoretical Results
For specific cases, theoretical guarantees exist:
The standard practical approach to handle local optima is multiple random restarts: run EM multiple times from different initializations and keep the best result.
The Multiple Restarts Algorithm
How Many Restarts?
The required number of restarts depends on:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
import numpy as npfrom concurrent.futures import ProcessPoolExecutor def em_with_restarts(X, K, n_restarts=10, n_jobs=4, verbose=True): """ Run EM with multiple random restarts in parallel. Parameters: X: (N, D) data matrix K: number of components n_restarts: number of random restarts n_jobs: number of parallel processes verbose: print progress Returns: best_params: Parameters from best run best_ll: Best log-likelihood achieved all_results: List of (params, ll) for all runs """ def single_run(seed): np.random.seed(seed) params = initialize_gmm_random(X, K) final_params, final_ll = run_em(X, params) return final_params, final_ll # Run in parallel for efficiency seeds = np.random.randint(0, 2**31, size=n_restarts) with ProcessPoolExecutor(max_workers=n_jobs) as executor: results = list(executor.map(single_run, seeds)) # Find best result best_idx = np.argmax([r[1] for r in results]) best_params, best_ll = results[best_idx] if verbose: lls = [r[1] for r in results] print(f"Log-likelihood statistics across {n_restarts} restarts:") print(f" Best: {max(lls):.4f}") print(f" Worst: {min(lls):.4f}") print(f" Mean: {np.mean(lls):.4f}") print(f" Std: {np.std(lls):.4f}") return best_params, best_ll, results def analyze_restart_stability(results, rel_tol=1e-3): """ Analyze stability across restarts. Returns True if most restarts find similar solutions. """ lls = np.array([r[1] for r in results]) best_ll = lls.max() # Count how many runs are close to best close_to_best = np.sum(np.abs(lls - best_ll) / np.abs(best_ll) < rel_tol) stability = close_to_best / len(results) return { 'stability': stability, # Fraction of runs near best 'n_distinct_optima': len(set(np.round(lls, 2))), 'll_range': best_ll - lls.min(), 'recommendation': 'stable' if stability > 0.8 else 'increase_restarts' }Start with 10-20 restarts. If results are inconsistent (high variance in final likelihoods), increase to 50-100. For critical applications, use 100+ restarts. Monitor the distribution of final likelihoods—if many restarts achieve similar values, you've likely found the global optimum. If widely scattered, either increase restarts or improve initialization strategy.
EM's monotonicity guarantee proves convergence to a stationary point—where the gradient is zero. But stationary points include not just local maxima, but also:
Saddle Points in GMMs
Saddle points can occur when:
In practice, saddle points are unstable—small perturbations typically escape them. But EM can spend many iterations near a saddle point before escaping.
The most dangerous degenerate solution occurs when a Gaussian component 'collapses' onto a single data point. As σₖ → 0, the density at that point N(x | x, σ²) → ∞de, creating infinite likelihood. This is a fundamental pathology of GMMs, not an EM bug. Without regularization, the global maximum has σ = 0 for some component!
Preventing Degenerate Solutions
Regularization: Add a small constant to diagonal of covariances: $\boldsymbol{\Sigma}_k \leftarrow \boldsymbol{\Sigma}_k + \epsilon \mathbf{I}$
Prior/MAP estimation: Use a Wishart prior on covariances (Bayesian approach)
Minimum variance constraint: Force $\lambda_{\min}(\boldsymbol{\Sigma}k) \geq \sigma^2{\min}$
Tied covariances: Force all components to share the same covariance
Detection and restart: Monitor for components with suspiciously small variances
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
def prevent_singularities(sigma, min_eigenvalue=1e-3): """ Regularize covariance matrices to prevent singularities. Parameters: sigma: (K, D, D) array of covariance matrices min_eigenvalue: minimum allowed eigenvalue Returns: sigma_reg: regularized covariance matrices """ K, D, _ = sigma.shape sigma_reg = sigma.copy() for k in range(K): # Eigendecomposition eigenvalues, eigenvectors = np.linalg.eigh(sigma[k]) # Clip minimum eigenvalue eigenvalues_clipped = np.maximum(eigenvalues, min_eigenvalue) # Reconstruct covariance sigma_reg[k] = eigenvectors @ np.diag(eigenvalues_clipped) @ eigenvectors.T # Ensure symmetry (numerical errors can break this) sigma_reg[k] = (sigma_reg[k] + sigma_reg[k].T) / 2 return sigma_reg def detect_degenerate_components(sigma, N_k, min_effective_points=5): """ Detect components that may be degenerating. Returns: list of component indices with potential issues """ K = len(N_k) degenerate_components = [] for k in range(K): # Check effective sample size if N_k[k] < min_effective_points: degenerate_components.append((k, 'low_effective_count')) continue # Check condition number (ratio of max to min eigenvalue) eigenvalues = np.linalg.eigvalsh(sigma[k]) condition_number = eigenvalues.max() / (eigenvalues.min() + 1e-10) if condition_number > 1e6: degenerate_components.append((k, 'ill_conditioned')) if eigenvalues.min() < 1e-6: degenerate_components.append((k, 'near_singular')) return degenerate_componentsWhen standard EM converges too slowly, several acceleration techniques can dramatically speed up convergence while maintaining the monotonicity property.
1. Aitken's Δ² Acceleration
Aitken's method accelerates linearly converging sequences. For a sequence of likelihood values ${\mathcal{L}^{(t)}}$:
$$\mathcal{L}^{\infty} \approx \mathcal{L}^{(t)} + \frac{(\mathcal{L}^{(t+1)} - \mathcal{L}^{(t)})^2}{\mathcal{L}^{(t)} - 2\mathcal{L}^{(t+1)} + \mathcal{L}^{(t+2)}}$$
This extrapolates to estimate the converged likelihood, useful for early stopping.
2. Parameter Expansion EM (PX-EM)
PX-EM introduces auxiliary parameters that are later integrated out, effectively increasing the step size while maintaining monotonicity. For GMMs, this involves expanding the covariance parameterization.
3. Squared Iterative Methods (SQUAREM)
SQUAREM applies a vector extrapolation to the EM update:
$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - 2\alpha \mathbf{r}_1 + \alpha^2 \mathbf{r}_2$$
where $\mathbf{r}_1, \mathbf{r}_2$ are computed from two EM steps. The step length $\alpha$ is chosen to maximize likelihood.
4. Annealing EM
Introduce a temperature parameter $T$ that "flattens" the posterior:
$$\gamma_{nk} \propto \left( \pi_k , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right)^{1/T}$$
Start with high $T$ (soft assignments, broad search), gradually decrease to $T = 1$ (standard EM). This helps escape local optima.
| Method | Speedup Factor | Maintains Monotonicity | Implementation Complexity |
|---|---|---|---|
| Standard EM | 1× (baseline) | Yes | Low |
| Aitken's Δ² | Better convergence detection | Yes (just prediction) | Low |
| PX-EM | 2-10× | Yes | Medium |
| SQUAREM | 5-50× | Yes (with line search) | Medium |
| Annealing EM | Better optima, similar speed | No (during annealing) | Low |
The next page examines initialization strategies—arguably the most important practical consideration for EM. Good initialization reduces the number of required restarts, speeds up convergence, and increases the probability of finding good local optima. We'll cover random initialization, K-means++ seeding, and more sophisticated approaches.