Expectation Maximization Algorithm - Learning Module

Loading content...

0/245

Convergence Properties of the EM Algorithm

Understanding EM's Convergence Behavior

The EM algorithm's practical success hinges on understanding its convergence properties. We've established that EM monotonically increases the log-likelihood—but what does this guarantee? Does EM always converge? How fast? To what kind of solution?

These questions have profound practical implications. If EM converges too slowly, we may need acceleration techniques. If EM finds only local optima, we need multiple restarts. If EM can get stuck at saddle points, we need detection mechanisms. This page provides the mathematical framework to answer these questions rigorously.

What You Will Learn

By the end of this page, you will understand: (1) the rigorous proof of EM's monotonicity property, (2) conditions under which EM converges, (3) the rate of convergence and why EM can be slow, (4) when EM finds global vs local optima, and (5) practical implications for implementation.

The Monotonicity Theorem: Why Likelihood Never Decreases

We now prove rigorously that each EM iteration increases (or maintains) the observed-data log-likelihood. This is the foundational guarantee that makes EM a reliable optimization algorithm.

Theorem (EM Monotonicity): For any EM iteration from $\boldsymbol{\theta}^{(t)}$ to $\boldsymbol{\theta}^{(t+1)}$:

$$\mathcal{L}(\boldsymbol{\theta}^{(t+1)}) \geq \mathcal{L}(\boldsymbol{\theta}^{(t)})$$

with equality if and only if the Q-function is not improved and the posterior distribution remains unchanged.

Proof Setup: The Likelihood Decomposition

The key insight is to decompose the observed-data log-likelihood using the posterior distribution of latent variables. For any distribution $q(\mathbf{Z})$ over latent variables:

$$\mathcal{L}(\boldsymbol{\theta}) = \log p(\mathbf{X} \mid \boldsymbol{\theta}) = \log \sum_{\mathbf{Z}} p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})$$

Multiplying and dividing by $q(\mathbf{Z})$:

$$\mathcal{L}(\boldsymbol{\theta}) = \log \sum_{\mathbf{Z}} q(\mathbf{Z}) \frac{p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})}{q(\mathbf{Z})}$$

Applying Jensen's Inequality

Since $\log$ is concave and we're taking a weighted average (with weights $q(\mathbf{Z})$), Jensen's inequality gives:

$$\mathcal{L}(\boldsymbol{\theta}) \geq \sum_{\mathbf{Z}} q(\mathbf{Z}) \log \frac{p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})}{q(\mathbf{Z})}$$

Define the Evidence Lower Bound (ELBO):

$$\text{ELBO}(q, \boldsymbol{\theta}) \triangleq \sum_{\mathbf{Z}} q(\mathbf{Z}) \log \frac{p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})}{q(\mathbf{Z})} = \mathbb{E}_{q(\mathbf{Z})} \left[ \log \frac{p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})}{q(\mathbf{Z})} \right]$$

Thus: $\mathcal{L}(\boldsymbol{\theta}) \geq \text{ELBO}(q, \boldsymbol{\theta})$ for all $q$.

The Gap: KL Divergence

The gap between the log-likelihood and the ELBO is precisely the KL divergence between q(Z) and the true posterior: L(θ) - ELBO(q, θ) = KL(q(Z) || p(Z | X, θ)) ≥ 0. The bound is tight when q(Z) = p(Z | X, θ), i.e., when q matches the true posterior.

E-Step: Making the Bound Tight

In the E-step, we set $q(\mathbf{Z}) = p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{(t)})$. This choice makes the KL divergence zero (when evaluated at $\boldsymbol{\theta}^{(t)}$), so:

$$\text{ELBO}(p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{(t)}), \boldsymbol{\theta}^{(t)}) = \mathcal{L}(\boldsymbol{\theta}^{(t)})$$

The ELBO equals the log-likelihood at the current parameters.

M-Step: Improving the Bound

In the M-step, we maximize the ELBO with respect to $\boldsymbol{\theta}$, keeping $q$ fixed:

$$\boldsymbol{\theta}^{(t+1)} = \arg\max_{\boldsymbol{\theta}} \text{ELBO}(p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{(t)}), \boldsymbol{\theta})$$

Since we're maximizing, we have:

$$\text{ELBO}(p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{(t)}), \boldsymbol{\theta}^{(t+1)}) \geq \text{ELBO}(p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{(t)}), \boldsymbol{\theta}^{(t)}) = \mathcal{L}(\boldsymbol{\theta}^{(t)})$$

Completing the Proof

The log-likelihood at the new parameters is:

$$\mathcal{L}(\boldsymbol{\theta}^{(t+1)}) \geq \text{ELBO}(p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{(t)}), \boldsymbol{\theta}^{(t+1)}) \geq \mathcal{L}(\boldsymbol{\theta}^{(t)})$$

The first inequality is because ELBO is always a lower bound. The second inequality is from the M-step improvement.

Q.E.D. ∎

Monotonicity Guarantee

Each EM iteration is guaranteed to not decrease the log-likelihood. This means EM is a stable optimization algorithm—it never gets worse, only better (or stays the same at convergence). This property is not shared by gradient-based methods, which can oscillate or diverge without careful step-size tuning.

Convergence Criteria and Stopping Conditions

Given monotonicity and the fact that log-likelihood is bounded above (by 0 for proper probability distributions), EM must converge to some limit. But we need practical criteria to detect convergence.

Common Stopping Criteria

Likelihood-based: Stop when $|\mathcal{L}(\boldsymbol{\theta}^{(t+1)}) - \mathcal{L}(\boldsymbol{\theta}^{(t)})| < \epsilon$
Parameter-based: Stop when $|\boldsymbol{\theta}^{(t+1)} - \boldsymbol{\theta}^{(t)}| < \delta$
Relative improvement: Stop when $\frac{\mathcal{L}(\boldsymbol{\theta}^{(t+1)}) - \mathcal{L}(\boldsymbol{\theta}^{(t)})}{|\mathcal{L}(\boldsymbol{\theta}^{(t)})|} < \epsilon_{\text{rel}}$
Gradient-based: Stop when $|\nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}^{(t)})| < \gamma$

Convergence Criteria Comparison
Criterion	Advantages	Disadvantages	Typical Threshold
Likelihood change	Direct measure of convergence goal; scale-invariant	Requires computing likelihood; can miss parameter convergence	$\epsilon = 10^{-6}$ to $10^{-9}$
Parameter change	Simple; detects oscillation	Scale-dependent; different parameters have different scales	$\delta = 10^{-4}$ to $10^{-6}$
Relative improvement	Adaptive to likelihood scale	Fails if likelihood is near zero	$\epsilon_{\text{rel}} = 10^{-6}$
Gradient norm	Theory-grounded (stationary point)	Expensive to compute; not always available	$\gamma = 10^{-5}$

Practical Recommendation

Use a combination: stop when EITHER likelihood improvement drops below threshold OR maximum iterations reached. Set a generous iteration limit (e.g., 500-1000) as a safety net. For production systems, also monitor parameter stability—if parameters are still changing significantly but likelihood isn't, you may have numerical issues.

convergence_detection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
class EMConvergenceMonitor:
    """
    Monitor EM convergence with multiple criteria.
    """
    def __init__(
        self,
        ll_tol: float = 1e-6,
        param_tol: float = 1e-5,
        max_iters: int = 500,
        patience: int = 5
    ):
        self.ll_tol = ll_tol
        self.param_tol = param_tol
        self.max_iters = max_iters
        self.patience = patience
        
        self.ll_history = []
        self.no_improvement_count = 0
        
    def check_convergence(
        self,
        iteration: int,
        current_ll: float,
        theta_old: dict,
        theta_new: dict
    ) -> tuple[bool, str]:
        """
        Check if EM has converged.
        
        Returns:
            (converged, reason): Whether converged and why
        """
        self.ll_history.append(current_ll)
        
        # Check iteration limit
        if iteration >= self.max_iters:
            return True, "max_iterations_reached"
        
        # Need at least 2 iterations to check convergence
        if iteration < 2:
            return False, "warming_up"
        
        # Check likelihood improvement
        ll_improvement = current_ll - self.ll_history[-2]
        
        if ll_improvement < 0:
            # This should never happen - indicates a bug!
            return True, f"ERROR_likelihood_decreased_{ll_improvement:.2e}"
        
        if ll_improvement < self.ll_tol:
            self.no_improvement_count += 1
            if self.no_improvement_count >= self.patience:
                return True, "likelihood_converged"
        else:
            self.no_improvement_count = 0
        
        # Check parameter stability
        param_changes = []
        for key in theta_old:
            old_val = theta_old[key]
            new_val = theta_new[key]
            change = np.linalg.norm(new_val - old_val) / (np.linalg.norm(old_val) + 1e-10)
            param_changes.append(change)
        
        max_param_change = max(param_changes)
        if max_param_change < self.param_tol:
            return True, "parameters_converged"
        
        return False, "still_improving"
    
    def get_convergence_summary(self) -> dict:
        """Return summary statistics of the EM run."""
        return {
            "total_iterations": len(self.ll_history),
            "final_ll": self.ll_history[-1] if self.ll_history else None,
            "ll_improvement": self.ll_history[-1] - self.ll_history[0] if len(self.ll_history) > 1 else 0,
            "ll_history": self.ll_history
        }

Rate of Convergence: Why EM Can Be Slow

While EM is guaranteed to converge, its rate of convergence can range from very fast to painfully slow. Understanding this behavior is crucial for practical applications.

Linear Convergence

EM typically exhibits linear (first-order) convergence. Near a fixed point $\boldsymbol{\theta}^*$:

$$|\boldsymbol{\theta}^{(t+1)} - \boldsymbol{\theta}^| \approx r |\boldsymbol{\theta}^{(t)} - \boldsymbol{\theta}^|$$

where $r \in [0, 1)$ is the convergence rate. Each iteration reduces the error by a constant factor.

This is slower than superlinear or quadratic convergence (like Newton's method), where the number of correct digits doubles each iteration.

The Rate of Convergence Matrix

The convergence rate $r$ is determined by the spectrum of the rate matrix:

$$\mathbf{M} = \mathbf{I} - \mathbf{I}{\text{obs}}(\boldsymbol{\theta}^*)^{-1} \mathbf{I}{\text{comp}}(\boldsymbol{\theta}^*)$$

where:

$\mathbf{I}_{\text{obs}}(\boldsymbol{\theta}^*)$ is the observed-data Fisher information matrix
$\mathbf{I}_{\text{comp}}(\boldsymbol{\theta}^*)$ is the complete-data Fisher information matrix

The convergence rate is $r = \lambda_{\max}(\mathbf{M})$, the largest eigenvalue of $\mathbf{M}$.

The Fraction of Missing Information

The eigenvalues of M represent the 'fraction of missing information' due to latent variables. When little information is missing (well-separated clusters), r is small and EM converges quickly. When much information is missing (overlapping clusters), r approaches 1 and EM becomes very slow. Intuitively: the more ambiguous the cluster assignments, the slower the convergence.

Factors Affecting Convergence Speed

Cluster separation: Well-separated clusters → fast convergence ($r \ll 1$)
- Responsibilities become nearly binary
- Each iteration makes large, confident updates
Cluster overlap: Highly overlapping clusters → slow convergence ($r \approx 1$)
- Responsibilities stay near 0.5
- Each update is tentative, hedging between components
Dimensionality: Higher dimensions can slow convergence
- More parameters to estimate
- Curse of dimensionality makes clusters harder to separate
Initialization quality: Poor initialization requires more iterations
- May start far from any local optimum
- Early iterations make large but potentially wrong moves

Converting Mermaid diagram...

Local vs. Global Optima: The Limitation of EM

EM's monotonicity guarantee has a critical limitation: it only ensures convergence to a local maximum, not necessarily the global maximum. The log-likelihood surface for mixture models is typically:

Non-convex: Multiple local maxima exist
Multi-modal: The number of local maxima can grow combinatorially with $K$
Symmetric: Permutations of component labels give identical likelihoods

The optimum that EM finds depends entirely on initialization.

EM Does NOT Find the Global Optimum

This is a fundamental limitation, not a bug. The GMM log-likelihood surface has multiple local maxima. EM is a hill-climbing algorithm—it goes uphill from wherever it starts. If you start near a poor local maximum, EM will converge there, missing potentially much better solutions elsewhere.

Sources of Multiple Local Maxima

Label permutations: For $K$ components, there are $K!$ equivalent labelings (same likelihood, different parameter orderings). These are not distinct optima in a statistical sense, just symmetries.
Genuine local maxima: Different component configurations that explain the data reasonably well but with different likelihoods.
Degenerate solutions: Components that collapse onto individual points or small clusters, creating singular covariances with infinite likelihood (addressed by regularization).

Theoretical Results

For specific cases, theoretical guarantees exist:

For $K = 2$ well-separated Gaussians with known covariances, EM with good initialization converges to the global optimum
For general GMMs, the number of local maxima can be exponential in $K$ and $N$
No polynomial-time algorithm is known to guarantee the global optimum for general GMMs

Good Local Optimum

•Components match true data clusters
•High likelihood (close to global)
•Stable across multiple runs
•Reasonable cluster sizes
•Interpretable results

Bad Local Optimum

•Components merge or split true clusters
•Lower likelihood
•Unstable—different runs, different results
•One component absorbs most points
•Non-interpretable configurations

Multiple Restarts: Practical Mitigation

The standard practical approach to handle local optima is multiple random restarts: run EM multiple times from different initializations and keep the best result.

The Multiple Restarts Algorithm

For $r = 1, \ldots, R$ restarts:
- Initialize parameters $\boldsymbol{\theta}^{(0)}_r$ (randomly or via heuristics)
- Run EM to convergence → get $\boldsymbol{\theta}^_r$ and $\mathcal{L}^_r$
Return $\boldsymbol{\theta}^* = \arg\max_r \mathcal{L}^*_r$

How Many Restarts?

The required number of restarts depends on:

Number of components $K$ (more components → more local optima)
Data complexity (more overlap → more local optima)
Initialization quality (better initialization → fewer restarts needed)

multiple_restarts.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import numpy as np
from concurrent.futures import ProcessPoolExecutor
 
def em_with_restarts(X, K, n_restarts=10, n_jobs=4, verbose=True):
    """
    Run EM with multiple random restarts in parallel.
    
    Parameters:
        X: (N, D) data matrix
        K: number of components
        n_restarts: number of random restarts
        n_jobs: number of parallel processes
        verbose: print progress
    
    Returns:
        best_params: Parameters from best run
        best_ll: Best log-likelihood achieved
        all_results: List of (params, ll) for all runs
    """
    def single_run(seed):
        np.random.seed(seed)
        params = initialize_gmm_random(X, K)
        final_params, final_ll = run_em(X, params)
        return final_params, final_ll
    
    # Run in parallel for efficiency
    seeds = np.random.randint(0, 2**31, size=n_restarts)
    
    with ProcessPoolExecutor(max_workers=n_jobs) as executor:
        results = list(executor.map(single_run, seeds))
    
    # Find best result
    best_idx = np.argmax([r[1] for r in results])
    best_params, best_ll = results[best_idx]
    
    if verbose:
        lls = [r[1] for r in results]
        print(f"Log-likelihood statistics across {n_restarts} restarts:")
        print(f"  Best:   {max(lls):.4f}")
        print(f"  Worst:  {min(lls):.4f}")
        print(f"  Mean:   {np.mean(lls):.4f}")
        print(f"  Std:    {np.std(lls):.4f}")
    
    return best_params, best_ll, results
 
 
def analyze_restart_stability(results, rel_tol=1e-3):
    """
    Analyze stability across restarts.
    
    Returns True if most restarts find similar solutions.
    """
    lls = np.array([r[1] for r in results])
    best_ll = lls.max()
    
    # Count how many runs are close to best
    close_to_best = np.sum(np.abs(lls - best_ll) / np.abs(best_ll) < rel_tol)
    
    stability = close_to_best / len(results)
    
    return {
        'stability': stability,  # Fraction of runs near best
        'n_distinct_optima': len(set(np.round(lls, 2))),
        'll_range': best_ll - lls.min(),
        'recommendation': 'stable' if stability > 0.8 else 'increase_restarts'
    }

Practical Guidelines for Restarts

Start with 10-20 restarts. If results are inconsistent (high variance in final likelihoods), increase to 50-100. For critical applications, use 100+ restarts. Monitor the distribution of final likelihoods—if many restarts achieve similar values, you've likely found the global optimum. If widely scattered, either increase restarts or improve initialization strategy.

Saddle Points and Degenerate Solutions

EM's monotonicity guarantee proves convergence to a stationary point—where the gradient is zero. But stationary points include not just local maxima, but also:

Saddle points: Maximum in some directions, minimum in others
Degenerate solutions: Pathological configurations with singular covariances

Saddle Points in GMMs

Saddle points can occur when:

Component means coincide exactly
Mixing coefficients equal zero (empty components)
The algorithm gets stuck between two equivalent labelings

In practice, saddle points are unstable—small perturbations typically escape them. But EM can spend many iterations near a saddle point before escaping.

The Singularity Problem

The most dangerous degenerate solution occurs when a Gaussian component 'collapses' onto a single data point. As σₖ → 0, the density at that point N(x | x, σ²) → ∞de, creating infinite likelihood. This is a fundamental pathology of GMMs, not an EM bug. Without regularization, the global maximum has σ = 0 for some component!

Preventing Degenerate Solutions

Regularization: Add a small constant to diagonal of covariances: $\boldsymbol{\Sigma}_k \leftarrow \boldsymbol{\Sigma}_k + \epsilon \mathbf{I}$
Prior/MAP estimation: Use a Wishart prior on covariances (Bayesian approach)
Minimum variance constraint: Force $\lambda_{\min}(\boldsymbol{\Sigma}k) \geq \sigma^2{\min}$
Tied covariances: Force all components to share the same covariance
Detection and restart: Monitor for components with suspiciously small variances

singularity_prevention.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
def prevent_singularities(sigma, min_eigenvalue=1e-3):
    """
    Regularize covariance matrices to prevent singularities.
    
    Parameters:
        sigma: (K, D, D) array of covariance matrices
        min_eigenvalue: minimum allowed eigenvalue
    
    Returns:
        sigma_reg: regularized covariance matrices
    """
    K, D, _ = sigma.shape
    sigma_reg = sigma.copy()
    
    for k in range(K):
        # Eigendecomposition
        eigenvalues, eigenvectors = np.linalg.eigh(sigma[k])
        
        # Clip minimum eigenvalue
        eigenvalues_clipped = np.maximum(eigenvalues, min_eigenvalue)
        
        # Reconstruct covariance
        sigma_reg[k] = eigenvectors @ np.diag(eigenvalues_clipped) @ eigenvectors.T
        
        # Ensure symmetry (numerical errors can break this)
        sigma_reg[k] = (sigma_reg[k] + sigma_reg[k].T) / 2
    
    return sigma_reg
 
 
def detect_degenerate_components(sigma, N_k, min_effective_points=5):
    """
    Detect components that may be degenerating.
    
    Returns:
        list of component indices with potential issues
    """
    K = len(N_k)
    degenerate_components = []
    
    for k in range(K):
        # Check effective sample size
        if N_k[k] < min_effective_points:
            degenerate_components.append((k, 'low_effective_count'))
            continue
            
        # Check condition number (ratio of max to min eigenvalue)
        eigenvalues = np.linalg.eigvalsh(sigma[k])
        condition_number = eigenvalues.max() / (eigenvalues.min() + 1e-10)
        
        if condition_number > 1e6:
            degenerate_components.append((k, 'ill_conditioned'))
        
        if eigenvalues.min() < 1e-6:
            degenerate_components.append((k, 'near_singular'))
    
    return degenerate_components

Acceleration Methods for Faster Convergence

When standard EM converges too slowly, several acceleration techniques can dramatically speed up convergence while maintaining the monotonicity property.

1. Aitken's Δ² Acceleration

Aitken's method accelerates linearly converging sequences. For a sequence of likelihood values ${\mathcal{L}^{(t)}}$:

$$\mathcal{L}^{\infty} \approx \mathcal{L}^{(t)} + \frac{(\mathcal{L}^{(t+1)} - \mathcal{L}^{(t)})^2}{\mathcal{L}^{(t)} - 2\mathcal{L}^{(t+1)} + \mathcal{L}^{(t+2)}}$$

This extrapolates to estimate the converged likelihood, useful for early stopping.

2. Parameter Expansion EM (PX-EM)

PX-EM introduces auxiliary parameters that are later integrated out, effectively increasing the step size while maintaining monotonicity. For GMMs, this involves expanding the covariance parameterization.

3. Squared Iterative Methods (SQUAREM)

SQUAREM applies a vector extrapolation to the EM update:

$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - 2\alpha \mathbf{r}_1 + \alpha^2 \mathbf{r}_2$$

where $\mathbf{r}_1, \mathbf{r}_2$ are computed from two EM steps. The step length $\alpha$ is chosen to maximize likelihood.

4. Annealing EM

Introduce a temperature parameter $T$ that "flattens" the posterior:

$$\gamma_{nk} \propto \left( \pi_k , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right)^{1/T}$$

Start with high $T$ (soft assignments, broad search), gradually decrease to $T = 1$ (standard EM). This helps escape local optima.

EM Acceleration Methods Comparison
Method	Speedup Factor	Maintains Monotonicity	Implementation Complexity
Standard EM	1× (baseline)	Yes	Low
Aitken's Δ²	Better convergence detection	Yes (just prediction)	Low
PX-EM	2-10×	Yes	Medium
SQUAREM	5-50×	Yes (with line search)	Medium
Annealing EM	Better optima, similar speed	No (during annealing)	Low

Summary: Convergence Properties

Key Takeaways

•Monotonicity: EM never decreases the log-likelihood. This is proven via the ELBO decomposition and Jensen's inequality.
•Convergence: Monotonicity + bounded likelihood guarantees convergence to a stationary point.
•Linear rate: EM typically converges linearly. Speed depends on 'missing information'—well-separated clusters converge fast, overlapping clusters converge slowly.
•Local optima: EM finds local, not global, optima. Multiple restarts are essential for non-trivial problems.
•Saddle points: EM can converge to saddle points, but these are usually unstable.
•Singularities: Without regularization, GMMs have degenerate global optima (σ → 0). Always regularize!
•Acceleration: Methods like SQUAREM can speed up EM by 5-50× with modest implementation effort.

What's Next

The next page examines initialization strategies—arguably the most important practical consideration for EM. Good initialization reduces the number of required restarts, speeds up convergence, and increases the probability of finding good local optima. We'll cover random initialization, K-means++ seeding, and more sophisticated approaches.