Machine LearningExpectation-Maximization Algorithm

The Expectation-Maximization Algorithm

LevelAdvanced

Duration90 mins

TopicExpectation-Maximization Algorithm

1 / 5

EM for GMMs: Maximum Likelihood with Latent Variables

The Chicken-and-Egg Problem of Mixture Models

Imagine you're analyzing customer behavior data and discover that your data appears to come from multiple distinct groups—but you don't know which customer belongs to which group. If you knew the group memberships, estimating the parameters of each group would be straightforward. But if you knew the parameters, inferring the group memberships would be easy. This is the classic chicken-and-egg problem of latent variable models.

The Expectation-Maximization (EM) algorithm elegantly solves this circular dependency by alternating between inferring latent variables given current parameters (the E-step) and updating parameters given inferred latent variables (the M-step). This simple yet profound idea forms the backbone of countless machine learning algorithms—from Gaussian Mixture Models to Hidden Markov Models, from topic models to missing data imputation.

What You Will Learn

By the end of this page, you will understand: (1) why standard maximum likelihood fails for mixture models, (2) how the EM algorithm introduces latent variables as a computational device, (3) the complete mathematical framework connecting EM to GMMs, and (4) why EM is guaranteed to never decrease the likelihood—providing a foundation for understanding its convergence properties in subsequent pages.

The Challenge of Maximum Likelihood for Mixture Models

To appreciate EM's elegance, we must first understand why standard maximum likelihood estimation (MLE) fails for mixture models. Consider a Gaussian Mixture Model with $K$ components. For a single observation $\mathbf{x}$, the probability density is:

$$p(\mathbf{x} \mid \boldsymbol{\theta}) = \sum_{k=1}^{K} \pi_k , \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$$

where $\boldsymbol{\theta} = {\pi_1, \ldots, \pi_K, \boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_K, \boldsymbol{\Sigma}_1, \ldots, \boldsymbol{\Sigma}K}$ contains all parameters. The mixing coefficients $\pi_k$ satisfy $\sum{k=1}^{K} \pi_k = 1$ and $\pi_k \geq 0$.

The Log-Likelihood Function

For $N$ i.i.d. observations $\mathbf{X} = {\mathbf{x}_1, \ldots, \mathbf{x}_N}$, the log-likelihood is:

$$\mathcal{L}(\boldsymbol{\theta}) = \log p(\mathbf{X} \mid \boldsymbol{\theta}) = \sum_{n=1}^{N} \log \left( \sum_{k=1}^{K} \pi_k , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right)$$

The critical observation is that the logarithm of a sum does not simplify nicely. Unlike single-component models where $\log p(\mathbf{x} \mid \boldsymbol{\theta})$ separates cleanly over parameters, here we cannot decompose the optimization problem.

Why Gradient-Based Optimization Struggles

Taking derivatives of the log-likelihood and setting them to zero yields coupled, nonlinear equations with no closed-form solution. The sum inside the logarithm creates a complex interdependence: each μₖ affects the denominator of the responsibility (posterior probability) for every data point, which in turn affects the optimal values of all other μⱼ. Gradient descent can theoretically work, but it's prone to slow convergence, saddle points, and numerical instabilities due to the non-convex landscape with multiple local optima.

The Pathological Derivative Structure

To see this concretely, consider the gradient with respect to $\boldsymbol{\mu}_k$:

$$\frac{\partial \mathcal{L}}{\partial \boldsymbol{\mu}k} = \sum{n=1}^{N} \frac{\pi_k , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}k)}{\sum{j=1}^{K} \pi_j , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)} \boldsymbol{\Sigma}_k^{-1}(\mathbf{x}_n - \boldsymbol{\mu}_k)$$

Notice that the fraction—which we'll soon call the responsibility $\gamma_{nk}$—depends on all component parameters. Setting this gradient to zero gives:

$$\boldsymbol{\mu}k = \frac{\sum{n=1}^{N} \gamma_{nk} \mathbf{x}n}{\sum{n=1}^{N} \gamma_{nk}}$$

This looks like a weighted mean, but $\gamma_{nk}$ itself depends on $\boldsymbol{\mu}_k$ and all other parameters! We have an implicit equation, not an explicit solution.

The Latent Variable Perspective

The key insight of EM is to introduce latent (hidden) variables that, if observed, would make the optimization tractable. For GMMs, we introduce a latent variable $\mathbf{z}_n$ for each observation $\mathbf{x}_n$, indicating which component generated that observation.

We encode $\mathbf{z}n$ as a one-hot vector: $\mathbf{z}n = (z{n1}, \ldots, z{nK})^\top$ where $z_{nk} \in {0, 1}$ and $\sum_{k=1}^{K} z_{nk} = 1$. The value $z_{nk} = 1$ indicates that observation $n$ was generated by component $k$.

The Complete-Data Distribution

With latent variables included, we define the complete-data likelihood. The joint distribution of $(\mathbf{x}_n, \mathbf{z}_n)$ is:

$$p(\mathbf{x}_n, \mathbf{z}n \mid \boldsymbol{\theta}) = \prod{k=1}^{K} \left[ \pi_k , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}k) \right]^{z{nk}}$$

Because $z_{nk}$ is either 0 or 1, this product selects exactly one component—the one that generated $\mathbf{x}_n$.

The Power of the Complete-Data Log-Likelihood

The complete-data log-likelihood has a beautiful property: the logarithm moves inside the product! For all N observations: log p(X, Z | θ) = Σₙ Σₖ z_{nk} [log πₖ + log N(xₙ | μₖ, Σₖ)]. This separates cleanly over components—if we knew Z, standard MLE formulas would apply directly to each component independently.

Why Complete-Data Makes MLE Easy

The complete-data log-likelihood is:

$$\log p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta}) = \sum_{n=1}^{N} \sum_{k=1}^{K} z_{nk} \left[ \log \pi_k + \log \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right]$$

Expanding the Gaussian:

$$= \sum_{n=1}^{N} \sum_{k=1}^{K} z_{nk} \left[ \log \pi_k - \frac{D}{2}\log(2\pi) - \frac{1}{2}\log|\boldsymbol{\Sigma}_k| - \frac{1}{2}(\mathbf{x}_n - \boldsymbol{\mu}_k)^\top \boldsymbol{\Sigma}_k^{-1}(\mathbf{x}_n - \boldsymbol{\mu}_k) \right]$$

Now, maximizing with respect to $\boldsymbol{\mu}k$ involves only terms where $z{nk} = 1$—effectively the data points assigned to component $k$. The optimal $\boldsymbol{\mu}_k$ is simply the sample mean of assigned points:

$$\boldsymbol{\mu}k^{\text{ML}} = \frac{\sum{n : z_{nk}=1} \mathbf{x}n}{\sum{n=1}^{N} z_{nk}} = \frac{\sum_{n : z_{nk}=1} \mathbf{x}_n}{N_k}$$

where $N_k = |{n : z_{nk} = 1}|$ is the count of points in component $k$.

Complete-Data MLE Solutions for GMM Parameters
Parameter	Complete-Data MLE Formula	Interpretation
$\pi_k$	$\frac{N_k}{N}$	Proportion of points assigned to component $k$
$\boldsymbol{\mu}_k$	$\frac{1}{N_k} \sum_{n:z_{nk}=1} \mathbf{x}_n$	Sample mean of assigned points
$\boldsymbol{\Sigma}_k$	$\frac{1}{N_k} \sum_{n:z_{nk}=1} (\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n - \boldsymbol{\mu}_k)^\top$	Sample covariance of assigned points

The EM Framework: Expectations Replace Unknowns

The latent variables $\mathbf{Z}$ are not observed—that's why they're called latent. The brilliant insight of EM is: instead of treating $z_{nk}$ as binary unknowns, we replace them with their expected values given current parameter estimates.

Let $\boldsymbol{\theta}^{(t)}$ denote our current parameter estimates at iteration $t$. The expected value of $z_{nk}$ given the observed data and current parameters is:

$$\mathbb{E}[z_{nk} \mid \mathbf{x}n, \boldsymbol{\theta}^{(t)}] = p(z{nk} = 1 \mid \mathbf{x}n, \boldsymbol{\theta}^{(t)}) \triangleq \gamma{nk}^{(t)}$$

Computing Responsibilities via Bayes' Theorem

The quantity $\gamma_{nk}$ is called the responsibility of component $k$ for observation $n$. Using Bayes' theorem:

$$\gamma_{nk}^{(t)} = \frac{p(z_{nk} = 1 \mid \boldsymbol{\theta}^{(t)}) , p(\mathbf{x}n \mid z{nk} = 1, \boldsymbol{\theta}^{(t)})}{\sum_{j=1}^{K} p(z_{nj} = 1 \mid \boldsymbol{\theta}^{(t)}) , p(\mathbf{x}n \mid z{nj} = 1, \boldsymbol{\theta}^{(t)})}$$

Substituting the GMM components:

$$\gamma_{nk}^{(t)} = \frac{\pi_k^{(t)} , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k^{(t)}, \boldsymbol{\Sigma}k^{(t)})}{\sum{j=1}^{K} \pi_j^{(t)} , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_j^{(t)}, \boldsymbol{\Sigma}_j^{(t)})}$$

This is a soft assignment: each data point has a responsibility value for each component, with $\sum_{k=1}^{K} \gamma_{nk} = 1$. Compare this to K-means, which uses hard assignment where each point belongs to exactly one cluster.

Responsibilities as Posterior Probabilities

The responsibility γ_{nk} is precisely the posterior probability that observation xₙ was generated by component k, given the observed data and current parameters. It quantifies our uncertainty about the latent assignments—a principled Bayesian approach that naturally handles ambiguous data points near cluster boundaries.

The Expected Complete-Data Log-Likelihood

With responsibilities in hand, we define the Q-function—the expected complete-data log-likelihood under the posterior distribution of latent variables:

$$Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)}) = \mathbb{E}_{\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{(t)}} \left[ \log p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta}) \right]$$

Because $z_{nk}$ appears linearly in the complete-data log-likelihood, the expectation simply replaces $z_{nk}$ with $\gamma_{nk}^{(t)}$:

$$Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)}) = \sum_{n=1}^{N} \sum_{k=1}^{K} \gamma_{nk}^{(t)} \left[ \log \pi_k + \log \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right]$$

The Q-function is now a function of the new parameters $\boldsymbol{\theta}$, with the responsibilities $\gamma_{nk}^{(t)}$ treated as fixed constants computed from the old parameters.

The Two-Step EM Algorithm for GMMs

The EM algorithm alternates between two steps until convergence:

E-Step (Expectation): Compute responsibilities $\gamma_{nk}^{(t)}$ using current parameters $\boldsymbol{\theta}^{(t)}$.

M-Step (Maximization): Update parameters by maximizing $Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)})$ with respect to $\boldsymbol{\theta}$.

Let's derive the complete update equations for GMMs.

em_algorithm_gmm.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import numpy as np
from scipy.stats import multivariate_normal
 
def em_gmm(X, K, max_iters=100, tol=1e-6):
    """
    EM algorithm for Gaussian Mixture Models.
    
    Parameters:
        X: (N, D) array of observations
        K: number of mixture components
        max_iters: maximum iterations
        tol: convergence tolerance on log-likelihood
    
    Returns:
        pi: (K,) mixing coefficients
        mu: (K, D) component means
        sigma: (K, D, D) component covariances
        log_likelihoods: history of log-likelihood values
    """
    N, D = X.shape
    
    # Initialize parameters (see initialization strategies later)
    pi = np.ones(K) / K
    mu = X[np.random.choice(N, K, replace=False)]
    sigma = np.array([np.eye(D) for _ in range(K)])
    
    log_likelihoods = []
    
    for iteration in range(max_iters):
        # ============ E-STEP ============
        # Compute responsibilities γ_nk = P(z_nk=1 | x_n, θ)
        
        gamma = np.zeros((N, K))
        for k in range(K):
            gamma[:, k] = pi[k] * multivariate_normal.pdf(X, mu[k], sigma[k])
        
        # Normalize (γ_nk / Σ_j γ_nj)
        gamma_sum = gamma.sum(axis=1, keepdims=True)
        gamma = gamma / gamma_sum
        
        # Compute log-likelihood for convergence check
        log_likelihood = np.sum(np.log(gamma_sum))
        log_likelihoods.append(log_likelihood)
        
        if iteration > 0 and abs(log_likelihood - log_likelihoods[-2]) < tol:
            print(f"Converged at iteration {iteration}")
            break
        
        # ============ M-STEP ============
        # Update parameters using closed-form solutions
        
        # Effective number of points per component
        N_k = gamma.sum(axis=0)  # (K,)
        
        # Update mixing coefficients
        pi = N_k / N
        
        # Update means (responsibility-weighted average)
        mu = (gamma.T @ X) / N_k[:, np.newaxis]  # (K, D)
        
        # Update covariances
        for k in range(K):
            diff = X - mu[k]  # (N, D)
            # Weighted outer products
            sigma[k] = (gamma[:, k:k+1] * diff).T @ diff / N_k[k]
            # Add small regularization for numerical stability
            sigma[k] += 1e-6 * np.eye(D)
    
    return pi, mu, sigma, log_likelihoods

M-Step Closed-Form Solutions

Maximizing $Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)})$ with respect to $\boldsymbol{\theta}$ yields:

$$\pi_k^{(t+1)} = \frac{N_k}{N} \quad \text{where } N_k = \sum_{n=1}^{N} \gamma_{nk}^{(t)}$$

$$\boldsymbol{\mu}k^{(t+1)} = \frac{1}{N_k} \sum{n=1}^{N} \gamma_{nk}^{(t)} \mathbf{x}_n$$

$$\boldsymbol{\Sigma}k^{(t+1)} = \frac{1}{N_k} \sum{n=1}^{N} \gamma_{nk}^{(t)} (\mathbf{x}_n - \boldsymbol{\mu}_k^{(t+1)})(\mathbf{x}_n - \boldsymbol{\mu}_k^{(t+1)})^\top$$

Notice the elegant correspondence with complete-data MLE: instead of counting points assigned to component $k$ (hard assignment), we sum the responsibilities (soft assignment). The quantity $N_k$ is the effective number of points assigned to component $k$.

The Beauty of EM for GMMs

The EM update equations for GMMs are simply the MLE formulas with hard counts replaced by soft responsibilities. This reveals EM's core principle: when you don't know the latent variables, use their posterior expectations instead. The mathematical elegance—closed-form M-step updates resembling familiar statistics—makes GMM-EM one of the most intuitive examples of the EM algorithm.

Geometric Intuition: What EM Actually Does

Understanding EM geometrically provides valuable intuition for its behavior. Consider a 2D dataset that appears to come from two overlapping Gaussian clusters.

E-Step Geometry:

The E-step computes responsibilities based on the current Gaussians. For each data point, we ask: "Given the current component locations and shapes, which component was most likely responsible for generating this point?"

Points clearly inside one component receive responsibility $\approx 1$ for that component
Points equidistant from components receive responsibility $\approx 0.5$ for each
The responsibility landscape is a smooth function over the data space, creating soft boundaries between components

Converting Mermaid diagram...

M-Step Geometry:

The M-step moves each component toward its responsible data points:

Mean update: The new mean is the responsibility-weighted centroid of all data points. Points with high responsibility pull the mean more strongly.
Covariance update: The new covariance captures the responsibility-weighted dispersion around the new mean. Points with high responsibility contribute more to the shape.
Mixing coefficient update: The new $\pi_k$ is the average responsibility—the fraction of total responsibility assigned to component $k$.

The Iterative Dance:

As EM iterates, the components "chase" the data:

Initial components may be poorly positioned
Responsibilities assign points to the nearest/best-fitting component
Components move toward their assigned points
Responsibilities are recomputed with updated components
Process repeats until components stabilize

Soft vs. Hard Assignment: A Crucial Difference

Unlike K-means, which uses hard assignment (each point belongs to exactly one cluster), EM uses soft assignment via responsibilities. This makes EM robust to overlapping clusters and provides principled uncertainty quantification. However, it also means EM requires more computation per iteration than K-means. In fact, if you force responsibilities to be binary (0 or 1), EM reduces to a variant of K-means!

How EM Optimizes the Log-Likelihood

A natural question arises: if we're maximizing the Q-function (expected complete-data log-likelihood), how does this relate to maximizing the actual log-likelihood $\mathcal{L}(\boldsymbol{\theta})$?

The answer lies in a beautiful decomposition. The observed log-likelihood can be written as:

$$\mathcal{L}(\boldsymbol{\theta}) = Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)}) - H(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)})$$

where $H(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)}) = \mathbb{E}_{\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{(t)}} \left[ \log p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}) \right]$ is an entropy-like term.

The Key Inequality

A crucial result from information theory states that $H(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)}) \leq H(\boldsymbol{\theta}^{(t)}, \boldsymbol{\theta}^{(t)})$ for all $\boldsymbol{\theta}$. This follows from the non-negativity of KL divergence.

Now, in the M-step, we choose $\boldsymbol{\theta}^{(t+1)}$ to maximize $Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)})$, so:

$$Q(\boldsymbol{\theta}^{(t+1)}, \boldsymbol{\theta}^{(t)}) \geq Q(\boldsymbol{\theta}^{(t)}, \boldsymbol{\theta}^{(t)})$$

Combining with the inequality on $H$:

$$\mathcal{L}(\boldsymbol{\theta}^{(t+1)}) = Q(\boldsymbol{\theta}^{(t+1)}, \boldsymbol{\theta}^{(t)}) - H(\boldsymbol{\theta}^{(t+1)}, \boldsymbol{\theta}^{(t)})$$ $$\geq Q(\boldsymbol{\theta}^{(t)}, \boldsymbol{\theta}^{(t)}) - H(\boldsymbol{\theta}^{(t)}, \boldsymbol{\theta}^{(t)}) = \mathcal{L}(\boldsymbol{\theta}^{(t)})$$

The Fundamental EM Guarantee

Every iteration of EM is guaranteed to not decrease the log-likelihood: L(θ^(t+1)) ≥ L(θ^(t)). Combined with the fact that the log-likelihood is bounded above (probabilities can't exceed 1), this guarantees convergence to a local maximum or saddle point. This monotonicity makes EM remarkably stable compared to gradient-based methods.

The ELBO Perspective

An alternative and increasingly popular view interprets EM through the Evidence Lower Bound (ELBO). For any distribution $q(\mathbf{Z})$ over latent variables:

$$\mathcal{L}(\boldsymbol{\theta}) = \log p(\mathbf{X} \mid \boldsymbol{\theta}) \geq \mathbb{E}_{q(\mathbf{Z})} \left[ \log \frac{p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})}{q(\mathbf{Z})} \right] = \text{ELBO}$$

The E-step sets $q(\mathbf{Z}) = p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{(t)})$, making the bound tight. The M-step maximizes the ELBO with respect to $\boldsymbol{\theta}$.

This variational perspective connects EM to modern variational inference methods like VAEs and is foundational for understanding approximate EM variants.

Practical Considerations for GMM-EM

While the theory of EM is elegant, practical implementations require careful attention to numerical issues and edge cases.

Common Numerical Challenges

•Singular Covariances: If a component collapses onto a single point (or small subset), its covariance becomes singular, causing the Gaussian density to explode to infinity. Solutions: add regularization (σ² + εI), enforce minimum eigenvalues, or remove degenerate components.
•Numerical Underflow: Gaussian densities for high-dimensional data can be astronomically small (e.g., 10^{-300}), causing underflow. Solution: work in log-space. Compute log-responsibilities using the log-sum-exp trick for numerical stability.
•Empty Components: If all responsibilities for a component approach zero, the effective N_k → 0, causing division by zero. Solution: reinitialize empty components, merge them with others, or add a small prior count (Dirichlet prior on mixing coefficients).
•Slow Convergence: Near local optima, EM can take many iterations with tiny improvements. Solution: use adaptive convergence criteria, acceleration methods (Aitken's Δ² method), or switch to gradient methods near convergence.

numerically_stable_e_step.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import numpy as np
 
def stable_log_responsibilities(X, pi, mu, sigma):
    """
    Compute log-responsibilities using log-sum-exp trick
    for numerical stability in high dimensions.
    
    Returns:
        log_gamma: (N, K) array of log-responsibilities
        log_likelihood: scalar log-likelihood
    """
    N, D = X.shape
    K = len(pi)
    
    # Compute log of unnormalized responsibilities
    log_weights = np.zeros((N, K))
    for k in range(K):
        # Log of mixing coefficient
        log_pi_k = np.log(pi[k] + 1e-300)
        
        # Log of Gaussian density (avoiding explicit density computation)
        diff = X - mu[k]
        L = np.linalg.cholesky(sigma[k])  # Σ = LL^T
        log_det = 2 * np.sum(np.log(np.diag(L)))
        
        # Solve for L^{-1}(x - μ) using triangular solve
        solved = np.linalg.solve(L, diff.T).T  # (N, D)
        mahal_sq = np.sum(solved**2, axis=1)  # Mahalanobis distance squared
        
        log_gauss = -0.5 * (D * np.log(2 * np.pi) + log_det + mahal_sq)
        log_weights[:, k] = log_pi_k + log_gauss
    
    # Log-sum-exp for normalization (stable computation of log(Σ exp(x)))
    log_sum = np.max(log_weights, axis=1, keepdims=True) + \
              np.log(np.sum(np.exp(log_weights - np.max(log_weights, axis=1, keepdims=True)), axis=1, keepdims=True))
    
    log_gamma = log_weights - log_sum
    log_likelihood = np.sum(log_sum)
    
    return log_gamma, log_likelihood

Summary: The Foundation of EM

We've established the mathematical and conceptual foundation for the EM algorithm in the context of Gaussian Mixture Models. The key insights are:

Key Takeaways

•The MLE problem for mixtures is intractable due to the sum inside the logarithm, creating coupled, nonlinear equations with no closed-form solution.
•Latent variables transform the problem — if we knew which component generated each point, MLE would be trivial (just compute sample statistics per component).
•EM replaces unknowns with expectations — instead of treating latent assignments as unknowns, we use their posterior probabilities (responsibilities) given current parameters.
•Responsibilities are Bayesian posteriors — γ_{nk} = P(component k generated point n | data, current parameters), computed via Bayes' theorem.
•The Q-function enables tractable optimization — expected complete-data log-likelihood separates cleanly, yielding closed-form M-step updates.
•EM monotonically increases likelihood — each iteration is guaranteed to not decrease the observed-data log-likelihood, ensuring stable convergence.
•Practical implementation requires care — numerical stability (log-sum-exp), regularization (prevent singular covariances), and handling of edge cases are essential.

What's Next

The next page provides a rigorous, step-by-step derivation of the E-step and M-step, examining the mathematical justifications in detail. We'll derive the update equations from first principles using calculus and Lagrange multipliers, providing the complete mathematical toolkit for understanding and extending EM to other models.

1 / 5

Loading learning content...

Machine LearningExpectation-Maximization Algorithm

The Expectation-Maximization Algorithm

LevelAdvanced

Duration90 mins

TopicExpectation-Maximization Algorithm

1 / 5

EM for GMMs: Maximum Likelihood with Latent Variables

The Chicken-and-Egg Problem of Mixture Models

What You Will Learn

The Challenge of Maximum Likelihood for Mixture Models

$$p(\mathbf{x} \mid \boldsymbol{\theta}) = \sum_{k=1}^{K} \pi_k , \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$$

The Log-Likelihood Function

For $N$ i.i.d. observations $\mathbf{X} = {\mathbf{x}_1, \ldots, \mathbf{x}_N}$, the log-likelihood is:

Why Gradient-Based Optimization Struggles

The Pathological Derivative Structure

To see this concretely, consider the gradient with respect to $\boldsymbol{\mu}_k$:

Notice that the fraction—which we'll soon call the responsibility $\gamma_{nk}$—depends on all component parameters. Setting this gradient to zero gives:

$$\boldsymbol{\mu}k = \frac{\sum{n=1}^{N} \gamma_{nk} \mathbf{x}n}{\sum{n=1}^{N} \gamma_{nk}}$$

This looks like a weighted mean, but $\gamma_{nk}$ itself depends on $\boldsymbol{\mu}_k$ and all other parameters! We have an implicit equation, not an explicit solution.

The Latent Variable Perspective

The Complete-Data Distribution

With latent variables included, we define the complete-data likelihood. The joint distribution of $(\mathbf{x}_n, \mathbf{z}_n)$ is:

$$p(\mathbf{x}_n, \mathbf{z}n \mid \boldsymbol{\theta}) = \prod{k=1}^{K} \left[ \pi_k , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}k) \right]^{z{nk}}$$

Because $z_{nk}$ is either 0 or 1, this product selects exactly one component—the one that generated $\mathbf{x}_n$.

The Power of the Complete-Data Log-Likelihood

Why Complete-Data Makes MLE Easy

The complete-data log-likelihood is:

Expanding the Gaussian:

$$\boldsymbol{\mu}k^{\text{ML}} = \frac{\sum{n : z_{nk}=1} \mathbf{x}n}{\sum{n=1}^{N} z_{nk}} = \frac{\sum_{n : z_{nk}=1} \mathbf{x}_n}{N_k}$$

where $N_k = |{n : z_{nk} = 1}|$ is the count of points in component $k$.

Complete-Data MLE Solutions for GMM Parameters
Parameter	Complete-Data MLE Formula	Interpretation
$\pi_k$	$\frac{N_k}{N}$	Proportion of points assigned to component $k$
$\boldsymbol{\mu}_k$	$\frac{1}{N_k} \sum_{n:z_{nk}=1} \mathbf{x}_n$	Sample mean of assigned points
$\boldsymbol{\Sigma}_k$	$\frac{1}{N_k} \sum_{n:z_{nk}=1} (\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n - \boldsymbol{\mu}_k)^\top$	Sample covariance of assigned points

The EM Framework: Expectations Replace Unknowns

Let $\boldsymbol{\theta}^{(t)}$ denote our current parameter estimates at iteration $t$. The expected value of $z_{nk}$ given the observed data and current parameters is:

$$\mathbb{E}[z_{nk} \mid \mathbf{x}n, \boldsymbol{\theta}^{(t)}] = p(z{nk} = 1 \mid \mathbf{x}n, \boldsymbol{\theta}^{(t)}) \triangleq \gamma{nk}^{(t)}$$

Computing Responsibilities via Bayes' Theorem

The quantity $\gamma_{nk}$ is called the responsibility of component $k$ for observation $n$. Using Bayes' theorem:

Substituting the GMM components:

Responsibilities as Posterior Probabilities

The Expected Complete-Data Log-Likelihood

With responsibilities in hand, we define the Q-function—the expected complete-data log-likelihood under the posterior distribution of latent variables:

$$Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)}) = \mathbb{E}_{\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{(t)}} \left[ \log p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta}) \right]$$

Because $z_{nk}$ appears linearly in the complete-data log-likelihood, the expectation simply replaces $z_{nk}$ with $\gamma_{nk}^{(t)}$:

The Q-function is now a function of the new parameters $\boldsymbol{\theta}$, with the responsibilities $\gamma_{nk}^{(t)}$ treated as fixed constants computed from the old parameters.

The Two-Step EM Algorithm for GMMs

The EM algorithm alternates between two steps until convergence:

E-Step (Expectation): Compute responsibilities $\gamma_{nk}^{(t)}$ using current parameters $\boldsymbol{\theta}^{(t)}$.

M-Step (Maximization): Update parameters by maximizing $Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)})$ with respect to $\boldsymbol{\theta}$.

Let's derive the complete update equations for GMMs.

em_algorithm_gmm.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import numpy as np
from scipy.stats import multivariate_normal
 
def em_gmm(X, K, max_iters=100, tol=1e-6):
    """
    EM algorithm for Gaussian Mixture Models.
    
    Parameters:
        X: (N, D) array of observations
        K: number of mixture components
        max_iters: maximum iterations
        tol: convergence tolerance on log-likelihood
    
    Returns:
        pi: (K,) mixing coefficients
        mu: (K, D) component means
        sigma: (K, D, D) component covariances
        log_likelihoods: history of log-likelihood values
    """
    N, D = X.shape
    
    # Initialize parameters (see initialization strategies later)
    pi = np.ones(K) / K
    mu = X[np.random.choice(N, K, replace=False)]
    sigma = np.array([np.eye(D) for _ in range(K)])
    
    log_likelihoods = []
    
    for iteration in range(max_iters):
        # ============ E-STEP ============
        # Compute responsibilities γ_nk = P(z_nk=1 | x_n, θ)
        
        gamma = np.zeros((N, K))
        for k in range(K):
            gamma[:, k] = pi[k] * multivariate_normal.pdf(X, mu[k], sigma[k])
        
        # Normalize (γ_nk / Σ_j γ_nj)
        gamma_sum = gamma.sum(axis=1, keepdims=True)
        gamma = gamma / gamma_sum
        
        # Compute log-likelihood for convergence check
        log_likelihood = np.sum(np.log(gamma_sum))
        log_likelihoods.append(log_likelihood)
        
        if iteration > 0 and abs(log_likelihood - log_likelihoods[-2]) < tol:
            print(f"Converged at iteration {iteration}")
            break
        
        # ============ M-STEP ============
        # Update parameters using closed-form solutions
        
        # Effective number of points per component
        N_k = gamma.sum(axis=0)  # (K,)
        
        # Update mixing coefficients
        pi = N_k / N
        
        # Update means (responsibility-weighted average)
        mu = (gamma.T @ X) / N_k[:, np.newaxis]  # (K, D)
        
        # Update covariances
        for k in range(K):
            diff = X - mu[k]  # (N, D)
            # Weighted outer products
            sigma[k] = (gamma[:, k:k+1] * diff).T @ diff / N_k[k]
            # Add small regularization for numerical stability
            sigma[k] += 1e-6 * np.eye(D)
    
    return pi, mu, sigma, log_likelihoods

M-Step Closed-Form Solutions

Maximizing $Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)})$ with respect to $\boldsymbol{\theta}$ yields:

$$\pi_k^{(t+1)} = \frac{N_k}{N} \quad \text{where } N_k = \sum_{n=1}^{N} \gamma_{nk}^{(t)}$$

$$\boldsymbol{\mu}k^{(t+1)} = \frac{1}{N_k} \sum{n=1}^{N} \gamma_{nk}^{(t)} \mathbf{x}_n$$

$$\boldsymbol{\Sigma}k^{(t+1)} = \frac{1}{N_k} \sum{n=1}^{N} \gamma_{nk}^{(t)} (\mathbf{x}_n - \boldsymbol{\mu}_k^{(t+1)})(\mathbf{x}_n - \boldsymbol{\mu}_k^{(t+1)})^\top$$

The Beauty of EM for GMMs

Geometric Intuition: What EM Actually Does

Understanding EM geometrically provides valuable intuition for its behavior. Consider a 2D dataset that appears to come from two overlapping Gaussian clusters.

E-Step Geometry:

Points clearly inside one component receive responsibility $\approx 1$ for that component
Points equidistant from components receive responsibility $\approx 0.5$ for each
The responsibility landscape is a smooth function over the data space, creating soft boundaries between components

Converting Mermaid diagram...

M-Step Geometry:

The M-step moves each component toward its responsible data points:

Mean update: The new mean is the responsibility-weighted centroid of all data points. Points with high responsibility pull the mean more strongly.
Covariance update: The new covariance captures the responsibility-weighted dispersion around the new mean. Points with high responsibility contribute more to the shape.
Mixing coefficient update: The new $\pi_k$ is the average responsibility—the fraction of total responsibility assigned to component $k$.

The Iterative Dance:

As EM iterates, the components "chase" the data:

Initial components may be poorly positioned
Responsibilities assign points to the nearest/best-fitting component
Components move toward their assigned points
Responsibilities are recomputed with updated components
Process repeats until components stabilize

Soft vs. Hard Assignment: A Crucial Difference

How EM Optimizes the Log-Likelihood

A natural question arises: if we're maximizing the Q-function (expected complete-data log-likelihood), how does this relate to maximizing the actual log-likelihood $\mathcal{L}(\boldsymbol{\theta})$?

The answer lies in a beautiful decomposition. The observed log-likelihood can be written as:

$$\mathcal{L}(\boldsymbol{\theta}) = Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)}) - H(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)})$$

The Key Inequality

Now, in the M-step, we choose $\boldsymbol{\theta}^{(t+1)}$ to maximize $Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)})$, so:

$$Q(\boldsymbol{\theta}^{(t+1)}, \boldsymbol{\theta}^{(t)}) \geq Q(\boldsymbol{\theta}^{(t)}, \boldsymbol{\theta}^{(t)})$$

Combining with the inequality on $H$:

The Fundamental EM Guarantee

The ELBO Perspective

An alternative and increasingly popular view interprets EM through the Evidence Lower Bound (ELBO). For any distribution $q(\mathbf{Z})$ over latent variables:

The E-step sets $q(\mathbf{Z}) = p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{(t)})$, making the bound tight. The M-step maximizes the ELBO with respect to $\boldsymbol{\theta}$.

This variational perspective connects EM to modern variational inference methods like VAEs and is foundational for understanding approximate EM variants.

Practical Considerations for GMM-EM

While the theory of EM is elegant, practical implementations require careful attention to numerical issues and edge cases.

Common Numerical Challenges

•Singular Covariances: If a component collapses onto a single point (or small subset), its covariance becomes singular, causing the Gaussian density to explode to infinity. Solutions: add regularization (σ² + εI), enforce minimum eigenvalues, or remove degenerate components.
•Numerical Underflow: Gaussian densities for high-dimensional data can be astronomically small (e.g., 10^{-300}), causing underflow. Solution: work in log-space. Compute log-responsibilities using the log-sum-exp trick for numerical stability.
•Empty Components: If all responsibilities for a component approach zero, the effective N_k → 0, causing division by zero. Solution: reinitialize empty components, merge them with others, or add a small prior count (Dirichlet prior on mixing coefficients).
•Slow Convergence: Near local optima, EM can take many iterations with tiny improvements. Solution: use adaptive convergence criteria, acceleration methods (Aitken's Δ² method), or switch to gradient methods near convergence.

numerically_stable_e_step.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import numpy as np
 
def stable_log_responsibilities(X, pi, mu, sigma):
    """
    Compute log-responsibilities using log-sum-exp trick
    for numerical stability in high dimensions.
    
    Returns:
        log_gamma: (N, K) array of log-responsibilities
        log_likelihood: scalar log-likelihood
    """
    N, D = X.shape
    K = len(pi)
    
    # Compute log of unnormalized responsibilities
    log_weights = np.zeros((N, K))
    for k in range(K):
        # Log of mixing coefficient
        log_pi_k = np.log(pi[k] + 1e-300)
        
        # Log of Gaussian density (avoiding explicit density computation)
        diff = X - mu[k]
        L = np.linalg.cholesky(sigma[k])  # Σ = LL^T
        log_det = 2 * np.sum(np.log(np.diag(L)))
        
        # Solve for L^{-1}(x - μ) using triangular solve
        solved = np.linalg.solve(L, diff.T).T  # (N, D)
        mahal_sq = np.sum(solved**2, axis=1)  # Mahalanobis distance squared
        
        log_gauss = -0.5 * (D * np.log(2 * np.pi) + log_det + mahal_sq)
        log_weights[:, k] = log_pi_k + log_gauss
    
    # Log-sum-exp for normalization (stable computation of log(Σ exp(x)))
    log_sum = np.max(log_weights, axis=1, keepdims=True) + \
              np.log(np.sum(np.exp(log_weights - np.max(log_weights, axis=1, keepdims=True)), axis=1, keepdims=True))
    
    log_gamma = log_weights - log_sum
    log_likelihood = np.sum(log_sum)
    
    return log_gamma, log_likelihood

Summary: The Foundation of EM

We've established the mathematical and conceptual foundation for the EM algorithm in the context of Gaussian Mixture Models. The key insights are:

Key Takeaways

•The MLE problem for mixtures is intractable due to the sum inside the logarithm, creating coupled, nonlinear equations with no closed-form solution.
•Latent variables transform the problem — if we knew which component generated each point, MLE would be trivial (just compute sample statistics per component).
•EM replaces unknowns with expectations — instead of treating latent assignments as unknowns, we use their posterior probabilities (responsibilities) given current parameters.
•Responsibilities are Bayesian posteriors — γ_{nk} = P(component k generated point n | data, current parameters), computed via Bayes' theorem.
•The Q-function enables tractable optimization — expected complete-data log-likelihood separates cleanly, yielding closed-form M-step updates.
•EM monotonically increases likelihood — each iteration is guaranteed to not decrease the observed-data log-likelihood, ensuring stable convergence.
•Practical implementation requires care — numerical stability (log-sum-exp), regularization (prevent singular covariances), and handling of edge cases are essential.

What's Next

1 / 5