Loading learning content...
Imagine you're analyzing customer behavior data and discover that your data appears to come from multiple distinct groups—but you don't know which customer belongs to which group. If you knew the group memberships, estimating the parameters of each group would be straightforward. But if you knew the parameters, inferring the group memberships would be easy. This is the classic chicken-and-egg problem of latent variable models.
The Expectation-Maximization (EM) algorithm elegantly solves this circular dependency by alternating between inferring latent variables given current parameters (the E-step) and updating parameters given inferred latent variables (the M-step). This simple yet profound idea forms the backbone of countless machine learning algorithms—from Gaussian Mixture Models to Hidden Markov Models, from topic models to missing data imputation.
By the end of this page, you will understand: (1) why standard maximum likelihood fails for mixture models, (2) how the EM algorithm introduces latent variables as a computational device, (3) the complete mathematical framework connecting EM to GMMs, and (4) why EM is guaranteed to never decrease the likelihood—providing a foundation for understanding its convergence properties in subsequent pages.
To appreciate EM's elegance, we must first understand why standard maximum likelihood estimation (MLE) fails for mixture models. Consider a Gaussian Mixture Model with $K$ components. For a single observation $\mathbf{x}$, the probability density is:
$$p(\mathbf{x} \mid \boldsymbol{\theta}) = \sum_{k=1}^{K} \pi_k , \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$$
where $\boldsymbol{\theta} = {\pi_1, \ldots, \pi_K, \boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_K, \boldsymbol{\Sigma}_1, \ldots, \boldsymbol{\Sigma}K}$ contains all parameters. The mixing coefficients $\pi_k$ satisfy $\sum{k=1}^{K} \pi_k = 1$ and $\pi_k \geq 0$.
The Log-Likelihood Function
For $N$ i.i.d. observations $\mathbf{X} = {\mathbf{x}_1, \ldots, \mathbf{x}_N}$, the log-likelihood is:
$$\mathcal{L}(\boldsymbol{\theta}) = \log p(\mathbf{X} \mid \boldsymbol{\theta}) = \sum_{n=1}^{N} \log \left( \sum_{k=1}^{K} \pi_k , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right)$$
The critical observation is that the logarithm of a sum does not simplify nicely. Unlike single-component models where $\log p(\mathbf{x} \mid \boldsymbol{\theta})$ separates cleanly over parameters, here we cannot decompose the optimization problem.
Taking derivatives of the log-likelihood and setting them to zero yields coupled, nonlinear equations with no closed-form solution. The sum inside the logarithm creates a complex interdependence: each μₖ affects the denominator of the responsibility (posterior probability) for every data point, which in turn affects the optimal values of all other μⱼ. Gradient descent can theoretically work, but it's prone to slow convergence, saddle points, and numerical instabilities due to the non-convex landscape with multiple local optima.
The Pathological Derivative Structure
To see this concretely, consider the gradient with respect to $\boldsymbol{\mu}_k$:
$$\frac{\partial \mathcal{L}}{\partial \boldsymbol{\mu}k} = \sum{n=1}^{N} \frac{\pi_k , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}k)}{\sum{j=1}^{K} \pi_j , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)} \boldsymbol{\Sigma}_k^{-1}(\mathbf{x}_n - \boldsymbol{\mu}_k)$$
Notice that the fraction—which we'll soon call the responsibility $\gamma_{nk}$—depends on all component parameters. Setting this gradient to zero gives:
$$\boldsymbol{\mu}k = \frac{\sum{n=1}^{N} \gamma_{nk} \mathbf{x}n}{\sum{n=1}^{N} \gamma_{nk}}$$
This looks like a weighted mean, but $\gamma_{nk}$ itself depends on $\boldsymbol{\mu}_k$ and all other parameters! We have an implicit equation, not an explicit solution.
The key insight of EM is to introduce latent (hidden) variables that, if observed, would make the optimization tractable. For GMMs, we introduce a latent variable $\mathbf{z}_n$ for each observation $\mathbf{x}_n$, indicating which component generated that observation.
We encode $\mathbf{z}n$ as a one-hot vector: $\mathbf{z}n = (z{n1}, \ldots, z{nK})^\top$ where $z_{nk} \in {0, 1}$ and $\sum_{k=1}^{K} z_{nk} = 1$. The value $z_{nk} = 1$ indicates that observation $n$ was generated by component $k$.
The Complete-Data Distribution
With latent variables included, we define the complete-data likelihood. The joint distribution of $(\mathbf{x}_n, \mathbf{z}_n)$ is:
$$p(\mathbf{x}_n, \mathbf{z}n \mid \boldsymbol{\theta}) = \prod{k=1}^{K} \left[ \pi_k , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}k) \right]^{z{nk}}$$
Because $z_{nk}$ is either 0 or 1, this product selects exactly one component—the one that generated $\mathbf{x}_n$.
The complete-data log-likelihood has a beautiful property: the logarithm moves inside the product! For all N observations: log p(X, Z | θ) = Σₙ Σₖ z_{nk} [log πₖ + log N(xₙ | μₖ, Σₖ)]. This separates cleanly over components—if we knew Z, standard MLE formulas would apply directly to each component independently.
Why Complete-Data Makes MLE Easy
The complete-data log-likelihood is:
$$\log p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta}) = \sum_{n=1}^{N} \sum_{k=1}^{K} z_{nk} \left[ \log \pi_k + \log \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right]$$
Expanding the Gaussian:
$$= \sum_{n=1}^{N} \sum_{k=1}^{K} z_{nk} \left[ \log \pi_k - \frac{D}{2}\log(2\pi) - \frac{1}{2}\log|\boldsymbol{\Sigma}_k| - \frac{1}{2}(\mathbf{x}_n - \boldsymbol{\mu}_k)^\top \boldsymbol{\Sigma}_k^{-1}(\mathbf{x}_n - \boldsymbol{\mu}_k) \right]$$
Now, maximizing with respect to $\boldsymbol{\mu}k$ involves only terms where $z{nk} = 1$—effectively the data points assigned to component $k$. The optimal $\boldsymbol{\mu}_k$ is simply the sample mean of assigned points:
$$\boldsymbol{\mu}k^{\text{ML}} = \frac{\sum{n : z_{nk}=1} \mathbf{x}n}{\sum{n=1}^{N} z_{nk}} = \frac{\sum_{n : z_{nk}=1} \mathbf{x}_n}{N_k}$$
where $N_k = |{n : z_{nk} = 1}|$ is the count of points in component $k$.
| Parameter | Complete-Data MLE Formula | Interpretation |
|---|---|---|
| $\pi_k$ | $\frac{N_k}{N}$ | Proportion of points assigned to component $k$ |
| $\boldsymbol{\mu}_k$ | $\frac{1}{N_k} \sum_{n:z_{nk}=1} \mathbf{x}_n$ | Sample mean of assigned points |
| $\boldsymbol{\Sigma}_k$ | $\frac{1}{N_k} \sum_{n:z_{nk}=1} (\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n - \boldsymbol{\mu}_k)^\top$ | Sample covariance of assigned points |
The latent variables $\mathbf{Z}$ are not observed—that's why they're called latent. The brilliant insight of EM is: instead of treating $z_{nk}$ as binary unknowns, we replace them with their expected values given current parameter estimates.
Let $\boldsymbol{\theta}^{(t)}$ denote our current parameter estimates at iteration $t$. The expected value of $z_{nk}$ given the observed data and current parameters is:
$$\mathbb{E}[z_{nk} \mid \mathbf{x}n, \boldsymbol{\theta}^{(t)}] = p(z{nk} = 1 \mid \mathbf{x}n, \boldsymbol{\theta}^{(t)}) \triangleq \gamma{nk}^{(t)}$$
Computing Responsibilities via Bayes' Theorem
The quantity $\gamma_{nk}$ is called the responsibility of component $k$ for observation $n$. Using Bayes' theorem:
$$\gamma_{nk}^{(t)} = \frac{p(z_{nk} = 1 \mid \boldsymbol{\theta}^{(t)}) , p(\mathbf{x}n \mid z{nk} = 1, \boldsymbol{\theta}^{(t)})}{\sum_{j=1}^{K} p(z_{nj} = 1 \mid \boldsymbol{\theta}^{(t)}) , p(\mathbf{x}n \mid z{nj} = 1, \boldsymbol{\theta}^{(t)})}$$
Substituting the GMM components:
$$\gamma_{nk}^{(t)} = \frac{\pi_k^{(t)} , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k^{(t)}, \boldsymbol{\Sigma}k^{(t)})}{\sum{j=1}^{K} \pi_j^{(t)} , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_j^{(t)}, \boldsymbol{\Sigma}_j^{(t)})}$$
This is a soft assignment: each data point has a responsibility value for each component, with $\sum_{k=1}^{K} \gamma_{nk} = 1$. Compare this to K-means, which uses hard assignment where each point belongs to exactly one cluster.
The responsibility γ_{nk} is precisely the posterior probability that observation xₙ was generated by component k, given the observed data and current parameters. It quantifies our uncertainty about the latent assignments—a principled Bayesian approach that naturally handles ambiguous data points near cluster boundaries.
The Expected Complete-Data Log-Likelihood
With responsibilities in hand, we define the Q-function—the expected complete-data log-likelihood under the posterior distribution of latent variables:
$$Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)}) = \mathbb{E}_{\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{(t)}} \left[ \log p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta}) \right]$$
Because $z_{nk}$ appears linearly in the complete-data log-likelihood, the expectation simply replaces $z_{nk}$ with $\gamma_{nk}^{(t)}$:
$$Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)}) = \sum_{n=1}^{N} \sum_{k=1}^{K} \gamma_{nk}^{(t)} \left[ \log \pi_k + \log \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right]$$
The Q-function is now a function of the new parameters $\boldsymbol{\theta}$, with the responsibilities $\gamma_{nk}^{(t)}$ treated as fixed constants computed from the old parameters.
The EM algorithm alternates between two steps until convergence:
E-Step (Expectation): Compute responsibilities $\gamma_{nk}^{(t)}$ using current parameters $\boldsymbol{\theta}^{(t)}$.
M-Step (Maximization): Update parameters by maximizing $Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)})$ with respect to $\boldsymbol{\theta}$.
Let's derive the complete update equations for GMMs.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
import numpy as npfrom scipy.stats import multivariate_normal def em_gmm(X, K, max_iters=100, tol=1e-6): """ EM algorithm for Gaussian Mixture Models. Parameters: X: (N, D) array of observations K: number of mixture components max_iters: maximum iterations tol: convergence tolerance on log-likelihood Returns: pi: (K,) mixing coefficients mu: (K, D) component means sigma: (K, D, D) component covariances log_likelihoods: history of log-likelihood values """ N, D = X.shape # Initialize parameters (see initialization strategies later) pi = np.ones(K) / K mu = X[np.random.choice(N, K, replace=False)] sigma = np.array([np.eye(D) for _ in range(K)]) log_likelihoods = [] for iteration in range(max_iters): # ============ E-STEP ============ # Compute responsibilities γ_nk = P(z_nk=1 | x_n, θ) gamma = np.zeros((N, K)) for k in range(K): gamma[:, k] = pi[k] * multivariate_normal.pdf(X, mu[k], sigma[k]) # Normalize (γ_nk / Σ_j γ_nj) gamma_sum = gamma.sum(axis=1, keepdims=True) gamma = gamma / gamma_sum # Compute log-likelihood for convergence check log_likelihood = np.sum(np.log(gamma_sum)) log_likelihoods.append(log_likelihood) if iteration > 0 and abs(log_likelihood - log_likelihoods[-2]) < tol: print(f"Converged at iteration {iteration}") break # ============ M-STEP ============ # Update parameters using closed-form solutions # Effective number of points per component N_k = gamma.sum(axis=0) # (K,) # Update mixing coefficients pi = N_k / N # Update means (responsibility-weighted average) mu = (gamma.T @ X) / N_k[:, np.newaxis] # (K, D) # Update covariances for k in range(K): diff = X - mu[k] # (N, D) # Weighted outer products sigma[k] = (gamma[:, k:k+1] * diff).T @ diff / N_k[k] # Add small regularization for numerical stability sigma[k] += 1e-6 * np.eye(D) return pi, mu, sigma, log_likelihoodsM-Step Closed-Form Solutions
Maximizing $Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)})$ with respect to $\boldsymbol{\theta}$ yields:
$$\pi_k^{(t+1)} = \frac{N_k}{N} \quad \text{where } N_k = \sum_{n=1}^{N} \gamma_{nk}^{(t)}$$
$$\boldsymbol{\mu}k^{(t+1)} = \frac{1}{N_k} \sum{n=1}^{N} \gamma_{nk}^{(t)} \mathbf{x}_n$$
$$\boldsymbol{\Sigma}k^{(t+1)} = \frac{1}{N_k} \sum{n=1}^{N} \gamma_{nk}^{(t)} (\mathbf{x}_n - \boldsymbol{\mu}_k^{(t+1)})(\mathbf{x}_n - \boldsymbol{\mu}_k^{(t+1)})^\top$$
Notice the elegant correspondence with complete-data MLE: instead of counting points assigned to component $k$ (hard assignment), we sum the responsibilities (soft assignment). The quantity $N_k$ is the effective number of points assigned to component $k$.
The EM update equations for GMMs are simply the MLE formulas with hard counts replaced by soft responsibilities. This reveals EM's core principle: when you don't know the latent variables, use their posterior expectations instead. The mathematical elegance—closed-form M-step updates resembling familiar statistics—makes GMM-EM one of the most intuitive examples of the EM algorithm.
Understanding EM geometrically provides valuable intuition for its behavior. Consider a 2D dataset that appears to come from two overlapping Gaussian clusters.
E-Step Geometry:
The E-step computes responsibilities based on the current Gaussians. For each data point, we ask: "Given the current component locations and shapes, which component was most likely responsible for generating this point?"
M-Step Geometry:
The M-step moves each component toward its responsible data points:
Mean update: The new mean is the responsibility-weighted centroid of all data points. Points with high responsibility pull the mean more strongly.
Covariance update: The new covariance captures the responsibility-weighted dispersion around the new mean. Points with high responsibility contribute more to the shape.
Mixing coefficient update: The new $\pi_k$ is the average responsibility—the fraction of total responsibility assigned to component $k$.
The Iterative Dance:
As EM iterates, the components "chase" the data:
Unlike K-means, which uses hard assignment (each point belongs to exactly one cluster), EM uses soft assignment via responsibilities. This makes EM robust to overlapping clusters and provides principled uncertainty quantification. However, it also means EM requires more computation per iteration than K-means. In fact, if you force responsibilities to be binary (0 or 1), EM reduces to a variant of K-means!
A natural question arises: if we're maximizing the Q-function (expected complete-data log-likelihood), how does this relate to maximizing the actual log-likelihood $\mathcal{L}(\boldsymbol{\theta})$?
The answer lies in a beautiful decomposition. The observed log-likelihood can be written as:
$$\mathcal{L}(\boldsymbol{\theta}) = Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)}) - H(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)})$$
where $H(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)}) = \mathbb{E}_{\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{(t)}} \left[ \log p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}) \right]$ is an entropy-like term.
The Key Inequality
A crucial result from information theory states that $H(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)}) \leq H(\boldsymbol{\theta}^{(t)}, \boldsymbol{\theta}^{(t)})$ for all $\boldsymbol{\theta}$. This follows from the non-negativity of KL divergence.
Now, in the M-step, we choose $\boldsymbol{\theta}^{(t+1)}$ to maximize $Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)})$, so:
$$Q(\boldsymbol{\theta}^{(t+1)}, \boldsymbol{\theta}^{(t)}) \geq Q(\boldsymbol{\theta}^{(t)}, \boldsymbol{\theta}^{(t)})$$
Combining with the inequality on $H$:
$$\mathcal{L}(\boldsymbol{\theta}^{(t+1)}) = Q(\boldsymbol{\theta}^{(t+1)}, \boldsymbol{\theta}^{(t)}) - H(\boldsymbol{\theta}^{(t+1)}, \boldsymbol{\theta}^{(t)})$$ $$\geq Q(\boldsymbol{\theta}^{(t)}, \boldsymbol{\theta}^{(t)}) - H(\boldsymbol{\theta}^{(t)}, \boldsymbol{\theta}^{(t)}) = \mathcal{L}(\boldsymbol{\theta}^{(t)})$$
Every iteration of EM is guaranteed to not decrease the log-likelihood: L(θ^(t+1)) ≥ L(θ^(t)). Combined with the fact that the log-likelihood is bounded above (probabilities can't exceed 1), this guarantees convergence to a local maximum or saddle point. This monotonicity makes EM remarkably stable compared to gradient-based methods.
The ELBO Perspective
An alternative and increasingly popular view interprets EM through the Evidence Lower Bound (ELBO). For any distribution $q(\mathbf{Z})$ over latent variables:
$$\mathcal{L}(\boldsymbol{\theta}) = \log p(\mathbf{X} \mid \boldsymbol{\theta}) \geq \mathbb{E}_{q(\mathbf{Z})} \left[ \log \frac{p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})}{q(\mathbf{Z})} \right] = \text{ELBO}$$
The E-step sets $q(\mathbf{Z}) = p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{(t)})$, making the bound tight. The M-step maximizes the ELBO with respect to $\boldsymbol{\theta}$.
This variational perspective connects EM to modern variational inference methods like VAEs and is foundational for understanding approximate EM variants.
While the theory of EM is elegant, practical implementations require careful attention to numerical issues and edge cases.
12345678910111213141516171819202122232425262728293031323334353637383940
import numpy as np def stable_log_responsibilities(X, pi, mu, sigma): """ Compute log-responsibilities using log-sum-exp trick for numerical stability in high dimensions. Returns: log_gamma: (N, K) array of log-responsibilities log_likelihood: scalar log-likelihood """ N, D = X.shape K = len(pi) # Compute log of unnormalized responsibilities log_weights = np.zeros((N, K)) for k in range(K): # Log of mixing coefficient log_pi_k = np.log(pi[k] + 1e-300) # Log of Gaussian density (avoiding explicit density computation) diff = X - mu[k] L = np.linalg.cholesky(sigma[k]) # Σ = LL^T log_det = 2 * np.sum(np.log(np.diag(L))) # Solve for L^{-1}(x - μ) using triangular solve solved = np.linalg.solve(L, diff.T).T # (N, D) mahal_sq = np.sum(solved**2, axis=1) # Mahalanobis distance squared log_gauss = -0.5 * (D * np.log(2 * np.pi) + log_det + mahal_sq) log_weights[:, k] = log_pi_k + log_gauss # Log-sum-exp for normalization (stable computation of log(Σ exp(x))) log_sum = np.max(log_weights, axis=1, keepdims=True) + \ np.log(np.sum(np.exp(log_weights - np.max(log_weights, axis=1, keepdims=True)), axis=1, keepdims=True)) log_gamma = log_weights - log_sum log_likelihood = np.sum(log_sum) return log_gamma, log_likelihoodWe've established the mathematical and conceptual foundation for the EM algorithm in the context of Gaussian Mixture Models. The key insights are:
The next page provides a rigorous, step-by-step derivation of the E-step and M-step, examining the mathematical justifications in detail. We'll derive the update equations from first principles using calculus and Lagrange multipliers, providing the complete mathematical toolkit for understanding and extending EM to other models.