Loading content...
The Expectation-Maximization algorithm derives its name from its two alternating steps: the Expectation step (E-step) and the Maximization step (M-step). While the previous page introduced these concepts intuitively, this page provides complete, rigorous mathematical derivations from first principles.
Understanding these derivations is essential for:
By the end of this page, you will be able to: (1) derive the E-step responsibility formula from Bayes' theorem, (2) derive all M-step update equations using calculus and Lagrange multipliers, (3) understand why these specific formulas emerge from the optimization, and (4) recognize the pattern that generalizes to other latent variable models.
The E-step computes the posterior distribution of latent variables given the observed data and current parameter estimates. For GMMs, this means computing the probability that each data point was generated by each component.
Setup and Notation
Let $\mathbf{x}n \in \mathbb{R}^D$ be the $n$-th observation ($n = 1, \ldots, N$). Let $z{nk} \in {0, 1}$ indicate whether observation $n$ was generated by component $k$ ($k = 1, \ldots, K$), with $\sum_{k=1}^{K} z_{nk} = 1$.
The current parameters are $\boldsymbol{\theta}^{(t)} = {\pi_k^{(t)}, \boldsymbol{\mu}_k^{(t)}, \boldsymbol{\Sigma}k^{(t)}}{k=1}^{K}$.
Deriving Responsibilities via Bayes' Theorem
We want to compute $p(z_{nk} = 1 \mid \mathbf{x}_n, \boldsymbol{\theta}^{(t)})$, the posterior probability that observation $n$ came from component $k$.
By Bayes' theorem:
$$p(z_{nk} = 1 \mid \mathbf{x}n, \boldsymbol{\theta}^{(t)}) = \frac{p(\mathbf{x}n \mid z{nk} = 1, \boldsymbol{\theta}^{(t)}) \cdot p(z{nk} = 1 \mid \boldsymbol{\theta}^{(t)})}{p(\mathbf{x}_n \mid \boldsymbol{\theta}^{(t)})}$$
Let's identify each term:
Prior: $p(z_{nk} = 1 \mid \boldsymbol{\theta}^{(t)}) = \pi_k^{(t)}$ (the mixing coefficient)
Likelihood: $p(\mathbf{x}n \mid z{nk} = 1, \boldsymbol{\theta}^{(t)}) = \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k^{(t)}, \boldsymbol{\Sigma}_k^{(t)})$ (component Gaussian density)
Evidence (marginal likelihood): $p(\mathbf{x}n \mid \boldsymbol{\theta}^{(t)}) = \sum{j=1}^{K} \pi_j^{(t)} , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_j^{(t)}, \boldsymbol{\Sigma}_j^{(t)})$
Substituting into Bayes' theorem yields the responsibility: $$\gamma_{nk}^{(t)} \triangleq p(z_{nk} = 1 \mid \mathbf{x}_n, \boldsymbol{\theta}^{(t)}) = \frac{\pi_k^{(t)} , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k^{(t)}, \boldsymbol{\Sigma}k^{(t)})}{\sum{j=1}^{K} \pi_j^{(t)} , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_j^{(t)}, \boldsymbol{\Sigma}_j^{(t)})}$$ This is a softmax-like normalization of weighted Gaussian densities.
Properties of Responsibilities
Normalization: $\sum_{k=1}^{K} \gamma_{nk}^{(t)} = 1$ for all $n$ (probabilities sum to one)
Non-negativity: $\gamma_{nk}^{(t)} \geq 0$ for all $n, k$
Interpretation: $\gamma_{nk}^{(t)}$ represents our belief that observation $n$ came from component $k$, given current parameters
Expectation property: $\gamma_{nk}^{(t)} = \mathbb{E}[z_{nk} \mid \mathbf{x}n, \boldsymbol{\theta}^{(t)}]$, since $z{nk} \in {0, 1}$
Expanded Gaussian Density
For completeness, the multivariate Gaussian density is:
$$\mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) = \frac{1}{(2\pi)^{D/2} |\boldsymbol{\Sigma}_k|^{1/2}} \exp\left( -\frac{1}{2} (\mathbf{x}_n - \boldsymbol{\mu}_k)^\top \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k) \right)$$
where $|\boldsymbol{\Sigma}_k|$ is the determinant and $\boldsymbol{\Sigma}_k^{-1}$ is the inverse (precision matrix).
The M-step maximizes the Q-function, which is the expected value of the complete-data log-likelihood with respect to the posterior distribution of latent variables.
Defining the Q-Function
$$Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)}) = \mathbb{E}_{p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{(t)})} \left[ \log p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta}) \right]$$
The first argument $\boldsymbol{\theta}$ is what we optimize over. The second argument $\boldsymbol{\theta}^{(t)}$ specifies the distribution used for the expectation (the current parameters).
Deriving the Q-Function for GMMs
The complete-data log-likelihood is:
$$\log p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta}) = \sum_{n=1}^{N} \sum_{k=1}^{K} z_{nk} \log \left[ \pi_k , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right]$$
Since $z_{nk}$ appears linearly and $\mathbb{E}[z_{nk} \mid \mathbf{x}n, \boldsymbol{\theta}^{(t)}] = \gamma{nk}^{(t)}$:
$$Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)}) = \sum_{n=1}^{N} \sum_{k=1}^{K} \gamma_{nk}^{(t)} \log \left[ \pi_k , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right]$$
Expanding the logarithm:
$$Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)}) = \sum_{n=1}^{N} \sum_{k=1}^{K} \gamma_{nk}^{(t)} \left[ \log \pi_k + \log \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right]$$
Expanding the Gaussian Log-Density
The log of the Gaussian density is:
$$\log \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) = -\frac{D}{2} \log(2\pi) - \frac{1}{2} \log |\boldsymbol{\Sigma}_k| - \frac{1}{2} (\mathbf{x}_n - \boldsymbol{\mu}_k)^\top \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k)$$
The full Q-function becomes:
$$Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)}) = \sum_{n=1}^{N} \sum_{k=1}^{K} \gamma_{nk}^{(t)} \Bigg[ \log \pi_k - \frac{D}{2} \log(2\pi) - \frac{1}{2} \log |\boldsymbol{\Sigma}_k|$$ $$- \frac{1}{2} (\mathbf{x}_n - \boldsymbol{\mu}_k)^\top \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k) \Bigg]$$
Notice that the Q-function separates over components k in a clean way. The terms involving πₖ, μₖ, and Σₖ can be grouped separately for each k. This separability is what enables closed-form M-step solutions—we can optimize each component's parameters independently (except for the constraint that Σπₖ = 1).
The mixing coefficients ${\pi_1, \ldots, \pi_K}$ must satisfy the constraint $\sum_{k=1}^{K} \pi_k = 1$ with $\pi_k \geq 0$. We use Lagrange multipliers to handle this constraint.
Constrained Optimization Setup
We maximize:
$$\mathcal{L}{\pi} = \sum{n=1}^{N} \sum_{k=1}^{K} \gamma_{nk}^{(t)} \log \pi_k + \lambda \left( 1 - \sum_{k=1}^{K} \pi_k \right)$$
where $\lambda$ is the Lagrange multiplier. (Note: we only include terms involving $\pi_k$ from the Q-function.)
Taking Derivatives
Differentiate with respect to $\pi_k$:
$$\frac{\partial \mathcal{L}{\pi}}{\partial \pi_k} = \sum{n=1}^{N} \frac{\gamma_{nk}^{(t)}}{\pi_k} - \lambda = 0$$
Solving for $\pi_k$:
$$\pi_k = \frac{1}{\lambda} \sum_{n=1}^{N} \gamma_{nk}^{(t)} = \frac{N_k}{\lambda}$$
where we define $N_k \triangleq \sum_{n=1}^{N} \gamma_{nk}^{(t)}$ as the effective number of points assigned to component $k$.
Determining the Lagrange Multiplier
Apply the constraint $\sum_{k=1}^{K} \pi_k = 1$:
$$\sum_{k=1}^{K} \frac{N_k}{\lambda} = 1 \implies \lambda = \sum_{k=1}^{K} N_k = \sum_{k=1}^{K} \sum_{n=1}^{N} \gamma_{nk}^{(t)} = \sum_{n=1}^{N} \underbrace{\sum_{k=1}^{K} \gamma_{nk}^{(t)}}_{= 1} = N$$
Therefore $\lambda = N$, and the update is:
$$\pi_k^{(t+1)} = \frac{N_k}{N} = \frac{1}{N} \sum_{n=1}^{N} \gamma_{nk}^{(t)}$$ The new mixing coefficient equals the average responsibility—the fraction of total responsibility assigned to component k across all data points.
Interpretation
This update has a beautiful interpretation:
The mean $\boldsymbol{\mu}_k$ is unconstrained, so we simply differentiate the relevant terms of the Q-function and set to zero.
Relevant Terms
From the Q-function, the terms involving $\boldsymbol{\mu}_k$ are:
$$Q_{\mu_k} = -\frac{1}{2} \sum_{n=1}^{N} \gamma_{nk}^{(t)} (\mathbf{x}_n - \boldsymbol{\mu}_k)^\top \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k) + \text{const}$$
where "const" denotes terms not depending on $\boldsymbol{\mu}_k$.
Computing the Gradient
We need $\frac{\partial}{\partial \boldsymbol{\mu}_k} \left[ (\mathbf{x}_n - \boldsymbol{\mu}_k)^\top \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k) \right]$.
Using the identity $\frac{\partial}{\partial \mathbf{a}} (\mathbf{x} - \mathbf{a})^\top \mathbf{A} (\mathbf{x} - \mathbf{a}) = -2 \mathbf{A} (\mathbf{x} - \mathbf{a})$ for symmetric $\mathbf{A}$:
$$\frac{\partial Q_{\mu_k}}{\partial \boldsymbol{\mu}k} = -\frac{1}{2} \sum{n=1}^{N} \gamma_{nk}^{(t)} \cdot (-2) \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k)$$
$$= \sum_{n=1}^{N} \gamma_{nk}^{(t)} \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k)$$
Setting the Gradient to Zero
Setting this equal to zero:
$$\boldsymbol{\Sigma}k^{-1} \sum{n=1}^{N} \gamma_{nk}^{(t)} (\mathbf{x}_n - \boldsymbol{\mu}_k) = \mathbf{0}$$
Since $\boldsymbol{\Sigma}_k^{-1}$ is positive definite (hence invertible), we can multiply both sides by $\boldsymbol{\Sigma}_k$:
$$\sum_{n=1}^{N} \gamma_{nk}^{(t)} (\mathbf{x}_n - \boldsymbol{\mu}_k) = \mathbf{0}$$
Expanding:
$$\sum_{n=1}^{N} \gamma_{nk}^{(t)} \mathbf{x}n - \boldsymbol{\mu}k \sum{n=1}^{N} \gamma{nk}^{(t)} = \mathbf{0}$$
$$\sum_{n=1}^{N} \gamma_{nk}^{(t)} \mathbf{x}_n = N_k \boldsymbol{\mu}_k$$
$$\boldsymbol{\mu}k^{(t+1)} = \frac{1}{N_k} \sum{n=1}^{N} \gamma_{nk}^{(t)} \mathbf{x}n = \frac{\sum{n=1}^{N} \gamma_{nk}^{(t)} \mathbf{x}n}{\sum{n=1}^{N} \gamma_{nk}^{(t)}}$$ The new mean is the responsibility-weighted average of all data points—a soft version of the sample mean.
Interpretation
The covariance update is the most mathematically involved. We need matrix calculus to differentiate with respect to $\boldsymbol{\Sigma}_k$.
Relevant Terms
From the Q-function, terms involving $\boldsymbol{\Sigma}_k$ are:
$$Q_{\Sigma_k} = \sum_{n=1}^{N} \gamma_{nk}^{(t)} \left[ -\frac{1}{2} \log |\boldsymbol{\Sigma}_k| - \frac{1}{2} (\mathbf{x}_n - \boldsymbol{\mu}_k)^\top \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k) \right]$$
It's more convenient to work with the precision matrix $\boldsymbol{\Lambda}_k = \boldsymbol{\Sigma}_k^{-1}$:
$$Q_{\Lambda_k} = \sum_{n=1}^{N} \gamma_{nk}^{(t)} \left[ \frac{1}{2} \log |\boldsymbol{\Lambda}_k| - \frac{1}{2} (\mathbf{x}_n - \boldsymbol{\mu}_k)^\top \boldsymbol{\Lambda}_k (\mathbf{x}_n - \boldsymbol{\mu}_k) \right]$$
(using $\log |\boldsymbol{\Sigma}_k| = -\log |\boldsymbol{\Lambda}_k|$)
Matrix Calculus Identities
We need two key identities:
$\frac{\partial}{\partial \mathbf{A}} \log |\mathbf{A}| = (\mathbf{A}^{-1})^\top = \mathbf{A}^{-1}$ (for symmetric $\mathbf{A}$)
$\frac{\partial}{\partial \mathbf{A}} \mathbf{x}^\top \mathbf{A} \mathbf{x} = \mathbf{x} \mathbf{x}^\top$ (for symmetric $\mathbf{A}$)
Computing the Gradient
$$\frac{\partial Q_{\Lambda_k}}{\partial \boldsymbol{\Lambda}k} = \sum{n=1}^{N} \gamma_{nk}^{(t)} \left[ \frac{1}{2} \boldsymbol{\Lambda}_k^{-1} - \frac{1}{2} (\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n - \boldsymbol{\mu}_k)^\top \right]$$
$$= \frac{1}{2} N_k \boldsymbol{\Sigma}k - \frac{1}{2} \sum{n=1}^{N} \gamma_{nk}^{(t)} (\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n - \boldsymbol{\mu}_k)^\top$$
Setting the Gradient to Zero
$$N_k \boldsymbol{\Sigma}k = \sum{n=1}^{N} \gamma_{nk}^{(t)} (\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n - \boldsymbol{\mu}_k)^\top$$
$$\boldsymbol{\Sigma}k^{(t+1)} = \frac{1}{N_k} \sum{n=1}^{N} \gamma_{nk}^{(t)} (\mathbf{x}_n - \boldsymbol{\mu}_k^{(t+1)})(\mathbf{x}_n - \boldsymbol{\mu}_k^{(t+1)})^\top$$ The new covariance is the responsibility-weighted outer product of deviations from the new mean.
Important Note: The covariance update uses the new mean $\boldsymbol{\mu}_k^{(t+1)}$, not the old mean. In practice, compute all new means first, then compute new covariances.
Interpretation
Let's consolidate the complete EM algorithm for GMMs, including both steps in their final form.
| Step | Computation | Formula |
|---|---|---|
| E-Step | Compute responsibilities | $\gamma_{nk}^{(t)} = \frac{\pi_k^{(t)} , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k^{(t)}, \boldsymbol{\Sigma}k^{(t)})}{\sum{j=1}^{K} \pi_j^{(t)} , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_j^{(t)}, \boldsymbol{\Sigma}_j^{(t)})}$ |
| E-Step | Compute effective counts | $N_k = \sum_{n=1}^{N} \gamma_{nk}^{(t)}$ |
| M-Step | Update mixing coefficients | $\pi_k^{(t+1)} = \frac{N_k}{N}$ |
| M-Step | Update means | $\boldsymbol{\mu}k^{(t+1)} = \frac{1}{N_k} \sum{n=1}^{N} \gamma_{nk}^{(t)} \mathbf{x}_n$ |
| M-Step | Update covariances | $\boldsymbol{\Sigma}k^{(t+1)} = \frac{1}{N_k} \sum{n=1}^{N} \gamma_{nk}^{(t)} (\mathbf{x}_n - \boldsymbol{\mu}_k^{(t+1)})(\mathbf{x}_n - \boldsymbol{\mu}_k^{(t+1)})^\top$ |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687
import numpy as npfrom scipy.stats import multivariate_normal def e_step(X, pi, mu, sigma): """ E-Step: Compute responsibilities. Parameters: X: (N, D) data matrix pi: (K,) mixing coefficients mu: (K, D) component means sigma: (K, D, D) component covariances Returns: gamma: (N, K) responsibility matrix N_k: (K,) effective counts per component """ N, D = X.shape K = len(pi) # Compute weighted densities: π_k * N(x_n | μ_k, Σ_k) weighted_densities = np.zeros((N, K)) for k in range(K): weighted_densities[:, k] = pi[k] * multivariate_normal.pdf( X, mean=mu[k], cov=sigma[k] ) # Normalize to get responsibilities gamma = weighted_densities / weighted_densities.sum(axis=1, keepdims=True) # Compute effective counts N_k = gamma.sum(axis=0) # (K,) return gamma, N_k def m_step(X, gamma, N_k): """ M-Step: Update parameters. Parameters: X: (N, D) data matrix gamma: (N, K) responsibility matrix N_k: (K,) effective counts per component Returns: pi: (K,) new mixing coefficients mu: (K, D) new component means sigma: (K, D, D) new component covariances """ N, D = X.shape K = len(N_k) # Update mixing coefficients pi = N_k / N # Update means: μ_k = (1/N_k) Σ_n γ_{nk} x_n mu = (gamma.T @ X) / N_k[:, np.newaxis] # (K, D) # Update covariances: Σ_k = (1/N_k) Σ_n γ_{nk} (x_n - μ_k)(x_n - μ_k)^T sigma = np.zeros((K, D, D)) for k in range(K): diff = X - mu[k] # (N, D) # Weighted outer product sum weighted_diff = gamma[:, k:k+1] * diff # (N, D) with weights sigma[k] = (weighted_diff.T @ diff) / N_k[k] # Add regularization for numerical stability sigma[k] += 1e-6 * np.eye(D) return pi, mu, sigma def compute_log_likelihood(X, pi, mu, sigma): """ Compute observed-data log-likelihood. """ N, D = X.shape K = len(pi) weighted_densities = np.zeros((N, K)) for k in range(K): weighted_densities[:, k] = pi[k] * multivariate_normal.pdf( X, mean=mu[k], cov=sigma[k] ) return np.sum(np.log(weighted_densities.sum(axis=1)))Update Ordering
The M-step updates should be computed in a specific order:
Using old means for the covariance update would violate the M-step's guarantee of maximizing the Q-function.
| Operation | Time Complexity | Space Complexity |
|---|---|---|
| Compute $K$ Gaussian densities for $N$ points | $O(NKD^2)$ | $O(NK)$ |
| Normalize responsibilities | $O(NK)$ | $O(1)$ |
| Compute $N_k$ | $O(NK)$ | $O(K)$ |
| Update means | $O(NKD)$ | $O(KD)$ |
| Update covariances | $O(NKD^2)$ | $O(KD^2)$ |
| Total per iteration | $O(NKD^2)$ | $O(NK + KD^2)$ |
Complexity Analysis
Comparison to K-means:
Optimizations:
The update equations are highly amenable to vectorization. The mean update mu = (gamma.T @ X) / N_k is a single matrix multiply. For covariances, broadcasting and einsum can avoid explicit loops over components. This is critical for practical performance—vectorized NumPy/PyTorch implementations are 10-100× faster than explicit loops.
Implementing EM correctly requires careful verification. Here are essential checks to ensure your implementation is correct:
12345678910111213141516171819202122232425262728293031323334353637
def verify_em_iteration(gamma, pi, sigma, prev_ll, curr_ll, tol=1e-10): """ Verify EM iteration correctness. Returns: dict: Verification results with pass/fail status """ N, K = gamma.shape results = {} # Check 1: Responsibility normalization resp_sums = gamma.sum(axis=1) results['responsibilities_normalized'] = np.allclose(resp_sums, 1.0, atol=tol) # Check 2: Responsibility bounds results['responsibilities_bounded'] = (gamma >= 0).all() and (gamma <= 1).all() # Check 3: Mixing coefficients sum to 1 results['mixing_coeffs_normalized'] = np.isclose(pi.sum(), 1.0, atol=tol) # Check 4: Log-likelihood monotonicity (CRITICAL) results['likelihood_monotonic'] = curr_ll >= prev_ll - tol if not results['likelihood_monotonic']: print(f"WARNING: Likelihood decreased from {prev_ll:.6f} to {curr_ll:.6f}") # Check 5: Covariance positive definiteness try: for k, s in enumerate(sigma): np.linalg.cholesky(s) # Fails if not positive definite results['covariances_positive_definite'] = True except np.linalg.LinAlgError: results['covariances_positive_definite'] = False # Overall pass results['all_passed'] = all(results.values()) return resultsIf log-likelihood ever decreases, you have a bug. Period. Common causes: (1) using old means in covariance update, (2) numerical instability in log-sum-exp, (3) incorrect Gaussian density computation, (4) forgetting to add regularization causing singular covariances. Always check monotonicity during development.
We've now completely derived both steps of the EM algorithm for GMMs from first principles. The key mathematical techniques were:
The General Pattern
The derivation pattern extends to other latent variable models:
For exponential family models, step 2 reduces to computing expected sufficient statistics, and step 3 typically has closed-form solutions. This is why EM is so widely applicable.
The next page examines the convergence properties of EM: why does monotonicity guarantee convergence? How fast does EM converge? Under what conditions does it find the global optimum? These theoretical questions have practical implications for when to use EM versus alternatives.