Expectation Maximization Algorithm - Learning Module

Loading content...

0/245

E-step and M-step: Complete Mathematical Derivation

The Mathematical Heart of EM

The Expectation-Maximization algorithm derives its name from its two alternating steps: the Expectation step (E-step) and the Maximization step (M-step). While the previous page introduced these concepts intuitively, this page provides complete, rigorous mathematical derivations from first principles.

Understanding these derivations is essential for:

Extending EM to new models beyond GMMs
Debugging implementations when results are unexpected
Developing variants like variational EM, online EM, or regularized EM
Understanding connections to variational inference and information theory

What You Will Learn

By the end of this page, you will be able to: (1) derive the E-step responsibility formula from Bayes' theorem, (2) derive all M-step update equations using calculus and Lagrange multipliers, (3) understand why these specific formulas emerge from the optimization, and (4) recognize the pattern that generalizes to other latent variable models.

The E-Step: Computing Posterior Responsibilities

The E-step computes the posterior distribution of latent variables given the observed data and current parameter estimates. For GMMs, this means computing the probability that each data point was generated by each component.

Setup and Notation

Let $\mathbf{x}n \in \mathbb{R}^D$ be the $n$-th observation ($n = 1, \ldots, N$). Let $z{nk} \in {0, 1}$ indicate whether observation $n$ was generated by component $k$ ($k = 1, \ldots, K$), with $\sum_{k=1}^{K} z_{nk} = 1$.

The current parameters are $\boldsymbol{\theta}^{(t)} = {\pi_k^{(t)}, \boldsymbol{\mu}_k^{(t)}, \boldsymbol{\Sigma}k^{(t)}}{k=1}^{K}$.

Deriving Responsibilities via Bayes' Theorem

We want to compute $p(z_{nk} = 1 \mid \mathbf{x}_n, \boldsymbol{\theta}^{(t)})$, the posterior probability that observation $n$ came from component $k$.

By Bayes' theorem:

$$p(z_{nk} = 1 \mid \mathbf{x}n, \boldsymbol{\theta}^{(t)}) = \frac{p(\mathbf{x}n \mid z{nk} = 1, \boldsymbol{\theta}^{(t)}) \cdot p(z{nk} = 1 \mid \boldsymbol{\theta}^{(t)})}{p(\mathbf{x}_n \mid \boldsymbol{\theta}^{(t)})}$$

Let's identify each term:

Prior: $p(z_{nk} = 1 \mid \boldsymbol{\theta}^{(t)}) = \pi_k^{(t)}$ (the mixing coefficient)
Likelihood: $p(\mathbf{x}n \mid z{nk} = 1, \boldsymbol{\theta}^{(t)}) = \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k^{(t)}, \boldsymbol{\Sigma}_k^{(t)})$ (component Gaussian density)
Evidence (marginal likelihood): $p(\mathbf{x}n \mid \boldsymbol{\theta}^{(t)}) = \sum{j=1}^{K} \pi_j^{(t)} , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_j^{(t)}, \boldsymbol{\Sigma}_j^{(t)})$

The Responsibility Formula

Substituting into Bayes' theorem yields the responsibility: $$\gamma_{nk}^{(t)} \triangleq p(z_{nk} = 1 \mid \mathbf{x}_n, \boldsymbol{\theta}^{(t)}) = \frac{\pi_k^{(t)} , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k^{(t)}, \boldsymbol{\Sigma}k^{(t)})}{\sum{j=1}^{K} \pi_j^{(t)} , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_j^{(t)}, \boldsymbol{\Sigma}_j^{(t)})}$$ This is a softmax-like normalization of weighted Gaussian densities.

Properties of Responsibilities

Normalization: $\sum_{k=1}^{K} \gamma_{nk}^{(t)} = 1$ for all $n$ (probabilities sum to one)
Non-negativity: $\gamma_{nk}^{(t)} \geq 0$ for all $n, k$
Interpretation: $\gamma_{nk}^{(t)}$ represents our belief that observation $n$ came from component $k$, given current parameters
Expectation property: $\gamma_{nk}^{(t)} = \mathbb{E}[z_{nk} \mid \mathbf{x}n, \boldsymbol{\theta}^{(t)}]$, since $z{nk} \in {0, 1}$

Expanded Gaussian Density

For completeness, the multivariate Gaussian density is:

$$\mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) = \frac{1}{(2\pi)^{D/2} |\boldsymbol{\Sigma}_k|^{1/2}} \exp\left( -\frac{1}{2} (\mathbf{x}_n - \boldsymbol{\mu}_k)^\top \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k) \right)$$

where $|\boldsymbol{\Sigma}_k|$ is the determinant and $\boldsymbol{\Sigma}_k^{-1}$ is the inverse (precision matrix).

The Q-Function: Expected Complete-Data Log-Likelihood

The M-step maximizes the Q-function, which is the expected value of the complete-data log-likelihood with respect to the posterior distribution of latent variables.

Defining the Q-Function

$$Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)}) = \mathbb{E}_{p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{(t)})} \left[ \log p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta}) \right]$$

The first argument $\boldsymbol{\theta}$ is what we optimize over. The second argument $\boldsymbol{\theta}^{(t)}$ specifies the distribution used for the expectation (the current parameters).

Deriving the Q-Function for GMMs

The complete-data log-likelihood is:

$$\log p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta}) = \sum_{n=1}^{N} \sum_{k=1}^{K} z_{nk} \log \left[ \pi_k , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right]$$

Since $z_{nk}$ appears linearly and $\mathbb{E}[z_{nk} \mid \mathbf{x}n, \boldsymbol{\theta}^{(t)}] = \gamma{nk}^{(t)}$:

$$Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)}) = \sum_{n=1}^{N} \sum_{k=1}^{K} \gamma_{nk}^{(t)} \log \left[ \pi_k , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right]$$

Expanding the logarithm:

$$Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)}) = \sum_{n=1}^{N} \sum_{k=1}^{K} \gamma_{nk}^{(t)} \left[ \log \pi_k + \log \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right]$$

Expanding the Gaussian Log-Density

The log of the Gaussian density is:

$$\log \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) = -\frac{D}{2} \log(2\pi) - \frac{1}{2} \log |\boldsymbol{\Sigma}_k| - \frac{1}{2} (\mathbf{x}_n - \boldsymbol{\mu}_k)^\top \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k)$$

The full Q-function becomes:

$$Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)}) = \sum_{n=1}^{N} \sum_{k=1}^{K} \gamma_{nk}^{(t)} \Bigg[ \log \pi_k - \frac{D}{2} \log(2\pi) - \frac{1}{2} \log |\boldsymbol{\Sigma}_k|$$ $$- \frac{1}{2} (\mathbf{x}_n - \boldsymbol{\mu}_k)^\top \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k) \Bigg]$$

Separability of the Q-Function

Notice that the Q-function separates over components k in a clean way. The terms involving πₖ, μₖ, and Σₖ can be grouped separately for each k. This separability is what enables closed-form M-step solutions—we can optimize each component's parameters independently (except for the constraint that Σπₖ = 1).

M-Step: Deriving the Mixing Coefficient Update

The mixing coefficients ${\pi_1, \ldots, \pi_K}$ must satisfy the constraint $\sum_{k=1}^{K} \pi_k = 1$ with $\pi_k \geq 0$. We use Lagrange multipliers to handle this constraint.

Constrained Optimization Setup

We maximize:

$$\mathcal{L}{\pi} = \sum{n=1}^{N} \sum_{k=1}^{K} \gamma_{nk}^{(t)} \log \pi_k + \lambda \left( 1 - \sum_{k=1}^{K} \pi_k \right)$$

where $\lambda$ is the Lagrange multiplier. (Note: we only include terms involving $\pi_k$ from the Q-function.)

Taking Derivatives

Differentiate with respect to $\pi_k$:

$$\frac{\partial \mathcal{L}{\pi}}{\partial \pi_k} = \sum{n=1}^{N} \frac{\gamma_{nk}^{(t)}}{\pi_k} - \lambda = 0$$

Solving for $\pi_k$:

$$\pi_k = \frac{1}{\lambda} \sum_{n=1}^{N} \gamma_{nk}^{(t)} = \frac{N_k}{\lambda}$$

where we define $N_k \triangleq \sum_{n=1}^{N} \gamma_{nk}^{(t)}$ as the effective number of points assigned to component $k$.

Determining the Lagrange Multiplier

Apply the constraint $\sum_{k=1}^{K} \pi_k = 1$:

$$\sum_{k=1}^{K} \frac{N_k}{\lambda} = 1 \implies \lambda = \sum_{k=1}^{K} N_k = \sum_{k=1}^{K} \sum_{n=1}^{N} \gamma_{nk}^{(t)} = \sum_{n=1}^{N} \underbrace{\sum_{k=1}^{K} \gamma_{nk}^{(t)}}_{= 1} = N$$

Therefore $\lambda = N$, and the update is:

Mixing Coefficient Update

$$\pi_k^{(t+1)} = \frac{N_k}{N} = \frac{1}{N} \sum_{n=1}^{N} \gamma_{nk}^{(t)}$$ The new mixing coefficient equals the average responsibility—the fraction of total responsibility assigned to component k across all data points.

Interpretation

This update has a beautiful interpretation:

If component $k$ has high responsibility for many points, $N_k$ is large, and $\pi_k$ increases
If component $k$ has low responsibility, $N_k$ is small, and $\pi_k$ decreases
The mixing coefficients automatically adapt to the "size" of each cluster
This is the soft analog of counting points per cluster in K-means

M-Step: Deriving the Mean Update

The mean $\boldsymbol{\mu}_k$ is unconstrained, so we simply differentiate the relevant terms of the Q-function and set to zero.

Relevant Terms

From the Q-function, the terms involving $\boldsymbol{\mu}_k$ are:

$$Q_{\mu_k} = -\frac{1}{2} \sum_{n=1}^{N} \gamma_{nk}^{(t)} (\mathbf{x}_n - \boldsymbol{\mu}_k)^\top \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k) + \text{const}$$

where "const" denotes terms not depending on $\boldsymbol{\mu}_k$.

Computing the Gradient

We need $\frac{\partial}{\partial \boldsymbol{\mu}_k} \left[ (\mathbf{x}_n - \boldsymbol{\mu}_k)^\top \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k) \right]$.

Using the identity $\frac{\partial}{\partial \mathbf{a}} (\mathbf{x} - \mathbf{a})^\top \mathbf{A} (\mathbf{x} - \mathbf{a}) = -2 \mathbf{A} (\mathbf{x} - \mathbf{a})$ for symmetric $\mathbf{A}$:

$$\frac{\partial Q_{\mu_k}}{\partial \boldsymbol{\mu}k} = -\frac{1}{2} \sum{n=1}^{N} \gamma_{nk}^{(t)} \cdot (-2) \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k)$$

$$= \sum_{n=1}^{N} \gamma_{nk}^{(t)} \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k)$$

Setting the Gradient to Zero

Setting this equal to zero:

$$\boldsymbol{\Sigma}k^{-1} \sum{n=1}^{N} \gamma_{nk}^{(t)} (\mathbf{x}_n - \boldsymbol{\mu}_k) = \mathbf{0}$$

Since $\boldsymbol{\Sigma}_k^{-1}$ is positive definite (hence invertible), we can multiply both sides by $\boldsymbol{\Sigma}_k$:

$$\sum_{n=1}^{N} \gamma_{nk}^{(t)} (\mathbf{x}_n - \boldsymbol{\mu}_k) = \mathbf{0}$$

Expanding:

$$\sum_{n=1}^{N} \gamma_{nk}^{(t)} \mathbf{x}n - \boldsymbol{\mu}k \sum{n=1}^{N} \gamma{nk}^{(t)} = \mathbf{0}$$

$$\sum_{n=1}^{N} \gamma_{nk}^{(t)} \mathbf{x}_n = N_k \boldsymbol{\mu}_k$$

Mean Update

$$\boldsymbol{\mu}k^{(t+1)} = \frac{1}{N_k} \sum{n=1}^{N} \gamma_{nk}^{(t)} \mathbf{x}n = \frac{\sum{n=1}^{N} \gamma_{nk}^{(t)} \mathbf{x}n}{\sum{n=1}^{N} \gamma_{nk}^{(t)}}$$ The new mean is the responsibility-weighted average of all data points—a soft version of the sample mean.

Interpretation

Points with high responsibility $\gamma_{nk}$ contribute more to the mean
Points with low responsibility are effectively down-weighted
This is exactly like a weighted average, with responsibilities as weights
Compare to K-means: the centroid is the unweighted mean of assigned points ($\gamma_{nk} \in {0, 1}$)

M-Step: Deriving the Covariance Update

The covariance update is the most mathematically involved. We need matrix calculus to differentiate with respect to $\boldsymbol{\Sigma}_k$.

Relevant Terms

From the Q-function, terms involving $\boldsymbol{\Sigma}_k$ are:

$$Q_{\Sigma_k} = \sum_{n=1}^{N} \gamma_{nk}^{(t)} \left[ -\frac{1}{2} \log |\boldsymbol{\Sigma}_k| - \frac{1}{2} (\mathbf{x}_n - \boldsymbol{\mu}_k)^\top \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k) \right]$$

It's more convenient to work with the precision matrix $\boldsymbol{\Lambda}_k = \boldsymbol{\Sigma}_k^{-1}$:

$$Q_{\Lambda_k} = \sum_{n=1}^{N} \gamma_{nk}^{(t)} \left[ \frac{1}{2} \log |\boldsymbol{\Lambda}_k| - \frac{1}{2} (\mathbf{x}_n - \boldsymbol{\mu}_k)^\top \boldsymbol{\Lambda}_k (\mathbf{x}_n - \boldsymbol{\mu}_k) \right]$$

(using $\log |\boldsymbol{\Sigma}_k| = -\log |\boldsymbol{\Lambda}_k|$)

Matrix Calculus Identities

We need two key identities:

$\frac{\partial}{\partial \mathbf{A}} \log |\mathbf{A}| = (\mathbf{A}^{-1})^\top = \mathbf{A}^{-1}$ (for symmetric $\mathbf{A}$)
$\frac{\partial}{\partial \mathbf{A}} \mathbf{x}^\top \mathbf{A} \mathbf{x} = \mathbf{x} \mathbf{x}^\top$ (for symmetric $\mathbf{A}$)

Computing the Gradient

$$\frac{\partial Q_{\Lambda_k}}{\partial \boldsymbol{\Lambda}k} = \sum{n=1}^{N} \gamma_{nk}^{(t)} \left[ \frac{1}{2} \boldsymbol{\Lambda}_k^{-1} - \frac{1}{2} (\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n - \boldsymbol{\mu}_k)^\top \right]$$

$$= \frac{1}{2} N_k \boldsymbol{\Sigma}k - \frac{1}{2} \sum{n=1}^{N} \gamma_{nk}^{(t)} (\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n - \boldsymbol{\mu}_k)^\top$$

Setting the Gradient to Zero

$$N_k \boldsymbol{\Sigma}k = \sum{n=1}^{N} \gamma_{nk}^{(t)} (\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n - \boldsymbol{\mu}_k)^\top$$

Covariance Update

$$\boldsymbol{\Sigma}k^{(t+1)} = \frac{1}{N_k} \sum{n=1}^{N} \gamma_{nk}^{(t)} (\mathbf{x}_n - \boldsymbol{\mu}_k^{(t+1)})(\mathbf{x}_n - \boldsymbol{\mu}_k^{(t+1)})^\top$$ The new covariance is the responsibility-weighted outer product of deviations from the new mean.

Important Note: The covariance update uses the new mean $\boldsymbol{\mu}_k^{(t+1)}$, not the old mean. In practice, compute all new means first, then compute new covariances.

Interpretation

This is a weighted sample covariance matrix
Points with high responsibility contribute more to the shape estimate
The outer product $(\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n - \boldsymbol{\mu}_k)^\top$ captures the shape of the deviation
The division by $N_k$ (not $N_k - 1$) gives the MLE covariance (biased estimate), consistent with likelihood maximization

Complete EM Algorithm Summary

Let's consolidate the complete EM algorithm for GMMs, including both steps in their final form.

Complete EM-GMM Algorithm
Step	Computation	Formula
E-Step	Compute responsibilities	$\gamma_{nk}^{(t)} = \frac{\pi_k^{(t)} , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k^{(t)}, \boldsymbol{\Sigma}k^{(t)})}{\sum{j=1}^{K} \pi_j^{(t)} , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_j^{(t)}, \boldsymbol{\Sigma}_j^{(t)})}$
E-Step	Compute effective counts	$N_k = \sum_{n=1}^{N} \gamma_{nk}^{(t)}$
M-Step	Update mixing coefficients	$\pi_k^{(t+1)} = \frac{N_k}{N}$
M-Step	Update means	$\boldsymbol{\mu}k^{(t+1)} = \frac{1}{N_k} \sum{n=1}^{N} \gamma_{nk}^{(t)} \mathbf{x}_n$
M-Step	Update covariances	$\boldsymbol{\Sigma}k^{(t+1)} = \frac{1}{N_k} \sum{n=1}^{N} \gamma_{nk}^{(t)} (\mathbf{x}_n - \boldsymbol{\mu}_k^{(t+1)})(\mathbf{x}_n - \boldsymbol{\mu}_k^{(t+1)})^\top$

em_step_by_step.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import numpy as np
from scipy.stats import multivariate_normal
 
def e_step(X, pi, mu, sigma):
    """
    E-Step: Compute responsibilities.
    
    Parameters:
        X: (N, D) data matrix
        pi: (K,) mixing coefficients
        mu: (K, D) component means
        sigma: (K, D, D) component covariances
    
    Returns:
        gamma: (N, K) responsibility matrix
        N_k: (K,) effective counts per component
    """
    N, D = X.shape
    K = len(pi)
    
    # Compute weighted densities: π_k * N(x_n | μ_k, Σ_k)
    weighted_densities = np.zeros((N, K))
    for k in range(K):
        weighted_densities[:, k] = pi[k] * multivariate_normal.pdf(
            X, mean=mu[k], cov=sigma[k]
        )
    
    # Normalize to get responsibilities
    gamma = weighted_densities / weighted_densities.sum(axis=1, keepdims=True)
    
    # Compute effective counts
    N_k = gamma.sum(axis=0)  # (K,)
    
    return gamma, N_k
 
 
def m_step(X, gamma, N_k):
    """
    M-Step: Update parameters.
    
    Parameters:
        X: (N, D) data matrix
        gamma: (N, K) responsibility matrix
        N_k: (K,) effective counts per component
    
    Returns:
        pi: (K,) new mixing coefficients
        mu: (K, D) new component means
        sigma: (K, D, D) new component covariances
    """
    N, D = X.shape
    K = len(N_k)
    
    # Update mixing coefficients
    pi = N_k / N
    
    # Update means: μ_k = (1/N_k) Σ_n γ_{nk} x_n
    mu = (gamma.T @ X) / N_k[:, np.newaxis]  # (K, D)
    
    # Update covariances: Σ_k = (1/N_k) Σ_n γ_{nk} (x_n - μ_k)(x_n - μ_k)^T
    sigma = np.zeros((K, D, D))
    for k in range(K):
        diff = X - mu[k]  # (N, D)
        # Weighted outer product sum
        weighted_diff = gamma[:, k:k+1] * diff  # (N, D) with weights
        sigma[k] = (weighted_diff.T @ diff) / N_k[k]
        
        # Add regularization for numerical stability
        sigma[k] += 1e-6 * np.eye(D)
    
    return pi, mu, sigma
 
 
def compute_log_likelihood(X, pi, mu, sigma):
    """
    Compute observed-data log-likelihood.
    """
    N, D = X.shape
    K = len(pi)
    
    weighted_densities = np.zeros((N, K))
    for k in range(K):
        weighted_densities[:, k] = pi[k] * multivariate_normal.pdf(
            X, mean=mu[k], cov=sigma[k]
        )
    
    return np.sum(np.log(weighted_densities.sum(axis=1)))

Order of Updates and Computational Complexity

Update Ordering

The M-step updates should be computed in a specific order:

Mixing coefficients $\pi_k$ — depends only on responsibilities $\gamma_{nk}$
Means $\boldsymbol{\mu}_k$ — depends only on responsibilities and data
Covariances $\boldsymbol{\Sigma}_k$ — depends on responsibilities, data, and new means

Using old means for the covariance update would violate the M-step's guarantee of maximizing the Q-function.

Computational Complexity per EM Iteration
Operation	Time Complexity	Space Complexity
Compute $K$ Gaussian densities for $N$ points	$O(NKD^2)$	$O(NK)$
Normalize responsibilities	$O(NK)$	$O(1)$
Compute $N_k$	$O(NK)$	$O(K)$
Update means	$O(NKD)$	$O(KD)$
Update covariances	$O(NKD^2)$	$O(KD^2)$
Total per iteration	$O(NKD^2)$	$O(NK + KD^2)$

Complexity Analysis

Per-iteration cost: $O(NKD^2)$ — dominated by Gaussian density computation and covariance updates
Number of iterations: Typically 10-100, but can be much higher for difficult data
Memory: $O(NK)$ for the responsibility matrix; $O(KD^2)$ for covariances

Comparison to K-means:

K-means per iteration: $O(NKD)$ — distance computations
EM per iteration: $O(NKD^2)$ — density computations with full covariances
EM is $O(D)$ times more expensive due to covariance handling

Optimizations:

Diagonal covariances reduce to $O(NKD)$ per iteration
Tied covariances (all components share $\boldsymbol{\Sigma}$) reduce parameters
Mini-batch EM uses subsets of data per iteration

Vectorized Implementation

The update equations are highly amenable to vectorization. The mean update mu = (gamma.T @ X) / N_k is a single matrix multiply. For covariances, broadcasting and einsum can avoid explicit loops over components. This is critical for practical performance—vectorized NumPy/PyTorch implementations are 10-100× faster than explicit loops.

Verification and Debugging Checks

Implementing EM correctly requires careful verification. Here are essential checks to ensure your implementation is correct:

Implementation Verification Checklist

•Responsibility normalization: Verify $\sum_{k=1}^{K} \gamma_{nk} = 1$ for all $n$ (within numerical tolerance)
•Responsibility bounds: Verify $0 \leq \gamma_{nk} \leq 1$ for all $n, k$
•Mixing coefficient constraint: Verify $\sum_{k=1}^{K} \pi_k = 1$ after each M-step
•Monotonicity: Verify $\mathcal{L}(\boldsymbol{\theta}^{(t+1)}) \geq \mathcal{L}(\boldsymbol{\theta}^{(t)})$ — this is the most important check!
•Covariance positive definiteness: Verify all eigenvalues of $\boldsymbol{\Sigma}_k$ are positive (use Cholesky factorization)
•Effective count sum: Verify $\sum_{k=1}^{K} N_k = N$

em_verification.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def verify_em_iteration(gamma, pi, sigma, prev_ll, curr_ll, tol=1e-10):
    """
    Verify EM iteration correctness.
    
    Returns:
        dict: Verification results with pass/fail status
    """
    N, K = gamma.shape
    results = {}
    
    # Check 1: Responsibility normalization
    resp_sums = gamma.sum(axis=1)
    results['responsibilities_normalized'] = np.allclose(resp_sums, 1.0, atol=tol)
    
    # Check 2: Responsibility bounds
    results['responsibilities_bounded'] = (gamma >= 0).all() and (gamma <= 1).all()
    
    # Check 3: Mixing coefficients sum to 1
    results['mixing_coeffs_normalized'] = np.isclose(pi.sum(), 1.0, atol=tol)
    
    # Check 4: Log-likelihood monotonicity (CRITICAL)
    results['likelihood_monotonic'] = curr_ll >= prev_ll - tol
    if not results['likelihood_monotonic']:
        print(f"WARNING: Likelihood decreased from {prev_ll:.6f} to {curr_ll:.6f}")
    
    # Check 5: Covariance positive definiteness
    try:
        for k, s in enumerate(sigma):
            np.linalg.cholesky(s)  # Fails if not positive definite
        results['covariances_positive_definite'] = True
    except np.linalg.LinAlgError:
        results['covariances_positive_definite'] = False
    
    # Overall pass
    results['all_passed'] = all(results.values())
    
    return results

The Monotonicity Test is Paramount

If log-likelihood ever decreases, you have a bug. Period. Common causes: (1) using old means in covariance update, (2) numerical instability in log-sum-exp, (3) incorrect Gaussian density computation, (4) forgetting to add regularization causing singular covariances. Always check monotonicity during development.

Summary: The Mathematics of EM

We've now completely derived both steps of the EM algorithm for GMMs from first principles. The key mathematical techniques were:

Key Mathematical Techniques

•Bayes' theorem for E-step responsibilities — posterior probability of latent assignments given data and current parameters
•Lagrange multipliers for constrained optimization of mixing coefficients — handling the simplex constraint Σπₖ = 1
•Vector calculus for mean updates — gradient of quadratic form with respect to mean vector
•Matrix calculus for covariance updates — derivatives of log-determinant and trace expressions
•Linearity of expectation for Q-function derivation — z_{nk} appears linearly, so expectations yield closed forms

The General Pattern

The derivation pattern extends to other latent variable models:

Write the complete-data log-likelihood $\log p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})$
Compute expectations of sufficient statistics under $p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{(t)})$
Maximize the resulting Q-function with respect to $\boldsymbol{\theta}$

For exponential family models, step 2 reduces to computing expected sufficient statistics, and step 3 typically has closed-form solutions. This is why EM is so widely applicable.

What's Next

The next page examines the convergence properties of EM: why does monotonicity guarantee convergence? How fast does EM converge? Under what conditions does it find the global optimum? These theoretical questions have practical implications for when to use EM versus alternatives.

E-step and M-step: Complete Mathematical Derivation

The Mathematical Heart of EM

Understanding these derivations is essential for:

Extending EM to new models beyond GMMs
Debugging implementations when results are unexpected
Developing variants like variational EM, online EM, or regularized EM
Understanding connections to variational inference and information theory

What You Will Learn

The E-Step: Computing Posterior Responsibilities

Setup and Notation

The current parameters are $\boldsymbol{\theta}^{(t)} = {\pi_k^{(t)}, \boldsymbol{\mu}_k^{(t)}, \boldsymbol{\Sigma}k^{(t)}}{k=1}^{K}$.

Deriving Responsibilities via Bayes' Theorem

We want to compute $p(z_{nk} = 1 \mid \mathbf{x}_n, \boldsymbol{\theta}^{(t)})$, the posterior probability that observation $n$ came from component $k$.

By Bayes' theorem:

Let's identify each term:

Prior: $p(z_{nk} = 1 \mid \boldsymbol{\theta}^{(t)}) = \pi_k^{(t)}$ (the mixing coefficient)
Likelihood: $p(\mathbf{x}n \mid z{nk} = 1, \boldsymbol{\theta}^{(t)}) = \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k^{(t)}, \boldsymbol{\Sigma}_k^{(t)})$ (component Gaussian density)
Evidence (marginal likelihood): $p(\mathbf{x}n \mid \boldsymbol{\theta}^{(t)}) = \sum{j=1}^{K} \pi_j^{(t)} , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_j^{(t)}, \boldsymbol{\Sigma}_j^{(t)})$

The Responsibility Formula

Properties of Responsibilities

Normalization: $\sum_{k=1}^{K} \gamma_{nk}^{(t)} = 1$ for all $n$ (probabilities sum to one)
Non-negativity: $\gamma_{nk}^{(t)} \geq 0$ for all $n, k$
Interpretation: $\gamma_{nk}^{(t)}$ represents our belief that observation $n$ came from component $k$, given current parameters
Expectation property: $\gamma_{nk}^{(t)} = \mathbb{E}[z_{nk} \mid \mathbf{x}n, \boldsymbol{\theta}^{(t)}]$, since $z{nk} \in {0, 1}$

Expanded Gaussian Density

For completeness, the multivariate Gaussian density is:

where $|\boldsymbol{\Sigma}_k|$ is the determinant and $\boldsymbol{\Sigma}_k^{-1}$ is the inverse (precision matrix).

The Q-Function: Expected Complete-Data Log-Likelihood

The M-step maximizes the Q-function, which is the expected value of the complete-data log-likelihood with respect to the posterior distribution of latent variables.

Defining the Q-Function

$$Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{(t)}) = \mathbb{E}_{p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{(t)})} \left[ \log p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta}) \right]$$

The first argument $\boldsymbol{\theta}$ is what we optimize over. The second argument $\boldsymbol{\theta}^{(t)}$ specifies the distribution used for the expectation (the current parameters).

Deriving the Q-Function for GMMs

The complete-data log-likelihood is:

$$\log p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta}) = \sum_{n=1}^{N} \sum_{k=1}^{K} z_{nk} \log \left[ \pi_k , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right]$$

Since $z_{nk}$ appears linearly and $\mathbb{E}[z_{nk} \mid \mathbf{x}n, \boldsymbol{\theta}^{(t)}] = \gamma{nk}^{(t)}$:

Expanding the logarithm:

Expanding the Gaussian Log-Density

The log of the Gaussian density is:

The full Q-function becomes:

Separability of the Q-Function

M-Step: Deriving the Mixing Coefficient Update

The mixing coefficients ${\pi_1, \ldots, \pi_K}$ must satisfy the constraint $\sum_{k=1}^{K} \pi_k = 1$ with $\pi_k \geq 0$. We use Lagrange multipliers to handle this constraint.

Constrained Optimization Setup

We maximize:

$$\mathcal{L}{\pi} = \sum{n=1}^{N} \sum_{k=1}^{K} \gamma_{nk}^{(t)} \log \pi_k + \lambda \left( 1 - \sum_{k=1}^{K} \pi_k \right)$$

where $\lambda$ is the Lagrange multiplier. (Note: we only include terms involving $\pi_k$ from the Q-function.)

Taking Derivatives

Differentiate with respect to $\pi_k$:

$$\frac{\partial \mathcal{L}{\pi}}{\partial \pi_k} = \sum{n=1}^{N} \frac{\gamma_{nk}^{(t)}}{\pi_k} - \lambda = 0$$

Solving for $\pi_k$:

$$\pi_k = \frac{1}{\lambda} \sum_{n=1}^{N} \gamma_{nk}^{(t)} = \frac{N_k}{\lambda}$$

where we define $N_k \triangleq \sum_{n=1}^{N} \gamma_{nk}^{(t)}$ as the effective number of points assigned to component $k$.

Determining the Lagrange Multiplier

Apply the constraint $\sum_{k=1}^{K} \pi_k = 1$:

Therefore $\lambda = N$, and the update is:

Mixing Coefficient Update

Interpretation

This update has a beautiful interpretation:

If component $k$ has high responsibility for many points, $N_k$ is large, and $\pi_k$ increases
If component $k$ has low responsibility, $N_k$ is small, and $\pi_k$ decreases
The mixing coefficients automatically adapt to the "size" of each cluster
This is the soft analog of counting points per cluster in K-means

M-Step: Deriving the Mean Update

The mean $\boldsymbol{\mu}_k$ is unconstrained, so we simply differentiate the relevant terms of the Q-function and set to zero.

Relevant Terms

From the Q-function, the terms involving $\boldsymbol{\mu}_k$ are:

$$Q_{\mu_k} = -\frac{1}{2} \sum_{n=1}^{N} \gamma_{nk}^{(t)} (\mathbf{x}_n - \boldsymbol{\mu}_k)^\top \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k) + \text{const}$$

where "const" denotes terms not depending on $\boldsymbol{\mu}_k$.

Computing the Gradient

We need $\frac{\partial}{\partial \boldsymbol{\mu}_k} \left[ (\mathbf{x}_n - \boldsymbol{\mu}_k)^\top \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k) \right]$.

Using the identity $\frac{\partial}{\partial \mathbf{a}} (\mathbf{x} - \mathbf{a})^\top \mathbf{A} (\mathbf{x} - \mathbf{a}) = -2 \mathbf{A} (\mathbf{x} - \mathbf{a})$ for symmetric $\mathbf{A}$:

$$\frac{\partial Q_{\mu_k}}{\partial \boldsymbol{\mu}k} = -\frac{1}{2} \sum{n=1}^{N} \gamma_{nk}^{(t)} \cdot (-2) \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k)$$

$$= \sum_{n=1}^{N} \gamma_{nk}^{(t)} \boldsymbol{\Sigma}_k^{-1} (\mathbf{x}_n - \boldsymbol{\mu}_k)$$

Setting the Gradient to Zero

Setting this equal to zero:

$$\boldsymbol{\Sigma}k^{-1} \sum{n=1}^{N} \gamma_{nk}^{(t)} (\mathbf{x}_n - \boldsymbol{\mu}_k) = \mathbf{0}$$

Since $\boldsymbol{\Sigma}_k^{-1}$ is positive definite (hence invertible), we can multiply both sides by $\boldsymbol{\Sigma}_k$:

$$\sum_{n=1}^{N} \gamma_{nk}^{(t)} (\mathbf{x}_n - \boldsymbol{\mu}_k) = \mathbf{0}$$

Expanding:

$$\sum_{n=1}^{N} \gamma_{nk}^{(t)} \mathbf{x}n - \boldsymbol{\mu}k \sum{n=1}^{N} \gamma{nk}^{(t)} = \mathbf{0}$$

$$\sum_{n=1}^{N} \gamma_{nk}^{(t)} \mathbf{x}_n = N_k \boldsymbol{\mu}_k$$

Mean Update

Interpretation

Points with high responsibility $\gamma_{nk}$ contribute more to the mean
Points with low responsibility are effectively down-weighted
This is exactly like a weighted average, with responsibilities as weights
Compare to K-means: the centroid is the unweighted mean of assigned points ($\gamma_{nk} \in {0, 1}$)

M-Step: Deriving the Covariance Update

The covariance update is the most mathematically involved. We need matrix calculus to differentiate with respect to $\boldsymbol{\Sigma}_k$.

Relevant Terms

From the Q-function, terms involving $\boldsymbol{\Sigma}_k$ are:

It's more convenient to work with the precision matrix $\boldsymbol{\Lambda}_k = \boldsymbol{\Sigma}_k^{-1}$:

(using $\log |\boldsymbol{\Sigma}_k| = -\log |\boldsymbol{\Lambda}_k|$)

Matrix Calculus Identities

We need two key identities:

$\frac{\partial}{\partial \mathbf{A}} \log |\mathbf{A}| = (\mathbf{A}^{-1})^\top = \mathbf{A}^{-1}$ (for symmetric $\mathbf{A}$)
$\frac{\partial}{\partial \mathbf{A}} \mathbf{x}^\top \mathbf{A} \mathbf{x} = \mathbf{x} \mathbf{x}^\top$ (for symmetric $\mathbf{A}$)

Computing the Gradient

$$= \frac{1}{2} N_k \boldsymbol{\Sigma}k - \frac{1}{2} \sum{n=1}^{N} \gamma_{nk}^{(t)} (\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n - \boldsymbol{\mu}_k)^\top$$

Setting the Gradient to Zero

$$N_k \boldsymbol{\Sigma}k = \sum{n=1}^{N} \gamma_{nk}^{(t)} (\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n - \boldsymbol{\mu}_k)^\top$$

Covariance Update

Important Note: The covariance update uses the new mean $\boldsymbol{\mu}_k^{(t+1)}$, not the old mean. In practice, compute all new means first, then compute new covariances.

Interpretation

This is a weighted sample covariance matrix
Points with high responsibility contribute more to the shape estimate
The outer product $(\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n - \boldsymbol{\mu}_k)^\top$ captures the shape of the deviation
The division by $N_k$ (not $N_k - 1$) gives the MLE covariance (biased estimate), consistent with likelihood maximization

Complete EM Algorithm Summary

Let's consolidate the complete EM algorithm for GMMs, including both steps in their final form.

Complete EM-GMM Algorithm
Step	Computation	Formula
E-Step	Compute responsibilities	$\gamma_{nk}^{(t)} = \frac{\pi_k^{(t)} , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k^{(t)}, \boldsymbol{\Sigma}k^{(t)})}{\sum{j=1}^{K} \pi_j^{(t)} , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_j^{(t)}, \boldsymbol{\Sigma}_j^{(t)})}$
E-Step	Compute effective counts	$N_k = \sum_{n=1}^{N} \gamma_{nk}^{(t)}$
M-Step	Update mixing coefficients	$\pi_k^{(t+1)} = \frac{N_k}{N}$
M-Step	Update means	$\boldsymbol{\mu}k^{(t+1)} = \frac{1}{N_k} \sum{n=1}^{N} \gamma_{nk}^{(t)} \mathbf{x}_n$
M-Step	Update covariances	$\boldsymbol{\Sigma}k^{(t+1)} = \frac{1}{N_k} \sum{n=1}^{N} \gamma_{nk}^{(t)} (\mathbf{x}_n - \boldsymbol{\mu}_k^{(t+1)})(\mathbf{x}_n - \boldsymbol{\mu}_k^{(t+1)})^\top$

em_step_by_step.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import numpy as np
from scipy.stats import multivariate_normal
 
def e_step(X, pi, mu, sigma):
    """
    E-Step: Compute responsibilities.
    
    Parameters:
        X: (N, D) data matrix
        pi: (K,) mixing coefficients
        mu: (K, D) component means
        sigma: (K, D, D) component covariances
    
    Returns:
        gamma: (N, K) responsibility matrix
        N_k: (K,) effective counts per component
    """
    N, D = X.shape
    K = len(pi)
    
    # Compute weighted densities: π_k * N(x_n | μ_k, Σ_k)
    weighted_densities = np.zeros((N, K))
    for k in range(K):
        weighted_densities[:, k] = pi[k] * multivariate_normal.pdf(
            X, mean=mu[k], cov=sigma[k]
        )
    
    # Normalize to get responsibilities
    gamma = weighted_densities / weighted_densities.sum(axis=1, keepdims=True)
    
    # Compute effective counts
    N_k = gamma.sum(axis=0)  # (K,)
    
    return gamma, N_k
 
 
def m_step(X, gamma, N_k):
    """
    M-Step: Update parameters.
    
    Parameters:
        X: (N, D) data matrix
        gamma: (N, K) responsibility matrix
        N_k: (K,) effective counts per component
    
    Returns:
        pi: (K,) new mixing coefficients
        mu: (K, D) new component means
        sigma: (K, D, D) new component covariances
    """
    N, D = X.shape
    K = len(N_k)
    
    # Update mixing coefficients
    pi = N_k / N
    
    # Update means: μ_k = (1/N_k) Σ_n γ_{nk} x_n
    mu = (gamma.T @ X) / N_k[:, np.newaxis]  # (K, D)
    
    # Update covariances: Σ_k = (1/N_k) Σ_n γ_{nk} (x_n - μ_k)(x_n - μ_k)^T
    sigma = np.zeros((K, D, D))
    for k in range(K):
        diff = X - mu[k]  # (N, D)
        # Weighted outer product sum
        weighted_diff = gamma[:, k:k+1] * diff  # (N, D) with weights
        sigma[k] = (weighted_diff.T @ diff) / N_k[k]
        
        # Add regularization for numerical stability
        sigma[k] += 1e-6 * np.eye(D)
    
    return pi, mu, sigma
 
 
def compute_log_likelihood(X, pi, mu, sigma):
    """
    Compute observed-data log-likelihood.
    """
    N, D = X.shape
    K = len(pi)
    
    weighted_densities = np.zeros((N, K))
    for k in range(K):
        weighted_densities[:, k] = pi[k] * multivariate_normal.pdf(
            X, mean=mu[k], cov=sigma[k]
        )
    
    return np.sum(np.log(weighted_densities.sum(axis=1)))

Order of Updates and Computational Complexity

Update Ordering

The M-step updates should be computed in a specific order:

Mixing coefficients $\pi_k$ — depends only on responsibilities $\gamma_{nk}$
Means $\boldsymbol{\mu}_k$ — depends only on responsibilities and data
Covariances $\boldsymbol{\Sigma}_k$ — depends on responsibilities, data, and new means

Using old means for the covariance update would violate the M-step's guarantee of maximizing the Q-function.

Computational Complexity per EM Iteration
Operation	Time Complexity	Space Complexity
Compute $K$ Gaussian densities for $N$ points	$O(NKD^2)$	$O(NK)$
Normalize responsibilities	$O(NK)$	$O(1)$
Compute $N_k$	$O(NK)$	$O(K)$
Update means	$O(NKD)$	$O(KD)$
Update covariances	$O(NKD^2)$	$O(KD^2)$
Total per iteration	$O(NKD^2)$	$O(NK + KD^2)$

Complexity Analysis

Per-iteration cost: $O(NKD^2)$ — dominated by Gaussian density computation and covariance updates
Number of iterations: Typically 10-100, but can be much higher for difficult data
Memory: $O(NK)$ for the responsibility matrix; $O(KD^2)$ for covariances

Comparison to K-means:

K-means per iteration: $O(NKD)$ — distance computations
EM per iteration: $O(NKD^2)$ — density computations with full covariances
EM is $O(D)$ times more expensive due to covariance handling

Optimizations:

Diagonal covariances reduce to $O(NKD)$ per iteration
Tied covariances (all components share $\boldsymbol{\Sigma}$) reduce parameters
Mini-batch EM uses subsets of data per iteration

Vectorized Implementation

Verification and Debugging Checks

Implementing EM correctly requires careful verification. Here are essential checks to ensure your implementation is correct:

Implementation Verification Checklist

•Responsibility normalization: Verify $\sum_{k=1}^{K} \gamma_{nk} = 1$ for all $n$ (within numerical tolerance)
•Responsibility bounds: Verify $0 \leq \gamma_{nk} \leq 1$ for all $n, k$
•Mixing coefficient constraint: Verify $\sum_{k=1}^{K} \pi_k = 1$ after each M-step
•Monotonicity: Verify $\mathcal{L}(\boldsymbol{\theta}^{(t+1)}) \geq \mathcal{L}(\boldsymbol{\theta}^{(t)})$ — this is the most important check!
•Covariance positive definiteness: Verify all eigenvalues of $\boldsymbol{\Sigma}_k$ are positive (use Cholesky factorization)
•Effective count sum: Verify $\sum_{k=1}^{K} N_k = N$

em_verification.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def verify_em_iteration(gamma, pi, sigma, prev_ll, curr_ll, tol=1e-10):
    """
    Verify EM iteration correctness.
    
    Returns:
        dict: Verification results with pass/fail status
    """
    N, K = gamma.shape
    results = {}
    
    # Check 1: Responsibility normalization
    resp_sums = gamma.sum(axis=1)
    results['responsibilities_normalized'] = np.allclose(resp_sums, 1.0, atol=tol)
    
    # Check 2: Responsibility bounds
    results['responsibilities_bounded'] = (gamma >= 0).all() and (gamma <= 1).all()
    
    # Check 3: Mixing coefficients sum to 1
    results['mixing_coeffs_normalized'] = np.isclose(pi.sum(), 1.0, atol=tol)
    
    # Check 4: Log-likelihood monotonicity (CRITICAL)
    results['likelihood_monotonic'] = curr_ll >= prev_ll - tol
    if not results['likelihood_monotonic']:
        print(f"WARNING: Likelihood decreased from {prev_ll:.6f} to {curr_ll:.6f}")
    
    # Check 5: Covariance positive definiteness
    try:
        for k, s in enumerate(sigma):
            np.linalg.cholesky(s)  # Fails if not positive definite
        results['covariances_positive_definite'] = True
    except np.linalg.LinAlgError:
        results['covariances_positive_definite'] = False
    
    # Overall pass
    results['all_passed'] = all(results.values())
    
    return results

The Monotonicity Test is Paramount

Summary: The Mathematics of EM

We've now completely derived both steps of the EM algorithm for GMMs from first principles. The key mathematical techniques were:

Key Mathematical Techniques

•Bayes' theorem for E-step responsibilities — posterior probability of latent assignments given data and current parameters
•Lagrange multipliers for constrained optimization of mixing coefficients — handling the simplex constraint Σπₖ = 1
•Vector calculus for mean updates — gradient of quadratic form with respect to mean vector
•Matrix calculus for covariance updates — derivatives of log-determinant and trace expressions
•Linearity of expectation for Q-function derivation — z_{nk} appears linearly, so expectations yield closed forms

The General Pattern

The derivation pattern extends to other latent variable models:

Write the complete-data log-likelihood $\log p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})$
Compute expectations of sufficient statistics under $p(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{(t)})$
Maximize the resulting Q-function with respect to $\boldsymbol{\theta}$

For exponential family models, step 2 reduces to computing expected sufficient statistics, and step 3 typically has closed-form solutions. This is why EM is so widely applicable.

What's Next