Loading content...
In the previous page, we developed the mathematical formulation of Gaussian Mixture Models and introduced the generative story—sample a component, then sample from that component's Gaussian. This two-stage process involves a critical quantity that we observe only indirectly: the component assignment.
This hidden component assignment is an example of a latent variable—an unobserved random variable that influences the observed data but is never directly measured. Latent variables are fundamental to probabilistic modeling, appearing in mixture models, hidden Markov models, factor analysis, variational autoencoders, and countless other frameworks.
This page develops a deep understanding of latent variables in the GMM context: what they represent, how they connect observed and hidden quantities, why they make inference challenging, and how they enable the powerful EM algorithm.
By the end of this page, you will: (1) Understand the precise role and representation of latent variables in GMMs, (2) Derive the complete-data likelihood and contrast it with the incomplete-data likelihood, (3) Compute posterior responsibilities using Bayes' theorem, (4) Explain why latent variables create optimization difficulties, and (5) Set the foundation for the EM algorithm by understanding the missing data perspective.
A latent variable (from Latin latēre, "to lie hidden") is a random variable that is not directly observed but is inferred from the observed data and the model structure. Latent variables serve multiple purposes in statistical modeling:
1. Representation of hidden structure: Data often arises from processes with unobserved components. Customer segments, disease subtypes, speaker identity in audio—these underlying factors generate the data but aren't directly labeled.
2. Dimensionality reduction: Latent variables can capture lower-dimensional structure in high-dimensional data. Factor analysis and PCA (as a limiting case) use latent variables to explain correlations.
3. Modeling dependencies: Latent variables can induce dependencies between observed variables. Two observations might be conditionally independent given a shared latent state but marginally dependent.
4. Enabling tractable computation: Introducing latent variables can make otherwise intractable models computationally feasible, as we'll see with EM.
It's crucial to distinguish latent variables from parameters. Parameters (like $\boldsymbol{\mu}_k$, $\boldsymbol{\Sigma}_k$, $\pi_k$) are fixed but unknown quantities that we estimate. Latent variables (like component assignments $z_n$) are random variables with probability distributions—we infer their posterior distributions given data, not point estimates (in the Bayesian perspective). Frequentist treatments sometimes blur this distinction by treating latent variables as parameters to estimate.
In Gaussian Mixture Models, the latent variable $z$ represents the component assignment—which of the $K$ Gaussian components generated a given observation.
Formal definition:
For each observation $\mathbf{x}_n$, we have an associated latent variable $z_n \in {1, 2, \ldots, K}$ indicating the component from which $\mathbf{x}_n$ was generated.
Distribution of $z_n$: $$P(z_n = k) = \pi_k \quad \text{for } k = 1, \ldots, K$$
The latent variables $z_1, z_2, \ldots, z_N$ are independent and identically distributed according to this categorical distribution.
It's often convenient to represent $z_n$ as a one-hot (1-of-K) vector $\mathbf{z}_n \in {0,1}^K$ where exactly one element is 1 and all others are 0:
$$z_{nk} = \begin{cases} 1 & \text{if observation } n \text{ belongs to component } k \ 0 & \text{otherwise} \end{cases}$$
With this encoding: $\sum_{k=1}^K z_{nk} = 1$ and $P(\mathbf{z}n) = \prod{k=1}^K \pi_k^{z_{nk}}$
The conditional distribution of $\mathbf{x}$ given $z$:
Given the component assignment, the observation is drawn from the corresponding Gaussian: $$p(\mathbf{x}_n \mid z_n = k) = \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$$
Using the one-hot encoding: $$p(\mathbf{x}_n \mid \mathbf{z}n) = \prod{k=1}^K \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}k)^{z{nk}}$$
Since exactly one $z_{nk} = 1$, only the corresponding Gaussian contributes.
The complete probabilistic model:
The joint distribution of observed and latent variables factorizes as: $$p(\mathbf{x}_n, \mathbf{z}_n) = P(\mathbf{z}_n) \cdot p(\mathbf{x}_n \mid \mathbf{z}n) = \prod{k=1}^K \left[ \pi_k \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}k) \right]^{z{nk}}$$
| Aspect | Scalar $z_n \in {1,\ldots,K}$ | One-Hot $\mathbf{z}_n \in {0,1}^K$ |
|---|---|---|
| Space | Single integer | K-dimensional binary vector |
| Prior | $P(z_n = k) = \pi_k$ | $P(\mathbf{z}n) = \prod_k \pi_k^{z{nk}}$ |
| Constraint | — | $\sum_k z_{nk} = 1$ |
| Likelihood | $p(\mathbf{x} \mid z=k)$ | $\prod_k p_k(\mathbf{x})^{z_{nk}}$ |
| Derivatives | Discrete, no gradients | Still discrete, but product form is useful |
| In code | z = 2 | z = [0, 1, 0, 0] |
A fundamental concept in latent variable models is the distinction between the complete-data and incomplete-data likelihoods.
Complete data: Both observations $\mathbf{X} = {\mathbf{x}_1, \ldots, \mathbf{x}_N}$ and latent variables $\mathbf{Z} = {\mathbf{z}_1, \ldots, \mathbf{z}_N}$ are known.
Incomplete data: Only observations $\mathbf{X}$ are known; latent variables $\mathbf{Z}$ are hidden.
In practice, we only have incomplete data. But analyzing the complete-data case provides crucial insights.
Assuming observations are i.i.d., the complete-data log-likelihood is:
$$\ell_c(\boldsymbol{\theta}) = \log p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta}) = \sum_{n=1}^{N} \sum_{k=1}^{K} z_{nk} \left[ \log \pi_k + \log \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right]$$
Note how the logarithm is now inside the summation over components, making each term tractable.
Deriving the complete-data log-likelihood:
Starting from the joint probability: $$p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta}) = \prod_{n=1}^{N} p(\mathbf{x}n, \mathbf{z}n)$$ $$= \prod{n=1}^{N} \prod{k=1}^{K} \left[ \pi_k \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}k) \right]^{z{nk}}$$
Taking logarithms: $$\log p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta}) = \sum_{n=1}^{N} \sum_{k=1}^{K} z_{nk} \log\left[ \pi_k \mathcal{N}(\mathbf{x}n \mid \boldsymbol{\mu}k, \boldsymbol{\Sigma}k) \right]$$ $$= \sum{n=1}^{N} \sum{k=1}^{K} z{nk} \left[ \log \pi_k + \log \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right]$$
Why this form is tractable:
The indicator $z_{nk}$ acts as a selector. For each observation, only the term corresponding to its true component contributes. The complete-data log-likelihood decomposes into separate terms for each component, and optimizing for one component's parameters doesn't directly involve other components' parameters.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
import numpy as npfrom scipy.stats import multivariate_normal def complete_data_log_likelihood(X, Z, means, covs, pi): """ Compute complete-data log-likelihood when assignments are known. Parameters: ----------- X : array, shape (N, D) Observed data Z : array, shape (N, K) One-hot encoded component assignments means : list of arrays, shape (K, D) Component means covs : list of arrays, shape (K, D, D) Component covariances pi : array, shape (K,) Mixing coefficients Returns: -------- log_likelihood : float Complete-data log-likelihood """ N, D = X.shape K = len(means) log_likelihood = 0.0 for n in range(N): for k in range(K): if Z[n, k] == 1: # Only the assigned component contributes # Log prior for component k log_prior = np.log(pi[k]) # Log likelihood under component k's Gaussian log_pdf = multivariate_normal.logpdf(X[n], means[k], covs[k]) log_likelihood += log_prior + log_pdf return log_likelihood # Example with known assignmentsN, D, K = 100, 2, 3 # Generate synthetic data with known labelsnp.random.seed(42)true_labels = np.random.choice(K, size=N, p=[0.3, 0.5, 0.2])X = np.zeros((N, D))means = [np.array([0, 0]), np.array([5, 0]), np.array([2.5, 4])]covs = [np.eye(D) * 0.5 for _ in range(K)] for n in range(N): k = true_labels[n] X[n] = np.random.multivariate_normal(means[k], covs[k]) # Create one-hot encodingZ = np.zeros((N, K))for n in range(N): Z[n, true_labels[n]] = 1 # Compute complete-data log-likelihoodpi = np.array([0.3, 0.5, 0.2])ll = complete_data_log_likelihood(X, Z, means, covs, pi)print(f"Complete-data log-likelihood: {ll:.2f}")print(f"Per-sample average: {ll/N:.2f}")If we knew the latent assignments $\mathbf{Z}$, maximizing the complete-data log-likelihood would be easy—it reduces to $K$ separate Gaussian MLE problems. The EM algorithm exploits this by iteratively: (1) estimating the "effective" assignments (E-step) and (2) maximizing as if these estimates were true assignments (M-step).
The incomplete-data (marginal) likelihood is what we actually work with, since latent variables are unobserved. It's obtained by marginalizing the complete-data likelihood over all possible latent configurations:
$$p(\mathbf{X} \mid \boldsymbol{\theta}) = \sum_{\mathbf{Z}} p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})$$
For GMMs, this sum is over all possible assignment vectors $\mathbf{Z}$—there are $K^N$ such configurations for $N$ observations and $K$ components.
The incomplete-data log-likelihood is:
$$\ell(\boldsymbol{\theta}) = \log p(\mathbf{X} \mid \boldsymbol{\theta}) = \sum_{n=1}^{N} \log \left( \sum_{k=1}^{K} \pi_k \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right)$$
Note: the log is outside the sum over $k$, which is the source of computational difficulty.
Comparing complete vs. incomplete:
| Aspect | Complete-Data | Incomplete-Data |
|---|---|---|
| Latent variables | Known | Hidden |
| Log-likelihood form | $\sum_n \sum_k z_{nk} \log[\cdot]$ | $\sum_n \log(\sum_k [\cdot])$ |
| Optimization | Decomposes into $K$ subproblems | Coupled, non-convex |
| Closed-form MLE | Yes, for each component | No |
| Computation | $O(NK)$ terms | Same, but harder structure |
Why marginalization creates difficulty:
The marginalization step—summing over latent configurations—is what couples the parameters together. In the incomplete-data likelihood:
$$\log \sum_k \pi_k \mathcal{N}_k(\mathbf{x})$$
Each observation contributes a log-sum-exp term. The gradient with respect to any parameter involves all components (through the normalizing sum in the denominator). This interdependence prevents decomposition into separate optimization problems.
For $N$ observations and $K$ components, there are $K^N$ possible complete-data configurations. With $N = 100$ and $K = 3$, that's $3^{100} \approx 5 \times 10^{47}$ configurations—more than atoms in the observable universe. We cannot enumerate all configurations; we need smarter approaches.
Given observed data and model parameters, we can compute the posterior distribution over latent variables—the probability distribution of $z_n$ given observation $\mathbf{x}_n$.
Applying Bayes' theorem: $$P(z_n = k \mid \mathbf{x}_n, \boldsymbol{\theta}) = \frac{P(z_n = k) \cdot p(\mathbf{x}_n \mid z_n = k)}{p(\mathbf{x}_n)}$$ $$= \frac{\pi_k \cdot \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}k)}{\sum{j=1}^{K} \pi_j \cdot \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}$$
We define the responsibility of component $k$ for observation $n$ as:
$$\gamma_{nk} \equiv P(z_n = k \mid \mathbf{x}_n, \boldsymbol{\theta}) = \frac{\pi_k \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}k)}{\sum{j=1}^{K} \pi_j \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}$$
These responsibilities satisfy $\gamma_{nk} \in [0,1]$ and $\sum_{k=1}^K \gamma_{nk} = 1$ for each $n$.
Interpreting responsibilities:
The responsibility $\gamma_{nk}$ quantifies how much component $k$ "claims" observation $\mathbf{x}_n$. It depends on:
Prior probability $\pi_k$: Components with larger mixing weights have higher responsibilities, all else equal.
Likelihood $\mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$: Components whose Gaussian assigns high density to $\mathbf{x}_n$ have higher responsibilities.
Competition: The denominator normalizes across all components. A component can have high responsibility only if it outperforms (or matches) other components for this observation.
Soft vs. hard assignments:
Soft assignment: Use the full responsibility distribution. Point $\mathbf{x}_n$ belongs to component 1 with probability 0.7, component 2 with probability 0.2, etc.
Hard assignment: Assign each point to its maximum-responsibility component: $\hat{z}n = \arg\max_k \gamma{nk}$. This is analogous to k-means-style clustering.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495
import numpy as npfrom scipy.stats import multivariate_normal def compute_responsibilities(X, means, covs, pi, eps=1e-10): """ Compute posterior responsibilities for all observations. Parameters: ----------- X : array, shape (N, D) Observed data means : list of arrays, shape (K, D) Component means covs : list of arrays, shape (K, D, D) Component covariances pi : array, shape (K,) Mixing coefficients eps : float Small constant for numerical stability Returns: -------- gamma : array, shape (N, K) Responsibility matrix where gamma[n,k] = P(z_n=k|x_n) log_likelihood : float Log-likelihood of the data under the model """ N = X.shape[0] K = len(means) # Compute log-likelihoods for numerical stability log_probs = np.zeros((N, K)) for k in range(K): rv = multivariate_normal(mean=means[k], cov=covs[k]) log_probs[:, k] = np.log(pi[k] + eps) + rv.logpdf(X) # Log-sum-exp trick for numerical stability max_log_probs = np.max(log_probs, axis=1, keepdims=True) log_sum = max_log_probs + np.log( np.sum(np.exp(log_probs - max_log_probs), axis=1, keepdims=True) ) # Compute responsibilities log_gamma = log_probs - log_sum gamma = np.exp(log_gamma) # Compute log-likelihood log_likelihood = np.sum(log_sum) return gamma, log_likelihood # Example usagenp.random.seed(42)N, D, K = 200, 2, 3 # Define GMM parametersmeans = [ np.array([0.0, 0.0]), np.array([4.0, 0.0]), np.array([2.0, 3.5])]covs = [ np.array([[1.0, 0.2], [0.2, 0.8]]), np.array([[0.8, -0.1], [-0.1, 1.0]]), np.array([[0.6, 0.0], [0.0, 0.6]])]pi = np.array([0.3, 0.45, 0.25]) # Generate test dataX = np.vstack([ np.random.multivariate_normal(means[k], covs[k], size=int(N * pi[k])) for k in range(K)]) # Compute responsibilitiesgamma, ll = compute_responsibilities(X, means, covs, pi) print(f"Responsibility matrix shape: {gamma.shape}")print(f"Log-likelihood: {ll:.2f}")print(f"Example responsibilities for first 5 points:")for n in range(5): print(f" Point {n}: {gamma[n]} -> Assigned to component {np.argmax(gamma[n])}") # Verify responsibilities sum to 1print(f"Verification: All row sums equal 1? {np.allclose(gamma.sum(axis=1), 1)}") # Effective sample counts per componentN_k = gamma.sum(axis=0)print(f"Effective samples per component: {N_k}")print(f"Proportions: {N_k / N} (true: {pi})")Since we don't know the true latent assignments, we cannot compute the complete-data log-likelihood directly. However, we can compute its expected value with respect to the posterior distribution of latent variables.
This expectation is central to the EM algorithm and provides a tractable objective function.
Given current parameter estimates $\boldsymbol{\theta}^{\text{old}}$, the expected complete-data log-likelihood is:
$$Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text{old}}) = \mathbb{E}_{\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{\text{old}}}\left[ \log p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta}) \right]$$
$$= \sum_{n=1}^{N} \sum_{k=1}^{K} \gamma_{nk}^{\text{old}} \left[ \log \pi_k + \log \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right]$$
where $\gamma_{nk}^{\text{old}} = P(z_n = k \mid \mathbf{x}_n, \boldsymbol{\theta}^{\text{old}})$.
Deriving the Q-function:
The complete-data log-likelihood is: $$\ell_c(\boldsymbol{\theta}) = \sum_{n=1}^{N} \sum_{k=1}^{K} z_{nk} \left[ \log \pi_k + \log \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right]$$
Taking the expectation with respect to $P(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{\text{old}})$: $$Q = \mathbb{E}\left[ \sum_{n,k} z_{nk} \left[ \log \pi_k + \log \mathcal{N}_k(\mathbf{x}_n) \right] \right]$$
Since the $\mathbf{x}n$ terms are not random (they're observed) and the $z{nk}$ are the only random variables: $$Q = \sum_{n,k} \mathbb{E}[z_{nk}] \left[ \log \pi_k + \log \mathcal{N}_k(\mathbf{x}_n) \right]$$
But $\mathbb{E}[z_{nk}] = P(z_n = k \mid \mathbf{x}n, \boldsymbol{\theta}^{\text{old}}) = \gamma{nk}^{\text{old}}$, giving us the result.
The key insight:
The Q-function has the same tractable structure as the complete-data log-likelihood, but with the unknown binary indicators $z_{nk}$ replaced by their expected values $\gamma_{nk}$. This makes optimization feasible!
| Quantity | Formula | Tractable? | Computable? |
|---|---|---|---|
| Complete-data LL | $\sum_{n,k} z_{nk} \log[\pi_k \mathcal{N}_k]$ | Yes | No ($z_{nk}$ unknown) |
| Incomplete-data LL | $\sum_n \log \sum_k \pi_k \mathcal{N}_k$ | No (log-sum) | Yes |
| Expected complete-data (Q) | $\sum_{n,k} \gamma_{nk} \log[\pi_k \mathcal{N}_k]$ | Yes | Yes |
The EM algorithm alternates between: (1) E-step: Compute responsibilities $\gamma_{nk}$ using current parameters, and (2) M-step: Maximize the Q-function with respect to parameters, treating $\gamma_{nk}$ as fixed weights. This iterative process is guaranteed to increase the incomplete-data log-likelihood at each step.
Latent variable models like GMMs present fundamental inference challenges that don't arise in simpler models without hidden variables. Understanding these challenges clarifies why specialized algorithms like EM are necessary.
The missing data perspective:
Latent variables can be viewed as "missing data"—information that would simplify inference if observed. The EM algorithm was originally developed in this framework (Dempster, Laird, and Rubin, 1977).
Consider: if someone handed you the true cluster labels ${z_1, \ldots, z_N}$, fitting the GMM would be trivial—just compute the MLE for each cluster separately. The difficulty is that these labels are missing.
EM handles this by:
Chicken-and-egg problem:
EM breaks this circular dependency by alternating optimization: fixing one set while optimizing the other.
The EM algorithm was formally introduced in the seminal 1977 paper by Dempster, Laird, and Rubin: "Maximum Likelihood from Incomplete Data via the EM Algorithm." The paper unified many special-case algorithms that had been developed independently for specific problems (including GMMs, which had been solved by this approach earlier). The paper has over 70,000 citations and remains one of the most influential in statistics.
This page has developed a deep understanding of latent variables in Gaussian Mixture Models. Let's consolidate the key insights:
| Quantity | Symbol | Formula |
|---|---|---|
| Prior (component prob) | $\pi_k$ | $P(z_n = k)$ |
| Likelihood (conditional) | $p(\mathbf{x}|z=k)$ | $\mathcal{N}(\mathbf{x} | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$ |
| Responsibility | $\gamma_{nk}$ | $\frac{\pi_k \mathcal{N}_k(\mathbf{x}_n)}{\sum_j \pi_j \mathcal{N}_j(\mathbf{x}_n)}$ |
| Q-function | $Q(\theta, \theta^{old})$ | $\sum_{n,k} \gamma_{nk} [\log \pi_k + \log \mathcal{N}_k(\mathbf{x}_n)]$ |
| Effective count | $N_k$ | $\sum_n \gamma_{nk}$ |
Connections to subsequent pages:
Page 2: Marginal Likelihood — Extends the incomplete-data likelihood perspective to model comparison and Bayesian model selection.
Page 3: Identifiability — Addresses when GMM parameters are uniquely determined, directly related to the symmetries in latent variable assignment.
Page 4: Model Selection — Uses latent variable concepts to determine the optimal number of components.
Module 4: EM Algorithm — Provides the complete algorithmic treatment of EM for GMMs, building on the Q-function and responsibilities developed here.
You now have a thorough understanding of latent variables in GMMs: their definition, representation, role in complete vs. incomplete likelihood, posterior inference via responsibilities, and why they create fundamental computational challenges. This foundation is essential for understanding the EM algorithm and broader latent variable modeling.