Gaussian Mixture Models - Learning Module

Loading content...

0/278

Latent Variables

The Hidden Structure of Mixture Models

In the previous page, we developed the mathematical formulation of Gaussian Mixture Models and introduced the generative story—sample a component, then sample from that component's Gaussian. This two-stage process involves a critical quantity that we observe only indirectly: the component assignment.

This hidden component assignment is an example of a latent variable—an unobserved random variable that influences the observed data but is never directly measured. Latent variables are fundamental to probabilistic modeling, appearing in mixture models, hidden Markov models, factor analysis, variational autoencoders, and countless other frameworks.

This page develops a deep understanding of latent variables in the GMM context: what they represent, how they connect observed and hidden quantities, why they make inference challenging, and how they enable the powerful EM algorithm.

Learning Objectives

By the end of this page, you will: (1) Understand the precise role and representation of latent variables in GMMs, (2) Derive the complete-data likelihood and contrast it with the incomplete-data likelihood, (3) Compute posterior responsibilities using Bayes' theorem, (4) Explain why latent variables create optimization difficulties, and (5) Set the foundation for the EM algorithm by understanding the missing data perspective.

What Are Latent Variables?

A latent variable (from Latin latēre, "to lie hidden") is a random variable that is not directly observed but is inferred from the observed data and the model structure. Latent variables serve multiple purposes in statistical modeling:

1. Representation of hidden structure: Data often arises from processes with unobserved components. Customer segments, disease subtypes, speaker identity in audio—these underlying factors generate the data but aren't directly labeled.

2. Dimensionality reduction: Latent variables can capture lower-dimensional structure in high-dimensional data. Factor analysis and PCA (as a limiting case) use latent variables to explain correlations.

3. Modeling dependencies: Latent variables can induce dependencies between observed variables. Two observations might be conditionally independent given a shared latent state but marginally dependent.

4. Enabling tractable computation: Introducing latent variables can make otherwise intractable models computationally feasible, as we'll see with EM.

Examples of Latent Variables Across ML

•Mixture models: Component assignment $z$ — which cluster generated each data point
•Hidden Markov Models: Hidden state sequence — unobserved states generating observed emissions
•Latent Dirichlet Allocation: Topic assignments — which topics generated which words in documents
•Factor Analysis: Latent factors — unobserved variables explaining correlations in observed features
•Variational Autoencoders: Latent code $\mathbf{z}$ — compact representation from which data is reconstructed
•Gaussian Process latent variable models: Latent coordinates — positions in latent space from which high-dimensional observations arise

Latent ≠ Unknown Parameters

It's crucial to distinguish latent variables from parameters. Parameters (like $\boldsymbol{\mu}_k$, $\boldsymbol{\Sigma}_k$, $\pi_k$) are fixed but unknown quantities that we estimate. Latent variables (like component assignments $z_n$) are random variables with probability distributions—we infer their posterior distributions given data, not point estimates (in the Bayesian perspective). Frequentist treatments sometimes blur this distinction by treating latent variables as parameters to estimate.

The Latent Variable in GMMs: Component Assignment

In Gaussian Mixture Models, the latent variable $z$ represents the component assignment—which of the $K$ Gaussian components generated a given observation.

Formal definition:

For each observation $\mathbf{x}_n$, we have an associated latent variable $z_n \in {1, 2, \ldots, K}$ indicating the component from which $\mathbf{x}_n$ was generated.

Distribution of $z_n$: $$P(z_n = k) = \pi_k \quad \text{for } k = 1, \ldots, K$$

The latent variables $z_1, z_2, \ldots, z_N$ are independent and identically distributed according to this categorical distribution.

One-Hot Encoding of Component Assignment

It's often convenient to represent $z_n$ as a one-hot (1-of-K) vector $\mathbf{z}_n \in {0,1}^K$ where exactly one element is 1 and all others are 0:

$$z_{nk} = \begin{cases} 1 & \text{if observation } n \text{ belongs to component } k \ 0 & \text{otherwise} \end{cases}$$

With this encoding: $\sum_{k=1}^K z_{nk} = 1$ and $P(\mathbf{z}n) = \prod{k=1}^K \pi_k^{z_{nk}}$

The conditional distribution of $\mathbf{x}$ given $z$:

Given the component assignment, the observation is drawn from the corresponding Gaussian: $$p(\mathbf{x}_n \mid z_n = k) = \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$$

Using the one-hot encoding: $$p(\mathbf{x}_n \mid \mathbf{z}n) = \prod{k=1}^K \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}k)^{z{nk}}$$

Since exactly one $z_{nk} = 1$, only the corresponding Gaussian contributes.

The complete probabilistic model:

The joint distribution of observed and latent variables factorizes as: $$p(\mathbf{x}_n, \mathbf{z}_n) = P(\mathbf{z}_n) \cdot p(\mathbf{x}_n \mid \mathbf{z}n) = \prod{k=1}^K \left[ \pi_k \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}k) \right]^{z{nk}}$$

Comparing Scalar and One-Hot Representations of $z$
Aspect	Scalar $z_n \in {1,\ldots,K}$	One-Hot $\mathbf{z}_n \in {0,1}^K$
Space	Single integer	K-dimensional binary vector
Prior	$P(z_n = k) = \pi_k$	$P(\mathbf{z}n) = \prod_k \pi_k^{z{nk}}$
Constraint	—	$\sum_k z_{nk} = 1$
Likelihood	$p(\mathbf{x} \mid z=k)$	$\prod_k p_k(\mathbf{x})^{z_{nk}}$
Derivatives	Discrete, no gradients	Still discrete, but product form is useful
In code	`z = 2`	`z = [0, 1, 0, 0]`

The Complete-Data Likelihood

A fundamental concept in latent variable models is the distinction between the complete-data and incomplete-data likelihoods.

Complete data: Both observations $\mathbf{X} = {\mathbf{x}_1, \ldots, \mathbf{x}_N}$ and latent variables $\mathbf{Z} = {\mathbf{z}_1, \ldots, \mathbf{z}_N}$ are known.

Incomplete data: Only observations $\mathbf{X}$ are known; latent variables $\mathbf{Z}$ are hidden.

In practice, we only have incomplete data. But analyzing the complete-data case provides crucial insights.

Complete-Data Log-Likelihood

Assuming observations are i.i.d., the complete-data log-likelihood is:

$$\ell_c(\boldsymbol{\theta}) = \log p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta}) = \sum_{n=1}^{N} \sum_{k=1}^{K} z_{nk} \left[ \log \pi_k + \log \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right]$$

Note how the logarithm is now inside the summation over components, making each term tractable.

Deriving the complete-data log-likelihood:

Starting from the joint probability: $$p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta}) = \prod_{n=1}^{N} p(\mathbf{x}n, \mathbf{z}n)$$ $$= \prod{n=1}^{N} \prod{k=1}^{K} \left[ \pi_k \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}k) \right]^{z{nk}}$$

Taking logarithms: $$\log p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta}) = \sum_{n=1}^{N} \sum_{k=1}^{K} z_{nk} \log\left[ \pi_k \mathcal{N}(\mathbf{x}n \mid \boldsymbol{\mu}k, \boldsymbol{\Sigma}k) \right]$$ $$= \sum{n=1}^{N} \sum{k=1}^{K} z{nk} \left[ \log \pi_k + \log \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right]$$

Why this form is tractable:

The indicator $z_{nk}$ acts as a selector. For each observation, only the term corresponding to its true component contributes. The complete-data log-likelihood decomposes into separate terms for each component, and optimizing for one component's parameters doesn't directly involve other components' parameters.

complete_data_likelihood.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
from scipy.stats import multivariate_normal
 
def complete_data_log_likelihood(X, Z, means, covs, pi):
    """
    Compute complete-data log-likelihood when assignments are known.
    
    Parameters:
    -----------
    X : array, shape (N, D)
        Observed data
    Z : array, shape (N, K)
        One-hot encoded component assignments
    means : list of arrays, shape (K, D)
        Component means
    covs : list of arrays, shape (K, D, D)
        Component covariances
    pi : array, shape (K,)
        Mixing coefficients
    
    Returns:
    --------
    log_likelihood : float
        Complete-data log-likelihood
    """
    N, D = X.shape
    K = len(means)
    
    log_likelihood = 0.0
    
    for n in range(N):
        for k in range(K):
            if Z[n, k] == 1:  # Only the assigned component contributes
                # Log prior for component k
                log_prior = np.log(pi[k])
                
                # Log likelihood under component k's Gaussian
                log_pdf = multivariate_normal.logpdf(X[n], means[k], covs[k])
                
                log_likelihood += log_prior + log_pdf
    
    return log_likelihood
 
# Example with known assignments
N, D, K = 100, 2, 3
 
# Generate synthetic data with known labels
np.random.seed(42)
true_labels = np.random.choice(K, size=N, p=[0.3, 0.5, 0.2])
X = np.zeros((N, D))
means = [np.array([0, 0]), np.array([5, 0]), np.array([2.5, 4])]
covs = [np.eye(D) * 0.5 for _ in range(K)]
 
for n in range(N):
    k = true_labels[n]
    X[n] = np.random.multivariate_normal(means[k], covs[k])
 
# Create one-hot encoding
Z = np.zeros((N, K))
for n in range(N):
    Z[n, true_labels[n]] = 1
 
# Compute complete-data log-likelihood
pi = np.array([0.3, 0.5, 0.2])
ll = complete_data_log_likelihood(X, Z, means, covs, pi)
print(f"Complete-data log-likelihood: {ll:.2f}")
print(f"Per-sample average: {ll/N:.2f}")

The Key Insight for EM

If we knew the latent assignments $\mathbf{Z}$, maximizing the complete-data log-likelihood would be easy—it reduces to $K$ separate Gaussian MLE problems. The EM algorithm exploits this by iteratively: (1) estimating the "effective" assignments (E-step) and (2) maximizing as if these estimates were true assignments (M-step).

The Incomplete-Data Likelihood Revisited

The incomplete-data (marginal) likelihood is what we actually work with, since latent variables are unobserved. It's obtained by marginalizing the complete-data likelihood over all possible latent configurations:

$$p(\mathbf{X} \mid \boldsymbol{\theta}) = \sum_{\mathbf{Z}} p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta})$$

For GMMs, this sum is over all possible assignment vectors $\mathbf{Z}$—there are $K^N$ such configurations for $N$ observations and $K$ components.

Incomplete-Data Log-Likelihood for GMM

The incomplete-data log-likelihood is:

$$\ell(\boldsymbol{\theta}) = \log p(\mathbf{X} \mid \boldsymbol{\theta}) = \sum_{n=1}^{N} \log \left( \sum_{k=1}^{K} \pi_k \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right)$$

Note: the log is outside the sum over $k$, which is the source of computational difficulty.

Comparing complete vs. incomplete:

Aspect	Complete-Data	Incomplete-Data
Latent variables	Known	Hidden
Log-likelihood form	$\sum_n \sum_k z_{nk} \log[\cdot]$	$\sum_n \log(\sum_k [\cdot])$
Optimization	Decomposes into $K$ subproblems	Coupled, non-convex
Closed-form MLE	Yes, for each component	No
Computation	$O(NK)$ terms	Same, but harder structure

Why marginalization creates difficulty:

The marginalization step—summing over latent configurations—is what couples the parameters together. In the incomplete-data likelihood:

$$\log \sum_k \pi_k \mathcal{N}_k(\mathbf{x})$$

Each observation contributes a log-sum-exp term. The gradient with respect to any parameter involves all components (through the normalizing sum in the denominator). This interdependence prevents decomposition into separate optimization problems.

Complete-Data Advantages

•Log-likelihood decomposes by component
•Closed-form MLEs exist for each component
•Convex in parameters (for Gaussians)
•Optimization is straightforward

Complete-Data Disadvantage

•We don't know $\mathbf{Z}$!
•The latent assignments are unobserved
•Cannot directly use this likelihood
•Must infer or integrate over $\mathbf{Z}$

The Exponential Blowup

For $N$ observations and $K$ components, there are $K^N$ possible complete-data configurations. With $N = 100$ and $K = 3$, that's $3^{100} \approx 5 \times 10^{47}$ configurations—more than atoms in the observable universe. We cannot enumerate all configurations; we need smarter approaches.

Posterior Responsibilities: Inferring Latent Variables

Given observed data and model parameters, we can compute the posterior distribution over latent variables—the probability distribution of $z_n$ given observation $\mathbf{x}_n$.

Applying Bayes' theorem: $$P(z_n = k \mid \mathbf{x}_n, \boldsymbol{\theta}) = \frac{P(z_n = k) \cdot p(\mathbf{x}_n \mid z_n = k)}{p(\mathbf{x}_n)}$$ $$= \frac{\pi_k \cdot \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}k)}{\sum{j=1}^{K} \pi_j \cdot \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}$$

Responsibility Definition

We define the responsibility of component $k$ for observation $n$ as:

$$\gamma_{nk} \equiv P(z_n = k \mid \mathbf{x}_n, \boldsymbol{\theta}) = \frac{\pi_k \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}k)}{\sum{j=1}^{K} \pi_j \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}$$

These responsibilities satisfy $\gamma_{nk} \in [0,1]$ and $\sum_{k=1}^K \gamma_{nk} = 1$ for each $n$.

Interpreting responsibilities:

The responsibility $\gamma_{nk}$ quantifies how much component $k$ "claims" observation $\mathbf{x}_n$. It depends on:

Prior probability $\pi_k$: Components with larger mixing weights have higher responsibilities, all else equal.
Likelihood $\mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$: Components whose Gaussian assigns high density to $\mathbf{x}_n$ have higher responsibilities.
Competition: The denominator normalizes across all components. A component can have high responsibility only if it outperforms (or matches) other components for this observation.

Soft vs. hard assignments:

Soft assignment: Use the full responsibility distribution. Point $\mathbf{x}_n$ belongs to component 1 with probability 0.7, component 2 with probability 0.2, etc.
Hard assignment: Assign each point to its maximum-responsibility component: $\hat{z}n = \arg\max_k \gamma{nk}$. This is analogous to k-means-style clustering.

compute_responsibilities.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
import numpy as np
from scipy.stats import multivariate_normal
 
def compute_responsibilities(X, means, covs, pi, eps=1e-10):
    """
    Compute posterior responsibilities for all observations.
    
    Parameters:
    -----------
    X : array, shape (N, D)
        Observed data
    means : list of arrays, shape (K, D)
        Component means
    covs : list of arrays, shape (K, D, D)
        Component covariances
    pi : array, shape (K,)
        Mixing coefficients
    eps : float
        Small constant for numerical stability
    
    Returns:
    --------
    gamma : array, shape (N, K)
        Responsibility matrix where gamma[n,k] = P(z_n=k|x_n)
    log_likelihood : float
        Log-likelihood of the data under the model
    """
    N = X.shape[0]
    K = len(means)
    
    # Compute log-likelihoods for numerical stability
    log_probs = np.zeros((N, K))
    
    for k in range(K):
        rv = multivariate_normal(mean=means[k], cov=covs[k])
        log_probs[:, k] = np.log(pi[k] + eps) + rv.logpdf(X)
    
    # Log-sum-exp trick for numerical stability
    max_log_probs = np.max(log_probs, axis=1, keepdims=True)
    log_sum = max_log_probs + np.log(
        np.sum(np.exp(log_probs - max_log_probs), axis=1, keepdims=True)
    )
    
    # Compute responsibilities
    log_gamma = log_probs - log_sum
    gamma = np.exp(log_gamma)
    
    # Compute log-likelihood
    log_likelihood = np.sum(log_sum)
    
    return gamma, log_likelihood
 
# Example usage
np.random.seed(42)
N, D, K = 200, 2, 3
 
# Define GMM parameters
means = [
    np.array([0.0, 0.0]),
    np.array([4.0, 0.0]),
    np.array([2.0, 3.5])
]
covs = [
    np.array([[1.0, 0.2], [0.2, 0.8]]),
    np.array([[0.8, -0.1], [-0.1, 1.0]]),
    np.array([[0.6, 0.0], [0.0, 0.6]])
]
pi = np.array([0.3, 0.45, 0.25])
 
# Generate test data
X = np.vstack([
    np.random.multivariate_normal(means[k], covs[k], 
                                  size=int(N * pi[k]))
    for k in range(K)
])
 
# Compute responsibilities
gamma, ll = compute_responsibilities(X, means, covs, pi)
 
print(f"Responsibility matrix shape: {gamma.shape}")
print(f"Log-likelihood: {ll:.2f}")
print(f"
Example responsibilities for first 5 points:")
for n in range(5):
    print(f"  Point {n}: {gamma[n]} -> Assigned to component {np.argmax(gamma[n])}")
 
# Verify responsibilities sum to 1
print(f"
Verification: All row sums equal 1? {np.allclose(gamma.sum(axis=1), 1)}")
 
# Effective sample counts per component
N_k = gamma.sum(axis=0)
print(f"
Effective samples per component: {N_k}")
print(f"Proportions: {N_k / N} (true: {pi})")

Key Properties of Responsibilities

•Normalization: For each observation, responsibilities sum to 1: $\sum_k \gamma_{nk} = 1$
•Probabilistic: Each $\gamma_{nk} \in [0, 1]$ is a valid probability
•Data-dependent: Responsibilities depend on both the observation $\mathbf{x}_n$ and current parameters
•Soft clustering: Responsibilities provide uncertainty-aware cluster membership
•Expected sufficient statistics: $\sum_n \gamma_{nk}$ is the expected number of points from component $k$

The Expected Complete-Data Log-Likelihood

Since we don't know the true latent assignments, we cannot compute the complete-data log-likelihood directly. However, we can compute its expected value with respect to the posterior distribution of latent variables.

This expectation is central to the EM algorithm and provides a tractable objective function.

Expected Complete-Data Log-Likelihood (Q-function)

Given current parameter estimates $\boldsymbol{\theta}^{\text{old}}$, the expected complete-data log-likelihood is:

$$Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text{old}}) = \mathbb{E}_{\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{\text{old}}}\left[ \log p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol{\theta}) \right]$$

$$= \sum_{n=1}^{N} \sum_{k=1}^{K} \gamma_{nk}^{\text{old}} \left[ \log \pi_k + \log \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right]$$

where $\gamma_{nk}^{\text{old}} = P(z_n = k \mid \mathbf{x}_n, \boldsymbol{\theta}^{\text{old}})$.

Deriving the Q-function:

The complete-data log-likelihood is: $$\ell_c(\boldsymbol{\theta}) = \sum_{n=1}^{N} \sum_{k=1}^{K} z_{nk} \left[ \log \pi_k + \log \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right]$$

Taking the expectation with respect to $P(\mathbf{Z} \mid \mathbf{X}, \boldsymbol{\theta}^{\text{old}})$: $$Q = \mathbb{E}\left[ \sum_{n,k} z_{nk} \left[ \log \pi_k + \log \mathcal{N}_k(\mathbf{x}_n) \right] \right]$$

Since the $\mathbf{x}n$ terms are not random (they're observed) and the $z{nk}$ are the only random variables: $$Q = \sum_{n,k} \mathbb{E}[z_{nk}] \left[ \log \pi_k + \log \mathcal{N}_k(\mathbf{x}_n) \right]$$

But $\mathbb{E}[z_{nk}] = P(z_n = k \mid \mathbf{x}n, \boldsymbol{\theta}^{\text{old}}) = \gamma{nk}^{\text{old}}$, giving us the result.

The key insight:

The Q-function has the same tractable structure as the complete-data log-likelihood, but with the unknown binary indicators $z_{nk}$ replaced by their expected values $\gamma_{nk}$. This makes optimization feasible!

Comparison of Log-Likelihood Forms
Quantity	Formula	Tractable?	Computable?
Complete-data LL	$\sum_{n,k} z_{nk} \log[\pi_k \mathcal{N}_k]$	Yes	No ($z_{nk}$ unknown)
Incomplete-data LL	$\sum_n \log \sum_k \pi_k \mathcal{N}_k$	No (log-sum)	Yes
Expected complete-data (Q)	$\sum_{n,k} \gamma_{nk} \log[\pi_k \mathcal{N}_k]$	Yes	Yes

Foundation for EM

The EM algorithm alternates between: (1) E-step: Compute responsibilities $\gamma_{nk}$ using current parameters, and (2) M-step: Maximize the Q-function with respect to parameters, treating $\gamma_{nk}$ as fixed weights. This iterative process is guaranteed to increase the incomplete-data log-likelihood at each step.

Why Latent Variables Make Inference Challenging

Latent variable models like GMMs present fundamental inference challenges that don't arise in simpler models without hidden variables. Understanding these challenges clarifies why specialized algorithms like EM are necessary.

Core Challenges from Latent Variables

•Non-convex optimization: The marginal log-likelihood has multiple local maxima. Gradient-based methods can get stuck in suboptimal solutions. Different initializations lead to different results.
•Coupled parameters: Through marginalization, all parameters affect all observations. You cannot optimize component 1's parameters without considering component 2's contribution. This coupling prevents decomposition.
•Intractable summations: For complex latent structures, the marginalization sum may be exponentially large or infinite. GMMs are actually tractable here, but many models aren't.
•Identifiability issues: The same marginal distribution can arise from different parameter settings (label switching, unidentified components). We'll cover this in detail in the identifiability page.
•Singularities: Degenerate solutions exist where log-likelihood goes to infinity. Regularization or careful initialization is required.

The missing data perspective:

Latent variables can be viewed as "missing data"—information that would simplify inference if observed. The EM algorithm was originally developed in this framework (Dempster, Laird, and Rubin, 1977).

Consider: if someone handed you the true cluster labels ${z_1, \ldots, z_N}$, fitting the GMM would be trivial—just compute the MLE for each cluster separately. The difficulty is that these labels are missing.

EM handles this by:

Making "soft guesses" about the missing data (E-step → responsibilities)
Proceeding as if these guesses were correct (M-step → update parameters)
Iterating, because better parameters lead to better guesses

Chicken-and-egg problem:

To estimate $\gamma_{nk}$ (which points belong to which clusters), we need the parameters $(\pi_k, \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$.
To estimate the parameters, we need to know which points belong to which clusters (the $\gamma_{nk}$).

EM breaks this circular dependency by alternating optimization: fixing one set while optimizing the other.

Historical Note

The EM algorithm was formally introduced in the seminal 1977 paper by Dempster, Laird, and Rubin: "Maximum Likelihood from Incomplete Data via the EM Algorithm." The paper unified many special-case algorithms that had been developed independently for specific problems (including GMMs, which had been solved by this approach earlier). The paper has over 70,000 citations and remains one of the most influential in statistics.

Summary: The Role of Latent Variables

This page has developed a deep understanding of latent variables in Gaussian Mixture Models. Let's consolidate the key insights:

Core Concepts Covered

•Latent variables are unobserved random variables inferred from data and model structure. In GMMs, the latent variable $z_n$ indicates which component generated observation $\mathbf{x}_n$.
•Complete-data log-likelihood includes both observations and latent assignments. It has a tractable, decomposable form but requires knowing the unobserved $z_n$.
•Incomplete-data log-likelihood marginalizes over latent variables, yielding the observable likelihood. The log-sum structure makes optimization difficult.
•Posterior responsibilities $\gamma_{nk} = P(z_n = k \mid \mathbf{x}_n)$ provide probabilistic (soft) cluster assignments given current parameters.
•Expected complete-data log-likelihood (Q-function) replaces unknown indicators with their expectations, yielding a tractable optimization target.
•EM algorithm alternates between computing responsibilities (E-step) and maximizing the Q-function (M-step), iteratively improving the solution.

Key Formulas Summary
Quantity	Symbol	Formula
Prior (component prob)	$\pi_k$	$P(z_n = k)$
Likelihood (conditional)	$p(\mathbf{x}\|z=k)$	$\mathcal{N}(\mathbf{x} \| \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$
Responsibility	$\gamma_{nk}$	$\frac{\pi_k \mathcal{N}_k(\mathbf{x}_n)}{\sum_j \pi_j \mathcal{N}_j(\mathbf{x}_n)}$
Q-function	$Q(\theta, \theta^{old})$	$\sum_{n,k} \gamma_{nk} [\log \pi_k + \log \mathcal{N}_k(\mathbf{x}_n)]$
Effective count	$N_k$	$\sum_n \gamma_{nk}$

Connections to subsequent pages:

Page 2: Marginal Likelihood — Extends the incomplete-data likelihood perspective to model comparison and Bayesian model selection.
Page 3: Identifiability — Addresses when GMM parameters are uniquely determined, directly related to the symmetries in latent variable assignment.
Page 4: Model Selection — Uses latent variable concepts to determine the optimal number of components.
Module 4: EM Algorithm — Provides the complete algorithmic treatment of EM for GMMs, building on the Q-function and responsibilities developed here.

Page Complete

You now have a thorough understanding of latent variables in GMMs: their definition, representation, role in complete vs. incomplete likelihood, posterior inference via responsibilities, and why they create fundamental computational challenges. This foundation is essential for understanding the EM algorithm and broader latent variable modeling.