Gaussian Mixture Models - Learning Module

Loading content...

0/278

Mixture Model Formulation

Beyond Single Distributions: The Need for Mixtures

In statistical modeling and machine learning, we frequently encounter data that cannot be adequately described by a single parametric distribution. Consider a dataset of human heights that includes both adult males and adult females—a single Gaussian distribution would fail to capture the bimodal nature of the data. Or imagine customer purchase amounts in a retail dataset, where different customer segments exhibit fundamentally different spending behaviors.

Mixture models provide an elegant and principled solution to this challenge. Rather than forcing all data through a single distributional lens, mixture models posit that the observed data arises from multiple underlying subpopulations, each characterized by its own parametric distribution. The overall data distribution is then a weighted combination—a mixture—of these component distributions.

This page develops the complete mathematical formulation of Gaussian Mixture Models (GMMs), the most widely used instance of mixture modeling. We will derive every component of the model from first principles, ensuring you understand not just what the equations say, but why they take the form they do.

What You Will Master

By the end of this page, you will be able to: (1) Write down the complete probability density function for a Gaussian mixture model, (2) Explain the role and interpretation of mixing coefficients, (3) Describe the generative process that produces mixture data, (4) Derive the relationship between component densities and the overall mixture density, and (5) Formulate the log-likelihood function for parameter estimation.

The Univariate Gaussian Foundation

Before constructing mixture models, we must establish a rigorous understanding of the Gaussian distribution itself. The univariate Gaussian (or Normal) distribution is parameterized by two quantities: the mean $\mu$ (location parameter) and the variance $\sigma^2$ (scale parameter). Its probability density function (pdf) is:

Univariate Gaussian PDF

$$\mathcal{N}(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)$$

This function describes the probability density at point $x$ given parameters $\mu$ and $\sigma^2$.

Dissecting the components:

The normalization constant $\frac{1}{\sqrt{2\pi\sigma^2}}$ ensures that the pdf integrates to 1 over the entire real line. This is a requirement for any valid probability density function. Notice that larger variance $\sigma^2$ leads to a smaller normalization constant—because the probability mass is spread over a wider range, the peak height must decrease.

The exponential term $\exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)$ captures the shape of the distribution. The squared term $(x - \mu)^2$ measures the distance from the mean, ensuring symmetry around $\mu$. The division by $2\sigma^2$ scales this distance relative to the spread—points many standard deviations from the mean receive exponentially lower density.

The maximum density occurs at $x = \mu$, where the exponent is zero and the exponential term equals 1. As $|x - \mu|$ increases, the density decreases rapidly due to the negative quadratic in the exponent.

Key Properties of the Gaussian Distribution

•Unimodal: The Gaussian has exactly one mode (peak) at $x = \mu$. This is a fundamental limitation when modeling multimodal data.
•Symmetric: The distribution is perfectly symmetric around its mean. Skewed data requires either transformation or alternative distributions.
•Unbounded support: The Gaussian assigns non-zero probability to all real numbers $x \in (-\infty, +\infty)$, though probability mass decays rapidly in the tails.
•Fully characterized by two moments: The mean $\mu$ and variance $\sigma^2$ completely determine the distribution. Higher moments are determined by these.
•Closed under linear transformation: If $X \sim \mathcal{N}(\mu, \sigma^2)$, then $aX + b \sim \mathcal{N}(a\mu + b, a^2\sigma^2)$.
•Maximum entropy: Among all distributions with given mean and variance, the Gaussian has maximum entropy—it assumes the least additional structure.

Why the Gaussian is Central to Mixture Models

The Gaussian's unimodality is precisely why we need mixtures. A single Gaussian cannot represent data with multiple clusters or modes. By combining multiple Gaussians—each representing a different subpopulation—we can model complex, multimodal distributions while retaining the mathematical tractability that makes Gaussians so useful.

The Multivariate Gaussian Extension

Real-world data typically involves multiple dimensions. Customer features might include age, income, and purchase frequency. Image data involves thousands of pixel intensities. To handle such data, we extend the Gaussian to multiple dimensions.

For a $D$-dimensional random vector $\mathbf{x} = (x_1, x_2, \ldots, x_D)^\top$, the multivariate Gaussian distribution is parameterized by:

Mean vector $\boldsymbol{\mu} \in \mathbb{R}^D$: The center of the distribution in $D$-dimensional space
Covariance matrix $\boldsymbol{\Sigma} \in \mathbb{R}^{D \times D}$: A symmetric, positive definite matrix capturing variance and pairwise correlations

Multivariate Gaussian PDF

$$\mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{(2\pi)^{D/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right)$$

where $|\boldsymbol{\Sigma}|$ denotes the determinant of the covariance matrix.

Understanding each component:

The normalization constant $(2\pi)^{D/2}|\boldsymbol{\Sigma}|^{1/2}$ generalizes the univariate case. The $D/2$ power accounts for $D$ independent dimensions, while the determinant $|\boldsymbol{\Sigma}|$ measures the "volume" of the distribution's spread. Larger determinant means more spread, requiring a smaller peak height.

The Mahalanobis distance $(\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})$ is a generalized squared distance that accounts for:

Different variances along each dimension
Correlations between dimensions

This quadratic form defines elliptical contours of constant density. Points on the same ellipse have equal probability density.

The precision matrix $\boldsymbol{\Sigma}^{-1}$, inverse of the covariance, appears rather than $\boldsymbol{\Sigma}$ itself. This is computationally important—we often work with precision directly in optimization.

Covariance Matrix Structure and Geometric Interpretation
Covariance Form	Structure	Geometric Shape	Parameters
Spherical	$\boldsymbol{\Sigma} = \sigma^2 \mathbf{I}$	Hyperspheres	$1$ parameter
Diagonal	$\boldsymbol{\Sigma} = \text{diag}(\sigma_1^2, \ldots, \sigma_D^2)$	Axis-aligned ellipsoids	$D$ parameters
Full	Arbitrary positive definite	Arbitrarily oriented ellipsoids	$D(D+1)/2$ parameters

The parameter count matters for mixture models. A full covariance matrix in $D$ dimensions has $D(D+1)/2$ free parameters due to symmetry. For a mixture with $K$ components, this means $K \cdot D(D+1)/2$ covariance parameters, which can quickly become prohibitive. This motivates restricted forms:

Spherical covariance (isotropic): All dimensions have equal variance, no correlation. This produces spherical clusters.

Diagonal covariance: Each dimension has its own variance, but no correlation between dimensions. This produces axis-aligned elliptical clusters.

Full covariance: Arbitrary orientation and scaling. Most flexible but most parameters.

Tied covariance: All components share the same covariance matrix, differing only in means. Reduces parameters significantly.

Positive Definiteness Requirement

The covariance matrix must be positive definite for the density to be well-defined. This means all eigenvalues must be strictly positive. Numerical issues during estimation can produce near-singular or non-positive-definite estimates, causing algorithms to fail. Regularization (adding a small value to the diagonal) is a common remedy.

Constructing the Mixture Model

Now we construct the Gaussian Mixture Model by combining multiple Gaussian components. The key insight is that complex, multimodal distributions can often be approximated as weighted sums of simpler unimodal distributions.

A Gaussian Mixture Model with $K$ components is defined by:

$K$ component densities: Each component $k \in {1, 2, \ldots, K}$ is a multivariate Gaussian with its own parameters $(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$
$K$ mixing coefficients: Non-negative weights $\pi_1, \pi_2, \ldots, \pi_K$ that sum to one: $\sum_{k=1}^{K} \pi_k = 1$

Gaussian Mixture Model Density

$$p(\mathbf{x}) = \sum_{k=1}^{K} \pi_k , \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$$

The probability density at any point $\mathbf{x}$ is a weighted sum of the densities assigned by each Gaussian component.

Why this is a valid probability density:

For $p(\mathbf{x})$ to be a valid pdf, it must:

Be non-negative everywhere: $p(\mathbf{x}) \geq 0$ for all $\mathbf{x}$
Integrate to one: $\int p(\mathbf{x}) d\mathbf{x} = 1$

Non-negativity is immediate: each $\pi_k \geq 0$ and each $\mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \geq 0$, so their product and sum are non-negative.

Integration to one follows from: $$\int p(\mathbf{x}) d\mathbf{x} = \int \sum_{k=1}^{K} \pi_k , \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}k) d\mathbf{x}$$ $$= \sum{k=1}^{K} \pi_k \int \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}k) d\mathbf{x} = \sum{k=1}^{K} \pi_k \cdot 1 = 1$$

The integral and sum exchange by linearity, and each Gaussian integrates to 1.

Interpreting the Mixing Coefficients

•Prior probability interpretation: $\pi_k$ represents the probability that a randomly sampled point belongs to component $k$, before observing the point's value.
•Population proportion: In a clustering context, $\pi_k$ estimates the fraction of the population belonging to cluster $k$.
•Marginal probability: $\pi_k = P(z = k)$ where $z$ is the latent component assignment (developed in the next page).
•Constraint preservation: The constraint $\sum_k \pi_k = 1$ ensures these are valid probabilities.
•Degrees of freedom: With $K$ mixing coefficients constrained to sum to 1, we have $K-1$ free parameters for the mixing proportions.

The complete parameter set:

Collecting all parameters, a GMM with $K$ components in $D$ dimensions has:

$$\boldsymbol{\theta} = {\pi_1, \ldots, \pi_K, \boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_K, \boldsymbol{\Sigma}_1, \ldots, \boldsymbol{\Sigma}_K}$$

Parameter count for full covariance:

Mixing coefficients: $K - 1$ (due to sum-to-one constraint)
Means: $K \cdot D$
Covariances: $K \cdot D(D+1)/2$

Total: $K - 1 + KD + K \cdot D(D+1)/2 = K(1 + D + D(D+1)/2) - 1$ parameters

For example, with $K = 5$ components in $D = 10$ dimensions with full covariance:

$5 - 1 = 4$ mixing coefficient parameters
$5 \times 10 = 50$ mean parameters
$5 \times 55 = 275$ covariance parameters
Total: 329 parameters

This illustrates why simpler covariance structures are often necessary in practice.

The Generative Story

One of the most powerful ways to understand probabilistic models is through their generative story—a step-by-step description of how the observed data could have been produced.

For a Gaussian Mixture Model, the generative process for a single observation is:

Step 1: Sample the component assignment Draw a component index $z$ from a Categorical distribution with probabilities $(\pi_1, \ldots, \pi_K)$: $$z \sim \text{Categorical}(\pi_1, \ldots, \pi_K)$$

Step 2: Sample the observation from the chosen component Given $z = k$, draw the observation $\mathbf{x}$ from the $k$-th Gaussian: $$\mathbf{x} \mid z = k \sim \mathcal{N}(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$$

The Two-Stage Sampling Process

Think of it like this: Imagine a factory with $K$ manufacturing machines. Each machine produces products with slightly different characteristics (mean and variance). To produce one item, we first randomly select a machine (with probability $\pi_k$ for machine $k$), then that machine produces an item according to its own distribution. The final product shows which machine made it... but imagine we lost those labels. That's exactly the GMM setup!

generative_sampling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
 
def sample_from_gmm(n_samples, means, covariances, mixing_coeffs):
    """
    Generate samples from a Gaussian Mixture Model.
    
    Parameters:
    -----------
    n_samples : int
        Number of samples to generate
    means : list of arrays, shape (K, D)
        Mean vectors for each component
    covariances : list of arrays, shape (K, D, D)
        Covariance matrices for each component
    mixing_coeffs : array, shape (K,)
        Mixing coefficients (must sum to 1)
    
    Returns:
    --------
    samples : array, shape (n_samples, D)
        Generated samples
    component_assignments : array, shape (n_samples,)
        True component assignment for each sample
    """
    K = len(means)
    D = len(means[0])
    
    samples = np.zeros((n_samples, D))
    component_assignments = np.zeros(n_samples, dtype=int)
    
    for i in range(n_samples):
        # Step 1: Sample component assignment
        z = np.random.choice(K, p=mixing_coeffs)
        component_assignments[i] = z
        
        # Step 2: Sample from chosen component
        samples[i] = np.random.multivariate_normal(
            mean=means[z],
            cov=covariances[z]
        )
    
    return samples, component_assignments
 
# Example: 3-component GMM in 2D
means = [
    np.array([0.0, 0.0]),
    np.array([5.0, 0.0]),
    np.array([2.5, 4.0])
]
 
covariances = [
    np.array([[1.0, 0.3], [0.3, 1.0]]),
    np.array([[0.8, -0.2], [-0.2, 0.6]]),
    np.array([[1.2, 0.0], [0.0, 0.5]])
]
 
mixing_coeffs = np.array([0.3, 0.5, 0.2])
 
# Generate 1000 samples
samples, true_labels = sample_from_gmm(
    n_samples=1000,
    means=means,
    covariances=covariances,
    mixing_coeffs=mixing_coeffs
)
 
print(f"Generated {len(samples)} samples from 3-component GMM")
print(f"Component proportions: {np.bincount(true_labels) / len(true_labels)}")

To generate a dataset of $N$ observations:

Repeat the two-step process $N$ times independently. Each observation $\mathbf{x}_n$ is associated with an unobserved (latent) component assignment $z_n$. The complete generative process produces:

Observations: ${\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_N}$ (what we observe)
Assignments: ${z_1, z_2, \ldots, z_N}$ (hidden/latent)

The fundamental challenge:

In practice, we observe only the $\mathbf{x}_n$ values—the component assignments $z_n$ are latent (hidden). We cannot directly count which observations belong to which component. This is precisely what makes mixture model inference challenging and leads to the Expectation-Maximization algorithm (covered in Module 4).

Conditional Independence

Crucially, given the component assignment $z_n = k$, each observation $\mathbf{x}_n$ is independent of all other observations and their component assignments. The observations are conditionally independent given the latent assignments. This conditional independence structure is what makes inference tractable.

Deriving the Mixture Density from the Generative Model

We now show that the generative story rigorously implies the mixture density formula. This derivation uses fundamental probability rules and deepens understanding of the model structure.

Starting point: Joint distribution of $\mathbf{x}$ and $z$

The generative story specifies:

Prior on component: $P(z = k) = \pi_k$
Likelihood given component: $p(\mathbf{x} \mid z = k) = \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$

By the chain rule of probability: $$p(\mathbf{x}, z = k) = P(z = k) \cdot p(\mathbf{x} \mid z = k) = \pi_k , \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$$

Marginalization: Summing Out the Latent Variable

To obtain the marginal density of $\mathbf{x}$, we marginalize (sum out) the latent variable $z$:

$$p(\mathbf{x}) = \sum_{k=1}^{K} p(\mathbf{x}, z = k) = \sum_{k=1}^{K} \pi_k , \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$$

This is exactly the GMM density! The mixture formula arises naturally from marginalizing over the latent component assignment.

The law of total probability in action:

The marginalization step is an application of the law of total probability: $$p(\mathbf{x}) = \sum_{k} P(z = k) \cdot p(\mathbf{x} \mid z = k)$$

This says: the probability of observing $\mathbf{x}$ is the sum over all ways $\mathbf{x}$ could have been generated—weighting each component's contribution by the probability of selecting that component.

Posterior responsibility:

Conversely, given an observation $\mathbf{x}$, we can compute the posterior probability that it came from component $k$ using Bayes' theorem: $$P(z = k \mid \mathbf{x}) = \frac{P(z = k) \cdot p(\mathbf{x} \mid z = k)}{p(\mathbf{x})} = \frac{\pi_k , \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}k)}{\sum{j=1}^{K} \pi_j , \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}$$

This quantity, often denoted $\gamma_{nk}$ for observation $n$ and component $k$, is called the responsibility—how responsible component $k$ is for generating observation $\mathbf{x}_n$. It plays a central role in the EM algorithm.

Key Probability Quantities in GMMs
Quantity	Formula	Interpretation
Prior $P(z = k)$	$\pi_k$	Probability of selecting component $k$ before seeing data
Likelihood $p(\mathbf{x} \mid z = k)$	$\mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$	Probability density of $\mathbf{x}$ under component $k$
Joint $p(\mathbf{x}, z = k)$	$\pi_k , \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$	Joint probability of $\mathbf{x}$ and assignment to $k$
Marginal $p(\mathbf{x})$	$\sum_k \pi_k , \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$	Overall density of $\mathbf{x}$ (mixture density)
Posterior $P(z = k \mid \mathbf{x})$	$\gamma_k = \frac{\pi_k , \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{p(\mathbf{x})}$	Responsibility of component $k$ for $\mathbf{x}$

Soft vs Hard Assignment

Unlike hard clustering (e.g., k-means), GMMs provide soft assignments via posterior responsibilities. Each point has a probability distribution over components rather than a single assignment. This is useful for uncertainty quantification—we know not just the most likely cluster, but how confident that assignment is.

The Log-Likelihood Function

Given observed data $\mathbf{X} = {\mathbf{x}_1, \ldots, \mathbf{x}_N}$, we want to estimate the GMM parameters $\boldsymbol{\theta} = {\pi_k, \boldsymbol{\mu}_k, \boldsymbol{\Sigma}k}{k=1}^{K}$. The maximum likelihood approach seeks parameters that maximize the probability of observing the data.

Assuming independence:

If observations are independent and identically distributed (i.i.d.) given the parameters, the joint likelihood is: $$L(\boldsymbol{\theta}) = p(\mathbf{X} \mid \boldsymbol{\theta}) = \prod_{n=1}^{N} p(\mathbf{x}n \mid \boldsymbol{\theta}) = \prod{n=1}^{N} \sum_{k=1}^{K} \pi_k , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$$

Log-Likelihood for GMM

Taking the logarithm (which doesn't change the location of the maximum but makes computation more stable):

$$\ell(\boldsymbol{\theta}) = \log L(\boldsymbol{\theta}) = \sum_{n=1}^{N} \log \left( \sum_{k=1}^{K} \pi_k , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right)$$

This is the log-likelihood for a Gaussian Mixture Model.

Why the GMM log-likelihood is difficult to optimize:

Note the structure of the log-likelihood. The logarithm is outside the summation over components. This is the source of all optimization difficulty:

$$\ell(\boldsymbol{\theta}) = \sum_{n=1}^{N} \log \left( \sum_{k=1}^{K} \pi_k , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right)$$

Comparison with single Gaussian:

For a single Gaussian, the log-likelihood is: $$\ell(\boldsymbol{\mu}, \boldsymbol{\Sigma}) = \sum_{n=1}^{N} \log \mathcal{N}(\mathbf{x}n \mid \boldsymbol{\mu}, \boldsymbol{\Sigma})$$ $$= -\frac{N}{2} \log|\boldsymbol{\Sigma}| - \frac{1}{2} \sum{n=1}^{N} (\mathbf{x}_n - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x}_n - \boldsymbol{\mu}) + \text{const}$$

This is a concave function of the parameters (after suitable reparameterization), and the maximum has a closed-form solution: $\hat{\boldsymbol{\mu}} = \frac{1}{N}\sum_n \mathbf{x}_n$ and $\hat{\boldsymbol{\Sigma}} = \frac{1}{N}\sum_n (\mathbf{x}_n - \hat{\boldsymbol{\mu}})(\mathbf{x}_n - \hat{\boldsymbol{\mu}})^\top$.

The mixture complication:

For the GMM, the log-of-sum structure prevents closed-form solutions. Setting derivatives to zero yields equations where each parameter appears on both sides, entangled with all other parameters. No algebraic rearrangement isolates them.

This is why we need iterative algorithms like Expectation-Maximization (EM)—the subject of Module 4.

Challenges in GMM Log-Likelihood Optimization

•Non-convexity: The log-likelihood surface has multiple local maxima. Different initializations can lead to different solutions.
•Singularities: If a component mean lands exactly on a data point with zero variance, the likelihood goes to infinity. This is a degenerate solution that must be prevented.
•Label switching: Permuting the component labels doesn't change the likelihood. The solution space has $K!$ equivalent modes due to this symmetry.
•Coupling of parameters: Each parameter affects all observations through multiple components, creating complex dependencies.
•Constraint handling: Mixing coefficients must sum to one; covariance matrices must be positive definite. These constraints complicate optimization.

The Singularity Problem

Consider what happens if $\boldsymbol{\mu}_k \to \mathbf{x}_n$ and $\boldsymbol{\Sigma}_k \to 0$ for any data point $\mathbf{x}_n$. The Gaussian density at that point goes to infinity, and so does the log-likelihood. This is a pathological solution where one component "collapses" onto a single point. Prevention strategies include regularization (adding a small constant to diagonal) or discarding components with very small variance.

Visualizing Mixture Densities

Understanding mixture densities geometrically builds intuition for how component parameters affect the overall distribution. Let's examine how different configurations produce different shapes.

One-dimensional mixtures:

In 1D, mixture densities appear as weighted sums of bell curves. The resulting shape exhibits:

Multiple modes when components are well-separated
Shoulders or asymmetry when components overlap partially
Approximate unimodality when components overlap heavily

Two-dimensional mixtures:

In 2D, we can visualize with contour plots or 3D surface plots. Each Gaussian component produces elliptical contours (circles for spherical covariance). The mixture density shows:

Complex, non-convex level sets
Ridges connecting nearby modes
Valleys between well-separated components

visualize_gmm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal
 
def gmm_density(x, y, means, covs, weights):
    """Compute GMM density on a 2D grid."""
    pos = np.dstack((x, y))
    density = np.zeros_like(x)
    
    for mean, cov, weight in zip(means, covs, weights):
        rv = multivariate_normal(mean, cov)
        density += weight * rv.pdf(pos)
    
    return density
 
# Define a 3-component GMM
means = [
    np.array([-2.0, 0.0]),
    np.array([2.0, 0.0]),
    np.array([0.0, 2.5])
]
 
covs = [
    np.array([[0.8, 0.3], [0.3, 0.5]]),
    np.array([[0.6, -0.2], [-0.2, 0.8]]),
    np.array([[0.5, 0.0], [0.0, 0.5]])
]
 
weights = [0.35, 0.40, 0.25]
 
# Create grid for visualization
x = np.linspace(-5, 5, 200)
y = np.linspace(-3, 5, 200)
X, Y = np.meshgrid(x, y)
 
# Compute density
Z = gmm_density(X, Y, means, covs, weights)
 
# Plot contours
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Contour plot
ax1 = axes[0]
contour = ax1.contour(X, Y, Z, levels=15, cmap='viridis')
ax1.clabel(contour, inline=True, fontsize=8)
ax1.scatter([m[0] for m in means], [m[1] for m in means], 
            c='red', s=100, marker='x', linewidths=3, 
            label='Component means')
ax1.set_xlabel('$x_1$')
ax1.set_ylabel('$x_2$')
ax1.set_title('GMM Density Contours')
ax1.legend()
ax1.set_aspect('equal')
 
# Filled contour
ax2 = axes[1]
filled = ax2.contourf(X, Y, Z, levels=30, cmap='magma')
plt.colorbar(filled, ax=ax2, label='Density')
ax2.scatter([m[0] for m in means], [m[1] for m in means], 
            c='white', s=100, marker='x', linewidths=3)
ax2.set_xlabel('$x_1$')
ax2.set_ylabel('$x_2$')
ax2.set_title('GMM Density Heat Map')
ax2.set_aspect('equal')
 
plt.tight_layout()
plt.savefig('gmm_visualization.png', dpi=150)
plt.show()
 
print("Visualization complete. Notice how:")
print("1. Contours are not simple ellipses - they're complex curves")
print("2. The density between modes doesn't go to zero - there's overlap")
print("3. Each component contributes to the density everywhere")

Key observations from visualization:

1. Superposition principle: The mixture density at any point is the sum of individual Gaussian densities. Where components overlap, densities add, creating ridges or elevated regions between modes.

2. Non-elliptical contours: Unlike a single Gaussian, mixture contours can have complex, non-convex shapes. Contours at lower density levels may encircle multiple modes.

3. Weight effects: Components with larger $\pi_k$ contribute more to the overall density. A component with small weight barely affects the density shape, even if its Gaussian is well-separated.

4. Covariance effects: The orientation and eccentricity of each component's ellipse affects how components "blend" together. Aligned components merge smoothly; perpendicular ones create more complex intersections.

5. Separation effects: Well-separated components produce clearly multimodal distributions. Overlapping components create ambiguity in component assignment.

Geometric Intuition for EM

When we later study the EM algorithm, remember this geometry. The E-step computes responsibilities—asking "how much of this point's density comes from each component?" The M-step updates parameters so each component better fits the points it's responsible for. Visualizing how the density surface changes with parameters helps understand EM's behavior.

Summary and Forward Connections

We have developed the complete mathematical formulation of Gaussian Mixture Models from first principles. Let's consolidate the key concepts:

Key Formulation Concepts

•The GMM density is a weighted sum of Gaussian components: $p(\mathbf{x}) = \sum_{k=1}^{K} \pi_k \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$
•Mixing coefficients $\pi_k$ represent prior probabilities of component membership and must sum to one.
•The generative story involves sampling a component, then sampling from that component's Gaussian.
•Marginalization over the latent component assignment derives the mixture density from the generative model.
•Posterior responsibilities quantify soft membership: how responsible each component is for each observation.
•The log-likelihood has a log-sum structure that prevents closed-form optimization, necessitating iterative algorithms.

GMM Parameter Summary
Parameter Type	Symbol	Count	Constraints
Mixing coefficients	$\pi_k$	$K - 1$ (free)	$\pi_k \geq 0$, $\sum_k \pi_k = 1$
Component means	$\boldsymbol{\mu}_k$	$K \times D$	None
Component covariances	$\boldsymbol{\Sigma}_k$	$K \times D(D+1)/2$	Positive definite

What comes next:

This formulation page establishes the probabilistic framework. The subsequent pages in this module build upon this foundation:

Page 1: Latent Variables — Deepens the understanding of $z$ as a latent variable, exploring the complete-data likelihood and why latent variables make GMMs both powerful and challenging.
Page 2: Marginal Likelihood — Develops the marginal likelihood perspective for model comparison and its connection to Bayesian inference.
Page 3: Identifiability — Addresses when GMM parameters are uniquely determined and the label-switching symmetry.
Page 4: Model Selection — Covers methods for determining the optimal number of components $K$.

The EM algorithm for fitting GMMs is covered in detail in Module 4: Expectation-Maximization Algorithm.

Page Complete

You now understand the complete mathematical formulation of Gaussian Mixture Models: the multivariate Gaussian foundation, the mixture density formula, the generative story, and the log-likelihood optimization challenge. This foundation enables rigorous understanding of latent variable inference, model selection, and the EM algorithm in subsequent pages.