Loading content...
In statistical modeling and machine learning, we frequently encounter data that cannot be adequately described by a single parametric distribution. Consider a dataset of human heights that includes both adult males and adult females—a single Gaussian distribution would fail to capture the bimodal nature of the data. Or imagine customer purchase amounts in a retail dataset, where different customer segments exhibit fundamentally different spending behaviors.
Mixture models provide an elegant and principled solution to this challenge. Rather than forcing all data through a single distributional lens, mixture models posit that the observed data arises from multiple underlying subpopulations, each characterized by its own parametric distribution. The overall data distribution is then a weighted combination—a mixture—of these component distributions.
This page develops the complete mathematical formulation of Gaussian Mixture Models (GMMs), the most widely used instance of mixture modeling. We will derive every component of the model from first principles, ensuring you understand not just what the equations say, but why they take the form they do.
By the end of this page, you will be able to: (1) Write down the complete probability density function for a Gaussian mixture model, (2) Explain the role and interpretation of mixing coefficients, (3) Describe the generative process that produces mixture data, (4) Derive the relationship between component densities and the overall mixture density, and (5) Formulate the log-likelihood function for parameter estimation.
Before constructing mixture models, we must establish a rigorous understanding of the Gaussian distribution itself. The univariate Gaussian (or Normal) distribution is parameterized by two quantities: the mean $\mu$ (location parameter) and the variance $\sigma^2$ (scale parameter). Its probability density function (pdf) is:
$$\mathcal{N}(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)$$
This function describes the probability density at point $x$ given parameters $\mu$ and $\sigma^2$.
Dissecting the components:
The normalization constant $\frac{1}{\sqrt{2\pi\sigma^2}}$ ensures that the pdf integrates to 1 over the entire real line. This is a requirement for any valid probability density function. Notice that larger variance $\sigma^2$ leads to a smaller normalization constant—because the probability mass is spread over a wider range, the peak height must decrease.
The exponential term $\exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)$ captures the shape of the distribution. The squared term $(x - \mu)^2$ measures the distance from the mean, ensuring symmetry around $\mu$. The division by $2\sigma^2$ scales this distance relative to the spread—points many standard deviations from the mean receive exponentially lower density.
The maximum density occurs at $x = \mu$, where the exponent is zero and the exponential term equals 1. As $|x - \mu|$ increases, the density decreases rapidly due to the negative quadratic in the exponent.
The Gaussian's unimodality is precisely why we need mixtures. A single Gaussian cannot represent data with multiple clusters or modes. By combining multiple Gaussians—each representing a different subpopulation—we can model complex, multimodal distributions while retaining the mathematical tractability that makes Gaussians so useful.
Real-world data typically involves multiple dimensions. Customer features might include age, income, and purchase frequency. Image data involves thousands of pixel intensities. To handle such data, we extend the Gaussian to multiple dimensions.
For a $D$-dimensional random vector $\mathbf{x} = (x_1, x_2, \ldots, x_D)^\top$, the multivariate Gaussian distribution is parameterized by:
$$\mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{(2\pi)^{D/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right)$$
where $|\boldsymbol{\Sigma}|$ denotes the determinant of the covariance matrix.
Understanding each component:
The normalization constant $(2\pi)^{D/2}|\boldsymbol{\Sigma}|^{1/2}$ generalizes the univariate case. The $D/2$ power accounts for $D$ independent dimensions, while the determinant $|\boldsymbol{\Sigma}|$ measures the "volume" of the distribution's spread. Larger determinant means more spread, requiring a smaller peak height.
The Mahalanobis distance $(\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})$ is a generalized squared distance that accounts for:
This quadratic form defines elliptical contours of constant density. Points on the same ellipse have equal probability density.
The precision matrix $\boldsymbol{\Sigma}^{-1}$, inverse of the covariance, appears rather than $\boldsymbol{\Sigma}$ itself. This is computationally important—we often work with precision directly in optimization.
| Covariance Form | Structure | Geometric Shape | Parameters |
|---|---|---|---|
| Spherical | $\boldsymbol{\Sigma} = \sigma^2 \mathbf{I}$ | Hyperspheres | $1$ parameter |
| Diagonal | $\boldsymbol{\Sigma} = \text{diag}(\sigma_1^2, \ldots, \sigma_D^2)$ | Axis-aligned ellipsoids | $D$ parameters |
| Full | Arbitrary positive definite | Arbitrarily oriented ellipsoids | $D(D+1)/2$ parameters |
The parameter count matters for mixture models. A full covariance matrix in $D$ dimensions has $D(D+1)/2$ free parameters due to symmetry. For a mixture with $K$ components, this means $K \cdot D(D+1)/2$ covariance parameters, which can quickly become prohibitive. This motivates restricted forms:
Spherical covariance (isotropic): All dimensions have equal variance, no correlation. This produces spherical clusters.
Diagonal covariance: Each dimension has its own variance, but no correlation between dimensions. This produces axis-aligned elliptical clusters.
Full covariance: Arbitrary orientation and scaling. Most flexible but most parameters.
Tied covariance: All components share the same covariance matrix, differing only in means. Reduces parameters significantly.
The covariance matrix must be positive definite for the density to be well-defined. This means all eigenvalues must be strictly positive. Numerical issues during estimation can produce near-singular or non-positive-definite estimates, causing algorithms to fail. Regularization (adding a small value to the diagonal) is a common remedy.
Now we construct the Gaussian Mixture Model by combining multiple Gaussian components. The key insight is that complex, multimodal distributions can often be approximated as weighted sums of simpler unimodal distributions.
A Gaussian Mixture Model with $K$ components is defined by:
$K$ component densities: Each component $k \in {1, 2, \ldots, K}$ is a multivariate Gaussian with its own parameters $(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$
$K$ mixing coefficients: Non-negative weights $\pi_1, \pi_2, \ldots, \pi_K$ that sum to one: $\sum_{k=1}^{K} \pi_k = 1$
$$p(\mathbf{x}) = \sum_{k=1}^{K} \pi_k , \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$$
The probability density at any point $\mathbf{x}$ is a weighted sum of the densities assigned by each Gaussian component.
Why this is a valid probability density:
For $p(\mathbf{x})$ to be a valid pdf, it must:
Non-negativity is immediate: each $\pi_k \geq 0$ and each $\mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \geq 0$, so their product and sum are non-negative.
Integration to one follows from: $$\int p(\mathbf{x}) d\mathbf{x} = \int \sum_{k=1}^{K} \pi_k , \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}k) d\mathbf{x}$$ $$= \sum{k=1}^{K} \pi_k \int \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}k) d\mathbf{x} = \sum{k=1}^{K} \pi_k \cdot 1 = 1$$
The integral and sum exchange by linearity, and each Gaussian integrates to 1.
The complete parameter set:
Collecting all parameters, a GMM with $K$ components in $D$ dimensions has:
$$\boldsymbol{\theta} = {\pi_1, \ldots, \pi_K, \boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_K, \boldsymbol{\Sigma}_1, \ldots, \boldsymbol{\Sigma}_K}$$
Parameter count for full covariance:
Total: $K - 1 + KD + K \cdot D(D+1)/2 = K(1 + D + D(D+1)/2) - 1$ parameters
For example, with $K = 5$ components in $D = 10$ dimensions with full covariance:
This illustrates why simpler covariance structures are often necessary in practice.
One of the most powerful ways to understand probabilistic models is through their generative story—a step-by-step description of how the observed data could have been produced.
For a Gaussian Mixture Model, the generative process for a single observation is:
Step 1: Sample the component assignment Draw a component index $z$ from a Categorical distribution with probabilities $(\pi_1, \ldots, \pi_K)$: $$z \sim \text{Categorical}(\pi_1, \ldots, \pi_K)$$
Step 2: Sample the observation from the chosen component Given $z = k$, draw the observation $\mathbf{x}$ from the $k$-th Gaussian: $$\mathbf{x} \mid z = k \sim \mathcal{N}(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$$
Think of it like this: Imagine a factory with $K$ manufacturing machines. Each machine produces products with slightly different characteristics (mean and variance). To produce one item, we first randomly select a machine (with probability $\pi_k$ for machine $k$), then that machine produces an item according to its own distribution. The final product shows which machine made it... but imagine we lost those labels. That's exactly the GMM setup!
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
import numpy as np def sample_from_gmm(n_samples, means, covariances, mixing_coeffs): """ Generate samples from a Gaussian Mixture Model. Parameters: ----------- n_samples : int Number of samples to generate means : list of arrays, shape (K, D) Mean vectors for each component covariances : list of arrays, shape (K, D, D) Covariance matrices for each component mixing_coeffs : array, shape (K,) Mixing coefficients (must sum to 1) Returns: -------- samples : array, shape (n_samples, D) Generated samples component_assignments : array, shape (n_samples,) True component assignment for each sample """ K = len(means) D = len(means[0]) samples = np.zeros((n_samples, D)) component_assignments = np.zeros(n_samples, dtype=int) for i in range(n_samples): # Step 1: Sample component assignment z = np.random.choice(K, p=mixing_coeffs) component_assignments[i] = z # Step 2: Sample from chosen component samples[i] = np.random.multivariate_normal( mean=means[z], cov=covariances[z] ) return samples, component_assignments # Example: 3-component GMM in 2Dmeans = [ np.array([0.0, 0.0]), np.array([5.0, 0.0]), np.array([2.5, 4.0])] covariances = [ np.array([[1.0, 0.3], [0.3, 1.0]]), np.array([[0.8, -0.2], [-0.2, 0.6]]), np.array([[1.2, 0.0], [0.0, 0.5]])] mixing_coeffs = np.array([0.3, 0.5, 0.2]) # Generate 1000 samplessamples, true_labels = sample_from_gmm( n_samples=1000, means=means, covariances=covariances, mixing_coeffs=mixing_coeffs) print(f"Generated {len(samples)} samples from 3-component GMM")print(f"Component proportions: {np.bincount(true_labels) / len(true_labels)}")To generate a dataset of $N$ observations:
Repeat the two-step process $N$ times independently. Each observation $\mathbf{x}_n$ is associated with an unobserved (latent) component assignment $z_n$. The complete generative process produces:
The fundamental challenge:
In practice, we observe only the $\mathbf{x}_n$ values—the component assignments $z_n$ are latent (hidden). We cannot directly count which observations belong to which component. This is precisely what makes mixture model inference challenging and leads to the Expectation-Maximization algorithm (covered in Module 4).
Crucially, given the component assignment $z_n = k$, each observation $\mathbf{x}_n$ is independent of all other observations and their component assignments. The observations are conditionally independent given the latent assignments. This conditional independence structure is what makes inference tractable.
We now show that the generative story rigorously implies the mixture density formula. This derivation uses fundamental probability rules and deepens understanding of the model structure.
Starting point: Joint distribution of $\mathbf{x}$ and $z$
The generative story specifies:
By the chain rule of probability: $$p(\mathbf{x}, z = k) = P(z = k) \cdot p(\mathbf{x} \mid z = k) = \pi_k , \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$$
To obtain the marginal density of $\mathbf{x}$, we marginalize (sum out) the latent variable $z$:
$$p(\mathbf{x}) = \sum_{k=1}^{K} p(\mathbf{x}, z = k) = \sum_{k=1}^{K} \pi_k , \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$$
This is exactly the GMM density! The mixture formula arises naturally from marginalizing over the latent component assignment.
The law of total probability in action:
The marginalization step is an application of the law of total probability: $$p(\mathbf{x}) = \sum_{k} P(z = k) \cdot p(\mathbf{x} \mid z = k)$$
This says: the probability of observing $\mathbf{x}$ is the sum over all ways $\mathbf{x}$ could have been generated—weighting each component's contribution by the probability of selecting that component.
Posterior responsibility:
Conversely, given an observation $\mathbf{x}$, we can compute the posterior probability that it came from component $k$ using Bayes' theorem: $$P(z = k \mid \mathbf{x}) = \frac{P(z = k) \cdot p(\mathbf{x} \mid z = k)}{p(\mathbf{x})} = \frac{\pi_k , \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}k)}{\sum{j=1}^{K} \pi_j , \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}$$
This quantity, often denoted $\gamma_{nk}$ for observation $n$ and component $k$, is called the responsibility—how responsible component $k$ is for generating observation $\mathbf{x}_n$. It plays a central role in the EM algorithm.
| Quantity | Formula | Interpretation |
|---|---|---|
| Prior $P(z = k)$ | $\pi_k$ | Probability of selecting component $k$ before seeing data |
| Likelihood $p(\mathbf{x} \mid z = k)$ | $\mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$ | Probability density of $\mathbf{x}$ under component $k$ |
| Joint $p(\mathbf{x}, z = k)$ | $\pi_k , \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$ | Joint probability of $\mathbf{x}$ and assignment to $k$ |
| Marginal $p(\mathbf{x})$ | $\sum_k \pi_k , \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$ | Overall density of $\mathbf{x}$ (mixture density) |
| Posterior $P(z = k \mid \mathbf{x})$ | $\gamma_k = \frac{\pi_k , \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{p(\mathbf{x})}$ | Responsibility of component $k$ for $\mathbf{x}$ |
Unlike hard clustering (e.g., k-means), GMMs provide soft assignments via posterior responsibilities. Each point has a probability distribution over components rather than a single assignment. This is useful for uncertainty quantification—we know not just the most likely cluster, but how confident that assignment is.
Given observed data $\mathbf{X} = {\mathbf{x}_1, \ldots, \mathbf{x}_N}$, we want to estimate the GMM parameters $\boldsymbol{\theta} = {\pi_k, \boldsymbol{\mu}_k, \boldsymbol{\Sigma}k}{k=1}^{K}$. The maximum likelihood approach seeks parameters that maximize the probability of observing the data.
Assuming independence:
If observations are independent and identically distributed (i.i.d.) given the parameters, the joint likelihood is: $$L(\boldsymbol{\theta}) = p(\mathbf{X} \mid \boldsymbol{\theta}) = \prod_{n=1}^{N} p(\mathbf{x}n \mid \boldsymbol{\theta}) = \prod{n=1}^{N} \sum_{k=1}^{K} \pi_k , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$$
Taking the logarithm (which doesn't change the location of the maximum but makes computation more stable):
$$\ell(\boldsymbol{\theta}) = \log L(\boldsymbol{\theta}) = \sum_{n=1}^{N} \log \left( \sum_{k=1}^{K} \pi_k , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right)$$
This is the log-likelihood for a Gaussian Mixture Model.
Why the GMM log-likelihood is difficult to optimize:
Note the structure of the log-likelihood. The logarithm is outside the summation over components. This is the source of all optimization difficulty:
$$\ell(\boldsymbol{\theta}) = \sum_{n=1}^{N} \log \left( \sum_{k=1}^{K} \pi_k , \mathcal{N}(\mathbf{x}_n \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right)$$
Comparison with single Gaussian:
For a single Gaussian, the log-likelihood is: $$\ell(\boldsymbol{\mu}, \boldsymbol{\Sigma}) = \sum_{n=1}^{N} \log \mathcal{N}(\mathbf{x}n \mid \boldsymbol{\mu}, \boldsymbol{\Sigma})$$ $$= -\frac{N}{2} \log|\boldsymbol{\Sigma}| - \frac{1}{2} \sum{n=1}^{N} (\mathbf{x}_n - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x}_n - \boldsymbol{\mu}) + \text{const}$$
This is a concave function of the parameters (after suitable reparameterization), and the maximum has a closed-form solution: $\hat{\boldsymbol{\mu}} = \frac{1}{N}\sum_n \mathbf{x}_n$ and $\hat{\boldsymbol{\Sigma}} = \frac{1}{N}\sum_n (\mathbf{x}_n - \hat{\boldsymbol{\mu}})(\mathbf{x}_n - \hat{\boldsymbol{\mu}})^\top$.
The mixture complication:
For the GMM, the log-of-sum structure prevents closed-form solutions. Setting derivatives to zero yields equations where each parameter appears on both sides, entangled with all other parameters. No algebraic rearrangement isolates them.
This is why we need iterative algorithms like Expectation-Maximization (EM)—the subject of Module 4.
Consider what happens if $\boldsymbol{\mu}_k \to \mathbf{x}_n$ and $\boldsymbol{\Sigma}_k \to 0$ for any data point $\mathbf{x}_n$. The Gaussian density at that point goes to infinity, and so does the log-likelihood. This is a pathological solution where one component "collapses" onto a single point. Prevention strategies include regularization (adding a small constant to diagonal) or discarding components with very small variance.
Understanding mixture densities geometrically builds intuition for how component parameters affect the overall distribution. Let's examine how different configurations produce different shapes.
One-dimensional mixtures:
In 1D, mixture densities appear as weighted sums of bell curves. The resulting shape exhibits:
Two-dimensional mixtures:
In 2D, we can visualize with contour plots or 3D surface plots. Each Gaussian component produces elliptical contours (circles for spherical covariance). The mixture density shows:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
import numpy as npimport matplotlib.pyplot as pltfrom scipy.stats import multivariate_normal def gmm_density(x, y, means, covs, weights): """Compute GMM density on a 2D grid.""" pos = np.dstack((x, y)) density = np.zeros_like(x) for mean, cov, weight in zip(means, covs, weights): rv = multivariate_normal(mean, cov) density += weight * rv.pdf(pos) return density # Define a 3-component GMMmeans = [ np.array([-2.0, 0.0]), np.array([2.0, 0.0]), np.array([0.0, 2.5])] covs = [ np.array([[0.8, 0.3], [0.3, 0.5]]), np.array([[0.6, -0.2], [-0.2, 0.8]]), np.array([[0.5, 0.0], [0.0, 0.5]])] weights = [0.35, 0.40, 0.25] # Create grid for visualizationx = np.linspace(-5, 5, 200)y = np.linspace(-3, 5, 200)X, Y = np.meshgrid(x, y) # Compute densityZ = gmm_density(X, Y, means, covs, weights) # Plot contoursfig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Contour plotax1 = axes[0]contour = ax1.contour(X, Y, Z, levels=15, cmap='viridis')ax1.clabel(contour, inline=True, fontsize=8)ax1.scatter([m[0] for m in means], [m[1] for m in means], c='red', s=100, marker='x', linewidths=3, label='Component means')ax1.set_xlabel('$x_1$')ax1.set_ylabel('$x_2$')ax1.set_title('GMM Density Contours')ax1.legend()ax1.set_aspect('equal') # Filled contourax2 = axes[1]filled = ax2.contourf(X, Y, Z, levels=30, cmap='magma')plt.colorbar(filled, ax=ax2, label='Density')ax2.scatter([m[0] for m in means], [m[1] for m in means], c='white', s=100, marker='x', linewidths=3)ax2.set_xlabel('$x_1$')ax2.set_ylabel('$x_2$')ax2.set_title('GMM Density Heat Map')ax2.set_aspect('equal') plt.tight_layout()plt.savefig('gmm_visualization.png', dpi=150)plt.show() print("Visualization complete. Notice how:")print("1. Contours are not simple ellipses - they're complex curves")print("2. The density between modes doesn't go to zero - there's overlap")print("3. Each component contributes to the density everywhere")Key observations from visualization:
1. Superposition principle: The mixture density at any point is the sum of individual Gaussian densities. Where components overlap, densities add, creating ridges or elevated regions between modes.
2. Non-elliptical contours: Unlike a single Gaussian, mixture contours can have complex, non-convex shapes. Contours at lower density levels may encircle multiple modes.
3. Weight effects: Components with larger $\pi_k$ contribute more to the overall density. A component with small weight barely affects the density shape, even if its Gaussian is well-separated.
4. Covariance effects: The orientation and eccentricity of each component's ellipse affects how components "blend" together. Aligned components merge smoothly; perpendicular ones create more complex intersections.
5. Separation effects: Well-separated components produce clearly multimodal distributions. Overlapping components create ambiguity in component assignment.
When we later study the EM algorithm, remember this geometry. The E-step computes responsibilities—asking "how much of this point's density comes from each component?" The M-step updates parameters so each component better fits the points it's responsible for. Visualizing how the density surface changes with parameters helps understand EM's behavior.
We have developed the complete mathematical formulation of Gaussian Mixture Models from first principles. Let's consolidate the key concepts:
| Parameter Type | Symbol | Count | Constraints |
|---|---|---|---|
| Mixing coefficients | $\pi_k$ | $K - 1$ (free) | $\pi_k \geq 0$, $\sum_k \pi_k = 1$ |
| Component means | $\boldsymbol{\mu}_k$ | $K \times D$ | None |
| Component covariances | $\boldsymbol{\Sigma}_k$ | $K \times D(D+1)/2$ | Positive definite |
What comes next:
This formulation page establishes the probabilistic framework. The subsequent pages in this module build upon this foundation:
Page 1: Latent Variables — Deepens the understanding of $z$ as a latent variable, exploring the complete-data likelihood and why latent variables make GMMs both powerful and challenging.
Page 2: Marginal Likelihood — Develops the marginal likelihood perspective for model comparison and its connection to Bayesian inference.
Page 3: Identifiability — Addresses when GMM parameters are uniquely determined and the label-switching symmetry.
Page 4: Model Selection — Covers methods for determining the optimal number of components $K$.
The EM algorithm for fitting GMMs is covered in detail in Module 4: Expectation-Maximization Algorithm.
You now understand the complete mathematical formulation of Gaussian Mixture Models: the multivariate Gaussian foundation, the mixture density formula, the generative story, and the log-likelihood optimization challenge. This foundation enables rigorous understanding of latent variable inference, model selection, and the EM algorithm in subsequent pages.