Loading learning content...
When we move from single variables to vectors—from one measurement to many—we enter the realm of multivariate probability. And just as the univariate Gaussian dominates single-variable modeling, the Multivariate Gaussian (MVN) dominates high-dimensional probabilistic modeling.
The Multivariate Gaussian is not merely the univariate Gaussian applied to each dimension independently. It models the joint distribution of multiple random variables, capturing both individual variability and the correlations between variables. This ability to model dependencies is what makes the MVN so powerful.
In machine learning, the MVN appears everywhere:
Mastering the Multivariate Gaussian is essential for understanding and developing modern probabilistic machine learning methods.
By the end of this page, you will understand the Multivariate Gaussian: its mathematical definition, the role of the mean vector and covariance matrix, geometric interpretation, conditional and marginal distributions, parameter estimation, and its pervasive applications in machine learning.
A random vector X = [X₁, X₂, ..., Xₐ]ᵀ follows a Multivariate Gaussian (Normal) distribution with mean vector μ and covariance matrix Σ, written X ~ N(μ, Σ), if its probability density function (PDF) is:
$$p(\mathbf{x}) = \frac{1}{(2\pi)^{d/2} |\mathbf{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right)$$
where:
The MVN PDF has two parts: (1) The normalizing constant 1/((2π)^{d/2}|Σ|^{1/2}) ensures integration to 1. (2) The Mahalanobis distance (x-μ)ᵀΣ⁻¹(x-μ) in the exponent measures 'standardized' distance from the mean, accounting for correlations and different scales.
The covariance matrix Σ is symmetric positive semi-definite (SPD) and encodes:
Diagonal entries: Σᵢᵢ = Var(Xᵢ) — the variance of each variable
Off-diagonal entries: Σᵢⱼ = Cov(Xᵢ, Xⱼ) = E[(Xᵢ - μᵢ)(Xⱼ - μⱼ)] — covariance between pairs
The correlation matrix is:
$$\rho_{ij} = \frac{\Sigma_{ij}}{\sqrt{\Sigma_{ii} \Sigma_{jj}}}$$
The inverse covariance matrix Λ = Σ⁻¹ is called the precision matrix. It has important properties:
| Property | Formula | Interpretation |
|---|---|---|
| Mean | E[X] = μ | Center of the distribution |
| Covariance | Cov(X) = Σ | Spread and correlations |
| Marginals | Xᵢ ~ N(μᵢ, Σᵢᵢ) | Each component is univariate Gaussian |
| Mode | μ | Peak of the density |
| Symmetry | About μ | Ellipsoidal symmetry |
| Entropy | ½ log((2πe)ᵈ|Σ|) | Information content |
| KL Divergence | Complex formula | See derivation below |
The geometry of the Multivariate Gaussian reveals deep insights about its structure.
Points of equal probability density satisfy:
$$(\mathbf{x} - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) = c^2$$
This is the equation of an ellipsoid centered at μ. The quantity on the left is the squared Mahalanobis distance from x to μ.
The covariance matrix can be decomposed as:
$$\mathbf{\Sigma} = \mathbf{U} \mathbf{\Lambda} \mathbf{U}^T$$
where:
Geometric interpretation:
The Mahalanobis distance generalizes Euclidean distance to account for correlations:
$$d_M(\mathbf{x}, \boldsymbol{\mu}) = \sqrt{(\mathbf{x} - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})}$$
Properties:
For X ~ N(μ, Σ), the squared Mahalanobis distance follows:
$$(\mathbf{X} - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{X} - \boldsymbol{\mu}) \sim \chi^2(d)$$
This allows computing the probability that X falls within a given ellipsoid. For example, the 95% confidence ellipsoid satisfies dₘ² ≤ χ²₀.₉₅(d).
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091
import numpy as npfrom scipy.stats import chi2import matplotlib.pyplot as plt def mahalanobis_distance(x: np.ndarray, mu: np.ndarray, Sigma: np.ndarray) -> float: """ Compute Mahalanobis distance from x to mu under covariance Sigma. d_M = sqrt((x - μ)ᵀ Σ⁻¹ (x - μ)) """ diff = x - mu Sigma_inv = np.linalg.inv(Sigma) return np.sqrt(diff @ Sigma_inv @ diff) def probability_ellipsoid_volume(Sigma: np.ndarray, confidence: float = 0.95) -> dict: """ Compute the confidence ellipsoid parameters. For X ~ N(μ, Σ), the squared Mahalanobis distance follows χ²(d). """ d = Sigma.shape[0] # Chi-squared critical value chi2_critical = chi2.ppf(confidence, df=d) # Eigendecomposition for ellipsoid axes eigenvalues, eigenvectors = np.linalg.eigh(Sigma) # Sort by decreasing eigenvalue idx = np.argsort(eigenvalues)[::-1] eigenvalues = eigenvalues[idx] eigenvectors = eigenvectors[:, idx] # Ellipsoid semi-axis lengths (at given confidence level) semi_axes = np.sqrt(chi2_critical * eigenvalues) # Volume of the ellipsoid # V = (π^(d/2) / Γ(d/2 + 1)) * Π(semi_axes) from scipy.special import gamma volume = (np.pi ** (d/2) / gamma(d/2 + 1)) * np.prod(semi_axes) return { 'dimension': d, 'confidence': confidence, 'chi2_critical': chi2_critical, 'eigenvalues': eigenvalues, 'eigenvectors': eigenvectors, 'semi_axes': semi_axes, 'volume': volume } # Example: 2D Gaussianprint("Multivariate Gaussian Geometry")print("=" * 60) mu = np.array([2.0, 3.0])Sigma = np.array([ [4.0, 1.5], [1.5, 2.0]]) print(f"Mean: μ = {mu}")print(f"Covariance matrix Σ:")print(Sigma) # Eigendecompositionresult = probability_ellipsoid_volume(Sigma, confidence=0.95) print(f"Eigenvalues: {result['eigenvalues']}")print(f"Principal axes (eigenvectors):")print(result['eigenvectors'])print(f"95% confidence ellipsoid:")print(f" χ² critical value: {result['chi2_critical']:.4f}")print(f" Semi-axis lengths: {result['semi_axes']}")print(f" Volume: {result['volume']:.4f}") # Mahalanobis distancesprint("Mahalanobis distances from μ:")test_points = [ np.array([2.0, 3.0]), # At mean np.array([4.0, 4.0]), # 1 unit each direction np.array([5.0, 5.0]), # 2 units each direction]for pt in test_points: d_m = mahalanobis_distance(pt, mu, Sigma) prob_outside = 1 - chi2.cdf(d_m**2, df=2) print(f" x = {pt}: d_M = {d_m:.4f}, P(further) = {prob_outside:.4f}")While we can visualize 2D and 3D Gaussians as ellipses/ellipsoids, the geometry extends to any dimension. The key insight is that the covariance matrix defines a metric (the Mahalanobis distance) that transforms the ellipsoid into a sphere when we 'whiten' the data: X → Σ^{-1/2}(X - μ).
A remarkable property of the MVN is its closure under conditioning and marginalization—both operations yield Gaussian distributions with analytically tractable parameters.
Partition the random vector and parameters:
$$\mathbf{X} = \begin{pmatrix} \mathbf{X}1 \\ \mathbf{X}2 \end{pmatrix}, \quad \boldsymbol{\mu} = \begin{pmatrix} \boldsymbol{\mu}1 \\ \boldsymbol{\mu}2 \end{pmatrix}, \quad \mathbf{\Sigma} = \begin{pmatrix} \mathbf{\Sigma}{11} & \mathbf{\Sigma}{12} \\ \mathbf{\Sigma}{21} & \mathbf{\Sigma}{22} \end{pmatrix}$$
where X₁ has dimension d₁ and X₂ has dimension d₂.
The marginal distribution of any subset of variables is Gaussian:
$$\mathbf{X}_1 \sim N(\boldsymbol{\mu}1, \mathbf{\Sigma}{11})$$ $$\mathbf{X}_2 \sim N(\boldsymbol{\mu}2, \mathbf{\Sigma}{22})$$
To marginalize: simply extract the corresponding subvector/submatrix!
This is remarkably simple—no integration required (unlike most distributions).
The conditional distribution of X₁ given X₂ = x₂ is also Gaussian:
$$\mathbf{X}_1 | \mathbf{X}2 = \mathbf{x}2 \sim N(\boldsymbol{\mu}{1|2}, \mathbf{\Sigma}{1|2})$$
Conditional Mean: $$\boldsymbol{\mu}{1|2} = \boldsymbol{\mu}1 + \mathbf{\Sigma}{12} \mathbf{\Sigma}{22}^{-1} (\mathbf{x}_2 - \boldsymbol{\mu}_2)$$
Conditional Covariance: $$\mathbf{\Sigma}{1|2} = \mathbf{\Sigma}{11} - \mathbf{\Sigma}{12} \mathbf{\Sigma}{22}^{-1} \mathbf{\Sigma}_{21}$$
Key insights:
For jointly Gaussian variables, uncorrelated implies independent:
$$\mathbf{\Sigma}_{12} = \mathbf{0} \iff \mathbf{X}_1 \perp\!\!\!\perp \mathbf{X}_2$$
This is unique to the Gaussian—for other distributions, zero correlation does not imply independence.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091
import numpy as npfrom scipy.stats import multivariate_normal def conditional_gaussian(mu: np.ndarray, Sigma: np.ndarray, idx_1: list, idx_2: list, x_2: np.ndarray) -> tuple: """ Compute the conditional distribution p(X_1 | X_2 = x_2) for a multivariate Gaussian. Args: mu: Mean vector of joint distribution Sigma: Covariance matrix of joint distribution idx_1: Indices of X_1 (variables to condition on) idx_2: Indices of X_2 (given variables) x_2: Observed value of X_2 Returns: (mu_cond, Sigma_cond): Parameters of conditional distribution """ # Extract subvectors/submatrices mu_1 = mu[idx_1] mu_2 = mu[idx_2] Sigma_11 = Sigma[np.ix_(idx_1, idx_1)] Sigma_12 = Sigma[np.ix_(idx_1, idx_2)] Sigma_21 = Sigma[np.ix_(idx_2, idx_1)] Sigma_22 = Sigma[np.ix_(idx_2, idx_2)] # Compute conditional parameters Sigma_22_inv = np.linalg.inv(Sigma_22) # Conditional mean: μ_{1|2} = μ_1 + Σ_12 Σ_22^{-1} (x_2 - μ_2) mu_cond = mu_1 + Sigma_12 @ Sigma_22_inv @ (x_2 - mu_2) # Conditional covariance: Σ_{1|2} = Σ_11 - Σ_12 Σ_22^{-1} Σ_21 Sigma_cond = Sigma_11 - Sigma_12 @ Sigma_22_inv @ Sigma_21 return mu_cond, Sigma_cond def marginal_gaussian(mu: np.ndarray, Sigma: np.ndarray, indices: list) -> tuple: """ Compute marginal distribution p(X_I) for subset I. For Gaussians, this is simply extracting subvector/submatrix! """ mu_marginal = mu[indices] Sigma_marginal = Sigma[np.ix_(indices, indices)] return mu_marginal, Sigma_marginal # Example: 3D Gaussian with conditioningprint("Gaussian Conditioning and Marginalization")print("=" * 60) # Define a 3D Gaussianmu = np.array([1.0, 2.0, 3.0])Sigma = np.array([ [1.0, 0.5, 0.3], [0.5, 2.0, 0.7], [0.3, 0.7, 1.5]]) print("Joint distribution:")print(f"μ = {mu}")print(f"Σ ={Sigma}") # Marginal of X_1, X_2 (indices 0, 1)mu_marg, Sigma_marg = marginal_gaussian(mu, Sigma, [0, 1])print(f"Marginal p(X₁, X₂):")print(f"μ = {mu_marg}")print(f"Σ ={Sigma_marg}") # Conditional p(X_1 | X_2=2.5, X_3=3.5)x_given = np.array([2.5, 3.5]) # Values of X_2 and X_3mu_cond, Sigma_cond = conditional_gaussian(mu, Sigma, [0], [1, 2], x_given) print(f"Conditional p(X₁ | X₂=2.5, X₃=3.5):")print(f"μ_{'{cond}'} = {mu_cond[0]:.4f}")print(f"σ²_{'{cond}'} = {Sigma_cond[0,0]:.4f}")print(f"σ_{'{cond}'} = {np.sqrt(Sigma_cond[0,0]):.4f}") # Compare to priorprint(f"Comparison to prior p(X₁):")print(f"Prior: μ = {mu[0]:.4f}, σ = {np.sqrt(Sigma[0,0]):.4f}")print(f"Posterior: μ = {mu_cond[0]:.4f}, σ = {np.sqrt(Sigma_cond[0,0]):.4f}")print("(Variance decreases as we gain information from conditioning)")Given n i.i.d. observations x₁, x₂, ..., xₙ from an MVN, we estimate the mean vector and covariance matrix.
The log-likelihood for n observations is:
$$\ell(\boldsymbol{\mu}, \mathbf{\Sigma}) = -\frac{nd}{2}\ln(2\pi) - \frac{n}{2}\ln|\mathbf{\Sigma}| - \frac{1}{2}\sum_{i=1}^n (\mathbf{x}_i - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x}_i - \boldsymbol{\mu})$$
MLE for Mean:
$$\hat{\boldsymbol{\mu}} = \frac{1}{n}\sum_{i=1}^n \mathbf{x}_i = \bar{\mathbf{x}}$$
The MLE for the mean is the sample mean vector.
MLE for Covariance:
$$\hat{\mathbf{\Sigma}}{MLE} = \frac{1}{n}\sum{i=1}^n (\mathbf{x}_i - \hat{\boldsymbol{\mu}})(\mathbf{x}_i - \hat{\boldsymbol{\mu}})^T = \frac{1}{n}\mathbf{X}^T\mathbf{X}$$
where X is the centered data matrix.
The MLE for covariance is biased. The unbiased estimator divides by (n-1):
$$\mathbf{S} = \frac{1}{n-1}\sum_{i=1}^n (\mathbf{x}_i - \bar{\mathbf{x}})(\mathbf{x}_i - \bar{\mathbf{x}})^T$$
When d (dimensionality) is comparable to or larger than n (sample size):
1. Rank deficiency: The sample covariance matrix has rank at most min(n-1, d). If d ≥ n, it's singular and cannot be inverted.
2. High variance: Even when invertible, the estimated eigenvalues are highly variable—largest overestimated, smallest underestimated.
3. Solutions:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121
import numpy as npfrom scipy.stats import multivariate_normal def estimate_mvn_parameters(data: np.ndarray, regularization: float = 0.0) -> dict: """ Estimate MVN parameters from data. Args: data: n x d data matrix (n samples, d dimensions) regularization: Shrinkage toward identity (Ledoit-Wolf style) Returns: Dictionary with mean, covariance, and related quantities """ n, d = data.shape # Mean estimation mu_hat = np.mean(data, axis=0) # Centered data X_centered = data - mu_hat # MLE covariance (biased) Sigma_mle = (X_centered.T @ X_centered) / n # Unbiased covariance Sigma_unbiased = (X_centered.T @ X_centered) / (n - 1) # Regularized covariance (shrink toward identity) if regularization > 0: target = np.trace(Sigma_unbiased) / d * np.eye(d) # Spherical target Sigma_reg = (1 - regularization) * Sigma_unbiased + regularization * target else: Sigma_reg = Sigma_unbiased # Correlation matrix stds = np.sqrt(np.diag(Sigma_unbiased)) corr = Sigma_unbiased / np.outer(stds, stds) # Eigendecomposition eigenvalues, eigenvectors = np.linalg.eigh(Sigma_reg) # Condition number (ratio of largest to smallest eigenvalue) condition_number = eigenvalues.max() / max(eigenvalues.min(), 1e-10) return { 'n': n, 'd': d, 'mu': mu_hat, 'Sigma_mle': Sigma_mle, 'Sigma_unbiased': Sigma_unbiased, 'Sigma_regularized': Sigma_reg, 'correlation': corr, 'eigenvalues': eigenvalues[::-1], # Descending order 'eigenvectors': eigenvectors[:, ::-1], 'condition_number': condition_number } def ledoit_wolf_shrinkage(data: np.ndarray) -> tuple: """ Ledoit-Wolf optimal shrinkage estimator. Shrinks toward scaled identity matrix with optimal shrinkage intensity. """ n, d = data.shape X_centered = data - np.mean(data, axis=0) # Sample covariance S = (X_centered.T @ X_centered) / n # Target: scaled identity mu_target = np.trace(S) / d F = mu_target * np.eye(d) # Compute optimal shrinkage intensity (simplified formula) # This is a simplified version; full formula involves fourth moments delta = S - F delta_sq_norm = np.sum(delta ** 2) # Estimate shrinkage intensity shrinkage = min(1.0, (1/n) * delta_sq_norm / delta_sq_norm) if delta_sq_norm > 0 else 0 # From sklearn.covariance for comparison from sklearn.covariance import LedoitWolf lw = LedoitWolf().fit(data) return lw.covariance_, lw.shrinkage_ # Examplenp.random.seed(42) # Generate data from known MVNtrue_mu = np.array([1.0, 2.0, 3.0])true_Sigma = np.array([ [1.0, 0.5, 0.3], [0.5, 2.0, 0.7], [0.3, 0.7, 1.5]]) n_samples = 100data = np.random.multivariate_normal(true_mu, true_Sigma, n_samples) # Estimate parametersresults = estimate_mvn_parameters(data) print("MVN Parameter Estimation")print("=" * 60)print(f"True μ: {true_mu}")print(f"Estimated μ: {results['mu'].round(4)}")print(f"True Σ:{true_Sigma}")print(f"Estimated Σ (unbiased):{results['Sigma_unbiased'].round(4)}")print(f"Correlation matrix:{results['correlation'].round(4)}")print(f"Eigenvalues: {results['eigenvalues'].round(4)}")print(f"Condition number: {results['condition_number']:.2f}")The MVN family is closed under linear transformations—a crucial property for many applications.
If X ~ N(μ, Σ) and Y = A****X + b where A is a matrix and b is a vector, then:
$$\mathbf{Y} \sim N(\mathbf{A}\boldsymbol{\mu} + \mathbf{b}, \mathbf{A}\mathbf{\Sigma}\mathbf{A}^T)$$
This generalizes the univariate result (aX + b ~ N(aμ + b, a²σ²)).
Whitening transforms data to have identity covariance:
$$\mathbf{Z} = \mathbf{\Sigma}^{-1/2}(\mathbf{X} - \boldsymbol{\mu}) \sim N(\mathbf{0}, \mathbf{I})$$
where Σ^{-1/2} is the inverse matrix square root (computed via eigendecomposition).
Applications:
If X ~ N(μₓ, Σₓ) and Y ~ N(μᵧ, Σᵧ) are independent, then:
$$\mathbf{X} + \mathbf{Y} \sim N(\boldsymbol{\mu}_X + \boldsymbol{\mu}_Y, \mathbf{\Sigma}_X + \mathbf{\Sigma}_Y)$$
Means add, covariances add (generalizing the univariate case).
The product of two Gaussian PDFs (up to normalization) is Gaussian—essential for Bayesian updating:
If we have prior x ~ N(μ₀, Σ₀) and likelihood y | x ~ N(x, Σₗ), the posterior is:
$$\mathbf{x} | \mathbf{y} \sim N(\boldsymbol{\mu}{post}, \mathbf{\Sigma}{post})$$
where:
$$\mathbf{\Sigma}{post}^{-1} = \mathbf{\Sigma}0^{-1} + \mathbf{\Sigma}\ell^{-1}$$ $$\boldsymbol{\mu}{post} = \mathbf{\Sigma}_{post}(\mathbf{\Sigma}_0^{-1}\boldsymbol{\mu}0 + \mathbf{\Sigma}\ell^{-1}\mathbf{y})$$
This is precision-weighted averaging, exactly as in the univariate case.
Many computations are simpler in the 'information form' (also called canonical form) using precision matrix Λ = Σ⁻¹ and information vector η = Λμ. Products of Gaussians become sums of precision matrices and information vectors, making Bayesian updates additive.
The Multivariate Gaussian is ubiquitous in machine learning. Let's explore its key applications.
PCA can be viewed as fitting an MVN and finding its principal axes:
Dimensionality reduction projects onto the top k eigenvectors.
GMMs model data as a mixture of K MVN components:
$$p(\mathbf{x}) = \sum_{k=1}^K \pi_k \, N(\mathbf{x} | \boldsymbol{\mu}_k, \mathbf{\Sigma}_k)$$
where πₖ are mixing weights. GMMs are trained via EM algorithm and are fundamental for:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107
import numpy as npfrom scipy.stats import multivariate_normal class GaussianDiscriminantAnalysis: """ Gaussian Discriminant Analysis (LDA/QDA). Models each class with a Multivariate Gaussian: p(x | y=k) = N(x | μ_k, Σ_k) LDA: Shared covariance across classes QDA: Separate covariance per class """ def __init__(self, shared_covariance: bool = True): """ Args: shared_covariance: True for LDA, False for QDA """ self.shared_cov = shared_covariance self.classes = None self.priors = {} self.means = {} self.covariances = {} def fit(self, X: np.ndarray, y: np.ndarray): """Fit Gaussian to each class.""" self.classes = np.unique(y) n = len(y) d = X.shape[1] for c in self.classes: X_c = X[y == c] self.priors[c] = len(X_c) / n self.means[c] = np.mean(X_c, axis=0) if not self.shared_cov: self.covariances[c] = np.cov(X_c.T) if self.shared_cov: # Pooled within-class covariance Sw = np.zeros((d, d)) for c in self.classes: X_c = X[y == c] X_centered = X_c - self.means[c] Sw += X_centered.T @ X_centered self.shared_Sigma = Sw / (n - len(self.classes)) def predict_proba(self, X: np.ndarray) -> np.ndarray: """Compute posterior probabilities P(y=k | x).""" log_posteriors = np.zeros((len(X), len(self.classes))) for i, c in enumerate(self.classes): if self.shared_cov: Sigma = self.shared_Sigma else: Sigma = self.covariances[c] mvn = multivariate_normal(self.means[c], Sigma) log_posteriors[:, i] = np.log(self.priors[c]) + mvn.logpdf(X) # Normalize log_sum = np.log(np.exp(log_posteriors).sum(axis=1, keepdims=True)) posteriors = np.exp(log_posteriors - log_sum) return posteriors def predict(self, X: np.ndarray) -> np.ndarray: proba = self.predict_proba(X) return self.classes[np.argmax(proba, axis=1)] # Example: Classification with LDAnp.random.seed(42) # Generate 3-class datan_per_class = 100X_0 = np.random.multivariate_normal([0, 0], [[1, 0.5], [0.5, 1]], n_per_class)X_1 = np.random.multivariate_normal([3, 3], [[1, -0.3], [-0.3, 1]], n_per_class)X_2 = np.random.multivariate_normal([0, 4], [[0.8, 0], [0, 0.8]], n_per_class) X = np.vstack([X_0, X_1, X_2])y = np.array([0] * n_per_class + [1] * n_per_class + [2] * n_per_class) # Fit LDAlda = GaussianDiscriminantAnalysis(shared_covariance=True)lda.fit(X, y) # Evaluatey_pred = lda.predict(X)accuracy = np.mean(y_pred == y) print("Gaussian Discriminant Analysis (LDA)")print("=" * 50)print(f"Training accuracy: {accuracy:.2%}")print("Class means:")for c in lda.classes: print(f" Class {c}: μ = {lda.means[c].round(3)}")print(f"Shared covariance:{lda.shared_Sigma.round(3)}") # Predict probabilities for a test pointtest_point = np.array([[1.5, 2.0]])probs = lda.predict_proba(test_point)[0]print(f"Test point [1.5, 2.0] posteriors:")for c, p in zip(lda.classes, probs): print(f" P(y={c} | x) = {p:.4f}")A Gaussian Process is a distribution over functions where any finite collection of function values is jointly MVN:
$$[f(x_1), f(x_2), \ldots, f(x_n)]^T \sim N(\boldsymbol{\mu}, \mathbf{K})$$
where K is the kernel (covariance) matrix with Kᵢⱼ = k(xᵢ, xⱼ).
GPs provide:
VAEs model the latent space as MVN:
The reparameterization trick enables gradient-based training: $$z = \mu + \sigma \odot \epsilon, \quad \epsilon \sim N(0, I)$$
Kalman filters propagate MVN beliefs through linear dynamical systems:
Prediction: p(xₜ | y₁:ₜ₋₁) = N(μₜ|ₜ₋₁, Σₜ|ₜ₋₁) Update: p(xₜ | y₁:ₜ) = N(μₜ|ₜ, Σₜ|ₜ)
All operations remain Gaussian, enabling efficient recursive estimation.
The Kullback-Leibler (KL) divergence measures how one distribution differs from another. For MVNs, we have closed-form expressions.
For p = N(μ₁, Σ₁) and q = N(μ₂, Σ₂):
$$D_{KL}(p \| q) = \frac{1}{2}\left[\text{tr}(\mathbf{\Sigma}_2^{-1}\mathbf{\Sigma}_1) + (\boldsymbol{\mu}_2 - \boldsymbol{\mu}_1)^T\mathbf{\Sigma}_2^{-1}(\boldsymbol{\mu}_2 - \boldsymbol{\mu}_1) - d + \ln\frac{|\mathbf{\Sigma}_2|}{|\mathbf{\Sigma}_1|}\right]$$
Special case (q = N(0, I)):
$$D_{KL}(N(\boldsymbol{\mu}, \mathbf{\Sigma}) \| N(\mathbf{0}, \mathbf{I})) = \frac{1}{2}\left[\text{tr}(\mathbf{\Sigma}) + \boldsymbol{\mu}^T\boldsymbol{\mu} - d - \ln|\mathbf{\Sigma}|\right]$$
This is the regularization term in VAE training!
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
import numpy as np def kl_divergence_gaussians(mu1: np.ndarray, Sigma1: np.ndarray, mu2: np.ndarray, Sigma2: np.ndarray) -> float: """ Compute KL divergence D_KL(p || q) where: p = N(μ₁, Σ₁) q = N(μ₂, Σ₂) D_KL = 0.5 * [tr(Σ₂⁻¹Σ₁) + (μ₂-μ₁)ᵀΣ₂⁻¹(μ₂-μ₁) - d + ln(|Σ₂|/|Σ₁|)] """ d = len(mu1) # Compute components Sigma2_inv = np.linalg.inv(Sigma2) mu_diff = mu2 - mu1 # Trace term trace_term = np.trace(Sigma2_inv @ Sigma1) # Mahalanobis term mahal_term = mu_diff @ Sigma2_inv @ mu_diff # Log determinant term log_det_term = np.log(np.linalg.det(Sigma2) / np.linalg.det(Sigma1)) kl = 0.5 * (trace_term + mahal_term - d + log_det_term) return kl def kl_to_standard_normal(mu: np.ndarray, Sigma: np.ndarray) -> float: """ KL divergence from N(μ, Σ) to N(0, I). This is the regularization term in VAE training: D_KL = 0.5 * [tr(Σ) + μᵀμ - d - ln|Σ|] """ d = len(mu) trace_term = np.trace(Sigma) mu_sq_term = np.sum(mu ** 2) log_det_term = np.log(np.linalg.det(Sigma)) return 0.5 * (trace_term + mu_sq_term - d - log_det_term) # Exampleprint("KL Divergence Between Gaussians")print("=" * 50) # Two 2D Gaussiansmu1 = np.array([0.0, 0.0])Sigma1 = np.array([[1.0, 0.3], [0.3, 1.0]]) mu2 = np.array([1.0, 1.0])Sigma2 = np.array([[2.0, 0.5], [0.5, 2.0]]) kl_12 = kl_divergence_gaussians(mu1, Sigma1, mu2, Sigma2)kl_21 = kl_divergence_gaussians(mu2, Sigma2, mu1, Sigma1) print(f"p = N({mu1}, Σ₁)")print(f"q = N({mu2}, Σ₂)")print(f"D_KL(p || q) = {kl_12:.4f}")print(f"D_KL(q || p) = {kl_21:.4f}")print("(Note: KL is asymmetric!)") # VAE regularization termmu_vae = np.array([0.5, -0.3, 0.2])sigma_vae = np.array([1.2, 0.8, 1.1]) # Diagonal variancesSigma_vae = np.diag(sigma_vae ** 2) kl_vae = kl_to_standard_normal(mu_vae, Sigma_vae)print(f"VAE example:")print(f"Encoder output: μ = {mu_vae}, σ = {sigma_vae}")print(f"KL regularization term: {kl_vae:.4f}")The Multivariate Gaussian is the cornerstone of high-dimensional probabilistic modeling. Let's consolidate our understanding:
Module Complete!
You have now completed the module on Common Probability Distributions. You've mastered:
These distributions form the probabilistic foundation of machine learning. Understanding their properties, estimation methods, and interrelationships equips you to build, analyze, and improve probabilistic models across diverse applications.
Congratulations! You now have comprehensive knowledge of the probability distributions that underpin machine learning. From the simplest Bernoulli to the sophisticated Multivariate Gaussian, these distributions provide the language and tools for reasoning about uncertainty, building models, and making predictions. This foundation is essential for understanding more advanced topics in probabilistic machine learning.