Machine LearningProbability Theory

Common Probability Distributions

LevelIntermediate

Duration120 mins

TopicProbability Theory

5 / 5

Multivariate Gaussian Distribution

The Geometry of Uncertainty in High Dimensions

When we move from single variables to vectors—from one measurement to many—we enter the realm of multivariate probability. And just as the univariate Gaussian dominates single-variable modeling, the Multivariate Gaussian (MVN) dominates high-dimensional probabilistic modeling.

The Multivariate Gaussian is not merely the univariate Gaussian applied to each dimension independently. It models the joint distribution of multiple random variables, capturing both individual variability and the correlations between variables. This ability to model dependencies is what makes the MVN so powerful.

In machine learning, the MVN appears everywhere:

Principal Component Analysis (PCA): Finds the principal axes of an MVN
Gaussian Mixture Models (GMMs): Clusters as weighted sums of MVNs
Gaussian Processes: MVN over function values
Variational Autoencoders: MVN latent spaces
Kalman Filters: State estimation via MVN propagation
Linear Discriminant Analysis: Class-conditional MVNs

Mastering the Multivariate Gaussian is essential for understanding and developing modern probabilistic machine learning methods.

What You Will Learn

By the end of this page, you will understand the Multivariate Gaussian: its mathematical definition, the role of the mean vector and covariance matrix, geometric interpretation, conditional and marginal distributions, parameter estimation, and its pervasive applications in machine learning.

Mathematical Definition

A random vector X = [X₁, X₂, ..., Xₐ]ᵀ follows a Multivariate Gaussian (Normal) distribution with mean vector μ and covariance matrix Σ, written X ~ N(μ, Σ), if its probability density function (PDF) is:

$$p(\mathbf{x}) = \frac{1}{(2\pi)^{d/2} |\mathbf{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right)$$

where:

d is the dimensionality (number of variables)
μ ∈ ℝᵈ is the mean vector: μᵢ = E[Xᵢ]
Σ ∈ ℝᵈˣᵈ is the covariance matrix: Σᵢⱼ = Cov(Xᵢ, Xⱼ)
|Σ| is the determinant of Σ
Σ⁻¹ is the inverse of Σ (the precision matrix)

Understanding the PDF Components

The MVN PDF has two parts: (1) The normalizing constant 1/((2π)^{d/2}|Σ|^{1/2}) ensures integration to 1. (2) The Mahalanobis distance (x-μ)ᵀΣ⁻¹(x-μ) in the exponent measures 'standardized' distance from the mean, accounting for correlations and different scales.

The Covariance Matrix

The covariance matrix Σ is symmetric positive semi-definite (SPD) and encodes:

Diagonal entries: Σᵢᵢ = Var(Xᵢ) — the variance of each variable

Off-diagonal entries: Σᵢⱼ = Cov(Xᵢ, Xⱼ) = E[(Xᵢ - μᵢ)(Xⱼ - μⱼ)] — covariance between pairs

The correlation matrix is:

$$\rho_{ij} = \frac{\Sigma_{ij}}{\sqrt{\Sigma_{ii} \Sigma_{jj}}}$$

The Precision Matrix

The inverse covariance matrix Λ = Σ⁻¹ is called the precision matrix. It has important properties:

Λᵢⱼ = 0 implies Xᵢ and Xⱼ are conditionally independent given all other variables
The precision parameterization is often more natural for graphical models
Sparse precision matrices correspond to Gaussian Markov Random Fields

Multivariate Gaussian Properties
Property	Formula	Interpretation
Mean	E[X] = μ	Center of the distribution
Covariance	Cov(X) = Σ	Spread and correlations
Marginals	Xᵢ ~ N(μᵢ, Σᵢᵢ)	Each component is univariate Gaussian
Mode	μ	Peak of the density
Symmetry	About μ	Ellipsoidal symmetry
Entropy	½ log((2πe)ᵈ\|Σ\|)	Information content
KL Divergence	Complex formula	See derivation below

Geometric Interpretation

The geometry of the Multivariate Gaussian reveals deep insights about its structure.

Contours of Constant Probability

Points of equal probability density satisfy:

$$(\mathbf{x} - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) = c^2$$

This is the equation of an ellipsoid centered at μ. The quantity on the left is the squared Mahalanobis distance from x to μ.

Eigendecomposition of Σ

The covariance matrix can be decomposed as:

$$\mathbf{\Sigma} = \mathbf{U} \mathbf{\Lambda} \mathbf{U}^T$$

where:

U is an orthogonal matrix whose columns are the eigenvectors of Σ
Λ is a diagonal matrix of eigenvalues λ₁ ≥ λ₂ ≥ ... ≥ λᵈ ≥ 0

Geometric interpretation:

Eigenvectors define the principal axes of the probability ellipsoid
Eigenvalues determine the lengths of axes: √λᵢ is proportional to the semi-axis length
The ellipsoid is oriented along the eigenvectors

The Mahalanobis Distance

The Mahalanobis distance generalizes Euclidean distance to account for correlations:

$$d_M(\mathbf{x}, \boldsymbol{\mu}) = \sqrt{(\mathbf{x} - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})}$$

Properties:

When Σ = I (identity), Mahalanobis = Euclidean distance
Accounts for different scales in each dimension
Accounts for correlations between dimensions
dₘ² follows a χ² distribution with d degrees of freedom: d²ₘ ~ χ²(d)

Probability Contained in Ellipsoids

For X ~ N(μ, Σ), the squared Mahalanobis distance follows:

$$(\mathbf{X} - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{X} - \boldsymbol{\mu}) \sim \chi^2(d)$$

This allows computing the probability that X falls within a given ellipsoid. For example, the 95% confidence ellipsoid satisfies dₘ² ≤ χ²₀.₉₅(d).

mvn_geometry.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import numpy as np
from scipy.stats import chi2
import matplotlib.pyplot as plt
 
def mahalanobis_distance(x: np.ndarray, mu: np.ndarray, Sigma: np.ndarray) -> float:
    """
    Compute Mahalanobis distance from x to mu under covariance Sigma.
    
    d_M = sqrt((x - μ)ᵀ Σ⁻¹ (x - μ))
    """
    diff = x - mu
    Sigma_inv = np.linalg.inv(Sigma)
    return np.sqrt(diff @ Sigma_inv @ diff)
 
def probability_ellipsoid_volume(Sigma: np.ndarray, confidence: float = 0.95) -> dict:
    """
    Compute the confidence ellipsoid parameters.
    
    For X ~ N(μ, Σ), the squared Mahalanobis distance follows χ²(d).
    """
    d = Sigma.shape[0]
    
    # Chi-squared critical value
    chi2_critical = chi2.ppf(confidence, df=d)
    
    # Eigendecomposition for ellipsoid axes
    eigenvalues, eigenvectors = np.linalg.eigh(Sigma)
    
    # Sort by decreasing eigenvalue
    idx = np.argsort(eigenvalues)[::-1]
    eigenvalues = eigenvalues[idx]
    eigenvectors = eigenvectors[:, idx]
    
    # Ellipsoid semi-axis lengths (at given confidence level)
    semi_axes = np.sqrt(chi2_critical * eigenvalues)
    
    # Volume of the ellipsoid
    # V = (π^(d/2) / Γ(d/2 + 1)) * Π(semi_axes)
    from scipy.special import gamma
    volume = (np.pi ** (d/2) / gamma(d/2 + 1)) * np.prod(semi_axes)
    
    return {
        'dimension': d,
        'confidence': confidence,
        'chi2_critical': chi2_critical,
        'eigenvalues': eigenvalues,
        'eigenvectors': eigenvectors,
        'semi_axes': semi_axes,
        'volume': volume
    }
 
# Example: 2D Gaussian
print("Multivariate Gaussian Geometry")
print("=" * 60)
 
mu = np.array([2.0, 3.0])
Sigma = np.array([
    [4.0, 1.5],
    [1.5, 2.0]
])
 
print(f"Mean: μ = {mu}")
print(f"
Covariance matrix Σ:")
print(Sigma)
 
# Eigendecomposition
result = probability_ellipsoid_volume(Sigma, confidence=0.95)
 
print(f"
Eigenvalues: {result['eigenvalues']}")
print(f"Principal axes (eigenvectors):")
print(result['eigenvectors'])
print(f"
95% confidence ellipsoid:")
print(f"  χ² critical value: {result['chi2_critical']:.4f}")
print(f"  Semi-axis lengths: {result['semi_axes']}")
print(f"  Volume: {result['volume']:.4f}")
 
# Mahalanobis distances
print("
Mahalanobis distances from μ:")
test_points = [
    np.array([2.0, 3.0]),   # At mean
    np.array([4.0, 4.0]),   # 1 unit each direction
    np.array([5.0, 5.0]),   # 2 units each direction
]
for pt in test_points:
    d_m = mahalanobis_distance(pt, mu, Sigma)
    prob_outside = 1 - chi2.cdf(d_m**2, df=2)
    print(f"  x = {pt}: d_M = {d_m:.4f}, P(further) = {prob_outside:.4f}")

Visualizing High Dimensions

While we can visualize 2D and 3D Gaussians as ellipses/ellipsoids, the geometry extends to any dimension. The key insight is that the covariance matrix defines a metric (the Mahalanobis distance) that transforms the ellipsoid into a sphere when we 'whiten' the data: X → Σ^{-1/2}(X - μ).

Conditioning and Marginalization

A remarkable property of the MVN is its closure under conditioning and marginalization—both operations yield Gaussian distributions with analytically tractable parameters.

Partitioned Representation

Partition the random vector and parameters:

$$\mathbf{X} = \begin{pmatrix} \mathbf{X}1 \\ \mathbf{X}2 \end{pmatrix}, \quad \boldsymbol{\mu} = \begin{pmatrix} \boldsymbol{\mu}1 \\ \boldsymbol{\mu}2 \end{pmatrix}, \quad \mathbf{\Sigma} = \begin{pmatrix} \mathbf{\Sigma}{11} & \mathbf{\Sigma}{12} \\ \mathbf{\Sigma}{21} & \mathbf{\Sigma}{22} \end{pmatrix}$$

where X₁ has dimension d₁ and X₂ has dimension d₂.

Marginalization

The marginal distribution of any subset of variables is Gaussian:

$$\mathbf{X}_1 \sim N(\boldsymbol{\mu}1, \mathbf{\Sigma}{11})$$ $$\mathbf{X}_2 \sim N(\boldsymbol{\mu}2, \mathbf{\Sigma}{22})$$

To marginalize: simply extract the corresponding subvector/submatrix!

This is remarkably simple—no integration required (unlike most distributions).

Conditioning

The conditional distribution of X₁ given X₂ = x₂ is also Gaussian:

$$\mathbf{X}_1 | \mathbf{X}2 = \mathbf{x}2 \sim N(\boldsymbol{\mu}{1|2}, \mathbf{\Sigma}{1|2})$$

Conditional Mean: $$\boldsymbol{\mu}{1|2} = \boldsymbol{\mu}1 + \mathbf{\Sigma}{12} \mathbf{\Sigma}{22}^{-1} (\mathbf{x}_2 - \boldsymbol{\mu}_2)$$

Conditional Covariance: $$\mathbf{\Sigma}{1|2} = \mathbf{\Sigma}{11} - \mathbf{\Sigma}{12} \mathbf{\Sigma}{22}^{-1} \mathbf{\Sigma}_{21}$$

Key insights:

The conditional mean is a linear function of the conditioning variable x₂
This is linear regression! The coefficient Σ₁₂Σ₂₂⁻¹ is the regression coefficient
The conditional covariance does not depend on x₂—homoscedasticity
The covariance shrinks (by the Schur complement) as we gain information

Independence

For jointly Gaussian variables, uncorrelated implies independent:

$$\mathbf{\Sigma}_{12} = \mathbf{0} \iff \mathbf{X}_1 \perp\!\!\!\perp \mathbf{X}_2$$

This is unique to the Gaussian—for other distributions, zero correlation does not imply independence.

mvn_conditioning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import numpy as np
from scipy.stats import multivariate_normal
 
def conditional_gaussian(mu: np.ndarray, Sigma: np.ndarray, 
                          idx_1: list, idx_2: list, x_2: np.ndarray) -> tuple:
    """
    Compute the conditional distribution p(X_1 | X_2 = x_2) 
    for a multivariate Gaussian.
    
    Args:
        mu: Mean vector of joint distribution
        Sigma: Covariance matrix of joint distribution
        idx_1: Indices of X_1 (variables to condition on)
        idx_2: Indices of X_2 (given variables)
        x_2: Observed value of X_2
        
    Returns:
        (mu_cond, Sigma_cond): Parameters of conditional distribution
    """
    # Extract subvectors/submatrices
    mu_1 = mu[idx_1]
    mu_2 = mu[idx_2]
    
    Sigma_11 = Sigma[np.ix_(idx_1, idx_1)]
    Sigma_12 = Sigma[np.ix_(idx_1, idx_2)]
    Sigma_21 = Sigma[np.ix_(idx_2, idx_1)]
    Sigma_22 = Sigma[np.ix_(idx_2, idx_2)]
    
    # Compute conditional parameters
    Sigma_22_inv = np.linalg.inv(Sigma_22)
    
    # Conditional mean: μ_{1|2} = μ_1 + Σ_12 Σ_22^{-1} (x_2 - μ_2)
    mu_cond = mu_1 + Sigma_12 @ Sigma_22_inv @ (x_2 - mu_2)
    
    # Conditional covariance: Σ_{1|2} = Σ_11 - Σ_12 Σ_22^{-1} Σ_21
    Sigma_cond = Sigma_11 - Sigma_12 @ Sigma_22_inv @ Sigma_21
    
    return mu_cond, Sigma_cond
 
def marginal_gaussian(mu: np.ndarray, Sigma: np.ndarray, 
                       indices: list) -> tuple:
    """
    Compute marginal distribution p(X_I) for subset I.
    
    For Gaussians, this is simply extracting subvector/submatrix!
    """
    mu_marginal = mu[indices]
    Sigma_marginal = Sigma[np.ix_(indices, indices)]
    return mu_marginal, Sigma_marginal
 
# Example: 3D Gaussian with conditioning
print("Gaussian Conditioning and Marginalization")
print("=" * 60)
 
# Define a 3D Gaussian
mu = np.array([1.0, 2.0, 3.0])
Sigma = np.array([
    [1.0, 0.5, 0.3],
    [0.5, 2.0, 0.7],
    [0.3, 0.7, 1.5]
])
 
print("Joint distribution:")
print(f"μ = {mu}")
print(f"Σ =
{Sigma}")
 
# Marginal of X_1, X_2 (indices 0, 1)
mu_marg, Sigma_marg = marginal_gaussian(mu, Sigma, [0, 1])
print(f"
Marginal p(X₁, X₂):")
print(f"μ = {mu_marg}")
print(f"Σ =
{Sigma_marg}")
 
# Conditional p(X_1 | X_2=2.5, X_3=3.5)
x_given = np.array([2.5, 3.5])  # Values of X_2 and X_3
mu_cond, Sigma_cond = conditional_gaussian(mu, Sigma, [0], [1, 2], x_given)
 
print(f"
Conditional p(X₁ | X₂=2.5, X₃=3.5):")
print(f"μ_{'{cond}'} = {mu_cond[0]:.4f}")
print(f"σ²_{'{cond}'} = {Sigma_cond[0,0]:.4f}")
print(f"σ_{'{cond}'} = {np.sqrt(Sigma_cond[0,0]):.4f}")
 
# Compare to prior
print(f"
Comparison to prior p(X₁):")
print(f"Prior: μ = {mu[0]:.4f}, σ = {np.sqrt(Sigma[0,0]):.4f}")
print(f"Posterior: μ = {mu_cond[0]:.4f}, σ = {np.sqrt(Sigma_cond[0,0]):.4f}")
print("(Variance decreases as we gain information from conditioning)")

Parameter Estimation

Given n i.i.d. observations x₁, x₂, ..., xₙ from an MVN, we estimate the mean vector and covariance matrix.

Maximum Likelihood Estimation

The log-likelihood for n observations is:

$$\ell(\boldsymbol{\mu}, \mathbf{\Sigma}) = -\frac{nd}{2}\ln(2\pi) - \frac{n}{2}\ln|\mathbf{\Sigma}| - \frac{1}{2}\sum_{i=1}^n (\mathbf{x}_i - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x}_i - \boldsymbol{\mu})$$

MLE for Mean:

$$\hat{\boldsymbol{\mu}} = \frac{1}{n}\sum_{i=1}^n \mathbf{x}_i = \bar{\mathbf{x}}$$

The MLE for the mean is the sample mean vector.

MLE for Covariance:

$$\hat{\mathbf{\Sigma}}{MLE} = \frac{1}{n}\sum{i=1}^n (\mathbf{x}_i - \hat{\boldsymbol{\mu}})(\mathbf{x}_i - \hat{\boldsymbol{\mu}})^T = \frac{1}{n}\mathbf{X}^T\mathbf{X}$$

where X is the centered data matrix.

Unbiased Estimation

The MLE for covariance is biased. The unbiased estimator divides by (n-1):

$$\mathbf{S} = \frac{1}{n-1}\sum_{i=1}^n (\mathbf{x}_i - \bar{\mathbf{x}})(\mathbf{x}_i - \bar{\mathbf{x}})^T$$

Challenges in High Dimensions

When d (dimensionality) is comparable to or larger than n (sample size):

1. Rank deficiency: The sample covariance matrix has rank at most min(n-1, d). If d ≥ n, it's singular and cannot be inverted.

2. High variance: Even when invertible, the estimated eigenvalues are highly variable—largest overestimated, smallest underestimated.

3. Solutions:

Regularization: Shrink toward a structured matrix (e.g., Ledoit-Wolf)
Sparse estimation: GLASSO for sparse precision matrix
Factor models: Assume low-rank plus diagonal structure
Diagonal approximation: Assume independence (Naive Bayes)

mvn_estimation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import numpy as np
from scipy.stats import multivariate_normal
 
def estimate_mvn_parameters(data: np.ndarray, regularization: float = 0.0) -> dict:
    """
    Estimate MVN parameters from data.
    
    Args:
        data: n x d data matrix (n samples, d dimensions)
        regularization: Shrinkage toward identity (Ledoit-Wolf style)
        
    Returns:
        Dictionary with mean, covariance, and related quantities
    """
    n, d = data.shape
    
    # Mean estimation
    mu_hat = np.mean(data, axis=0)
    
    # Centered data
    X_centered = data - mu_hat
    
    # MLE covariance (biased)
    Sigma_mle = (X_centered.T @ X_centered) / n
    
    # Unbiased covariance
    Sigma_unbiased = (X_centered.T @ X_centered) / (n - 1)
    
    # Regularized covariance (shrink toward identity)
    if regularization > 0:
        target = np.trace(Sigma_unbiased) / d * np.eye(d)  # Spherical target
        Sigma_reg = (1 - regularization) * Sigma_unbiased + regularization * target
    else:
        Sigma_reg = Sigma_unbiased
    
    # Correlation matrix
    stds = np.sqrt(np.diag(Sigma_unbiased))
    corr = Sigma_unbiased / np.outer(stds, stds)
    
    # Eigendecomposition
    eigenvalues, eigenvectors = np.linalg.eigh(Sigma_reg)
    
    # Condition number (ratio of largest to smallest eigenvalue)
    condition_number = eigenvalues.max() / max(eigenvalues.min(), 1e-10)
    
    return {
        'n': n,
        'd': d,
        'mu': mu_hat,
        'Sigma_mle': Sigma_mle,
        'Sigma_unbiased': Sigma_unbiased,
        'Sigma_regularized': Sigma_reg,
        'correlation': corr,
        'eigenvalues': eigenvalues[::-1],  # Descending order
        'eigenvectors': eigenvectors[:, ::-1],
        'condition_number': condition_number
    }
 
def ledoit_wolf_shrinkage(data: np.ndarray) -> tuple:
    """
    Ledoit-Wolf optimal shrinkage estimator.
    
    Shrinks toward scaled identity matrix with optimal shrinkage intensity.
    """
    n, d = data.shape
    X_centered = data - np.mean(data, axis=0)
    
    # Sample covariance
    S = (X_centered.T @ X_centered) / n
    
    # Target: scaled identity
    mu_target = np.trace(S) / d
    F = mu_target * np.eye(d)
    
    # Compute optimal shrinkage intensity (simplified formula)
    # This is a simplified version; full formula involves fourth moments
    delta = S - F
    delta_sq_norm = np.sum(delta ** 2)
    
    # Estimate shrinkage intensity
    shrinkage = min(1.0, (1/n) * delta_sq_norm / delta_sq_norm) if delta_sq_norm > 0 else 0
    
    # From sklearn.covariance for comparison
    from sklearn.covariance import LedoitWolf
    lw = LedoitWolf().fit(data)
    
    return lw.covariance_, lw.shrinkage_
 
# Example
np.random.seed(42)
 
# Generate data from known MVN
true_mu = np.array([1.0, 2.0, 3.0])
true_Sigma = np.array([
    [1.0, 0.5, 0.3],
    [0.5, 2.0, 0.7],
    [0.3, 0.7, 1.5]
])
 
n_samples = 100
data = np.random.multivariate_normal(true_mu, true_Sigma, n_samples)
 
# Estimate parameters
results = estimate_mvn_parameters(data)
 
print("MVN Parameter Estimation")
print("=" * 60)
print(f"True μ: {true_mu}")
print(f"Estimated μ: {results['mu'].round(4)}")
print(f"
True Σ:
{true_Sigma}")
print(f"
Estimated Σ (unbiased):
{results['Sigma_unbiased'].round(4)}")
print(f"
Correlation matrix:
{results['correlation'].round(4)}")
print(f"
Eigenvalues: {results['eigenvalues'].round(4)}")
print(f"Condition number: {results['condition_number']:.2f}")

Linear Transformations and Affine Properties

The MVN family is closed under linear transformations—a crucial property for many applications.

Affine Transformation

If X ~ N(μ, Σ) and Y = A****X + b where A is a matrix and b is a vector, then:

$$\mathbf{Y} \sim N(\mathbf{A}\boldsymbol{\mu} + \mathbf{b}, \mathbf{A}\mathbf{\Sigma}\mathbf{A}^T)$$

This generalizes the univariate result (aX + b ~ N(aμ + b, a²σ²)).

Whitening Transformation

Whitening transforms data to have identity covariance:

$$\mathbf{Z} = \mathbf{\Sigma}^{-1/2}(\mathbf{X} - \boldsymbol{\mu}) \sim N(\mathbf{0}, \mathbf{I})$$

where Σ^{-1/2} is the inverse matrix square root (computed via eigendecomposition).

Applications:

Standardizes multivariate data
Decorrelates variables
Simplifies distance calculations (Mahalanobis → Euclidean)

Sum of Independent Gaussians

If X ~ N(μₓ, Σₓ) and Y ~ N(μᵧ, Σᵧ) are independent, then:

$$\mathbf{X} + \mathbf{Y} \sim N(\boldsymbol{\mu}_X + \boldsymbol{\mu}_Y, \mathbf{\Sigma}_X + \mathbf{\Sigma}_Y)$$

Means add, covariances add (generalizing the univariate case).

Product of Gaussian PDFs

The product of two Gaussian PDFs (up to normalization) is Gaussian—essential for Bayesian updating:

If we have prior x ~ N(μ₀, Σ₀) and likelihood y | x ~ N(x, Σₗ), the posterior is:

$$\mathbf{x} | \mathbf{y} \sim N(\boldsymbol{\mu}{post}, \mathbf{\Sigma}{post})$$

where:

$$\mathbf{\Sigma}{post}^{-1} = \mathbf{\Sigma}0^{-1} + \mathbf{\Sigma}\ell^{-1}$$ $$\boldsymbol{\mu}{post} = \mathbf{\Sigma}_{post}(\mathbf{\Sigma}_0^{-1}\boldsymbol{\mu}0 + \mathbf{\Sigma}\ell^{-1}\mathbf{y})$$

This is precision-weighted averaging, exactly as in the univariate case.

Information Form

Many computations are simpler in the 'information form' (also called canonical form) using precision matrix Λ = Σ⁻¹ and information vector η = Λμ. Products of Gaussians become sums of precision matrices and information vectors, making Bayesian updates additive.

Applications in Machine Learning

The Multivariate Gaussian is ubiquitous in machine learning. Let's explore its key applications.

Principal Component Analysis (PCA)

PCA can be viewed as fitting an MVN and finding its principal axes:

Estimate the covariance matrix Σ from data
Compute eigendecomposition: Σ = UΛUᵀ
The eigenvectors (columns of U) are the principal components
Eigenvalues indicate variance explained along each PC

Dimensionality reduction projects onto the top k eigenvectors.

Gaussian Mixture Models (GMMs)

GMMs model data as a mixture of K MVN components:

$$p(\mathbf{x}) = \sum_{k=1}^K \pi_k \, N(\mathbf{x} | \boldsymbol{\mu}_k, \mathbf{\Sigma}_k)$$

where πₖ are mixing weights. GMMs are trained via EM algorithm and are fundamental for:

Soft clustering
Density estimation
Speech recognition phoneme modeling
Image segmentation

mvn_applications.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
import numpy as np
from scipy.stats import multivariate_normal
 
class GaussianDiscriminantAnalysis:
    """
    Gaussian Discriminant Analysis (LDA/QDA).
    
    Models each class with a Multivariate Gaussian:
    p(x | y=k) = N(x | μ_k, Σ_k)
    
    LDA: Shared covariance across classes
    QDA: Separate covariance per class
    """
    def __init__(self, shared_covariance: bool = True):
        """
        Args:
            shared_covariance: True for LDA, False for QDA
        """
        self.shared_cov = shared_covariance
        self.classes = None
        self.priors = {}
        self.means = {}
        self.covariances = {}
    
    def fit(self, X: np.ndarray, y: np.ndarray):
        """Fit Gaussian to each class."""
        self.classes = np.unique(y)
        n = len(y)
        d = X.shape[1]
        
        for c in self.classes:
            X_c = X[y == c]
            self.priors[c] = len(X_c) / n
            self.means[c] = np.mean(X_c, axis=0)
            
            if not self.shared_cov:
                self.covariances[c] = np.cov(X_c.T)
        
        if self.shared_cov:
            # Pooled within-class covariance
            Sw = np.zeros((d, d))
            for c in self.classes:
                X_c = X[y == c]
                X_centered = X_c - self.means[c]
                Sw += X_centered.T @ X_centered
            self.shared_Sigma = Sw / (n - len(self.classes))
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """Compute posterior probabilities P(y=k | x)."""
        log_posteriors = np.zeros((len(X), len(self.classes)))
        
        for i, c in enumerate(self.classes):
            if self.shared_cov:
                Sigma = self.shared_Sigma
            else:
                Sigma = self.covariances[c]
            
            mvn = multivariate_normal(self.means[c], Sigma)
            log_posteriors[:, i] = np.log(self.priors[c]) + mvn.logpdf(X)
        
        # Normalize
        log_sum = np.log(np.exp(log_posteriors).sum(axis=1, keepdims=True))
        posteriors = np.exp(log_posteriors - log_sum)
        return posteriors
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        proba = self.predict_proba(X)
        return self.classes[np.argmax(proba, axis=1)]
 
# Example: Classification with LDA
np.random.seed(42)
 
# Generate 3-class data
n_per_class = 100
X_0 = np.random.multivariate_normal([0, 0], [[1, 0.5], [0.5, 1]], n_per_class)
X_1 = np.random.multivariate_normal([3, 3], [[1, -0.3], [-0.3, 1]], n_per_class)
X_2 = np.random.multivariate_normal([0, 4], [[0.8, 0], [0, 0.8]], n_per_class)
 
X = np.vstack([X_0, X_1, X_2])
y = np.array([0] * n_per_class + [1] * n_per_class + [2] * n_per_class)
 
# Fit LDA
lda = GaussianDiscriminantAnalysis(shared_covariance=True)
lda.fit(X, y)
 
# Evaluate
y_pred = lda.predict(X)
accuracy = np.mean(y_pred == y)
 
print("Gaussian Discriminant Analysis (LDA)")
print("=" * 50)
print(f"Training accuracy: {accuracy:.2%}")
print("
Class means:")
for c in lda.classes:
    print(f"  Class {c}: μ = {lda.means[c].round(3)}")
print(f"
Shared covariance:
{lda.shared_Sigma.round(3)}")
 
# Predict probabilities for a test point
test_point = np.array([[1.5, 2.0]])
probs = lda.predict_proba(test_point)[0]
print(f"
Test point [1.5, 2.0] posteriors:")
for c, p in zip(lda.classes, probs):
    print(f"  P(y={c} | x) = {p:.4f}")

Gaussian Processes (GPs)

A Gaussian Process is a distribution over functions where any finite collection of function values is jointly MVN:

$$[f(x_1), f(x_2), \ldots, f(x_n)]^T \sim N(\boldsymbol{\mu}, \mathbf{K})$$

where K is the kernel (covariance) matrix with Kᵢⱼ = k(xᵢ, xⱼ).

GPs provide:

Uncertainty quantification (posterior variance)
Non-parametric flexibility
Principled Bayesian inference for functions

Variational Autoencoders (VAEs)

VAEs model the latent space as MVN:

Encoder: q(z|x) = N(μ(x), diag(σ²(x)))
Prior: p(z) = N(0, I)

The reparameterization trick enables gradient-based training: $$z = \mu + \sigma \odot \epsilon, \quad \epsilon \sim N(0, I)$$

Kalman Filters

Kalman filters propagate MVN beliefs through linear dynamical systems:

Prediction: p(xₜ | y₁:ₜ₋₁) = N(μₜ|ₜ₋₁, Σₜ|ₜ₋₁) Update: p(xₜ | y₁:ₜ) = N(μₜ|ₜ, Σₜ|ₜ)

All operations remain Gaussian, enabling efficient recursive estimation.

MVN Applications Summary

•PCA: Find principal axes of MVN-modeled data; dimensionality reduction
•GMMs: Mixture of MVNs for clustering and density estimation
•LDA/QDA: Gaussian class-conditional models for classification
•Gaussian Processes: MVN over function values for regression with uncertainty
•VAEs: MVN latent spaces for generative modeling
•Kalman Filters: Recursive MVN estimation for time series
•Factor Analysis: Low-rank plus diagonal MVN structure
•Probabilistic PCA: Latent variable MVN model for dimensionality reduction

KL Divergence Between Gaussians

The Kullback-Leibler (KL) divergence measures how one distribution differs from another. For MVNs, we have closed-form expressions.

KL Divergence Formula

For p = N(μ₁, Σ₁) and q = N(μ₂, Σ₂):

$$D_{KL}(p \| q) = \frac{1}{2}\left[\text{tr}(\mathbf{\Sigma}_2^{-1}\mathbf{\Sigma}_1) + (\boldsymbol{\mu}_2 - \boldsymbol{\mu}_1)^T\mathbf{\Sigma}_2^{-1}(\boldsymbol{\mu}_2 - \boldsymbol{\mu}_1) - d + \ln\frac{|\mathbf{\Sigma}_2|}{|\mathbf{\Sigma}_1|}\right]$$

Special case (q = N(0, I)):

$$D_{KL}(N(\boldsymbol{\mu}, \mathbf{\Sigma}) \| N(\mathbf{0}, \mathbf{I})) = \frac{1}{2}\left[\text{tr}(\mathbf{\Sigma}) + \boldsymbol{\mu}^T\boldsymbol{\mu} - d - \ln|\mathbf{\Sigma}|\right]$$

This is the regularization term in VAE training!

gaussian_kl.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
 
def kl_divergence_gaussians(mu1: np.ndarray, Sigma1: np.ndarray,
                            mu2: np.ndarray, Sigma2: np.ndarray) -> float:
    """
    Compute KL divergence D_KL(p || q) where:
    p = N(μ₁, Σ₁)
    q = N(μ₂, Σ₂)
    
    D_KL = 0.5 * [tr(Σ₂⁻¹Σ₁) + (μ₂-μ₁)ᵀΣ₂⁻¹(μ₂-μ₁) - d + ln(|Σ₂|/|Σ₁|)]
    """
    d = len(mu1)
    
    # Compute components
    Sigma2_inv = np.linalg.inv(Sigma2)
    mu_diff = mu2 - mu1
    
    # Trace term
    trace_term = np.trace(Sigma2_inv @ Sigma1)
    
    # Mahalanobis term
    mahal_term = mu_diff @ Sigma2_inv @ mu_diff
    
    # Log determinant term
    log_det_term = np.log(np.linalg.det(Sigma2) / np.linalg.det(Sigma1))
    
    kl = 0.5 * (trace_term + mahal_term - d + log_det_term)
    return kl
 
def kl_to_standard_normal(mu: np.ndarray, Sigma: np.ndarray) -> float:
    """
    KL divergence from N(μ, Σ) to N(0, I).
    
    This is the regularization term in VAE training:
    D_KL = 0.5 * [tr(Σ) + μᵀμ - d - ln|Σ|]
    """
    d = len(mu)
    
    trace_term = np.trace(Sigma)
    mu_sq_term = np.sum(mu ** 2)
    log_det_term = np.log(np.linalg.det(Sigma))
    
    return 0.5 * (trace_term + mu_sq_term - d - log_det_term)
 
# Example
print("KL Divergence Between Gaussians")
print("=" * 50)
 
# Two 2D Gaussians
mu1 = np.array([0.0, 0.0])
Sigma1 = np.array([[1.0, 0.3], [0.3, 1.0]])
 
mu2 = np.array([1.0, 1.0])
Sigma2 = np.array([[2.0, 0.5], [0.5, 2.0]])
 
kl_12 = kl_divergence_gaussians(mu1, Sigma1, mu2, Sigma2)
kl_21 = kl_divergence_gaussians(mu2, Sigma2, mu1, Sigma1)
 
print(f"p = N({mu1}, Σ₁)")
print(f"q = N({mu2}, Σ₂)")
print(f"
D_KL(p || q) = {kl_12:.4f}")
print(f"D_KL(q || p) = {kl_21:.4f}")
print("(Note: KL is asymmetric!)")
 
# VAE regularization term
mu_vae = np.array([0.5, -0.3, 0.2])
sigma_vae = np.array([1.2, 0.8, 1.1])  # Diagonal variances
Sigma_vae = np.diag(sigma_vae ** 2)
 
kl_vae = kl_to_standard_normal(mu_vae, Sigma_vae)
print(f"
VAE example:")
print(f"Encoder output: μ = {mu_vae}, σ = {sigma_vae}")
print(f"KL regularization term: {kl_vae:.4f}")

Summary and Key Takeaways

The Multivariate Gaussian is the cornerstone of high-dimensional probabilistic modeling. Let's consolidate our understanding:

Key Takeaways

•Definition: X ~ N(μ, Σ) where μ is the mean vector and Σ is the covariance matrix (symmetric positive semi-definite).
•Geometry: Contours are ellipsoids; eigendecomposition of Σ gives principal axes and their lengths.
•Mahalanobis distance: (x-μ)ᵀΣ⁻¹(x-μ) measures standardized distance; follows χ²(d) distribution.
•Conditioning: X₁|X₂ is Gaussian with linear mean and reduced variance; this IS linear regression.
•Marginalization: Extract subvector/submatrix—no integration needed!
•Independence: For Gaussians, uncorrelated ⇔ independent (unique property).
•Applications: PCA, GMMs, LDA, GPs, VAEs, Kalman filters—the MVN is everywhere in ML.
•KL divergence: Closed-form expression; regularization term in VAE training.

Module Complete!

You have now completed the module on Common Probability Distributions. You've mastered:

Bernoulli and Binomial: Binary outcomes and their counts
Gaussian: The universal continuous distribution, justified by maximum entropy and CLT
Poisson: Counting random events with the Poisson process connection
Exponential Family: The unifying framework connecting distributions to GLMs
Multivariate Gaussian: High-dimensional probabilistic modeling with elegant conditioning and marginalization

These distributions form the probabilistic foundation of machine learning. Understanding their properties, estimation methods, and interrelationships equips you to build, analyze, and improve probabilistic models across diverse applications.

Module Complete

Congratulations! You now have comprehensive knowledge of the probability distributions that underpin machine learning. From the simplest Bernoulli to the sophisticated Multivariate Gaussian, these distributions provide the language and tools for reasoning about uncertainty, building models, and making predictions. This foundation is essential for understanding more advanced topics in probabilistic machine learning.

5 / 5

Loading learning content...

Machine LearningProbability Theory

Common Probability Distributions

LevelIntermediate

Duration120 mins

TopicProbability Theory

5 / 5

Multivariate Gaussian Distribution

The Geometry of Uncertainty in High Dimensions

In machine learning, the MVN appears everywhere:

Principal Component Analysis (PCA): Finds the principal axes of an MVN
Gaussian Mixture Models (GMMs): Clusters as weighted sums of MVNs
Gaussian Processes: MVN over function values
Variational Autoencoders: MVN latent spaces
Kalman Filters: State estimation via MVN propagation
Linear Discriminant Analysis: Class-conditional MVNs

Mastering the Multivariate Gaussian is essential for understanding and developing modern probabilistic machine learning methods.

What You Will Learn

Mathematical Definition

$$p(\mathbf{x}) = \frac{1}{(2\pi)^{d/2} |\mathbf{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right)$$

where:

d is the dimensionality (number of variables)
μ ∈ ℝᵈ is the mean vector: μᵢ = E[Xᵢ]
Σ ∈ ℝᵈˣᵈ is the covariance matrix: Σᵢⱼ = Cov(Xᵢ, Xⱼ)
|Σ| is the determinant of Σ
Σ⁻¹ is the inverse of Σ (the precision matrix)

Understanding the PDF Components

The Covariance Matrix

The covariance matrix Σ is symmetric positive semi-definite (SPD) and encodes:

Diagonal entries: Σᵢᵢ = Var(Xᵢ) — the variance of each variable

Off-diagonal entries: Σᵢⱼ = Cov(Xᵢ, Xⱼ) = E[(Xᵢ - μᵢ)(Xⱼ - μⱼ)] — covariance between pairs

The correlation matrix is:

$$\rho_{ij} = \frac{\Sigma_{ij}}{\sqrt{\Sigma_{ii} \Sigma_{jj}}}$$

The Precision Matrix

The inverse covariance matrix Λ = Σ⁻¹ is called the precision matrix. It has important properties:

Λᵢⱼ = 0 implies Xᵢ and Xⱼ are conditionally independent given all other variables
The precision parameterization is often more natural for graphical models
Sparse precision matrices correspond to Gaussian Markov Random Fields

Multivariate Gaussian Properties
Property	Formula	Interpretation
Mean	E[X] = μ	Center of the distribution
Covariance	Cov(X) = Σ	Spread and correlations
Marginals	Xᵢ ~ N(μᵢ, Σᵢᵢ)	Each component is univariate Gaussian
Mode	μ	Peak of the density
Symmetry	About μ	Ellipsoidal symmetry
Entropy	½ log((2πe)ᵈ\|Σ\|)	Information content
KL Divergence	Complex formula	See derivation below

Geometric Interpretation

The geometry of the Multivariate Gaussian reveals deep insights about its structure.

Contours of Constant Probability

Points of equal probability density satisfy:

$$(\mathbf{x} - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) = c^2$$

This is the equation of an ellipsoid centered at μ. The quantity on the left is the squared Mahalanobis distance from x to μ.

Eigendecomposition of Σ

The covariance matrix can be decomposed as:

$$\mathbf{\Sigma} = \mathbf{U} \mathbf{\Lambda} \mathbf{U}^T$$

where:

U is an orthogonal matrix whose columns are the eigenvectors of Σ
Λ is a diagonal matrix of eigenvalues λ₁ ≥ λ₂ ≥ ... ≥ λᵈ ≥ 0

Geometric interpretation:

Eigenvectors define the principal axes of the probability ellipsoid
Eigenvalues determine the lengths of axes: √λᵢ is proportional to the semi-axis length
The ellipsoid is oriented along the eigenvectors

The Mahalanobis Distance

The Mahalanobis distance generalizes Euclidean distance to account for correlations:

$$d_M(\mathbf{x}, \boldsymbol{\mu}) = \sqrt{(\mathbf{x} - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})}$$

Properties:

When Σ = I (identity), Mahalanobis = Euclidean distance
Accounts for different scales in each dimension
Accounts for correlations between dimensions
dₘ² follows a χ² distribution with d degrees of freedom: d²ₘ ~ χ²(d)

Probability Contained in Ellipsoids

For X ~ N(μ, Σ), the squared Mahalanobis distance follows:

$$(\mathbf{X} - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{X} - \boldsymbol{\mu}) \sim \chi^2(d)$$

This allows computing the probability that X falls within a given ellipsoid. For example, the 95% confidence ellipsoid satisfies dₘ² ≤ χ²₀.₉₅(d).

mvn_geometry.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import numpy as np
from scipy.stats import chi2
import matplotlib.pyplot as plt
 
def mahalanobis_distance(x: np.ndarray, mu: np.ndarray, Sigma: np.ndarray) -> float:
    """
    Compute Mahalanobis distance from x to mu under covariance Sigma.
    
    d_M = sqrt((x - μ)ᵀ Σ⁻¹ (x - μ))
    """
    diff = x - mu
    Sigma_inv = np.linalg.inv(Sigma)
    return np.sqrt(diff @ Sigma_inv @ diff)
 
def probability_ellipsoid_volume(Sigma: np.ndarray, confidence: float = 0.95) -> dict:
    """
    Compute the confidence ellipsoid parameters.
    
    For X ~ N(μ, Σ), the squared Mahalanobis distance follows χ²(d).
    """
    d = Sigma.shape[0]
    
    # Chi-squared critical value
    chi2_critical = chi2.ppf(confidence, df=d)
    
    # Eigendecomposition for ellipsoid axes
    eigenvalues, eigenvectors = np.linalg.eigh(Sigma)
    
    # Sort by decreasing eigenvalue
    idx = np.argsort(eigenvalues)[::-1]
    eigenvalues = eigenvalues[idx]
    eigenvectors = eigenvectors[:, idx]
    
    # Ellipsoid semi-axis lengths (at given confidence level)
    semi_axes = np.sqrt(chi2_critical * eigenvalues)
    
    # Volume of the ellipsoid
    # V = (π^(d/2) / Γ(d/2 + 1)) * Π(semi_axes)
    from scipy.special import gamma
    volume = (np.pi ** (d/2) / gamma(d/2 + 1)) * np.prod(semi_axes)
    
    return {
        'dimension': d,
        'confidence': confidence,
        'chi2_critical': chi2_critical,
        'eigenvalues': eigenvalues,
        'eigenvectors': eigenvectors,
        'semi_axes': semi_axes,
        'volume': volume
    }
 
# Example: 2D Gaussian
print("Multivariate Gaussian Geometry")
print("=" * 60)
 
mu = np.array([2.0, 3.0])
Sigma = np.array([
    [4.0, 1.5],
    [1.5, 2.0]
])
 
print(f"Mean: μ = {mu}")
print(f"
Covariance matrix Σ:")
print(Sigma)
 
# Eigendecomposition
result = probability_ellipsoid_volume(Sigma, confidence=0.95)
 
print(f"
Eigenvalues: {result['eigenvalues']}")
print(f"Principal axes (eigenvectors):")
print(result['eigenvectors'])
print(f"
95% confidence ellipsoid:")
print(f"  χ² critical value: {result['chi2_critical']:.4f}")
print(f"  Semi-axis lengths: {result['semi_axes']}")
print(f"  Volume: {result['volume']:.4f}")
 
# Mahalanobis distances
print("
Mahalanobis distances from μ:")
test_points = [
    np.array([2.0, 3.0]),   # At mean
    np.array([4.0, 4.0]),   # 1 unit each direction
    np.array([5.0, 5.0]),   # 2 units each direction
]
for pt in test_points:
    d_m = mahalanobis_distance(pt, mu, Sigma)
    prob_outside = 1 - chi2.cdf(d_m**2, df=2)
    print(f"  x = {pt}: d_M = {d_m:.4f}, P(further) = {prob_outside:.4f}")

Visualizing High Dimensions

Conditioning and Marginalization

A remarkable property of the MVN is its closure under conditioning and marginalization—both operations yield Gaussian distributions with analytically tractable parameters.

Partitioned Representation

Partition the random vector and parameters:

where X₁ has dimension d₁ and X₂ has dimension d₂.

Marginalization

The marginal distribution of any subset of variables is Gaussian:

$$\mathbf{X}_1 \sim N(\boldsymbol{\mu}1, \mathbf{\Sigma}{11})$$ $$\mathbf{X}_2 \sim N(\boldsymbol{\mu}2, \mathbf{\Sigma}{22})$$

To marginalize: simply extract the corresponding subvector/submatrix!

This is remarkably simple—no integration required (unlike most distributions).

Conditioning

The conditional distribution of X₁ given X₂ = x₂ is also Gaussian:

$$\mathbf{X}_1 | \mathbf{X}2 = \mathbf{x}2 \sim N(\boldsymbol{\mu}{1|2}, \mathbf{\Sigma}{1|2})$$

Conditional Mean: $$\boldsymbol{\mu}{1|2} = \boldsymbol{\mu}1 + \mathbf{\Sigma}{12} \mathbf{\Sigma}{22}^{-1} (\mathbf{x}_2 - \boldsymbol{\mu}_2)$$

Conditional Covariance: $$\mathbf{\Sigma}{1|2} = \mathbf{\Sigma}{11} - \mathbf{\Sigma}{12} \mathbf{\Sigma}{22}^{-1} \mathbf{\Sigma}_{21}$$

Key insights:

The conditional mean is a linear function of the conditioning variable x₂
This is linear regression! The coefficient Σ₁₂Σ₂₂⁻¹ is the regression coefficient
The conditional covariance does not depend on x₂—homoscedasticity
The covariance shrinks (by the Schur complement) as we gain information

Independence

For jointly Gaussian variables, uncorrelated implies independent:

$$\mathbf{\Sigma}_{12} = \mathbf{0} \iff \mathbf{X}_1 \perp\!\!\!\perp \mathbf{X}_2$$

This is unique to the Gaussian—for other distributions, zero correlation does not imply independence.

mvn_conditioning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import numpy as np
from scipy.stats import multivariate_normal
 
def conditional_gaussian(mu: np.ndarray, Sigma: np.ndarray, 
                          idx_1: list, idx_2: list, x_2: np.ndarray) -> tuple:
    """
    Compute the conditional distribution p(X_1 | X_2 = x_2) 
    for a multivariate Gaussian.
    
    Args:
        mu: Mean vector of joint distribution
        Sigma: Covariance matrix of joint distribution
        idx_1: Indices of X_1 (variables to condition on)
        idx_2: Indices of X_2 (given variables)
        x_2: Observed value of X_2
        
    Returns:
        (mu_cond, Sigma_cond): Parameters of conditional distribution
    """
    # Extract subvectors/submatrices
    mu_1 = mu[idx_1]
    mu_2 = mu[idx_2]
    
    Sigma_11 = Sigma[np.ix_(idx_1, idx_1)]
    Sigma_12 = Sigma[np.ix_(idx_1, idx_2)]
    Sigma_21 = Sigma[np.ix_(idx_2, idx_1)]
    Sigma_22 = Sigma[np.ix_(idx_2, idx_2)]
    
    # Compute conditional parameters
    Sigma_22_inv = np.linalg.inv(Sigma_22)
    
    # Conditional mean: μ_{1|2} = μ_1 + Σ_12 Σ_22^{-1} (x_2 - μ_2)
    mu_cond = mu_1 + Sigma_12 @ Sigma_22_inv @ (x_2 - mu_2)
    
    # Conditional covariance: Σ_{1|2} = Σ_11 - Σ_12 Σ_22^{-1} Σ_21
    Sigma_cond = Sigma_11 - Sigma_12 @ Sigma_22_inv @ Sigma_21
    
    return mu_cond, Sigma_cond
 
def marginal_gaussian(mu: np.ndarray, Sigma: np.ndarray, 
                       indices: list) -> tuple:
    """
    Compute marginal distribution p(X_I) for subset I.
    
    For Gaussians, this is simply extracting subvector/submatrix!
    """
    mu_marginal = mu[indices]
    Sigma_marginal = Sigma[np.ix_(indices, indices)]
    return mu_marginal, Sigma_marginal
 
# Example: 3D Gaussian with conditioning
print("Gaussian Conditioning and Marginalization")
print("=" * 60)
 
# Define a 3D Gaussian
mu = np.array([1.0, 2.0, 3.0])
Sigma = np.array([
    [1.0, 0.5, 0.3],
    [0.5, 2.0, 0.7],
    [0.3, 0.7, 1.5]
])
 
print("Joint distribution:")
print(f"μ = {mu}")
print(f"Σ =
{Sigma}")
 
# Marginal of X_1, X_2 (indices 0, 1)
mu_marg, Sigma_marg = marginal_gaussian(mu, Sigma, [0, 1])
print(f"
Marginal p(X₁, X₂):")
print(f"μ = {mu_marg}")
print(f"Σ =
{Sigma_marg}")
 
# Conditional p(X_1 | X_2=2.5, X_3=3.5)
x_given = np.array([2.5, 3.5])  # Values of X_2 and X_3
mu_cond, Sigma_cond = conditional_gaussian(mu, Sigma, [0], [1, 2], x_given)
 
print(f"
Conditional p(X₁ | X₂=2.5, X₃=3.5):")
print(f"μ_{'{cond}'} = {mu_cond[0]:.4f}")
print(f"σ²_{'{cond}'} = {Sigma_cond[0,0]:.4f}")
print(f"σ_{'{cond}'} = {np.sqrt(Sigma_cond[0,0]):.4f}")
 
# Compare to prior
print(f"
Comparison to prior p(X₁):")
print(f"Prior: μ = {mu[0]:.4f}, σ = {np.sqrt(Sigma[0,0]):.4f}")
print(f"Posterior: μ = {mu_cond[0]:.4f}, σ = {np.sqrt(Sigma_cond[0,0]):.4f}")
print("(Variance decreases as we gain information from conditioning)")

Parameter Estimation

Given n i.i.d. observations x₁, x₂, ..., xₙ from an MVN, we estimate the mean vector and covariance matrix.

Maximum Likelihood Estimation

The log-likelihood for n observations is:

MLE for Mean:

$$\hat{\boldsymbol{\mu}} = \frac{1}{n}\sum_{i=1}^n \mathbf{x}_i = \bar{\mathbf{x}}$$

The MLE for the mean is the sample mean vector.

MLE for Covariance:

$$\hat{\mathbf{\Sigma}}{MLE} = \frac{1}{n}\sum{i=1}^n (\mathbf{x}_i - \hat{\boldsymbol{\mu}})(\mathbf{x}_i - \hat{\boldsymbol{\mu}})^T = \frac{1}{n}\mathbf{X}^T\mathbf{X}$$

where X is the centered data matrix.

Unbiased Estimation

The MLE for covariance is biased. The unbiased estimator divides by (n-1):

$$\mathbf{S} = \frac{1}{n-1}\sum_{i=1}^n (\mathbf{x}_i - \bar{\mathbf{x}})(\mathbf{x}_i - \bar{\mathbf{x}})^T$$

Challenges in High Dimensions

When d (dimensionality) is comparable to or larger than n (sample size):

1. Rank deficiency: The sample covariance matrix has rank at most min(n-1, d). If d ≥ n, it's singular and cannot be inverted.

2. High variance: Even when invertible, the estimated eigenvalues are highly variable—largest overestimated, smallest underestimated.

3. Solutions:

Regularization: Shrink toward a structured matrix (e.g., Ledoit-Wolf)
Sparse estimation: GLASSO for sparse precision matrix
Factor models: Assume low-rank plus diagonal structure
Diagonal approximation: Assume independence (Naive Bayes)

mvn_estimation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import numpy as np
from scipy.stats import multivariate_normal
 
def estimate_mvn_parameters(data: np.ndarray, regularization: float = 0.0) -> dict:
    """
    Estimate MVN parameters from data.
    
    Args:
        data: n x d data matrix (n samples, d dimensions)
        regularization: Shrinkage toward identity (Ledoit-Wolf style)
        
    Returns:
        Dictionary with mean, covariance, and related quantities
    """
    n, d = data.shape
    
    # Mean estimation
    mu_hat = np.mean(data, axis=0)
    
    # Centered data
    X_centered = data - mu_hat
    
    # MLE covariance (biased)
    Sigma_mle = (X_centered.T @ X_centered) / n
    
    # Unbiased covariance
    Sigma_unbiased = (X_centered.T @ X_centered) / (n - 1)
    
    # Regularized covariance (shrink toward identity)
    if regularization > 0:
        target = np.trace(Sigma_unbiased) / d * np.eye(d)  # Spherical target
        Sigma_reg = (1 - regularization) * Sigma_unbiased + regularization * target
    else:
        Sigma_reg = Sigma_unbiased
    
    # Correlation matrix
    stds = np.sqrt(np.diag(Sigma_unbiased))
    corr = Sigma_unbiased / np.outer(stds, stds)
    
    # Eigendecomposition
    eigenvalues, eigenvectors = np.linalg.eigh(Sigma_reg)
    
    # Condition number (ratio of largest to smallest eigenvalue)
    condition_number = eigenvalues.max() / max(eigenvalues.min(), 1e-10)
    
    return {
        'n': n,
        'd': d,
        'mu': mu_hat,
        'Sigma_mle': Sigma_mle,
        'Sigma_unbiased': Sigma_unbiased,
        'Sigma_regularized': Sigma_reg,
        'correlation': corr,
        'eigenvalues': eigenvalues[::-1],  # Descending order
        'eigenvectors': eigenvectors[:, ::-1],
        'condition_number': condition_number
    }
 
def ledoit_wolf_shrinkage(data: np.ndarray) -> tuple:
    """
    Ledoit-Wolf optimal shrinkage estimator.
    
    Shrinks toward scaled identity matrix with optimal shrinkage intensity.
    """
    n, d = data.shape
    X_centered = data - np.mean(data, axis=0)
    
    # Sample covariance
    S = (X_centered.T @ X_centered) / n
    
    # Target: scaled identity
    mu_target = np.trace(S) / d
    F = mu_target * np.eye(d)
    
    # Compute optimal shrinkage intensity (simplified formula)
    # This is a simplified version; full formula involves fourth moments
    delta = S - F
    delta_sq_norm = np.sum(delta ** 2)
    
    # Estimate shrinkage intensity
    shrinkage = min(1.0, (1/n) * delta_sq_norm / delta_sq_norm) if delta_sq_norm > 0 else 0
    
    # From sklearn.covariance for comparison
    from sklearn.covariance import LedoitWolf
    lw = LedoitWolf().fit(data)
    
    return lw.covariance_, lw.shrinkage_
 
# Example
np.random.seed(42)
 
# Generate data from known MVN
true_mu = np.array([1.0, 2.0, 3.0])
true_Sigma = np.array([
    [1.0, 0.5, 0.3],
    [0.5, 2.0, 0.7],
    [0.3, 0.7, 1.5]
])
 
n_samples = 100
data = np.random.multivariate_normal(true_mu, true_Sigma, n_samples)
 
# Estimate parameters
results = estimate_mvn_parameters(data)
 
print("MVN Parameter Estimation")
print("=" * 60)
print(f"True μ: {true_mu}")
print(f"Estimated μ: {results['mu'].round(4)}")
print(f"
True Σ:
{true_Sigma}")
print(f"
Estimated Σ (unbiased):
{results['Sigma_unbiased'].round(4)}")
print(f"
Correlation matrix:
{results['correlation'].round(4)}")
print(f"
Eigenvalues: {results['eigenvalues'].round(4)}")
print(f"Condition number: {results['condition_number']:.2f}")

Linear Transformations and Affine Properties

The MVN family is closed under linear transformations—a crucial property for many applications.

Affine Transformation

If X ~ N(μ, Σ) and Y = A****X + b where A is a matrix and b is a vector, then:

$$\mathbf{Y} \sim N(\mathbf{A}\boldsymbol{\mu} + \mathbf{b}, \mathbf{A}\mathbf{\Sigma}\mathbf{A}^T)$$

This generalizes the univariate result (aX + b ~ N(aμ + b, a²σ²)).

Whitening Transformation

Whitening transforms data to have identity covariance:

$$\mathbf{Z} = \mathbf{\Sigma}^{-1/2}(\mathbf{X} - \boldsymbol{\mu}) \sim N(\mathbf{0}, \mathbf{I})$$

where Σ^{-1/2} is the inverse matrix square root (computed via eigendecomposition).

Applications:

Standardizes multivariate data
Decorrelates variables
Simplifies distance calculations (Mahalanobis → Euclidean)

Sum of Independent Gaussians

If X ~ N(μₓ, Σₓ) and Y ~ N(μᵧ, Σᵧ) are independent, then:

$$\mathbf{X} + \mathbf{Y} \sim N(\boldsymbol{\mu}_X + \boldsymbol{\mu}_Y, \mathbf{\Sigma}_X + \mathbf{\Sigma}_Y)$$

Means add, covariances add (generalizing the univariate case).

Product of Gaussian PDFs

The product of two Gaussian PDFs (up to normalization) is Gaussian—essential for Bayesian updating:

If we have prior x ~ N(μ₀, Σ₀) and likelihood y | x ~ N(x, Σₗ), the posterior is:

$$\mathbf{x} | \mathbf{y} \sim N(\boldsymbol{\mu}{post}, \mathbf{\Sigma}{post})$$

where:

This is precision-weighted averaging, exactly as in the univariate case.

Information Form

Applications in Machine Learning

The Multivariate Gaussian is ubiquitous in machine learning. Let's explore its key applications.

Principal Component Analysis (PCA)

PCA can be viewed as fitting an MVN and finding its principal axes:

Estimate the covariance matrix Σ from data
Compute eigendecomposition: Σ = UΛUᵀ
The eigenvectors (columns of U) are the principal components
Eigenvalues indicate variance explained along each PC

Dimensionality reduction projects onto the top k eigenvectors.

Gaussian Mixture Models (GMMs)

GMMs model data as a mixture of K MVN components:

$$p(\mathbf{x}) = \sum_{k=1}^K \pi_k \, N(\mathbf{x} | \boldsymbol{\mu}_k, \mathbf{\Sigma}_k)$$

where πₖ are mixing weights. GMMs are trained via EM algorithm and are fundamental for:

Soft clustering
Density estimation
Speech recognition phoneme modeling
Image segmentation

mvn_applications.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
import numpy as np
from scipy.stats import multivariate_normal
 
class GaussianDiscriminantAnalysis:
    """
    Gaussian Discriminant Analysis (LDA/QDA).
    
    Models each class with a Multivariate Gaussian:
    p(x | y=k) = N(x | μ_k, Σ_k)
    
    LDA: Shared covariance across classes
    QDA: Separate covariance per class
    """
    def __init__(self, shared_covariance: bool = True):
        """
        Args:
            shared_covariance: True for LDA, False for QDA
        """
        self.shared_cov = shared_covariance
        self.classes = None
        self.priors = {}
        self.means = {}
        self.covariances = {}
    
    def fit(self, X: np.ndarray, y: np.ndarray):
        """Fit Gaussian to each class."""
        self.classes = np.unique(y)
        n = len(y)
        d = X.shape[1]
        
        for c in self.classes:
            X_c = X[y == c]
            self.priors[c] = len(X_c) / n
            self.means[c] = np.mean(X_c, axis=0)
            
            if not self.shared_cov:
                self.covariances[c] = np.cov(X_c.T)
        
        if self.shared_cov:
            # Pooled within-class covariance
            Sw = np.zeros((d, d))
            for c in self.classes:
                X_c = X[y == c]
                X_centered = X_c - self.means[c]
                Sw += X_centered.T @ X_centered
            self.shared_Sigma = Sw / (n - len(self.classes))
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """Compute posterior probabilities P(y=k | x)."""
        log_posteriors = np.zeros((len(X), len(self.classes)))
        
        for i, c in enumerate(self.classes):
            if self.shared_cov:
                Sigma = self.shared_Sigma
            else:
                Sigma = self.covariances[c]
            
            mvn = multivariate_normal(self.means[c], Sigma)
            log_posteriors[:, i] = np.log(self.priors[c]) + mvn.logpdf(X)
        
        # Normalize
        log_sum = np.log(np.exp(log_posteriors).sum(axis=1, keepdims=True))
        posteriors = np.exp(log_posteriors - log_sum)
        return posteriors
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        proba = self.predict_proba(X)
        return self.classes[np.argmax(proba, axis=1)]
 
# Example: Classification with LDA
np.random.seed(42)
 
# Generate 3-class data
n_per_class = 100
X_0 = np.random.multivariate_normal([0, 0], [[1, 0.5], [0.5, 1]], n_per_class)
X_1 = np.random.multivariate_normal([3, 3], [[1, -0.3], [-0.3, 1]], n_per_class)
X_2 = np.random.multivariate_normal([0, 4], [[0.8, 0], [0, 0.8]], n_per_class)
 
X = np.vstack([X_0, X_1, X_2])
y = np.array([0] * n_per_class + [1] * n_per_class + [2] * n_per_class)
 
# Fit LDA
lda = GaussianDiscriminantAnalysis(shared_covariance=True)
lda.fit(X, y)
 
# Evaluate
y_pred = lda.predict(X)
accuracy = np.mean(y_pred == y)
 
print("Gaussian Discriminant Analysis (LDA)")
print("=" * 50)
print(f"Training accuracy: {accuracy:.2%}")
print("
Class means:")
for c in lda.classes:
    print(f"  Class {c}: μ = {lda.means[c].round(3)}")
print(f"
Shared covariance:
{lda.shared_Sigma.round(3)}")
 
# Predict probabilities for a test point
test_point = np.array([[1.5, 2.0]])
probs = lda.predict_proba(test_point)[0]
print(f"
Test point [1.5, 2.0] posteriors:")
for c, p in zip(lda.classes, probs):
    print(f"  P(y={c} | x) = {p:.4f}")

Gaussian Processes (GPs)

A Gaussian Process is a distribution over functions where any finite collection of function values is jointly MVN:

$$[f(x_1), f(x_2), \ldots, f(x_n)]^T \sim N(\boldsymbol{\mu}, \mathbf{K})$$

where K is the kernel (covariance) matrix with Kᵢⱼ = k(xᵢ, xⱼ).

GPs provide:

Uncertainty quantification (posterior variance)
Non-parametric flexibility
Principled Bayesian inference for functions

Variational Autoencoders (VAEs)

VAEs model the latent space as MVN:

Encoder: q(z|x) = N(μ(x), diag(σ²(x)))
Prior: p(z) = N(0, I)

The reparameterization trick enables gradient-based training: $$z = \mu + \sigma \odot \epsilon, \quad \epsilon \sim N(0, I)$$

Kalman Filters

Kalman filters propagate MVN beliefs through linear dynamical systems:

Prediction: p(xₜ | y₁:ₜ₋₁) = N(μₜ|ₜ₋₁, Σₜ|ₜ₋₁) Update: p(xₜ | y₁:ₜ) = N(μₜ|ₜ, Σₜ|ₜ)

All operations remain Gaussian, enabling efficient recursive estimation.

MVN Applications Summary

•PCA: Find principal axes of MVN-modeled data; dimensionality reduction
•GMMs: Mixture of MVNs for clustering and density estimation
•LDA/QDA: Gaussian class-conditional models for classification
•Gaussian Processes: MVN over function values for regression with uncertainty
•VAEs: MVN latent spaces for generative modeling
•Kalman Filters: Recursive MVN estimation for time series
•Factor Analysis: Low-rank plus diagonal MVN structure
•Probabilistic PCA: Latent variable MVN model for dimensionality reduction

KL Divergence Between Gaussians

The Kullback-Leibler (KL) divergence measures how one distribution differs from another. For MVNs, we have closed-form expressions.

KL Divergence Formula

For p = N(μ₁, Σ₁) and q = N(μ₂, Σ₂):

Special case (q = N(0, I)):

$$D_{KL}(N(\boldsymbol{\mu}, \mathbf{\Sigma}) \| N(\mathbf{0}, \mathbf{I})) = \frac{1}{2}\left[\text{tr}(\mathbf{\Sigma}) + \boldsymbol{\mu}^T\boldsymbol{\mu} - d - \ln|\mathbf{\Sigma}|\right]$$

This is the regularization term in VAE training!

gaussian_kl.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
 
def kl_divergence_gaussians(mu1: np.ndarray, Sigma1: np.ndarray,
                            mu2: np.ndarray, Sigma2: np.ndarray) -> float:
    """
    Compute KL divergence D_KL(p || q) where:
    p = N(μ₁, Σ₁)
    q = N(μ₂, Σ₂)
    
    D_KL = 0.5 * [tr(Σ₂⁻¹Σ₁) + (μ₂-μ₁)ᵀΣ₂⁻¹(μ₂-μ₁) - d + ln(|Σ₂|/|Σ₁|)]
    """
    d = len(mu1)
    
    # Compute components
    Sigma2_inv = np.linalg.inv(Sigma2)
    mu_diff = mu2 - mu1
    
    # Trace term
    trace_term = np.trace(Sigma2_inv @ Sigma1)
    
    # Mahalanobis term
    mahal_term = mu_diff @ Sigma2_inv @ mu_diff
    
    # Log determinant term
    log_det_term = np.log(np.linalg.det(Sigma2) / np.linalg.det(Sigma1))
    
    kl = 0.5 * (trace_term + mahal_term - d + log_det_term)
    return kl
 
def kl_to_standard_normal(mu: np.ndarray, Sigma: np.ndarray) -> float:
    """
    KL divergence from N(μ, Σ) to N(0, I).
    
    This is the regularization term in VAE training:
    D_KL = 0.5 * [tr(Σ) + μᵀμ - d - ln|Σ|]
    """
    d = len(mu)
    
    trace_term = np.trace(Sigma)
    mu_sq_term = np.sum(mu ** 2)
    log_det_term = np.log(np.linalg.det(Sigma))
    
    return 0.5 * (trace_term + mu_sq_term - d - log_det_term)
 
# Example
print("KL Divergence Between Gaussians")
print("=" * 50)
 
# Two 2D Gaussians
mu1 = np.array([0.0, 0.0])
Sigma1 = np.array([[1.0, 0.3], [0.3, 1.0]])
 
mu2 = np.array([1.0, 1.0])
Sigma2 = np.array([[2.0, 0.5], [0.5, 2.0]])
 
kl_12 = kl_divergence_gaussians(mu1, Sigma1, mu2, Sigma2)
kl_21 = kl_divergence_gaussians(mu2, Sigma2, mu1, Sigma1)
 
print(f"p = N({mu1}, Σ₁)")
print(f"q = N({mu2}, Σ₂)")
print(f"
D_KL(p || q) = {kl_12:.4f}")
print(f"D_KL(q || p) = {kl_21:.4f}")
print("(Note: KL is asymmetric!)")
 
# VAE regularization term
mu_vae = np.array([0.5, -0.3, 0.2])
sigma_vae = np.array([1.2, 0.8, 1.1])  # Diagonal variances
Sigma_vae = np.diag(sigma_vae ** 2)
 
kl_vae = kl_to_standard_normal(mu_vae, Sigma_vae)
print(f"
VAE example:")
print(f"Encoder output: μ = {mu_vae}, σ = {sigma_vae}")
print(f"KL regularization term: {kl_vae:.4f}")

Summary and Key Takeaways

The Multivariate Gaussian is the cornerstone of high-dimensional probabilistic modeling. Let's consolidate our understanding:

Key Takeaways

•Definition: X ~ N(μ, Σ) where μ is the mean vector and Σ is the covariance matrix (symmetric positive semi-definite).
•Geometry: Contours are ellipsoids; eigendecomposition of Σ gives principal axes and their lengths.
•Mahalanobis distance: (x-μ)ᵀΣ⁻¹(x-μ) measures standardized distance; follows χ²(d) distribution.
•Conditioning: X₁|X₂ is Gaussian with linear mean and reduced variance; this IS linear regression.
•Marginalization: Extract subvector/submatrix—no integration needed!
•Independence: For Gaussians, uncorrelated ⇔ independent (unique property).
•Applications: PCA, GMMs, LDA, GPs, VAEs, Kalman filters—the MVN is everywhere in ML.
•KL divergence: Closed-form expression; regularization term in VAE training.

Module Complete!

You have now completed the module on Common Probability Distributions. You've mastered:

Bernoulli and Binomial: Binary outcomes and their counts
Gaussian: The universal continuous distribution, justified by maximum entropy and CLT
Poisson: Counting random events with the Poisson process connection
Exponential Family: The unifying framework connecting distributions to GLMs
Multivariate Gaussian: High-dimensional probabilistic modeling with elegant conditioning and marginalization

Module Complete

5 / 5