Loading learning content...
In the real world, phenomena rarely exist in isolation. A patient's health depends on multiple biomarkers simultaneously. House prices correlate with size, location, and age—all at once. A machine learning classifier must reason about the joint occurrence of many features to make accurate predictions.
Single-variable probability distributions are insufficient for capturing these complex relationships. When we measure height and weight, knowing someone is tall tells us something about their likely weight. When modeling stock prices, knowing the market trend informs predictions for individual stocks. These interconnections demand a mathematical framework for describing multiple random variables together.
This is precisely what joint probability distributions provide: a rigorous way to describe the complete probabilistic behavior of two or more random variables, including all their interdependencies.
By the end of this page, you will master joint probability distributions for both discrete and continuous random variables. You'll understand how to specify, visualize, and manipulate joint distributions, and you'll see why they form the essential mathematical foundation for multivariate machine learning.
Before diving into joint distributions, let's recall what we know about single (univariate) random variables and understand why extending to multiple variables requires new machinery.
Univariate distributions describe the probability behavior of a single random variable $X$:
For a single variable, this is complete—we know everything about $X$ from its distribution. But consider two random variables $X$ and $Y$:
Knowing $P(X)$ and $P(Y)$ separately is NOT enough.
To see why, consider two scenarios with the same marginal distributions:
Both scenarios can have identical individual distributions for $X$ and $Y$, yet their joint behavior is completely different. The joint distribution captures what the individual distributions cannot.
Individual distributions answer: 'What values can X take?' and 'What values can Y take?' Joint distributions answer the crucial additional question: 'What combinations of values can X and Y take together, and with what probabilities?'
Machine Learning Relevance:
In ML, we constantly deal with multiple random variables:
Without joint distributions, we cannot formalize:
For discrete random variables, we extend the PMF concept to multiple variables through the joint probability mass function.
Definition (Joint PMF):
For two discrete random variables $X$ and $Y$, the joint PMF is defined as:
$$p_{X,Y}(x, y) = P(X = x \text{ and } Y = y) = P(X = x, Y = y)$$
This function assigns a probability to every possible pair $(x, y)$ of values that $X$ and $Y$ can take.
Properties of Joint PMF:
Non-negativity: $p_{X,Y}(x, y) \geq 0$ for all $(x, y)$
Normalization: The sum over all possible pairs equals 1: $$\sum_{x} \sum_{y} p_{X,Y}(x, y) = 1$$
Probability of events: For any event $A \subseteq \mathbb{R}^2$: $$P((X, Y) \in A) = \sum_{(x,y) \in A} p_{X,Y}(x, y)$$
These properties directly parallel those of univariate PMFs, extended to the two-dimensional case.
| Y = Heads (0) | Y = Tails (1) | |
|---|---|---|
| X = 1 | 1/12 | 1/12 |
| X = 2 | 1/12 | 1/12 |
| X = 3 | 1/12 | 1/12 |
| X = 4 | 1/12 | 1/12 |
| X = 5 | 1/12 | 1/12 |
| X = 6 | 1/12 | 1/12 |
In the table above, we have $X$ representing a fair die roll (values 1-6) and $Y$ representing a fair coin flip (0 for heads, 1 for tails). Since these are independent, each cell has probability $\frac{1}{6} \cdot \frac{1}{2} = \frac{1}{12}$.
Notice that the table itself is the joint distribution—it tells us everything about the joint behavior of $X$ and $Y$. Each cell is $p_{X,Y}(x, y)$, and summing all 12 cells gives 1.
For discrete random variables with finite support, the joint PMF can always be represented as a table (for two variables) or a higher-dimensional array (for more variables). Each entry is a probability, and the entire table sums to 1.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
import numpy as npimport matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3D # Example: Joint PMF of two discrete random variables# X: Number of heads in 2 coin flips (0, 1, or 2)# Y: Number of heads in 3 coin flips (0, 1, 2, or 3)# Both use fair coins independently def joint_pmf_coin_flips(): """ Compute and visualize joint PMF for two binomial random variables. X ~ Binomial(2, 0.5) Y ~ Binomial(3, 0.5) Since independent: P(X=x, Y=y) = P(X=x) * P(Y=y) """ from scipy.stats import binom # Define the distributions n_x, p = 2, 0.5 # X: 2 flips n_y = 3 # Y: 3 flips # Create meshgrid for all (x, y) combinations x_values = np.arange(n_x + 1) # 0, 1, 2 y_values = np.arange(n_y + 1) # 0, 1, 2, 3 # Compute marginal PMFs pmf_x = binom.pmf(x_values, n_x, p) pmf_y = binom.pmf(y_values, n_y, p) # Compute joint PMF (outer product for independent variables) joint_pmf = np.outer(pmf_x, pmf_y) print("Joint PMF Table:") print("=" * 50) print(f"{'X \\ Y':>8}", end="") for y in y_values: print(f" Y={y:d}", end="") print() print("-" * 50) for i, x in enumerate(x_values): print(f"X={x:d}", end=" ") for j, y in enumerate(y_values): print(f" {joint_pmf[i,j]:.4f}", end="") print() print("-" * 50) print(f"Sum of all probabilities: {joint_pmf.sum():.4f}") # Visualize as 3D bar chart fig = plt.figure(figsize=(10, 8)) ax = fig.add_subplot(111, projection='3d') X, Y = np.meshgrid(x_values, y_values) X_flat = X.flatten() Y_flat = Y.flatten() Z_flat = joint_pmf.T.flatten() ax.bar3d(X_flat, Y_flat, np.zeros_like(Z_flat), 0.4, 0.4, Z_flat, alpha=0.8, color='steelblue') ax.set_xlabel('X (2 coin flips)') ax.set_ylabel('Y (3 coin flips)') ax.set_zlabel('P(X=x, Y=y)') ax.set_title('Joint PMF: Independent Binomial Variables') return joint_pmf # Run the visualizationjoint_pmf_coin_flips()plt.show()For continuous random variables, we cannot assign probabilities to individual points (they would all be zero). Instead, we work with probability density functions that describe the density of probability in regions of the outcome space.
Definition (Joint PDF):
For two continuous random variables $X$ and $Y$, the joint PDF $f_{X,Y}(x, y)$ is a function such that:
$$P((X, Y) \in A) = \iint_A f_{X,Y}(x, y) , dx , dy$$
Intuitively, $f_{X,Y}(x, y) \cdot dx \cdot dy$ represents the probability that $X$ falls in a tiny interval around $x$ AND $Y$ falls in a tiny interval around $y$.
Properties of Joint PDF:
Non-negativity: $f_{X,Y}(x, y) \geq 0$ for all $(x, y) \in \mathbb{R}^2$
Normalization: The integral over the entire plane equals 1: $$\int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f_{X,Y}(x, y) , dx , dy = 1$$
Probability via integration: For any region $A$: $$P((X, Y) \in A) = \iint_A f_{X,Y}(x, y) , dx , dy$$
Critical distinction from PMF: The value $f_{X,Y}(x, y)$ is NOT a probability—it's a probability density. It can exceed 1, as long as the total integral equals 1. Only when we integrate over a region do we get actual probabilities.
A joint PDF value f(x,y) = 3 is perfectly valid—it just means probability is concentrated in that region. What matters is that ∫∫f(x,y)dxdy = 1. Think of it like mass density: a small, dense object can have high density but still have finite mass.
Example: Uniform Distribution on the Unit Square
The simplest continuous joint distribution is the uniform distribution over a bounded region. For the unit square $[0, 1] \times [0, 1]$:
$$f_{X,Y}(x, y) = \begin{cases} 1 & \text{if } 0 \leq x \leq 1 \text{ and } 0 \leq y \leq 1 \ 0 & \text{otherwise} \end{cases}$$
Let's verify normalization: $$\int_0^1 \int_0^1 1 , dx , dy = 1 \cdot 1 = 1 \checkmark$$
Computing probabilities:
What is $P(X + Y < 1)$? This is the probability that the point $(X, Y)$ falls in the region below the line $x + y = 1$:
$$P(X + Y < 1) = \int_0^1 \int_0^{1-x} 1 , dy , dx = \int_0^1 (1-x) , dx = \frac{1}{2}$$
Geometrically, this is the area of the triangle below the line, which is indeed half the unit square.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687
import numpy as npimport matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3Dfrom scipy import stats def visualize_joint_pdf(): """ Visualize joint PDFs: uniform on unit square and bivariate Gaussian. """ fig, axes = plt.subplots(1, 2, figsize=(14, 6), subplot_kw={'projection': '3d'}) # Example 1: Uniform on unit square ax1 = axes[0] x = np.linspace(-0.2, 1.2, 50) y = np.linspace(-0.2, 1.2, 50) X, Y = np.meshgrid(x, y) # PDF is 1 inside [0,1]x[0,1], 0 outside Z_uniform = np.where((X >= 0) & (X <= 1) & (Y >= 0) & (Y <= 1), 1.0, 0.0) ax1.plot_surface(X, Y, Z_uniform, cmap='Blues', alpha=0.8, edgecolor='none') ax1.set_xlabel('X') ax1.set_ylabel('Y') ax1.set_zlabel('f(x, y)') ax1.set_title('Uniform Distribution on Unit Square') # Example 2: Standard Bivariate Gaussian ax2 = axes[1] x = np.linspace(-3, 3, 100) y = np.linspace(-3, 3, 100) X, Y = np.meshgrid(x, y) # Standard bivariate Gaussian (μ=0, σ=1, ρ=0.5) mu = np.array([0, 0]) rho = 0.5 cov = np.array([[1, rho], [rho, 1]]) # Compute PDF at each point pos = np.dstack((X, Y)) rv = stats.multivariate_normal(mu, cov) Z_gaussian = rv.pdf(pos) ax2.plot_surface(X, Y, Z_gaussian, cmap='Reds', alpha=0.8, edgecolor='none') ax2.set_xlabel('X') ax2.set_ylabel('Y') ax2.set_zlabel('f(x, y)') ax2.set_title(f'Bivariate Gaussian (ρ = {rho})') plt.tight_layout() # Also show contour plots (more common in practice) fig2, axes2 = plt.subplots(1, 2, figsize=(14, 6)) # Contour for uniform ax3 = axes2[0] x = np.linspace(-0.2, 1.2, 100) y = np.linspace(-0.2, 1.2, 100) X, Y = np.meshgrid(x, y) Z_uniform = np.where((X >= 0) & (X <= 1) & (Y >= 0) & (Y <= 1), 1.0, 0.0) ax3.contourf(X, Y, Z_uniform, levels=20, cmap='Blues') ax3.set_xlabel('X') ax3.set_ylabel('Y') ax3.set_title('Uniform on Unit Square (Contour)') ax3.set_aspect('equal') # Contour for Gaussian ax4 = axes2[1] x = np.linspace(-3, 3, 100) y = np.linspace(-3, 3, 100) X, Y = np.meshgrid(x, y) pos = np.dstack((X, Y)) Z_gaussian = rv.pdf(pos) ax4.contourf(X, Y, Z_gaussian, levels=20, cmap='Reds') ax4.set_xlabel('X') ax4.set_ylabel('Y') ax4.set_title(f'Bivariate Gaussian (ρ = {rho}) (Contour)') ax4.set_aspect('equal') plt.tight_layout() plt.show() return Z_gaussian visualize_joint_pdf()The bivariate Gaussian (normal) distribution is the most important continuous joint distribution in statistics and machine learning. Its mathematical tractability, theoretical properties, and prevalence in real-world data make it foundational.
Definition (Bivariate Gaussian):
For random variables $X$ and $Y$ jointly normally distributed with means $\mu_X, \mu_Y$, standard deviations $\sigma_X, \sigma_Y$, and correlation coefficient $\rho$, the joint PDF is:
$$f_{X,Y}(x, y) = \frac{1}{2\pi\sigma_X\sigma_Y\sqrt{1-\rho^2}} \exp\left(-\frac{1}{2(1-\rho^2)}\left[\frac{(x-\mu_X)^2}{\sigma_X^2} - \frac{2\rho(x-\mu_X)(y-\mu_Y)}{\sigma_X\sigma_Y} + \frac{(y-\mu_Y)^2}{\sigma_Y^2}\right]\right)$$
Matrix Notation (cleaner):
Let $\mathbf{x} = \begin{pmatrix} x \ y \end{pmatrix}$, $\boldsymbol{\mu} = \begin{pmatrix} \mu_X \ \mu_Y \end{pmatrix}$, and covariance matrix:
$$\boldsymbol{\Sigma} = \begin{pmatrix} \sigma_X^2 & \rho\sigma_X\sigma_Y \ \rho\sigma_X\sigma_Y & \sigma_Y^2 \end{pmatrix}$$
Then the joint PDF becomes:
$$f(\mathbf{x}) = \frac{1}{2\pi|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right)$$
This matrix form generalizes directly to $d$ dimensions (the multivariate Gaussian).
The covariance matrix Σ completely determines the shape and orientation of the Gaussian. Its eigenvalues determine the spread along principal axes, and its eigenvectors determine those axes' directions. This connects to PCA—the principal components are exactly the eigenvectors of the data covariance matrix.
Key Properties of Bivariate Gaussian:
Contours are ellipses: Lines of constant probability density form ellipses centered at $(\mu_X, \mu_Y)$
Marginals are Gaussian: If $(X, Y)$ is bivariate Gaussian, then $X$ alone is Gaussian($\mu_X, \sigma_X^2$) and $Y$ alone is Gaussian($\mu_Y, \sigma_Y^2$)
Conditionals are Gaussian: Given $X = x$, the conditional distribution of $Y$ is Gaussian with:
Zero correlation implies independence: For Gaussian variables ONLY, $\rho = 0$ implies $X$ and $Y$ are independent. This is NOT true for general distributions.
Linear combinations are Gaussian: Any linear combination $aX + bY$ is also Gaussian.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138
import numpy as npimport matplotlib.pyplot as pltfrom scipy import stats def explore_bivariate_gaussian(): """ Explore properties of bivariate Gaussian distributions with different correlation values. """ fig, axes = plt.subplots(2, 3, figsize=(15, 10)) correlations = [-0.8, 0, 0.5, 0.9, 0.99] mu = np.array([0, 0]) # Create grid for plotting x = np.linspace(-4, 4, 200) y = np.linspace(-4, 4, 200) X, Y = np.meshgrid(x, y) pos = np.dstack((X, Y)) for idx, rho in enumerate(correlations): ax = axes.flatten()[idx] # Covariance matrix cov = np.array([[1, rho], [rho, 1]]) # Create bivariate normal rv = stats.multivariate_normal(mu, cov) Z = rv.pdf(pos) # Plot contours contours = ax.contour(X, Y, Z, levels=10, cmap='RdBu_r') ax.contourf(X, Y, Z, levels=50, cmap='RdBu_r', alpha=0.5) # Add eigenvalue analysis eigenvalues, eigenvectors = np.linalg.eigh(cov) # Plot eigenvector directions (scaled by sqrt(eigenvalue)) for i in range(2): vec = eigenvectors[:, i] * np.sqrt(eigenvalues[i]) * 2 ax.arrow(0, 0, vec[0], vec[1], head_width=0.15, head_length=0.1, fc='black', ec='black', linewidth=2) ax.set_xlim(-4, 4) ax.set_ylim(-4, 4) ax.set_aspect('equal') ax.set_xlabel('X') ax.set_ylabel('Y') ax.set_title(f'ρ = {rho}\nEigenvalues: [{eigenvalues[0]:.2f}, {eigenvalues[1]:.2f}]') ax.axhline(y=0, color='gray', linestyle='--', alpha=0.5) ax.axvline(x=0, color='gray', linestyle='--', alpha=0.5) # Last panel: explain the relationship ax = axes.flatten()[5] ax.text(0.5, 0.9, "Bivariate Gaussian Properties:", transform=ax.transAxes, fontsize=12, fontweight='bold', ha='center') ax.text(0.5, 0.7, "• Contours are ellipses", transform=ax.transAxes, fontsize=10, ha='center') ax.text(0.5, 0.55, "• Eigenvalues = variances along principal axes", transform=ax.transAxes, fontsize=10, ha='center') ax.text(0.5, 0.4, "• Eigenvectors = directions of principal axes", transform=ax.transAxes, fontsize=10, ha='center') ax.text(0.5, 0.25, "• ρ → ±1: ellipse becomes a line", transform=ax.transAxes, fontsize=10, ha='center') ax.text(0.5, 0.1, "• ρ = 0: ellipse axes align with X, Y", transform=ax.transAxes, fontsize=10, ha='center') ax.axis('off') plt.tight_layout() plt.savefig('bivariate_gaussian_correlations.png', dpi=150, bbox_inches='tight') plt.show() def demonstrate_conditional_gaussian(): """ Show that conditional distributions are also Gaussian. """ # Parameters mu_x, mu_y = 2, 3 sigma_x, sigma_y = 1, 2 rho = 0.7 # Generate samples mu = np.array([mu_x, mu_y]) cov = np.array([[sigma_x**2, rho*sigma_x*sigma_y], [rho*sigma_x*sigma_y, sigma_y**2]]) samples = np.random.multivariate_normal(mu, cov, 10000) # Condition on X being close to a specific value x_given = 3.5 tolerance = 0.1 mask = np.abs(samples[:, 0] - x_given) < tolerance conditional_y_samples = samples[mask, 1] # Theoretical conditional distribution # E[Y|X=x] = mu_y + rho * (sigma_y/sigma_x) * (x - mu_x) # Var[Y|X=x] = sigma_y^2 * (1 - rho^2) conditional_mean = mu_y + rho * (sigma_y/sigma_x) * (x_given - mu_x) conditional_var = sigma_y**2 * (1 - rho**2) conditional_std = np.sqrt(conditional_var) print(f"Conditioning on X ≈ {x_given}") print(f"Theoretical conditional mean: {conditional_mean:.4f}") print(f"Sample conditional mean: {np.mean(conditional_y_samples):.4f}") print(f"Theoretical conditional std: {conditional_std:.4f}") print(f"Sample conditional std: {np.std(conditional_y_samples):.4f}") # Visualize fig, ax = plt.subplots(figsize=(10, 6)) # Histogram of conditional samples ax.hist(conditional_y_samples, bins=30, density=True, alpha=0.7, label='Empirical conditional distribution') # Theoretical conditional PDF y_range = np.linspace( conditional_mean - 4*conditional_std, conditional_mean + 4*conditional_std, 200 ) theoretical_pdf = stats.norm.pdf(y_range, conditional_mean, conditional_std) ax.plot(y_range, theoretical_pdf, 'r-', linewidth=2, label=f'Theoretical: N({conditional_mean:.2f}, {conditional_var:.2f})') ax.set_xlabel('Y') ax.set_ylabel('Density') ax.set_title(f'Conditional Distribution: P(Y | X = {x_given})') ax.legend() plt.tight_layout() plt.show() explore_bivariate_gaussian()demonstrate_conditional_gaussian()Just as univariate distributions can be characterized by their CDF, joint distributions have a joint CDF that provides cumulative probabilities.
Definition (Joint CDF):
For random variables $X$ and $Y$, the joint CDF is:
$$F_{X,Y}(x, y) = P(X \leq x \text{ and } Y \leq y) = P(X \leq x, Y \leq y)$$
Properties of Joint CDF:
Boundary conditions:
Monotonicity: $F_{X,Y}$ is non-decreasing in both arguments
Right-continuity: $F_{X,Y}$ is right-continuous in each argument
Probability of rectangles: $$P(a < X \leq b, c < Y \leq d) = F(b,d) - F(a,d) - F(b,c) + F(a,c)$$
This last property is the inclusion-exclusion formula for rectangular regions.
Relationship to PDF (Continuous Case):
For continuous random variables, the PDF is the mixed partial derivative of the CDF:
$$f_{X,Y}(x, y) = \frac{\partial^2 F_{X,Y}(x, y)}{\partial x , \partial y}$$
Conversely, the CDF is obtained by integrating the PDF:
$$F_{X,Y}(x, y) = \int_{-\infty}^{x} \int_{-\infty}^{y} f_{X,Y}(s, t) , dt , ds$$
Example: CDF for Uniform on Unit Square
For $(X, Y)$ uniform on $[0,1] \times [0,1]$:
$$F_{X,Y}(x, y) = \begin{cases} 0 & \text{if } x < 0 \text{ or } y < 0 \ xy & \text{if } 0 \leq x \leq 1 \text{ and } 0 \leq y \leq 1 \ x & \text{if } 0 \leq x \leq 1 \text{ and } y > 1 \ y & \text{if } x > 1 \text{ and } 0 \leq y \leq 1 \ 1 & \text{if } x > 1 \text{ and } y > 1 \end{cases}$$
Let's verify: $\frac{\partial^2}{\partial x \partial y}(xy) = 1 = f_{X,Y}(x,y)$ on the unit square. ✓
CDFs are particularly useful for: (1) computing probabilities of rectangular regions using inclusion-exclusion, (2) theoretical analyses involving limits and continuity, and (3) generating random samples via inverse CDF methods. PDFs/PMFs are easier for computing specific probabilities and visualizing distributions.
While we've focused on two random variables, the concepts generalize directly to any number of variables—essential for machine learning with high-dimensional data.
Notation for $d$ Random Variables:
For random variables $X_1, X_2, \ldots, X_d$ (often written as a random vector $\mathbf{X} = (X_1, \ldots, X_d)^T$):
Discrete case (Joint PMF): $$p_{X_1, \ldots, X_d}(x_1, \ldots, x_d) = P(X_1 = x_1, X_2 = x_2, \ldots, X_d = x_d)$$
Continuous case (Joint PDF): $$P(\mathbf{X} \in A) = \int \cdots \int_A f_{X_1, \ldots, X_d}(x_1, \ldots, x_d) , dx_1 \cdots dx_d$$
Joint CDF: $$F_{X_1, \ldots, X_d}(x_1, \ldots, x_d) = P(X_1 \leq x_1, \ldots, X_d \leq x_d)$$
The Multivariate Gaussian:
The $d$-dimensional Gaussian is defined by mean vector $\boldsymbol{\mu} \in \mathbb{R}^d$ and $d \times d$ covariance matrix $\boldsymbol{\Sigma}$:
$$f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right)$$
The covariance matrix $\boldsymbol{\Sigma}$ is:
Why this matters for ML:
In machine learning, the feature vector $\mathbf{x}$ for a data point is exactly a realization from a joint distribution. When we assume features follow a multivariate Gaussian:
As dimension d increases, the complexity of joint distributions grows exponentially. A joint distribution over d binary variables has 2^d possible outcomes. For continuous variables, estimation requires more samples as d grows. This 'curse of dimensionality' motivates assumptions like conditional independence (Naive Bayes) or low-rank structure (factor models).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102
import numpy as npimport matplotlib.pyplot as pltfrom scipy import stats def multivariate_gaussian_demo(): """ Demonstrate multivariate Gaussian in higher dimensions. """ np.random.seed(42) # 5-dimensional Gaussian d = 5 mu = np.zeros(d) # Create a random positive definite covariance matrix A = np.random.randn(d, d) Sigma = A @ A.T + 0.1 * np.eye(d) # Ensure positive definiteness print("=" * 60) print("5-Dimensional Multivariate Gaussian") print("=" * 60) print(f"\nMean vector μ:\n{mu}") print(f"\nCovariance matrix Σ:\n{np.round(Sigma, 3)}") # Eigendecomposition (principal axes) eigenvalues, eigenvectors = np.linalg.eigh(Sigma) print(f"\nEigenvalues: {np.round(eigenvalues, 3)}") print("(These are variances along principal axes)") # Generate samples samples = np.random.multivariate_normal(mu, Sigma, 1000) # Verify sample statistics match theoretical sample_mean = np.mean(samples, axis=0) sample_cov = np.cov(samples.T) print(f"\nSample mean (should be ≈ 0):\n{np.round(sample_mean, 3)}") print(f"\nSample covariance (should match Σ):\n{np.round(sample_cov, 3)}") # Visualize pairwise marginals fig, axes = plt.subplots(d, d, figsize=(15, 15)) for i in range(d): for j in range(d): ax = axes[i, j] if i == j: # Diagonal: histogram of marginal ax.hist(samples[:, i], bins=30, density=True, alpha=0.7) # Theoretical marginal x_range = np.linspace(samples[:, i].min(), samples[:, i].max(), 100) ax.plot(x_range, stats.norm.pdf(x_range, mu[i], np.sqrt(Sigma[i,i])), 'r-', linewidth=2) ax.set_ylabel(f'X{i+1}') else: # Off-diagonal: scatter plot ax.scatter(samples[:, j], samples[:, i], alpha=0.1, s=5) ax.set_xlabel(f'X{j+1}') ax.set_ylabel(f'X{i+1}') plt.suptitle('Pairwise Marginal Distributions (5D Gaussian)', fontsize=14) plt.tight_layout() plt.savefig('multivariate_gaussian_pairwise.png', dpi=100) plt.show() # Mahalanobis distance demonstration print("\n" + "=" * 60) print("Mahalanobis Distance") print("=" * 60) # Mahalanobis distance: (x - μ)^T Σ^(-1) (x - μ) Sigma_inv = np.linalg.inv(Sigma) mahal_distances = [] for sample in samples: diff = sample - mu mahal = np.sqrt(diff @ Sigma_inv @ diff) mahal_distances.append(mahal) mahal_distances = np.array(mahal_distances) # For a d-dim Gaussian, squared Mahalanobis distance follows chi-square(d) print(f"Mean Mahalanobis distance: {np.mean(mahal_distances):.3f}") print(f"Expected (√d ≈ √5): {np.sqrt(d):.3f}") # Plot distribution of Mahalanobis distances fig, ax = plt.subplots(figsize=(10, 6)) ax.hist(mahal_distances**2, bins=50, density=True, alpha=0.7, label='Empirical (squared Mahalanobis)') x_range = np.linspace(0, max(mahal_distances**2), 200) ax.plot(x_range, stats.chi2.pdf(x_range, df=d), 'r-', linewidth=2, label=f'Chi-square({d})') ax.set_xlabel('Squared Mahalanobis Distance') ax.set_ylabel('Density') ax.set_title('Squared Mahalanobis Distance follows Chi-square Distribution') ax.legend() plt.tight_layout() plt.show() multivariate_gaussian_demo()Joint distributions permeate machine learning. Understanding their role clarifies many ML techniques.
1. Modeling the Data Generation Process
In ML, we typically assume data points $(\mathbf{x}_i, y_i)$ are i.i.d. samples from some joint distribution $P(\mathbf{X}, Y)$:
The goal of supervised learning is to learn enough about this joint distribution to make good predictions.
2. Classification via Bayes' Rule
Given the joint distribution, classification is straightforward via Bayes' rule:
$$P(Y = k | \mathbf{X} = \mathbf{x}) = \frac{P(\mathbf{X} = \mathbf{x} | Y = k) \cdot P(Y = k)}{P(\mathbf{X} = \mathbf{x})}$$
This connects:
3. The Generative vs. Discriminative Divide
A fundamental choice in ML is whether to model the joint distribution:
| Approach | Models | Advantages | Disadvantages |
|---|---|---|---|
| Generative | $P(\mathbf{X}, Y)$ | Can generate data, handle missing values, see feature relationships | More parameters to estimate, stronger assumptions |
| Discriminative | $P(Y \mid \mathbf{X})$ | Fewer parameters, often better for pure classification | Can't generate data, no insight into feature distribution |
Neither is universally better—the choice depends on the task and data availability.
Every probabilistic machine learning model—from linear regression to deep neural networks—implicitly or explicitly involves assumptions about joint distributions. Understanding joint distributions unlocks the mathematical reasoning behind ML algorithms.
We've established the foundational framework for reasoning about multiple random variables simultaneously. Let's consolidate the key concepts:
What's next:
Now that we understand joint distributions as the complete description of multiple variables together, we need tools to extract information from them. The next page covers marginal distributions—how to "integrate out" variables we don't care about to focus on the distribution of a subset.
You now understand joint probability distributions—the mathematical framework for describing the complete probabilistic behavior of multiple random variables. This is the foundation for all subsequent topics in this module: marginal distributions, conditional distributions, covariance, correlation, and independence.