Mathematical FoundationsJoint and Marginal Distributions

Joint and Marginal Distributions

LevelIntermediate

Duration90 mins

TopicJoint and Marginal Distributions

1 / 5

Joint Distributions

The Need to Model Multiple Variables Together

In the real world, phenomena rarely exist in isolation. A patient's health depends on multiple biomarkers simultaneously. House prices correlate with size, location, and age—all at once. A machine learning classifier must reason about the joint occurrence of many features to make accurate predictions.

Single-variable probability distributions are insufficient for capturing these complex relationships. When we measure height and weight, knowing someone is tall tells us something about their likely weight. When modeling stock prices, knowing the market trend informs predictions for individual stocks. These interconnections demand a mathematical framework for describing multiple random variables together.

This is precisely what joint probability distributions provide: a rigorous way to describe the complete probabilistic behavior of two or more random variables, including all their interdependencies.

What You Will Learn

By the end of this page, you will master joint probability distributions for both discrete and continuous random variables. You'll understand how to specify, visualize, and manipulate joint distributions, and you'll see why they form the essential mathematical foundation for multivariate machine learning.

From Univariate to Multivariate: The Conceptual Leap

Before diving into joint distributions, let's recall what we know about single (univariate) random variables and understand why extending to multiple variables requires new machinery.

Univariate distributions describe the probability behavior of a single random variable $X$:

Discrete case: A probability mass function (PMF) $P(X = x)$ assigns probabilities to each possible value
Continuous case: A probability density function (PDF) $f_X(x)$ describes the density of probability at each point

For a single variable, this is complete—we know everything about $X$ from its distribution. But consider two random variables $X$ and $Y$:

Knowing $P(X)$ and $P(Y)$ separately is NOT enough.

To see why, consider two scenarios with the same marginal distributions:

$X$ and $Y$ are the same random variable (perfectly correlated)
$X$ and $Y$ are completely independent

Both scenarios can have identical individual distributions for $X$ and $Y$, yet their joint behavior is completely different. The joint distribution captures what the individual distributions cannot.

The Information Gap

Individual distributions answer: 'What values can X take?' and 'What values can Y take?' Joint distributions answer the crucial additional question: 'What combinations of values can X and Y take together, and with what probabilities?'

Machine Learning Relevance:

In ML, we constantly deal with multiple random variables:

Features: Each feature $X_1, X_2, \ldots, X_d$ in a dataset is a random variable
Targets: The label $Y$ we predict is a random variable
Model understanding: Classification and regression fundamentally model $P(Y | X_1, \ldots, X_d)$, which derives from the joint distribution $P(X_1, \ldots, X_d, Y)$

Without joint distributions, we cannot formalize:

How features relate to each other (multicollinearity)
How features predict the target (supervised learning)
How to generate realistic synthetic data (generative modeling)
How to perform Bayesian inference (posterior computation)

Joint Probability Mass Function (Discrete Case)

For discrete random variables, we extend the PMF concept to multiple variables through the joint probability mass function.

Definition (Joint PMF):

For two discrete random variables $X$ and $Y$, the joint PMF is defined as:

$$p_{X,Y}(x, y) = P(X = x \text{ and } Y = y) = P(X = x, Y = y)$$

This function assigns a probability to every possible pair $(x, y)$ of values that $X$ and $Y$ can take.

Properties of Joint PMF:

Non-negativity: $p_{X,Y}(x, y) \geq 0$ for all $(x, y)$
Normalization: The sum over all possible pairs equals 1: $$\sum_{x} \sum_{y} p_{X,Y}(x, y) = 1$$
Probability of events: For any event $A \subseteq \mathbb{R}^2$: $$P((X, Y) \in A) = \sum_{(x,y) \in A} p_{X,Y}(x, y)$$

These properties directly parallel those of univariate PMFs, extended to the two-dimensional case.

Example: Joint PMF for Dice Roll and Coin Flip
	Y = Heads (0)	Y = Tails (1)
X = 1	1/12	1/12
X = 2	1/12	1/12
X = 3	1/12	1/12
X = 4	1/12	1/12
X = 5	1/12	1/12
X = 6	1/12	1/12

In the table above, we have $X$ representing a fair die roll (values 1-6) and $Y$ representing a fair coin flip (0 for heads, 1 for tails). Since these are independent, each cell has probability $\frac{1}{6} \cdot \frac{1}{2} = \frac{1}{12}$.

Notice that the table itself is the joint distribution—it tells us everything about the joint behavior of $X$ and $Y$. Each cell is $p_{X,Y}(x, y)$, and summing all 12 cells gives 1.

Tabular Representation

For discrete random variables with finite support, the joint PMF can always be represented as a table (for two variables) or a higher-dimensional array (for more variables). Each entry is a probability, and the entire table sums to 1.

joint_pmf_example.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
 
# Example: Joint PMF of two discrete random variables
# X: Number of heads in 2 coin flips (0, 1, or 2)
# Y: Number of heads in 3 coin flips (0, 1, 2, or 3)
# Both use fair coins independently
 
def joint_pmf_coin_flips():
    """
    Compute and visualize joint PMF for two binomial random variables.
    X ~ Binomial(2, 0.5)
    Y ~ Binomial(3, 0.5)
    Since independent: P(X=x, Y=y) = P(X=x) * P(Y=y)
    """
    from scipy.stats import binom
    
    # Define the distributions
    n_x, p = 2, 0.5  # X: 2 flips
    n_y = 3          # Y: 3 flips
    
    # Create meshgrid for all (x, y) combinations
    x_values = np.arange(n_x + 1)  # 0, 1, 2
    y_values = np.arange(n_y + 1)  # 0, 1, 2, 3
    
    # Compute marginal PMFs
    pmf_x = binom.pmf(x_values, n_x, p)
    pmf_y = binom.pmf(y_values, n_y, p)
    
    # Compute joint PMF (outer product for independent variables)
    joint_pmf = np.outer(pmf_x, pmf_y)
    
    print("Joint PMF Table:")
    print("=" * 50)
    print(f"{'X \\ Y':>8}", end="")
    for y in y_values:
        print(f"    Y={y:d}", end="")
    print()
    print("-" * 50)
    
    for i, x in enumerate(x_values):
        print(f"X={x:d}", end="    ")
        for j, y in enumerate(y_values):
            print(f"  {joint_pmf[i,j]:.4f}", end="")
        print()
    
    print("-" * 50)
    print(f"Sum of all probabilities: {joint_pmf.sum():.4f}")
    
    # Visualize as 3D bar chart
    fig = plt.figure(figsize=(10, 8))
    ax = fig.add_subplot(111, projection='3d')
    
    X, Y = np.meshgrid(x_values, y_values)
    X_flat = X.flatten()
    Y_flat = Y.flatten()
    Z_flat = joint_pmf.T.flatten()
    
    ax.bar3d(X_flat, Y_flat, np.zeros_like(Z_flat), 
             0.4, 0.4, Z_flat, alpha=0.8, color='steelblue')
    
    ax.set_xlabel('X (2 coin flips)')
    ax.set_ylabel('Y (3 coin flips)')
    ax.set_zlabel('P(X=x, Y=y)')
    ax.set_title('Joint PMF: Independent Binomial Variables')
    
    return joint_pmf
 
# Run the visualization
joint_pmf_coin_flips()
plt.show()

Joint Probability Density Function (Continuous Case)

For continuous random variables, we cannot assign probabilities to individual points (they would all be zero). Instead, we work with probability density functions that describe the density of probability in regions of the outcome space.

Definition (Joint PDF):

For two continuous random variables $X$ and $Y$, the joint PDF $f_{X,Y}(x, y)$ is a function such that:

$$P((X, Y) \in A) = \iint_A f_{X,Y}(x, y) , dx , dy$$

Intuitively, $f_{X,Y}(x, y) \cdot dx \cdot dy$ represents the probability that $X$ falls in a tiny interval around $x$ AND $Y$ falls in a tiny interval around $y$.

Properties of Joint PDF:

Non-negativity: $f_{X,Y}(x, y) \geq 0$ for all $(x, y) \in \mathbb{R}^2$
Normalization: The integral over the entire plane equals 1: $$\int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f_{X,Y}(x, y) , dx , dy = 1$$
Probability via integration: For any region $A$: $$P((X, Y) \in A) = \iint_A f_{X,Y}(x, y) , dx , dy$$

Critical distinction from PMF: The value $f_{X,Y}(x, y)$ is NOT a probability—it's a probability density. It can exceed 1, as long as the total integral equals 1. Only when we integrate over a region do we get actual probabilities.

Density vs. Probability

A joint PDF value f(x,y) = 3 is perfectly valid—it just means probability is concentrated in that region. What matters is that ∫∫f(x,y)dxdy = 1. Think of it like mass density: a small, dense object can have high density but still have finite mass.

Example: Uniform Distribution on the Unit Square

The simplest continuous joint distribution is the uniform distribution over a bounded region. For the unit square $[0, 1] \times [0, 1]$:

$$f_{X,Y}(x, y) = \begin{cases} 1 & \text{if } 0 \leq x \leq 1 \text{ and } 0 \leq y \leq 1 \ 0 & \text{otherwise} \end{cases}$$

Let's verify normalization: $$\int_0^1 \int_0^1 1 , dx , dy = 1 \cdot 1 = 1 \checkmark$$

Computing probabilities:

What is $P(X + Y < 1)$? This is the probability that the point $(X, Y)$ falls in the region below the line $x + y = 1$:

$$P(X + Y < 1) = \int_0^1 \int_0^{1-x} 1 , dy , dx = \int_0^1 (1-x) , dx = \frac{1}{2}$$

Geometrically, this is the area of the triangle below the line, which is indeed half the unit square.

joint_pdf_visualization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from scipy import stats
 
def visualize_joint_pdf():
    """
    Visualize joint PDFs: uniform on unit square and bivariate Gaussian.
    """
    fig, axes = plt.subplots(1, 2, figsize=(14, 6), 
                              subplot_kw={'projection': '3d'})
    
    # Example 1: Uniform on unit square
    ax1 = axes[0]
    x = np.linspace(-0.2, 1.2, 50)
    y = np.linspace(-0.2, 1.2, 50)
    X, Y = np.meshgrid(x, y)
    
    # PDF is 1 inside [0,1]x[0,1], 0 outside
    Z_uniform = np.where((X >= 0) & (X <= 1) & (Y >= 0) & (Y <= 1), 1.0, 0.0)
    
    ax1.plot_surface(X, Y, Z_uniform, cmap='Blues', alpha=0.8, 
                     edgecolor='none')
    ax1.set_xlabel('X')
    ax1.set_ylabel('Y')
    ax1.set_zlabel('f(x, y)')
    ax1.set_title('Uniform Distribution on Unit Square')
    
    # Example 2: Standard Bivariate Gaussian
    ax2 = axes[1]
    x = np.linspace(-3, 3, 100)
    y = np.linspace(-3, 3, 100)
    X, Y = np.meshgrid(x, y)
    
    # Standard bivariate Gaussian (μ=0, σ=1, ρ=0.5)
    mu = np.array([0, 0])
    rho = 0.5
    cov = np.array([[1, rho], [rho, 1]])
    
    # Compute PDF at each point
    pos = np.dstack((X, Y))
    rv = stats.multivariate_normal(mu, cov)
    Z_gaussian = rv.pdf(pos)
    
    ax2.plot_surface(X, Y, Z_gaussian, cmap='Reds', alpha=0.8,
                     edgecolor='none')
    ax2.set_xlabel('X')
    ax2.set_ylabel('Y')
    ax2.set_zlabel('f(x, y)')
    ax2.set_title(f'Bivariate Gaussian (ρ = {rho})')
    
    plt.tight_layout()
    
    # Also show contour plots (more common in practice)
    fig2, axes2 = plt.subplots(1, 2, figsize=(14, 6))
    
    # Contour for uniform
    ax3 = axes2[0]
    x = np.linspace(-0.2, 1.2, 100)
    y = np.linspace(-0.2, 1.2, 100)
    X, Y = np.meshgrid(x, y)
    Z_uniform = np.where((X >= 0) & (X <= 1) & (Y >= 0) & (Y <= 1), 1.0, 0.0)
    ax3.contourf(X, Y, Z_uniform, levels=20, cmap='Blues')
    ax3.set_xlabel('X')
    ax3.set_ylabel('Y')
    ax3.set_title('Uniform on Unit Square (Contour)')
    ax3.set_aspect('equal')
    
    # Contour for Gaussian
    ax4 = axes2[1]
    x = np.linspace(-3, 3, 100)
    y = np.linspace(-3, 3, 100)
    X, Y = np.meshgrid(x, y)
    pos = np.dstack((X, Y))
    Z_gaussian = rv.pdf(pos)
    ax4.contourf(X, Y, Z_gaussian, levels=20, cmap='Reds')
    ax4.set_xlabel('X')
    ax4.set_ylabel('Y')
    ax4.set_title(f'Bivariate Gaussian (ρ = {rho}) (Contour)')
    ax4.set_aspect('equal')
    
    plt.tight_layout()
    plt.show()
    
    return Z_gaussian
 
visualize_joint_pdf()

The Bivariate Gaussian Distribution

The bivariate Gaussian (normal) distribution is the most important continuous joint distribution in statistics and machine learning. Its mathematical tractability, theoretical properties, and prevalence in real-world data make it foundational.

Definition (Bivariate Gaussian):

For random variables $X$ and $Y$ jointly normally distributed with means $\mu_X, \mu_Y$, standard deviations $\sigma_X, \sigma_Y$, and correlation coefficient $\rho$, the joint PDF is:

$$f_{X,Y}(x, y) = \frac{1}{2\pi\sigma_X\sigma_Y\sqrt{1-\rho^2}} \exp\left(-\frac{1}{2(1-\rho^2)}\left[\frac{(x-\mu_X)^2}{\sigma_X^2} - \frac{2\rho(x-\mu_X)(y-\mu_Y)}{\sigma_X\sigma_Y} + \frac{(y-\mu_Y)^2}{\sigma_Y^2}\right]\right)$$

Matrix Notation (cleaner):

Let $\mathbf{x} = \begin{pmatrix} x \ y \end{pmatrix}$, $\boldsymbol{\mu} = \begin{pmatrix} \mu_X \ \mu_Y \end{pmatrix}$, and covariance matrix:

$$\boldsymbol{\Sigma} = \begin{pmatrix} \sigma_X^2 & \rho\sigma_X\sigma_Y \ \rho\sigma_X\sigma_Y & \sigma_Y^2 \end{pmatrix}$$

Then the joint PDF becomes:

$$f(\mathbf{x}) = \frac{1}{2\pi|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right)$$

This matrix form generalizes directly to $d$ dimensions (the multivariate Gaussian).

The Covariance Matrix Encodes Everything

The covariance matrix Σ completely determines the shape and orientation of the Gaussian. Its eigenvalues determine the spread along principal axes, and its eigenvectors determine those axes' directions. This connects to PCA—the principal components are exactly the eigenvectors of the data covariance matrix.

Key Properties of Bivariate Gaussian:

Contours are ellipses: Lines of constant probability density form ellipses centered at $(\mu_X, \mu_Y)$
Marginals are Gaussian: If $(X, Y)$ is bivariate Gaussian, then $X$ alone is Gaussian($\mu_X, \sigma_X^2$) and $Y$ alone is Gaussian($\mu_Y, \sigma_Y^2$)
Conditionals are Gaussian: Given $X = x$, the conditional distribution of $Y$ is Gaussian with:
- Mean: $\mu_Y + \rho\frac{\sigma_Y}{\sigma_X}(x - \mu_X)$
- Variance: $\sigma_Y^2(1 - \rho^2)$
Zero correlation implies independence: For Gaussian variables ONLY, $\rho = 0$ implies $X$ and $Y$ are independent. This is NOT true for general distributions.
Linear combinations are Gaussian: Any linear combination $aX + bY$ is also Gaussian.

bivariate_gaussian_properties.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
def explore_bivariate_gaussian():
    """
    Explore properties of bivariate Gaussian distributions
    with different correlation values.
    """
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    
    correlations = [-0.8, 0, 0.5, 0.9, 0.99]
    mu = np.array([0, 0])
    
    # Create grid for plotting
    x = np.linspace(-4, 4, 200)
    y = np.linspace(-4, 4, 200)
    X, Y = np.meshgrid(x, y)
    pos = np.dstack((X, Y))
    
    for idx, rho in enumerate(correlations):
        ax = axes.flatten()[idx]
        
        # Covariance matrix
        cov = np.array([[1, rho], [rho, 1]])
        
        # Create bivariate normal
        rv = stats.multivariate_normal(mu, cov)
        Z = rv.pdf(pos)
        
        # Plot contours
        contours = ax.contour(X, Y, Z, levels=10, cmap='RdBu_r')
        ax.contourf(X, Y, Z, levels=50, cmap='RdBu_r', alpha=0.5)
        
        # Add eigenvalue analysis
        eigenvalues, eigenvectors = np.linalg.eigh(cov)
        
        # Plot eigenvector directions (scaled by sqrt(eigenvalue))
        for i in range(2):
            vec = eigenvectors[:, i] * np.sqrt(eigenvalues[i]) * 2
            ax.arrow(0, 0, vec[0], vec[1], head_width=0.15, 
                    head_length=0.1, fc='black', ec='black', linewidth=2)
        
        ax.set_xlim(-4, 4)
        ax.set_ylim(-4, 4)
        ax.set_aspect('equal')
        ax.set_xlabel('X')
        ax.set_ylabel('Y')
        ax.set_title(f'ρ = {rho}\nEigenvalues: [{eigenvalues[0]:.2f}, {eigenvalues[1]:.2f}]')
        ax.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
        ax.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
    
    # Last panel: explain the relationship
    ax = axes.flatten()[5]
    ax.text(0.5, 0.9, "Bivariate Gaussian Properties:", 
            transform=ax.transAxes, fontsize=12, fontweight='bold',
            ha='center')
    ax.text(0.5, 0.7, "• Contours are ellipses", 
            transform=ax.transAxes, fontsize=10, ha='center')
    ax.text(0.5, 0.55, "• Eigenvalues = variances along principal axes", 
            transform=ax.transAxes, fontsize=10, ha='center')
    ax.text(0.5, 0.4, "• Eigenvectors = directions of principal axes", 
            transform=ax.transAxes, fontsize=10, ha='center')
    ax.text(0.5, 0.25, "• ρ → ±1: ellipse becomes a line", 
            transform=ax.transAxes, fontsize=10, ha='center')
    ax.text(0.5, 0.1, "• ρ = 0: ellipse axes align with X, Y", 
            transform=ax.transAxes, fontsize=10, ha='center')
    ax.axis('off')
    
    plt.tight_layout()
    plt.savefig('bivariate_gaussian_correlations.png', dpi=150, 
                bbox_inches='tight')
    plt.show()
 
def demonstrate_conditional_gaussian():
    """
    Show that conditional distributions are also Gaussian.
    """
    # Parameters
    mu_x, mu_y = 2, 3
    sigma_x, sigma_y = 1, 2
    rho = 0.7
    
    # Generate samples
    mu = np.array([mu_x, mu_y])
    cov = np.array([[sigma_x**2, rho*sigma_x*sigma_y],
                    [rho*sigma_x*sigma_y, sigma_y**2]])
    
    samples = np.random.multivariate_normal(mu, cov, 10000)
    
    # Condition on X being close to a specific value
    x_given = 3.5
    tolerance = 0.1
    
    mask = np.abs(samples[:, 0] - x_given) < tolerance
    conditional_y_samples = samples[mask, 1]
    
    # Theoretical conditional distribution
    # E[Y|X=x] = mu_y + rho * (sigma_y/sigma_x) * (x - mu_x)
    # Var[Y|X=x] = sigma_y^2 * (1 - rho^2)
    
    conditional_mean = mu_y + rho * (sigma_y/sigma_x) * (x_given - mu_x)
    conditional_var = sigma_y**2 * (1 - rho**2)
    conditional_std = np.sqrt(conditional_var)
    
    print(f"Conditioning on X ≈ {x_given}")
    print(f"Theoretical conditional mean: {conditional_mean:.4f}")
    print(f"Sample conditional mean: {np.mean(conditional_y_samples):.4f}")
    print(f"Theoretical conditional std: {conditional_std:.4f}")
    print(f"Sample conditional std: {np.std(conditional_y_samples):.4f}")
    
    # Visualize
    fig, ax = plt.subplots(figsize=(10, 6))
    
    # Histogram of conditional samples
    ax.hist(conditional_y_samples, bins=30, density=True, alpha=0.7,
            label='Empirical conditional distribution')
    
    # Theoretical conditional PDF
    y_range = np.linspace(
        conditional_mean - 4*conditional_std,
        conditional_mean + 4*conditional_std,
        200
    )
    theoretical_pdf = stats.norm.pdf(y_range, conditional_mean, conditional_std)
    ax.plot(y_range, theoretical_pdf, 'r-', linewidth=2,
            label=f'Theoretical: N({conditional_mean:.2f}, {conditional_var:.2f})')
    
    ax.set_xlabel('Y')
    ax.set_ylabel('Density')
    ax.set_title(f'Conditional Distribution: P(Y | X = {x_given})')
    ax.legend()
    
    plt.tight_layout()
    plt.show()
 
explore_bivariate_gaussian()
demonstrate_conditional_gaussian()

Joint Cumulative Distribution Function

Just as univariate distributions can be characterized by their CDF, joint distributions have a joint CDF that provides cumulative probabilities.

Definition (Joint CDF):

For random variables $X$ and $Y$, the joint CDF is:

$$F_{X,Y}(x, y) = P(X \leq x \text{ and } Y \leq y) = P(X \leq x, Y \leq y)$$

Properties of Joint CDF:

Boundary conditions:
- $\lim_{x \to -\infty} F_{X,Y}(x, y) = 0$ for any $y$
- $\lim_{y \to -\infty} F_{X,Y}(x, y) = 0$ for any $x$
- $\lim_{x,y \to \infty} F_{X,Y}(x, y) = 1$
Monotonicity: $F_{X,Y}$ is non-decreasing in both arguments
Right-continuity: $F_{X,Y}$ is right-continuous in each argument
Probability of rectangles: $$P(a < X \leq b, c < Y \leq d) = F(b,d) - F(a,d) - F(b,c) + F(a,c)$$

This last property is the inclusion-exclusion formula for rectangular regions.

Relationship to PDF (Continuous Case):

For continuous random variables, the PDF is the mixed partial derivative of the CDF:

$$f_{X,Y}(x, y) = \frac{\partial^2 F_{X,Y}(x, y)}{\partial x , \partial y}$$

Conversely, the CDF is obtained by integrating the PDF:

$$F_{X,Y}(x, y) = \int_{-\infty}^{x} \int_{-\infty}^{y} f_{X,Y}(s, t) , dt , ds$$

Example: CDF for Uniform on Unit Square

For $(X, Y)$ uniform on $[0,1] \times [0,1]$:

$$F_{X,Y}(x, y) = \begin{cases} 0 & \text{if } x < 0 \text{ or } y < 0 \ xy & \text{if } 0 \leq x \leq 1 \text{ and } 0 \leq y \leq 1 \ x & \text{if } 0 \leq x \leq 1 \text{ and } y > 1 \ y & \text{if } x > 1 \text{ and } 0 \leq y \leq 1 \ 1 & \text{if } x > 1 \text{ and } y > 1 \end{cases}$$

Let's verify: $\frac{\partial^2}{\partial x \partial y}(xy) = 1 = f_{X,Y}(x,y)$ on the unit square. ✓

When to Use CDF vs PDF/PMF

CDFs are particularly useful for: (1) computing probabilities of rectangular regions using inclusion-exclusion, (2) theoretical analyses involving limits and continuity, and (3) generating random samples via inverse CDF methods. PDFs/PMFs are easier for computing specific probabilities and visualizing distributions.

Extending to Higher Dimensions

While we've focused on two random variables, the concepts generalize directly to any number of variables—essential for machine learning with high-dimensional data.

Notation for $d$ Random Variables:

For random variables $X_1, X_2, \ldots, X_d$ (often written as a random vector $\mathbf{X} = (X_1, \ldots, X_d)^T$):

Discrete case (Joint PMF): $$p_{X_1, \ldots, X_d}(x_1, \ldots, x_d) = P(X_1 = x_1, X_2 = x_2, \ldots, X_d = x_d)$$

Continuous case (Joint PDF): $$P(\mathbf{X} \in A) = \int \cdots \int_A f_{X_1, \ldots, X_d}(x_1, \ldots, x_d) , dx_1 \cdots dx_d$$

Joint CDF: $$F_{X_1, \ldots, X_d}(x_1, \ldots, x_d) = P(X_1 \leq x_1, \ldots, X_d \leq x_d)$$

The Multivariate Gaussian:

The $d$-dimensional Gaussian is defined by mean vector $\boldsymbol{\mu} \in \mathbb{R}^d$ and $d \times d$ covariance matrix $\boldsymbol{\Sigma}$:

$$f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right)$$

The covariance matrix $\boldsymbol{\Sigma}$ is:

Symmetric: $\Sigma_{ij} = \Sigma_{ji}$
Positive semi-definite: $\mathbf{v}^T \boldsymbol{\Sigma} \mathbf{v} \geq 0$ for all $\mathbf{v}$
Diagonal entries: $\Sigma_{ii} = \text{Var}(X_i)$
Off-diagonal entries: $\Sigma_{ij} = \text{Cov}(X_i, X_j)$

Why this matters for ML:

In machine learning, the feature vector $\mathbf{x}$ for a data point is exactly a realization from a joint distribution. When we assume features follow a multivariate Gaussian:

Linear Discriminant Analysis (LDA) assumes class-conditional Gaussians
Gaussian Mixture Models (GMMs) use mixtures of multivariate Gaussians
Gaussian Processes model functions as infinite-dimensional Gaussians
Probabilistic PCA assumes latent Gaussian factors

The Curse of Dimensionality

As dimension d increases, the complexity of joint distributions grows exponentially. A joint distribution over d binary variables has 2^d possible outcomes. For continuous variables, estimation requires more samples as d grows. This 'curse of dimensionality' motivates assumptions like conditional independence (Naive Bayes) or low-rank structure (factor models).

multivariate_gaussian.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
def multivariate_gaussian_demo():
    """
    Demonstrate multivariate Gaussian in higher dimensions.
    """
    np.random.seed(42)
    
    # 5-dimensional Gaussian
    d = 5
    mu = np.zeros(d)
    
    # Create a random positive definite covariance matrix
    A = np.random.randn(d, d)
    Sigma = A @ A.T + 0.1 * np.eye(d)  # Ensure positive definiteness
    
    print("=" * 60)
    print("5-Dimensional Multivariate Gaussian")
    print("=" * 60)
    print(f"\nMean vector μ:\n{mu}")
    print(f"\nCovariance matrix Σ:\n{np.round(Sigma, 3)}")
    
    # Eigendecomposition (principal axes)
    eigenvalues, eigenvectors = np.linalg.eigh(Sigma)
    print(f"\nEigenvalues: {np.round(eigenvalues, 3)}")
    print("(These are variances along principal axes)")
    
    # Generate samples
    samples = np.random.multivariate_normal(mu, Sigma, 1000)
    
    # Verify sample statistics match theoretical
    sample_mean = np.mean(samples, axis=0)
    sample_cov = np.cov(samples.T)
    
    print(f"\nSample mean (should be ≈ 0):\n{np.round(sample_mean, 3)}")
    print(f"\nSample covariance (should match Σ):\n{np.round(sample_cov, 3)}")
    
    # Visualize pairwise marginals
    fig, axes = plt.subplots(d, d, figsize=(15, 15))
    
    for i in range(d):
        for j in range(d):
            ax = axes[i, j]
            if i == j:
                # Diagonal: histogram of marginal
                ax.hist(samples[:, i], bins=30, density=True, alpha=0.7)
                # Theoretical marginal
                x_range = np.linspace(samples[:, i].min(), samples[:, i].max(), 100)
                ax.plot(x_range, stats.norm.pdf(x_range, mu[i], np.sqrt(Sigma[i,i])),
                       'r-', linewidth=2)
                ax.set_ylabel(f'X{i+1}')
            else:
                # Off-diagonal: scatter plot
                ax.scatter(samples[:, j], samples[:, i], alpha=0.1, s=5)
                ax.set_xlabel(f'X{j+1}')
                ax.set_ylabel(f'X{i+1}')
    
    plt.suptitle('Pairwise Marginal Distributions (5D Gaussian)', fontsize=14)
    plt.tight_layout()
    plt.savefig('multivariate_gaussian_pairwise.png', dpi=100)
    plt.show()
    
    # Mahalanobis distance demonstration
    print("\n" + "=" * 60)
    print("Mahalanobis Distance")
    print("=" * 60)
    
    # Mahalanobis distance: (x - μ)^T Σ^(-1) (x - μ)
    Sigma_inv = np.linalg.inv(Sigma)
    
    mahal_distances = []
    for sample in samples:
        diff = sample - mu
        mahal = np.sqrt(diff @ Sigma_inv @ diff)
        mahal_distances.append(mahal)
    
    mahal_distances = np.array(mahal_distances)
    
    # For a d-dim Gaussian, squared Mahalanobis distance follows chi-square(d)
    print(f"Mean Mahalanobis distance: {np.mean(mahal_distances):.3f}")
    print(f"Expected (√d ≈ √5): {np.sqrt(d):.3f}")
    
    # Plot distribution of Mahalanobis distances
    fig, ax = plt.subplots(figsize=(10, 6))
    ax.hist(mahal_distances**2, bins=50, density=True, alpha=0.7,
            label='Empirical (squared Mahalanobis)')
    
    x_range = np.linspace(0, max(mahal_distances**2), 200)
    ax.plot(x_range, stats.chi2.pdf(x_range, df=d), 'r-', linewidth=2,
            label=f'Chi-square({d})')
    
    ax.set_xlabel('Squared Mahalanobis Distance')
    ax.set_ylabel('Density')
    ax.set_title('Squared Mahalanobis Distance follows Chi-square Distribution')
    ax.legend()
    
    plt.tight_layout()
    plt.show()
 
multivariate_gaussian_demo()

Applications in Machine Learning

Joint distributions permeate machine learning. Understanding their role clarifies many ML techniques.

1. Modeling the Data Generation Process

In ML, we typically assume data points $(\mathbf{x}_i, y_i)$ are i.i.d. samples from some joint distribution $P(\mathbf{X}, Y)$:

Features $\mathbf{X}$: random vector representing input
Target $Y$: random variable representing label/output

The goal of supervised learning is to learn enough about this joint distribution to make good predictions.

2. Classification via Bayes' Rule

Given the joint distribution, classification is straightforward via Bayes' rule:

$$P(Y = k | \mathbf{X} = \mathbf{x}) = \frac{P(\mathbf{X} = \mathbf{x} | Y = k) \cdot P(Y = k)}{P(\mathbf{X} = \mathbf{x})}$$

This connects:

Joint: $P(\mathbf{X}, Y) = P(\mathbf{X} | Y) \cdot P(Y)$
Marginal: $P(\mathbf{X}) = \sum_k P(\mathbf{X} | Y = k) P(Y = k)$
Conditional/Posterior: $P(Y | \mathbf{X})$ — what we want for prediction

Joint Distributions Across ML Paradigms

•Generative Models: Explicitly model the joint $P(\mathbf{X}, Y)$ or $P(\mathbf{X})$. Examples: Gaussian Mixture Models, Naive Bayes, GANs, VAEs
•Discriminative Models: Model conditional $P(Y | \mathbf{X})$ directly, bypassing the joint. Examples: Logistic Regression, Neural Networks, SVMs
•Probabilistic Graphical Models: Factor the joint into local terms via graph structure. Examples: Bayesian Networks, Markov Random Fields
•Dimensionality Reduction: Model joint distribution of features to find low-dimensional representations. Examples: PCA (Gaussian assumption), t-SNE (heavy-tailed joint embeddings)
•Anomaly Detection: Model the joint distribution of normal data; anomalies have low probability under this distribution

3. The Generative vs. Discriminative Divide

A fundamental choice in ML is whether to model the joint distribution:

Approach	Models	Advantages	Disadvantages
Generative	$P(\mathbf{X}, Y)$	Can generate data, handle missing values, see feature relationships	More parameters to estimate, stronger assumptions
Discriminative	$P(Y \mid \mathbf{X})$	Fewer parameters, often better for pure classification	Can't generate data, no insight into feature distribution

Neither is universally better—the choice depends on the task and data availability.

Key Insight

Every probabilistic machine learning model—from linear regression to deep neural networks—implicitly or explicitly involves assumptions about joint distributions. Understanding joint distributions unlocks the mathematical reasoning behind ML algorithms.

Summary: Joint Distributions

We've established the foundational framework for reasoning about multiple random variables simultaneously. Let's consolidate the key concepts:

Key Takeaways

•Joint distributions capture multivariate behavior: Individual distributions for $X$ and $Y$ don't tell us how they relate; the joint distribution does.
•Discrete case uses joint PMF: $p_{X,Y}(x, y) = P(X = x, Y = y)$, which can be represented as a table.
•Continuous case uses joint PDF: $f_{X,Y}(x, y)$ is a density; probabilities come from integration over regions.
•The bivariate Gaussian is fundamental: Its tractability, closure properties, and prevalence make it central to statistics and ML.
•Joint CDFs provide cumulative probabilities: $F_{X,Y}(x, y) = P(X \leq x, Y \leq y)$.
•Extension to $d$ dimensions is straightforward: Random vectors $\mathbf{X}$ have joint PMF/PDF and are central to high-dimensional ML.
•ML depends on joint distributions: Whether generative or discriminative, understanding the joint is key to understanding models.

What's next:

Now that we understand joint distributions as the complete description of multiple variables together, we need tools to extract information from them. The next page covers marginal distributions—how to "integrate out" variables we don't care about to focus on the distribution of a subset.

Page Complete

You now understand joint probability distributions—the mathematical framework for describing the complete probabilistic behavior of multiple random variables. This is the foundation for all subsequent topics in this module: marginal distributions, conditional distributions, covariance, correlation, and independence.

1 / 5

Loading learning content...

Mathematical FoundationsJoint and Marginal Distributions

Joint and Marginal Distributions

LevelIntermediate

Duration90 mins

TopicJoint and Marginal Distributions

1 / 5

Joint Distributions

The Need to Model Multiple Variables Together

What You Will Learn

From Univariate to Multivariate: The Conceptual Leap

Before diving into joint distributions, let's recall what we know about single (univariate) random variables and understand why extending to multiple variables requires new machinery.

Univariate distributions describe the probability behavior of a single random variable $X$:

Discrete case: A probability mass function (PMF) $P(X = x)$ assigns probabilities to each possible value
Continuous case: A probability density function (PDF) $f_X(x)$ describes the density of probability at each point

For a single variable, this is complete—we know everything about $X$ from its distribution. But consider two random variables $X$ and $Y$:

Knowing $P(X)$ and $P(Y)$ separately is NOT enough.

To see why, consider two scenarios with the same marginal distributions:

$X$ and $Y$ are the same random variable (perfectly correlated)
$X$ and $Y$ are completely independent

Both scenarios can have identical individual distributions for $X$ and $Y$, yet their joint behavior is completely different. The joint distribution captures what the individual distributions cannot.

The Information Gap

Machine Learning Relevance:

In ML, we constantly deal with multiple random variables:

Features: Each feature $X_1, X_2, \ldots, X_d$ in a dataset is a random variable
Targets: The label $Y$ we predict is a random variable
Model understanding: Classification and regression fundamentally model $P(Y | X_1, \ldots, X_d)$, which derives from the joint distribution $P(X_1, \ldots, X_d, Y)$

Without joint distributions, we cannot formalize:

How features relate to each other (multicollinearity)
How features predict the target (supervised learning)
How to generate realistic synthetic data (generative modeling)
How to perform Bayesian inference (posterior computation)

Joint Probability Mass Function (Discrete Case)

For discrete random variables, we extend the PMF concept to multiple variables through the joint probability mass function.

Definition (Joint PMF):

For two discrete random variables $X$ and $Y$, the joint PMF is defined as:

$$p_{X,Y}(x, y) = P(X = x \text{ and } Y = y) = P(X = x, Y = y)$$

This function assigns a probability to every possible pair $(x, y)$ of values that $X$ and $Y$ can take.

Properties of Joint PMF:

Non-negativity: $p_{X,Y}(x, y) \geq 0$ for all $(x, y)$
Normalization: The sum over all possible pairs equals 1: $$\sum_{x} \sum_{y} p_{X,Y}(x, y) = 1$$
Probability of events: For any event $A \subseteq \mathbb{R}^2$: $$P((X, Y) \in A) = \sum_{(x,y) \in A} p_{X,Y}(x, y)$$

These properties directly parallel those of univariate PMFs, extended to the two-dimensional case.

Example: Joint PMF for Dice Roll and Coin Flip
	Y = Heads (0)	Y = Tails (1)
X = 1	1/12	1/12
X = 2	1/12	1/12
X = 3	1/12	1/12
X = 4	1/12	1/12
X = 5	1/12	1/12
X = 6	1/12	1/12

Notice that the table itself is the joint distribution—it tells us everything about the joint behavior of $X$ and $Y$. Each cell is $p_{X,Y}(x, y)$, and summing all 12 cells gives 1.

Tabular Representation

joint_pmf_example.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
 
# Example: Joint PMF of two discrete random variables
# X: Number of heads in 2 coin flips (0, 1, or 2)
# Y: Number of heads in 3 coin flips (0, 1, 2, or 3)
# Both use fair coins independently
 
def joint_pmf_coin_flips():
    """
    Compute and visualize joint PMF for two binomial random variables.
    X ~ Binomial(2, 0.5)
    Y ~ Binomial(3, 0.5)
    Since independent: P(X=x, Y=y) = P(X=x) * P(Y=y)
    """
    from scipy.stats import binom
    
    # Define the distributions
    n_x, p = 2, 0.5  # X: 2 flips
    n_y = 3          # Y: 3 flips
    
    # Create meshgrid for all (x, y) combinations
    x_values = np.arange(n_x + 1)  # 0, 1, 2
    y_values = np.arange(n_y + 1)  # 0, 1, 2, 3
    
    # Compute marginal PMFs
    pmf_x = binom.pmf(x_values, n_x, p)
    pmf_y = binom.pmf(y_values, n_y, p)
    
    # Compute joint PMF (outer product for independent variables)
    joint_pmf = np.outer(pmf_x, pmf_y)
    
    print("Joint PMF Table:")
    print("=" * 50)
    print(f"{'X \\ Y':>8}", end="")
    for y in y_values:
        print(f"    Y={y:d}", end="")
    print()
    print("-" * 50)
    
    for i, x in enumerate(x_values):
        print(f"X={x:d}", end="    ")
        for j, y in enumerate(y_values):
            print(f"  {joint_pmf[i,j]:.4f}", end="")
        print()
    
    print("-" * 50)
    print(f"Sum of all probabilities: {joint_pmf.sum():.4f}")
    
    # Visualize as 3D bar chart
    fig = plt.figure(figsize=(10, 8))
    ax = fig.add_subplot(111, projection='3d')
    
    X, Y = np.meshgrid(x_values, y_values)
    X_flat = X.flatten()
    Y_flat = Y.flatten()
    Z_flat = joint_pmf.T.flatten()
    
    ax.bar3d(X_flat, Y_flat, np.zeros_like(Z_flat), 
             0.4, 0.4, Z_flat, alpha=0.8, color='steelblue')
    
    ax.set_xlabel('X (2 coin flips)')
    ax.set_ylabel('Y (3 coin flips)')
    ax.set_zlabel('P(X=x, Y=y)')
    ax.set_title('Joint PMF: Independent Binomial Variables')
    
    return joint_pmf
 
# Run the visualization
joint_pmf_coin_flips()
plt.show()

Joint Probability Density Function (Continuous Case)

Definition (Joint PDF):

For two continuous random variables $X$ and $Y$, the joint PDF $f_{X,Y}(x, y)$ is a function such that:

$$P((X, Y) \in A) = \iint_A f_{X,Y}(x, y) , dx , dy$$

Intuitively, $f_{X,Y}(x, y) \cdot dx \cdot dy$ represents the probability that $X$ falls in a tiny interval around $x$ AND $Y$ falls in a tiny interval around $y$.

Properties of Joint PDF:

Non-negativity: $f_{X,Y}(x, y) \geq 0$ for all $(x, y) \in \mathbb{R}^2$
Normalization: The integral over the entire plane equals 1: $$\int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f_{X,Y}(x, y) , dx , dy = 1$$
Probability via integration: For any region $A$: $$P((X, Y) \in A) = \iint_A f_{X,Y}(x, y) , dx , dy$$

Density vs. Probability

Example: Uniform Distribution on the Unit Square

The simplest continuous joint distribution is the uniform distribution over a bounded region. For the unit square $[0, 1] \times [0, 1]$:

$$f_{X,Y}(x, y) = \begin{cases} 1 & \text{if } 0 \leq x \leq 1 \text{ and } 0 \leq y \leq 1 \ 0 & \text{otherwise} \end{cases}$$

Let's verify normalization: $$\int_0^1 \int_0^1 1 , dx , dy = 1 \cdot 1 = 1 \checkmark$$

Computing probabilities:

What is $P(X + Y < 1)$? This is the probability that the point $(X, Y)$ falls in the region below the line $x + y = 1$:

$$P(X + Y < 1) = \int_0^1 \int_0^{1-x} 1 , dy , dx = \int_0^1 (1-x) , dx = \frac{1}{2}$$

Geometrically, this is the area of the triangle below the line, which is indeed half the unit square.

joint_pdf_visualization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from scipy import stats
 
def visualize_joint_pdf():
    """
    Visualize joint PDFs: uniform on unit square and bivariate Gaussian.
    """
    fig, axes = plt.subplots(1, 2, figsize=(14, 6), 
                              subplot_kw={'projection': '3d'})
    
    # Example 1: Uniform on unit square
    ax1 = axes[0]
    x = np.linspace(-0.2, 1.2, 50)
    y = np.linspace(-0.2, 1.2, 50)
    X, Y = np.meshgrid(x, y)
    
    # PDF is 1 inside [0,1]x[0,1], 0 outside
    Z_uniform = np.where((X >= 0) & (X <= 1) & (Y >= 0) & (Y <= 1), 1.0, 0.0)
    
    ax1.plot_surface(X, Y, Z_uniform, cmap='Blues', alpha=0.8, 
                     edgecolor='none')
    ax1.set_xlabel('X')
    ax1.set_ylabel('Y')
    ax1.set_zlabel('f(x, y)')
    ax1.set_title('Uniform Distribution on Unit Square')
    
    # Example 2: Standard Bivariate Gaussian
    ax2 = axes[1]
    x = np.linspace(-3, 3, 100)
    y = np.linspace(-3, 3, 100)
    X, Y = np.meshgrid(x, y)
    
    # Standard bivariate Gaussian (μ=0, σ=1, ρ=0.5)
    mu = np.array([0, 0])
    rho = 0.5
    cov = np.array([[1, rho], [rho, 1]])
    
    # Compute PDF at each point
    pos = np.dstack((X, Y))
    rv = stats.multivariate_normal(mu, cov)
    Z_gaussian = rv.pdf(pos)
    
    ax2.plot_surface(X, Y, Z_gaussian, cmap='Reds', alpha=0.8,
                     edgecolor='none')
    ax2.set_xlabel('X')
    ax2.set_ylabel('Y')
    ax2.set_zlabel('f(x, y)')
    ax2.set_title(f'Bivariate Gaussian (ρ = {rho})')
    
    plt.tight_layout()
    
    # Also show contour plots (more common in practice)
    fig2, axes2 = plt.subplots(1, 2, figsize=(14, 6))
    
    # Contour for uniform
    ax3 = axes2[0]
    x = np.linspace(-0.2, 1.2, 100)
    y = np.linspace(-0.2, 1.2, 100)
    X, Y = np.meshgrid(x, y)
    Z_uniform = np.where((X >= 0) & (X <= 1) & (Y >= 0) & (Y <= 1), 1.0, 0.0)
    ax3.contourf(X, Y, Z_uniform, levels=20, cmap='Blues')
    ax3.set_xlabel('X')
    ax3.set_ylabel('Y')
    ax3.set_title('Uniform on Unit Square (Contour)')
    ax3.set_aspect('equal')
    
    # Contour for Gaussian
    ax4 = axes2[1]
    x = np.linspace(-3, 3, 100)
    y = np.linspace(-3, 3, 100)
    X, Y = np.meshgrid(x, y)
    pos = np.dstack((X, Y))
    Z_gaussian = rv.pdf(pos)
    ax4.contourf(X, Y, Z_gaussian, levels=20, cmap='Reds')
    ax4.set_xlabel('X')
    ax4.set_ylabel('Y')
    ax4.set_title(f'Bivariate Gaussian (ρ = {rho}) (Contour)')
    ax4.set_aspect('equal')
    
    plt.tight_layout()
    plt.show()
    
    return Z_gaussian
 
visualize_joint_pdf()

The Bivariate Gaussian Distribution

Definition (Bivariate Gaussian):

For random variables $X$ and $Y$ jointly normally distributed with means $\mu_X, \mu_Y$, standard deviations $\sigma_X, \sigma_Y$, and correlation coefficient $\rho$, the joint PDF is:

Matrix Notation (cleaner):

Let $\mathbf{x} = \begin{pmatrix} x \ y \end{pmatrix}$, $\boldsymbol{\mu} = \begin{pmatrix} \mu_X \ \mu_Y \end{pmatrix}$, and covariance matrix:

$$\boldsymbol{\Sigma} = \begin{pmatrix} \sigma_X^2 & \rho\sigma_X\sigma_Y \ \rho\sigma_X\sigma_Y & \sigma_Y^2 \end{pmatrix}$$

Then the joint PDF becomes:

$$f(\mathbf{x}) = \frac{1}{2\pi|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right)$$

This matrix form generalizes directly to $d$ dimensions (the multivariate Gaussian).

The Covariance Matrix Encodes Everything

Key Properties of Bivariate Gaussian:

Contours are ellipses: Lines of constant probability density form ellipses centered at $(\mu_X, \mu_Y)$
Marginals are Gaussian: If $(X, Y)$ is bivariate Gaussian, then $X$ alone is Gaussian($\mu_X, \sigma_X^2$) and $Y$ alone is Gaussian($\mu_Y, \sigma_Y^2$)
Conditionals are Gaussian: Given $X = x$, the conditional distribution of $Y$ is Gaussian with:
- Mean: $\mu_Y + \rho\frac{\sigma_Y}{\sigma_X}(x - \mu_X)$
- Variance: $\sigma_Y^2(1 - \rho^2)$
Zero correlation implies independence: For Gaussian variables ONLY, $\rho = 0$ implies $X$ and $Y$ are independent. This is NOT true for general distributions.
Linear combinations are Gaussian: Any linear combination $aX + bY$ is also Gaussian.

bivariate_gaussian_properties.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
def explore_bivariate_gaussian():
    """
    Explore properties of bivariate Gaussian distributions
    with different correlation values.
    """
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    
    correlations = [-0.8, 0, 0.5, 0.9, 0.99]
    mu = np.array([0, 0])
    
    # Create grid for plotting
    x = np.linspace(-4, 4, 200)
    y = np.linspace(-4, 4, 200)
    X, Y = np.meshgrid(x, y)
    pos = np.dstack((X, Y))
    
    for idx, rho in enumerate(correlations):
        ax = axes.flatten()[idx]
        
        # Covariance matrix
        cov = np.array([[1, rho], [rho, 1]])
        
        # Create bivariate normal
        rv = stats.multivariate_normal(mu, cov)
        Z = rv.pdf(pos)
        
        # Plot contours
        contours = ax.contour(X, Y, Z, levels=10, cmap='RdBu_r')
        ax.contourf(X, Y, Z, levels=50, cmap='RdBu_r', alpha=0.5)
        
        # Add eigenvalue analysis
        eigenvalues, eigenvectors = np.linalg.eigh(cov)
        
        # Plot eigenvector directions (scaled by sqrt(eigenvalue))
        for i in range(2):
            vec = eigenvectors[:, i] * np.sqrt(eigenvalues[i]) * 2
            ax.arrow(0, 0, vec[0], vec[1], head_width=0.15, 
                    head_length=0.1, fc='black', ec='black', linewidth=2)
        
        ax.set_xlim(-4, 4)
        ax.set_ylim(-4, 4)
        ax.set_aspect('equal')
        ax.set_xlabel('X')
        ax.set_ylabel('Y')
        ax.set_title(f'ρ = {rho}\nEigenvalues: [{eigenvalues[0]:.2f}, {eigenvalues[1]:.2f}]')
        ax.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
        ax.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
    
    # Last panel: explain the relationship
    ax = axes.flatten()[5]
    ax.text(0.5, 0.9, "Bivariate Gaussian Properties:", 
            transform=ax.transAxes, fontsize=12, fontweight='bold',
            ha='center')
    ax.text(0.5, 0.7, "• Contours are ellipses", 
            transform=ax.transAxes, fontsize=10, ha='center')
    ax.text(0.5, 0.55, "• Eigenvalues = variances along principal axes", 
            transform=ax.transAxes, fontsize=10, ha='center')
    ax.text(0.5, 0.4, "• Eigenvectors = directions of principal axes", 
            transform=ax.transAxes, fontsize=10, ha='center')
    ax.text(0.5, 0.25, "• ρ → ±1: ellipse becomes a line", 
            transform=ax.transAxes, fontsize=10, ha='center')
    ax.text(0.5, 0.1, "• ρ = 0: ellipse axes align with X, Y", 
            transform=ax.transAxes, fontsize=10, ha='center')
    ax.axis('off')
    
    plt.tight_layout()
    plt.savefig('bivariate_gaussian_correlations.png', dpi=150, 
                bbox_inches='tight')
    plt.show()
 
def demonstrate_conditional_gaussian():
    """
    Show that conditional distributions are also Gaussian.
    """
    # Parameters
    mu_x, mu_y = 2, 3
    sigma_x, sigma_y = 1, 2
    rho = 0.7
    
    # Generate samples
    mu = np.array([mu_x, mu_y])
    cov = np.array([[sigma_x**2, rho*sigma_x*sigma_y],
                    [rho*sigma_x*sigma_y, sigma_y**2]])
    
    samples = np.random.multivariate_normal(mu, cov, 10000)
    
    # Condition on X being close to a specific value
    x_given = 3.5
    tolerance = 0.1
    
    mask = np.abs(samples[:, 0] - x_given) < tolerance
    conditional_y_samples = samples[mask, 1]
    
    # Theoretical conditional distribution
    # E[Y|X=x] = mu_y + rho * (sigma_y/sigma_x) * (x - mu_x)
    # Var[Y|X=x] = sigma_y^2 * (1 - rho^2)
    
    conditional_mean = mu_y + rho * (sigma_y/sigma_x) * (x_given - mu_x)
    conditional_var = sigma_y**2 * (1 - rho**2)
    conditional_std = np.sqrt(conditional_var)
    
    print(f"Conditioning on X ≈ {x_given}")
    print(f"Theoretical conditional mean: {conditional_mean:.4f}")
    print(f"Sample conditional mean: {np.mean(conditional_y_samples):.4f}")
    print(f"Theoretical conditional std: {conditional_std:.4f}")
    print(f"Sample conditional std: {np.std(conditional_y_samples):.4f}")
    
    # Visualize
    fig, ax = plt.subplots(figsize=(10, 6))
    
    # Histogram of conditional samples
    ax.hist(conditional_y_samples, bins=30, density=True, alpha=0.7,
            label='Empirical conditional distribution')
    
    # Theoretical conditional PDF
    y_range = np.linspace(
        conditional_mean - 4*conditional_std,
        conditional_mean + 4*conditional_std,
        200
    )
    theoretical_pdf = stats.norm.pdf(y_range, conditional_mean, conditional_std)
    ax.plot(y_range, theoretical_pdf, 'r-', linewidth=2,
            label=f'Theoretical: N({conditional_mean:.2f}, {conditional_var:.2f})')
    
    ax.set_xlabel('Y')
    ax.set_ylabel('Density')
    ax.set_title(f'Conditional Distribution: P(Y | X = {x_given})')
    ax.legend()
    
    plt.tight_layout()
    plt.show()
 
explore_bivariate_gaussian()
demonstrate_conditional_gaussian()

Joint Cumulative Distribution Function

Just as univariate distributions can be characterized by their CDF, joint distributions have a joint CDF that provides cumulative probabilities.

Definition (Joint CDF):

For random variables $X$ and $Y$, the joint CDF is:

$$F_{X,Y}(x, y) = P(X \leq x \text{ and } Y \leq y) = P(X \leq x, Y \leq y)$$

Properties of Joint CDF:

Boundary conditions:
- $\lim_{x \to -\infty} F_{X,Y}(x, y) = 0$ for any $y$
- $\lim_{y \to -\infty} F_{X,Y}(x, y) = 0$ for any $x$
- $\lim_{x,y \to \infty} F_{X,Y}(x, y) = 1$
Monotonicity: $F_{X,Y}$ is non-decreasing in both arguments
Right-continuity: $F_{X,Y}$ is right-continuous in each argument
Probability of rectangles: $$P(a < X \leq b, c < Y \leq d) = F(b,d) - F(a,d) - F(b,c) + F(a,c)$$

This last property is the inclusion-exclusion formula for rectangular regions.

Relationship to PDF (Continuous Case):

For continuous random variables, the PDF is the mixed partial derivative of the CDF:

$$f_{X,Y}(x, y) = \frac{\partial^2 F_{X,Y}(x, y)}{\partial x , \partial y}$$

Conversely, the CDF is obtained by integrating the PDF:

$$F_{X,Y}(x, y) = \int_{-\infty}^{x} \int_{-\infty}^{y} f_{X,Y}(s, t) , dt , ds$$

Example: CDF for Uniform on Unit Square

For $(X, Y)$ uniform on $[0,1] \times [0,1]$:

Let's verify: $\frac{\partial^2}{\partial x \partial y}(xy) = 1 = f_{X,Y}(x,y)$ on the unit square. ✓

When to Use CDF vs PDF/PMF

Extending to Higher Dimensions

While we've focused on two random variables, the concepts generalize directly to any number of variables—essential for machine learning with high-dimensional data.

Notation for $d$ Random Variables:

For random variables $X_1, X_2, \ldots, X_d$ (often written as a random vector $\mathbf{X} = (X_1, \ldots, X_d)^T$):

Discrete case (Joint PMF): $$p_{X_1, \ldots, X_d}(x_1, \ldots, x_d) = P(X_1 = x_1, X_2 = x_2, \ldots, X_d = x_d)$$

Continuous case (Joint PDF): $$P(\mathbf{X} \in A) = \int \cdots \int_A f_{X_1, \ldots, X_d}(x_1, \ldots, x_d) , dx_1 \cdots dx_d$$

Joint CDF: $$F_{X_1, \ldots, X_d}(x_1, \ldots, x_d) = P(X_1 \leq x_1, \ldots, X_d \leq x_d)$$

The Multivariate Gaussian:

The $d$-dimensional Gaussian is defined by mean vector $\boldsymbol{\mu} \in \mathbb{R}^d$ and $d \times d$ covariance matrix $\boldsymbol{\Sigma}$:

$$f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right)$$

The covariance matrix $\boldsymbol{\Sigma}$ is:

Symmetric: $\Sigma_{ij} = \Sigma_{ji}$
Positive semi-definite: $\mathbf{v}^T \boldsymbol{\Sigma} \mathbf{v} \geq 0$ for all $\mathbf{v}$
Diagonal entries: $\Sigma_{ii} = \text{Var}(X_i)$
Off-diagonal entries: $\Sigma_{ij} = \text{Cov}(X_i, X_j)$

Why this matters for ML:

In machine learning, the feature vector $\mathbf{x}$ for a data point is exactly a realization from a joint distribution. When we assume features follow a multivariate Gaussian:

Linear Discriminant Analysis (LDA) assumes class-conditional Gaussians
Gaussian Mixture Models (GMMs) use mixtures of multivariate Gaussians
Gaussian Processes model functions as infinite-dimensional Gaussians
Probabilistic PCA assumes latent Gaussian factors

The Curse of Dimensionality

multivariate_gaussian.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
def multivariate_gaussian_demo():
    """
    Demonstrate multivariate Gaussian in higher dimensions.
    """
    np.random.seed(42)
    
    # 5-dimensional Gaussian
    d = 5
    mu = np.zeros(d)
    
    # Create a random positive definite covariance matrix
    A = np.random.randn(d, d)
    Sigma = A @ A.T + 0.1 * np.eye(d)  # Ensure positive definiteness
    
    print("=" * 60)
    print("5-Dimensional Multivariate Gaussian")
    print("=" * 60)
    print(f"\nMean vector μ:\n{mu}")
    print(f"\nCovariance matrix Σ:\n{np.round(Sigma, 3)}")
    
    # Eigendecomposition (principal axes)
    eigenvalues, eigenvectors = np.linalg.eigh(Sigma)
    print(f"\nEigenvalues: {np.round(eigenvalues, 3)}")
    print("(These are variances along principal axes)")
    
    # Generate samples
    samples = np.random.multivariate_normal(mu, Sigma, 1000)
    
    # Verify sample statistics match theoretical
    sample_mean = np.mean(samples, axis=0)
    sample_cov = np.cov(samples.T)
    
    print(f"\nSample mean (should be ≈ 0):\n{np.round(sample_mean, 3)}")
    print(f"\nSample covariance (should match Σ):\n{np.round(sample_cov, 3)}")
    
    # Visualize pairwise marginals
    fig, axes = plt.subplots(d, d, figsize=(15, 15))
    
    for i in range(d):
        for j in range(d):
            ax = axes[i, j]
            if i == j:
                # Diagonal: histogram of marginal
                ax.hist(samples[:, i], bins=30, density=True, alpha=0.7)
                # Theoretical marginal
                x_range = np.linspace(samples[:, i].min(), samples[:, i].max(), 100)
                ax.plot(x_range, stats.norm.pdf(x_range, mu[i], np.sqrt(Sigma[i,i])),
                       'r-', linewidth=2)
                ax.set_ylabel(f'X{i+1}')
            else:
                # Off-diagonal: scatter plot
                ax.scatter(samples[:, j], samples[:, i], alpha=0.1, s=5)
                ax.set_xlabel(f'X{j+1}')
                ax.set_ylabel(f'X{i+1}')
    
    plt.suptitle('Pairwise Marginal Distributions (5D Gaussian)', fontsize=14)
    plt.tight_layout()
    plt.savefig('multivariate_gaussian_pairwise.png', dpi=100)
    plt.show()
    
    # Mahalanobis distance demonstration
    print("\n" + "=" * 60)
    print("Mahalanobis Distance")
    print("=" * 60)
    
    # Mahalanobis distance: (x - μ)^T Σ^(-1) (x - μ)
    Sigma_inv = np.linalg.inv(Sigma)
    
    mahal_distances = []
    for sample in samples:
        diff = sample - mu
        mahal = np.sqrt(diff @ Sigma_inv @ diff)
        mahal_distances.append(mahal)
    
    mahal_distances = np.array(mahal_distances)
    
    # For a d-dim Gaussian, squared Mahalanobis distance follows chi-square(d)
    print(f"Mean Mahalanobis distance: {np.mean(mahal_distances):.3f}")
    print(f"Expected (√d ≈ √5): {np.sqrt(d):.3f}")
    
    # Plot distribution of Mahalanobis distances
    fig, ax = plt.subplots(figsize=(10, 6))
    ax.hist(mahal_distances**2, bins=50, density=True, alpha=0.7,
            label='Empirical (squared Mahalanobis)')
    
    x_range = np.linspace(0, max(mahal_distances**2), 200)
    ax.plot(x_range, stats.chi2.pdf(x_range, df=d), 'r-', linewidth=2,
            label=f'Chi-square({d})')
    
    ax.set_xlabel('Squared Mahalanobis Distance')
    ax.set_ylabel('Density')
    ax.set_title('Squared Mahalanobis Distance follows Chi-square Distribution')
    ax.legend()
    
    plt.tight_layout()
    plt.show()
 
multivariate_gaussian_demo()

Applications in Machine Learning

Joint distributions permeate machine learning. Understanding their role clarifies many ML techniques.

1. Modeling the Data Generation Process

In ML, we typically assume data points $(\mathbf{x}_i, y_i)$ are i.i.d. samples from some joint distribution $P(\mathbf{X}, Y)$:

Features $\mathbf{X}$: random vector representing input
Target $Y$: random variable representing label/output

The goal of supervised learning is to learn enough about this joint distribution to make good predictions.

2. Classification via Bayes' Rule

Given the joint distribution, classification is straightforward via Bayes' rule:

$$P(Y = k | \mathbf{X} = \mathbf{x}) = \frac{P(\mathbf{X} = \mathbf{x} | Y = k) \cdot P(Y = k)}{P(\mathbf{X} = \mathbf{x})}$$

This connects:

Joint: $P(\mathbf{X}, Y) = P(\mathbf{X} | Y) \cdot P(Y)$
Marginal: $P(\mathbf{X}) = \sum_k P(\mathbf{X} | Y = k) P(Y = k)$
Conditional/Posterior: $P(Y | \mathbf{X})$ — what we want for prediction

Joint Distributions Across ML Paradigms

•Generative Models: Explicitly model the joint $P(\mathbf{X}, Y)$ or $P(\mathbf{X})$. Examples: Gaussian Mixture Models, Naive Bayes, GANs, VAEs
•Discriminative Models: Model conditional $P(Y | \mathbf{X})$ directly, bypassing the joint. Examples: Logistic Regression, Neural Networks, SVMs
•Probabilistic Graphical Models: Factor the joint into local terms via graph structure. Examples: Bayesian Networks, Markov Random Fields
•Dimensionality Reduction: Model joint distribution of features to find low-dimensional representations. Examples: PCA (Gaussian assumption), t-SNE (heavy-tailed joint embeddings)
•Anomaly Detection: Model the joint distribution of normal data; anomalies have low probability under this distribution

3. The Generative vs. Discriminative Divide

A fundamental choice in ML is whether to model the joint distribution:

Approach	Models	Advantages	Disadvantages
Generative	$P(\mathbf{X}, Y)$	Can generate data, handle missing values, see feature relationships	More parameters to estimate, stronger assumptions
Discriminative	$P(Y \mid \mathbf{X})$	Fewer parameters, often better for pure classification	Can't generate data, no insight into feature distribution

Neither is universally better—the choice depends on the task and data availability.

Key Insight

Summary: Joint Distributions

We've established the foundational framework for reasoning about multiple random variables simultaneously. Let's consolidate the key concepts:

Key Takeaways

•Joint distributions capture multivariate behavior: Individual distributions for $X$ and $Y$ don't tell us how they relate; the joint distribution does.
•Discrete case uses joint PMF: $p_{X,Y}(x, y) = P(X = x, Y = y)$, which can be represented as a table.
•Continuous case uses joint PDF: $f_{X,Y}(x, y)$ is a density; probabilities come from integration over regions.
•The bivariate Gaussian is fundamental: Its tractability, closure properties, and prevalence make it central to statistics and ML.
•Joint CDFs provide cumulative probabilities: $F_{X,Y}(x, y) = P(X \leq x, Y \leq y)$.
•Extension to $d$ dimensions is straightforward: Random vectors $\mathbf{X}$ have joint PMF/PDF and are central to high-dimensional ML.
•ML depends on joint distributions: Whether generative or discriminative, understanding the joint is key to understanding models.

What's next:

Page Complete

1 / 5