Joint Marginal Distributions - Learning Module

Loading content...

0/278

Marginal Distributions

Extracting Individual Behavior from Joint Distributions

Given a joint distribution over multiple variables, we often need to answer questions about a single variable, ignoring all others. If we know $P(X, Y)$ completely, what is the distribution of $X$ alone?

This operation—marginalization—is one of the most fundamental tools in probability theory. The name comes from the fact that in tabular representations, summing across rows or columns places the result in the "margin" of the table.

The Core Question: Given complete knowledge of how $(X, Y)$ behave together, how do we extract the distribution of $X$ by itself, effectively "forgetting" about $Y$?

What You Will Learn

By the end of this page, you will master marginalization for both discrete and continuous distributions, understand its computational implications, and see how it connects to machine learning tasks like computing likelihoods and performing inference.

Marginal PMF (Discrete Case)

Definition (Marginal PMF):

For discrete random variables $X$ and $Y$ with joint PMF $p_{X,Y}(x, y)$, the marginal PMF of $X$ is obtained by summing over all possible values of $Y$:

$$p_X(x) = \sum_{y} p_{X,Y}(x, y) = \sum_{y} P(X = x, Y = y)$$

Similarly, the marginal PMF of $Y$ is:

$$p_Y(y) = \sum_{x} p_{X,Y}(x, y)$$

Intuition: The event ${X = x}$ can occur with any value of $Y$. By the axiom of probability applied to a partition, we sum over all mutually exclusive ways this can happen.

Example: Computing Marginals from Joint PMF
	Y = 0	Y = 1	Y = 2	P(X = x) (Marginal)
X = 0	0.10	0.05	0.05	0.20
X = 1	0.20	0.15	0.10	0.45
X = 2	0.10	0.15	0.10	0.35
P(Y = y)	0.40	0.35	0.25	1.00

In the table above:

Each row sum gives $p_X(x)$ — the marginal distribution of $X$
Each column sum gives $p_Y(y)$ — the marginal distribution of $Y$
The corner (1.00) confirms the joint sums to 1

Key Observation: Knowing only the marginals does NOT let you recover the joint. Many different joint distributions can produce the same marginals.

marginal_pmf.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import numpy as np
 
def compute_marginals_discrete():
    """Demonstrate marginalization for discrete distributions."""
    # Joint PMF as a 2D array
    joint_pmf = np.array([
        [0.10, 0.05, 0.05],  # X = 0
        [0.20, 0.15, 0.10],  # X = 1
        [0.10, 0.15, 0.10],  # X = 2
    ])
    
    print("Joint PMF P(X, Y):")
    print(joint_pmf)
    print(f"Sum = {joint_pmf.sum():.2f}")
    
    # Marginal of X: sum over Y (columns)
    marginal_x = joint_pmf.sum(axis=1)
    print(f"
Marginal P(X): {marginal_x}")
    
    # Marginal of Y: sum over X (rows)
    marginal_y = joint_pmf.sum(axis=0)
    print(f"Marginal P(Y): {marginal_y}")
    
    # Verify marginals sum to 1
    print(f"
Sum of P(X) = {marginal_x.sum():.2f}")
    print(f"Sum of P(Y) = {marginal_y.sum():.2f}")
 
compute_marginals_discrete()

Marginal PDF (Continuous Case)

Definition (Marginal PDF):

For continuous random variables with joint PDF $f_{X,Y}(x, y)$, the marginal PDF of $X$ is obtained by integrating out $Y$:

$$f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x, y) , dy$$

Similarly for $Y$:

$$f_Y(y) = \int_{-\infty}^{\infty} f_{X,Y}(x, y) , dx$$

Intuition: Integration replaces summation when moving from discrete to continuous. We "integrate out" the variable we want to ignore.

Example: Uniform on the Unit Square

For $(X, Y)$ uniform on $[0, 1] \times [0, 1]$ with $f_{X,Y}(x, y) = 1$:

$$f_X(x) = \int_0^1 1 , dy = 1 \quad \text{for } x \in [0, 1]$$

So $X \sim \text{Uniform}(0, 1)$. Similarly $Y \sim \text{Uniform}(0, 1)$.

Example: Bivariate Gaussian Marginals

For a bivariate Gaussian $(X, Y)$ with means $\mu_X, \mu_Y$, variances $\sigma_X^2, \sigma_Y^2$, and correlation $\rho$:

$$f_X(x) = \frac{1}{\sqrt{2\pi}\sigma_X} \exp\left(-\frac{(x - \mu_X)^2}{2\sigma_X^2}\right)$$

The marginal is Gaussian with the same mean and variance—the correlation $\rho$ disappears completely. This is a key property: marginals don't capture correlation.

Marginals Lose Information

Marginalization is a lossy operation. Two very different joint distributions (e.g., perfectly correlated vs. independent) can have identical marginals. The marginal tells you about individual variables but nothing about their relationships.

marginal_pdf.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
def visualize_marginals():
    """Visualize joint and marginal distributions for bivariate Gaussian."""
    # Parameters
    mu = [0, 0]
    rho = 0.7
    cov = [[1, rho], [rho, 1]]
    
    # Create grid
    x = np.linspace(-4, 4, 200)
    y = np.linspace(-4, 4, 200)
    X, Y = np.meshgrid(x, y)
    pos = np.dstack((X, Y))
    
    # Joint PDF
    rv = stats.multivariate_normal(mu, cov)
    Z = rv.pdf(pos)
    
    # Marginals (standard normal for both)
    marginal_x = stats.norm.pdf(x, 0, 1)
    marginal_y = stats.norm.pdf(y, 0, 1)
    
    # Plotting
    fig = plt.figure(figsize=(10, 10))
    
    # Joint distribution (center)
    ax_joint = fig.add_axes([0.2, 0.2, 0.6, 0.6])
    ax_joint.contourf(X, Y, Z, levels=20, cmap='Blues')
    ax_joint.set_xlabel('X')
    ax_joint.set_ylabel('Y')
    ax_joint.set_title(f'Joint Distribution (ρ = {rho})')
    
    # Marginal of X (top)
    ax_marg_x = fig.add_axes([0.2, 0.82, 0.6, 0.15])
    ax_marg_x.fill_between(x, marginal_x, alpha=0.5)
    ax_marg_x.set_xlim(-4, 4)
    ax_marg_x.set_title('Marginal P(X)')
    ax_marg_x.set_xticks([])
    
    # Marginal of Y (right)
    ax_marg_y = fig.add_axes([0.82, 0.2, 0.15, 0.6])
    ax_marg_y.fill_betweenx(y, marginal_y, alpha=0.5)
    ax_marg_y.set_ylim(-4, 4)
    ax_marg_y.set_title('P(Y)', rotation=-90, labelpad=20)
    ax_marg_y.set_yticks([])
    
    plt.savefig('joint_with_marginals.png', dpi=150, bbox_inches='tight')
    plt.show()
 
visualize_marginals()

Connection to the Law of Total Probability

Marginalization is a direct application of the Law of Total Probability.

Law of Total Probability:

If ${B_i}$ is a partition of the sample space (mutually exclusive, exhaustive events), then for any event $A$:

$$P(A) = \sum_i P(A \cap B_i) = \sum_i P(A | B_i) P(B_i)$$

Applied to Random Variables:

The possible values of $Y$ partition the sample space. Therefore:

$$P(X = x) = \sum_y P(X = x, Y = y)$$

This is exactly the marginalization formula. In the continuous case, the sum becomes an integral.

Why This Matters:

This connection reveals that marginalization isn't an arbitrary operation—it's logically necessary from the axioms of probability. It's the unique way to derive individual distributions from joint distributions that remains consistent with probability theory.

Marginalization as "Summing Over Possibilities"

Think of marginalization as answering: 'What's the probability of X = x, considering ALL possible values of Y?' You sum/integrate over every possible Y-configuration, weighted by its probability.

Marginalization in Higher Dimensions

For random vectors $\mathbf{X} = (X_1, X_2, \ldots, X_d)$, we can marginalize over any subset of variables.

General Formula (Continuous):

To get the marginal of $(X_1, X_2)$ from the joint of $(X_1, X_2, \ldots, X_d)$:

$$f_{X_1, X_2}(x_1, x_2) = \int \cdots \int f_{X_1, \ldots, X_d}(x_1, \ldots, x_d) , dx_3 \cdots dx_d$$

Multivariate Gaussian Case:

For a Gaussian $\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$, the marginal of any subset is also Gaussian with the corresponding sub-vector of $\boldsymbol{\mu}$ and sub-matrix of $\boldsymbol{\Sigma}$.

If $\mathbf{X} = \begin{pmatrix} \mathbf{X}A \ \mathbf{X}B \end{pmatrix}$ with $\boldsymbol{\mu} = \begin{pmatrix} \boldsymbol{\mu}A \ \boldsymbol{\mu}B \end{pmatrix}$ and $\boldsymbol{\Sigma} = \begin{pmatrix} \boldsymbol{\Sigma}{AA} & \boldsymbol{\Sigma}{AB} \ \boldsymbol{\Sigma}{BA} & \boldsymbol{\Sigma}{BB} \end{pmatrix}$

Then: $\mathbf{X}_A \sim \mathcal{N}(\boldsymbol{\mu}A, \boldsymbol{\Sigma}{AA})$

No integration required—just extract the relevant blocks!

Applications in Machine Learning

Marginalization appears throughout ML, often in computational challenges.

1. Computing Marginal Likelihood (Model Evidence)

In Bayesian inference, the marginal likelihood integrates over all parameter values:

$$P(\mathbf{X}) = \int P(\mathbf{X} | \boldsymbol{\theta}) P(\boldsymbol{\theta}) , d\boldsymbol{\theta}$$

This integral is often intractable, motivating approximations like MCMC and variational inference.

2. Latent Variable Models

Models like GMMs, VAEs, and HMMs have latent variables $\mathbf{Z}$. The observed likelihood marginalizes over latents:

$$P(\mathbf{X}) = \sum_{\mathbf{z}} P(\mathbf{X}, \mathbf{Z} = \mathbf{z}) = \sum_{\mathbf{z}} P(\mathbf{X} | \mathbf{Z}) P(\mathbf{Z})$$

3. Graphical Models

Variable elimination in Bayesian networks is systematic marginalization—removing variables one by one to answer queries about remaining variables.

Marginalization in ML Contexts

•Naive Bayes: Compute $P(Y|\mathbf{X})$ using marginal $P(\mathbf{X})$ in denominator
•EM Algorithm: E-step computes posterior over latents; M-step maximizes expected complete log-likelihood
•Bayesian Neural Networks: Marginalize over weights for predictive uncertainty
•Gaussian Processes: Marginal likelihood used for hyperparameter optimization

Summary: Marginal Distributions

Key Takeaways

•Marginalization extracts individual distributions from joints: Sum (discrete) or integrate (continuous) over variables to ignore.
•Marginals are in the 'margins' of joint tables: Row sums and column sums give the marginals.
•Marginalization loses information: The joint cannot be recovered from marginals alone.
•Gaussian marginals are simple: Just extract the corresponding sub-vector and sub-matrix.
•ML relies heavily on marginalization: Bayesian inference, latent variable models, and graphical models all require integrating/summing over variables.

What's next:

While marginalization lets us ignore variables, often we want to use information about one variable to refine our knowledge of another. The next page covers conditional distributions—the mathematical foundation for prediction in machine learning.

Page Complete

You now understand marginal distributions—how to extract individual variable behavior from joint distributions by summing or integrating over other variables.