Loading content...
Given a joint distribution over multiple variables, we often need to answer questions about a single variable, ignoring all others. If we know $P(X, Y)$ completely, what is the distribution of $X$ alone?
This operation—marginalization—is one of the most fundamental tools in probability theory. The name comes from the fact that in tabular representations, summing across rows or columns places the result in the "margin" of the table.
The Core Question: Given complete knowledge of how $(X, Y)$ behave together, how do we extract the distribution of $X$ by itself, effectively "forgetting" about $Y$?
By the end of this page, you will master marginalization for both discrete and continuous distributions, understand its computational implications, and see how it connects to machine learning tasks like computing likelihoods and performing inference.
Definition (Marginal PMF):
For discrete random variables $X$ and $Y$ with joint PMF $p_{X,Y}(x, y)$, the marginal PMF of $X$ is obtained by summing over all possible values of $Y$:
$$p_X(x) = \sum_{y} p_{X,Y}(x, y) = \sum_{y} P(X = x, Y = y)$$
Similarly, the marginal PMF of $Y$ is:
$$p_Y(y) = \sum_{x} p_{X,Y}(x, y)$$
Intuition: The event ${X = x}$ can occur with any value of $Y$. By the axiom of probability applied to a partition, we sum over all mutually exclusive ways this can happen.
| Y = 0 | Y = 1 | Y = 2 | P(X = x) (Marginal) | |
|---|---|---|---|---|
| X = 0 | 0.10 | 0.05 | 0.05 | 0.20 |
| X = 1 | 0.20 | 0.15 | 0.10 | 0.45 |
| X = 2 | 0.10 | 0.15 | 0.10 | 0.35 |
| P(Y = y) | 0.40 | 0.35 | 0.25 | 1.00 |
In the table above:
Key Observation: Knowing only the marginals does NOT let you recover the joint. Many different joint distributions can produce the same marginals.
123456789101112131415161718192021222324252627282930
import numpy as np def compute_marginals_discrete(): """Demonstrate marginalization for discrete distributions.""" # Joint PMF as a 2D array joint_pmf = np.array([ [0.10, 0.05, 0.05], # X = 0 [0.20, 0.15, 0.10], # X = 1 [0.10, 0.15, 0.10], # X = 2 ]) print("Joint PMF P(X, Y):") print(joint_pmf) print(f"Sum = {joint_pmf.sum():.2f}") # Marginal of X: sum over Y (columns) marginal_x = joint_pmf.sum(axis=1) print(f"Marginal P(X): {marginal_x}") # Marginal of Y: sum over X (rows) marginal_y = joint_pmf.sum(axis=0) print(f"Marginal P(Y): {marginal_y}") # Verify marginals sum to 1 print(f"Sum of P(X) = {marginal_x.sum():.2f}") print(f"Sum of P(Y) = {marginal_y.sum():.2f}") compute_marginals_discrete()Definition (Marginal PDF):
For continuous random variables with joint PDF $f_{X,Y}(x, y)$, the marginal PDF of $X$ is obtained by integrating out $Y$:
$$f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x, y) , dy$$
Similarly for $Y$:
$$f_Y(y) = \int_{-\infty}^{\infty} f_{X,Y}(x, y) , dx$$
Intuition: Integration replaces summation when moving from discrete to continuous. We "integrate out" the variable we want to ignore.
Example: Uniform on the Unit Square
For $(X, Y)$ uniform on $[0, 1] \times [0, 1]$ with $f_{X,Y}(x, y) = 1$:
$$f_X(x) = \int_0^1 1 , dy = 1 \quad \text{for } x \in [0, 1]$$
So $X \sim \text{Uniform}(0, 1)$. Similarly $Y \sim \text{Uniform}(0, 1)$.
Example: Bivariate Gaussian Marginals
For a bivariate Gaussian $(X, Y)$ with means $\mu_X, \mu_Y$, variances $\sigma_X^2, \sigma_Y^2$, and correlation $\rho$:
$$f_X(x) = \frac{1}{\sqrt{2\pi}\sigma_X} \exp\left(-\frac{(x - \mu_X)^2}{2\sigma_X^2}\right)$$
The marginal is Gaussian with the same mean and variance—the correlation $\rho$ disappears completely. This is a key property: marginals don't capture correlation.
Marginalization is a lossy operation. Two very different joint distributions (e.g., perfectly correlated vs. independent) can have identical marginals. The marginal tells you about individual variables but nothing about their relationships.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
import numpy as npimport matplotlib.pyplot as pltfrom scipy import stats def visualize_marginals(): """Visualize joint and marginal distributions for bivariate Gaussian.""" # Parameters mu = [0, 0] rho = 0.7 cov = [[1, rho], [rho, 1]] # Create grid x = np.linspace(-4, 4, 200) y = np.linspace(-4, 4, 200) X, Y = np.meshgrid(x, y) pos = np.dstack((X, Y)) # Joint PDF rv = stats.multivariate_normal(mu, cov) Z = rv.pdf(pos) # Marginals (standard normal for both) marginal_x = stats.norm.pdf(x, 0, 1) marginal_y = stats.norm.pdf(y, 0, 1) # Plotting fig = plt.figure(figsize=(10, 10)) # Joint distribution (center) ax_joint = fig.add_axes([0.2, 0.2, 0.6, 0.6]) ax_joint.contourf(X, Y, Z, levels=20, cmap='Blues') ax_joint.set_xlabel('X') ax_joint.set_ylabel('Y') ax_joint.set_title(f'Joint Distribution (ρ = {rho})') # Marginal of X (top) ax_marg_x = fig.add_axes([0.2, 0.82, 0.6, 0.15]) ax_marg_x.fill_between(x, marginal_x, alpha=0.5) ax_marg_x.set_xlim(-4, 4) ax_marg_x.set_title('Marginal P(X)') ax_marg_x.set_xticks([]) # Marginal of Y (right) ax_marg_y = fig.add_axes([0.82, 0.2, 0.15, 0.6]) ax_marg_y.fill_betweenx(y, marginal_y, alpha=0.5) ax_marg_y.set_ylim(-4, 4) ax_marg_y.set_title('P(Y)', rotation=-90, labelpad=20) ax_marg_y.set_yticks([]) plt.savefig('joint_with_marginals.png', dpi=150, bbox_inches='tight') plt.show() visualize_marginals()Marginalization is a direct application of the Law of Total Probability.
Law of Total Probability:
If ${B_i}$ is a partition of the sample space (mutually exclusive, exhaustive events), then for any event $A$:
$$P(A) = \sum_i P(A \cap B_i) = \sum_i P(A | B_i) P(B_i)$$
Applied to Random Variables:
The possible values of $Y$ partition the sample space. Therefore:
$$P(X = x) = \sum_y P(X = x, Y = y)$$
This is exactly the marginalization formula. In the continuous case, the sum becomes an integral.
Why This Matters:
This connection reveals that marginalization isn't an arbitrary operation—it's logically necessary from the axioms of probability. It's the unique way to derive individual distributions from joint distributions that remains consistent with probability theory.
Think of marginalization as answering: 'What's the probability of X = x, considering ALL possible values of Y?' You sum/integrate over every possible Y-configuration, weighted by its probability.
For random vectors $\mathbf{X} = (X_1, X_2, \ldots, X_d)$, we can marginalize over any subset of variables.
General Formula (Continuous):
To get the marginal of $(X_1, X_2)$ from the joint of $(X_1, X_2, \ldots, X_d)$:
$$f_{X_1, X_2}(x_1, x_2) = \int \cdots \int f_{X_1, \ldots, X_d}(x_1, \ldots, x_d) , dx_3 \cdots dx_d$$
Multivariate Gaussian Case:
For a Gaussian $\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$, the marginal of any subset is also Gaussian with the corresponding sub-vector of $\boldsymbol{\mu}$ and sub-matrix of $\boldsymbol{\Sigma}$.
If $\mathbf{X} = \begin{pmatrix} \mathbf{X}A \ \mathbf{X}B \end{pmatrix}$ with $\boldsymbol{\mu} = \begin{pmatrix} \boldsymbol{\mu}A \ \boldsymbol{\mu}B \end{pmatrix}$ and $\boldsymbol{\Sigma} = \begin{pmatrix} \boldsymbol{\Sigma}{AA} & \boldsymbol{\Sigma}{AB} \ \boldsymbol{\Sigma}{BA} & \boldsymbol{\Sigma}{BB} \end{pmatrix}$
Then: $\mathbf{X}_A \sim \mathcal{N}(\boldsymbol{\mu}A, \boldsymbol{\Sigma}{AA})$
No integration required—just extract the relevant blocks!
Marginalization appears throughout ML, often in computational challenges.
1. Computing Marginal Likelihood (Model Evidence)
In Bayesian inference, the marginal likelihood integrates over all parameter values:
$$P(\mathbf{X}) = \int P(\mathbf{X} | \boldsymbol{\theta}) P(\boldsymbol{\theta}) , d\boldsymbol{\theta}$$
This integral is often intractable, motivating approximations like MCMC and variational inference.
2. Latent Variable Models
Models like GMMs, VAEs, and HMMs have latent variables $\mathbf{Z}$. The observed likelihood marginalizes over latents:
$$P(\mathbf{X}) = \sum_{\mathbf{z}} P(\mathbf{X}, \mathbf{Z} = \mathbf{z}) = \sum_{\mathbf{z}} P(\mathbf{X} | \mathbf{Z}) P(\mathbf{Z})$$
3. Graphical Models
Variable elimination in Bayesian networks is systematic marginalization—removing variables one by one to answer queries about remaining variables.
What's next:
While marginalization lets us ignore variables, often we want to use information about one variable to refine our knowledge of another. The next page covers conditional distributions—the mathematical foundation for prediction in machine learning.
You now understand marginal distributions—how to extract individual variable behavior from joint distributions by summing or integrating over other variables.