Loading learning content...
At the core of every machine learning prediction lies a simple question: Given what we observe, what can we infer about what we don't observe?
When a doctor knows a patient's symptoms, what's the distribution of possible diseases? When we know a house's features, what's the distribution of its price? When we see today's stock price, what's the distribution of tomorrow's?
These questions are answered by conditional distributions—the probability distribution of one variable given that we know the value of another. If joint distributions describe complete multivariate behavior, and marginals describe individual variables, then conditionals describe how knowing something changes our beliefs about something else.
By the end of this page, you will master conditional probability distributions for discrete and continuous cases, understand their derivation from joint distributions, and see how they form the mathematical foundation for supervised learning, Bayesian inference, and prediction.
Definition (Conditional PMF):
For discrete random variables $X$ and $Y$, the conditional PMF of $Y$ given $X = x$ is:
$$p_{Y|X}(y | x) = P(Y = y | X = x) = \frac{P(X = x, Y = y)}{P(X = x)} = \frac{p_{X,Y}(x, y)}{p_X(x)}$$
This is defined whenever $p_X(x) > 0$.
Interpretation: Given that $X$ has taken value $x$, what are the probabilities that $Y$ takes each of its possible values?
Properties:
Valid PMF: For each fixed $x$, $p_{Y|X}(\cdot | x)$ is a valid PMF over $y$:
Relationship to joint: $p_{X,Y}(x, y) = p_{Y|X}(y | x) \cdot p_X(x)$
Chain rule: For multiple variables: $$p(x_1, x_2, x_3) = p(x_1) \cdot p(x_2 | x_1) \cdot p(x_3 | x_1, x_2)$$
| Y = 0 | Y = 1 | P(X = x) | |
|---|---|---|---|
| X = 0 | 0.12 | 0.08 | 0.20 |
| X = 1 | 0.28 | 0.12 | 0.40 |
| X = 2 | 0.20 | 0.20 | 0.40 |
| P(Y = y) | 0.60 | 0.40 | 1.00 |
To find $P(Y = 0 | X = 1)$:
$$P(Y = 0 | X = 1) = \frac{P(X = 1, Y = 0)}{P(X = 1)} = \frac{0.28}{0.40} = 0.70$$
Similarly: $P(Y = 1 | X = 1) = \frac{0.12}{0.40} = 0.30$
Note: $0.70 + 0.30 = 1$ ✓ (it's a valid distribution over $Y$).
Observation: $P(Y = 0) = 0.60$ unconditionally, but $P(Y = 0 | X = 1) = 0.70$. Knowing $X = 1$ changed our belief about $Y$!
12345678910111213141516171819202122232425262728293031323334
import numpy as np def compute_conditionals(): """Compute and verify conditional distributions.""" # Joint PMF P(X, Y) joint = np.array([ [0.12, 0.08], # X = 0 [0.28, 0.12], # X = 1 [0.20, 0.20], # X = 2 ]) # Marginals p_x = joint.sum(axis=1) # [0.20, 0.40, 0.40] p_y = joint.sum(axis=0) # [0.60, 0.40] print("Joint P(X,Y):") print(joint) print(f"\nMarginal P(X): {p_x}") print(f"Marginal P(Y): {p_y}") # Conditional P(Y|X) for each X value print("\nConditional P(Y|X):") for x in range(3): p_y_given_x = joint[x, :] / p_x[x] print(f" P(Y|X={x}) = {p_y_given_x} (sum = {p_y_given_x.sum():.2f})") # Verify: P(Y|X) * P(X) = P(X,Y) print("\nVerification: P(Y|X) * P(X) should equal P(X,Y)") for x in range(3): p_y_given_x = joint[x, :] / p_x[x] reconstructed = p_y_given_x * p_x[x] print(f" X={x}: joint={joint[x,:]}, reconstructed={reconstructed}") compute_conditionals()Definition (Conditional PDF):
For continuous random variables with joint PDF $f_{X,Y}(x, y)$ and marginal $f_X(x) > 0$:
$$f_{Y|X}(y | x) = \frac{f_{X,Y}(x, y)}{f_X(x)}$$
This is a PDF in $y$ (for each fixed $x$), meaning: $$\int_{-\infty}^{\infty} f_{Y|X}(y | x) , dy = 1$$
Technical Note: Unlike discrete case where we condition on a point event with positive probability, here $P(X = x) = 0$ for any specific $x$. The conditional PDF is defined as a limit via the relationship: $f_{Y|X}(y|x) = \lim_{\epsilon \to 0} \frac{P(y \leq Y \leq y + dy | x \leq X \leq x + \epsilon)}{dy}$
Example: Conditional Distribution of Bivariate Gaussian
For bivariate Gaussian $(X, Y)$ with $\mu_X = 0$, $\mu_Y = 0$, $\sigma_X = \sigma_Y = 1$, and correlation $\rho$:
$$Y | X = x \sim \mathcal{N}\left(\rho x, , 1 - \rho^2\right)$$
Key observations:
This is why linear regression works: the conditional mean of $Y$ given $X$ is a linear function of $X$ when they're jointly Gaussian.
Visualize the conditional as taking a vertical slice through the joint PDF at X = x, then normalizing so it integrates to 1. Each slice is a valid PDF for Y, but its shape changes depending on where you slice.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
import numpy as npimport matplotlib.pyplot as pltfrom scipy import stats def visualize_conditional_gaussian(): """Show how conditional distribution varies with x.""" rho = 0.7 fig, axes = plt.subplots(1, 3, figsize=(15, 5)) # Left: Joint distribution with conditional slices ax1 = axes[0] x = np.linspace(-3, 3, 100) y = np.linspace(-3, 3, 100) X, Y = np.meshgrid(x, y) cov = [[1, rho], [rho, 1]] rv = stats.multivariate_normal([0, 0], cov) Z = rv.pdf(np.dstack((X, Y))) ax1.contourf(X, Y, Z, levels=20, cmap='Blues', alpha=0.7) # Mark conditioning values x_vals = [-1.5, 0, 1.5] colors = ['red', 'green', 'purple'] for xv, c in zip(x_vals, colors): ax1.axvline(x=xv, color=c, linestyle='--', linewidth=2) ax1.set_xlabel('X') ax1.set_ylabel('Y') ax1.set_title(f'Joint PDF (ρ = {rho})\nVertical lines show conditioning values') # Middle: Conditional PDFs for different x values ax2 = axes[1] y_range = np.linspace(-4, 4, 200) for xv, c in zip(x_vals, colors): # Conditional: Y|X=x ~ N(rho*x, 1-rho^2) cond_mean = rho * xv cond_var = 1 - rho**2 cond_pdf = stats.norm.pdf(y_range, cond_mean, np.sqrt(cond_var)) ax2.plot(y_range, cond_pdf, color=c, linewidth=2, label=f'P(Y|X={xv}): N({cond_mean:.1f}, {cond_var:.2f})') # Marginal for comparison ax2.plot(y_range, stats.norm.pdf(y_range, 0, 1), 'k--', linewidth=2, label='Marginal P(Y): N(0, 1)') ax2.set_xlabel('Y') ax2.set_ylabel('Density') ax2.set_title('Conditional PDFs vs Marginal') ax2.legend(fontsize=9) # Right: How conditional mean relates to x (regression line) ax3 = axes[2] samples = np.random.multivariate_normal([0, 0], cov, 500) ax3.scatter(samples[:, 0], samples[:, 1], alpha=0.3, s=20) x_line = np.linspace(-3, 3, 100) ax3.plot(x_line, rho * x_line, 'r-', linewidth=3, label=f'E[Y|X] = {rho}X (regression line)') ax3.set_xlabel('X') ax3.set_ylabel('Y') ax3.set_title('Conditional Mean = Regression Line') ax3.legend() ax3.set_aspect('equal') plt.tight_layout() plt.savefig('conditional_gaussian.png', dpi=150) plt.show() visualize_conditional_gaussian()Bayes' Theorem relates two different conditional distributions—flipping the direction of conditioning.
For Discrete Variables: $$p_{X|Y}(x | y) = \frac{p_{Y|X}(y | x) \cdot p_X(x)}{p_Y(y)} = \frac{p_{Y|X}(y | x) \cdot p_X(x)}{\sum_{x'} p_{Y|X}(y | x') \cdot p_X(x')}$$
For Continuous Variables: $$f_{X|Y}(x | y) = \frac{f_{Y|X}(y | x) \cdot f_X(x)}{f_Y(y)} = \frac{f_{Y|X}(y | x) \cdot f_X(x)}{\int f_{Y|X}(y | x') \cdot f_X(x') , dx'}$$
The Bayesian Interpretation:
The denominator (evidence) requires summing/integrating over all possible values of X. In high dimensions, this is often intractable, motivating approximate inference methods like MCMC, variational inference, and Laplace approximations.
Example: Medical Diagnosis
Let $D$ = disease (1 = present, 0 = absent) and $T$ = test result (1 = positive, 0 = negative).
Given a positive test, what's the probability of disease?
$$P(D = 1 | T = 1) = \frac{P(T = 1 | D = 1) P(D = 1)}{P(T = 1)}$$
$$P(T = 1) = P(T = 1 | D = 1)P(D = 1) + P(T = 1 | D = 0)P(D = 0) = 0.95(0.01) + 0.05(0.99) = 0.059$$
$$P(D = 1 | T = 1) = \frac{0.95 \times 0.01}{0.059} \approx 0.161$$
Only 16.1% chance of disease despite positive test! The low prior dominates.
The chain rule expresses any joint distribution as a product of conditionals.
For Two Variables: $$P(X, Y) = P(Y | X) P(X) = P(X | Y) P(Y)$$
For Multiple Variables: $$P(X_1, X_2, \ldots, X_n) = P(X_1) \prod_{i=2}^{n} P(X_i | X_1, \ldots, X_{i-1})$$
This factorization is always valid—it's an algebraic identity from the definition of conditional probability.
Why This Matters for ML:
Autoregressive Models: Language models like GPT use this directly: $$P(w_1, \ldots, w_n) = P(w_1) P(w_2|w_1) P(w_3|w_1, w_2) \cdots P(w_n|w_1, \ldots, w_{n-1})$$
Bayesian Networks: Factor the joint according to a DAG structure, where each variable is conditioned only on its parents: $$P(X_1, \ldots, X_n) = \prod_{i=1}^{n} P(X_i | \text{Parents}(X_i))$$
Sequential Decision Making: In reinforcement learning, trajectories factor as: $$P(s_0, a_0, s_1, a_1, \ldots) = P(s_0) \prod_t P(a_t|s_t) P(s_{t+1}|s_t, a_t)$$
The chain rule isn't just a mathematical identity—it's a modeling recipe. Different orderings and conditional independence assumptions lead to different model architectures. Language models, Bayesian networks, and diffusion models all exploit this factorization creatively.
Given a conditional distribution, we can compute conditional moments.
Conditional Expectation: $$\mathbb{E}[Y | X = x] = \sum_y y \cdot p_{Y|X}(y | x) \quad \text{(discrete)}$$ $$\mathbb{E}[Y | X = x] = \int y \cdot f_{Y|X}(y | x) , dy \quad \text{(continuous)}$$
This is a function of $x$, often written as $g(x) = \mathbb{E}[Y | X = x]$.
Key Properties:
The unconditional mean equals the average of conditional means.
Conditional Variance: $$\text{Var}(Y | X = x) = \mathbb{E}[Y^2 | X = x] - (\mathbb{E}[Y | X = x])^2$$
Law of Total Variance: $$\text{Var}(Y) = \mathbb{E}[\text{Var}(Y | X)] + \text{Var}(\mathbb{E}[Y | X])$$
Total variance = average within-group variance + between-group variance.
12345678910111213141516171819202122232425262728293031323334353637383940
import numpy as np def demonstrate_conditional_expectation(): """Verify Law of Iterated Expectations via simulation.""" np.random.seed(42) # Generate correlated bivariate normal rho = 0.6 n = 100000 x = np.random.normal(0, 1, n) y = rho * x + np.sqrt(1 - rho**2) * np.random.normal(0, 1, n) # Unconditional E[Y] uncond_mean = np.mean(y) print(f"Unconditional E[Y]: {uncond_mean:.4f} (should be ≈ 0)") # Conditional E[Y|X] = rho * X conditional_means = rho * x # E[E[Y|X]] should equal E[Y] mean_of_cond_means = np.mean(conditional_means) print(f"E[E[Y|X]]: {mean_of_cond_means:.4f}") # Verify Law of Total Variance # Var(Y) = E[Var(Y|X)] + Var(E[Y|X]) var_y = np.var(y) # Var(Y|X) = 1 - rho^2 (constant for Gaussian) within_var = 1 - rho**2 # Var(E[Y|X]) = Var(rho*X) = rho^2 * Var(X) = rho^2 between_var = np.var(conditional_means) print(f"\nVar(Y): {var_y:.4f}") print(f"E[Var(Y|X)] + Var(E[Y|X]): {within_var + between_var:.4f}") print(f" - Within-group variance: {within_var:.4f}") print(f" - Between-group variance: {between_var:.4f}") demonstrate_conditional_expectation()Conditional distributions are everywhere in ML—they are prediction.
1. Supervised Learning as Conditional Estimation
Classification and regression both estimate conditional distributions:
Neural networks with softmax outputs estimate $P(Y | \mathbf{X})$ directly.
2. Generative Modeling
3. Sequence Modeling
What's next:
Now that we understand how variables relate through conditional distributions, we turn to quantifying the strength and nature of these relationships. The next page covers covariance and correlation—numerical summaries of how two variables move together.
You now understand conditional distributions—the mathematical foundation for prediction and inference in machine learning. Every supervised learning model is, at its core, estimating a conditional distribution.