Mathematical FoundationsJoint and Marginal Distributions

Joint and Marginal Distributions

LevelIntermediate

Duration90 mins

TopicJoint and Marginal Distributions

3 / 5

Conditional Distributions

The Heart of Prediction

At the core of every machine learning prediction lies a simple question: Given what we observe, what can we infer about what we don't observe?

When a doctor knows a patient's symptoms, what's the distribution of possible diseases? When we know a house's features, what's the distribution of its price? When we see today's stock price, what's the distribution of tomorrow's?

These questions are answered by conditional distributions—the probability distribution of one variable given that we know the value of another. If joint distributions describe complete multivariate behavior, and marginals describe individual variables, then conditionals describe how knowing something changes our beliefs about something else.

What You Will Learn

By the end of this page, you will master conditional probability distributions for discrete and continuous cases, understand their derivation from joint distributions, and see how they form the mathematical foundation for supervised learning, Bayesian inference, and prediction.

Conditional PMF (Discrete Case)

Definition (Conditional PMF):

For discrete random variables $X$ and $Y$, the conditional PMF of $Y$ given $X = x$ is:

$$p_{Y|X}(y | x) = P(Y = y | X = x) = \frac{P(X = x, Y = y)}{P(X = x)} = \frac{p_{X,Y}(x, y)}{p_X(x)}$$

This is defined whenever $p_X(x) > 0$.

Interpretation: Given that $X$ has taken value $x$, what are the probabilities that $Y$ takes each of its possible values?

Properties:

Valid PMF: For each fixed $x$, $p_{Y|X}(\cdot | x)$ is a valid PMF over $y$:
- Non-negative: $p_{Y|X}(y | x) \geq 0$
- Sums to 1: $\sum_y p_{Y|X}(y | x) = 1$
Relationship to joint: $p_{X,Y}(x, y) = p_{Y|X}(y | x) \cdot p_X(x)$
Chain rule: For multiple variables: $$p(x_1, x_2, x_3) = p(x_1) \cdot p(x_2 | x_1) \cdot p(x_3 | x_1, x_2)$$

Example: Computing Conditional Distribution
	Y = 0	Y = 1	P(X = x)
X = 0	0.12	0.08	0.20
X = 1	0.28	0.12	0.40
X = 2	0.20	0.20	0.40
P(Y = y)	0.60	0.40	1.00

To find $P(Y = 0 | X = 1)$:

$$P(Y = 0 | X = 1) = \frac{P(X = 1, Y = 0)}{P(X = 1)} = \frac{0.28}{0.40} = 0.70$$

Similarly: $P(Y = 1 | X = 1) = \frac{0.12}{0.40} = 0.30$

Note: $0.70 + 0.30 = 1$ ✓ (it's a valid distribution over $Y$).

Observation: $P(Y = 0) = 0.60$ unconditionally, but $P(Y = 0 | X = 1) = 0.70$. Knowing $X = 1$ changed our belief about $Y$!

conditional_pmf.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import numpy as np
 
def compute_conditionals():
    """Compute and verify conditional distributions."""
    # Joint PMF P(X, Y)
    joint = np.array([
        [0.12, 0.08],  # X = 0
        [0.28, 0.12],  # X = 1
        [0.20, 0.20],  # X = 2
    ])
    
    # Marginals
    p_x = joint.sum(axis=1)  # [0.20, 0.40, 0.40]
    p_y = joint.sum(axis=0)  # [0.60, 0.40]
    
    print("Joint P(X,Y):")
    print(joint)
    print(f"\nMarginal P(X): {p_x}")
    print(f"Marginal P(Y): {p_y}")
    
    # Conditional P(Y|X) for each X value
    print("\nConditional P(Y|X):")
    for x in range(3):
        p_y_given_x = joint[x, :] / p_x[x]
        print(f"  P(Y|X={x}) = {p_y_given_x} (sum = {p_y_given_x.sum():.2f})")
    
    # Verify: P(Y|X) * P(X) = P(X,Y)
    print("\nVerification: P(Y|X) * P(X) should equal P(X,Y)")
    for x in range(3):
        p_y_given_x = joint[x, :] / p_x[x]
        reconstructed = p_y_given_x * p_x[x]
        print(f"  X={x}: joint={joint[x,:]}, reconstructed={reconstructed}")
 
compute_conditionals()

Conditional PDF (Continuous Case)

Definition (Conditional PDF):

For continuous random variables with joint PDF $f_{X,Y}(x, y)$ and marginal $f_X(x) > 0$:

$$f_{Y|X}(y | x) = \frac{f_{X,Y}(x, y)}{f_X(x)}$$

This is a PDF in $y$ (for each fixed $x$), meaning: $$\int_{-\infty}^{\infty} f_{Y|X}(y | x) , dy = 1$$

Technical Note: Unlike discrete case where we condition on a point event with positive probability, here $P(X = x) = 0$ for any specific $x$. The conditional PDF is defined as a limit via the relationship: $f_{Y|X}(y|x) = \lim_{\epsilon \to 0} \frac{P(y \leq Y \leq y + dy | x \leq X \leq x + \epsilon)}{dy}$

Example: Conditional Distribution of Bivariate Gaussian

For bivariate Gaussian $(X, Y)$ with $\mu_X = 0$, $\mu_Y = 0$, $\sigma_X = \sigma_Y = 1$, and correlation $\rho$:

$$Y | X = x \sim \mathcal{N}\left(\rho x, , 1 - \rho^2\right)$$

Key observations:

Conditional mean depends on $x$: $\mathbb{E}[Y | X = x] = \rho x$ — this is linear regression!
Conditional variance is reduced: $\text{Var}(Y | X = x) = 1 - \rho^2 < 1$
The stronger the correlation, the more information $X$ provides about $Y$
When $\rho = 0$: Conditional distribution equals marginal (independence)

This is why linear regression works: the conditional mean of $Y$ given $X$ is a linear function of $X$ when they're jointly Gaussian.

Conditional = Slice of Joint

Visualize the conditional as taking a vertical slice through the joint PDF at X = x, then normalizing so it integrates to 1. Each slice is a valid PDF for Y, but its shape changes depending on where you slice.

conditional_gaussian.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
def visualize_conditional_gaussian():
    """Show how conditional distribution varies with x."""
    rho = 0.7
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    # Left: Joint distribution with conditional slices
    ax1 = axes[0]
    x = np.linspace(-3, 3, 100)
    y = np.linspace(-3, 3, 100)
    X, Y = np.meshgrid(x, y)
    
    cov = [[1, rho], [rho, 1]]
    rv = stats.multivariate_normal([0, 0], cov)
    Z = rv.pdf(np.dstack((X, Y)))
    
    ax1.contourf(X, Y, Z, levels=20, cmap='Blues', alpha=0.7)
    
    # Mark conditioning values
    x_vals = [-1.5, 0, 1.5]
    colors = ['red', 'green', 'purple']
    for xv, c in zip(x_vals, colors):
        ax1.axvline(x=xv, color=c, linestyle='--', linewidth=2)
    
    ax1.set_xlabel('X')
    ax1.set_ylabel('Y')
    ax1.set_title(f'Joint PDF (ρ = {rho})\nVertical lines show conditioning values')
    
    # Middle: Conditional PDFs for different x values
    ax2 = axes[1]
    y_range = np.linspace(-4, 4, 200)
    
    for xv, c in zip(x_vals, colors):
        # Conditional: Y|X=x ~ N(rho*x, 1-rho^2)
        cond_mean = rho * xv
        cond_var = 1 - rho**2
        cond_pdf = stats.norm.pdf(y_range, cond_mean, np.sqrt(cond_var))
        ax2.plot(y_range, cond_pdf, color=c, linewidth=2,
                label=f'P(Y|X={xv}): N({cond_mean:.1f}, {cond_var:.2f})')
    
    # Marginal for comparison
    ax2.plot(y_range, stats.norm.pdf(y_range, 0, 1), 'k--', 
            linewidth=2, label='Marginal P(Y): N(0, 1)')
    
    ax2.set_xlabel('Y')
    ax2.set_ylabel('Density')
    ax2.set_title('Conditional PDFs vs Marginal')
    ax2.legend(fontsize=9)
    
    # Right: How conditional mean relates to x (regression line)
    ax3 = axes[2]
    samples = np.random.multivariate_normal([0, 0], cov, 500)
    ax3.scatter(samples[:, 0], samples[:, 1], alpha=0.3, s=20)
    
    x_line = np.linspace(-3, 3, 100)
    ax3.plot(x_line, rho * x_line, 'r-', linewidth=3,
            label=f'E[Y|X] = {rho}X (regression line)')
    
    ax3.set_xlabel('X')
    ax3.set_ylabel('Y')
    ax3.set_title('Conditional Mean = Regression Line')
    ax3.legend()
    ax3.set_aspect('equal')
    
    plt.tight_layout()
    plt.savefig('conditional_gaussian.png', dpi=150)
    plt.show()
 
visualize_conditional_gaussian()

Bayes' Theorem for Distributions

Bayes' Theorem relates two different conditional distributions—flipping the direction of conditioning.

For Discrete Variables: $$p_{X|Y}(x | y) = \frac{p_{Y|X}(y | x) \cdot p_X(x)}{p_Y(y)} = \frac{p_{Y|X}(y | x) \cdot p_X(x)}{\sum_{x'} p_{Y|X}(y | x') \cdot p_X(x')}$$

For Continuous Variables: $$f_{X|Y}(x | y) = \frac{f_{Y|X}(y | x) \cdot f_X(x)}{f_Y(y)} = \frac{f_{Y|X}(y | x) \cdot f_X(x)}{\int f_{Y|X}(y | x') \cdot f_X(x') , dx'}$$

The Bayesian Interpretation:

$P(X)$: Prior — our belief about $X$ before seeing $Y$
$P(Y|X)$: Likelihood — how likely the observed $Y$ is for each value of $X$
$P(X|Y)$: Posterior — updated belief about $X$ after seeing $Y$
$P(Y)$: Evidence — normalizing constant to make posterior sum/integrate to 1

The Computational Challenge

The denominator (evidence) requires summing/integrating over all possible values of X. In high dimensions, this is often intractable, motivating approximate inference methods like MCMC, variational inference, and Laplace approximations.

Example: Medical Diagnosis

Let $D$ = disease (1 = present, 0 = absent) and $T$ = test result (1 = positive, 0 = negative).

Prior: $P(D = 1) = 0.01$ (1% disease prevalence)
Likelihood: $P(T = 1 | D = 1) = 0.95$ (95% sensitivity)
Likelihood: $P(T = 1 | D = 0) = 0.05$ (5% false positive rate)

Given a positive test, what's the probability of disease?

$$P(D = 1 | T = 1) = \frac{P(T = 1 | D = 1) P(D = 1)}{P(T = 1)}$$

$$P(T = 1) = P(T = 1 | D = 1)P(D = 1) + P(T = 1 | D = 0)P(D = 0) = 0.95(0.01) + 0.05(0.99) = 0.059$$

$$P(D = 1 | T = 1) = \frac{0.95 \times 0.01}{0.059} \approx 0.161$$

Only 16.1% chance of disease despite positive test! The low prior dominates.

The Chain Rule of Probability

The chain rule expresses any joint distribution as a product of conditionals.

For Two Variables: $$P(X, Y) = P(Y | X) P(X) = P(X | Y) P(Y)$$

For Multiple Variables: $$P(X_1, X_2, \ldots, X_n) = P(X_1) \prod_{i=2}^{n} P(X_i | X_1, \ldots, X_{i-1})$$

This factorization is always valid—it's an algebraic identity from the definition of conditional probability.

Why This Matters for ML:

Autoregressive Models: Language models like GPT use this directly: $$P(w_1, \ldots, w_n) = P(w_1) P(w_2|w_1) P(w_3|w_1, w_2) \cdots P(w_n|w_1, \ldots, w_{n-1})$$
Bayesian Networks: Factor the joint according to a DAG structure, where each variable is conditioned only on its parents: $$P(X_1, \ldots, X_n) = \prod_{i=1}^{n} P(X_i | \text{Parents}(X_i))$$
Sequential Decision Making: In reinforcement learning, trajectories factor as: $$P(s_0, a_0, s_1, a_1, \ldots) = P(s_0) \prod_t P(a_t|s_t) P(s_{t+1}|s_t, a_t)$$

Chain Rule = Modeling Strategy

The chain rule isn't just a mathematical identity—it's a modeling recipe. Different orderings and conditional independence assumptions lead to different model architectures. Language models, Bayesian networks, and diffusion models all exploit this factorization creatively.

Conditional Expectation and Variance

Given a conditional distribution, we can compute conditional moments.

Conditional Expectation: $$\mathbb{E}[Y | X = x] = \sum_y y \cdot p_{Y|X}(y | x) \quad \text{(discrete)}$$ $$\mathbb{E}[Y | X = x] = \int y \cdot f_{Y|X}(y | x) , dy \quad \text{(continuous)}$$

This is a function of $x$, often written as $g(x) = \mathbb{E}[Y | X = x]$.

Key Properties:

Law of Iterated Expectations (Tower Property): $$\mathbb{E}[Y] = \mathbb{E}[\mathbb{E}[Y | X]]$$

The unconditional mean equals the average of conditional means.

Regression Interpretation: The function $\mathbb{E}[Y | X = x]$ is the regression function—the best prediction of $Y$ given $X$ (in mean squared error sense).

Conditional Variance: $$\text{Var}(Y | X = x) = \mathbb{E}[Y^2 | X = x] - (\mathbb{E}[Y | X = x])^2$$

Law of Total Variance: $$\text{Var}(Y) = \mathbb{E}[\text{Var}(Y | X)] + \text{Var}(\mathbb{E}[Y | X])$$

Total variance = average within-group variance + between-group variance.

conditional_expectation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import numpy as np
 
def demonstrate_conditional_expectation():
    """Verify Law of Iterated Expectations via simulation."""
    np.random.seed(42)
    
    # Generate correlated bivariate normal
    rho = 0.6
    n = 100000
    
    x = np.random.normal(0, 1, n)
    y = rho * x + np.sqrt(1 - rho**2) * np.random.normal(0, 1, n)
    
    # Unconditional E[Y]
    uncond_mean = np.mean(y)
    print(f"Unconditional E[Y]: {uncond_mean:.4f} (should be ≈ 0)")
    
    # Conditional E[Y|X] = rho * X
    conditional_means = rho * x
    
    # E[E[Y|X]] should equal E[Y]
    mean_of_cond_means = np.mean(conditional_means)
    print(f"E[E[Y|X]]: {mean_of_cond_means:.4f}")
    
    # Verify Law of Total Variance
    # Var(Y) = E[Var(Y|X)] + Var(E[Y|X])
    var_y = np.var(y)
    
    # Var(Y|X) = 1 - rho^2 (constant for Gaussian)
    within_var = 1 - rho**2
    
    # Var(E[Y|X]) = Var(rho*X) = rho^2 * Var(X) = rho^2
    between_var = np.var(conditional_means)
    
    print(f"\nVar(Y): {var_y:.4f}")
    print(f"E[Var(Y|X)] + Var(E[Y|X]): {within_var + between_var:.4f}")
    print(f"  - Within-group variance: {within_var:.4f}")
    print(f"  - Between-group variance: {between_var:.4f}")
 
demonstrate_conditional_expectation()

Applications in Machine Learning

Conditional distributions are everywhere in ML—they are prediction.

1. Supervised Learning as Conditional Estimation

Classification and regression both estimate conditional distributions:

Regression: Estimate $P(Y | \mathbf{X})$ or $\mathbb{E}[Y | \mathbf{X}]$
Classification: Estimate $P(Y = k | \mathbf{X})$ for each class $k$

Neural networks with softmax outputs estimate $P(Y | \mathbf{X})$ directly.

2. Generative Modeling

Conditional GANs: Generate images conditioned on class labels or text
Conditional VAEs: Decode latent codes conditioned on auxiliary information
Diffusion Models: Model $P(X_{t-1} | X_t)$ to reverse the noising process

3. Sequence Modeling

Language Models: $P(w_t | w_1, \ldots, w_{t-1})$
Machine Translation: $P(y_t | y_1, \ldots, y_{t-1}, \mathbf{x})$
Speech Recognition: $P(\text{text} | \text{audio})$

Conditional Distributions in Practice

•Logistic Regression: Models $P(Y=1|\mathbf{X})$ via sigmoid of linear function
•Gaussian Process Regression: Posterior is conditional Gaussian
•Bayesian Neural Networks: Predictive distribution conditions on weights' posterior
•Normalizing Flows: Learn expressive conditional densities via invertible transforms
•Mixture Density Networks: Output parameters of mixture model for $P(Y|\mathbf{X})$

Summary: Conditional Distributions

Key Takeaways

•Conditional distributions are joint divided by marginal: $P(Y|X) = P(X,Y) / P(X)$
•They answer: 'Given X, what's the distribution of Y?' This is the essence of prediction.
•Bayes' theorem flips the conditioning direction: Relates $P(Y|X)$ to $P(X|Y)$
•Chain rule factorizes joints into conditionals: Foundation for autoregressive and graphical models
•Conditional expectation is the regression function: Optimal prediction of Y given X
•ML is fundamentally about estimating conditional distributions: From logistic regression to transformers

What's next:

Now that we understand how variables relate through conditional distributions, we turn to quantifying the strength and nature of these relationships. The next page covers covariance and correlation—numerical summaries of how two variables move together.

Page Complete

You now understand conditional distributions—the mathematical foundation for prediction and inference in machine learning. Every supervised learning model is, at its core, estimating a conditional distribution.

3 / 5

Loading learning content...

Mathematical FoundationsJoint and Marginal Distributions

Joint and Marginal Distributions

LevelIntermediate

Duration90 mins

TopicJoint and Marginal Distributions

3 / 5

Conditional Distributions

The Heart of Prediction

At the core of every machine learning prediction lies a simple question: Given what we observe, what can we infer about what we don't observe?

What You Will Learn

Conditional PMF (Discrete Case)

Definition (Conditional PMF):

For discrete random variables $X$ and $Y$, the conditional PMF of $Y$ given $X = x$ is:

$$p_{Y|X}(y | x) = P(Y = y | X = x) = \frac{P(X = x, Y = y)}{P(X = x)} = \frac{p_{X,Y}(x, y)}{p_X(x)}$$

This is defined whenever $p_X(x) > 0$.

Interpretation: Given that $X$ has taken value $x$, what are the probabilities that $Y$ takes each of its possible values?

Properties:

Valid PMF: For each fixed $x$, $p_{Y|X}(\cdot | x)$ is a valid PMF over $y$:
- Non-negative: $p_{Y|X}(y | x) \geq 0$
- Sums to 1: $\sum_y p_{Y|X}(y | x) = 1$
Relationship to joint: $p_{X,Y}(x, y) = p_{Y|X}(y | x) \cdot p_X(x)$
Chain rule: For multiple variables: $$p(x_1, x_2, x_3) = p(x_1) \cdot p(x_2 | x_1) \cdot p(x_3 | x_1, x_2)$$

Example: Computing Conditional Distribution
	Y = 0	Y = 1	P(X = x)
X = 0	0.12	0.08	0.20
X = 1	0.28	0.12	0.40
X = 2	0.20	0.20	0.40
P(Y = y)	0.60	0.40	1.00

To find $P(Y = 0 | X = 1)$:

$$P(Y = 0 | X = 1) = \frac{P(X = 1, Y = 0)}{P(X = 1)} = \frac{0.28}{0.40} = 0.70$$

Similarly: $P(Y = 1 | X = 1) = \frac{0.12}{0.40} = 0.30$

Note: $0.70 + 0.30 = 1$ ✓ (it's a valid distribution over $Y$).

Observation: $P(Y = 0) = 0.60$ unconditionally, but $P(Y = 0 | X = 1) = 0.70$. Knowing $X = 1$ changed our belief about $Y$!

conditional_pmf.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import numpy as np
 
def compute_conditionals():
    """Compute and verify conditional distributions."""
    # Joint PMF P(X, Y)
    joint = np.array([
        [0.12, 0.08],  # X = 0
        [0.28, 0.12],  # X = 1
        [0.20, 0.20],  # X = 2
    ])
    
    # Marginals
    p_x = joint.sum(axis=1)  # [0.20, 0.40, 0.40]
    p_y = joint.sum(axis=0)  # [0.60, 0.40]
    
    print("Joint P(X,Y):")
    print(joint)
    print(f"\nMarginal P(X): {p_x}")
    print(f"Marginal P(Y): {p_y}")
    
    # Conditional P(Y|X) for each X value
    print("\nConditional P(Y|X):")
    for x in range(3):
        p_y_given_x = joint[x, :] / p_x[x]
        print(f"  P(Y|X={x}) = {p_y_given_x} (sum = {p_y_given_x.sum():.2f})")
    
    # Verify: P(Y|X) * P(X) = P(X,Y)
    print("\nVerification: P(Y|X) * P(X) should equal P(X,Y)")
    for x in range(3):
        p_y_given_x = joint[x, :] / p_x[x]
        reconstructed = p_y_given_x * p_x[x]
        print(f"  X={x}: joint={joint[x,:]}, reconstructed={reconstructed}")
 
compute_conditionals()

Conditional PDF (Continuous Case)

Definition (Conditional PDF):

For continuous random variables with joint PDF $f_{X,Y}(x, y)$ and marginal $f_X(x) > 0$:

$$f_{Y|X}(y | x) = \frac{f_{X,Y}(x, y)}{f_X(x)}$$

This is a PDF in $y$ (for each fixed $x$), meaning: $$\int_{-\infty}^{\infty} f_{Y|X}(y | x) , dy = 1$$

Example: Conditional Distribution of Bivariate Gaussian

For bivariate Gaussian $(X, Y)$ with $\mu_X = 0$, $\mu_Y = 0$, $\sigma_X = \sigma_Y = 1$, and correlation $\rho$:

$$Y | X = x \sim \mathcal{N}\left(\rho x, , 1 - \rho^2\right)$$

Key observations:

Conditional mean depends on $x$: $\mathbb{E}[Y | X = x] = \rho x$ — this is linear regression!
Conditional variance is reduced: $\text{Var}(Y | X = x) = 1 - \rho^2 < 1$
The stronger the correlation, the more information $X$ provides about $Y$
When $\rho = 0$: Conditional distribution equals marginal (independence)

This is why linear regression works: the conditional mean of $Y$ given $X$ is a linear function of $X$ when they're jointly Gaussian.

Conditional = Slice of Joint

conditional_gaussian.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
def visualize_conditional_gaussian():
    """Show how conditional distribution varies with x."""
    rho = 0.7
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    # Left: Joint distribution with conditional slices
    ax1 = axes[0]
    x = np.linspace(-3, 3, 100)
    y = np.linspace(-3, 3, 100)
    X, Y = np.meshgrid(x, y)
    
    cov = [[1, rho], [rho, 1]]
    rv = stats.multivariate_normal([0, 0], cov)
    Z = rv.pdf(np.dstack((X, Y)))
    
    ax1.contourf(X, Y, Z, levels=20, cmap='Blues', alpha=0.7)
    
    # Mark conditioning values
    x_vals = [-1.5, 0, 1.5]
    colors = ['red', 'green', 'purple']
    for xv, c in zip(x_vals, colors):
        ax1.axvline(x=xv, color=c, linestyle='--', linewidth=2)
    
    ax1.set_xlabel('X')
    ax1.set_ylabel('Y')
    ax1.set_title(f'Joint PDF (ρ = {rho})\nVertical lines show conditioning values')
    
    # Middle: Conditional PDFs for different x values
    ax2 = axes[1]
    y_range = np.linspace(-4, 4, 200)
    
    for xv, c in zip(x_vals, colors):
        # Conditional: Y|X=x ~ N(rho*x, 1-rho^2)
        cond_mean = rho * xv
        cond_var = 1 - rho**2
        cond_pdf = stats.norm.pdf(y_range, cond_mean, np.sqrt(cond_var))
        ax2.plot(y_range, cond_pdf, color=c, linewidth=2,
                label=f'P(Y|X={xv}): N({cond_mean:.1f}, {cond_var:.2f})')
    
    # Marginal for comparison
    ax2.plot(y_range, stats.norm.pdf(y_range, 0, 1), 'k--', 
            linewidth=2, label='Marginal P(Y): N(0, 1)')
    
    ax2.set_xlabel('Y')
    ax2.set_ylabel('Density')
    ax2.set_title('Conditional PDFs vs Marginal')
    ax2.legend(fontsize=9)
    
    # Right: How conditional mean relates to x (regression line)
    ax3 = axes[2]
    samples = np.random.multivariate_normal([0, 0], cov, 500)
    ax3.scatter(samples[:, 0], samples[:, 1], alpha=0.3, s=20)
    
    x_line = np.linspace(-3, 3, 100)
    ax3.plot(x_line, rho * x_line, 'r-', linewidth=3,
            label=f'E[Y|X] = {rho}X (regression line)')
    
    ax3.set_xlabel('X')
    ax3.set_ylabel('Y')
    ax3.set_title('Conditional Mean = Regression Line')
    ax3.legend()
    ax3.set_aspect('equal')
    
    plt.tight_layout()
    plt.savefig('conditional_gaussian.png', dpi=150)
    plt.show()
 
visualize_conditional_gaussian()

Bayes' Theorem for Distributions

Bayes' Theorem relates two different conditional distributions—flipping the direction of conditioning.

For Discrete Variables: $$p_{X|Y}(x | y) = \frac{p_{Y|X}(y | x) \cdot p_X(x)}{p_Y(y)} = \frac{p_{Y|X}(y | x) \cdot p_X(x)}{\sum_{x'} p_{Y|X}(y | x') \cdot p_X(x')}$$

For Continuous Variables: $$f_{X|Y}(x | y) = \frac{f_{Y|X}(y | x) \cdot f_X(x)}{f_Y(y)} = \frac{f_{Y|X}(y | x) \cdot f_X(x)}{\int f_{Y|X}(y | x') \cdot f_X(x') , dx'}$$

The Bayesian Interpretation:

$P(X)$: Prior — our belief about $X$ before seeing $Y$
$P(Y|X)$: Likelihood — how likely the observed $Y$ is for each value of $X$
$P(X|Y)$: Posterior — updated belief about $X$ after seeing $Y$
$P(Y)$: Evidence — normalizing constant to make posterior sum/integrate to 1

The Computational Challenge

Example: Medical Diagnosis

Let $D$ = disease (1 = present, 0 = absent) and $T$ = test result (1 = positive, 0 = negative).

Prior: $P(D = 1) = 0.01$ (1% disease prevalence)
Likelihood: $P(T = 1 | D = 1) = 0.95$ (95% sensitivity)
Likelihood: $P(T = 1 | D = 0) = 0.05$ (5% false positive rate)

Given a positive test, what's the probability of disease?

$$P(D = 1 | T = 1) = \frac{P(T = 1 | D = 1) P(D = 1)}{P(T = 1)}$$

$$P(T = 1) = P(T = 1 | D = 1)P(D = 1) + P(T = 1 | D = 0)P(D = 0) = 0.95(0.01) + 0.05(0.99) = 0.059$$

$$P(D = 1 | T = 1) = \frac{0.95 \times 0.01}{0.059} \approx 0.161$$

Only 16.1% chance of disease despite positive test! The low prior dominates.

The Chain Rule of Probability

The chain rule expresses any joint distribution as a product of conditionals.

For Two Variables: $$P(X, Y) = P(Y | X) P(X) = P(X | Y) P(Y)$$

For Multiple Variables: $$P(X_1, X_2, \ldots, X_n) = P(X_1) \prod_{i=2}^{n} P(X_i | X_1, \ldots, X_{i-1})$$

This factorization is always valid—it's an algebraic identity from the definition of conditional probability.

Why This Matters for ML:

Autoregressive Models: Language models like GPT use this directly: $$P(w_1, \ldots, w_n) = P(w_1) P(w_2|w_1) P(w_3|w_1, w_2) \cdots P(w_n|w_1, \ldots, w_{n-1})$$
Bayesian Networks: Factor the joint according to a DAG structure, where each variable is conditioned only on its parents: $$P(X_1, \ldots, X_n) = \prod_{i=1}^{n} P(X_i | \text{Parents}(X_i))$$
Sequential Decision Making: In reinforcement learning, trajectories factor as: $$P(s_0, a_0, s_1, a_1, \ldots) = P(s_0) \prod_t P(a_t|s_t) P(s_{t+1}|s_t, a_t)$$

Chain Rule = Modeling Strategy

Conditional Expectation and Variance

Given a conditional distribution, we can compute conditional moments.

Conditional Expectation: $$\mathbb{E}[Y | X = x] = \sum_y y \cdot p_{Y|X}(y | x) \quad \text{(discrete)}$$ $$\mathbb{E}[Y | X = x] = \int y \cdot f_{Y|X}(y | x) , dy \quad \text{(continuous)}$$

This is a function of $x$, often written as $g(x) = \mathbb{E}[Y | X = x]$.

Key Properties:

Law of Iterated Expectations (Tower Property): $$\mathbb{E}[Y] = \mathbb{E}[\mathbb{E}[Y | X]]$$

The unconditional mean equals the average of conditional means.

Regression Interpretation: The function $\mathbb{E}[Y | X = x]$ is the regression function—the best prediction of $Y$ given $X$ (in mean squared error sense).

Conditional Variance: $$\text{Var}(Y | X = x) = \mathbb{E}[Y^2 | X = x] - (\mathbb{E}[Y | X = x])^2$$

Law of Total Variance: $$\text{Var}(Y) = \mathbb{E}[\text{Var}(Y | X)] + \text{Var}(\mathbb{E}[Y | X])$$

Total variance = average within-group variance + between-group variance.

conditional_expectation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import numpy as np
 
def demonstrate_conditional_expectation():
    """Verify Law of Iterated Expectations via simulation."""
    np.random.seed(42)
    
    # Generate correlated bivariate normal
    rho = 0.6
    n = 100000
    
    x = np.random.normal(0, 1, n)
    y = rho * x + np.sqrt(1 - rho**2) * np.random.normal(0, 1, n)
    
    # Unconditional E[Y]
    uncond_mean = np.mean(y)
    print(f"Unconditional E[Y]: {uncond_mean:.4f} (should be ≈ 0)")
    
    # Conditional E[Y|X] = rho * X
    conditional_means = rho * x
    
    # E[E[Y|X]] should equal E[Y]
    mean_of_cond_means = np.mean(conditional_means)
    print(f"E[E[Y|X]]: {mean_of_cond_means:.4f}")
    
    # Verify Law of Total Variance
    # Var(Y) = E[Var(Y|X)] + Var(E[Y|X])
    var_y = np.var(y)
    
    # Var(Y|X) = 1 - rho^2 (constant for Gaussian)
    within_var = 1 - rho**2
    
    # Var(E[Y|X]) = Var(rho*X) = rho^2 * Var(X) = rho^2
    between_var = np.var(conditional_means)
    
    print(f"\nVar(Y): {var_y:.4f}")
    print(f"E[Var(Y|X)] + Var(E[Y|X]): {within_var + between_var:.4f}")
    print(f"  - Within-group variance: {within_var:.4f}")
    print(f"  - Between-group variance: {between_var:.4f}")
 
demonstrate_conditional_expectation()

Applications in Machine Learning

Conditional distributions are everywhere in ML—they are prediction.

1. Supervised Learning as Conditional Estimation

Classification and regression both estimate conditional distributions:

Regression: Estimate $P(Y | \mathbf{X})$ or $\mathbb{E}[Y | \mathbf{X}]$
Classification: Estimate $P(Y = k | \mathbf{X})$ for each class $k$

Neural networks with softmax outputs estimate $P(Y | \mathbf{X})$ directly.

2. Generative Modeling

Conditional GANs: Generate images conditioned on class labels or text
Conditional VAEs: Decode latent codes conditioned on auxiliary information
Diffusion Models: Model $P(X_{t-1} | X_t)$ to reverse the noising process

3. Sequence Modeling

Language Models: $P(w_t | w_1, \ldots, w_{t-1})$
Machine Translation: $P(y_t | y_1, \ldots, y_{t-1}, \mathbf{x})$
Speech Recognition: $P(\text{text} | \text{audio})$

Conditional Distributions in Practice

•Logistic Regression: Models $P(Y=1|\mathbf{X})$ via sigmoid of linear function
•Gaussian Process Regression: Posterior is conditional Gaussian
•Bayesian Neural Networks: Predictive distribution conditions on weights' posterior
•Normalizing Flows: Learn expressive conditional densities via invertible transforms
•Mixture Density Networks: Output parameters of mixture model for $P(Y|\mathbf{X})$

Summary: Conditional Distributions

Key Takeaways

•Conditional distributions are joint divided by marginal: $P(Y|X) = P(X,Y) / P(X)$
•They answer: 'Given X, what's the distribution of Y?' This is the essence of prediction.
•Bayes' theorem flips the conditioning direction: Relates $P(Y|X)$ to $P(X|Y)$
•Chain rule factorizes joints into conditionals: Foundation for autoregressive and graphical models
•Conditional expectation is the regression function: Optimal prediction of Y given X
•ML is fundamentally about estimating conditional distributions: From logistic regression to transformers

What's next:

Page Complete

3 / 5