Joint Marginal Distributions - Learning Module

Loading content...

0/245

Independence

When Variables Truly Don't Affect Each Other

We've seen that zero correlation means no linear relationship—but variables can still be strongly dependent nonlinearly. Independence is the strongest possible statement about two variables not affecting each other: knowing one tells you absolutely nothing about the other.

Independence is both a theoretical cornerstone and a practical assumption. When we claim data points are i.i.d. (independent and identically distributed), we're invoking independence. When Naive Bayes assumes features are conditionally independent given the class, that's an independence assumption. When we split data into train/test sets, we need them to behave independently.

Understanding independence—when it holds, when it fails, and what it implies—is essential for correctly applying machine learning methods.

What You Will Learn

By the end of this page, you will master the formal definition of independence, understand conditional independence, test for independence, and appreciate why independence assumptions are both powerful and dangerous in machine learning.

Formal Definition of Independence

Definition (Independence of Random Variables):

Random variables $X$ and $Y$ are independent, written $X \perp Y$, if and only if:

$$P(X \in A, Y \in B) = P(X \in A) \cdot P(Y \in B)$$

for all measurable sets $A$ and $B$.

Equivalent Characterizations:

Joint = Product of Marginals (Discrete): $$p_{X,Y}(x, y) = p_X(x) \cdot p_Y(y) \quad \text{for all } x, y$$
Joint = Product of Marginals (Continuous): $$f_{X,Y}(x, y) = f_X(x) \cdot f_Y(y) \quad \text{for all } x, y$$
CDF Factorization: $$F_{X,Y}(x, y) = F_X(x) \cdot F_Y(y)$$
Conditional = Marginal: $$P(Y = y | X = x) = P(Y = y) \quad \text{for all } x, y$$

This last one captures the intuition: knowing $X$'s value doesn't change our belief about $Y$.

Implications of Independence:

If $X \perp Y$, then:

$\mathbb{E}[XY] = \mathbb{E}[X] \cdot \mathbb{E}[Y]$
$\text{Cov}(X, Y) = 0$
$\mathbb{E}[g(X)h(Y)] = \mathbb{E}[g(X)] \cdot \mathbb{E}[h(Y)]$ for any functions $g, h$
$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$
$M_{X+Y}(t) = M_X(t) \cdot M_Y(t)$ (MGFs multiply)

Critical Note: These are all necessary but not sufficient conditions. Zero covariance does NOT imply independence. The factorization of the joint distribution is the complete, definitive test.

The Distinction Matters

Uncorrelated means Cov(X,Y) = 0. Independent means the entire joint distribution factors. Independence ⟹ Uncorrelated, but Uncorrelated ⟹̸ Independent. The only exception: for Gaussian variables, uncorrelated DOES imply independent.

independence_demo.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import numpy as np
from scipy import stats
 
def test_independence():
    """Demonstrate independence vs uncorrelated."""
    np.random.seed(42)
    n = 10000
    
    # Example 1: Independent variables
    x1 = np.random.normal(0, 1, n)
    y1 = np.random.normal(0, 1, n)
    
    print("=== Independent Variables ===")
    print(f"Correlation: {np.corrcoef(x1, y1)[0,1]:.4f}")
    print(f"E[XY]: {np.mean(x1*y1):.4f}")
    print(f"E[X]*E[Y]: {np.mean(x1)*np.mean(y1):.4f}")
    
    # Example 2: Dependent but uncorrelated (Y = X²)
    x2 = np.random.normal(0, 1, n)
    y2 = x2**2
    
    print("\n=== Y = X² (Dependent but Uncorrelated) ===")
    print(f"Correlation: {np.corrcoef(x2, y2)[0,1]:.4f}")
    print(f"But Y is completely determined by X!")
    
    # Check if joint factors (it shouldn't for y=x²)
    # Discretize and check factorization
    x_bins = np.percentile(x2, [0, 25, 50, 75, 100])
    y_bins = np.percentile(y2, [0, 25, 50, 75, 100])
    
    x_disc = np.digitize(x2, x_bins[:-1])
    y_disc = np.digitize(y2, y_bins[:-1])
    
    # Compute joint and marginal distributions
    joint = np.zeros((4, 4))
    for i in range(n):
        joint[x_disc[i]-1, y_disc[i]-1] += 1
    joint /= n
    
    p_x = joint.sum(axis=1)
    p_y = joint.sum(axis=0)
    
    # If independent, joint should equal outer product of marginals
    product = np.outer(p_x, p_y)
    
    print(f"\nMax |P(X,Y) - P(X)P(Y)|: {np.max(np.abs(joint - product)):.4f}")
    print("(Should be ~0 if independent, larger if dependent)")
 
test_independence()

Conditional Independence

Definition (Conditional Independence):

$X$ and $Y$ are conditionally independent given $Z$, written $X \perp Y | Z$, if:

$$P(X, Y | Z) = P(X | Z) \cdot P(Y | Z)$$

for all values of $Z$.

Equivalently: $$P(X | Y, Z) = P(X | Z)$$

Once we know $Z$, additional knowledge of $Y$ doesn't change our belief about $X$.

Key Points:

Conditional independence ≠ Marginal independence:
- $X \perp Y | Z$ does NOT imply $X \perp Y$
- $X \perp Y$ does NOT imply $X \perp Y | Z$
Example: Common Cause
- Let $Z$ = rain, $X$ = wet grass, $Y$ = wet sidewalk
- $X$ and $Y$ are marginally dependent (both caused by rain)
- But $X \perp Y | Z$ (given rain status, grass and sidewalk are independent)
Example: Common Effect (Explaining Away)
- Let $X$ = earthquake, $Y$ = burglary, $Z$ = alarm goes off
- $X$ and $Y$ are marginally independent
- But $X \not\perp Y | Z$ (if alarm rings and there's no earthquake, burglary is more likely)

The Naive Bayes Assumption

Naive Bayes classifiers assume features are conditionally independent given the class label: P(X₁, ..., Xₙ | Y) = ∏ P(Xᵢ | Y). This is almost always wrong, but often works well in practice because the conditional dependencies are weak or cancel out.

conditional_independence.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import numpy as np
 
def demonstrate_conditional_independence():
    """Demonstrate conditional vs marginal independence."""
    np.random.seed(42)
    n = 10000
    
    # Common cause: Z -> X and Z -> Y
    # Z = latent cause, X and Y are effects
    z = np.random.binomial(1, 0.5, n)  # Rain: 0 or 1
    x = np.random.binomial(1, 0.1 + 0.8 * z, n)  # Wet grass
    y = np.random.binomial(1, 0.05 + 0.9 * z, n)  # Wet sidewalk
    
    print("=== Common Cause: Rain -> {Grass, Sidewalk} ===")
    
    # Marginal dependence
    marginal_corr = np.corrcoef(x, y)[0, 1]
    print(f"Marginal correlation P(X,Y): {marginal_corr:.4f}")
    
    # Conditional independence given Z
    for z_val in [0, 1]:
        mask = z == z_val
        if mask.sum() > 100:
            cond_corr = np.corrcoef(x[mask], y[mask])[0, 1]
            print(f"Conditional correlation given Z={z_val}: {cond_corr:.4f}")
    
    # Common effect: X -> Z <- Y (collider)
    print("\n=== Common Effect (Collider): Earthquake, Burglary -> Alarm ===")
    
    earthquake = np.random.binomial(1, 0.01, n)  # Rare earthquake
    burglary = np.random.binomial(1, 0.02, n)    # Rare burglary
    # Alarm goes off if either happens (with some noise)
    alarm = np.random.binomial(1, 0.1 + 0.85 * (earthquake | burglary), n)
    
    # Marginal independence (earthquake and burglary are unrelated)
    marginal_corr = np.corrcoef(earthquake, burglary)[0, 1]
    print(f"Marginal corr(Earthquake, Burglary): {marginal_corr:.4f}")
    
    # Conditional dependence given alarm (explaining away)
    for a_val in [0, 1]:
        mask = alarm == a_val
        if mask.sum() > 100:
            # Need enough samples of each
            eq = earthquake[mask]
            bg = burglary[mask]
            if eq.std() > 0 and bg.std() > 0:
                cond_corr = np.corrcoef(eq, bg)[0, 1]
                print(f"Conditional corr given Alarm={a_val}: {cond_corr:.4f}")
 
demonstrate_conditional_independence()

Mutual Independence of Multiple Variables

For more than two variables, mutual independence is stronger than pairwise independence.

Definition (Mutual Independence):

Random variables $X_1, X_2, \ldots, X_n$ are mutually independent if:

$$P(X_1 \in A_1, \ldots, X_n \in A_n) = \prod_{i=1}^{n} P(X_i \in A_i)$$

for all measurable sets $A_1, \ldots, A_n$.

Equivalently, the joint distribution factors: $$f(x_1, \ldots, x_n) = \prod_{i=1}^{n} f_{X_i}(x_i)$$

Pairwise vs. Mutual Independence:

Pairwise independent: Every pair $(X_i, X_j)$ is independent
Mutually independent: All subsets are independent (stronger)

Pairwise independence does NOT imply mutual independence!

Example: Let $X_1, X_2$ be i.i.d. fair coin flips (±1). Let $X_3 = X_1 \cdot X_2$.

Each pair is independent (check: $P(X_1 = 1, X_2 = 1) = 1/4 = P(X_1=1)P(X_2=1)$)
But $X_3$ is determined by $(X_1, X_2)$, so they're not mutually independent
$P(X_1=1, X_2=1, X_3=1) = 1/4 \neq 1/8 = P(X_1=1)P(X_2=1)P(X_3=1)$

i.i.d. = Independent AND Identically Distributed

The standard ML assumption that data points are 'i.i.d.' means: (1) mutual independence across samples, and (2) same distribution for each sample. Both parts matter—time series data often violates independence, and distribution shift violates identical distribution.

Testing for Independence

In practice, we need to test whether observed data are consistent with independence.

Chi-Square Test for Independence (Discrete):

For categorical variables with observed counts $O_{ij}$:

$$\chi^2 = \sum_{i,j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$$

where $E_{ij} = \frac{(\text{row } i \text{ total})(\text{column } j \text{ total})}{n}$ is the expected count under independence.

Under $H_0$: independence, $\chi^2 \sim \chi^2_{(r-1)(c-1)}$.

Tests for Continuous Variables:

Pearson correlation test: Tests if $\rho = 0$ (linear independence only)
Spearman/Kendall: Tests monotonic independence
Distance correlation: Detects any dependence (zero iff independent, for finite variance)
Mutual information: Zero iff independent
Kernel-based tests (HSIC): General-purpose dependence detection

independence_tests.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import numpy as np
from scipy import stats
 
def test_independence_methods():
    """Demonstrate various independence tests."""
    np.random.seed(42)
    n = 500
    
    # Case 1: Independent
    x1 = np.random.normal(0, 1, n)
    y1 = np.random.normal(0, 1, n)
    
    # Case 2: Linearly dependent
    x2 = np.random.normal(0, 1, n)
    y2 = 0.5 * x2 + np.random.normal(0, 1, n)
    
    # Case 3: Nonlinearly dependent (Y = X²)
    x3 = np.random.normal(0, 1, n)
    y3 = x3**2 + 0.1 * np.random.normal(0, 1, n)
    
    cases = [(x1, y1, "Independent"), 
             (x2, y2, "Linear dependence"),
             (x3, y3, "Y = X² (nonlinear)")]
    
    print("Testing Independence:")
    print("=" * 60)
    
    for x, y, name in cases:
        print(f"\n{name}:")
        
        # Pearson correlation
        r, p_pearson = stats.pearsonr(x, y)
        print(f"  Pearson r = {r:.4f}, p = {p_pearson:.4f}")
        
        # Spearman correlation
        rho, p_spearman = stats.spearmanr(x, y)
        print(f"  Spearman ρ = {rho:.4f}, p = {p_spearman:.4f}")
        
        # Chi-square (discretize first)
        x_cat = np.digitize(x, np.percentile(x, [25, 50, 75]))
        y_cat = np.digitize(y, np.percentile(y, [25, 50, 75]))
        contingency = np.histogram2d(x_cat, y_cat, bins=[4, 4])[0]
        chi2, p_chi, dof, expected = stats.chi2_contingency(contingency)
        print(f"  Chi-square = {chi2:.2f}, p = {p_chi:.4f}")
 
test_independence_methods()

Independence Assumptions in Machine Learning

Machine learning algorithms routinely make independence assumptions—sometimes justified, sometimes not.

1. i.i.d. Assumption

Most ML theory assumes training samples are i.i.d. This fails when:

Time series: Sequential data has temporal dependencies
Spatial data: Nearby points are correlated
Clustered data: Samples within groups are more similar

2. Naive Bayes: Conditional Feature Independence

$$P(\mathbf{X} | Y) = \prod_i P(X_i | Y)$$

Features are assumed independent given the class. Almost always false, surprisingly often effective.

3. Graphical Models: Local Markov Property

A variable is independent of non-descendants given its parents. Encodes complex dependence structure with local factors.

4. Dropout Regularization

Randomly dropping neurons during training—each mask is independent across samples and iterations.

When Independence Assumptions Fail

•Leakage: Train/test "independence" violated when information leaks from test to train
•Autocorrelation: Time series predictions ignore valuable sequential structure if treated as i.i.d.
•Confounding: Hidden common causes create spurious dependencies between features
•Multicollinearity: Highly correlated features violate implicit assumptions in regression
•Simpson's Paradox: Marginal and conditional relationships can point opposite directions

Independence as Simplifying Assumption

Independence assumptions are often knowingly false but useful. They reduce the number of parameters to estimate (a d×d covariance matrix becomes d variances) and make inference tractable. The art is knowing when violations matter enough to model explicitly.

Summary: Independence

Key Takeaways

•Independence means joint factors: P(X,Y) = P(X)P(Y) is the complete test
•Independence ⟹ Uncorrelated, but not vice versa: Zero correlation is necessary but not sufficient
•Conditional independence is different from marginal: X⊥Y|Z ≠ X⊥Y
•Mutual independence > pairwise independence: All subsets must factor, not just pairs
•Independence can be tested statistically: Chi-square, distance correlation, HSIC, etc.
•ML relies heavily on independence assumptions: Often wrong, often useful, must be checked

Module Complete:

You've now completed the module on Joint and Marginal Distributions. You've learned:

How to specify joint distributions for multiple random variables
How marginalization extracts individual variable distributions
How conditional distributions enable prediction and inference
How covariance and correlation quantify linear relationships
How independence provides the strongest statement about non-association

These concepts form the probabilistic foundation for understanding machine learning models that reason about multiple features, predict targets from inputs, and quantify uncertainty.

Module Complete

Congratulations! You've mastered joint and marginal distributions—the mathematical framework for reasoning about multiple random variables and their relationships. This knowledge is fundamental to probabilistic machine learning, Bayesian inference, and statistical modeling.