Loading content...
We've seen that zero correlation means no linear relationship—but variables can still be strongly dependent nonlinearly. Independence is the strongest possible statement about two variables not affecting each other: knowing one tells you absolutely nothing about the other.
Independence is both a theoretical cornerstone and a practical assumption. When we claim data points are i.i.d. (independent and identically distributed), we're invoking independence. When Naive Bayes assumes features are conditionally independent given the class, that's an independence assumption. When we split data into train/test sets, we need them to behave independently.
Understanding independence—when it holds, when it fails, and what it implies—is essential for correctly applying machine learning methods.
By the end of this page, you will master the formal definition of independence, understand conditional independence, test for independence, and appreciate why independence assumptions are both powerful and dangerous in machine learning.
Definition (Independence of Random Variables):
Random variables $X$ and $Y$ are independent, written $X \perp Y$, if and only if:
$$P(X \in A, Y \in B) = P(X \in A) \cdot P(Y \in B)$$
for all measurable sets $A$ and $B$.
Equivalent Characterizations:
Joint = Product of Marginals (Discrete): $$p_{X,Y}(x, y) = p_X(x) \cdot p_Y(y) \quad \text{for all } x, y$$
Joint = Product of Marginals (Continuous): $$f_{X,Y}(x, y) = f_X(x) \cdot f_Y(y) \quad \text{for all } x, y$$
CDF Factorization: $$F_{X,Y}(x, y) = F_X(x) \cdot F_Y(y)$$
Conditional = Marginal: $$P(Y = y | X = x) = P(Y = y) \quad \text{for all } x, y$$
This last one captures the intuition: knowing $X$'s value doesn't change our belief about $Y$.
Implications of Independence:
If $X \perp Y$, then:
Critical Note: These are all necessary but not sufficient conditions. Zero covariance does NOT imply independence. The factorization of the joint distribution is the complete, definitive test.
Uncorrelated means Cov(X,Y) = 0. Independent means the entire joint distribution factors. Independence ⟹ Uncorrelated, but Uncorrelated ⟹̸ Independent. The only exception: for Gaussian variables, uncorrelated DOES imply independent.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
import numpy as npfrom scipy import stats def test_independence(): """Demonstrate independence vs uncorrelated.""" np.random.seed(42) n = 10000 # Example 1: Independent variables x1 = np.random.normal(0, 1, n) y1 = np.random.normal(0, 1, n) print("=== Independent Variables ===") print(f"Correlation: {np.corrcoef(x1, y1)[0,1]:.4f}") print(f"E[XY]: {np.mean(x1*y1):.4f}") print(f"E[X]*E[Y]: {np.mean(x1)*np.mean(y1):.4f}") # Example 2: Dependent but uncorrelated (Y = X²) x2 = np.random.normal(0, 1, n) y2 = x2**2 print("\n=== Y = X² (Dependent but Uncorrelated) ===") print(f"Correlation: {np.corrcoef(x2, y2)[0,1]:.4f}") print(f"But Y is completely determined by X!") # Check if joint factors (it shouldn't for y=x²) # Discretize and check factorization x_bins = np.percentile(x2, [0, 25, 50, 75, 100]) y_bins = np.percentile(y2, [0, 25, 50, 75, 100]) x_disc = np.digitize(x2, x_bins[:-1]) y_disc = np.digitize(y2, y_bins[:-1]) # Compute joint and marginal distributions joint = np.zeros((4, 4)) for i in range(n): joint[x_disc[i]-1, y_disc[i]-1] += 1 joint /= n p_x = joint.sum(axis=1) p_y = joint.sum(axis=0) # If independent, joint should equal outer product of marginals product = np.outer(p_x, p_y) print(f"\nMax |P(X,Y) - P(X)P(Y)|: {np.max(np.abs(joint - product)):.4f}") print("(Should be ~0 if independent, larger if dependent)") test_independence()Definition (Conditional Independence):
$X$ and $Y$ are conditionally independent given $Z$, written $X \perp Y | Z$, if:
$$P(X, Y | Z) = P(X | Z) \cdot P(Y | Z)$$
for all values of $Z$.
Equivalently: $$P(X | Y, Z) = P(X | Z)$$
Once we know $Z$, additional knowledge of $Y$ doesn't change our belief about $X$.
Key Points:
Conditional independence ≠ Marginal independence:
Example: Common Cause
Example: Common Effect (Explaining Away)
Naive Bayes classifiers assume features are conditionally independent given the class label: P(X₁, ..., Xₙ | Y) = ∏ P(Xᵢ | Y). This is almost always wrong, but often works well in practice because the conditional dependencies are weak or cancel out.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
import numpy as np def demonstrate_conditional_independence(): """Demonstrate conditional vs marginal independence.""" np.random.seed(42) n = 10000 # Common cause: Z -> X and Z -> Y # Z = latent cause, X and Y are effects z = np.random.binomial(1, 0.5, n) # Rain: 0 or 1 x = np.random.binomial(1, 0.1 + 0.8 * z, n) # Wet grass y = np.random.binomial(1, 0.05 + 0.9 * z, n) # Wet sidewalk print("=== Common Cause: Rain -> {Grass, Sidewalk} ===") # Marginal dependence marginal_corr = np.corrcoef(x, y)[0, 1] print(f"Marginal correlation P(X,Y): {marginal_corr:.4f}") # Conditional independence given Z for z_val in [0, 1]: mask = z == z_val if mask.sum() > 100: cond_corr = np.corrcoef(x[mask], y[mask])[0, 1] print(f"Conditional correlation given Z={z_val}: {cond_corr:.4f}") # Common effect: X -> Z <- Y (collider) print("\n=== Common Effect (Collider): Earthquake, Burglary -> Alarm ===") earthquake = np.random.binomial(1, 0.01, n) # Rare earthquake burglary = np.random.binomial(1, 0.02, n) # Rare burglary # Alarm goes off if either happens (with some noise) alarm = np.random.binomial(1, 0.1 + 0.85 * (earthquake | burglary), n) # Marginal independence (earthquake and burglary are unrelated) marginal_corr = np.corrcoef(earthquake, burglary)[0, 1] print(f"Marginal corr(Earthquake, Burglary): {marginal_corr:.4f}") # Conditional dependence given alarm (explaining away) for a_val in [0, 1]: mask = alarm == a_val if mask.sum() > 100: # Need enough samples of each eq = earthquake[mask] bg = burglary[mask] if eq.std() > 0 and bg.std() > 0: cond_corr = np.corrcoef(eq, bg)[0, 1] print(f"Conditional corr given Alarm={a_val}: {cond_corr:.4f}") demonstrate_conditional_independence()For more than two variables, mutual independence is stronger than pairwise independence.
Definition (Mutual Independence):
Random variables $X_1, X_2, \ldots, X_n$ are mutually independent if:
$$P(X_1 \in A_1, \ldots, X_n \in A_n) = \prod_{i=1}^{n} P(X_i \in A_i)$$
for all measurable sets $A_1, \ldots, A_n$.
Equivalently, the joint distribution factors: $$f(x_1, \ldots, x_n) = \prod_{i=1}^{n} f_{X_i}(x_i)$$
Pairwise vs. Mutual Independence:
Pairwise independence does NOT imply mutual independence!
Example: Let $X_1, X_2$ be i.i.d. fair coin flips (±1). Let $X_3 = X_1 \cdot X_2$.
The standard ML assumption that data points are 'i.i.d.' means: (1) mutual independence across samples, and (2) same distribution for each sample. Both parts matter—time series data often violates independence, and distribution shift violates identical distribution.
In practice, we need to test whether observed data are consistent with independence.
Chi-Square Test for Independence (Discrete):
For categorical variables with observed counts $O_{ij}$:
$$\chi^2 = \sum_{i,j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$$
where $E_{ij} = \frac{(\text{row } i \text{ total})(\text{column } j \text{ total})}{n}$ is the expected count under independence.
Under $H_0$: independence, $\chi^2 \sim \chi^2_{(r-1)(c-1)}$.
Tests for Continuous Variables:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
import numpy as npfrom scipy import stats def test_independence_methods(): """Demonstrate various independence tests.""" np.random.seed(42) n = 500 # Case 1: Independent x1 = np.random.normal(0, 1, n) y1 = np.random.normal(0, 1, n) # Case 2: Linearly dependent x2 = np.random.normal(0, 1, n) y2 = 0.5 * x2 + np.random.normal(0, 1, n) # Case 3: Nonlinearly dependent (Y = X²) x3 = np.random.normal(0, 1, n) y3 = x3**2 + 0.1 * np.random.normal(0, 1, n) cases = [(x1, y1, "Independent"), (x2, y2, "Linear dependence"), (x3, y3, "Y = X² (nonlinear)")] print("Testing Independence:") print("=" * 60) for x, y, name in cases: print(f"\n{name}:") # Pearson correlation r, p_pearson = stats.pearsonr(x, y) print(f" Pearson r = {r:.4f}, p = {p_pearson:.4f}") # Spearman correlation rho, p_spearman = stats.spearmanr(x, y) print(f" Spearman ρ = {rho:.4f}, p = {p_spearman:.4f}") # Chi-square (discretize first) x_cat = np.digitize(x, np.percentile(x, [25, 50, 75])) y_cat = np.digitize(y, np.percentile(y, [25, 50, 75])) contingency = np.histogram2d(x_cat, y_cat, bins=[4, 4])[0] chi2, p_chi, dof, expected = stats.chi2_contingency(contingency) print(f" Chi-square = {chi2:.2f}, p = {p_chi:.4f}") test_independence_methods()Machine learning algorithms routinely make independence assumptions—sometimes justified, sometimes not.
1. i.i.d. Assumption
Most ML theory assumes training samples are i.i.d. This fails when:
2. Naive Bayes: Conditional Feature Independence
$$P(\mathbf{X} | Y) = \prod_i P(X_i | Y)$$
Features are assumed independent given the class. Almost always false, surprisingly often effective.
3. Graphical Models: Local Markov Property
A variable is independent of non-descendants given its parents. Encodes complex dependence structure with local factors.
4. Dropout Regularization
Randomly dropping neurons during training—each mask is independent across samples and iterations.
Independence assumptions are often knowingly false but useful. They reduce the number of parameters to estimate (a d×d covariance matrix becomes d variances) and make inference tractable. The art is knowing when violations matter enough to model explicitly.
Module Complete:
You've now completed the module on Joint and Marginal Distributions. You've learned:
These concepts form the probabilistic foundation for understanding machine learning models that reason about multiple features, predict targets from inputs, and quantify uncertainty.
Congratulations! You've mastered joint and marginal distributions—the mathematical framework for reasoning about multiple random variables and their relationships. This knowledge is fundamental to probabilistic machine learning, Bayesian inference, and statistical modeling.