Loading learning content...
We've seen how conditional probability captures the idea that knowing B occurred changes our assessment of A's likelihood. But sometimes, knowing B tells us nothing about A. The probability of A remains the same whether or not B occurred.
This special relationship is called independence—and it's one of the most important concepts in probability and machine learning.
Consider:
Independence is not just a theoretical curiosity. In ML, independence assumptions:
But wrongly assuming independence leads to invalid models. Understanding independence deeply is essential.
By the end of this page, you will master the formal definition of independence, extend it to mutual independence of multiple events, understand conditional independence and its role in graphical models, and recognize independence assumptions throughout machine learning.
Two events A and B are independent if knowing that B occurred does not change the probability of A. Formally:
Equivalently, using the definition of conditional probability:
P(A|B) = P(A ∩ B) / P(B) = P(A)
Rearranging:
This product rule is the most commonly used criterion for independence. It's symmetric: if A is independent of B, then B is independent of A (since multiplication is commutative).
We write A ⊥ B to denote 'A is independent of B' (some texts use A ⫫ B or A ⊥⊥ B).
Definition 1: P(A|B) = P(A) — conditioning on B doesn't change A's probability
Definition 2: P(A ∩ B) = P(A) · P(B) — joint probability factors into product of marginals
Use Definition 2 when P(B) might be zero (Definition 1 requires P(B) > 0). Use Definition 1 for intuition about 'no information gain.'
One of the most common errors in probability is confusing disjoint (mutually exclusive) with independent. These concepts are almost opposites!
A and B are disjoint if A ∩ B = ∅. They cannot both occur.
A and B are independent if P(A ∩ B) = P(A) · P(B).
If A and B are disjoint and both have positive probability:
If A and B are disjoint (can't both happen), then knowing A occurred tells you B definitely did NOT occur. That's strong information—the opposite of independence!
P(B|A) = P(A ∩ B) / P(A) = 0 / P(A) = 0
But P(B) > 0, so P(B|A) ≠ P(B). Maximum dependence!
| Property | Disjoint | Independent |
|---|---|---|
| Definition | A ∩ B = ∅ (no overlap) | P(A ∩ B) = P(A)P(B) |
| Can both occur? | No | Yes (usually) |
| P(A ∩ B) | 0 | P(A) · P(B) |
| Information relationship | A tells us B didn't happen | A tells us nothing about B |
| Dependence | Strongly dependent (if P(A), P(B) > 0) | By definition, not dependent |
| Example | Rolling 1 vs rolling 6 | First die result vs second die result |
Only when at least one has probability zero. If P(A) = 0 or P(B) = 0, then:
But this is a degenerate case—the event essentially doesn't happen.
For more than two events, we need to be careful about what 'independent' means. There are two distinct concepts:
Events A₁, A₂, ..., Aₙ are pairwise independent if every pair is independent:
P(Aᵢ ∩ Aⱼ) = P(Aᵢ) · P(Aⱼ) for all i ≠ j
Events A₁, A₂, ..., Aₙ are mutually independent if the probability of any intersection equals the product of individual probabilities:
P(Aᵢ₁ ∩ Aᵢ₂ ∩ ... ∩ Aᵢₖ) = P(Aᵢ₁) · P(Aᵢ₂) · ... · P(Aᵢₖ)
for every subset {i₁, i₂, ..., iₖ} of {1, 2, ..., n}.
Pairwise independence does NOT imply mutual independence! Events can be pairwise independent but collectively dependent. The classic example involves XOR of random bits:
Let X, Y be independent fair coin flips and Z = X XOR Y. Then:
For A, B, C to be mutually independent, we need ALL of these:
Conditions 1-3 are pairwise independence. Condition 4 is the additional requirement for mutual independence.
For n events, mutual independence requires:
Total: 2ⁿ - n - 1 conditions
Events can be dependent marginally but become independent once we condition on another variable. This is conditional independence—perhaps the most important concept in probabilistic graphical models.
Equivalently: P(A | B, C) = P(A | C)
Once we know C, learning B doesn't further change our belief about A.
Conditional independence ≠ marginal independence
Conditioning can create or destroy independence
Not symmetric in the conditioning variable
Bayesian networks encode conditional independence through graph structure:
This allows complex joint distributions to be factored into simple conditional distributions, enabling tractable inference.
In practice, we don't know the true probability distribution—we have data. Testing independence from data requires statistical methods.
Given data, estimate:
Check if P̂(A ∩ B) ≈ P̂(A) · P̂(B)
But how close is 'close enough'? We need statistical tests.
For categorical variables, the chi-square test compares observed joint frequencies with expected frequencies under independence:
χ² = Σ (Observed - Expected)² / Expected
If χ² is large (compared to critical value), we reject independence.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210
import numpy as npfrom scipy import statsfrom typing import Tuple, Dict def test_independence_empirical( joint_prob: float, marginal_a: float, marginal_b: float, tolerance: float = 0.01) -> Dict: """ Check if joint probability approximately equals product of marginals. Parameters: joint_prob: P(A ∩ B) marginal_a: P(A) marginal_b: P(B) tolerance: Maximum allowed difference Returns: Dictionary with independence analysis """ product = marginal_a * marginal_b difference = abs(joint_prob - product) ratio = joint_prob / product if product > 0 else float('inf') return { "P(A ∩ B)": joint_prob, "P(A) × P(B)": product, "difference": difference, "ratio": ratio, "appears_independent": difference < tolerance } def chi_square_independence_test( contingency_table: np.ndarray, alpha: float = 0.05) -> Dict: """ Chi-square test for independence between two categorical variables. Parameters: contingency_table: 2D array of observed frequencies alpha: Significance level (default 0.05) Returns: Dictionary with test statistics and conclusion """ # Perform chi-square test chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table) return { "chi_square_statistic": chi2, "p_value": p_value, "degrees_of_freedom": dof, "expected_frequencies": expected, "reject_independence": p_value < alpha, "interpretation": ( "Evidence of dependence" if p_value < alpha else "Cannot reject independence" ) } def simulate_independence_check( n_samples: int = 10000, p_a: float = 0.3, p_b_given_a: float = 0.5, # If equal to p_b_given_not_a, independent p_b_given_not_a: float = 0.5) -> Dict: """ Simulate two events and test their independence. Parameters: n_samples: Number of samples to generate p_a: Marginal probability of A p_b_given_a: P(B|A) p_b_given_not_a: P(B|¬A) Returns: Analysis of independence from simulated data """ # Generate samples a_samples = np.random.binomial(1, p_a, n_samples) # B depends on A unless p_b_given_a == p_b_given_not_a b_probs = np.where(a_samples == 1, p_b_given_a, p_b_given_not_a) b_samples = np.random.binomial(1, b_probs) # Compute empirical probabilities p_a_emp = a_samples.mean() p_b_emp = b_samples.mean() p_joint_emp = (a_samples * b_samples).mean() # Theoretical values p_b_true = p_a * p_b_given_a + (1 - p_a) * p_b_given_not_a p_joint_true = p_a * p_b_given_a # Build contingency table contingency = np.array([ [(1 - a_samples).sum() - ((1-a_samples) * b_samples).sum(), ((1-a_samples) * b_samples).sum()], [a_samples.sum() - (a_samples * b_samples).sum(), (a_samples * b_samples).sum()] ]) chi_result = chi_square_independence_test(contingency) return { "n_samples": n_samples, "theoretical": { "P(A)": p_a, "P(B)": round(p_b_true, 4), "P(A∩B)": round(p_joint_true, 4), "P(A)P(B)": round(p_a * p_b_true, 4), "are_independent": p_b_given_a == p_b_given_not_a }, "empirical": { "P(A)": round(p_a_emp, 4), "P(B)": round(p_b_emp, 4), "P(A∩B)": round(p_joint_emp, 4), "P(A)P(B)": round(p_a_emp * p_b_emp, 4) }, "chi_square_test": chi_result } # Mutual information: measures dependence strengthdef mutual_information(contingency_table: np.ndarray) -> float: """ Compute mutual information I(A; B) from contingency table. MI = 0 if and only if A and B are independent. Higher MI indicates stronger dependence. I(A;B) = Σᵢⱼ P(i,j) log(P(i,j) / (P(i)P(j))) """ # Normalize to get joint distribution joint = contingency_table / contingency_table.sum() # Marginals p_a = joint.sum(axis=1, keepdims=True) p_b = joint.sum(axis=0, keepdims=True) # Expected under independence expected = p_a @ p_b # MI (handle zeros) mi = 0.0 for i in range(joint.shape[0]): for j in range(joint.shape[1]): if joint[i, j] > 0: mi += joint[i, j] * np.log(joint[i, j] / expected[i, j]) return mi if __name__ == "__main__": print("=" * 60) print("Independence Testing Examples") print("=" * 60) # Example 1: Two fair dice (independent) print("\nExample 1: First die = 6, Second die = even (INDEPENDENT)") result = test_independence_empirical( joint_prob=3/36, # P(first=6 AND second=even) marginal_a=1/6, # P(first=6) marginal_b=3/6 # P(second=even) ) for k, v in result.items(): print(f" {k}: {v}") # Example 2: First die = 6, Sum >= 10 (dependent) print("\nExample 2: First die = 6, Sum >= 10 (DEPENDENT)") result = test_independence_empirical( joint_prob=3/36, # P(first=6 AND sum>=10) marginal_a=1/6, # P(first=6) marginal_b=6/36 # P(sum>=10) ) for k, v in result.items(): print(f" {k}: {v}") # Example 3: Simulation with independence print("\n" + "=" * 60) print("Simulating INDEPENDENT events") print("=" * 60) result = simulate_independence_check( n_samples=10000, p_a=0.4, p_b_given_a=0.3, # Same probabilities p_b_given_not_a=0.3 # -> independence ) print(f"Theoretical independence: {result['theoretical']['are_independent']}") print(f"Chi-square test: {result['chi_square_test']['interpretation']}") print(f"p-value: {result['chi_square_test']['p_value']:.4f}") # Example 4: Simulation with dependence print("\n" + "=" * 60) print("Simulating DEPENDENT events") print("=" * 60) result = simulate_independence_check( n_samples=10000, p_a=0.4, p_b_given_a=0.7, # Different probabilities p_b_given_not_a=0.2 # -> dependence ) print(f"Theoretical independence: {result['theoretical']['are_independent']}") print(f"Chi-square test: {result['chi_square_test']['interpretation']}") print(f"p-value: {result['chi_square_test']['p_value']:.6f}")Independence assumptions pervade machine learning. They're often wrong in detail but useful in practice—trading accuracy for tractability.
Naive Bayes classifies by assuming features are conditionally independent given the class:
P(X₁, X₂, ..., Xₙ | Y) = ∏ᵢ P(Xᵢ | Y)
This is almost always wrong (word 'cheap' and 'viagra' in spam are hardly independent given 'spam'), but Naive Bayes often works well despite this.
Why it works:
| Model/Method | Independence Assumption | What's Actually Assumed Independent |
|---|---|---|
| Naive Bayes | Conditional independence | Features are independent given class |
| IID Training Data | Mutual independence | Training samples are independent of each other |
| Bayesian Networks | Conditional independence (structured) | Each variable independent of non-descendants given parents |
| SGD Noise | Independence across iterations | Gradient noise is independent at each step |
| Dropout | Independence | Units are dropped independently |
| Mean Field Variational Inference | Full factorization | Posterior factorizes into independent components |
HMMs model sequences with two key independence assumptions:
Markov property: Current hidden state depends only on previous state P(Zₜ | Z₁, ..., Zₜ₋₁) = P(Zₜ | Zₜ₋₁)
Observation independence: Each observation depends only on its hidden state P(Xₜ | Z₁, ..., Zₜ, X₁, ..., Xₜ₋₁) = P(Xₜ | Zₜ)
These structured assumptions make inference tractable O(n) rather than exponential.
Modern transformers explicitly capture dependencies through attention:
In ML, independence assumptions are modeling choices, not beliefs about reality. The question isn't 'Are the features truly independent?' but rather 'Does assuming independence lead to models that work well?' Often the answer is yes—especially when data is limited and the bias from wrong independence helps more than it hurts (bias-variance tradeoff).
Independence is computationally magical. It transforms exponential problems into tractable ones.
For n binary variables, the joint distribution P(X₁, ..., Xₙ) has:
For n = 100: This is 2¹⁰⁰ ≈ 10³⁰ parameters. Impossible to store or estimate.
If X₁, ..., Xₙ are mutually independent:
P(X₁, ..., Xₙ) = ∏ᵢ P(Xᵢ)
For n = 100: Just 100 parameters. Trivially tractable!
Independence reduces complexity from exponential to linear.
For i.i.d. data, the log-likelihood factorizes into a sum:
log P(X₁, ..., Xₙ | θ) = Σᵢ log P(Xᵢ | θ)
This enables:
With i.i.d. data:
∇θ log P(X₁, ..., Xₙ | θ) = Σᵢ ∇θ log P(Xᵢ | θ)
Gradient is sum of individual gradients—enabling mini-batch gradient descent.
The i.i.d. assumption is what makes training neural networks on millions of examples feasible. Without it, every gradient computation would need to consider dependencies between all samples—computationally catastrophic. The assumption may be technically false (similar images in a batch may be correlated), but it's approximately true enough to work.
Independence—when knowing one event tells us nothing about another—is both conceptually elegant and computationally essential.
What's Next:
We now have all the ingredients for the crown jewel of probability theory: Bayes' Theorem. Combining conditional probability, the law of total probability, and our understanding of independence, Bayes' theorem tells us how to invert conditional probabilities—how to go from P(evidence | hypothesis) to P(hypothesis | evidence). This is the mathematical foundation of learning from data.
You now understand independence in its full depth—from the basic product rule to conditional independence structures, from common confusions (disjoint ≠ independent) to practical testing methods. You recognize that independence assumptions pervade ML and that they transform intractable problems into solvable ones. Next: Bayes' theorem completes your probabilistic toolkit.