Machine LearningProbability Fundamentals

Probability Fundamentals

LevelBeginner

Duration90 mins

TopicProbability Fundamentals

4 / 5

Independence

When Knowledge Changes Nothing

We've seen how conditional probability captures the idea that knowing B occurred changes our assessment of A's likelihood. But sometimes, knowing B tells us nothing about A. The probability of A remains the same whether or not B occurred.

This special relationship is called independence—and it's one of the most important concepts in probability and machine learning.

Consider:

Does knowing today is Tuesday change the probability it will rain? (Likely: no—independence)
Does knowing a patient smokes change the probability of lung cancer? (Likely: yes—dependence)
Does knowing one coin flip result change the probability of the next flip? (Depends on the coins!)

Independence is not just a theoretical curiosity. In ML, independence assumptions:

Dramatically simplify computations
Enable tractable probabilistic models
Underlie algorithms from Naive Bayes to many deep learning architectures

But wrongly assuming independence leads to invalid models. Understanding independence deeply is essential.

What You Will Learn

By the end of this page, you will master the formal definition of independence, extend it to mutual independence of multiple events, understand conditional independence and its role in graphical models, and recognize independence assumptions throughout machine learning.

Definition of Independence

Two events A and B are independent if knowing that B occurred does not change the probability of A. Formally:

P(A|B) = P(A) (when P(B) > 0)

Equivalently, using the definition of conditional probability:

P(A|B) = P(A ∩ B) / P(B) = P(A)

Rearranging:

P(A ∩ B) = P(A) · P(B)

This product rule is the most commonly used criterion for independence. It's symmetric: if A is independent of B, then B is independent of A (since multiplication is commutative).

Notation

We write A ⊥ B to denote 'A is independent of B' (some texts use A ⫫ B or A ⊥⊥ B).

Two Equivalent Definitions

Definition 1: P(A|B) = P(A) — conditioning on B doesn't change A's probability

Definition 2: P(A ∩ B) = P(A) · P(B) — joint probability factors into product of marginals

Use Definition 2 when P(B) might be zero (Definition 1 requires P(B) > 0). Use Definition 1 for intuition about 'no information gain.'

Testing Independence: Two DiceLet's verify independence for events involving two fair dice.

Input

Output

Testing Dependence: Die SumNow consider events that share information.

Input

Output

Disjoint vs. Independent: A Common Confusion

One of the most common errors in probability is confusing disjoint (mutually exclusive) with independent. These concepts are almost opposites!

Disjoint Events

A and B are disjoint if A ∩ B = ∅. They cannot both occur.

Independent Events

A and B are independent if P(A ∩ B) = P(A) · P(B).

The Key Difference

If A and B are disjoint and both have positive probability:

P(A ∩ B) = 0 (they can't co-occur)
P(A) · P(B) > 0 (product of positive numbers)
Therefore P(A ∩ B) ≠ P(A) · P(B)
Disjoint events with positive probability are DEPENDENT!

Disjoint ≠ Independent

If A and B are disjoint (can't both happen), then knowing A occurred tells you B definitely did NOT occur. That's strong information—the opposite of independence!

P(B|A) = P(A ∩ B) / P(A) = 0 / P(A) = 0

But P(B) > 0, so P(B|A) ≠ P(B). Maximum dependence!

Comparing Disjoint and Independent Events
Property	Disjoint	Independent
Definition	A ∩ B = ∅ (no overlap)	P(A ∩ B) = P(A)P(B)
Can both occur?	No	Yes (usually)
P(A ∩ B)	0	P(A) · P(B)
Information relationship	A tells us B didn't happen	A tells us nothing about B
Dependence	Strongly dependent (if P(A), P(B) > 0)	By definition, not dependent
Example	Rolling 1 vs rolling 6	First die result vs second die result

When Can Disjoint Events Be Independent?

Only when at least one has probability zero. If P(A) = 0 or P(B) = 0, then:

P(A ∩ B) = 0 (impossible event has zero probability)
P(A) · P(B) = 0 (product with zero is zero)
So P(A ∩ B) = P(A) · P(B) ✓

But this is a degenerate case—the event essentially doesn't happen.

Mutual Independence of Multiple Events

For more than two events, we need to be careful about what 'independent' means. There are two distinct concepts:

Pairwise Independence

Events A₁, A₂, ..., Aₙ are pairwise independent if every pair is independent:

P(Aᵢ ∩ Aⱼ) = P(Aᵢ) · P(Aⱼ) for all i ≠ j

Mutual (Full) Independence

Events A₁, A₂, ..., Aₙ are mutually independent if the probability of any intersection equals the product of individual probabilities:

P(Aᵢ₁ ∩ Aᵢ₂ ∩ ... ∩ Aᵢₖ) = P(Aᵢ₁) · P(Aᵢ₂) · ... · P(Aᵢₖ)

for every subset {i₁, i₂, ..., iₖ} of {1, 2, ..., n}.

Pairwise ≠ Mutual Independence

Pairwise independence does NOT imply mutual independence! Events can be pairwise independent but collectively dependent. The classic example involves XOR of random bits:

Let X, Y be independent fair coin flips and Z = X XOR Y. Then:

Any two of {X, Y, Z} are pairwise independent
But X, Y, Z are NOT mutually independent (knowing any two determines the third)

For Three Events

For A, B, C to be mutually independent, we need ALL of these:

P(A ∩ B) = P(A)P(B)
P(A ∩ C) = P(A)P(C)
P(B ∩ C) = P(B)P(C)
P(A ∩ B ∩ C) = P(A)P(B)P(C)

Conditions 1-3 are pairwise independence. Condition 4 is the additional requirement for mutual independence.

Number of Conditions

For n events, mutual independence requires:

C(n,2) pairwise conditions
C(n,3) triple conditions
...
1 joint condition for all n

Total: 2ⁿ - n - 1 conditions

I.I.D. Samples: The ML DefaultIn ML, we often assume training samples are i.i.d. (independent and identically distributed).

Input

Output

Conditional Independence

Events can be dependent marginally but become independent once we condition on another variable. This is conditional independence—perhaps the most important concept in probabilistic graphical models.

A and B are conditionally independent given C, written A ⊥ B | C, if:

P(A ∩ B | C) = P(A|C) · P(B|C)

Equivalently: P(A | B, C) = P(A | C)

Once we know C, learning B doesn't further change our belief about A.

Conditional Independence: Umbrella and RaincoatPeople's umbrella and raincoat decisions are correlated—but why?

Input

Output

Important Properties of Conditional Independence

Conditional independence ≠ marginal independence
- A ⊥ B | C does NOT imply A ⊥ B
- A ⊥ B does NOT imply A ⊥ B | C
Conditioning can create or destroy independence
- Events can be marginally independent but conditionally dependent
- Events can be marginally dependent but conditionally independent
Not symmetric in the conditioning variable
- A ⊥ B | C does NOT imply A ⊥ C | B

Conditional Independence in Graphical Models

Bayesian networks encode conditional independence through graph structure:

Each node is a variable
Directed edges encode dependencies
A variable is conditionally independent of its non-descendants given its parents

This allows complex joint distributions to be factored into simple conditional distributions, enabling tractable inference.

Testing Independence from Data

In practice, we don't know the true probability distribution—we have data. Testing independence from data requires statistical methods.

Empirical Approach

Given data, estimate:

P̂(A) = count(A) / n
P̂(B) = count(B) / n
P̂(A ∩ B) = count(A ∩ B) / n

Check if P̂(A ∩ B) ≈ P̂(A) · P̂(B)

But how close is 'close enough'? We need statistical tests.

Chi-Square Test for Independence

For categorical variables, the chi-square test compares observed joint frequencies with expected frequencies under independence:

χ² = Σ (Observed - Expected)² / Expected

If χ² is large (compared to critical value), we reject independence.

independence_testing.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
import numpy as np
from scipy import stats
from typing import Tuple, Dict
 
def test_independence_empirical(
    joint_prob: float,
    marginal_a: float,
    marginal_b: float,
    tolerance: float = 0.01
) -> Dict:
    """
    Check if joint probability approximately equals product of marginals.
    
    Parameters:
        joint_prob: P(A ∩ B)
        marginal_a: P(A)
        marginal_b: P(B)
        tolerance: Maximum allowed difference
    
    Returns:
        Dictionary with independence analysis
    """
    product = marginal_a * marginal_b
    difference = abs(joint_prob - product)
    ratio = joint_prob / product if product > 0 else float('inf')
    
    return {
        "P(A ∩ B)": joint_prob,
        "P(A) × P(B)": product,
        "difference": difference,
        "ratio": ratio,
        "appears_independent": difference < tolerance
    }
 
 
def chi_square_independence_test(
    contingency_table: np.ndarray,
    alpha: float = 0.05
) -> Dict:
    """
    Chi-square test for independence between two categorical variables.
    
    Parameters:
        contingency_table: 2D array of observed frequencies
        alpha: Significance level (default 0.05)
    
    Returns:
        Dictionary with test statistics and conclusion
    """
    # Perform chi-square test
    chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
    
    return {
        "chi_square_statistic": chi2,
        "p_value": p_value,
        "degrees_of_freedom": dof,
        "expected_frequencies": expected,
        "reject_independence": p_value < alpha,
        "interpretation": (
            "Evidence of dependence" if p_value < alpha 
            else "Cannot reject independence"
        )
    }
 
 
def simulate_independence_check(
    n_samples: int = 10000,
    p_a: float = 0.3,
    p_b_given_a: float = 0.5,  # If equal to p_b_given_not_a, independent
    p_b_given_not_a: float = 0.5
) -> Dict:
    """
    Simulate two events and test their independence.
    
    Parameters:
        n_samples: Number of samples to generate
        p_a: Marginal probability of A
        p_b_given_a: P(B|A)
        p_b_given_not_a: P(B|¬A)
    
    Returns:
        Analysis of independence from simulated data
    """
    # Generate samples
    a_samples = np.random.binomial(1, p_a, n_samples)
    
    # B depends on A unless p_b_given_a == p_b_given_not_a
    b_probs = np.where(a_samples == 1, p_b_given_a, p_b_given_not_a)
    b_samples = np.random.binomial(1, b_probs)
    
    # Compute empirical probabilities
    p_a_emp = a_samples.mean()
    p_b_emp = b_samples.mean()
    p_joint_emp = (a_samples * b_samples).mean()
    
    # Theoretical values
    p_b_true = p_a * p_b_given_a + (1 - p_a) * p_b_given_not_a
    p_joint_true = p_a * p_b_given_a
    
    # Build contingency table
    contingency = np.array([
        [(1 - a_samples).sum() - ((1-a_samples) * b_samples).sum(),
         ((1-a_samples) * b_samples).sum()],
        [a_samples.sum() - (a_samples * b_samples).sum(),
         (a_samples * b_samples).sum()]
    ])
    
    chi_result = chi_square_independence_test(contingency)
    
    return {
        "n_samples": n_samples,
        "theoretical": {
            "P(A)": p_a,
            "P(B)": round(p_b_true, 4),
            "P(A∩B)": round(p_joint_true, 4),
            "P(A)P(B)": round(p_a * p_b_true, 4),
            "are_independent": p_b_given_a == p_b_given_not_a
        },
        "empirical": {
            "P(A)": round(p_a_emp, 4),
            "P(B)": round(p_b_emp, 4),
            "P(A∩B)": round(p_joint_emp, 4),
            "P(A)P(B)": round(p_a_emp * p_b_emp, 4)
        },
        "chi_square_test": chi_result
    }
 
 
# Mutual information: measures dependence strength
def mutual_information(contingency_table: np.ndarray) -> float:
    """
    Compute mutual information I(A; B) from contingency table.
    
    MI = 0 if and only if A and B are independent.
    Higher MI indicates stronger dependence.
    
    I(A;B) = Σᵢⱼ P(i,j) log(P(i,j) / (P(i)P(j)))
    """
    # Normalize to get joint distribution
    joint = contingency_table / contingency_table.sum()
    
    # Marginals
    p_a = joint.sum(axis=1, keepdims=True)
    p_b = joint.sum(axis=0, keepdims=True)
    
    # Expected under independence
    expected = p_a @ p_b
    
    # MI (handle zeros)
    mi = 0.0
    for i in range(joint.shape[0]):
        for j in range(joint.shape[1]):
            if joint[i, j] > 0:
                mi += joint[i, j] * np.log(joint[i, j] / expected[i, j])
    
    return mi
 
 
if __name__ == "__main__":
    print("=" * 60)
    print("Independence Testing Examples")
    print("=" * 60)
    
    # Example 1: Two fair dice (independent)
    print("\nExample 1: First die = 6, Second die = even (INDEPENDENT)")
    result = test_independence_empirical(
        joint_prob=3/36,      # P(first=6 AND second=even)
        marginal_a=1/6,       # P(first=6)
        marginal_b=3/6        # P(second=even)
    )
    for k, v in result.items():
        print(f"  {k}: {v}")
    
    # Example 2: First die = 6, Sum >= 10 (dependent)
    print("\nExample 2: First die = 6, Sum >= 10 (DEPENDENT)")
    result = test_independence_empirical(
        joint_prob=3/36,      # P(first=6 AND sum>=10)
        marginal_a=1/6,       # P(first=6)
        marginal_b=6/36       # P(sum>=10)
    )
    for k, v in result.items():
        print(f"  {k}: {v}")
    
    # Example 3: Simulation with independence
    print("\n" + "=" * 60)
    print("Simulating INDEPENDENT events")
    print("=" * 60)
    result = simulate_independence_check(
        n_samples=10000,
        p_a=0.4,
        p_b_given_a=0.3,      # Same probabilities
        p_b_given_not_a=0.3   # -> independence
    )
    print(f"Theoretical independence: {result['theoretical']['are_independent']}")
    print(f"Chi-square test: {result['chi_square_test']['interpretation']}")
    print(f"p-value: {result['chi_square_test']['p_value']:.4f}")
    
    # Example 4: Simulation with dependence
    print("\n" + "=" * 60)
    print("Simulating DEPENDENT events")
    print("=" * 60)
    result = simulate_independence_check(
        n_samples=10000,
        p_a=0.4,
        p_b_given_a=0.7,      # Different probabilities
        p_b_given_not_a=0.2   # -> dependence
    )
    print(f"Theoretical independence: {result['theoretical']['are_independent']}")
    print(f"Chi-square test: {result['chi_square_test']['interpretation']}")
    print(f"p-value: {result['chi_square_test']['p_value']:.6f}")

Independence Assumptions in Machine Learning

Independence assumptions pervade machine learning. They're often wrong in detail but useful in practice—trading accuracy for tractability.

Naive Bayes: The Classic Independence Assumption

Naive Bayes classifies by assuming features are conditionally independent given the class:

P(X₁, X₂, ..., Xₙ | Y) = ∏ᵢ P(Xᵢ | Y)

This is almost always wrong (word 'cheap' and 'viagra' in spam are hardly independent given 'spam'), but Naive Bayes often works well despite this.

Why it works:

Estimates of P(Y | X) may be accurate even if P(X | Y) is misspecified
Rankings can be correct even if absolute probabilities are wrong
The independence assumption is regularization in disguise

Independence Assumptions Throughout ML
Model/Method	Independence Assumption	What's Actually Assumed Independent
Naive Bayes	Conditional independence	Features are independent given class
IID Training Data	Mutual independence	Training samples are independent of each other
Bayesian Networks	Conditional independence (structured)	Each variable independent of non-descendants given parents
SGD Noise	Independence across iterations	Gradient noise is independent at each step
Dropout	Independence	Units are dropped independently
Mean Field Variational Inference	Full factorization	Posterior factorizes into independent components

Hidden Markov Models: Structured Conditional Independence

HMMs model sequences with two key independence assumptions:

Markov property: Current hidden state depends only on previous state P(Zₜ | Z₁, ..., Zₜ₋₁) = P(Zₜ | Zₜ₋₁)
Observation independence: Each observation depends only on its hidden state P(Xₜ | Z₁, ..., Zₜ, X₁, ..., Xₜ₋₁) = P(Xₜ | Zₜ)

These structured assumptions make inference tractable O(n) rather than exponential.

Attention Mechanisms: Relaxing Independence

Modern transformers explicitly capture dependencies through attention:

Self-attention allows every position to depend on every other position
This is computationally expensive (O(n²)) but captures rich dependencies
Trade-off: complexity for expressiveness

The Art of Independence Assumptions

In ML, independence assumptions are modeling choices, not beliefs about reality. The question isn't 'Are the features truly independent?' but rather 'Does assuming independence lead to models that work well?' Often the answer is yes—especially when data is limited and the bias from wrong independence helps more than it hurts (bias-variance tradeoff).

Computational Benefits of Independence

Independence is computationally magical. It transforms exponential problems into tractable ones.

Joint Distributions Without Independence

For n binary variables, the joint distribution P(X₁, ..., Xₙ) has:

2ⁿ possible configurations
2ⁿ - 1 free parameters (one constraint: sum to 1)

For n = 100: This is 2¹⁰⁰ ≈ 10³⁰ parameters. Impossible to store or estimate.

Joint Distributions With Independence

If X₁, ..., Xₙ are mutually independent:

P(X₁, ..., Xₙ) = ∏ᵢ P(Xᵢ)

Each P(Xᵢ) has 1 free parameter
Total: n parameters

For n = 100: Just 100 parameters. Trivially tractable!

Independence reduces complexity from exponential to linear.

Log-Likelihood Factorization

For i.i.d. data, the log-likelihood factorizes into a sum:

log P(X₁, ..., Xₙ | θ) = Σᵢ log P(Xᵢ | θ)

This enables:

Stochastic optimization: Process one sample at a time (SGD)
Parallel computation: Compute each term independently
Online learning: Update as new data arrives
Memory efficiency: Don't need to store all data simultaneously

Gradient Computation

With i.i.d. data:

∇θ log P(X₁, ..., Xₙ | θ) = Σᵢ ∇θ log P(Xᵢ | θ)

Gradient is sum of individual gradients—enabling mini-batch gradient descent.

Independence Enables Modern Deep Learning

The i.i.d. assumption is what makes training neural networks on millions of examples feasible. Without it, every gradient computation would need to consider dependencies between all samples—computationally catastrophic. The assumption may be technically false (similar images in a batch may be correlated), but it's approximately true enough to work.

Summary: The Power of Independence

Independence—when knowing one event tells us nothing about another—is both conceptually elegant and computationally essential.

Core Concepts Mastered

•Independence Definition — P(A|B) = P(A), equivalently P(A ∩ B) = P(A)P(B); B provides no information about A
•Disjoint ≠ Independent — Disjoint events with positive probability are maximally dependent, not independent
•Mutual Independence — All subsets must satisfy the product rule, not just pairs; stronger than pairwise independence
•Conditional Independence — A ⊥ B | C means A and B are independent once C is known; foundation of graphical models
•Testing from Data — Chi-square tests and mutual information quantify dependence strength empirically
•ML Applications — Naive Bayes, i.i.d. assumptions, HMMs, and more rely on independence for tractability
•Computational Magic — Independence reduces exponential complexity to linear; enables modern ML at scale

What's Next:

We now have all the ingredients for the crown jewel of probability theory: Bayes' Theorem. Combining conditional probability, the law of total probability, and our understanding of independence, Bayes' theorem tells us how to invert conditional probabilities—how to go from P(evidence | hypothesis) to P(hypothesis | evidence). This is the mathematical foundation of learning from data.

Page Complete

You now understand independence in its full depth—from the basic product rule to conditional independence structures, from common confusions (disjoint ≠ independent) to practical testing methods. You recognize that independence assumptions pervade ML and that they transform intractable problems into solvable ones. Next: Bayes' theorem completes your probabilistic toolkit.

4 / 5

Loading learning content...

Machine LearningProbability Fundamentals

Probability Fundamentals

LevelBeginner

Duration90 mins

TopicProbability Fundamentals

4 / 5

Independence

When Knowledge Changes Nothing

This special relationship is called independence—and it's one of the most important concepts in probability and machine learning.

Consider:

Does knowing today is Tuesday change the probability it will rain? (Likely: no—independence)
Does knowing a patient smokes change the probability of lung cancer? (Likely: yes—dependence)
Does knowing one coin flip result change the probability of the next flip? (Depends on the coins!)

Independence is not just a theoretical curiosity. In ML, independence assumptions:

Dramatically simplify computations
Enable tractable probabilistic models
Underlie algorithms from Naive Bayes to many deep learning architectures

But wrongly assuming independence leads to invalid models. Understanding independence deeply is essential.

What You Will Learn

Definition of Independence

Two events A and B are independent if knowing that B occurred does not change the probability of A. Formally:

P(A|B) = P(A) (when P(B) > 0)

Equivalently, using the definition of conditional probability:

P(A|B) = P(A ∩ B) / P(B) = P(A)

Rearranging:

P(A ∩ B) = P(A) · P(B)

This product rule is the most commonly used criterion for independence. It's symmetric: if A is independent of B, then B is independent of A (since multiplication is commutative).

Notation

We write A ⊥ B to denote 'A is independent of B' (some texts use A ⫫ B or A ⊥⊥ B).

Two Equivalent Definitions

Definition 1: P(A|B) = P(A) — conditioning on B doesn't change A's probability

Definition 2: P(A ∩ B) = P(A) · P(B) — joint probability factors into product of marginals

Use Definition 2 when P(B) might be zero (Definition 1 requires P(B) > 0). Use Definition 1 for intuition about 'no information gain.'

Testing Independence: Two DiceLet's verify independence for events involving two fair dice.

Input

Output

Testing Dependence: Die SumNow consider events that share information.

Input

Output

Disjoint vs. Independent: A Common Confusion

One of the most common errors in probability is confusing disjoint (mutually exclusive) with independent. These concepts are almost opposites!

Disjoint Events

A and B are disjoint if A ∩ B = ∅. They cannot both occur.

Independent Events

A and B are independent if P(A ∩ B) = P(A) · P(B).

The Key Difference

If A and B are disjoint and both have positive probability:

P(A ∩ B) = 0 (they can't co-occur)
P(A) · P(B) > 0 (product of positive numbers)
Therefore P(A ∩ B) ≠ P(A) · P(B)
Disjoint events with positive probability are DEPENDENT!

Disjoint ≠ Independent

If A and B are disjoint (can't both happen), then knowing A occurred tells you B definitely did NOT occur. That's strong information—the opposite of independence!

P(B|A) = P(A ∩ B) / P(A) = 0 / P(A) = 0

But P(B) > 0, so P(B|A) ≠ P(B). Maximum dependence!

Comparing Disjoint and Independent Events
Property	Disjoint	Independent
Definition	A ∩ B = ∅ (no overlap)	P(A ∩ B) = P(A)P(B)
Can both occur?	No	Yes (usually)
P(A ∩ B)	0	P(A) · P(B)
Information relationship	A tells us B didn't happen	A tells us nothing about B
Dependence	Strongly dependent (if P(A), P(B) > 0)	By definition, not dependent
Example	Rolling 1 vs rolling 6	First die result vs second die result

When Can Disjoint Events Be Independent?

Only when at least one has probability zero. If P(A) = 0 or P(B) = 0, then:

P(A ∩ B) = 0 (impossible event has zero probability)
P(A) · P(B) = 0 (product with zero is zero)
So P(A ∩ B) = P(A) · P(B) ✓

But this is a degenerate case—the event essentially doesn't happen.

Mutual Independence of Multiple Events

For more than two events, we need to be careful about what 'independent' means. There are two distinct concepts:

Pairwise Independence

Events A₁, A₂, ..., Aₙ are pairwise independent if every pair is independent:

P(Aᵢ ∩ Aⱼ) = P(Aᵢ) · P(Aⱼ) for all i ≠ j

Mutual (Full) Independence

Events A₁, A₂, ..., Aₙ are mutually independent if the probability of any intersection equals the product of individual probabilities:

P(Aᵢ₁ ∩ Aᵢ₂ ∩ ... ∩ Aᵢₖ) = P(Aᵢ₁) · P(Aᵢ₂) · ... · P(Aᵢₖ)

for every subset {i₁, i₂, ..., iₖ} of {1, 2, ..., n}.

Pairwise ≠ Mutual Independence

Pairwise independence does NOT imply mutual independence! Events can be pairwise independent but collectively dependent. The classic example involves XOR of random bits:

Let X, Y be independent fair coin flips and Z = X XOR Y. Then:

Any two of {X, Y, Z} are pairwise independent
But X, Y, Z are NOT mutually independent (knowing any two determines the third)

For Three Events

For A, B, C to be mutually independent, we need ALL of these:

P(A ∩ B) = P(A)P(B)
P(A ∩ C) = P(A)P(C)
P(B ∩ C) = P(B)P(C)
P(A ∩ B ∩ C) = P(A)P(B)P(C)

Conditions 1-3 are pairwise independence. Condition 4 is the additional requirement for mutual independence.

Number of Conditions

For n events, mutual independence requires:

C(n,2) pairwise conditions
C(n,3) triple conditions
...
1 joint condition for all n

Total: 2ⁿ - n - 1 conditions

I.I.D. Samples: The ML DefaultIn ML, we often assume training samples are i.i.d. (independent and identically distributed).

Input

Output

Conditional Independence

A and B are conditionally independent given C, written A ⊥ B | C, if:

P(A ∩ B | C) = P(A|C) · P(B|C)

Equivalently: P(A | B, C) = P(A | C)

Once we know C, learning B doesn't further change our belief about A.

Conditional Independence: Umbrella and RaincoatPeople's umbrella and raincoat decisions are correlated—but why?

Input

Output

Important Properties of Conditional Independence

Conditional independence ≠ marginal independence
- A ⊥ B | C does NOT imply A ⊥ B
- A ⊥ B does NOT imply A ⊥ B | C
Conditioning can create or destroy independence
- Events can be marginally independent but conditionally dependent
- Events can be marginally dependent but conditionally independent
Not symmetric in the conditioning variable
- A ⊥ B | C does NOT imply A ⊥ C | B

Conditional Independence in Graphical Models

Bayesian networks encode conditional independence through graph structure:

Each node is a variable
Directed edges encode dependencies
A variable is conditionally independent of its non-descendants given its parents

This allows complex joint distributions to be factored into simple conditional distributions, enabling tractable inference.

Testing Independence from Data

In practice, we don't know the true probability distribution—we have data. Testing independence from data requires statistical methods.

Empirical Approach

Given data, estimate:

P̂(A) = count(A) / n
P̂(B) = count(B) / n
P̂(A ∩ B) = count(A ∩ B) / n

Check if P̂(A ∩ B) ≈ P̂(A) · P̂(B)

But how close is 'close enough'? We need statistical tests.

Chi-Square Test for Independence

For categorical variables, the chi-square test compares observed joint frequencies with expected frequencies under independence:

χ² = Σ (Observed - Expected)² / Expected

If χ² is large (compared to critical value), we reject independence.

independence_testing.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
import numpy as np
from scipy import stats
from typing import Tuple, Dict
 
def test_independence_empirical(
    joint_prob: float,
    marginal_a: float,
    marginal_b: float,
    tolerance: float = 0.01
) -> Dict:
    """
    Check if joint probability approximately equals product of marginals.
    
    Parameters:
        joint_prob: P(A ∩ B)
        marginal_a: P(A)
        marginal_b: P(B)
        tolerance: Maximum allowed difference
    
    Returns:
        Dictionary with independence analysis
    """
    product = marginal_a * marginal_b
    difference = abs(joint_prob - product)
    ratio = joint_prob / product if product > 0 else float('inf')
    
    return {
        "P(A ∩ B)": joint_prob,
        "P(A) × P(B)": product,
        "difference": difference,
        "ratio": ratio,
        "appears_independent": difference < tolerance
    }
 
 
def chi_square_independence_test(
    contingency_table: np.ndarray,
    alpha: float = 0.05
) -> Dict:
    """
    Chi-square test for independence between two categorical variables.
    
    Parameters:
        contingency_table: 2D array of observed frequencies
        alpha: Significance level (default 0.05)
    
    Returns:
        Dictionary with test statistics and conclusion
    """
    # Perform chi-square test
    chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
    
    return {
        "chi_square_statistic": chi2,
        "p_value": p_value,
        "degrees_of_freedom": dof,
        "expected_frequencies": expected,
        "reject_independence": p_value < alpha,
        "interpretation": (
            "Evidence of dependence" if p_value < alpha 
            else "Cannot reject independence"
        )
    }
 
 
def simulate_independence_check(
    n_samples: int = 10000,
    p_a: float = 0.3,
    p_b_given_a: float = 0.5,  # If equal to p_b_given_not_a, independent
    p_b_given_not_a: float = 0.5
) -> Dict:
    """
    Simulate two events and test their independence.
    
    Parameters:
        n_samples: Number of samples to generate
        p_a: Marginal probability of A
        p_b_given_a: P(B|A)
        p_b_given_not_a: P(B|¬A)
    
    Returns:
        Analysis of independence from simulated data
    """
    # Generate samples
    a_samples = np.random.binomial(1, p_a, n_samples)
    
    # B depends on A unless p_b_given_a == p_b_given_not_a
    b_probs = np.where(a_samples == 1, p_b_given_a, p_b_given_not_a)
    b_samples = np.random.binomial(1, b_probs)
    
    # Compute empirical probabilities
    p_a_emp = a_samples.mean()
    p_b_emp = b_samples.mean()
    p_joint_emp = (a_samples * b_samples).mean()
    
    # Theoretical values
    p_b_true = p_a * p_b_given_a + (1 - p_a) * p_b_given_not_a
    p_joint_true = p_a * p_b_given_a
    
    # Build contingency table
    contingency = np.array([
        [(1 - a_samples).sum() - ((1-a_samples) * b_samples).sum(),
         ((1-a_samples) * b_samples).sum()],
        [a_samples.sum() - (a_samples * b_samples).sum(),
         (a_samples * b_samples).sum()]
    ])
    
    chi_result = chi_square_independence_test(contingency)
    
    return {
        "n_samples": n_samples,
        "theoretical": {
            "P(A)": p_a,
            "P(B)": round(p_b_true, 4),
            "P(A∩B)": round(p_joint_true, 4),
            "P(A)P(B)": round(p_a * p_b_true, 4),
            "are_independent": p_b_given_a == p_b_given_not_a
        },
        "empirical": {
            "P(A)": round(p_a_emp, 4),
            "P(B)": round(p_b_emp, 4),
            "P(A∩B)": round(p_joint_emp, 4),
            "P(A)P(B)": round(p_a_emp * p_b_emp, 4)
        },
        "chi_square_test": chi_result
    }
 
 
# Mutual information: measures dependence strength
def mutual_information(contingency_table: np.ndarray) -> float:
    """
    Compute mutual information I(A; B) from contingency table.
    
    MI = 0 if and only if A and B are independent.
    Higher MI indicates stronger dependence.
    
    I(A;B) = Σᵢⱼ P(i,j) log(P(i,j) / (P(i)P(j)))
    """
    # Normalize to get joint distribution
    joint = contingency_table / contingency_table.sum()
    
    # Marginals
    p_a = joint.sum(axis=1, keepdims=True)
    p_b = joint.sum(axis=0, keepdims=True)
    
    # Expected under independence
    expected = p_a @ p_b
    
    # MI (handle zeros)
    mi = 0.0
    for i in range(joint.shape[0]):
        for j in range(joint.shape[1]):
            if joint[i, j] > 0:
                mi += joint[i, j] * np.log(joint[i, j] / expected[i, j])
    
    return mi
 
 
if __name__ == "__main__":
    print("=" * 60)
    print("Independence Testing Examples")
    print("=" * 60)
    
    # Example 1: Two fair dice (independent)
    print("\nExample 1: First die = 6, Second die = even (INDEPENDENT)")
    result = test_independence_empirical(
        joint_prob=3/36,      # P(first=6 AND second=even)
        marginal_a=1/6,       # P(first=6)
        marginal_b=3/6        # P(second=even)
    )
    for k, v in result.items():
        print(f"  {k}: {v}")
    
    # Example 2: First die = 6, Sum >= 10 (dependent)
    print("\nExample 2: First die = 6, Sum >= 10 (DEPENDENT)")
    result = test_independence_empirical(
        joint_prob=3/36,      # P(first=6 AND sum>=10)
        marginal_a=1/6,       # P(first=6)
        marginal_b=6/36       # P(sum>=10)
    )
    for k, v in result.items():
        print(f"  {k}: {v}")
    
    # Example 3: Simulation with independence
    print("\n" + "=" * 60)
    print("Simulating INDEPENDENT events")
    print("=" * 60)
    result = simulate_independence_check(
        n_samples=10000,
        p_a=0.4,
        p_b_given_a=0.3,      # Same probabilities
        p_b_given_not_a=0.3   # -> independence
    )
    print(f"Theoretical independence: {result['theoretical']['are_independent']}")
    print(f"Chi-square test: {result['chi_square_test']['interpretation']}")
    print(f"p-value: {result['chi_square_test']['p_value']:.4f}")
    
    # Example 4: Simulation with dependence
    print("\n" + "=" * 60)
    print("Simulating DEPENDENT events")
    print("=" * 60)
    result = simulate_independence_check(
        n_samples=10000,
        p_a=0.4,
        p_b_given_a=0.7,      # Different probabilities
        p_b_given_not_a=0.2   # -> dependence
    )
    print(f"Theoretical independence: {result['theoretical']['are_independent']}")
    print(f"Chi-square test: {result['chi_square_test']['interpretation']}")
    print(f"p-value: {result['chi_square_test']['p_value']:.6f}")

Independence Assumptions in Machine Learning

Independence assumptions pervade machine learning. They're often wrong in detail but useful in practice—trading accuracy for tractability.

Naive Bayes: The Classic Independence Assumption

Naive Bayes classifies by assuming features are conditionally independent given the class:

P(X₁, X₂, ..., Xₙ | Y) = ∏ᵢ P(Xᵢ | Y)

This is almost always wrong (word 'cheap' and 'viagra' in spam are hardly independent given 'spam'), but Naive Bayes often works well despite this.

Why it works:

Estimates of P(Y | X) may be accurate even if P(X | Y) is misspecified
Rankings can be correct even if absolute probabilities are wrong
The independence assumption is regularization in disguise

Independence Assumptions Throughout ML
Model/Method	Independence Assumption	What's Actually Assumed Independent
Naive Bayes	Conditional independence	Features are independent given class
IID Training Data	Mutual independence	Training samples are independent of each other
Bayesian Networks	Conditional independence (structured)	Each variable independent of non-descendants given parents
SGD Noise	Independence across iterations	Gradient noise is independent at each step
Dropout	Independence	Units are dropped independently
Mean Field Variational Inference	Full factorization	Posterior factorizes into independent components

Hidden Markov Models: Structured Conditional Independence

HMMs model sequences with two key independence assumptions:

Markov property: Current hidden state depends only on previous state P(Zₜ | Z₁, ..., Zₜ₋₁) = P(Zₜ | Zₜ₋₁)
Observation independence: Each observation depends only on its hidden state P(Xₜ | Z₁, ..., Zₜ, X₁, ..., Xₜ₋₁) = P(Xₜ | Zₜ)

These structured assumptions make inference tractable O(n) rather than exponential.

Attention Mechanisms: Relaxing Independence

Modern transformers explicitly capture dependencies through attention:

Self-attention allows every position to depend on every other position
This is computationally expensive (O(n²)) but captures rich dependencies
Trade-off: complexity for expressiveness

The Art of Independence Assumptions

Computational Benefits of Independence

Independence is computationally magical. It transforms exponential problems into tractable ones.

Joint Distributions Without Independence

For n binary variables, the joint distribution P(X₁, ..., Xₙ) has:

2ⁿ possible configurations
2ⁿ - 1 free parameters (one constraint: sum to 1)

For n = 100: This is 2¹⁰⁰ ≈ 10³⁰ parameters. Impossible to store or estimate.

Joint Distributions With Independence

If X₁, ..., Xₙ are mutually independent:

P(X₁, ..., Xₙ) = ∏ᵢ P(Xᵢ)

Each P(Xᵢ) has 1 free parameter
Total: n parameters

For n = 100: Just 100 parameters. Trivially tractable!

Independence reduces complexity from exponential to linear.

Log-Likelihood Factorization

For i.i.d. data, the log-likelihood factorizes into a sum:

log P(X₁, ..., Xₙ | θ) = Σᵢ log P(Xᵢ | θ)

This enables:

Stochastic optimization: Process one sample at a time (SGD)
Parallel computation: Compute each term independently
Online learning: Update as new data arrives
Memory efficiency: Don't need to store all data simultaneously

Gradient Computation

With i.i.d. data:

∇θ log P(X₁, ..., Xₙ | θ) = Σᵢ ∇θ log P(Xᵢ | θ)

Gradient is sum of individual gradients—enabling mini-batch gradient descent.

Independence Enables Modern Deep Learning

Summary: The Power of Independence

Independence—when knowing one event tells us nothing about another—is both conceptually elegant and computationally essential.

Core Concepts Mastered

•Independence Definition — P(A|B) = P(A), equivalently P(A ∩ B) = P(A)P(B); B provides no information about A
•Disjoint ≠ Independent — Disjoint events with positive probability are maximally dependent, not independent
•Mutual Independence — All subsets must satisfy the product rule, not just pairs; stronger than pairwise independence
•Conditional Independence — A ⊥ B | C means A and B are independent once C is known; foundation of graphical models
•Testing from Data — Chi-square tests and mutual information quantify dependence strength empirically
•ML Applications — Naive Bayes, i.i.d. assumptions, HMMs, and more rely on independence for tractability
•Computational Magic — Independence reduces exponential complexity to linear; enables modern ML at scale

What's Next:

Page Complete

4 / 5