Machine LearningGraphical Models

Probabilistic Graphical Models

LevelAdvanced

Duration90 mins

TopicGraphical Models

3 / 5

Conditional Independence

The Mathematical Key to Tractable Probabilistic Reasoning

Consider a medical diagnosis system reasoning about a patient. The patient has hundreds of attributes: symptoms, vital signs, lab results, medical history, demographics. In principle, every attribute could influence every other. But this is rarely the case.

Observation: Given that a patient has pneumonia, their cough symptoms are independent of whether they live in a city or suburb. The disease screens off the symptom from the location.

Observation: Given a patient's genetic test result, their disease risk becomes independent of their parents' disease status. The test result is a more direct indicator.

These observations illustrate conditional independence—the phenomenon where, given certain information, some variables cease to provide additional predictive value about others. This isn't just a simplifying assumption; it reflects genuine structure in how the world works.

Why conditional independence is foundational:

Conditional independence is what transforms probabilistic graphical models from a theoretical curiosity into a practical tool. It's the mathematical mechanism that:

Enables compact representation (factorization)
Allows efficient inference algorithms
Reduces data requirements for learning
Makes large-scale probabilistic reasoning tractable

What You Will Learn

By the end of this page, you will have a rigorous understanding of conditional independence: its formal definition, key mathematical properties, relationship to graphical model factorization, and practical implications. You will be able to reason precisely about when and why variables are independent, and how this independence enables tractable computation.

Formal Definition of Conditional Independence

Let's establish the formal definition with mathematical precision.

Definition (Conditional Independence):

Random variables $X$ and $Y$ are conditionally independent given $Z$, written:

$$X \perp!!!\perp Y \mid Z$$

if and only if, for all values $x, y, z$ in their respective domains:

$$P(X = x, Y = y \mid Z = z) = P(X = x \mid Z = z) \cdot P(Y = y \mid Z = z)$$

whenever $P(Z = z) > 0$.

Interpretation: Once we know the value of $Z$, knowing $X$ provides no additional information about $Y$, and vice versa. The variables are "screened off" from each other by $Z$.

Equivalent formulations:

The following statements are all equivalent to $X \perp!!!\perp Y \mid Z$:

$P(X, Y | Z) = P(X | Z) \cdot P(Y | Z)$
$P(X | Y, Z) = P(X | Z)$ (when $P(Y, Z) > 0$)
$P(Y | X, Z) = P(Y | Z)$ (when $P(X, Z) > 0$)

Formulation 2 says: "If you already know $Z$, learning $Y$ doesn't change your belief about $X$."

Special case — Marginal Independence:

When $Z$ is empty (unconditional), we write:

$$X \perp!!!\perp Y$$

meaning $P(X, Y) = P(X) \cdot P(Y)$. This is marginal or unconditional independence.

conditional_independence_verification.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import numpy as np
 
def verify_conditional_independence(P_XYZ, X_vals, Y_vals, Z_vals):
    """
    Verify if X ⟂ Y | Z holds in a joint distribution P(X, Y, Z).
    
    P_XYZ: dict mapping (x, y, z) -> probability
    Returns: True if conditionally independent, else False with violation details
    """
    tolerance = 1e-10
    
    for z in Z_vals:
        # Compute P(Z = z)
        P_z = sum(P_XYZ.get((x, y, z), 0) for x in X_vals for y in Y_vals)
        
        if P_z < tolerance:
            continue  # Skip if P(Z = z) = 0
        
        for x in X_vals:
            for y in Y_vals:
                # P(X = x, Y = y | Z = z)
                P_xy_given_z = P_XYZ.get((x, y, z), 0) / P_z
                
                # P(X = x | Z = z)
                P_x_given_z = sum(P_XYZ.get((x, y_, z), 0) for y_ in Y_vals) / P_z
                
                # P(Y = y | Z = z)
                P_y_given_z = sum(P_XYZ.get((x_, y, z), 0) for x_ in X_vals) / P_z
                
                # Check factorization
                expected = P_x_given_z * P_y_given_z
                
                if abs(P_xy_given_z - expected) > tolerance:
                    return False, {
                        'x': x, 'y': y, 'z': z,
                        'P(x,y|z)': P_xy_given_z,
                        'P(x|z)*P(y|z)': expected,
                        'difference': abs(P_xy_given_z - expected)
                    }
    
    return True, None
 
# Example 1: Classic "screening off" scenario
# Z = Disease, X = Symptom1, Y = Symptom2
# Given the disease, symptoms are independent (caused by disease, not each other)
 
X_vals = [0, 1]  # Symptom1 absent/present
Y_vals = [0, 1]  # Symptom2 absent/present  
Z_vals = [0, 1]  # Disease absent/present
 
# Define joint distribution with X ⟂ Y | Z
P_XYZ = {}
 
# P(Z=0) = 0.7 (no disease)
# P(Z=1) = 0.3 (disease)
# P(X=1|Z=0) = 0.1, P(X=1|Z=1) = 0.8 (symptom1)
# P(Y=1|Z=0) = 0.2, P(Y=1|Z=1) = 0.7 (symptom2)
 
P_Z = {0: 0.7, 1: 0.3}
P_X_given_Z = {(1, 0): 0.1, (0, 0): 0.9, (1, 1): 0.8, (0, 1): 0.2}
P_Y_given_Z = {(1, 0): 0.2, (0, 0): 0.8, (1, 1): 0.7, (0, 1): 0.3}
 
for z in Z_vals:
    for x in X_vals:
        for y in Y_vals:
            # Build P(X,Y,Z) = P(Z) * P(X|Z) * P(Y|Z)
            # This construction GUARANTEES X ⟂ Y | Z
            P_XYZ[(x, y, z)] = P_Z[z] * P_X_given_Z[(x, z)] * P_Y_given_Z[(y, z)]
 
is_ci, violation = verify_conditional_independence(P_XYZ, X_vals, Y_vals, Z_vals)
print("Example 1 (Symptoms given Disease):")
print(f"  X ⟂ Y | Z? {is_ci}")
 
# But are X and Y marginally independent?
# Compute P(X, Y) by marginalizing over Z
P_XY = {}
for x in X_vals:
    for y in Y_vals:
        P_XY[(x, y)] = sum(P_XYZ[(x, y, z)] for z in Z_vals)
 
P_X = {x: sum(P_XY[(x, y)] for y in Y_vals) for x in X_vals}
P_Y = {y: sum(P_XY[(x, y)] for x in X_vals) for y in Y_vals}
 
print("\nMarginal independence check:")
for x in X_vals:
    for y in Y_vals:
        product = P_X[x] * P_Y[y]
        joint = P_XY[(x, y)]
        print(f"  P(X={x},Y={y})={joint:.4f}, P(X={x})*P(Y={y})={product:.4f}")
 
# Note: They're NOT marginally independent! 
# Conditional independence ≠ marginal independence

Conditional Independence ≠ Marginal Independence

A critical insight: X ⟂ Y | Z does NOT imply X ⟂ Y, and vice versa. Variables can be marginally dependent but conditionally independent (the 'screening off' pattern). Variables can also be marginally independent but conditionally dependent (the 'explaining away' pattern in V-structures). Always distinguish which type of independence you're asserting.

Properties of Conditional Independence (Graphoid Axioms)

Conditional independence satisfies several important properties known as the graphoid axioms. These axioms govern how independence statements can be combined and derived.

Let $X$, $Y$, $Z$, $W$ be disjoint sets of random variables. The following properties hold for any probability distribution:

Semi-Graphoid Axioms (Always Hold):

1. Symmetry: $$X \perp!!!\perp Y \mid Z \Rightarrow Y \perp!!!\perp X \mid Z$$

Independence is symmetric—if X is independent of Y, then Y is independent of X.

2. Decomposition: $$X \perp!!!\perp Y, W \mid Z \Rightarrow X \perp!!!\perp Y \mid Z \text{ and } X \perp!!!\perp W \mid Z$$

If X is independent of the pair (Y, W), then X is independent of each separately.

3. Weak Union: $$X \perp!!!\perp Y, W \mid Z \Rightarrow X \perp!!!\perp Y \mid Z, W$$

If X is independent of (Y, W) given Z, then X is independent of Y given both Z and W. Adding more conditioning can't create dependence where none existed.

4. Contraction: $$X \perp!!!\perp Y \mid Z, W \text{ and } X \perp!!!\perp W \mid Z \Rightarrow X \perp!!!\perp Y, W \mid Z$$

This is the converse of decomposition: if independences hold separately, they can be combined.

Graphoid Axiom (Holds for Positive Distributions):

5. Intersection: $$X \perp!!!\perp Y \mid Z, W \text{ and } X \perp!!!\perp W \mid Z, Y \Rightarrow X \perp!!!\perp Y, W \mid Z$$

This stronger property holds when all probabilities are strictly positive ($P > 0$ everywhere). It's crucial for connecting graphical models to probability distributions.

Why these axioms matter:

Sound reasoning: We can derive new independence statements from known ones using these rules.
Graph characterization: D-separation (covered next) is sound because it only derives independencies that follow from these axioms.
Completeness results: For positive distributions, the axioms characterize exactly what can be derived about conditional independence.

The Graphoid Axioms
Axiom	Statement	Intuition
Symmetry	$(X \perp!!!\perp Y \| Z) \Rightarrow (Y \perp!!!\perp X \| Z)$	Independence is a symmetric relation
Decomposition	$(X \perp!!!\perp Y,W \| Z) \Rightarrow (X \perp!!!\perp Y \| Z)$	Independent of a set implies independent of subsets
Weak Union	$(X \perp!!!\perp Y,W \| Z) \Rightarrow (X \perp!!!\perp Y \| Z,W)$	More conditioning preserves independence
Contraction	$(X \perp!!!\perp Y \| Z,W) \land (X \perp!!!\perp W \| Z) \Rightarrow (X \perp!!!\perp Y,W \| Z)$	Combine separate independencies
Intersection	$(X \perp!!!\perp Y \| Z,W) \land (X \perp!!!\perp W \| Z,Y) \Rightarrow (X \perp!!!\perp Y,W \| Z)$	Stronger combination (positive distributions)

The Positivity Assumption

The intersection axiom requires that all joint probabilities be strictly positive. This is often assumed in graphical models (the 'positivity' or 'faithfulness' assumption). Without positivity, some independencies can hold 'accidentally' due to exact cancellations, leading to scenarios where the graph doesn't fully capture the distribution's independence structure.

Independence and Factorization: The Deep Connection

Conditional independence and factorization are intimately connected. In fact, they are two sides of the same coin:

Theorem (Factorization ↔ Conditional Independence):

For any probability distribution $P$ and graphical model $G$:

$$P \text{ factorizes according to } G \Leftrightarrow P \text{ satisfies the conditional independencies of } G$$

This is the foundational result connecting the algebraic view (factorization) with the logical view (independence statements).

Example with Bayesian Networks:

Consider a chain: $A \to B \to C$

Factorization: $P(A, B, C) = P(A) \cdot P(B|A) \cdot P(C|B)$

This factorization implies $A \perp!!!\perp C | B$. Let's verify:

$$P(A, C | B) = \frac{P(A, B, C)}{P(B)} = \frac{P(A) \cdot P(B|A) \cdot P(C|B)}{P(B)}$$

$$= \frac{P(A, B)}{P(B)} \cdot P(C|B) = P(A|B) \cdot P(C|B)$$

The joint over A and C given B factors into a product—that's exactly conditional independence!

The Markov Properties:

Both directed and undirected models define independence through Markov properties:

For Bayesian Networks (Local Markov Property):

$$X_i \perp!!!\perp \text{NonDescendants}(X_i) \mid \text{Parents}(X_i)$$

Each node is independent of its non-descendants given its parents.

For Markov Random Fields (Local Markov Property):

$$X_i \perp!!!\perp X_{V \setminus {i} \setminus \text{Neighbors}(i)} \mid \text{Neighbors}(X_i)$$

Each node is independent of all others given its neighbors.

Global Markov Property (Both):

If sets A and B are separated by set C in the graph (meaning all paths from A to B go through C), then:

$$A \perp!!!\perp B \mid C$$

The precise definition of "separated" differs between directed (d-separation) and undirected (simple path separation) graphs.

independence_from_factorization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
"""
Demonstrating the equivalence of factorization and conditional independence.
 
For a Bayesian Network A → B → C:
- Factorization: P(A,B,C) = P(A) P(B|A) P(C|B)
- Independence: A ⟂ C | B
 
We'll verify both directions of this equivalence.
"""
 
import numpy as np
from itertools import product
 
# Define a distribution that factors as P(A) P(B|A) P(C|B)
class ChainBayesianNetwork:
    def __init__(self):
        # P(A): A is Bernoulli(0.4)
        self.P_A = {0: 0.6, 1: 0.4}
        
        # P(B|A)
        self.P_B_given_A = {
            (0, 0): 0.8, (1, 0): 0.2,  # A=0: B=0 likely
            (0, 1): 0.3, (1, 1): 0.7,  # A=1: B=1 likely
        }
        
        # P(C|B): C depends only on B, not on A
        self.P_C_given_B = {
            (0, 0): 0.9, (1, 0): 0.1,  # B=0: C=0 likely
            (0, 1): 0.2, (1, 1): 0.8,  # B=1: C=1 likely
        }
    
    def joint(self, a, b, c):
        """P(A=a, B=b, C=c) = P(A) P(B|A) P(C|B)"""
        return self.P_A[a] * self.P_B_given_A[(b, a)] * self.P_C_given_B[(c, b)]
    
    def marginal_B(self, b):
        """P(B=b) by marginalizing over A, C"""
        return sum(self.joint(a, b, c) for a in [0,1] for c in [0,1])
    
    def conditional_A_given_B(self, a, b):
        """P(A=a | B=b)"""
        return sum(self.joint(a, b, c) for c in [0,1]) / self.marginal_B(b)
    
    def conditional_C_given_B(self, c, b):
        """P(C=c | B=b)"""
        return sum(self.joint(a, b, c) for a in [0,1]) / self.marginal_B(b)
    
    def conditional_AC_given_B(self, a, c, b):
        """P(A=a, C=c | B=b)"""
        return self.joint(a, b, c) / self.marginal_B(b)
 
# Verify A ⟂ C | B
bn = ChainBayesianNetwork()
 
print("Verifying A ⟂ C | B from factorization P(A)P(B|A)P(C|B):")
print("=" * 60)
 
for b in [0, 1]:
    print(f"\nConditioning on B = {b}:")
    for a in [0, 1]:
        for c in [0, 1]:
            p_ac_given_b = bn.conditional_AC_given_B(a, c, b)
            p_a_given_b = bn.conditional_A_given_B(a, b)
            p_c_given_b = bn.conditional_C_given_B(c, b)
            product = p_a_given_b * p_c_given_b
            
            match = "✓" if abs(p_ac_given_b - product) < 1e-10 else "✗"
            print(f"  P(A={a},C={c}|B={b}) = {p_ac_given_b:.6f}")
            print(f"  P(A={a}|B={b}) * P(C={c}|B={b}) = {p_a_given_b:.4f} * {p_c_given_b:.4f} = {product:.6f} {match}")
 
print("\n" + "=" * 60)
print("CONCLUSION: A ⟂ C | B holds exactly, as implied by the factorization.")
print("This verifies: Factorization → Conditional Independence")

The Big Picture

The factorization-independence equivalence is why graphical models work. When we draw a graph and specify factors according to its structure, we're simultaneously making precise assertions about which conditional independencies hold. Conversely, if we know which independencies hold in a distribution, we know how it can be factorized—and thus which graphical representations are valid.

Patterns of Independence in Directed Graphs

Before diving into the full d-separation algorithm (next page), let's understand the three fundamental patterns that govern independence in Bayesian networks. Every independence query reduces to combinations of these patterns.

Pattern 1: Chain (Indirect Cause)

A → B → C

Factorization: $P(A,B,C) = P(A) \cdot P(B|A) \cdot P(C|B)$

Marginal: $A \not\perp!!!\perp C$ — Information flows from A to C through B
Given B: $A \perp!!!\perp C | B$ — Knowing B blocks information flow; A and C become independent

Intuition: B is a "mediator." If you know the mediator's value, the cause tells you nothing new about the effect.

Example: Smoking → Tar in lungs → Lung cancer. If you know the tar level, smoking status provides no additional information about cancer risk.

Pattern 2: Fork (Common Cause)

    A
   ↙ ↘
  B   C

Factorization: $P(A,B,C) = P(A) \cdot P(B|A) \cdot P(C|A)$

Marginal: $B \not\perp!!!\perp C$ — Both are influenced by A, creating correlation
Given A: $B \perp!!!\perp C | A$ — The common cause explains all correlation

Intuition: A is a "confounder." It makes B and C appear related, but once you condition on A, they're independent.

Example: Intelligence → SAT score, Intelligence → GPA. SAT and GPA are correlated, but given intelligence, they provide independent information.

Pattern 3: Collider / V-Structure (Common Effect)

  A   B
   ↘ ↙
    C

Factorization: $P(A,B,C) = P(A) \cdot P(B) \cdot P(C|A,B)$

Marginal: $A \perp!!!\perp B$ — A and B are independent causes
Given C: $A \not\perp!!!\perp B | C$ — Conditioning on the effect makes causes dependent!

Intuition: This is "explaining away" or "Berkson's paradox." If you know the effect occurred and one cause didn't happen, the other cause becomes more likely.

Example: Talent → Movie success, Luck → Movie success. Talent and luck are independent. But given that someone is a successful actor, if you learn they're not particularly talented, you infer they must have been lucky.

Critical insight: The collider pattern is the OPPOSITE of chains and forks. For chains and forks, conditioning blocks information flow. For colliders, conditioning opens information flow.

Descendant conditioning: The effect extends to descendants of colliders. If $C$ is a collider and $D$ is a descendant of $C$, then conditioning on $D$ also opens the collider (partially).

The Three Fundamental Patterns
Pattern	Structure	Marginal	Given Middle Node
Chain	A → B → C	A ↔ C (dependent)	A ⟂ C \| B (B blocks)
Fork	A ← B → C	A ↔ C (dependent)	A ⟂ C \| B (B blocks)
Collider	A → B ← C	A ⟂ C (independent)	A ↔ C \| B (B opens!)

The Collider Trap

The collider pattern catches many people off guard. It's counterintuitive that conditioning on a variable can CREATE dependence where none existed. This is a major source of selection bias in statistics (Berkson's paradox) and must be accounted for in causal inference. Always check if you're conditioning on a collider—it might induce spurious correlations!

Independence in Undirected Graphs

For Markov Random Fields (undirected graphs), the independence structure is simpler than for Bayesian networks. There are no colliders to worry about—blocking is purely about graph separation.

Global Markov Property for MRFs:

In an undirected graph $G$, sets $A$ and $B$ are conditionally independent given $C$ if:

$$A \perp!!!\perp B \mid C \Leftrightarrow \text{Every path from } A \text{ to } B \text{ passes through } C$$

This is simple graph separation: remove the nodes in $C$, and check if $A$ and $B$ are in disconnected components.

Local Markov Property:

Each node is independent of all non-neighbors given its neighbors:

$$X_i \perp!!!\perp X_{V \setminus {i} \setminus \text{Neighbors}(i)} \mid \text{Neighbors}(X_i)$$

The neighbors form a "shield" around each node, blocking all information from the rest of the graph.

Pairwise Markov Property:

If there's no edge between nodes $i$ and $j$, then:

$$X_i \perp!!!\perp X_j \mid X_{V \setminus {i, j}}$$

They're independent given all other nodes.

undirected_separation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
"""
Graph separation in undirected graphs is straightforward:
Remove conditioning nodes and check connectivity.
 
Example: Grid MRF for an image
    A — B — C
    |   |   |
    D — E — F
"""
 
from collections import defaultdict, deque
 
class UndirectedGraph:
    def __init__(self):
        self.edges = defaultdict(set)
    
    def add_edge(self, u, v):
        self.edges[u].add(v)
        self.edges[v].add(u)
    
    def neighbors(self, node):
        return self.edges[node]
    
    def is_separated(self, A: set, B: set, C: set) -> bool:
        """
        Check if A and B are separated by C in the undirected graph.
        This determines A ⟂ B | C.
        """
        # Remove C from graph and check if A and B are connected
        # Do BFS from A, see if we can reach B
        
        visited = set()
        queue = deque(A - C)  # Start from A nodes not in C
        visited.update(A - C)
        
        while queue:
            node = queue.popleft()
            
            # Check if we reached B
            if node in B:
                return False  # A and B are connected, not separated
            
            # Explore neighbors, avoiding C
            for neighbor in self.neighbors(node):
                if neighbor not in visited and neighbor not in C:
                    visited.add(neighbor)
                    queue.append(neighbor)
        
        # Check if any B node was reachable
        return B.isdisjoint(visited)
 
# Build the grid graph
g = UndirectedGraph()
# Horizontal edges
for row in ['A', 'D']:
    g.add_edge(row, chr(ord(row)+1))  # Simplified naming for example
    
# Let's use meaningful node names
g = UndirectedGraph()
g.add_edge('A', 'B')
g.add_edge('B', 'C')
g.add_edge('A', 'D')
g.add_edge('B', 'E')
g.add_edge('C', 'F')
g.add_edge('D', 'E')
g.add_edge('E', 'F')
 
# Test separation queries
queries = [
    ({'A'}, {'C'}, {'B'}),      # A ⟂ C | B?
    ({'A'}, {'F'}, {'B', 'D'}), # A ⟂ F | {B, D}?
    ({'A'}, {'F'}, {'E'}),      # A ⟂ F | E?
    ({'A'}, {'C'}, set()),      # A ⟂ C (marginally)?
    ({'D'}, {'C'}, {'A', 'B'}), # D ⟂ C | {A, B}?
]
 
print("Separation queries in grid MRF:")
print("=" * 50)
for A, B, C in queries:
    separated = g.is_separated(A, B, C)
    symbol = "⟂" if separated else "↔"
    C_str = f"| {C}" if C else "(marginal)"
    print(f"  {A} {symbol} {B} {C_str}")
 
print("\n" + "=" * 50)
print("Key insight: In undirected graphs, conditioning ALWAYS blocks.")
print("There are no colliders that 'open' paths like in Bayesian networks.")

Undirected Simplicity

Undirected graphs have simpler independence semantics than directed graphs because there are no colliders. Every conditioning set can only BLOCK paths, never open them. This makes reasoning about independence more straightforward, though it limits expressiveness (can't represent explaining away).

Computational Benefits of Conditional Independence

Conditional independence isn't just a theoretical nicety—it's the enabler of computational tractability. Let's quantify the benefits.

Benefit 1: Compact Representation

Without structure: A joint over $n$ binary variables needs $2^n - 1$ parameters.

With structure: A Bayesian network where each node has at most $k$ parents needs at most $n \cdot 2^k$ parameters.

For $n = 100$ and $k = 3$:

Full joint: $2^{100} \approx 10^{30}$ parameters (impossible)
Sparse Bayes net: $100 \cdot 2^3 = 800$ parameters (trivial)

This is an exponential reduction!

Benefit 2: Efficient Inference

The variable elimination algorithm exploits factorization. To compute $P(X_1)$ from $P(X_1, \ldots, X_n)$:

Naive approach: Sum over all $2^{n-1}$ configurations of other variables.

With factorization: "Push" sums inside products:

$$P(X_1) = \sum_{X_2} \cdots \sum_{X_n} \prod_i \phi_i(X_{\text{scope}(i)})$$

By summing out variables in the right order, intermediate results stay small. For tree-structured graphs, inference is linear in $n$!

variable_elimination_benefit.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
"""
Demonstrating the computational savings from exploiting factorization
via variable elimination.
 
Chain: X1 → X2 → ... → Xn
Goal: Compute P(X1)
"""
 
import numpy as np
import time
 
def naive_marginalization(n):
    """
    Naive approach: enumerate all 2^n configurations, sum over all but X1.
    Complexity: O(2^n)
    """
    # Simulate by counting operations
    configs = 2 ** n
    ops_per_config = n  # Multiply n factors
    total_ops = configs * ops_per_config
    return total_ops
 
def factored_marginalization(n, k=1):
    """
    Exploiting chain factorization: P(X1,...,Xn) = P(X1) Π P(Xi|Xi-1)
    
    P(X1) = Σ_X2...Xn P(X1)P(X2|X1)...P(Xn|Xn-1)
    
    Work from right to left:
    1. Σ_Xn P(Xn|Xn-1) = 1  (CPD sums to 1!) - actually gives a factor over Xn-1
    2. Continue eliminating from right
    
    Each elimination step is O(2^k) where k is factor width.
    Total: O(n * 2^k)
    """
    factor_width = k + 1  # Variable + its parent
    ops_per_elimination = 2 ** factor_width  # Sum over variable values
    total_ops = n * ops_per_elimination
    return total_ops
 
print("Computational Complexity: Naive vs Factored Marginalization")
print("=" * 65)
print(f"{'n':<6} {'Naive (2^n)':<20} {'Factored (n*2^2)':<20} {'Speedup':<15}")
print("-" * 65)
 
for n in [10, 20, 30, 50, 100]:
    naive = naive_marginalization(n)
    factored = factored_marginalization(n, k=1)  # Each variable has 1 parent
    
    if factored > 0:
        speedup = naive / factored
        naive_str = f"{naive:,.0f}" if naive < 1e15 else f"{naive:.2e}"
        print(f"{n:<6} {naive_str:<20} {factored:<20,} {speedup:<15,.0f}x")
 
print("\n" + "=" * 65)
print("For n=100: naive needs 10^30 operations, factored needs 400.")
print("This is the power of exploiting conditional independence!")

Benefit 3: Sample-Efficient Learning

Learning with fewer parameters requires fewer samples:

Full joint with $2^n$ parameters: Need $\Omega(2^n)$ samples to estimate reliably
Factored model with $O(nk)$ parameters: Need $O(nk \cdot \text{poly}(k))$ samples

Conditional independence reduces the statistical complexity of learning, not just the computational complexity.

Benefit 4: Modularity

With factorization, we can:

Learn factors separately (each CPD from its local data)
Update one factor without affecting others
Compose models by combining factor sets
Reason about subproblems in isolation

This modularity is crucial for building and maintaining large probabilistic systems.

The Tractability Thesis

Probabilistic inference is intractable in the worst case (#P-complete). But most real-world problems have structure—conditional independencies that reflect how the world actually works. By exploiting this structure through graphical models, we transform intractable problems into tractable ones. This is the fundamental insight that makes probabilistic AI practical.

Testing and Discovering Conditional Independence

In practice, we need methods to both test whether conditional independence holds in data and discover independence structure automatically.

Statistical Tests for Conditional Independence:

Chi-squared test (discrete variables): Test whether the observed joint frequencies differ from expected under independence.
Partial correlation (continuous Gaussian): Variables X and Y are conditionally independent given Z iff partial correlation $\rho_{XY|Z} = 0$.
Conditional mutual information: $I(X; Y | Z) = 0$ iff $X \perp!!!\perp Y | Z$.
Kernel-based tests: HSIC (Hilbert-Schmidt Independence Criterion) and kernel conditional independence tests for general distributions.

Challenges in testing:

Finite data: True independence rarely holds exactly; we need statistical tests with proper significance levels.
Multiple testing: Testing many pairs leads to false positives; need correction.
Conditioning set size: Large conditioning sets require exponentially more data.
Continuous variables: Discretization or kernel methods needed.

conditional_independence_test.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
"""
Testing conditional independence from data.
 
Example: Testing X ⟂ Y | Z using chi-squared test for discrete variables.
"""
 
import numpy as np
from scipy.stats import chi2_contingency
from collections import Counter
 
def test_conditional_independence(X, Y, Z, alpha=0.05):
    """
    Test H0: X ⟂ Y | Z using stratified chi-squared tests.
    
    For each value z of Z:
        - Build contingency table of X vs Y given Z=z
        - Compute chi-squared statistic
        - Combine using Fisher's method or average
    
    Returns: (independent, p_value)
    """
    n = len(X)
    z_values = list(set(Z))
    
    p_values = []
    
    for z in z_values:
        # Filter data for Z = z
        mask = [i for i in range(n) if Z[i] == z]
        if len(mask) < 20:  # Need enough data
            continue
            
        X_z = [X[i] for i in mask]
        Y_z = [Y[i] for i in mask]
        
        # Build contingency table
        x_vals = sorted(set(X_z))
        y_vals = sorted(set(Y_z))
        
        table = np.zeros((len(x_vals), len(y_vals)))
        for x, y in zip(X_z, Y_z):
            i = x_vals.index(x)
            j = y_vals.index(y)
            table[i, j] += 1
        
        # Chi-squared test
        if table.min() >= 5:  # Sufficient expected counts
            try:
                chi2, p, dof, expected = chi2_contingency(table)
                p_values.append(p)
            except:
                pass
    
    if not p_values:
        return None, None
    
    # Combine p-values (simplified: use minimum with Bonferroni)
    min_p = min(p_values) * len(p_values)  # Bonferroni correction
    
    return min_p > alpha, min_p
 
# Generate data where X ⟂ Y | Z actually holds
np.random.seed(42)
n = 2000
 
# Z causes both X and Y (common cause)
Z = np.random.binomial(1, 0.5, n)
 
# X depends on Z
P_X_given_Z = {0: 0.3, 1: 0.8}
X = np.array([np.random.binomial(1, P_X_given_Z[z]) for z in Z])
 
# Y depends on Z (independent of X given Z)
P_Y_given_Z = {0: 0.2, 1: 0.7}
Y = np.array([np.random.binomial(1, P_Y_given_Z[z]) for z in Z])
 
# Test X ⟂ Y | Z
independent, p_val = test_conditional_independence(X, Y, Z)
print(f"Test X ⟂ Y | Z: independent={independent}, p-value={p_val:.4f}")
 
# Test X ⟂ Y (marginally) - should be dependent!
# Use chi-squared on the marginal 2x2 table
from scipy.stats import chi2_contingency
table_marginal = np.zeros((2, 2))
for x, y in zip(X, Y):
    table_marginal[x, y] += 1
chi2, p_marginal, _, _ = chi2_contingency(table_marginal)
print(f"Test X ⟂ Y (marginal): p-value={p_marginal:.6f}")
print(f"  -> Marginally {'independent' if p_marginal > 0.05 else 'DEPENDENT'}")
 
print("\nConclusion: X and Y are marginally dependent but conditionally independent given Z.")
print("This confirms the fork pattern: Z → X and Z → Y")

Structure Learning via Independence Tests:

The PC algorithm (named after its inventors Peter Spirtes and Clark Glymour) discovers graph structure by testing conditional independencies:

Start with a fully connected undirected graph
For each pair $(X, Y)$, test if $X \perp!!!\perp Y | S$ for any subset $S$
If independence holds, remove the edge $X - Y$
Orient edges based on V-structure detection
Propagate orientations using acyclicity constraints

This constraint-based approach directly uses conditional independence as its building block.

Summary and Key Takeaways

Let's consolidate the essential insights about conditional independence:

Key Takeaways

•Formal definition: $X \perp!!!\perp Y | Z$ means $P(X,Y|Z) = P(X|Z)P(Y|Z)$—knowing Y provides no additional information about X once Z is known.
•Graphoid axioms: Conditional independence satisfies symmetry, decomposition, weak union, and contraction. Positive distributions also satisfy intersection.
•Factorization equivalence: A distribution factorizes according to a graph if and only if it satisfies the conditional independencies encoded by the graph.
•Three patterns: Chains and forks are blocked by conditioning; colliders are OPENED by conditioning (explaining away).
•Undirected simplicity: In MRFs, graph separation directly gives conditional independence—no colliders to worry about.
•Computational benefits: Independence enables exponential reduction in representation size, inference complexity, and data requirements.
•Can be tested: Chi-squared tests, partial correlation, and mutual information can assess conditional independence from data.

What's next:

Now that we understand conditional independence conceptually, we need a precise algorithm to read independencies from directed graphs. The next page covers d-separation—the complete graphical criterion that tells you exactly which conditional independencies hold in a Bayesian network.

Page Complete

You now have a rigorous understanding of conditional independence—the mathematical foundation of probabilistic graphical models. You can formally define it, understand its properties, connect it to factorization, and reason about the three fundamental patterns in directed graphs. This foundation prepares you for the next topic: d-separation, the algorithmic tool for reading conditional independence from Bayesian network structure.

3 / 5

Loading learning content...

Machine LearningGraphical Models

Probabilistic Graphical Models

LevelAdvanced

Duration90 mins

TopicGraphical Models

3 / 5

Conditional Independence

The Mathematical Key to Tractable Probabilistic Reasoning

Observation: Given that a patient has pneumonia, their cough symptoms are independent of whether they live in a city or suburb. The disease screens off the symptom from the location.

Observation: Given a patient's genetic test result, their disease risk becomes independent of their parents' disease status. The test result is a more direct indicator.

Why conditional independence is foundational:

Conditional independence is what transforms probabilistic graphical models from a theoretical curiosity into a practical tool. It's the mathematical mechanism that:

Enables compact representation (factorization)
Allows efficient inference algorithms
Reduces data requirements for learning
Makes large-scale probabilistic reasoning tractable

What You Will Learn

Formal Definition of Conditional Independence

Let's establish the formal definition with mathematical precision.

Definition (Conditional Independence):

Random variables $X$ and $Y$ are conditionally independent given $Z$, written:

$$X \perp!!!\perp Y \mid Z$$

if and only if, for all values $x, y, z$ in their respective domains:

$$P(X = x, Y = y \mid Z = z) = P(X = x \mid Z = z) \cdot P(Y = y \mid Z = z)$$

whenever $P(Z = z) > 0$.

Interpretation: Once we know the value of $Z$, knowing $X$ provides no additional information about $Y$, and vice versa. The variables are "screened off" from each other by $Z$.

Equivalent formulations:

The following statements are all equivalent to $X \perp!!!\perp Y \mid Z$:

$P(X, Y | Z) = P(X | Z) \cdot P(Y | Z)$
$P(X | Y, Z) = P(X | Z)$ (when $P(Y, Z) > 0$)
$P(Y | X, Z) = P(Y | Z)$ (when $P(X, Z) > 0$)

Formulation 2 says: "If you already know $Z$, learning $Y$ doesn't change your belief about $X$."

Special case — Marginal Independence:

When $Z$ is empty (unconditional), we write:

$$X \perp!!!\perp Y$$

meaning $P(X, Y) = P(X) \cdot P(Y)$. This is marginal or unconditional independence.

conditional_independence_verification.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import numpy as np
 
def verify_conditional_independence(P_XYZ, X_vals, Y_vals, Z_vals):
    """
    Verify if X ⟂ Y | Z holds in a joint distribution P(X, Y, Z).
    
    P_XYZ: dict mapping (x, y, z) -> probability
    Returns: True if conditionally independent, else False with violation details
    """
    tolerance = 1e-10
    
    for z in Z_vals:
        # Compute P(Z = z)
        P_z = sum(P_XYZ.get((x, y, z), 0) for x in X_vals for y in Y_vals)
        
        if P_z < tolerance:
            continue  # Skip if P(Z = z) = 0
        
        for x in X_vals:
            for y in Y_vals:
                # P(X = x, Y = y | Z = z)
                P_xy_given_z = P_XYZ.get((x, y, z), 0) / P_z
                
                # P(X = x | Z = z)
                P_x_given_z = sum(P_XYZ.get((x, y_, z), 0) for y_ in Y_vals) / P_z
                
                # P(Y = y | Z = z)
                P_y_given_z = sum(P_XYZ.get((x_, y, z), 0) for x_ in X_vals) / P_z
                
                # Check factorization
                expected = P_x_given_z * P_y_given_z
                
                if abs(P_xy_given_z - expected) > tolerance:
                    return False, {
                        'x': x, 'y': y, 'z': z,
                        'P(x,y|z)': P_xy_given_z,
                        'P(x|z)*P(y|z)': expected,
                        'difference': abs(P_xy_given_z - expected)
                    }
    
    return True, None
 
# Example 1: Classic "screening off" scenario
# Z = Disease, X = Symptom1, Y = Symptom2
# Given the disease, symptoms are independent (caused by disease, not each other)
 
X_vals = [0, 1]  # Symptom1 absent/present
Y_vals = [0, 1]  # Symptom2 absent/present  
Z_vals = [0, 1]  # Disease absent/present
 
# Define joint distribution with X ⟂ Y | Z
P_XYZ = {}
 
# P(Z=0) = 0.7 (no disease)
# P(Z=1) = 0.3 (disease)
# P(X=1|Z=0) = 0.1, P(X=1|Z=1) = 0.8 (symptom1)
# P(Y=1|Z=0) = 0.2, P(Y=1|Z=1) = 0.7 (symptom2)
 
P_Z = {0: 0.7, 1: 0.3}
P_X_given_Z = {(1, 0): 0.1, (0, 0): 0.9, (1, 1): 0.8, (0, 1): 0.2}
P_Y_given_Z = {(1, 0): 0.2, (0, 0): 0.8, (1, 1): 0.7, (0, 1): 0.3}
 
for z in Z_vals:
    for x in X_vals:
        for y in Y_vals:
            # Build P(X,Y,Z) = P(Z) * P(X|Z) * P(Y|Z)
            # This construction GUARANTEES X ⟂ Y | Z
            P_XYZ[(x, y, z)] = P_Z[z] * P_X_given_Z[(x, z)] * P_Y_given_Z[(y, z)]
 
is_ci, violation = verify_conditional_independence(P_XYZ, X_vals, Y_vals, Z_vals)
print("Example 1 (Symptoms given Disease):")
print(f"  X ⟂ Y | Z? {is_ci}")
 
# But are X and Y marginally independent?
# Compute P(X, Y) by marginalizing over Z
P_XY = {}
for x in X_vals:
    for y in Y_vals:
        P_XY[(x, y)] = sum(P_XYZ[(x, y, z)] for z in Z_vals)
 
P_X = {x: sum(P_XY[(x, y)] for y in Y_vals) for x in X_vals}
P_Y = {y: sum(P_XY[(x, y)] for x in X_vals) for y in Y_vals}
 
print("\nMarginal independence check:")
for x in X_vals:
    for y in Y_vals:
        product = P_X[x] * P_Y[y]
        joint = P_XY[(x, y)]
        print(f"  P(X={x},Y={y})={joint:.4f}, P(X={x})*P(Y={y})={product:.4f}")
 
# Note: They're NOT marginally independent! 
# Conditional independence ≠ marginal independence