Loading learning content...
Consider a medical diagnosis system reasoning about a patient. The patient has hundreds of attributes: symptoms, vital signs, lab results, medical history, demographics. In principle, every attribute could influence every other. But this is rarely the case.
Observation: Given that a patient has pneumonia, their cough symptoms are independent of whether they live in a city or suburb. The disease screens off the symptom from the location.
Observation: Given a patient's genetic test result, their disease risk becomes independent of their parents' disease status. The test result is a more direct indicator.
These observations illustrate conditional independence—the phenomenon where, given certain information, some variables cease to provide additional predictive value about others. This isn't just a simplifying assumption; it reflects genuine structure in how the world works.
Why conditional independence is foundational:
Conditional independence is what transforms probabilistic graphical models from a theoretical curiosity into a practical tool. It's the mathematical mechanism that:
By the end of this page, you will have a rigorous understanding of conditional independence: its formal definition, key mathematical properties, relationship to graphical model factorization, and practical implications. You will be able to reason precisely about when and why variables are independent, and how this independence enables tractable computation.
Let's establish the formal definition with mathematical precision.
Definition (Conditional Independence):
Random variables $X$ and $Y$ are conditionally independent given $Z$, written:
$$X \perp!!!\perp Y \mid Z$$
if and only if, for all values $x, y, z$ in their respective domains:
$$P(X = x, Y = y \mid Z = z) = P(X = x \mid Z = z) \cdot P(Y = y \mid Z = z)$$
whenever $P(Z = z) > 0$.
Interpretation: Once we know the value of $Z$, knowing $X$ provides no additional information about $Y$, and vice versa. The variables are "screened off" from each other by $Z$.
Equivalent formulations:
The following statements are all equivalent to $X \perp!!!\perp Y \mid Z$:
Formulation 2 says: "If you already know $Z$, learning $Y$ doesn't change your belief about $X$."
Special case — Marginal Independence:
When $Z$ is empty (unconditional), we write:
$$X \perp!!!\perp Y$$
meaning $P(X, Y) = P(X) \cdot P(Y)$. This is marginal or unconditional independence.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
import numpy as np def verify_conditional_independence(P_XYZ, X_vals, Y_vals, Z_vals): """ Verify if X ⟂ Y | Z holds in a joint distribution P(X, Y, Z). P_XYZ: dict mapping (x, y, z) -> probability Returns: True if conditionally independent, else False with violation details """ tolerance = 1e-10 for z in Z_vals: # Compute P(Z = z) P_z = sum(P_XYZ.get((x, y, z), 0) for x in X_vals for y in Y_vals) if P_z < tolerance: continue # Skip if P(Z = z) = 0 for x in X_vals: for y in Y_vals: # P(X = x, Y = y | Z = z) P_xy_given_z = P_XYZ.get((x, y, z), 0) / P_z # P(X = x | Z = z) P_x_given_z = sum(P_XYZ.get((x, y_, z), 0) for y_ in Y_vals) / P_z # P(Y = y | Z = z) P_y_given_z = sum(P_XYZ.get((x_, y, z), 0) for x_ in X_vals) / P_z # Check factorization expected = P_x_given_z * P_y_given_z if abs(P_xy_given_z - expected) > tolerance: return False, { 'x': x, 'y': y, 'z': z, 'P(x,y|z)': P_xy_given_z, 'P(x|z)*P(y|z)': expected, 'difference': abs(P_xy_given_z - expected) } return True, None # Example 1: Classic "screening off" scenario# Z = Disease, X = Symptom1, Y = Symptom2# Given the disease, symptoms are independent (caused by disease, not each other) X_vals = [0, 1] # Symptom1 absent/presentY_vals = [0, 1] # Symptom2 absent/present Z_vals = [0, 1] # Disease absent/present # Define joint distribution with X ⟂ Y | ZP_XYZ = {} # P(Z=0) = 0.7 (no disease)# P(Z=1) = 0.3 (disease)# P(X=1|Z=0) = 0.1, P(X=1|Z=1) = 0.8 (symptom1)# P(Y=1|Z=0) = 0.2, P(Y=1|Z=1) = 0.7 (symptom2) P_Z = {0: 0.7, 1: 0.3}P_X_given_Z = {(1, 0): 0.1, (0, 0): 0.9, (1, 1): 0.8, (0, 1): 0.2}P_Y_given_Z = {(1, 0): 0.2, (0, 0): 0.8, (1, 1): 0.7, (0, 1): 0.3} for z in Z_vals: for x in X_vals: for y in Y_vals: # Build P(X,Y,Z) = P(Z) * P(X|Z) * P(Y|Z) # This construction GUARANTEES X ⟂ Y | Z P_XYZ[(x, y, z)] = P_Z[z] * P_X_given_Z[(x, z)] * P_Y_given_Z[(y, z)] is_ci, violation = verify_conditional_independence(P_XYZ, X_vals, Y_vals, Z_vals)print("Example 1 (Symptoms given Disease):")print(f" X ⟂ Y | Z? {is_ci}") # But are X and Y marginally independent?# Compute P(X, Y) by marginalizing over ZP_XY = {}for x in X_vals: for y in Y_vals: P_XY[(x, y)] = sum(P_XYZ[(x, y, z)] for z in Z_vals) P_X = {x: sum(P_XY[(x, y)] for y in Y_vals) for x in X_vals}P_Y = {y: sum(P_XY[(x, y)] for x in X_vals) for y in Y_vals} print("\nMarginal independence check:")for x in X_vals: for y in Y_vals: product = P_X[x] * P_Y[y] joint = P_XY[(x, y)] print(f" P(X={x},Y={y})={joint:.4f}, P(X={x})*P(Y={y})={product:.4f}") # Note: They're NOT marginally independent! # Conditional independence ≠ marginal independenceA critical insight: X ⟂ Y | Z does NOT imply X ⟂ Y, and vice versa. Variables can be marginally dependent but conditionally independent (the 'screening off' pattern). Variables can also be marginally independent but conditionally dependent (the 'explaining away' pattern in V-structures). Always distinguish which type of independence you're asserting.
Conditional independence satisfies several important properties known as the graphoid axioms. These axioms govern how independence statements can be combined and derived.
Let $X$, $Y$, $Z$, $W$ be disjoint sets of random variables. The following properties hold for any probability distribution:
Semi-Graphoid Axioms (Always Hold):
1. Symmetry: $$X \perp!!!\perp Y \mid Z \Rightarrow Y \perp!!!\perp X \mid Z$$
Independence is symmetric—if X is independent of Y, then Y is independent of X.
2. Decomposition: $$X \perp!!!\perp Y, W \mid Z \Rightarrow X \perp!!!\perp Y \mid Z \text{ and } X \perp!!!\perp W \mid Z$$
If X is independent of the pair (Y, W), then X is independent of each separately.
3. Weak Union: $$X \perp!!!\perp Y, W \mid Z \Rightarrow X \perp!!!\perp Y \mid Z, W$$
If X is independent of (Y, W) given Z, then X is independent of Y given both Z and W. Adding more conditioning can't create dependence where none existed.
4. Contraction: $$X \perp!!!\perp Y \mid Z, W \text{ and } X \perp!!!\perp W \mid Z \Rightarrow X \perp!!!\perp Y, W \mid Z$$
This is the converse of decomposition: if independences hold separately, they can be combined.
Graphoid Axiom (Holds for Positive Distributions):
5. Intersection: $$X \perp!!!\perp Y \mid Z, W \text{ and } X \perp!!!\perp W \mid Z, Y \Rightarrow X \perp!!!\perp Y, W \mid Z$$
This stronger property holds when all probabilities are strictly positive ($P > 0$ everywhere). It's crucial for connecting graphical models to probability distributions.
Why these axioms matter:
Sound reasoning: We can derive new independence statements from known ones using these rules.
Graph characterization: D-separation (covered next) is sound because it only derives independencies that follow from these axioms.
Completeness results: For positive distributions, the axioms characterize exactly what can be derived about conditional independence.
| Axiom | Statement | Intuition |
|---|---|---|
| Symmetry | $(X \perp!!!\perp Y | Z) \Rightarrow (Y \perp!!!\perp X | Z)$ | Independence is a symmetric relation |
| Decomposition | $(X \perp!!!\perp Y,W | Z) \Rightarrow (X \perp!!!\perp Y | Z)$ | Independent of a set implies independent of subsets |
| Weak Union | $(X \perp!!!\perp Y,W | Z) \Rightarrow (X \perp!!!\perp Y | Z,W)$ | More conditioning preserves independence |
| Contraction | $(X \perp!!!\perp Y | Z,W) \land (X \perp!!!\perp W | Z) \Rightarrow (X \perp!!!\perp Y,W | Z)$ | Combine separate independencies |
| Intersection | $(X \perp!!!\perp Y | Z,W) \land (X \perp!!!\perp W | Z,Y) \Rightarrow (X \perp!!!\perp Y,W | Z)$ | Stronger combination (positive distributions) |
The intersection axiom requires that all joint probabilities be strictly positive. This is often assumed in graphical models (the 'positivity' or 'faithfulness' assumption). Without positivity, some independencies can hold 'accidentally' due to exact cancellations, leading to scenarios where the graph doesn't fully capture the distribution's independence structure.
Conditional independence and factorization are intimately connected. In fact, they are two sides of the same coin:
Theorem (Factorization ↔ Conditional Independence):
For any probability distribution $P$ and graphical model $G$:
$$P \text{ factorizes according to } G \Leftrightarrow P \text{ satisfies the conditional independencies of } G$$
This is the foundational result connecting the algebraic view (factorization) with the logical view (independence statements).
Example with Bayesian Networks:
Consider a chain: $A \to B \to C$
Factorization: $P(A, B, C) = P(A) \cdot P(B|A) \cdot P(C|B)$
This factorization implies $A \perp!!!\perp C | B$. Let's verify:
$$P(A, C | B) = \frac{P(A, B, C)}{P(B)} = \frac{P(A) \cdot P(B|A) \cdot P(C|B)}{P(B)}$$
$$= \frac{P(A, B)}{P(B)} \cdot P(C|B) = P(A|B) \cdot P(C|B)$$
The joint over A and C given B factors into a product—that's exactly conditional independence!
The Markov Properties:
Both directed and undirected models define independence through Markov properties:
For Bayesian Networks (Local Markov Property):
$$X_i \perp!!!\perp \text{NonDescendants}(X_i) \mid \text{Parents}(X_i)$$
Each node is independent of its non-descendants given its parents.
For Markov Random Fields (Local Markov Property):
$$X_i \perp!!!\perp X_{V \setminus {i} \setminus \text{Neighbors}(i)} \mid \text{Neighbors}(X_i)$$
Each node is independent of all others given its neighbors.
Global Markov Property (Both):
If sets A and B are separated by set C in the graph (meaning all paths from A to B go through C), then:
$$A \perp!!!\perp B \mid C$$
The precise definition of "separated" differs between directed (d-separation) and undirected (simple path separation) graphs.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
"""Demonstrating the equivalence of factorization and conditional independence. For a Bayesian Network A → B → C:- Factorization: P(A,B,C) = P(A) P(B|A) P(C|B)- Independence: A ⟂ C | B We'll verify both directions of this equivalence.""" import numpy as npfrom itertools import product # Define a distribution that factors as P(A) P(B|A) P(C|B)class ChainBayesianNetwork: def __init__(self): # P(A): A is Bernoulli(0.4) self.P_A = {0: 0.6, 1: 0.4} # P(B|A) self.P_B_given_A = { (0, 0): 0.8, (1, 0): 0.2, # A=0: B=0 likely (0, 1): 0.3, (1, 1): 0.7, # A=1: B=1 likely } # P(C|B): C depends only on B, not on A self.P_C_given_B = { (0, 0): 0.9, (1, 0): 0.1, # B=0: C=0 likely (0, 1): 0.2, (1, 1): 0.8, # B=1: C=1 likely } def joint(self, a, b, c): """P(A=a, B=b, C=c) = P(A) P(B|A) P(C|B)""" return self.P_A[a] * self.P_B_given_A[(b, a)] * self.P_C_given_B[(c, b)] def marginal_B(self, b): """P(B=b) by marginalizing over A, C""" return sum(self.joint(a, b, c) for a in [0,1] for c in [0,1]) def conditional_A_given_B(self, a, b): """P(A=a | B=b)""" return sum(self.joint(a, b, c) for c in [0,1]) / self.marginal_B(b) def conditional_C_given_B(self, c, b): """P(C=c | B=b)""" return sum(self.joint(a, b, c) for a in [0,1]) / self.marginal_B(b) def conditional_AC_given_B(self, a, c, b): """P(A=a, C=c | B=b)""" return self.joint(a, b, c) / self.marginal_B(b) # Verify A ⟂ C | Bbn = ChainBayesianNetwork() print("Verifying A ⟂ C | B from factorization P(A)P(B|A)P(C|B):")print("=" * 60) for b in [0, 1]: print(f"\nConditioning on B = {b}:") for a in [0, 1]: for c in [0, 1]: p_ac_given_b = bn.conditional_AC_given_B(a, c, b) p_a_given_b = bn.conditional_A_given_B(a, b) p_c_given_b = bn.conditional_C_given_B(c, b) product = p_a_given_b * p_c_given_b match = "✓" if abs(p_ac_given_b - product) < 1e-10 else "✗" print(f" P(A={a},C={c}|B={b}) = {p_ac_given_b:.6f}") print(f" P(A={a}|B={b}) * P(C={c}|B={b}) = {p_a_given_b:.4f} * {p_c_given_b:.4f} = {product:.6f} {match}") print("\n" + "=" * 60)print("CONCLUSION: A ⟂ C | B holds exactly, as implied by the factorization.")print("This verifies: Factorization → Conditional Independence")The factorization-independence equivalence is why graphical models work. When we draw a graph and specify factors according to its structure, we're simultaneously making precise assertions about which conditional independencies hold. Conversely, if we know which independencies hold in a distribution, we know how it can be factorized—and thus which graphical representations are valid.
Before diving into the full d-separation algorithm (next page), let's understand the three fundamental patterns that govern independence in Bayesian networks. Every independence query reduces to combinations of these patterns.
Pattern 1: Chain (Indirect Cause)
A → B → C
Factorization: $P(A,B,C) = P(A) \cdot P(B|A) \cdot P(C|B)$
Intuition: B is a "mediator." If you know the mediator's value, the cause tells you nothing new about the effect.
Example: Smoking → Tar in lungs → Lung cancer. If you know the tar level, smoking status provides no additional information about cancer risk.
Pattern 2: Fork (Common Cause)
A
↙ ↘
B C
Factorization: $P(A,B,C) = P(A) \cdot P(B|A) \cdot P(C|A)$
Intuition: A is a "confounder." It makes B and C appear related, but once you condition on A, they're independent.
Example: Intelligence → SAT score, Intelligence → GPA. SAT and GPA are correlated, but given intelligence, they provide independent information.
Pattern 3: Collider / V-Structure (Common Effect)
A B
↘ ↙
C
Factorization: $P(A,B,C) = P(A) \cdot P(B) \cdot P(C|A,B)$
Intuition: This is "explaining away" or "Berkson's paradox." If you know the effect occurred and one cause didn't happen, the other cause becomes more likely.
Example: Talent → Movie success, Luck → Movie success. Talent and luck are independent. But given that someone is a successful actor, if you learn they're not particularly talented, you infer they must have been lucky.
Critical insight: The collider pattern is the OPPOSITE of chains and forks. For chains and forks, conditioning blocks information flow. For colliders, conditioning opens information flow.
Descendant conditioning: The effect extends to descendants of colliders. If $C$ is a collider and $D$ is a descendant of $C$, then conditioning on $D$ also opens the collider (partially).
| Pattern | Structure | Marginal | Given Middle Node |
|---|---|---|---|
| Chain | A → B → C | A ↔ C (dependent) | A ⟂ C | B (B blocks) |
| Fork | A ← B → C | A ↔ C (dependent) | A ⟂ C | B (B blocks) |
| Collider | A → B ← C | A ⟂ C (independent) | A ↔ C | B (B opens!) |
The collider pattern catches many people off guard. It's counterintuitive that conditioning on a variable can CREATE dependence where none existed. This is a major source of selection bias in statistics (Berkson's paradox) and must be accounted for in causal inference. Always check if you're conditioning on a collider—it might induce spurious correlations!
For Markov Random Fields (undirected graphs), the independence structure is simpler than for Bayesian networks. There are no colliders to worry about—blocking is purely about graph separation.
Global Markov Property for MRFs:
In an undirected graph $G$, sets $A$ and $B$ are conditionally independent given $C$ if:
$$A \perp!!!\perp B \mid C \Leftrightarrow \text{Every path from } A \text{ to } B \text{ passes through } C$$
This is simple graph separation: remove the nodes in $C$, and check if $A$ and $B$ are in disconnected components.
Local Markov Property:
Each node is independent of all non-neighbors given its neighbors:
$$X_i \perp!!!\perp X_{V \setminus {i} \setminus \text{Neighbors}(i)} \mid \text{Neighbors}(X_i)$$
The neighbors form a "shield" around each node, blocking all information from the rest of the graph.
Pairwise Markov Property:
If there's no edge between nodes $i$ and $j$, then:
$$X_i \perp!!!\perp X_j \mid X_{V \setminus {i, j}}$$
They're independent given all other nodes.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687
"""Graph separation in undirected graphs is straightforward:Remove conditioning nodes and check connectivity. Example: Grid MRF for an image A — B — C | | | D — E — F""" from collections import defaultdict, deque class UndirectedGraph: def __init__(self): self.edges = defaultdict(set) def add_edge(self, u, v): self.edges[u].add(v) self.edges[v].add(u) def neighbors(self, node): return self.edges[node] def is_separated(self, A: set, B: set, C: set) -> bool: """ Check if A and B are separated by C in the undirected graph. This determines A ⟂ B | C. """ # Remove C from graph and check if A and B are connected # Do BFS from A, see if we can reach B visited = set() queue = deque(A - C) # Start from A nodes not in C visited.update(A - C) while queue: node = queue.popleft() # Check if we reached B if node in B: return False # A and B are connected, not separated # Explore neighbors, avoiding C for neighbor in self.neighbors(node): if neighbor not in visited and neighbor not in C: visited.add(neighbor) queue.append(neighbor) # Check if any B node was reachable return B.isdisjoint(visited) # Build the grid graphg = UndirectedGraph()# Horizontal edgesfor row in ['A', 'D']: g.add_edge(row, chr(ord(row)+1)) # Simplified naming for example # Let's use meaningful node namesg = UndirectedGraph()g.add_edge('A', 'B')g.add_edge('B', 'C')g.add_edge('A', 'D')g.add_edge('B', 'E')g.add_edge('C', 'F')g.add_edge('D', 'E')g.add_edge('E', 'F') # Test separation queriesqueries = [ ({'A'}, {'C'}, {'B'}), # A ⟂ C | B? ({'A'}, {'F'}, {'B', 'D'}), # A ⟂ F | {B, D}? ({'A'}, {'F'}, {'E'}), # A ⟂ F | E? ({'A'}, {'C'}, set()), # A ⟂ C (marginally)? ({'D'}, {'C'}, {'A', 'B'}), # D ⟂ C | {A, B}?] print("Separation queries in grid MRF:")print("=" * 50)for A, B, C in queries: separated = g.is_separated(A, B, C) symbol = "⟂" if separated else "↔" C_str = f"| {C}" if C else "(marginal)" print(f" {A} {symbol} {B} {C_str}") print("\n" + "=" * 50)print("Key insight: In undirected graphs, conditioning ALWAYS blocks.")print("There are no colliders that 'open' paths like in Bayesian networks.")Undirected graphs have simpler independence semantics than directed graphs because there are no colliders. Every conditioning set can only BLOCK paths, never open them. This makes reasoning about independence more straightforward, though it limits expressiveness (can't represent explaining away).
Conditional independence isn't just a theoretical nicety—it's the enabler of computational tractability. Let's quantify the benefits.
Benefit 1: Compact Representation
Without structure: A joint over $n$ binary variables needs $2^n - 1$ parameters.
With structure: A Bayesian network where each node has at most $k$ parents needs at most $n \cdot 2^k$ parameters.
For $n = 100$ and $k = 3$:
This is an exponential reduction!
Benefit 2: Efficient Inference
The variable elimination algorithm exploits factorization. To compute $P(X_1)$ from $P(X_1, \ldots, X_n)$:
Naive approach: Sum over all $2^{n-1}$ configurations of other variables.
With factorization: "Push" sums inside products:
$$P(X_1) = \sum_{X_2} \cdots \sum_{X_n} \prod_i \phi_i(X_{\text{scope}(i)})$$
By summing out variables in the right order, intermediate results stay small. For tree-structured graphs, inference is linear in $n$!
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
"""Demonstrating the computational savings from exploiting factorizationvia variable elimination. Chain: X1 → X2 → ... → XnGoal: Compute P(X1)""" import numpy as npimport time def naive_marginalization(n): """ Naive approach: enumerate all 2^n configurations, sum over all but X1. Complexity: O(2^n) """ # Simulate by counting operations configs = 2 ** n ops_per_config = n # Multiply n factors total_ops = configs * ops_per_config return total_ops def factored_marginalization(n, k=1): """ Exploiting chain factorization: P(X1,...,Xn) = P(X1) Π P(Xi|Xi-1) P(X1) = Σ_X2...Xn P(X1)P(X2|X1)...P(Xn|Xn-1) Work from right to left: 1. Σ_Xn P(Xn|Xn-1) = 1 (CPD sums to 1!) - actually gives a factor over Xn-1 2. Continue eliminating from right Each elimination step is O(2^k) where k is factor width. Total: O(n * 2^k) """ factor_width = k + 1 # Variable + its parent ops_per_elimination = 2 ** factor_width # Sum over variable values total_ops = n * ops_per_elimination return total_ops print("Computational Complexity: Naive vs Factored Marginalization")print("=" * 65)print(f"{'n':<6} {'Naive (2^n)':<20} {'Factored (n*2^2)':<20} {'Speedup':<15}")print("-" * 65) for n in [10, 20, 30, 50, 100]: naive = naive_marginalization(n) factored = factored_marginalization(n, k=1) # Each variable has 1 parent if factored > 0: speedup = naive / factored naive_str = f"{naive:,.0f}" if naive < 1e15 else f"{naive:.2e}" print(f"{n:<6} {naive_str:<20} {factored:<20,} {speedup:<15,.0f}x") print("\n" + "=" * 65)print("For n=100: naive needs 10^30 operations, factored needs 400.")print("This is the power of exploiting conditional independence!")Benefit 3: Sample-Efficient Learning
Learning with fewer parameters requires fewer samples:
Conditional independence reduces the statistical complexity of learning, not just the computational complexity.
Benefit 4: Modularity
With factorization, we can:
This modularity is crucial for building and maintaining large probabilistic systems.
Probabilistic inference is intractable in the worst case (#P-complete). But most real-world problems have structure—conditional independencies that reflect how the world actually works. By exploiting this structure through graphical models, we transform intractable problems into tractable ones. This is the fundamental insight that makes probabilistic AI practical.
In practice, we need methods to both test whether conditional independence holds in data and discover independence structure automatically.
Statistical Tests for Conditional Independence:
Chi-squared test (discrete variables): Test whether the observed joint frequencies differ from expected under independence.
Partial correlation (continuous Gaussian): Variables X and Y are conditionally independent given Z iff partial correlation $\rho_{XY|Z} = 0$.
Conditional mutual information: $I(X; Y | Z) = 0$ iff $X \perp!!!\perp Y | Z$.
Kernel-based tests: HSIC (Hilbert-Schmidt Independence Criterion) and kernel conditional independence tests for general distributions.
Challenges in testing:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
"""Testing conditional independence from data. Example: Testing X ⟂ Y | Z using chi-squared test for discrete variables.""" import numpy as npfrom scipy.stats import chi2_contingencyfrom collections import Counter def test_conditional_independence(X, Y, Z, alpha=0.05): """ Test H0: X ⟂ Y | Z using stratified chi-squared tests. For each value z of Z: - Build contingency table of X vs Y given Z=z - Compute chi-squared statistic - Combine using Fisher's method or average Returns: (independent, p_value) """ n = len(X) z_values = list(set(Z)) p_values = [] for z in z_values: # Filter data for Z = z mask = [i for i in range(n) if Z[i] == z] if len(mask) < 20: # Need enough data continue X_z = [X[i] for i in mask] Y_z = [Y[i] for i in mask] # Build contingency table x_vals = sorted(set(X_z)) y_vals = sorted(set(Y_z)) table = np.zeros((len(x_vals), len(y_vals))) for x, y in zip(X_z, Y_z): i = x_vals.index(x) j = y_vals.index(y) table[i, j] += 1 # Chi-squared test if table.min() >= 5: # Sufficient expected counts try: chi2, p, dof, expected = chi2_contingency(table) p_values.append(p) except: pass if not p_values: return None, None # Combine p-values (simplified: use minimum with Bonferroni) min_p = min(p_values) * len(p_values) # Bonferroni correction return min_p > alpha, min_p # Generate data where X ⟂ Y | Z actually holdsnp.random.seed(42)n = 2000 # Z causes both X and Y (common cause)Z = np.random.binomial(1, 0.5, n) # X depends on ZP_X_given_Z = {0: 0.3, 1: 0.8}X = np.array([np.random.binomial(1, P_X_given_Z[z]) for z in Z]) # Y depends on Z (independent of X given Z)P_Y_given_Z = {0: 0.2, 1: 0.7}Y = np.array([np.random.binomial(1, P_Y_given_Z[z]) for z in Z]) # Test X ⟂ Y | Zindependent, p_val = test_conditional_independence(X, Y, Z)print(f"Test X ⟂ Y | Z: independent={independent}, p-value={p_val:.4f}") # Test X ⟂ Y (marginally) - should be dependent!# Use chi-squared on the marginal 2x2 tablefrom scipy.stats import chi2_contingencytable_marginal = np.zeros((2, 2))for x, y in zip(X, Y): table_marginal[x, y] += 1chi2, p_marginal, _, _ = chi2_contingency(table_marginal)print(f"Test X ⟂ Y (marginal): p-value={p_marginal:.6f}")print(f" -> Marginally {'independent' if p_marginal > 0.05 else 'DEPENDENT'}") print("\nConclusion: X and Y are marginally dependent but conditionally independent given Z.")print("This confirms the fork pattern: Z → X and Z → Y")Structure Learning via Independence Tests:
The PC algorithm (named after its inventors Peter Spirtes and Clark Glymour) discovers graph structure by testing conditional independencies:
This constraint-based approach directly uses conditional independence as its building block.
Let's consolidate the essential insights about conditional independence:
What's next:
Now that we understand conditional independence conceptually, we need a precise algorithm to read independencies from directed graphs. The next page covers d-separation—the complete graphical criterion that tells you exactly which conditional independencies hold in a Bayesian network.
You now have a rigorous understanding of conditional independence—the mathematical foundation of probabilistic graphical models. You can formally define it, understand its properties, connect it to factorization, and reason about the three fundamental patterns in directed graphs. This foundation prepares you for the next topic: d-separation, the algorithmic tool for reading conditional independence from Bayesian network structure.