Loading content...
The power of Bayesian Networks lies not in representing arbitrary joint distributions, but in exploiting structure to make the representation tractable. A joint distribution over $n$ binary variables requires $2^n - 1$ parameters in the general case—a number that becomes astronomical for even modest $n$. For 30 variables, that's over a billion parameters.
Factorization is the mathematical mechanism that transforms this exponential nightmare into something manageable. By decomposing the joint distribution into a product of local conditional distributions—each involving only a variable and its parents—we leverage the DAG structure to achieve dramatic reductions in representational complexity.
This page develops the theory of factorization in Bayesian Networks, from the chain rule foundation through independence-exploiting refinements, culminating in practical understanding of how structure translates to efficiency.
By completing this page, you will: (1) Derive the chain rule factorization from first principles, (2) Understand how conditional independence simplifies the chain rule, (3) Quantify the parameter savings from DAG structure, (4) Compute joint probabilities using factorized representations, and (5) Recognize factorization as the bridge between graph structure and probabilistic computation.
Every factorization in Bayesian Networks derives from the fundamental chain rule of probability (also called the product rule). This is not a Bayesian Network-specific result—it applies to any joint distribution.
Theorem (Chain Rule): For any joint distribution over variables $X_1, X_2, \ldots, X_n$: $$P(X_1, X_2, \ldots, X_n) = P(X_1) \cdot P(X_2 | X_1) \cdot P(X_3 | X_1, X_2) \cdots P(X_n | X_1, \ldots, X_{n-1})$$
Or more compactly: $$P(X_1, \ldots, X_n) = \prod_{i=1}^{n} P(X_i | X_1, \ldots, X_{i-1})$$
Proof: This follows directly from the definition of conditional probability applied recursively: $$P(A, B) = P(A) \cdot P(B | A)$$
Extending to three variables: $$P(A, B, C) = P(A) \cdot P(B | A) \cdot P(C | A, B)$$
And by induction to $n$ variables.
The chain rule holds for ANY ordering of variables. We could equally write P(X₃)·P(X₁|X₃)·P(X₂|X₃,X₁). Different orderings give different factorizations with the same numerical result. Bayesian Networks exploit this freedom by choosing orderings consistent with the DAG topology.
The Problem with the Naive Chain Rule:
While mathematically correct, the naive chain rule doesn't save parameters in the general case. Consider the last factor $P(X_n | X_1, \ldots, X_{n-1})$:
The chain rule alone is just a rewriting—it doesn't exploit any structure. What Bayesian Networks add is the observation that many of these conditioning variables are often irrelevant.
Key Insight: If $X_i \perp X_j | X_k$ (conditional independence), then: $$P(X_i | X_j, X_k) = P(X_i | X_k)$$
Conditioning on $X_j$ adds nothing when $X_k$ is known. This is the leverage point that Bayesian Networks exploit.
12345678910111213141516171819202122232425262728293031323334353637383940414243
def naive_chain_rule_parameters(n, cardinalities): """ Calculate parameters needed for naive chain rule factorization. Each factor P(X_i | X_1, ..., X_{i-1}) requires product of earlier cardinalities parameters for each value of X_i. Args: n: Number of variables cardinalities: List of number of states per variable Returns: Total parameters needed """ total = 0 for i in range(n): # For P(X_i | X_1, ..., X_{i-1}) # Need (cardinality of X_i - 1) per combination of parent values # (One value determined by normalization) conditioning_combinations = 1 for j in range(i): conditioning_combinations *= cardinalities[j] factor_params = (cardinalities[i] - 1) * conditioning_combinations total += factor_params print(f"P(X_{i+1} | X_1,...,X_{i}): {factor_params} parameters") return total # Example: 5 binary variablesprint("=== Naive Chain Rule: 5 Binary Variables ===\n")cardinalities = [2, 2, 2, 2, 2]total = naive_chain_rule_parameters(5, cardinalities)print(f"\nTotal parameters: {total}")print(f"Full joint table would need: {2**5 - 1} = 31 parameters") # Scaling problemprint("\n=== Scaling: n Binary Variables ===\n")for n in [5, 10, 15, 20, 25]: # Last factor alone needs 2^(n-1) parameters last_factor = 2 ** (n-1) print(f"n={n}: Last factor alone needs {last_factor:,} parameters")The defining property of a Bayesian Network is that the joint distribution factorizes according to the DAG structure. This is not merely a computational convenience—it is the mathematical definition of what it means for a distribution to 'respect' a given graph.
Definition (Bayesian Network Factorization): Let $G = (V, E)$ be a DAG with vertices $X_1, \ldots, X_n$. A probability distribution $P$ is said to factorize according to $G$ (or is Markov with respect to $G$) if: $$P(X_1, X_2, \ldots, X_n) = \prod_{i=1}^{n} P(X_i | Pa(X_i))$$
where $Pa(X_i)$ denotes the parents of $X_i$ in the DAG.
Crucial Observation: Compare this to the naive chain rule:
The difference is enormous. Parents are typically a small subset of predecessors.
Why can we drop non-parent predecessors? Because the Local Markov Property guarantees X_i ⊥ NonDesc(X_i) | Pa(X_i). All predecessors except parents are non-descendants, so given parents, they provide no additional information. This independence justifies the simplified factorization.
Example: The Classic Rain-Sprinkler-Grass Network
Consider four variables:
The DAG structure:
The factorization: $$P(C, R, S, W) = P(C) \cdot P(R|C) \cdot P(S|C) \cdot P(W|R,S)$$
Parameter Count:
Total: 1 + 2 + 2 + 4 = 9 parameters
Naive approach: $2^4 - 1 = 15$ parameters for the full joint table.
We've reduced parameters by 40%—and this is a tiny network!
| Approach | Factor Structure | Parameters |
|---|---|---|
| Full Joint Table | Single table P(C,R,S,W) | 15 |
| BN Factorization | P(C)·P(R|C)·P(S|C)·P(W|R,S) | 9 |
| Savings | — | 40% reduction |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
import numpy as np # Define the Bayesian Network for Rain-Sprinkler-Grass# Each CPT is represented as a dictionary or numpy array # P(Cloudy) - no parentsP_C = { 'cloudy': 0.5, 'not_cloudy': 0.5} # P(Rain | Cloudy)P_R_given_C = { ('rain', 'cloudy'): 0.8, ('no_rain', 'cloudy'): 0.2, ('rain', 'not_cloudy'): 0.2, ('no_rain', 'not_cloudy'): 0.8} # P(Sprinkler | Cloudy)P_S_given_C = { ('on', 'cloudy'): 0.1, ('off', 'cloudy'): 0.9, ('on', 'not_cloudy'): 0.5, ('off', 'not_cloudy'): 0.5} # P(WetGrass | Rain, Sprinkler)P_W_given_RS = { ('wet', 'rain', 'on'): 0.99, ('dry', 'rain', 'on'): 0.01, ('wet', 'rain', 'off'): 0.9, ('dry', 'rain', 'off'): 0.1, ('wet', 'no_rain', 'on'): 0.9, ('dry', 'no_rain', 'on'): 0.1, ('wet', 'no_rain', 'off'): 0.0, ('dry', 'no_rain', 'off'): 1.0} def compute_joint_probability(c, r, s, w): """ Compute P(C=c, R=r, S=s, W=w) using BN factorization: P(C,R,S,W) = P(C) · P(R|C) · P(S|C) · P(W|R,S) """ p_c = P_C[c] p_r = P_R_given_C[(r, c)] p_s = P_S_given_C[(s, c)] p_w = P_W_given_RS[(w, r, s)] joint = p_c * p_r * p_s * p_w print(f"P({c}, {r}, {s}, {w}):") print(f" = P({c}) × P({r}|{c}) × P({s}|{c}) × P({w}|{r},{s})") print(f" = {p_c:.3f} × {p_r:.3f} × {p_s:.3f} × {p_w:.3f}") print(f" = {joint:.6f}") return joint # Example calculationsprint("=== Bayesian Network Factorization ===\n") # Compute specific joint probabilitiescompute_joint_probability('cloudy', 'rain', 'off', 'wet')print()compute_joint_probability('not_cloudy', 'no_rain', 'on', 'wet')print() # Verify normalization by summing over all statestotal = 0for c in ['cloudy', 'not_cloudy']: for r in ['rain', 'no_rain']: for s in ['on', 'off']: for w in ['wet', 'dry']: p_c = P_C[c] p_r = P_R_given_C[(r, c)] p_s = P_S_given_C[(s, c)] p_w = P_W_given_RS[(w, r, s)] total += p_c * p_r * p_s * p_w print(f"Sum of all joint probabilities: {total:.6f}")print("(Should equal 1.0 for valid distribution)")Understanding the parameter count in Bayesian Networks is essential for both theoretical analysis and practical model design. The savings from factorization can be dramatic—often reducing exponential complexity to linear.
General Parameter Count Formula:
For a Bayesian Network with variables $X_1, \ldots, X_n$, each variable $X_i$ having $r_i$ possible states and parent set $Pa(X_i)$:
$$\text{Total Parameters} = \sum_{i=1}^{n} (r_i - 1) \cdot \prod_{X_j \in Pa(X_i)} r_j$$
The factor $(r_i - 1)$ accounts for normalization—one probability in each row is determined by the constraint that probabilities sum to 1.
Special Case: All Binary Variables
If all variables are binary ($r_i = 2$ for all $i$): $$\text{Total Parameters} = \sum_{i=1}^{n} 2^{|Pa(X_i)|}$$
Compared to the full joint table: $2^n - 1$ parameters.
For a sparse DAG where each variable has at most k parents (bounded in-degree), the total parameters are O(n · 2^k). If k is small (say 3-5), this is linear in n with a manageable constant. Compare to 2^n for the full joint—exponential vs linear!
Concrete Example: Bounded In-Degree
Consider a network with:
Full joint table: $2^{100} - 1 \approx 10^{30}$ parameters (impossible to store)
BN with bounded in-degree: At most $100 \times 2^3 = 800$ parameters
This is a reduction factor of over $10^{27}$!
Why This Works:
The exponential explosion in joint distributions comes from having to specify a probability for every combination of variable values. The BN factorization says: we don't need to specify everything independently—once we know a variable's parents, we know its distribution. The rest of the network doesn't matter for that local decision.
This is conditional independence in action: structure reduces redundancy.
| Variables (n) | Full Joint | BN (max 3 parents) | Reduction Factor |
|---|---|---|---|
| 10 | 1,023 | 80 | ~13x |
| 20 | ~1 million | 160 | ~6,500x |
| 50 | ~10^15 | 400 | ~10^12x |
| 100 | ~10^30 | 800 | ~10^27x |
| 1000 | ~10^301 | 8,000 | ~10^298x |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899
import math def bn_parameter_count(variable_cardinalities, parent_sets): """ Calculate total parameters for a Bayesian Network. Args: variable_cardinalities: Dict mapping variable -> number of states parent_sets: Dict mapping variable -> list of parent variables Returns: Total parameter count """ total = 0 details = [] for var, cardinality in variable_cardinalities.items(): parents = parent_sets.get(var, []) # Product of parent cardinalities parent_combinations = 1 for parent in parents: parent_combinations *= variable_cardinalities[parent] # Parameters for this CPT: (cardinality - 1) per parent combination var_params = (cardinality - 1) * parent_combinations total += var_params details.append({ 'variable': var, 'cardinality': cardinality, 'parents': parents, 'parent_combinations': parent_combinations, 'parameters': var_params }) return total, details def compare_scalability(): """Compare BN vs full joint for different network sizes.""" print("=== Scalability Comparison ===\n") print(f"{'n':>5} | {'Full Joint':>15} | {'BN (≤3 parents)':>15} | {'Reduction':>15}") print("-" * 60) for n in [10, 20, 30, 50, 100]: # Full joint for n binary variables full_joint = 2 ** n - 1 # BN with each variable having at most 3 binary parents # Worst case: n * 2^3 = 8n bn_params = n * 8 if full_joint > 0: reduction = full_joint / bn_params # Format large numbers if full_joint > 1e9: fj_str = f"~10^{int(math.log10(full_joint))}" else: fj_str = f"{full_joint:,}" if reduction > 1e6: red_str = f"~10^{int(math.log10(reduction))}" else: red_str = f"{reduction:,.0f}x" print(f"{n:>5} | {fj_str:>15} | {bn_params:>15,} | {red_str:>15}") # Example: Analyze a specific networkprint("=== Specific Network Analysis ===\n") cardinalities = { 'Cloudy': 2, 'Rain': 2, 'Sprinkler': 2, 'WetGrass': 2} parents = { 'Cloudy': [], 'Rain': ['Cloudy'], 'Sprinkler': ['Cloudy'], 'WetGrass': ['Rain', 'Sprinkler']} total, details = bn_parameter_count(cardinalities, parents) for d in details: print(f"P({d['variable']} | {d['parents'] if d['parents'] else '∅'}): " f"{d['parameters']} parameters") print(f"\nTotal BN parameters: {total}")print(f"Full joint parameters: {2**len(cardinalities) - 1}")print() compare_scalability()The factorized representation is not just a compressed storage format—it enables efficient computation. Let's examine how to compute various quantities using only the local factors.
Computing Joint Probabilities:
To compute $P(X_1 = x_1, \ldots, X_n = x_n)$:
This requires looking up $n$ values and performing $n-1$ multiplications—linear time regardless of the number of variables!
Computing Marginal Probabilities:
To compute $P(X_k = x_k)$ (marginalizing out all other variables), we sum over all configurations of the other variables: $$P(X_k = x_k) = \sum_{x_1, \ldots, x_{k-1}, x_{k+1}, \ldots, x_n} P(X_1 = x_1, \ldots, X_n = x_n)$$
Naively, this is exponential. However, the factorization enables variable elimination—summing out variables one at a time, exploiting the fact that each factor involves only a subset of variables.
When summing out variable X, any factor that doesn't involve X can be moved outside the sum. This allows us to eliminate variables one by one, combining only the relevant factors at each step. The complexity depends on the elimination order—choosing a good order is itself a challenging optimization problem!
Example: Marginal Computation
Using our Rain-Sprinkler network, let's compute $P(W = \text{wet})$:
$$P(W=\text{wet}) = \sum_{c,r,s} P(C=c) \cdot P(R=r|C=c) \cdot P(S=s|C=c) \cdot P(W=\text{wet}|R=r,S=s)$$
Naive approach: 8 terms (all combinations of $c, r, s$), each requiring 3 multiplications.
Smart approach using variable elimination:
Step 1: Sum out $C$ first, but notice $P(C)$ contributes to both $P(R|C)$ and $P(S|C)$. Define: $$\phi_{R}(r) = \sum_c P(C=c) \cdot P(R=r|C=c)$$ $$\phi_{S}(s) = \sum_c P(C=c) \cdot P(S=s|C=c)$$
Wait—$C$ appears in both $R$ and $S$ factors, so we can't separate them simply. This reveals an important truth: the network structure affects computational efficiency.
For this network, we need to be careful about elimination order...
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118
import itertoolsfrom collections import defaultdict # Define CPTs (Conditional Probability Tables) as factor functionsdef P_C(c): return 0.5 # P(cloudy) = P(not_cloudy) = 0.5 def P_R_given_C(r, c): table = { ('rain', 'cloudy'): 0.8, ('no_rain', 'cloudy'): 0.2, ('rain', 'not_cloudy'): 0.2, ('no_rain', 'not_cloudy'): 0.8 } return table[(r, c)] def P_S_given_C(s, c): table = { ('on', 'cloudy'): 0.1, ('off', 'cloudy'): 0.9, ('on', 'not_cloudy'): 0.5, ('off', 'not_cloudy'): 0.5 } return table[(s, c)] def P_W_given_RS(w, r, s): table = { ('wet', 'rain', 'on'): 0.99, ('dry', 'rain', 'on'): 0.01, ('wet', 'rain', 'off'): 0.9, ('dry', 'rain', 'off'): 0.1, ('wet', 'no_rain', 'on'): 0.9, ('dry', 'no_rain', 'on'): 0.1, ('wet', 'no_rain', 'off'): 0.0, ('dry', 'no_rain', 'off'): 1.0 } return table[(w, r, s)] def compute_marginal_naive(target_var, target_val): """ Compute P(target_var = target_val) by summing over all other variables. This is the naive, exponential approach. """ c_vals = ['cloudy', 'not_cloudy'] r_vals = ['rain', 'no_rain'] s_vals = ['on', 'off'] w_vals = ['wet', 'dry'] total = 0.0 for c in c_vals: for r in r_vals: for s in s_vals: for w in w_vals: # Apply constraint if this is the target variable if target_var == 'W' and w != target_val: continue if target_var == 'R' and r != target_val: continue if target_var == 'S' and s != target_val: continue if target_var == 'C' and c != target_val: continue joint = P_C(c) * P_R_given_C(r, c) * P_S_given_C(s, c) * P_W_given_RS(w, r, s) total += joint return total def compute_marginal_efficient(target_var, target_val): """ Compute P(target_var = target_val) using variable elimination. Demonstrates how structure can be exploited. """ # For P(W=wet), we can factor the computation: # Sum over S: Sum over R: Sum over C: P(C) * P(R|C) * P(S|C) * P(W|R,S) if target_var == 'W': w = target_val total = 0.0 # For each (R, S) configuration, compute the contribution for r in ['rain', 'no_rain']: for s in ['on', 'off']: # P(W|R,S) doesn't depend on C p_w = P_W_given_RS(w, r, s) # Sum over C: P(C) * P(R|C) * P(S|C) # This gives us P(R, S) - the joint marginal of R and S p_rs = 0.0 for c in ['cloudy', 'not_cloudy']: p_rs += P_C(c) * P_R_given_C(r, c) * P_S_given_C(s, c) total += p_rs * p_w return total return None # For other variables, would implement similarly # Compare approachesprint("=== Marginal Probability Computation ===\n") print("P(W = wet):")print(f" Naive enumeration: {compute_marginal_naive('W', 'wet'):.6f}")print(f" Variable elimination: {compute_marginal_efficient('W', 'wet'):.6f}") print("\nP(W = dry):")print(f" Naive enumeration: {compute_marginal_naive('W', 'dry'):.6f}")print(f" Variable elimination: {compute_marginal_efficient('W', 'dry'):.6f}") print("\nP(R = rain):")print(f" Naive enumeration: {compute_marginal_naive('R', 'rain'):.6f}") print("\nVerification: P(W=wet) + P(W=dry) =", compute_marginal_naive('W', 'wet') + compute_marginal_naive('W', 'dry'))The deep connection between conditional independence and factorization is the theoretical cornerstone of Bayesian Networks. Let us formalize this relationship.
Theorem (Factorization Implies Local Markov Property):
If a distribution $P$ factorizes according to DAG $G$: $$P(X_1, \ldots, X_n) = \prod_{i=1}^{n} P(X_i | Pa(X_i))$$
Then $P$ satisfies the local Markov property with respect to $G$: $$X_i \perp NonDesc(X_i) \mid Pa(X_i) \quad \forall i$$
Proof Sketch:
Consider any variable $X_i$ and any non-descendant $X_j$ (where $j$ is not $i$ or a descendant of $i$). We want to show $X_i \perp X_j | Pa(X_i)$.
From the factorization:
When we condition on $Pa(X_i)$, the factor $P(X_i | Pa(X_i))$ becomes fixed and doesn't depend on $X_j$. The remaining factors either don't involve $X_i$ at all or involve descendants of $X_i$ that also don't depend on $X_j$ given the structure. □
For distributions where P(x) > 0 for all configurations x, the converse is also true: if P satisfies the local Markov property with respect to G, then P factorizes according to G. This requires the Hammersley-Clifford theorem for the undirected case and can be shown for DAGs via the d-separation criterion.
Reading Independence from Factorization:
Given the factorized form, we can directly read off conditional independencies:
Each factor reveals a local independence: $P(X_i | Pa(X_i))$ tells us that $X_i$, given its parents, doesn't directly depend on anything else in its factor.
Structure limits information flow: A variable can only influence another through paths in the graph. If there's no path from $X$ to $Y$ (in either direction) that isn't blocked by the conditioning set, independence holds.
Factorization = Minimal representation: The factorized form captures exactly the dependencies that exist—no more, no less. Any additional independence in the true distribution would mean a coarser (simpler) factorization is possible.
Example: Reading Independence
From $P(C,R,S,W) = P(C) \cdot P(R|C) \cdot P(S|C) \cdot P(W|R,S)$:
| Statement | From Factorization | Reasoning |
|---|---|---|
| R ⊥ S | C | Both condition only on C | Fork structure at C; conditioning blocks |
| W ⊥ C | R, S | W factor has only R, S | R, S d-separate W from C |
| C ⊥ W | R, S | Symmetric of above | Same d-separation argument |
| R ⊥̸ S (marginal) | Both depend on C | Shared cause creates dependence |
One of the most powerful practical consequences of factorization is that it enables modular, independent learning of different parts of the network. Each conditional probability table can be estimated in isolation.
The Decomposability Principle:
Given the log-likelihood of data $\mathcal{D} = {\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(N)}}$:
$$\log P(\mathcal{D} | \theta) = \sum_{j=1}^{N} \log P(\mathbf{x}^{(j)} | \theta)$$
Using factorization: $$= \sum_{j=1}^{N} \sum_{i=1}^{n} \log P(X_i^{(j)} | Pa(X_i)^{(j)}, \theta_i)$$
Rearranging sums: $$= \sum_{i=1}^{n} \sum_{j=1}^{N} \log P(X_i^{(j)} | Pa(X_i)^{(j)}, \theta_i)$$
This shows the log-likelihood decomposes into a sum of local terms, one per variable. Each term $\sum_j \log P(X_i^{(j)} | Pa(X_i)^{(j)}, \theta_i)$ depends only on the parameters $\theta_i$ for variable $X_i$'s CPT.
Because the log-likelihood decomposes into independent terms, we can maximize each term separately! Learning P(X_i | Pa(X_i)) requires only examining columns X_i and Pa(X_i) in the data—not the entire dataset. This is a massive computational advantage.
Maximum Likelihood Estimation with Factorization:
For complete data (no missing values), the MLE for each CPT entry is simply the empirical conditional frequency:
$$\hat{P}(X_i = x | Pa(X_i) = \pi) = \frac{\text{Count}(X_i = x, Pa(X_i) = \pi)}{\text{Count}(Pa(X_i) = \pi)}$$
This can be computed with a single pass through the data, counting co-occurrences. No iterative optimization is needed.
Implications for Large Networks:
Parallelization: Each CPT can be learned on a separate processor
Incremental updates: Adding data updates only the relevant counts
Local modifications: Changing network structure requires re-learning only affected CPTs
Missing data handling: Even with missing data, the EM algorithm can exploit factorization within each E and M step
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105
import numpy as npfrom collections import defaultdict def learn_cpt_from_data(data, variable, parents, var_names): """ Learn a single CPT P(variable | parents) from data. This demonstrates modular learning - each CPT can be learned independently. Args: data: Array of shape (N, n_variables) - each row is a sample variable: Index of the variable whose CPT we're learning parents: List of indices of parent variables var_names: List of variable names for display Returns: CPT as a dictionary mapping (var_value, parent_values) -> probability """ # Count occurrences joint_counts = defaultdict(lambda: defaultdict(int)) parent_counts = defaultdict(int) for sample in data: var_value = sample[variable] parent_values = tuple(sample[p] for p in parents) joint_counts[parent_values][var_value] += 1 parent_counts[parent_values] += 1 # Compute conditional probabilities cpt = {} for parent_values, var_counts in joint_counts.items(): total = parent_counts[parent_values] for var_value, count in var_counts.items(): cpt[(var_value, parent_values)] = count / total # Display var_name = var_names[variable] parent_names = [var_names[p] for p in parents] if parents else [] print(f"\nLearned CPT for P({var_name} | {parent_names or '∅'}):") for (val, parents_val), prob in sorted(cpt.items()): if parents: print(f" P({var_name}={val} | {dict(zip(parent_names, parents_val))}) = {prob:.3f}") else: print(f" P({var_name}={val}) = {prob:.3f}") return cpt def demonstrate_modular_learning(): """ Demonstrate that each CPT can be learned independently. """ np.random.seed(42) # Generate synthetic data from a known BN # Structure: A -> B, A -> C, B -> D, C -> D N = 1000 # Variable indices: A=0, B=1, C=2, D=3 data = np.zeros((N, 4), dtype=int) # Generate according to the BN for i in range(N): # P(A=1) = 0.3 a = 1 if np.random.random() < 0.3 else 0 # P(B=1 | A): if A=1, P=0.8; if A=0, P=0.2 p_b = 0.8 if a == 1 else 0.2 b = 1 if np.random.random() < p_b else 0 # P(C=1 | A): if A=1, P=0.7; if A=0, P=0.1 p_c = 0.7 if a == 1 else 0.1 c = 1 if np.random.random() < p_c else 0 # P(D=1 | B, C) if b == 1 and c == 1: p_d = 0.95 elif b == 1 or c == 1: p_d = 0.6 else: p_d = 0.05 d = 1 if np.random.random() < p_d else 0 data[i] = [a, b, c, d] var_names = ['A', 'B', 'C', 'D'] print("=== Modular CPT Learning ===") print(f"\nLearning from {N} samples...") print("Each CPT is learned INDEPENDENTLY using only relevant columns.") # Learn each CPT independently cpt_A = learn_cpt_from_data(data, variable=0, parents=[], var_names=var_names) cpt_B = learn_cpt_from_data(data, variable=1, parents=[0], var_names=var_names) cpt_C = learn_cpt_from_data(data, variable=2, parents=[0], var_names=var_names) cpt_D = learn_cpt_from_data(data, variable=3, parents=[1, 2], var_names=var_names) print("\n" + "="*50) print("Note: Each CPT was learned using only its variable and parents.") print("No global optimization was needed!") demonstrate_modular_learning()A crucial subtlety: a single probability distribution may factorize according to multiple different DAG structures. Two different DAGs can represent exactly the same set of conditional independencies, making them observationally equivalent from data alone.
Definition (I-Equivalence): Two DAGs $G_1$ and $G_2$ are I-equivalent (or Markov equivalent) if they encode the same set of conditional independence relations—meaning a distribution $P$ factorizes according to $G_1$ if and only if it factorizes according to $G_2$.
Example: Three Variable Chains
Consider three variables $A$, $B$, $C$. The following three DAGs are I-equivalent:
All three encode exactly one independence: $A \perp C | B$.
But $A \rightarrow B \leftarrow C$ (collider) is NOT equivalent—it encodes $A \perp C$ marginally, which is different.
I-equivalence is why we cannot learn causal direction from observational data alone. Seeing correlation between A, B, C with A ⊥ C | B is consistent with A causing B causing C, or C causing B causing A, or B causing both. Only interventions or additional assumptions can distinguish these.
Characterizing I-Equivalence:
Theorem (Verma & Pearl, 1990): Two DAGs are I-equivalent if and only if they have:
This means: most edge directions can be reversed freely, but collider structures must be preserved.
Equivalence Classes:
I-equivalent DAGs form an equivalence class, often represented by a Completed Partially Directed Acyclic Graph (CPDAG) or Essential Graph:
A CPDAG is the most informative structure learnable from purely observational data.
| DAG 1 | DAG 2 | Relationship | Reason |
|---|---|---|---|
| A→B→C | C→B→A | I-equivalent | Same skeleton, no v-structures |
| A→B→C | A←B→C | I-equivalent | Same skeleton, no v-structures |
| A→B←C | A→B→C | NOT equivalent | Different v-structures |
| A→B←C | A←B→C | NOT equivalent | Collider vs fork |
Implications for Practice:
Structure learning from data can at best identify the equivalence class, not a unique DAG
Causal interpretation requires additional assumptions (e.g., faithfulness, causal sufficiency) or experimental data
Model comparison should compare equivalence classes, not individual DAGs, when using observational data
Prior knowledge about causal directions can break equivalence and specify a unique DAG
Factorization is the mathematical engine that powers Bayesian Networks. We have developed a thorough understanding of how structure enables compact representation and efficient computation. Let us consolidate the key insights:
What's Next:
The next page explores Conditional Probability Tables (CPTs)—the data structures that store the local factors $P(X_i | Pa(X_i))$. We'll examine representation formats, normalization constraints, parameterization choices, and practical considerations for working with CPTs in real Bayesian Network implementations.
You now understand factorization as the core mechanism enabling tractable probabilistic reasoning in Bayesian Networks. You can compute joint probabilities, analyze parameter complexity, and recognize how structure translates to computational efficiency. Next, we'll examine how the local factors are actually represented and stored.