Bayesian Networks - Learning Module

Loading content...

0/245

Factorization

The Art of Dividing to Conquer

The power of Bayesian Networks lies not in representing arbitrary joint distributions, but in exploiting structure to make the representation tractable. A joint distribution over $n$ binary variables requires $2^n - 1$ parameters in the general case—a number that becomes astronomical for even modest $n$. For 30 variables, that's over a billion parameters.

Factorization is the mathematical mechanism that transforms this exponential nightmare into something manageable. By decomposing the joint distribution into a product of local conditional distributions—each involving only a variable and its parents—we leverage the DAG structure to achieve dramatic reductions in representational complexity.

This page develops the theory of factorization in Bayesian Networks, from the chain rule foundation through independence-exploiting refinements, culminating in practical understanding of how structure translates to efficiency.

What You Will Learn

By completing this page, you will: (1) Derive the chain rule factorization from first principles, (2) Understand how conditional independence simplifies the chain rule, (3) Quantify the parameter savings from DAG structure, (4) Compute joint probabilities using factorized representations, and (5) Recognize factorization as the bridge between graph structure and probabilistic computation.

The Chain Rule Foundation

Every factorization in Bayesian Networks derives from the fundamental chain rule of probability (also called the product rule). This is not a Bayesian Network-specific result—it applies to any joint distribution.

Theorem (Chain Rule): For any joint distribution over variables $X_1, X_2, \ldots, X_n$: $$P(X_1, X_2, \ldots, X_n) = P(X_1) \cdot P(X_2 | X_1) \cdot P(X_3 | X_1, X_2) \cdots P(X_n | X_1, \ldots, X_{n-1})$$

Or more compactly: $$P(X_1, \ldots, X_n) = \prod_{i=1}^{n} P(X_i | X_1, \ldots, X_{i-1})$$

Proof: This follows directly from the definition of conditional probability applied recursively: $$P(A, B) = P(A) \cdot P(B | A)$$

Extending to three variables: $$P(A, B, C) = P(A) \cdot P(B | A) \cdot P(C | A, B)$$

And by induction to $n$ variables.

Ordering Independence

The chain rule holds for ANY ordering of variables. We could equally write P(X₃)·P(X₁|X₃)·P(X₂|X₃,X₁). Different orderings give different factorizations with the same numerical result. Bayesian Networks exploit this freedom by choosing orderings consistent with the DAG topology.

The Problem with the Naive Chain Rule:

While mathematically correct, the naive chain rule doesn't save parameters in the general case. Consider the last factor $P(X_n | X_1, \ldots, X_{n-1})$:

If all variables are binary, this requires $2^{n-1}$ parameters
For $n = 20$, that's over 500,000 parameters for this single factor
Total parameters across all factors: still exponential

The chain rule alone is just a rewriting—it doesn't exploit any structure. What Bayesian Networks add is the observation that many of these conditioning variables are often irrelevant.

Key Insight: If $X_i \perp X_j | X_k$ (conditional independence), then: $$P(X_i | X_j, X_k) = P(X_i | X_k)$$

Conditioning on $X_j$ adds nothing when $X_k$ is known. This is the leverage point that Bayesian Networks exploit.

chain_rule_complexity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
def naive_chain_rule_parameters(n, cardinalities):
    """
    Calculate parameters needed for naive chain rule factorization.
    
    Each factor P(X_i | X_1, ..., X_{i-1}) requires product of earlier cardinalities
    parameters for each value of X_i.
    
    Args:
        n: Number of variables
        cardinalities: List of number of states per variable
    
    Returns:
        Total parameters needed
    """
    total = 0
    for i in range(n):
        # For P(X_i | X_1, ..., X_{i-1})
        # Need (cardinality of X_i - 1) per combination of parent values
        # (One value determined by normalization)
        conditioning_combinations = 1
        for j in range(i):
            conditioning_combinations *= cardinalities[j]
        
        factor_params = (cardinalities[i] - 1) * conditioning_combinations
        total += factor_params
        print(f"P(X_{i+1} | X_1,...,X_{i}): {factor_params} parameters")
    
    return total
 
 
# Example: 5 binary variables
print("=== Naive Chain Rule: 5 Binary Variables ===\n")
cardinalities = [2, 2, 2, 2, 2]
total = naive_chain_rule_parameters(5, cardinalities)
print(f"\nTotal parameters: {total}")
print(f"Full joint table would need: {2**5 - 1} = 31 parameters")
 
# Scaling problem
print("\n=== Scaling: n Binary Variables ===\n")
for n in [5, 10, 15, 20, 25]:
    # Last factor alone needs 2^(n-1) parameters
    last_factor = 2 ** (n-1)
    print(f"n={n}: Last factor alone needs {last_factor:,} parameters")

Bayesian Network Factorization

The defining property of a Bayesian Network is that the joint distribution factorizes according to the DAG structure. This is not merely a computational convenience—it is the mathematical definition of what it means for a distribution to 'respect' a given graph.

Definition (Bayesian Network Factorization): Let $G = (V, E)$ be a DAG with vertices $X_1, \ldots, X_n$. A probability distribution $P$ is said to factorize according to $G$ (or is Markov with respect to $G$) if: $$P(X_1, X_2, \ldots, X_n) = \prod_{i=1}^{n} P(X_i | Pa(X_i))$$

where $Pa(X_i)$ denotes the parents of $X_i$ in the DAG.

Crucial Observation: Compare this to the naive chain rule:

Naive: $P(X_i | X_1, \ldots, X_{i-1})$ — conditioning on ALL predecessors
Bayesian Network: $P(X_i | Pa(X_i))$ — conditioning on ONLY parents

The difference is enormous. Parents are typically a small subset of predecessors.

The Local Markov Property at Work

Why can we drop non-parent predecessors? Because the Local Markov Property guarantees X_i ⊥ NonDesc(X_i) | Pa(X_i). All predecessors except parents are non-descendants, so given parents, they provide no additional information. This independence justifies the simplified factorization.

Example: The Classic Rain-Sprinkler-Grass Network

Consider four variables:

$C$ = Cloudy (whether it's a cloudy day)
$R$ = Rain (whether it rained)
$S$ = Sprinkler (whether the sprinkler was on)
$W$ = Wet Grass (whether the grass is wet)

The DAG structure:

$C \rightarrow R$ (clouds cause rain)
$C \rightarrow S$ (clouds affect sprinkler decision)
$R \rightarrow W$ (rain wets grass)
$S \rightarrow W$ (sprinkler wets grass)

The factorization: $$P(C, R, S, W) = P(C) \cdot P(R|C) \cdot P(S|C) \cdot P(W|R,S)$$

Parameter Count:

$P(C)$: 1 parameter (binary, one determined by normalization)
$P(R|C)$: 2 parameters (one for each value of $C$)
$P(S|C)$: 2 parameters
$P(W|R,S)$: 4 parameters (one per combination of $R,S$, minus one for normalization per row... actually 4 rows × 1 free parameter = 4)

Total: 1 + 2 + 2 + 4 = 9 parameters

Naive approach: $2^4 - 1 = 15$ parameters for the full joint table.

We've reduced parameters by 40%—and this is a tiny network!

Parameter Savings Example: 4-Variable Network
Approach	Factor Structure	Parameters
Full Joint Table	Single table P(C,R,S,W)	15
BN Factorization	P(C)·P(R\|C)·P(S\|C)·P(W\|R,S)	9
Savings	—	40% reduction

bn_factorization_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
 
# Define the Bayesian Network for Rain-Sprinkler-Grass
# Each CPT is represented as a dictionary or numpy array
 
# P(Cloudy) - no parents
P_C = {
    'cloudy': 0.5,
    'not_cloudy': 0.5
}
 
# P(Rain | Cloudy)
P_R_given_C = {
    ('rain', 'cloudy'): 0.8,
    ('no_rain', 'cloudy'): 0.2,
    ('rain', 'not_cloudy'): 0.2,
    ('no_rain', 'not_cloudy'): 0.8
}
 
# P(Sprinkler | Cloudy)
P_S_given_C = {
    ('on', 'cloudy'): 0.1,
    ('off', 'cloudy'): 0.9,
    ('on', 'not_cloudy'): 0.5,
    ('off', 'not_cloudy'): 0.5
}
 
# P(WetGrass | Rain, Sprinkler)
P_W_given_RS = {
    ('wet', 'rain', 'on'): 0.99,
    ('dry', 'rain', 'on'): 0.01,
    ('wet', 'rain', 'off'): 0.9,
    ('dry', 'rain', 'off'): 0.1,
    ('wet', 'no_rain', 'on'): 0.9,
    ('dry', 'no_rain', 'on'): 0.1,
    ('wet', 'no_rain', 'off'): 0.0,
    ('dry', 'no_rain', 'off'): 1.0
}
 
def compute_joint_probability(c, r, s, w):
    """
    Compute P(C=c, R=r, S=s, W=w) using BN factorization:
    P(C,R,S,W) = P(C) · P(R|C) · P(S|C) · P(W|R,S)
    """
    p_c = P_C[c]
    p_r = P_R_given_C[(r, c)]
    p_s = P_S_given_C[(s, c)]
    p_w = P_W_given_RS[(w, r, s)]
    
    joint = p_c * p_r * p_s * p_w
    
    print(f"P({c}, {r}, {s}, {w}):")
    print(f"  = P({c}) × P({r}|{c}) × P({s}|{c}) × P({w}|{r},{s})")
    print(f"  = {p_c:.3f} × {p_r:.3f} × {p_s:.3f} × {p_w:.3f}")
    print(f"  = {joint:.6f}")
    
    return joint
 
 
# Example calculations
print("=== Bayesian Network Factorization ===\n")
 
# Compute specific joint probabilities
compute_joint_probability('cloudy', 'rain', 'off', 'wet')
print()
compute_joint_probability('not_cloudy', 'no_rain', 'on', 'wet')
print()
 
# Verify normalization by summing over all states
total = 0
for c in ['cloudy', 'not_cloudy']:
    for r in ['rain', 'no_rain']:
        for s in ['on', 'off']:
            for w in ['wet', 'dry']:
                p_c = P_C[c]
                p_r = P_R_given_C[(r, c)]
                p_s = P_S_given_C[(s, c)]
                p_w = P_W_given_RS[(w, r, s)]
                total += p_c * p_r * p_s * p_w
 
print(f"Sum of all joint probabilities: {total:.6f}")
print("(Should equal 1.0 for valid distribution)")

Parameter Complexity Analysis

Understanding the parameter count in Bayesian Networks is essential for both theoretical analysis and practical model design. The savings from factorization can be dramatic—often reducing exponential complexity to linear.

General Parameter Count Formula:

For a Bayesian Network with variables $X_1, \ldots, X_n$, each variable $X_i$ having $r_i$ possible states and parent set $Pa(X_i)$:

$$\text{Total Parameters} = \sum_{i=1}^{n} (r_i - 1) \cdot \prod_{X_j \in Pa(X_i)} r_j$$

The factor $(r_i - 1)$ accounts for normalization—one probability in each row is determined by the constraint that probabilities sum to 1.

Special Case: All Binary Variables

If all variables are binary ($r_i = 2$ for all $i$): $$\text{Total Parameters} = \sum_{i=1}^{n} 2^{|Pa(X_i)|}$$

Compared to the full joint table: $2^n - 1$ parameters.

The Power of Sparsity

For a sparse DAG where each variable has at most k parents (bounded in-degree), the total parameters are O(n · 2^k). If k is small (say 3-5), this is linear in n with a manageable constant. Compare to 2^n for the full joint—exponential vs linear!

Concrete Example: Bounded In-Degree

Consider a network with:

$n = 100$ binary variables
Maximum in-degree = 3 (each variable has at most 3 parents)

Full joint table: $2^{100} - 1 \approx 10^{30}$ parameters (impossible to store)

BN with bounded in-degree: At most $100 \times 2^3 = 800$ parameters

This is a reduction factor of over $10^{27}$!

Why This Works:

The exponential explosion in joint distributions comes from having to specify a probability for every combination of variable values. The BN factorization says: we don't need to specify everything independently—once we know a variable's parents, we know its distribution. The rest of the network doesn't matter for that local decision.

This is conditional independence in action: structure reduces redundancy.

Parameter Scaling: BN vs Full Joint
Variables (n)	Full Joint	BN (max 3 parents)	Reduction Factor
10	1,023	80	~13x
20	~1 million	160	~6,500x
50	~10^15	400	~10^12x
100	~10^30	800	~10^27x
1000	~10^301	8,000	~10^298x

parameter_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
import math
 
def bn_parameter_count(variable_cardinalities, parent_sets):
    """
    Calculate total parameters for a Bayesian Network.
    
    Args:
        variable_cardinalities: Dict mapping variable -> number of states
        parent_sets: Dict mapping variable -> list of parent variables
    
    Returns:
        Total parameter count
    """
    total = 0
    details = []
    
    for var, cardinality in variable_cardinalities.items():
        parents = parent_sets.get(var, [])
        
        # Product of parent cardinalities
        parent_combinations = 1
        for parent in parents:
            parent_combinations *= variable_cardinalities[parent]
        
        # Parameters for this CPT: (cardinality - 1) per parent combination
        var_params = (cardinality - 1) * parent_combinations
        total += var_params
        
        details.append({
            'variable': var,
            'cardinality': cardinality,
            'parents': parents,
            'parent_combinations': parent_combinations,
            'parameters': var_params
        })
    
    return total, details
 
 
def compare_scalability():
    """Compare BN vs full joint for different network sizes."""
    
    print("=== Scalability Comparison ===\n")
    print(f"{'n':>5} | {'Full Joint':>15} | {'BN (≤3 parents)':>15} | {'Reduction':>15}")
    print("-" * 60)
    
    for n in [10, 20, 30, 50, 100]:
        # Full joint for n binary variables
        full_joint = 2 ** n - 1
        
        # BN with each variable having at most 3 binary parents
        # Worst case: n * 2^3 = 8n
        bn_params = n * 8
        
        if full_joint > 0:
            reduction = full_joint / bn_params
            
            # Format large numbers
            if full_joint > 1e9:
                fj_str = f"~10^{int(math.log10(full_joint))}"
            else:
                fj_str = f"{full_joint:,}"
            
            if reduction > 1e6:
                red_str = f"~10^{int(math.log10(reduction))}"
            else:
                red_str = f"{reduction:,.0f}x"
            
            print(f"{n:>5} | {fj_str:>15} | {bn_params:>15,} | {red_str:>15}")
 
 
# Example: Analyze a specific network
print("=== Specific Network Analysis ===\n")
 
cardinalities = {
    'Cloudy': 2,
    'Rain': 2,
    'Sprinkler': 2,
    'WetGrass': 2
}
 
parents = {
    'Cloudy': [],
    'Rain': ['Cloudy'],
    'Sprinkler': ['Cloudy'],
    'WetGrass': ['Rain', 'Sprinkler']
}
 
total, details = bn_parameter_count(cardinalities, parents)
 
for d in details:
    print(f"P({d['variable']} | {d['parents'] if d['parents'] else '∅'}): "
          f"{d['parameters']} parameters")
 
print(f"\nTotal BN parameters: {total}")
print(f"Full joint parameters: {2**len(cardinalities) - 1}")
print()
 
compare_scalability()

Computing with Factorized Representations

The factorized representation is not just a compressed storage format—it enables efficient computation. Let's examine how to compute various quantities using only the local factors.

Computing Joint Probabilities:

To compute $P(X_1 = x_1, \ldots, X_n = x_n)$:

For each variable $X_i$, look up $P(X_i = x_i | Pa(X_i) = pa_i)$ from its CPT
Multiply all factors together

This requires looking up $n$ values and performing $n-1$ multiplications—linear time regardless of the number of variables!

Computing Marginal Probabilities:

To compute $P(X_k = x_k)$ (marginalizing out all other variables), we sum over all configurations of the other variables: $$P(X_k = x_k) = \sum_{x_1, \ldots, x_{k-1}, x_{k+1}, \ldots, x_n} P(X_1 = x_1, \ldots, X_n = x_n)$$

Naively, this is exponential. However, the factorization enables variable elimination—summing out variables one at a time, exploiting the fact that each factor involves only a subset of variables.

The Variable Elimination Insight

When summing out variable X, any factor that doesn't involve X can be moved outside the sum. This allows us to eliminate variables one by one, combining only the relevant factors at each step. The complexity depends on the elimination order—choosing a good order is itself a challenging optimization problem!

Example: Marginal Computation

Using our Rain-Sprinkler network, let's compute $P(W = \text{wet})$:

$$P(W=\text{wet}) = \sum_{c,r,s} P(C=c) \cdot P(R=r|C=c) \cdot P(S=s|C=c) \cdot P(W=\text{wet}|R=r,S=s)$$

Naive approach: 8 terms (all combinations of $c, r, s$), each requiring 3 multiplications.

Smart approach using variable elimination:

Step 1: Sum out $C$ first, but notice $P(C)$ contributes to both $P(R|C)$ and $P(S|C)$. Define: $$\phi_{R}(r) = \sum_c P(C=c) \cdot P(R=r|C=c)$$ $$\phi_{S}(s) = \sum_c P(C=c) \cdot P(S=s|C=c)$$

Wait—$C$ appears in both $R$ and $S$ factors, so we can't separate them simply. This reveals an important truth: the network structure affects computational efficiency.

For this network, we need to be careful about elimination order...

factorized_computation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
import itertools
from collections import defaultdict
 
# Define CPTs (Conditional Probability Tables) as factor functions
def P_C(c):
    return 0.5  # P(cloudy) = P(not_cloudy) = 0.5
 
def P_R_given_C(r, c):
    table = {
        ('rain', 'cloudy'): 0.8,
        ('no_rain', 'cloudy'): 0.2,
        ('rain', 'not_cloudy'): 0.2,
        ('no_rain', 'not_cloudy'): 0.8
    }
    return table[(r, c)]
 
def P_S_given_C(s, c):
    table = {
        ('on', 'cloudy'): 0.1,
        ('off', 'cloudy'): 0.9,
        ('on', 'not_cloudy'): 0.5,
        ('off', 'not_cloudy'): 0.5
    }
    return table[(s, c)]
 
def P_W_given_RS(w, r, s):
    table = {
        ('wet', 'rain', 'on'): 0.99,
        ('dry', 'rain', 'on'): 0.01,
        ('wet', 'rain', 'off'): 0.9,
        ('dry', 'rain', 'off'): 0.1,
        ('wet', 'no_rain', 'on'): 0.9,
        ('dry', 'no_rain', 'on'): 0.1,
        ('wet', 'no_rain', 'off'): 0.0,
        ('dry', 'no_rain', 'off'): 1.0
    }
    return table[(w, r, s)]
 
 
def compute_marginal_naive(target_var, target_val):
    """
    Compute P(target_var = target_val) by summing over all other variables.
    This is the naive, exponential approach.
    """
    c_vals = ['cloudy', 'not_cloudy']
    r_vals = ['rain', 'no_rain']
    s_vals = ['on', 'off']
    w_vals = ['wet', 'dry']
    
    total = 0.0
    
    for c in c_vals:
        for r in r_vals:
            for s in s_vals:
                for w in w_vals:
                    # Apply constraint if this is the target variable
                    if target_var == 'W' and w != target_val:
                        continue
                    if target_var == 'R' and r != target_val:
                        continue
                    if target_var == 'S' and s != target_val:
                        continue
                    if target_var == 'C' and c != target_val:
                        continue
                    
                    joint = P_C(c) * P_R_given_C(r, c) * P_S_given_C(s, c) * P_W_given_RS(w, r, s)
                    total += joint
    
    return total
 
 
def compute_marginal_efficient(target_var, target_val):
    """
    Compute P(target_var = target_val) using variable elimination.
    Demonstrates how structure can be exploited.
    """
    # For P(W=wet), we can factor the computation:
    # Sum over S: Sum over R: Sum over C: P(C) * P(R|C) * P(S|C) * P(W|R,S)
    
    if target_var == 'W':
        w = target_val
        total = 0.0
        
        # For each (R, S) configuration, compute the contribution
        for r in ['rain', 'no_rain']:
            for s in ['on', 'off']:
                # P(W|R,S) doesn't depend on C
                p_w = P_W_given_RS(w, r, s)
                
                # Sum over C: P(C) * P(R|C) * P(S|C)
                # This gives us P(R, S) - the joint marginal of R and S
                p_rs = 0.0
                for c in ['cloudy', 'not_cloudy']:
                    p_rs += P_C(c) * P_R_given_C(r, c) * P_S_given_C(s, c)
                
                total += p_rs * p_w
        
        return total
    
    return None  # For other variables, would implement similarly
 
 
# Compare approaches
print("=== Marginal Probability Computation ===\n")
 
print("P(W = wet):")
print(f"  Naive enumeration: {compute_marginal_naive('W', 'wet'):.6f}")
print(f"  Variable elimination: {compute_marginal_efficient('W', 'wet'):.6f}")
 
print("\nP(W = dry):")
print(f"  Naive enumeration: {compute_marginal_naive('W', 'dry'):.6f}")
print(f"  Variable elimination: {compute_marginal_efficient('W', 'dry'):.6f}")
 
print("\nP(R = rain):")
print(f"  Naive enumeration: {compute_marginal_naive('R', 'rain'):.6f}")
 
print("\nVerification: P(W=wet) + P(W=dry) =", 
      compute_marginal_naive('W', 'wet') + compute_marginal_naive('W', 'dry'))

Independence and Factorization

The deep connection between conditional independence and factorization is the theoretical cornerstone of Bayesian Networks. Let us formalize this relationship.

Theorem (Factorization Implies Local Markov Property):

If a distribution $P$ factorizes according to DAG $G$: $$P(X_1, \ldots, X_n) = \prod_{i=1}^{n} P(X_i | Pa(X_i))$$

Then $P$ satisfies the local Markov property with respect to $G$: $$X_i \perp NonDesc(X_i) \mid Pa(X_i) \quad \forall i$$

Proof Sketch:

Consider any variable $X_i$ and any non-descendant $X_j$ (where $j$ is not $i$ or a descendant of $i$). We want to show $X_i \perp X_j | Pa(X_i)$.

From the factorization:

$X_i$ appears only in terms $P(X_i | Pa(X_i))$ and as a parent in descendants' factors
$X_j$ either appears before $X_i$ (as a potential ancestor) or in a branch not containing $X_i$

When we condition on $Pa(X_i)$, the factor $P(X_i | Pa(X_i))$ becomes fixed and doesn't depend on $X_j$. The remaining factors either don't involve $X_i$ at all or involve descendants of $X_i$ that also don't depend on $X_j$ given the structure. □

The Converse Also Holds (With Positivity)

For distributions where P(x) > 0 for all configurations x, the converse is also true: if P satisfies the local Markov property with respect to G, then P factorizes according to G. This requires the Hammersley-Clifford theorem for the undirected case and can be shown for DAGs via the d-separation criterion.

Reading Independence from Factorization:

Given the factorized form, we can directly read off conditional independencies:

Each factor reveals a local independence: $P(X_i | Pa(X_i))$ tells us that $X_i$, given its parents, doesn't directly depend on anything else in its factor.
Structure limits information flow: A variable can only influence another through paths in the graph. If there's no path from $X$ to $Y$ (in either direction) that isn't blocked by the conditioning set, independence holds.
Factorization = Minimal representation: The factorized form captures exactly the dependencies that exist—no more, no less. Any additional independence in the true distribution would mean a coarser (simpler) factorization is possible.

Example: Reading Independence

From $P(C,R,S,W) = P(C) \cdot P(R|C) \cdot P(S|C) \cdot P(W|R,S)$:

$R$ and $S$ only share $C$ as a common parent → $R \perp S | C$
$W$ depends on $R$ and $S$ but not directly on $C$ → $W \perp C | R, S$
But $R$ and $S$ are NOT independent marginally (before conditioning on $C$), because $C$ correlates them

Independence Statements from Factorization
Statement	From Factorization	Reasoning
R ⊥ S \| C	Both condition only on C	Fork structure at C; conditioning blocks
W ⊥ C \| R, S	W factor has only R, S	R, S d-separate W from C
C ⊥ W \| R, S	Symmetric of above	Same d-separation argument
R ⊥̸ S (marginal)	Both depend on C	Shared cause creates dependence

Factorization Enables Modular Learning

One of the most powerful practical consequences of factorization is that it enables modular, independent learning of different parts of the network. Each conditional probability table can be estimated in isolation.

The Decomposability Principle:

Given the log-likelihood of data $\mathcal{D} = {\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(N)}}$:

$$\log P(\mathcal{D} | \theta) = \sum_{j=1}^{N} \log P(\mathbf{x}^{(j)} | \theta)$$

Using factorization: $$= \sum_{j=1}^{N} \sum_{i=1}^{n} \log P(X_i^{(j)} | Pa(X_i)^{(j)}, \theta_i)$$

Rearranging sums: $$= \sum_{i=1}^{n} \sum_{j=1}^{N} \log P(X_i^{(j)} | Pa(X_i)^{(j)}, \theta_i)$$

This shows the log-likelihood decomposes into a sum of local terms, one per variable. Each term $\sum_j \log P(X_i^{(j)} | Pa(X_i)^{(j)}, \theta_i)$ depends only on the parameters $\theta_i$ for variable $X_i$'s CPT.

Independent Optimization

Because the log-likelihood decomposes into independent terms, we can maximize each term separately! Learning P(X_i | Pa(X_i)) requires only examining columns X_i and Pa(X_i) in the data—not the entire dataset. This is a massive computational advantage.

Maximum Likelihood Estimation with Factorization:

For complete data (no missing values), the MLE for each CPT entry is simply the empirical conditional frequency:

$$\hat{P}(X_i = x | Pa(X_i) = \pi) = \frac{\text{Count}(X_i = x, Pa(X_i) = \pi)}{\text{Count}(Pa(X_i) = \pi)}$$

This can be computed with a single pass through the data, counting co-occurrences. No iterative optimization is needed.

Implications for Large Networks:

Parallelization: Each CPT can be learned on a separate processor
Incremental updates: Adding data updates only the relevant counts
Local modifications: Changing network structure requires re-learning only affected CPTs
Missing data handling: Even with missing data, the EM algorithm can exploit factorization within each E and M step

modular_learning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import numpy as np
from collections import defaultdict
 
def learn_cpt_from_data(data, variable, parents, var_names):
    """
    Learn a single CPT P(variable | parents) from data.
    
    This demonstrates modular learning - each CPT can be learned independently.
    
    Args:
        data: Array of shape (N, n_variables) - each row is a sample
        variable: Index of the variable whose CPT we're learning
        parents: List of indices of parent variables
        var_names: List of variable names for display
    
    Returns:
        CPT as a dictionary mapping (var_value, parent_values) -> probability
    """
    # Count occurrences
    joint_counts = defaultdict(lambda: defaultdict(int))
    parent_counts = defaultdict(int)
    
    for sample in data:
        var_value = sample[variable]
        parent_values = tuple(sample[p] for p in parents)
        
        joint_counts[parent_values][var_value] += 1
        parent_counts[parent_values] += 1
    
    # Compute conditional probabilities
    cpt = {}
    for parent_values, var_counts in joint_counts.items():
        total = parent_counts[parent_values]
        for var_value, count in var_counts.items():
            cpt[(var_value, parent_values)] = count / total
    
    # Display
    var_name = var_names[variable]
    parent_names = [var_names[p] for p in parents] if parents else []
    
    print(f"\nLearned CPT for P({var_name} | {parent_names or '∅'}):")
    for (val, parents_val), prob in sorted(cpt.items()):
        if parents:
            print(f"  P({var_name}={val} | {dict(zip(parent_names, parents_val))}) = {prob:.3f}")
        else:
            print(f"  P({var_name}={val}) = {prob:.3f}")
    
    return cpt
 
 
def demonstrate_modular_learning():
    """
    Demonstrate that each CPT can be learned independently.
    """
    np.random.seed(42)
    
    # Generate synthetic data from a known BN
    # Structure: A -> B, A -> C, B -> D, C -> D
    N = 1000
    
    # Variable indices: A=0, B=1, C=2, D=3
    data = np.zeros((N, 4), dtype=int)
    
    # Generate according to the BN
    for i in range(N):
        # P(A=1) = 0.3
        a = 1 if np.random.random() < 0.3 else 0
        
        # P(B=1 | A): if A=1, P=0.8; if A=0, P=0.2
        p_b = 0.8 if a == 1 else 0.2
        b = 1 if np.random.random() < p_b else 0
        
        # P(C=1 | A): if A=1, P=0.7; if A=0, P=0.1
        p_c = 0.7 if a == 1 else 0.1
        c = 1 if np.random.random() < p_c else 0
        
        # P(D=1 | B, C)
        if b == 1 and c == 1:
            p_d = 0.95
        elif b == 1 or c == 1:
            p_d = 0.6
        else:
            p_d = 0.05
        d = 1 if np.random.random() < p_d else 0
        
        data[i] = [a, b, c, d]
    
    var_names = ['A', 'B', 'C', 'D']
    
    print("=== Modular CPT Learning ===")
    print(f"\nLearning from {N} samples...")
    print("Each CPT is learned INDEPENDENTLY using only relevant columns.")
    
    # Learn each CPT independently
    cpt_A = learn_cpt_from_data(data, variable=0, parents=[], var_names=var_names)
    cpt_B = learn_cpt_from_data(data, variable=1, parents=[0], var_names=var_names)
    cpt_C = learn_cpt_from_data(data, variable=2, parents=[0], var_names=var_names)
    cpt_D = learn_cpt_from_data(data, variable=3, parents=[1, 2], var_names=var_names)
    
    print("\n" + "="*50)
    print("Note: Each CPT was learned using only its variable and parents.")
    print("No global optimization was needed!")
 
 
demonstrate_modular_learning()

Non-Unique Factorizations and I-Equivalence

A crucial subtlety: a single probability distribution may factorize according to multiple different DAG structures. Two different DAGs can represent exactly the same set of conditional independencies, making them observationally equivalent from data alone.

Definition (I-Equivalence): Two DAGs $G_1$ and $G_2$ are I-equivalent (or Markov equivalent) if they encode the same set of conditional independence relations—meaning a distribution $P$ factorizes according to $G_1$ if and only if it factorizes according to $G_2$.

Example: Three Variable Chains

Consider three variables $A$, $B$, $C$. The following three DAGs are I-equivalent:

$A \rightarrow B \rightarrow C$ (chain)
$C \rightarrow B \rightarrow A$ (reversed chain)
$A \leftarrow B \rightarrow C$ (fork)

All three encode exactly one independence: $A \perp C | B$.

But $A \rightarrow B \leftarrow C$ (collider) is NOT equivalent—it encodes $A \perp C$ marginally, which is different.

Causal Ambiguity

I-equivalence is why we cannot learn causal direction from observational data alone. Seeing correlation between A, B, C with A ⊥ C | B is consistent with A causing B causing C, or C causing B causing A, or B causing both. Only interventions or additional assumptions can distinguish these.

Characterizing I-Equivalence:

Theorem (Verma & Pearl, 1990): Two DAGs are I-equivalent if and only if they have:

The same skeleton (underlying undirected graph)
The same v-structures (colliders $A \rightarrow C \leftarrow B$ where $A$ and $B$ are not adjacent)

This means: most edge directions can be reversed freely, but collider structures must be preserved.

Equivalence Classes:

I-equivalent DAGs form an equivalence class, often represented by a Completed Partially Directed Acyclic Graph (CPDAG) or Essential Graph:

Directed edges where all I-equivalent DAGs agree on direction
Undirected edges where direction varies across the class

A CPDAG is the most informative structure learnable from purely observational data.

Examples of I-Equivalent and Non-Equivalent DAGs
DAG 1	DAG 2	Relationship	Reason
A→B→C	C→B→A	I-equivalent	Same skeleton, no v-structures
A→B→C	A←B→C	I-equivalent	Same skeleton, no v-structures
A→B←C	A→B→C	NOT equivalent	Different v-structures
A→B←C	A←B→C	NOT equivalent	Collider vs fork

Implications for Practice:

Structure learning from data can at best identify the equivalence class, not a unique DAG
Causal interpretation requires additional assumptions (e.g., faithfulness, causal sufficiency) or experimental data
Model comparison should compare equivalence classes, not individual DAGs, when using observational data
Prior knowledge about causal directions can break equivalence and specify a unique DAG

Summary: Factorization Mastery

Factorization is the mathematical engine that powers Bayesian Networks. We have developed a thorough understanding of how structure enables compact representation and efficient computation. Let us consolidate the key insights:

Key Takeaways

•The chain rule is the foundation: Any joint distribution can be written as a product of conditionals, but this alone doesn't save parameters.
•BN factorization exploits independence: By conditioning only on parents (not all predecessors), we leverage the local Markov property for massive parameter reduction.
•Parameter complexity is O(n·2^k): For bounded in-degree k, parameters scale linearly with network size—an exponential improvement over the general case.
•Joint probabilities are products: Computing P(x₁,...,xₙ) requires only n lookups and n-1 multiplications—linear time regardless of network size.
•Marginalization exploits structure: Variable elimination algorithms leverage factorization to compute marginals without full enumeration.
•Learning decomposes modularly: Each CPT can be estimated independently, enabling parallel and incremental learning.
•I-equivalence limits identifiability: Multiple DAGs may represent the same distribution; observational data alone cannot distinguish them.

What's Next:

The next page explores Conditional Probability Tables (CPTs)—the data structures that store the local factors $P(X_i | Pa(X_i))$. We'll examine representation formats, normalization constraints, parameterization choices, and practical considerations for working with CPTs in real Bayesian Network implementations.

Page Complete

You now understand factorization as the core mechanism enabling tractable probabilistic reasoning in Bayesian Networks. You can compute joint probabilities, analyze parameter complexity, and recognize how structure translates to computational efficiency. Next, we'll examine how the local factors are actually represented and stored.