Machine LearningGraphical Models

Bayesian Networks

LevelAdvanced

Duration90 mins

TopicGraphical Models

5 / 5

Structure Learning

Discovering Hidden Structure

The most challenging aspect of Bayesian Network learning is structure learning: discovering the DAG itself from data. While parameter learning fills in known tables, structure learning must determine which edges exist—revealing the dependency structure hidden in observations.

This is a fundamentally harder problem. The space of possible DAGs over $n$ variables is super-exponential, and observational data alone cannot distinguish between certain equivalent structures. Yet practical algorithms exist that can learn useful network structures from data, enabling automated discovery of probabilistic relationships in domains from genomics to social science.

This page develops the theory and practice of structure learning, covering constraint-based methods (independence tests), score-based methods (search optimization), and hybrid approaches that combine their strengths.

What You Will Learn

By completing this page, you will: (1) Understand the computational complexity of structure learning, (2) Apply constraint-based algorithms like PC and FCI, (3) Use score-based search with BIC and BDe scores, (4) Recognize the limitations of observational learning, and (5) Combine learned structure with domain knowledge.

The Structure Learning Problem

Let us first appreciate the difficulty of structure learning by examining the search space and fundamental limitations.

Size of the DAG Space:

The number of possible DAGs on $n$ labeled nodes grows super-exponentially. Let $a(n)$ denote this count:

$$a(n) = \sum_{k=1}^{n} (-1)^{k+1} \binom{n}{k} 2^{k(n-k)} a(n-k)$$

with $a(0) = 1$.

n	DAGs	Approximate
3	25	~25
5	29,281	~3×10⁴
7	1.1×10⁹	~10⁹
10	4.2×10¹⁸	~10¹⁸
20	—	~10⁷²

Exhaustive search is intractable beyond a handful of variables.

NP-Hardness:

Finding the optimal DAG structure (maximizing a score) is NP-hard. Even with just 20 variables, we cannot check all candidates. Practical algorithms must use heuristics or exploit structure.

Fundamental Limitation: I-Equivalence

From observational data alone, we can only learn up to I-equivalence class. The three structures A→B→C, A←B→C, and A←B←C encode identical independencies (A⊥C|B) and cannot be distinguished without interventional data or additional assumptions.

What Can Structure Learning Achieve?

Skeleton: The undirected graph showing which pairs of variables have direct relationships
V-structures (Colliders): Patterns A → C ← B where A and B are not adjacent. These ARE identifiable from data.
CPDAG/Essential Graph: A partially directed graph representing the entire equivalence class—directed where all equivalent DAGs agree, undirected otherwise.
One member of the equivalence class: By making arbitrary choices for undirected edges, we can obtain a fully directed DAG.

Two Main Approaches:

Constraint-based: Use statistical tests to detect (conditional) independence, build structure from independence patterns
Score-based: Define a scoring function over structures, search for high-scoring DAGs

Each has strengths and weaknesses; hybrid methods combine them.

Constraint-Based Learning: The PC Algorithm

Constraint-based methods discover structure by testing conditional independence relationships in the data.

The PC Algorithm (Peter-Clark):

The PC algorithm is the foundational constraint-based method:

Phase 1: Skeleton Discovery

Start with a complete undirected graph (all pairs connected)
For each pair (X, Y), test if X ⊥ Y | S for increasing sizes of conditioning sets S
If independence is found, remove the edge X—Y
Record the separating set S that established independence

Phase 2: Orient V-structures

For each unshielded triple A—B—C (A and C not adjacent):
If B is NOT in the separating set for (A, C), orient as A → B ← C

Phase 3: Propagate Orientations

Apply rules to orient remaining edges without creating new v-structures or cycles:

If A → B — C and A—C not present, orient B → C
If A → B → C and A — C, orient A → C
Additional rules for acyclicity

pc_algorithm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
import numpy as np
from scipy import stats
from itertools import combinations
 
def partial_correlation(data, i, j, S):
    """
    Compute partial correlation between variables i and j given set S.
    Uses linear regression residuals.
    """
    if not S:
        return np.corrcoef(data[:, i], data[:, j])[0, 1]
    
    # Regress both i and j on S, compute correlation of residuals
    from numpy.linalg import lstsq
    
    X_S = data[:, list(S)]
    X_S = np.column_stack([np.ones(len(data)), X_S])
    
    # Residuals of i on S
    coef_i, _, _, _ = lstsq(X_S, data[:, i], rcond=None)
    resid_i = data[:, i] - X_S @ coef_i
    
    # Residuals of j on S
    coef_j, _, _, _ = lstsq(X_S, data[:, j], rcond=None)
    resid_j = data[:, j] - X_S @ coef_j
    
    return np.corrcoef(resid_i, resid_j)[0, 1]
 
 
def test_conditional_independence(data, i, j, S, alpha=0.05):
    """
    Test X_i ⊥ X_j | X_S using Fisher's z-transform.
    
    Returns:
        True if independent (fail to reject), False if dependent
    """
    n = len(data)
    r = partial_correlation(data, i, j, S)
    
    # Handle edge cases
    if abs(r) >= 1:
        return abs(r) < 0.9999
    
    # Fisher's z-transform
    z = 0.5 * np.log((1 + r) / (1 - r))
    se = 1 / np.sqrt(n - len(S) - 3)
    
    # Two-sided test
    p_value = 2 * (1 - stats.norm.cdf(abs(z / se)))
    
    return p_value > alpha
 
 
def pc_skeleton(data, alpha=0.05, max_cond_size=None):
    """
    PC algorithm Phase 1: Learn the skeleton.
    
    Returns:
        adjacency: Boolean adjacency matrix
        sep_sets: Dict mapping (i,j) -> separating set S
    """
    n_vars = data.shape[1]
    if max_cond_size is None:
        max_cond_size = n_vars - 2
    
    # Start with complete graph
    adjacency = np.ones((n_vars, n_vars), dtype=bool)
    np.fill_diagonal(adjacency, False)
    
    sep_sets = {}
    
    # Test increasing conditioning set sizes
    for cond_size in range(max_cond_size + 1):
        for i in range(n_vars):
            for j in range(i + 1, n_vars):
                if not adjacency[i, j]:
                    continue
                
                # Get neighbors of i (excluding j)
                neighbors = [k for k in range(n_vars) 
                            if adjacency[i, k] and k != j]
                
                if len(neighbors) < cond_size:
                    continue
                
                # Test all conditioning sets of given size
                for S in combinations(neighbors, cond_size):
                    S = set(S)
                    if test_conditional_independence(data, i, j, S, alpha):
                        adjacency[i, j] = adjacency[j, i] = False
                        sep_sets[(i, j)] = sep_sets[(j, i)] = S
                        break
    
    return adjacency, sep_sets
 
 
# Example usage
np.random.seed(42)
n_samples = 1000
 
# Generate from known structure: A -> B -> C, A -> C
A = np.random.randn(n_samples)
B = 0.8 * A + 0.6 * np.random.randn(n_samples)
C = 0.5 * A + 0.7 * B + 0.4 * np.random.randn(n_samples)
data = np.column_stack([A, B, C])
 
print("=== PC Algorithm: Skeleton Discovery ===\n")
print("True structure: A → B → C, A → C (triangle)")
 
adj, seps = pc_skeleton(data, alpha=0.05)
 
print("\nLearned skeleton (adjacency):")
print("   A  B  C")
for i, label in enumerate(['A', 'B', 'C']):
    print(f"{label}: {adj[i].astype(int)}")
    
print("\nSeparating sets:")
for (i, j), S in seps.items():
    if i < j:
        labels = ['A', 'B', 'C']
        sep_labels = [labels[s] for s in S] if S else ['∅']
        print(f"  {labels[i]} ⊥ {labels[j]} | {sep_labels}")

Independence Test Selection

The PC algorithm requires a conditional independence test. Common choices: Fisher's z-test (Gaussian data), G² test (discrete data), kernel-based tests (nonlinear). The choice affects both accuracy and computational cost. Alpha level controls the tradeoff between missing edges (high α) and spurious edges (low α).

Score-Based Learning

Score-based methods define a goodness measure for DAG structures and search for optimal structures.

Common Scoring Functions:

1. Bayesian Information Criterion (BIC): $$\text{BIC}(G, \mathcal{D}) = \log P(\mathcal{D} | \hat{\theta}_{\text{MLE}}, G) - \frac{d}{2} \log N$$

where $d$ is the number of free parameters. The penalty term $\frac{d}{2} \log N$ discourages complex models.

2. Bayesian Dirichlet equivalent (BDe) Score: $$\text{BDe}(G, \mathcal{D}) = \prod_i \prod_{\mathbf{u}} \frac{\Gamma(\alpha_{\mathbf{u}})}{\Gamma(\alpha_{\mathbf{u}} + N_{\mathbf{u}})} \prod_x \frac{\Gamma(\alpha_{x\mathbf{u}} + N_{x\mathbf{u}})}{\Gamma(\alpha_{x\mathbf{u}})}$$

This is the marginal likelihood with Dirichlet priors, integrating out parameters.

Decomposability:

Both scores decompose by variable: $$\text{Score}(G) = \sum_i \text{Score}_i(X_i, Pa(X_i))$$

This enables local search: changing one node's parents requires recomputing only that node's local score.

Search Strategies:

Greedy Hill Climbing:

Start with an initial DAG (empty, random, or from prior knowledge)
Consider all single-edge operations: add, delete, reverse
Apply the operation that most improves the score
Repeat until no operation improves score (local optimum)

Operators:

Add edge: X → Y (if acyclic)
Delete edge: Remove X → Y
Reverse edge: X → Y becomes Y → X (if acyclic)

Tabu Search:

Enhance hill climbing by maintaining a 'tabu list' of recently visited structures to avoid cycling and escape local optima.

Simulated Annealing:

Accept score-decreasing moves with probability $\exp(-\Delta/T)$ where $T$ is temperature. Gradually reduce $T$ to converge.

Genetic Algorithms:

Maintain a population of DAGs, apply crossover and mutation, select based on fitness (score).

score_based_learning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
import numpy as np
from itertools import product
 
def bic_score_local(data, variable_idx, parent_indices, n_states=2):
    """
    Compute local BIC score for P(X_i | Pa(X_i)).
    
    BIC = log_likelihood - (d/2) * log(N)
    """
    N = len(data)
    
    # Count configurations
    parent_configs = list(product(range(n_states), repeat=len(parent_indices)))
    
    log_likelihood = 0
    n_params = 0
    
    for config in parent_configs:
        # Find samples matching this parent config
        if parent_indices:
            mask = np.all(data[:, parent_indices] == config, axis=1)
        else:
            mask = np.ones(N, dtype=bool)
        
        n_parent = mask.sum()
        if n_parent == 0:
            continue
        
        n_params += (n_states - 1)  # Free parameters for this row
        
        # Count child values
        for child_val in range(n_states):
            child_mask = mask & (data[:, variable_idx] == child_val)
            n_child = child_mask.sum()
            
            if n_child > 0:
                # MLE probability
                p = n_child / n_parent
                log_likelihood += n_child * np.log(p)
    
    bic = log_likelihood - (n_params / 2) * np.log(N)
    return bic
 
 
def greedy_hill_climbing(data, n_states=2, max_parents=3):
    """
    Greedy hill climbing for DAG structure learning.
    
    Returns:
        adjacency: Boolean adjacency matrix (adj[i,j] means i→j)
    """
    n_vars = data.shape[1]
    
    # Start with empty graph
    adjacency = np.zeros((n_vars, n_vars), dtype=bool)
    
    def get_parents(node):
        return [i for i in range(n_vars) if adjacency[i, node]]
    
    def is_acyclic(adj):
        """Check if adding edges creates cycle using DFS."""
        visited = [False] * n_vars
        rec_stack = [False] * n_vars
        
        def dfs(v):
            visited[v] = True
            rec_stack[v] = True
            for u in range(n_vars):
                if adj[v, u]:
                    if not visited[u]:
                        if dfs(u):
                            return True
                    elif rec_stack[u]:
                        return True
            rec_stack[v] = False
            return False
        
        for v in range(n_vars):
            if not visited[v]:
                if dfs(v):
                    return False
        return True
    
    def compute_total_score(adj):
        score = 0
        for j in range(n_vars):
            parents = [i for i in range(n_vars) if adj[i, j]]
            score += bic_score_local(data, j, parents, n_states)
        return score
    
    current_score = compute_total_score(adjacency)
    improved = True
    
    while improved:
        improved = False
        best_op = None
        best_score = current_score
        
        for i in range(n_vars):
            for j in range(n_vars):
                if i == j:
                    continue
                
                # Try adding edge i -> j
                if not adjacency[i, j]:
                    if len(get_parents(j)) < max_parents:
                        adjacency[i, j] = True
                        if is_acyclic(adjacency):
                            score = compute_total_score(adjacency)
                            if score > best_score:
                                best_score = score
                                best_op = ('add', i, j)
                        adjacency[i, j] = False
                
                # Try deleting edge i -> j
                if adjacency[i, j]:
                    adjacency[i, j] = False
                    score = compute_total_score(adjacency)
                    if score > best_score:
                        best_score = score
                        best_op = ('del', i, j)
                    adjacency[i, j] = True
                
                # Try reversing edge i -> j
                if adjacency[i, j]:
                    adjacency[i, j] = False
                    if len(get_parents(i)) < max_parents:
                        adjacency[j, i] = True
                        if is_acyclic(adjacency):
                            score = compute_total_score(adjacency)
                            if score > best_score:
                                best_score = score
                                best_op = ('rev', i, j)
                        adjacency[j, i] = False
                    adjacency[i, j] = True
        
        if best_op:
            op, i, j = best_op
            if op == 'add':
                adjacency[i, j] = True
            elif op == 'del':
                adjacency[i, j] = False
            elif op == 'rev':
                adjacency[i, j] = False
                adjacency[j, i] = True
            
            current_score = best_score
            improved = True
            print(f"  {op} {i}→{j}, score: {current_score:.2f}")
    
    return adjacency
 
 
# Example
np.random.seed(42)
N = 500
 
# Generate data: 0 → 1, 0 → 2, 1 → 2
X0 = np.random.choice([0, 1], size=N, p=[0.4, 0.6])
X1 = np.array([np.random.choice([0, 1], p=[0.3, 0.7] if x == 1 else [0.8, 0.2]) 
               for x in X0])
X2 = np.array([np.random.choice([0, 1], p=[0.2, 0.8] if x1 == 1 and x0 == 1 
               else [0.6, 0.4]) for x0, x1 in zip(X0, X1)])
 
data = np.column_stack([X0, X1, X2])
 
print("=== Greedy Hill Climbing ===\n")
print("True structure: 0 → 1, 0 → 2, 1 → 2\n")
 
adj = greedy_hill_climbing(data)
 
print("\nLearned adjacency (row i to col j):")
print("   0  1  2")
for i in range(3):
    print(f"{i}: {adj[i].astype(int)}")

Hybrid Methods

Hybrid methods combine constraint-based and score-based approaches to leverage the strengths of both.

Max-Min Hill Climbing (MMHC):

A popular hybrid algorithm with two phases:

Phase 1: Skeleton via MMPC

For each variable X:

Use max-min heuristic to find candidate parents/children (CPC)
Start with empty CPC, iteratively add variables that maximize minimum association
Remove false positives through conditional independence tests

Phase 2: Orientation via Scoring

Use skeleton from Phase 1 to constrain search space
Apply greedy hill climbing to orient edges
Only consider DAGs consistent with learned skeleton

Benefits:

Phase 1 dramatically prunes the search space
Phase 2 chooses best orientations from candidates
More robust than pure constraint or score methods

NOTEARS and Continuous Optimization:

Recent methods formulate structure learning as continuous optimization: $$\min_W \mathcal{L}(W) + \lambda |W|_1 \quad \text{s.t.} \quad h(W) = 0$$

where $W$ is a weighted adjacency matrix and $h(W) = 0$ encodes acyclicity. Gradient-based optimization enables scalability to many variables.

Comparison of Structure Learning Approaches
Approach	Pros	Cons	Best When
Constraint-based (PC)	Principled, fast for sparse graphs	Sensitive to test errors	Large n, few edges
Score-based (Hill Climb)	Robust, handles noise	Local optima, slow	Small n, any density
Hybrid (MMHC)	Best of both worlds	More complex	General use
Continuous (NOTEARS)	Scalable, global optimum	Linear assumption	Many variables

Incorporating Prior Knowledge

Pure data-driven structure learning is often insufficient. Incorporating domain knowledge dramatically improves results.

Types of Prior Knowledge:

1. Forbidden Edges: 'X cannot cause Y' (e.g., effect cannot precede cause temporally)

2. Required Edges: 'X must directly influence Y' (from established theory)

3. Tier/Layer Ordering: Variables in earlier tiers can cause later tiers but not vice versa

4. Edge Probability Priors: Soft preferences: P(edge exists) based on prior belief

Implementation:

Constraint-based: Skip independence tests for required edges; don't remove even if 'independent'
Score-based: Add forbidden edges to tabu list; add prior term to score for soft constraints

$$\text{Score}{\text{prior}}(G) = \text{Score}(G) + \sum{\text{edges}} \log P(\text{edge exists})$$

When to Use Prior Knowledge

Use prior knowledge when: (1) Domain expertise is available and reliable, (2) Sample size is limited relative to complexity, (3) Causal direction matters but can't be identified from data, (4) Physical or logical constraints are known. Priors prevent spurious discoveries but can bias results if wrong.

Evaluating Learned Structures

Evaluating structure learning requires comparing learned graphs to ground truth (in simulation) or assessing predictive performance (in practice).

Structural Metrics (vs ground truth):

Structural Hamming Distance (SHD): Number of edge insertions, deletions, and flips needed to transform learned CPDAG into true CPDAG.

True/False Positive Rate:

TP: Correct edges in learned graph
FP: Spurious edges
FN: Missing edges

Precision/Recall:

Precision = TP / (TP + FP) — fraction of learned edges that are correct
Recall = TP / (TP + FN) — fraction of true edges that are recovered

Predictive Metrics:

Held-out Likelihood: Score on test data not used for learning.

Conditional Likelihood: Predict each variable given its Markov blanket; compare predictions to true values.

Evaluation Best Practices

•Use multiple metrics (structural AND predictive)
•Report confidence intervals via bootstrap
•Compare to baselines (empty, complete, random graphs)
•Evaluate at CPDAG level, not DAG level
•Consider causal vs associational accuracy separately

Summary: Structure Learning Mastery

Key Takeaways

•DAG space is super-exponential: Exhaustive search is impossible; heuristics are essential.
•I-equivalence limits identifiability: Observational data reveals equivalence classes, not unique DAGs.
•Constraint-based methods test independence: PC algorithm uses statistical tests to discover skeleton and orient v-structures.
•Score-based methods optimize fit: BIC/BDe scores balance likelihood and complexity; greedy search finds local optima.
•Hybrid methods combine strengths: Use independence tests to constrain search, scoring to choose orientations.
•Prior knowledge is crucial: Domain constraints dramatically improve learning with limited data.

Module Complete:

You have now completed the comprehensive coverage of Bayesian Networks. From DAG structure through factorization, CPT representation, parameter learning, and structure learning, you possess the theoretical foundation and practical skills to build, learn, and apply Bayesian Networks in real-world machine learning applications.

Module Complete

Congratulations! You've mastered Bayesian Networks—DAG structure, factorization, CPT representation, parameter learning, and structure learning. You can now represent complex probabilistic dependencies, learn from data, and reason under uncertainty.

5 / 5

Loading learning content...

Machine LearningGraphical Models

Bayesian Networks

LevelAdvanced

Duration90 mins

TopicGraphical Models

5 / 5

Structure Learning

Discovering Hidden Structure

What You Will Learn

The Structure Learning Problem

Let us first appreciate the difficulty of structure learning by examining the search space and fundamental limitations.

Size of the DAG Space:

The number of possible DAGs on $n$ labeled nodes grows super-exponentially. Let $a(n)$ denote this count:

$$a(n) = \sum_{k=1}^{n} (-1)^{k+1} \binom{n}{k} 2^{k(n-k)} a(n-k)$$

with $a(0) = 1$.

n	DAGs	Approximate
3	25	~25
5	29,281	~3×10⁴
7	1.1×10⁹	~10⁹
10	4.2×10¹⁸	~10¹⁸
20	—	~10⁷²

Exhaustive search is intractable beyond a handful of variables.

NP-Hardness:

Finding the optimal DAG structure (maximizing a score) is NP-hard. Even with just 20 variables, we cannot check all candidates. Practical algorithms must use heuristics or exploit structure.

Fundamental Limitation: I-Equivalence

What Can Structure Learning Achieve?

Skeleton: The undirected graph showing which pairs of variables have direct relationships
V-structures (Colliders): Patterns A → C ← B where A and B are not adjacent. These ARE identifiable from data.
CPDAG/Essential Graph: A partially directed graph representing the entire equivalence class—directed where all equivalent DAGs agree, undirected otherwise.
One member of the equivalence class: By making arbitrary choices for undirected edges, we can obtain a fully directed DAG.

Two Main Approaches:

Constraint-based: Use statistical tests to detect (conditional) independence, build structure from independence patterns
Score-based: Define a scoring function over structures, search for high-scoring DAGs

Each has strengths and weaknesses; hybrid methods combine them.

Constraint-Based Learning: The PC Algorithm

Constraint-based methods discover structure by testing conditional independence relationships in the data.

The PC Algorithm (Peter-Clark):

The PC algorithm is the foundational constraint-based method:

Phase 1: Skeleton Discovery

Start with a complete undirected graph (all pairs connected)
For each pair (X, Y), test if X ⊥ Y | S for increasing sizes of conditioning sets S
If independence is found, remove the edge X—Y
Record the separating set S that established independence

Phase 2: Orient V-structures

For each unshielded triple A—B—C (A and C not adjacent):
If B is NOT in the separating set for (A, C), orient as A → B ← C

Phase 3: Propagate Orientations

Apply rules to orient remaining edges without creating new v-structures or cycles:

If A → B — C and A—C not present, orient B → C
If A → B → C and A — C, orient A → C
Additional rules for acyclicity

pc_algorithm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
import numpy as np
from scipy import stats
from itertools import combinations
 
def partial_correlation(data, i, j, S):
    """
    Compute partial correlation between variables i and j given set S.
    Uses linear regression residuals.
    """
    if not S:
        return np.corrcoef(data[:, i], data[:, j])[0, 1]
    
    # Regress both i and j on S, compute correlation of residuals
    from numpy.linalg import lstsq
    
    X_S = data[:, list(S)]
    X_S = np.column_stack([np.ones(len(data)), X_S])
    
    # Residuals of i on S
    coef_i, _, _, _ = lstsq(X_S, data[:, i], rcond=None)
    resid_i = data[:, i] - X_S @ coef_i
    
    # Residuals of j on S
    coef_j, _, _, _ = lstsq(X_S, data[:, j], rcond=None)
    resid_j = data[:, j] - X_S @ coef_j
    
    return np.corrcoef(resid_i, resid_j)[0, 1]
 
 
def test_conditional_independence(data, i, j, S, alpha=0.05):
    """
    Test X_i ⊥ X_j | X_S using Fisher's z-transform.
    
    Returns:
        True if independent (fail to reject), False if dependent
    """
    n = len(data)
    r = partial_correlation(data, i, j, S)
    
    # Handle edge cases
    if abs(r) >= 1:
        return abs(r) < 0.9999
    
    # Fisher's z-transform
    z = 0.5 * np.log((1 + r) / (1 - r))
    se = 1 / np.sqrt(n - len(S) - 3)
    
    # Two-sided test
    p_value = 2 * (1 - stats.norm.cdf(abs(z / se)))
    
    return p_value > alpha
 
 
def pc_skeleton(data, alpha=0.05, max_cond_size=None):
    """
    PC algorithm Phase 1: Learn the skeleton.
    
    Returns:
        adjacency: Boolean adjacency matrix
        sep_sets: Dict mapping (i,j) -> separating set S
    """
    n_vars = data.shape[1]
    if max_cond_size is None:
        max_cond_size = n_vars - 2
    
    # Start with complete graph
    adjacency = np.ones((n_vars, n_vars), dtype=bool)
    np.fill_diagonal(adjacency, False)
    
    sep_sets = {}
    
    # Test increasing conditioning set sizes
    for cond_size in range(max_cond_size + 1):
        for i in range(n_vars):
            for j in range(i + 1, n_vars):
                if not adjacency[i, j]:
                    continue
                
                # Get neighbors of i (excluding j)
                neighbors = [k for k in range(n_vars) 
                            if adjacency[i, k] and k != j]
                
                if len(neighbors) < cond_size:
                    continue
                
                # Test all conditioning sets of given size
                for S in combinations(neighbors, cond_size):
                    S = set(S)
                    if test_conditional_independence(data, i, j, S, alpha):
                        adjacency[i, j] = adjacency[j, i] = False
                        sep_sets[(i, j)] = sep_sets[(j, i)] = S
                        break
    
    return adjacency, sep_sets
 
 
# Example usage
np.random.seed(42)
n_samples = 1000
 
# Generate from known structure: A -> B -> C, A -> C
A = np.random.randn(n_samples)
B = 0.8 * A + 0.6 * np.random.randn(n_samples)
C = 0.5 * A + 0.7 * B + 0.4 * np.random.randn(n_samples)
data = np.column_stack([A, B, C])
 
print("=== PC Algorithm: Skeleton Discovery ===\n")
print("True structure: A → B → C, A → C (triangle)")
 
adj, seps = pc_skeleton(data, alpha=0.05)
 
print("\nLearned skeleton (adjacency):")
print("   A  B  C")
for i, label in enumerate(['A', 'B', 'C']):
    print(f"{label}: {adj[i].astype(int)}")
    
print("\nSeparating sets:")
for (i, j), S in seps.items():
    if i < j:
        labels = ['A', 'B', 'C']
        sep_labels = [labels[s] for s in S] if S else ['∅']
        print(f"  {labels[i]} ⊥ {labels[j]} | {sep_labels}")

Independence Test Selection

Score-Based Learning

Score-based methods define a goodness measure for DAG structures and search for optimal structures.

Common Scoring Functions:

1. Bayesian Information Criterion (BIC): $$\text{BIC}(G, \mathcal{D}) = \log P(\mathcal{D} | \hat{\theta}_{\text{MLE}}, G) - \frac{d}{2} \log N$$

where $d$ is the number of free parameters. The penalty term $\frac{d}{2} \log N$ discourages complex models.

This is the marginal likelihood with Dirichlet priors, integrating out parameters.

Decomposability:

Both scores decompose by variable: $$\text{Score}(G) = \sum_i \text{Score}_i(X_i, Pa(X_i))$$

This enables local search: changing one node's parents requires recomputing only that node's local score.

Search Strategies:

Greedy Hill Climbing:

Start with an initial DAG (empty, random, or from prior knowledge)
Consider all single-edge operations: add, delete, reverse
Apply the operation that most improves the score
Repeat until no operation improves score (local optimum)

Operators:

Add edge: X → Y (if acyclic)
Delete edge: Remove X → Y
Reverse edge: X → Y becomes Y → X (if acyclic)

Tabu Search:

Enhance hill climbing by maintaining a 'tabu list' of recently visited structures to avoid cycling and escape local optima.

Simulated Annealing:

Accept score-decreasing moves with probability $\exp(-\Delta/T)$ where $T$ is temperature. Gradually reduce $T$ to converge.

Genetic Algorithms:

Maintain a population of DAGs, apply crossover and mutation, select based on fitness (score).

score_based_learning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
import numpy as np
from itertools import product
 
def bic_score_local(data, variable_idx, parent_indices, n_states=2):
    """
    Compute local BIC score for P(X_i | Pa(X_i)).
    
    BIC = log_likelihood - (d/2) * log(N)
    """
    N = len(data)
    
    # Count configurations
    parent_configs = list(product(range(n_states), repeat=len(parent_indices)))
    
    log_likelihood = 0
    n_params = 0
    
    for config in parent_configs:
        # Find samples matching this parent config
        if parent_indices:
            mask = np.all(data[:, parent_indices] == config, axis=1)
        else:
            mask = np.ones(N, dtype=bool)
        
        n_parent = mask.sum()
        if n_parent == 0:
            continue
        
        n_params += (n_states - 1)  # Free parameters for this row
        
        # Count child values
        for child_val in range(n_states):
            child_mask = mask & (data[:, variable_idx] == child_val)
            n_child = child_mask.sum()
            
            if n_child > 0:
                # MLE probability
                p = n_child / n_parent
                log_likelihood += n_child * np.log(p)
    
    bic = log_likelihood - (n_params / 2) * np.log(N)
    return bic
 
 
def greedy_hill_climbing(data, n_states=2, max_parents=3):
    """
    Greedy hill climbing for DAG structure learning.
    
    Returns:
        adjacency: Boolean adjacency matrix (adj[i,j] means i→j)
    """
    n_vars = data.shape[1]
    
    # Start with empty graph
    adjacency = np.zeros((n_vars, n_vars), dtype=bool)
    
    def get_parents(node):
        return [i for i in range(n_vars) if adjacency[i, node]]
    
    def is_acyclic(adj):
        """Check if adding edges creates cycle using DFS."""
        visited = [False] * n_vars
        rec_stack = [False] * n_vars
        
        def dfs(v):
            visited[v] = True
            rec_stack[v] = True
            for u in range(n_vars):
                if adj[v, u]:
                    if not visited[u]:
                        if dfs(u):
                            return True
                    elif rec_stack[u]:
                        return True
            rec_stack[v] = False
            return False
        
        for v in range(n_vars):
            if not visited[v]:
                if dfs(v):
                    return False
        return True
    
    def compute_total_score(adj):
        score = 0
        for j in range(n_vars):
            parents = [i for i in range(n_vars) if adj[i, j]]
            score += bic_score_local(data, j, parents, n_states)
        return score
    
    current_score = compute_total_score(adjacency)
    improved = True
    
    while improved:
        improved = False
        best_op = None
        best_score = current_score
        
        for i in range(n_vars):
            for j in range(n_vars):
                if i == j:
                    continue
                
                # Try adding edge i -> j
                if not adjacency[i, j]:
                    if len(get_parents(j)) < max_parents:
                        adjacency[i, j] = True
                        if is_acyclic(adjacency):
                            score = compute_total_score(adjacency)
                            if score > best_score:
                                best_score = score
                                best_op = ('add', i, j)
                        adjacency[i, j] = False
                
                # Try deleting edge i -> j
                if adjacency[i, j]:
                    adjacency[i, j] = False
                    score = compute_total_score(adjacency)
                    if score > best_score:
                        best_score = score
                        best_op = ('del', i, j)
                    adjacency[i, j] = True
                
                # Try reversing edge i -> j
                if adjacency[i, j]:
                    adjacency[i, j] = False
                    if len(get_parents(i)) < max_parents:
                        adjacency[j, i] = True
                        if is_acyclic(adjacency):
                            score = compute_total_score(adjacency)
                            if score > best_score:
                                best_score = score
                                best_op = ('rev', i, j)
                        adjacency[j, i] = False
                    adjacency[i, j] = True
        
        if best_op:
            op, i, j = best_op
            if op == 'add':
                adjacency[i, j] = True
            elif op == 'del':
                adjacency[i, j] = False
            elif op == 'rev':
                adjacency[i, j] = False
                adjacency[j, i] = True
            
            current_score = best_score
            improved = True
            print(f"  {op} {i}→{j}, score: {current_score:.2f}")
    
    return adjacency
 
 
# Example
np.random.seed(42)
N = 500
 
# Generate data: 0 → 1, 0 → 2, 1 → 2
X0 = np.random.choice([0, 1], size=N, p=[0.4, 0.6])
X1 = np.array([np.random.choice([0, 1], p=[0.3, 0.7] if x == 1 else [0.8, 0.2]) 
               for x in X0])
X2 = np.array([np.random.choice([0, 1], p=[0.2, 0.8] if x1 == 1 and x0 == 1 
               else [0.6, 0.4]) for x0, x1 in zip(X0, X1)])
 
data = np.column_stack([X0, X1, X2])
 
print("=== Greedy Hill Climbing ===\n")
print("True structure: 0 → 1, 0 → 2, 1 → 2\n")
 
adj = greedy_hill_climbing(data)
 
print("\nLearned adjacency (row i to col j):")
print("   0  1  2")
for i in range(3):
    print(f"{i}: {adj[i].astype(int)}")

Hybrid Methods

Hybrid methods combine constraint-based and score-based approaches to leverage the strengths of both.

Max-Min Hill Climbing (MMHC):

A popular hybrid algorithm with two phases:

Phase 1: Skeleton via MMPC

For each variable X:

Use max-min heuristic to find candidate parents/children (CPC)
Start with empty CPC, iteratively add variables that maximize minimum association
Remove false positives through conditional independence tests

Phase 2: Orientation via Scoring

Use skeleton from Phase 1 to constrain search space
Apply greedy hill climbing to orient edges
Only consider DAGs consistent with learned skeleton

Benefits:

Phase 1 dramatically prunes the search space
Phase 2 chooses best orientations from candidates
More robust than pure constraint or score methods

NOTEARS and Continuous Optimization:

Recent methods formulate structure learning as continuous optimization: $$\min_W \mathcal{L}(W) + \lambda |W|_1 \quad \text{s.t.} \quad h(W) = 0$$

where $W$ is a weighted adjacency matrix and $h(W) = 0$ encodes acyclicity. Gradient-based optimization enables scalability to many variables.

Comparison of Structure Learning Approaches
Approach	Pros	Cons	Best When
Constraint-based (PC)	Principled, fast for sparse graphs	Sensitive to test errors	Large n, few edges
Score-based (Hill Climb)	Robust, handles noise	Local optima, slow	Small n, any density
Hybrid (MMHC)	Best of both worlds	More complex	General use
Continuous (NOTEARS)	Scalable, global optimum	Linear assumption	Many variables

Incorporating Prior Knowledge

Pure data-driven structure learning is often insufficient. Incorporating domain knowledge dramatically improves results.

Types of Prior Knowledge:

1. Forbidden Edges: 'X cannot cause Y' (e.g., effect cannot precede cause temporally)

2. Required Edges: 'X must directly influence Y' (from established theory)

3. Tier/Layer Ordering: Variables in earlier tiers can cause later tiers but not vice versa

4. Edge Probability Priors: Soft preferences: P(edge exists) based on prior belief

Implementation:

Constraint-based: Skip independence tests for required edges; don't remove even if 'independent'
Score-based: Add forbidden edges to tabu list; add prior term to score for soft constraints

$$\text{Score}{\text{prior}}(G) = \text{Score}(G) + \sum{\text{edges}} \log P(\text{edge exists})$$

When to Use Prior Knowledge

Evaluating Learned Structures

Evaluating structure learning requires comparing learned graphs to ground truth (in simulation) or assessing predictive performance (in practice).

Structural Metrics (vs ground truth):

Structural Hamming Distance (SHD): Number of edge insertions, deletions, and flips needed to transform learned CPDAG into true CPDAG.

True/False Positive Rate:

TP: Correct edges in learned graph
FP: Spurious edges
FN: Missing edges

Precision/Recall:

Precision = TP / (TP + FP) — fraction of learned edges that are correct
Recall = TP / (TP + FN) — fraction of true edges that are recovered

Predictive Metrics:

Held-out Likelihood: Score on test data not used for learning.

Conditional Likelihood: Predict each variable given its Markov blanket; compare predictions to true values.

Evaluation Best Practices

•Use multiple metrics (structural AND predictive)
•Report confidence intervals via bootstrap
•Compare to baselines (empty, complete, random graphs)
•Evaluate at CPDAG level, not DAG level
•Consider causal vs associational accuracy separately

Summary: Structure Learning Mastery

Key Takeaways

•DAG space is super-exponential: Exhaustive search is impossible; heuristics are essential.
•I-equivalence limits identifiability: Observational data reveals equivalence classes, not unique DAGs.
•Constraint-based methods test independence: PC algorithm uses statistical tests to discover skeleton and orient v-structures.
•Score-based methods optimize fit: BIC/BDe scores balance likelihood and complexity; greedy search finds local optima.
•Hybrid methods combine strengths: Use independence tests to constrain search, scoring to choose orientations.
•Prior knowledge is crucial: Domain constraints dramatically improve learning with limited data.

Module Complete:

Module Complete

5 / 5