Loading learning content...
The most challenging aspect of Bayesian Network learning is structure learning: discovering the DAG itself from data. While parameter learning fills in known tables, structure learning must determine which edges exist—revealing the dependency structure hidden in observations.
This is a fundamentally harder problem. The space of possible DAGs over $n$ variables is super-exponential, and observational data alone cannot distinguish between certain equivalent structures. Yet practical algorithms exist that can learn useful network structures from data, enabling automated discovery of probabilistic relationships in domains from genomics to social science.
This page develops the theory and practice of structure learning, covering constraint-based methods (independence tests), score-based methods (search optimization), and hybrid approaches that combine their strengths.
By completing this page, you will: (1) Understand the computational complexity of structure learning, (2) Apply constraint-based algorithms like PC and FCI, (3) Use score-based search with BIC and BDe scores, (4) Recognize the limitations of observational learning, and (5) Combine learned structure with domain knowledge.
Let us first appreciate the difficulty of structure learning by examining the search space and fundamental limitations.
Size of the DAG Space:
The number of possible DAGs on $n$ labeled nodes grows super-exponentially. Let $a(n)$ denote this count:
$$a(n) = \sum_{k=1}^{n} (-1)^{k+1} \binom{n}{k} 2^{k(n-k)} a(n-k)$$
with $a(0) = 1$.
| n | DAGs | Approximate |
|---|---|---|
| 3 | 25 | ~25 |
| 5 | 29,281 | ~3×10⁴ |
| 7 | 1.1×10⁹ | ~10⁹ |
| 10 | 4.2×10¹⁸ | ~10¹⁸ |
| 20 | — | ~10⁷² |
Exhaustive search is intractable beyond a handful of variables.
NP-Hardness:
Finding the optimal DAG structure (maximizing a score) is NP-hard. Even with just 20 variables, we cannot check all candidates. Practical algorithms must use heuristics or exploit structure.
From observational data alone, we can only learn up to I-equivalence class. The three structures A→B→C, A←B→C, and A←B←C encode identical independencies (A⊥C|B) and cannot be distinguished without interventional data or additional assumptions.
What Can Structure Learning Achieve?
Skeleton: The undirected graph showing which pairs of variables have direct relationships
V-structures (Colliders): Patterns A → C ← B where A and B are not adjacent. These ARE identifiable from data.
CPDAG/Essential Graph: A partially directed graph representing the entire equivalence class—directed where all equivalent DAGs agree, undirected otherwise.
One member of the equivalence class: By making arbitrary choices for undirected edges, we can obtain a fully directed DAG.
Two Main Approaches:
Constraint-based: Use statistical tests to detect (conditional) independence, build structure from independence patterns
Score-based: Define a scoring function over structures, search for high-scoring DAGs
Each has strengths and weaknesses; hybrid methods combine them.
Constraint-based methods discover structure by testing conditional independence relationships in the data.
The PC Algorithm (Peter-Clark):
The PC algorithm is the foundational constraint-based method:
Phase 1: Skeleton Discovery
Phase 2: Orient V-structures
Phase 3: Propagate Orientations
Apply rules to orient remaining edges without creating new v-structures or cycles:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122
import numpy as npfrom scipy import statsfrom itertools import combinations def partial_correlation(data, i, j, S): """ Compute partial correlation between variables i and j given set S. Uses linear regression residuals. """ if not S: return np.corrcoef(data[:, i], data[:, j])[0, 1] # Regress both i and j on S, compute correlation of residuals from numpy.linalg import lstsq X_S = data[:, list(S)] X_S = np.column_stack([np.ones(len(data)), X_S]) # Residuals of i on S coef_i, _, _, _ = lstsq(X_S, data[:, i], rcond=None) resid_i = data[:, i] - X_S @ coef_i # Residuals of j on S coef_j, _, _, _ = lstsq(X_S, data[:, j], rcond=None) resid_j = data[:, j] - X_S @ coef_j return np.corrcoef(resid_i, resid_j)[0, 1] def test_conditional_independence(data, i, j, S, alpha=0.05): """ Test X_i ⊥ X_j | X_S using Fisher's z-transform. Returns: True if independent (fail to reject), False if dependent """ n = len(data) r = partial_correlation(data, i, j, S) # Handle edge cases if abs(r) >= 1: return abs(r) < 0.9999 # Fisher's z-transform z = 0.5 * np.log((1 + r) / (1 - r)) se = 1 / np.sqrt(n - len(S) - 3) # Two-sided test p_value = 2 * (1 - stats.norm.cdf(abs(z / se))) return p_value > alpha def pc_skeleton(data, alpha=0.05, max_cond_size=None): """ PC algorithm Phase 1: Learn the skeleton. Returns: adjacency: Boolean adjacency matrix sep_sets: Dict mapping (i,j) -> separating set S """ n_vars = data.shape[1] if max_cond_size is None: max_cond_size = n_vars - 2 # Start with complete graph adjacency = np.ones((n_vars, n_vars), dtype=bool) np.fill_diagonal(adjacency, False) sep_sets = {} # Test increasing conditioning set sizes for cond_size in range(max_cond_size + 1): for i in range(n_vars): for j in range(i + 1, n_vars): if not adjacency[i, j]: continue # Get neighbors of i (excluding j) neighbors = [k for k in range(n_vars) if adjacency[i, k] and k != j] if len(neighbors) < cond_size: continue # Test all conditioning sets of given size for S in combinations(neighbors, cond_size): S = set(S) if test_conditional_independence(data, i, j, S, alpha): adjacency[i, j] = adjacency[j, i] = False sep_sets[(i, j)] = sep_sets[(j, i)] = S break return adjacency, sep_sets # Example usagenp.random.seed(42)n_samples = 1000 # Generate from known structure: A -> B -> C, A -> CA = np.random.randn(n_samples)B = 0.8 * A + 0.6 * np.random.randn(n_samples)C = 0.5 * A + 0.7 * B + 0.4 * np.random.randn(n_samples)data = np.column_stack([A, B, C]) print("=== PC Algorithm: Skeleton Discovery ===\n")print("True structure: A → B → C, A → C (triangle)") adj, seps = pc_skeleton(data, alpha=0.05) print("\nLearned skeleton (adjacency):")print(" A B C")for i, label in enumerate(['A', 'B', 'C']): print(f"{label}: {adj[i].astype(int)}") print("\nSeparating sets:")for (i, j), S in seps.items(): if i < j: labels = ['A', 'B', 'C'] sep_labels = [labels[s] for s in S] if S else ['∅'] print(f" {labels[i]} ⊥ {labels[j]} | {sep_labels}")The PC algorithm requires a conditional independence test. Common choices: Fisher's z-test (Gaussian data), G² test (discrete data), kernel-based tests (nonlinear). The choice affects both accuracy and computational cost. Alpha level controls the tradeoff between missing edges (high α) and spurious edges (low α).
Score-based methods define a goodness measure for DAG structures and search for optimal structures.
Common Scoring Functions:
1. Bayesian Information Criterion (BIC): $$\text{BIC}(G, \mathcal{D}) = \log P(\mathcal{D} | \hat{\theta}_{\text{MLE}}, G) - \frac{d}{2} \log N$$
where $d$ is the number of free parameters. The penalty term $\frac{d}{2} \log N$ discourages complex models.
2. Bayesian Dirichlet equivalent (BDe) Score: $$\text{BDe}(G, \mathcal{D}) = \prod_i \prod_{\mathbf{u}} \frac{\Gamma(\alpha_{\mathbf{u}})}{\Gamma(\alpha_{\mathbf{u}} + N_{\mathbf{u}})} \prod_x \frac{\Gamma(\alpha_{x\mathbf{u}} + N_{x\mathbf{u}})}{\Gamma(\alpha_{x\mathbf{u}})}$$
This is the marginal likelihood with Dirichlet priors, integrating out parameters.
Decomposability:
Both scores decompose by variable: $$\text{Score}(G) = \sum_i \text{Score}_i(X_i, Pa(X_i))$$
This enables local search: changing one node's parents requires recomputing only that node's local score.
Search Strategies:
Greedy Hill Climbing:
Operators:
Tabu Search:
Enhance hill climbing by maintaining a 'tabu list' of recently visited structures to avoid cycling and escape local optima.
Simulated Annealing:
Accept score-decreasing moves with probability $\exp(-\Delta/T)$ where $T$ is temperature. Gradually reduce $T$ to converge.
Genetic Algorithms:
Maintain a population of DAGs, apply crossover and mutation, select based on fitness (score).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175
import numpy as npfrom itertools import product def bic_score_local(data, variable_idx, parent_indices, n_states=2): """ Compute local BIC score for P(X_i | Pa(X_i)). BIC = log_likelihood - (d/2) * log(N) """ N = len(data) # Count configurations parent_configs = list(product(range(n_states), repeat=len(parent_indices))) log_likelihood = 0 n_params = 0 for config in parent_configs: # Find samples matching this parent config if parent_indices: mask = np.all(data[:, parent_indices] == config, axis=1) else: mask = np.ones(N, dtype=bool) n_parent = mask.sum() if n_parent == 0: continue n_params += (n_states - 1) # Free parameters for this row # Count child values for child_val in range(n_states): child_mask = mask & (data[:, variable_idx] == child_val) n_child = child_mask.sum() if n_child > 0: # MLE probability p = n_child / n_parent log_likelihood += n_child * np.log(p) bic = log_likelihood - (n_params / 2) * np.log(N) return bic def greedy_hill_climbing(data, n_states=2, max_parents=3): """ Greedy hill climbing for DAG structure learning. Returns: adjacency: Boolean adjacency matrix (adj[i,j] means i→j) """ n_vars = data.shape[1] # Start with empty graph adjacency = np.zeros((n_vars, n_vars), dtype=bool) def get_parents(node): return [i for i in range(n_vars) if adjacency[i, node]] def is_acyclic(adj): """Check if adding edges creates cycle using DFS.""" visited = [False] * n_vars rec_stack = [False] * n_vars def dfs(v): visited[v] = True rec_stack[v] = True for u in range(n_vars): if adj[v, u]: if not visited[u]: if dfs(u): return True elif rec_stack[u]: return True rec_stack[v] = False return False for v in range(n_vars): if not visited[v]: if dfs(v): return False return True def compute_total_score(adj): score = 0 for j in range(n_vars): parents = [i for i in range(n_vars) if adj[i, j]] score += bic_score_local(data, j, parents, n_states) return score current_score = compute_total_score(adjacency) improved = True while improved: improved = False best_op = None best_score = current_score for i in range(n_vars): for j in range(n_vars): if i == j: continue # Try adding edge i -> j if not adjacency[i, j]: if len(get_parents(j)) < max_parents: adjacency[i, j] = True if is_acyclic(adjacency): score = compute_total_score(adjacency) if score > best_score: best_score = score best_op = ('add', i, j) adjacency[i, j] = False # Try deleting edge i -> j if adjacency[i, j]: adjacency[i, j] = False score = compute_total_score(adjacency) if score > best_score: best_score = score best_op = ('del', i, j) adjacency[i, j] = True # Try reversing edge i -> j if adjacency[i, j]: adjacency[i, j] = False if len(get_parents(i)) < max_parents: adjacency[j, i] = True if is_acyclic(adjacency): score = compute_total_score(adjacency) if score > best_score: best_score = score best_op = ('rev', i, j) adjacency[j, i] = False adjacency[i, j] = True if best_op: op, i, j = best_op if op == 'add': adjacency[i, j] = True elif op == 'del': adjacency[i, j] = False elif op == 'rev': adjacency[i, j] = False adjacency[j, i] = True current_score = best_score improved = True print(f" {op} {i}→{j}, score: {current_score:.2f}") return adjacency # Examplenp.random.seed(42)N = 500 # Generate data: 0 → 1, 0 → 2, 1 → 2X0 = np.random.choice([0, 1], size=N, p=[0.4, 0.6])X1 = np.array([np.random.choice([0, 1], p=[0.3, 0.7] if x == 1 else [0.8, 0.2]) for x in X0])X2 = np.array([np.random.choice([0, 1], p=[0.2, 0.8] if x1 == 1 and x0 == 1 else [0.6, 0.4]) for x0, x1 in zip(X0, X1)]) data = np.column_stack([X0, X1, X2]) print("=== Greedy Hill Climbing ===\n")print("True structure: 0 → 1, 0 → 2, 1 → 2\n") adj = greedy_hill_climbing(data) print("\nLearned adjacency (row i to col j):")print(" 0 1 2")for i in range(3): print(f"{i}: {adj[i].astype(int)}")Hybrid methods combine constraint-based and score-based approaches to leverage the strengths of both.
Max-Min Hill Climbing (MMHC):
A popular hybrid algorithm with two phases:
Phase 1: Skeleton via MMPC
For each variable X:
Phase 2: Orientation via Scoring
Benefits:
NOTEARS and Continuous Optimization:
Recent methods formulate structure learning as continuous optimization: $$\min_W \mathcal{L}(W) + \lambda |W|_1 \quad \text{s.t.} \quad h(W) = 0$$
where $W$ is a weighted adjacency matrix and $h(W) = 0$ encodes acyclicity. Gradient-based optimization enables scalability to many variables.
| Approach | Pros | Cons | Best When |
|---|---|---|---|
| Constraint-based (PC) | Principled, fast for sparse graphs | Sensitive to test errors | Large n, few edges |
| Score-based (Hill Climb) | Robust, handles noise | Local optima, slow | Small n, any density |
| Hybrid (MMHC) | Best of both worlds | More complex | General use |
| Continuous (NOTEARS) | Scalable, global optimum | Linear assumption | Many variables |
Pure data-driven structure learning is often insufficient. Incorporating domain knowledge dramatically improves results.
Types of Prior Knowledge:
1. Forbidden Edges: 'X cannot cause Y' (e.g., effect cannot precede cause temporally)
2. Required Edges: 'X must directly influence Y' (from established theory)
3. Tier/Layer Ordering: Variables in earlier tiers can cause later tiers but not vice versa
4. Edge Probability Priors: Soft preferences: P(edge exists) based on prior belief
Implementation:
Constraint-based: Skip independence tests for required edges; don't remove even if 'independent'
Score-based: Add forbidden edges to tabu list; add prior term to score for soft constraints
$$\text{Score}{\text{prior}}(G) = \text{Score}(G) + \sum{\text{edges}} \log P(\text{edge exists})$$
Use prior knowledge when: (1) Domain expertise is available and reliable, (2) Sample size is limited relative to complexity, (3) Causal direction matters but can't be identified from data, (4) Physical or logical constraints are known. Priors prevent spurious discoveries but can bias results if wrong.
Evaluating structure learning requires comparing learned graphs to ground truth (in simulation) or assessing predictive performance (in practice).
Structural Metrics (vs ground truth):
Structural Hamming Distance (SHD): Number of edge insertions, deletions, and flips needed to transform learned CPDAG into true CPDAG.
True/False Positive Rate:
Precision/Recall:
Predictive Metrics:
Held-out Likelihood: Score on test data not used for learning.
Conditional Likelihood: Predict each variable given its Markov blanket; compare predictions to true values.
Module Complete:
You have now completed the comprehensive coverage of Bayesian Networks. From DAG structure through factorization, CPT representation, parameter learning, and structure learning, you possess the theoretical foundation and practical skills to build, learn, and apply Bayesian Networks in real-world machine learning applications.
Congratulations! You've mastered Bayesian Networks—DAG structure, factorization, CPT representation, parameter learning, and structure learning. You can now represent complex probabilistic dependencies, learn from data, and reason under uncertainty.