Bayesian Networks - Learning Module

Loading content...

0/245

Parameter Learning

From Data to Probability

Once the structure of a Bayesian Network is determined, the next challenge is parameter learning: estimating the entries of each CPT from observed data. This is where the rubber meets the road—where abstract probabilistic models become calibrated to real-world phenomena.

Parameter learning in Bayesian Networks benefits enormously from the factorization property: each CPT can be learned independently using only the relevant variables. This decomposability transforms what could be an intractable high-dimensional estimation problem into a set of manageable local estimation tasks.

This page develops the complete theory and practice of parameter learning, from maximum likelihood estimation through Bayesian methods to handling the inevitable challenge of missing data.

What You Will Learn

By completing this page, you will: (1) Derive maximum likelihood estimates for CPT parameters, (2) Apply Bayesian estimation with Dirichlet priors, (3) Handle missing data using the EM algorithm, (4) Understand the bias-variance tradeoffs in parameter estimation, and (5) Implement practical parameter learning algorithms.

Maximum Likelihood Estimation

The most fundamental approach to parameter learning is Maximum Likelihood Estimation (MLE): find parameters that maximize the probability of observed data.

Setup: Given a dataset $\mathcal{D} = {\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(N)}}$ of $N$ i.i.d. samples and a Bayesian Network structure, we seek parameters $\theta$ (all CPT entries) that maximize:

$$\mathcal{L}(\theta; \mathcal{D}) = P(\mathcal{D} | \theta) = \prod_{j=1}^{N} P(\mathbf{x}^{(j)} | \theta)$$

Decomposition Property:

Using the BN factorization: $$\log \mathcal{L}(\theta; \mathcal{D}) = \sum_{j=1}^{N} \sum_{i=1}^{n} \log P(X_i^{(j)} | Pa(X_i)^{(j)}, \theta_i)$$

Rearranging: $$= \sum_{i=1}^{n} \sum_{j=1}^{N} \log P(X_i^{(j)} | Pa(X_i)^{(j)}, \theta_i)$$

Each inner sum depends only on $\theta_i$, so we can maximize separately for each CPT!

Modular Learning

The decomposition means parameter learning parallelizes perfectly. Each CPT can be learned on a separate processor using only the columns for that variable and its parents. This is a major computational advantage of Bayesian Networks.

MLE Solution:

For complete data, the MLE has a beautiful closed-form solution. Let $\theta_{x|\mathbf{u}}$ denote $P(X_i = x | Pa(X_i) = \mathbf{u})$. The MLE is:

$$\hat{\theta}_{x|\mathbf{u}}^{\text{MLE}} = \frac{N(X_i = x, Pa(X_i) = \mathbf{u})}{N(Pa(X_i) = \mathbf{u})}$$

where $N(\cdot)$ denotes the count in the dataset.

Intuition: The MLE is simply the empirical conditional frequency—the proportion of times $X = x$ among cases where parents equal $\mathbf{u}$.

Derivation: Using Lagrange multipliers to enforce $\sum_x \theta_{x|\mathbf{u}} = 1$, we maximize: $$\sum_x N(x, \mathbf{u}) \log \theta_{x|\mathbf{u}} + \lambda(\sum_x \theta_{x|\mathbf{u}} - 1)$$

Taking derivatives and solving yields the frequency estimator.

mle_parameter_learning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import numpy as np
from collections import defaultdict
 
def learn_cpt_mle(data, variable_idx, parent_indices, 
                  variable_states, parent_states_list):
    """
    Learn CPT parameters using Maximum Likelihood Estimation.
    
    Args:
        data: NxD array where each row is a sample
        variable_idx: Column index of the child variable
        parent_indices: List of column indices for parents
        variable_states: Number of states for child
        parent_states_list: List of number of states per parent
    
    Returns:
        CPT as numpy array with shape (parent_states..., variable_states)
    """
    # Initialize counts
    if parent_indices:
        shape = tuple(parent_states_list) + (variable_states,)
        joint_counts = np.zeros(shape)
        parent_counts = np.zeros(tuple(parent_states_list))
    else:
        joint_counts = np.zeros(variable_states)
        parent_counts = len(data)
    
    # Count occurrences
    for row in data:
        child_val = int(row[variable_idx])
        if parent_indices:
            parent_vals = tuple(int(row[p]) for p in parent_indices)
            joint_counts[parent_vals + (child_val,)] += 1
            parent_counts[parent_vals] += 1
        else:
            joint_counts[child_val] += 1
    
    # Compute MLE: joint / parent (with handling for zero counts)
    if parent_indices:
        # Expand parent_counts to broadcast correctly
        parent_counts_expanded = parent_counts[..., np.newaxis]
        cpt = np.divide(joint_counts, parent_counts_expanded,
                       where=parent_counts_expanded > 0,
                       out=np.ones_like(joint_counts) / variable_states)
    else:
        cpt = joint_counts / parent_counts
    
    return cpt
 
 
# Example: Learn from synthetic data
np.random.seed(42)
 
# Generate data from known distribution
# Structure: A -> B, A -> C (A is root, B and C depend on A)
N = 1000
A = np.random.choice([0, 1], size=N, p=[0.3, 0.7])
B = np.array([np.random.choice([0, 1], p=[0.2, 0.8] if a else [0.9, 0.1]) for a in A])
C = np.array([np.random.choice([0, 1], p=[0.4, 0.6] if a else [0.7, 0.3]) for a in A])
 
data = np.column_stack([A, B, C])
 
print("=== MLE Parameter Learning ===\n")
print(f"Dataset size: {N} samples\n")
 
# Learn P(A)
cpt_A = learn_cpt_mle(data, 0, [], 2, [])
print(f"Learned P(A):")
print(f"  P(A=0) = {cpt_A[0]:.4f} (true: 0.30)")
print(f"  P(A=1) = {cpt_A[1]:.4f} (true: 0.70)")
 
# Learn P(B|A)
cpt_B = learn_cpt_mle(data, 1, [0], 2, [2])
print(f"\nLearned P(B|A):")
print(f"  P(B=0|A=0) = {cpt_B[0,0]:.4f} (true: 0.90)")
print(f"  P(B=1|A=0) = {cpt_B[0,1]:.4f} (true: 0.10)")
print(f"  P(B=0|A=1) = {cpt_B[1,0]:.4f} (true: 0.20)")
print(f"  P(B=1|A=1) = {cpt_B[1,1]:.4f} (true: 0.80)")

Bayesian Parameter Estimation

MLE has a critical weakness: with limited data, estimates can be extreme (0 or 1) and overconfident. Bayesian estimation addresses this by incorporating prior beliefs.

The Dirichlet Prior:

For a categorical distribution, the natural conjugate prior is the Dirichlet distribution. For a variable $X$ with $r$ states, given parents $\mathbf{u}$:

$$P(\theta_{\cdot|\mathbf{u}}) = \text{Dirichlet}(\alpha_1, \ldots, \alpha_r)$$

The hyperparameters $\alpha_k$ can be interpreted as 'pseudo-counts'—imaginary observations before seeing real data.

Posterior with Conjugate Prior:

After observing counts $N(X=k, Pa=\mathbf{u})$, the posterior is still Dirichlet:

$$P(\theta_{\cdot|\mathbf{u}} | \mathcal{D}) = \text{Dirichlet}(\alpha_1 + N_1, \ldots, \alpha_r + N_r)$$

MAP Estimate (Posterior Mode):

$$\hat{\theta}_{x|\mathbf{u}}^{\text{MAP}} = \frac{N(X=x, Pa=\mathbf{u}) + \alpha_x - 1}{N(Pa=\mathbf{u}) + \sum_k (\alpha_k - 1)}$$

Posterior Mean (Expected Value):

$$\hat{\theta}_{x|\mathbf{u}}^{\text{Mean}} = \frac{N(X=x, Pa=\mathbf{u}) + \alpha_x}{N(Pa=\mathbf{u}) + \sum_k \alpha_k}$$

Laplace Smoothing

Setting all αₖ = 1 gives Laplace smoothing (add-one smoothing). Setting αₖ = 1/r gives the Jeffreys prior. With α = 2 (equivalent sample size = 2r), we get parameters halfway to uniform. The choice of α controls the strength of regularization toward the prior.

Advantages of Bayesian Estimation:

No zero probabilities: Even with N(x,u) = 0, the estimate is positive: $\alpha_x / (\sum \alpha)$
Smooth convergence: Estimates gradually move from prior to data as sample size increases
Uncertainty quantification: The posterior distribution captures estimation uncertainty, not just a point estimate
Prior incorporation: Domain knowledge (e.g., 'all states are roughly equally likely') can be encoded

Equivalent Sample Size (ESS):

The sum $\sum_k \alpha_k = \alpha_0$ is called the equivalent sample size—the strength of the prior in terms of imaginary observations. With $N$ real samples:

Prior's weight: $\alpha_0 / (N + \alpha_0)$
Data's weight: $N / (N + \alpha_0)$

As $N \to \infty$, data dominates and estimates converge to MLE.

bayesian_parameter_learning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import numpy as np
 
def learn_cpt_bayesian(data, variable_idx, parent_indices,
                       variable_states, parent_states_list,
                       prior_alpha=1.0):
    """
    Learn CPT with Bayesian estimation using Dirichlet prior.
    
    Args:
        prior_alpha: Dirichlet hyperparameter (same for all states).
                    α=1 gives Laplace smoothing
                    α=0.5 gives Jeffreys prior
    
    Returns:
        Posterior mean estimates for CPT
    """
    # Initialize with prior pseudo-counts
    if parent_indices:
        shape = tuple(parent_states_list) + (variable_states,)
        counts = np.full(shape, prior_alpha)
    else:
        counts = np.full(variable_states, prior_alpha)
    
    # Add observed counts
    for row in data:
        child_val = int(row[variable_idx])
        if parent_indices:
            parent_vals = tuple(int(row[p]) for p in parent_indices)
            counts[parent_vals + (child_val,)] += 1
        else:
            counts[child_val] += 1
    
    # Compute posterior mean: (α + N) / sum(α + N)
    if parent_indices:
        totals = counts.sum(axis=-1, keepdims=True)
        cpt = counts / totals
    else:
        cpt = counts / counts.sum()
    
    return cpt
 
 
# Demonstrate Bayesian vs MLE with sparse data
print("=== Bayesian vs MLE: Sparse Data ===\n")
 
# Sparse data: only 10 observations
sparse_data = np.array([
    [0, 1],  # Parent=0, Child=1
    [0, 1],
    [0, 0],
    [1, 1],
    [1, 1],
    [1, 1],
    [1, 1],
    [1, 1],
    [1, 0],
    [1, 0],
])
 
print("Data: 10 samples")
print("Parent=0: 3 samples (Child: 1,1,0)")
print("Parent=1: 7 samples (Child: 1,1,1,1,1,0,0)\n")
 
# MLE estimates
cpt_mle = np.zeros((2, 2))
for p in [0, 1]:
    mask = sparse_data[:, 0] == p
    child_vals = sparse_data[mask, 1]
    if len(child_vals) > 0:
        cpt_mle[p, :] = [np.mean(child_vals == 0), np.mean(child_vals == 1)]
 
print("MLE Estimates P(Child|Parent):")
print(f"  P(C=0|P=0) = {cpt_mle[0,0]:.3f}, P(C=1|P=0) = {cpt_mle[0,1]:.3f}")
print(f"  P(C=0|P=1) = {cpt_mle[1,0]:.3f}, P(C=1|P=1) = {cpt_mle[1,1]:.3f}")
 
# Bayesian with Laplace smoothing (α=1)
cpt_bayes1 = learn_cpt_bayesian(sparse_data, 1, [0], 2, [2], prior_alpha=1.0)
print(f"\nBayesian (α=1, Laplace):")
print(f"  P(C=0|P=0) = {cpt_bayes1[0,0]:.3f}, P(C=1|P=0) = {cpt_bayes1[0,1]:.3f}")
print(f"  P(C=0|P=1) = {cpt_bayes1[1,0]:.3f}, P(C=1|P=1) = {cpt_bayes1[1,1]:.3f}")
 
# Bayesian with stronger prior (α=5)
cpt_bayes5 = learn_cpt_bayesian(sparse_data, 1, [0], 2, [2], prior_alpha=5.0)
print(f"\nBayesian (α=5, strong prior):")
print(f"  P(C=0|P=0) = {cpt_bayes5[0,0]:.3f}, P(C=1|P=0) = {cpt_bayes5[0,1]:.3f}")
print(f"  P(C=0|P=1) = {cpt_bayes5[1,0]:.3f}, P(C=1|P=1) = {cpt_bayes5[1,1]:.3f}")
 
print("\nNote: Stronger prior pulls estimates toward 0.5 (uniform)")

Handling Missing Data with EM

Real datasets invariably contain missing values. The Expectation-Maximization (EM) algorithm provides a principled approach to learning parameters when data is incomplete.

The Missing Data Problem:

With complete data, counting is straightforward. With missing values, we don't observe all variable configurations. For example, if sample $j$ has $X_2$ missing, we can't directly count its contribution to $P(X_2 | Pa(X_2))$.

The EM Algorithm:

EM iteratively refines parameter estimates:

E-Step (Expectation): Using current parameters $\theta^{(t)}$, compute expected counts by inferring the distribution over missing values: $$E[N(X=x, Pa=\mathbf{u}) | \mathcal{D}, \theta^{(t)}]$$

For each incomplete sample, run inference to get $P(\text{missing} | \text{observed}, \theta^{(t)})$.

M-Step (Maximization): Update parameters using expected counts as if they were real counts: $$\theta^{(t+1)} = \arg\max_\theta E[\log P(\mathcal{D} | \theta) | \mathcal{D}, \theta^{(t)}]$$

Convergence: EM monotonically increases the likelihood lower bound and converges to a local maximum (but not necessarily global).

Local Optima

EM finds local maxima, not global. Multiple random initializations are recommended. Also, with extensive missing data, the final estimates depend heavily on initialization. Consider using domain knowledge to initialize parameters reasonably.

EM for Bayesian Networks:

The BN structure dramatically simplifies EM:

E-Step uses standard BN inference: Given observed values, compute posterior over missing values using belief propagation or variable elimination.
M-Step decomposes by CPT: Expected counts for each CPT are computed independently, then parameters are estimated as before.
Sufficient statistics: Only expected counts matter—we don't need to enumerate all possible completions explicitly.

Algorithm Outline:

Initialize θ⁽⁰⁾ (random or prior-based)
repeat:
  # E-Step
  for each sample with missing values:
    Compute P(missing | observed, θ⁽ᵗ⁾)
    Accumulate expected counts
  
  # M-Step
  for each CPT:
    θ⁽ᵗ⁺¹⁾ = expected_counts / expected_parent_counts

until convergence (Δ likelihood < ε)

em_missing_data.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import numpy as np
 
def em_simple_bn(data, missing_mask, n_iter=50, tol=1e-6):
    """
    EM algorithm for a simple A -> B network with missing data.
    
    Args:
        data: Nx2 array (NaN for missing values)
        missing_mask: Boolean array indicating missing values
        n_iter: Maximum iterations
        tol: Convergence tolerance
    
    Returns:
        Learned P(A) and P(B|A)
    """
    N = len(data)
    
    # Initialize parameters randomly
    p_A = np.array([0.5, 0.5])  # P(A=0), P(A=1)
    p_B_given_A = np.array([[0.5, 0.5], [0.5, 0.5]])  # P(B|A)
    
    prev_ll = float('-inf')
    
    for iteration in range(n_iter):
        # E-Step: Compute expected counts
        expected_A = np.zeros(2)
        expected_AB = np.zeros((2, 2))
        
        for i in range(N):
            a_val = data[i, 0]
            b_val = data[i, 1]
            a_missing = missing_mask[i, 0]
            b_missing = missing_mask[i, 1]
            
            if not a_missing and not b_missing:
                # Complete observation
                a, b = int(a_val), int(b_val)
                expected_A[a] += 1
                expected_AB[a, b] += 1
            
            elif a_missing and not b_missing:
                # A missing, B observed
                b = int(b_val)
                # P(A | B=b) ∝ P(A) * P(B=b | A)
                posterior_A = p_A * p_B_given_A[:, b]
                posterior_A /= posterior_A.sum()
                expected_A += posterior_A
                expected_AB[:, b] += posterior_A
            
            elif not a_missing and b_missing:
                # A observed, B missing
                a = int(a_val)
                expected_A[a] += 1
                expected_AB[a, :] += p_B_given_A[a, :]
            
            else:
                # Both missing - use joint
                for a in [0, 1]:
                    for b in [0, 1]:
                        prob = p_A[a] * p_B_given_A[a, b]
                        expected_A[a] += prob
                        expected_AB[a, b] += prob
        
        # M-Step: Update parameters
        p_A = expected_A / expected_A.sum()
        p_B_given_A = expected_AB / expected_AB.sum(axis=1, keepdims=True)
        
        # Check convergence
        ll = compute_expected_ll(data, missing_mask, p_A, p_B_given_A)
        if abs(ll - prev_ll) < tol:
            print(f"Converged at iteration {iteration}")
            break
        prev_ll = ll
    
    return p_A, p_B_given_A
 
def compute_expected_ll(data, missing_mask, p_A, p_B_given_A):
    """Compute expected log-likelihood."""
    ll = 0
    for i in range(len(data)):
        if not missing_mask[i].any():
            a, b = int(data[i, 0]), int(data[i, 1])
            ll += np.log(p_A[a] * p_B_given_A[a, b])
    return ll
 
# Example with missing data
np.random.seed(42)
 
# True parameters
true_pA = np.array([0.3, 0.7])
true_pB_A = np.array([[0.9, 0.1], [0.2, 0.8]])
 
# Generate complete data
N = 500
A = np.random.choice([0, 1], size=N, p=true_pA)
B = np.array([np.random.choice([0, 1], p=true_pB_A[a]) for a in A])
data = np.column_stack([A, B]).astype(float)
 
# Introduce 20% missing values randomly
missing_mask = np.random.random((N, 2)) < 0.2
data_missing = data.copy()
data_missing[missing_mask] = np.nan
 
print("=== EM for Missing Data ===\n")
print(f"Data: {N} samples, {missing_mask.sum()} missing values ({100*missing_mask.mean():.1f}%)\n")
 
p_A_learned, p_B_A_learned = em_simple_bn(data_missing, missing_mask)
 
print(f"\nLearned P(A):")
print(f"  P(A=0) = {p_A_learned[0]:.3f} (true: {true_pA[0]:.1f})")
print(f"  P(A=1) = {p_A_learned[1]:.3f} (true: {true_pA[1]:.1f})")
print(f"\nLearned P(B|A):")
print(f"  P(B=0|A=0) = {p_B_A_learned[0,0]:.3f} (true: {true_pB_A[0,0]:.1f})")
print(f"  P(B=1|A=1) = {p_B_A_learned[1,1]:.3f} (true: {true_pB_A[1,1]:.1f})")

Sample Size and Estimation Reliability

The reliability of parameter estimates depends critically on sample size relative to CPT complexity. Understanding this relationship is essential for practical applications.

The Data Fragmentation Problem:

For a variable $X$ with $k$ parents each having $r$ states, the CPT has $r^k$ rows. Data is split across these rows, so each row sees only a fraction of samples.

Expected counts per row: $N / r^k$

With $N = 1000$, $k = 5$, $r = 2$: each row gets only $1000 / 32 \approx 31$ samples on average. Some rows may have very few or zero samples.

Rules of Thumb:

For reliable MLE estimates, each CPT row should have:

Minimum: 5+ samples (very rough estimates)
Reasonable: 20+ samples (moderate reliability)
Good: 100+ samples (low variance)

Implication: Total samples needed scales as $O(r^k)$—exponentially in parent count.

Required Sample Size for Reliable CPT Estimation
Binary Parents	CPT Rows	Min N (5/row)	Good N (100/row)
2	4	20	400
3	8	40	800
4	16	80	1,600
5	32	160	3,200
6	64	320	6,400
10	1,024	5,120	102,400

Practical Strategies for Limited Data

When data is limited: (1) Use Bayesian estimation with informative priors, (2) Reduce parent count through structure learning constraints, (3) Use parameter tying to share parameters across similar contexts, (4) Consider compact representations like Noisy-OR, (5) Pool rare parent configurations.

Variance of MLE:

For a binomial proportion estimated from $n$ samples: $$\text{Var}(\hat{p}) = \frac{p(1-p)}{n}$$

Standard error: $\sqrt{p(1-p)/n}$

With $n = 10$ and $p = 0.5$: SE = 0.158 (95% CI width ≈ 0.63). With $n = 100$ and $p = 0.5$: SE = 0.05 (95% CI width ≈ 0.20).

Confidence Intervals:

Normal approximation 95% CI: $\hat{p} \pm 1.96 \sqrt{\hat{p}(1-\hat{p})/n}$

For small $n$, use exact binomial (Clopper-Pearson) or Bayesian credible intervals.

Monitoring Estimation Quality:

During learning, track:

Minimum counts per CPT row
Number of zero-count rows
Effective sample size for Bayesian estimates
Parameter stability across random data splits

Parameter Tying and Sharing

When domain knowledge suggests that certain parameters should be equal, parameter tying can dramatically reduce the number of free parameters and improve estimation with limited data.

Types of Parameter Sharing:

1. Within-CPT Tying: Certain rows of a CPT share the same distribution.

Example: For P(Disease | Symptom1, Symptom2), we might assume the distribution is the same whether only Symptom1 or only Symptom2 is present:

P(D | S1=1, S2=0) = P(D | S1=0, S2=1)

This halves the parameters for those rows.

2. Across-CPT Tying: Different nodes share CPT parameters.

Example: In a temporal model, P(Xₜ | Xₜ₋₁) might be the same for all time steps—stationary dynamics. One CPT serves all time slices.

3. Symmetry Constraints: Parameters obey symmetry relationships.

Example: P(Y=1 | X=1) = P(X=1 | Y=1) when the graph structure encodes symmetric relationships.

Learning with Tied Parameters

With tied parameters, aggregate counts from all configurations that share the parameter. MLE becomes: θ = (sum of counts across tied configs) / (sum of parent counts across tied configs). The effective sample size increases proportionally to the number of tied configurations.

Common Tying Patterns:

Order-independent (exchangeable) parents: If the effect depends only on the number of active causes, not which ones:

P(Effect | k causes active) for k = 0, 1, 2, ..., n
Reduces from $2^n$ rows to $n + 1$ rows

Homogeneous networks: In network models where nodes represent similar entities (e.g., social network members), all nodes might share the same CPT template.

Temporal stationarity: In Dynamic Bayesian Networks, transition probabilities often don't change over time, allowing massive parameter sharing across time slices.

Benefits:

Fewer parameters → less overfitting
More data per parameter → better estimates
More interpretable models
Faster learning and inference

Practical Implementation Considerations

Implementing parameter learning in practice requires attention to several engineering details that theory often glosses over.

Data Preprocessing:

Handle unknown categories (values not seen in training)
Decide treatment of NA vs. legitimate '0' values
Discretize continuous variables consistently
Normalize string values to canonical forms

Numerical Stability:

Work in log-space to avoid underflow
Add small ε to zero counts before taking logs
Use numerically stable normalization

Validation:

Hold out test data to assess generalization
Use cross-validation for hyperparameter selection (prior strength)
Monitor for overfitting to rare parent configurations

Scalability:

Single-pass counting for MLE (streaming-compatible)
Parallelize across CPTs
For EM, cache inference results when possible
Consider online/incremental learning for streaming data

Parameter Learning Checklist

•Verify data quality and handle missing values explicitly
•Choose appropriate prior strength based on sample size
•Check for zero counts and apply smoothing if needed
•Validate estimates on held-out data
•Monitor convergence for EM algorithms
•Document assumptions (parameter tying, priors)
•Test sensitivity to prior choices

Summary: Parameter Learning Mastery

Key Takeaways

•MLE = empirical frequency: With complete data, the MLE is simply counts divided by total—no optimization needed.
•Decomposition enables modularity: Each CPT can be learned independently, enabling parallelization.
•Bayesian estimation prevents zeros: Dirichlet priors add pseudo-counts, smoothing estimates toward prior beliefs.
•EM handles missing data: Iterate between inferring missing values and maximizing with expected counts.
•Sample size scales exponentially: More parents → exponentially more data needed for reliable estimates.
•Parameter tying reduces complexity: Sharing parameters across contexts improves estimation with limited data.

What's Next:

The final page of this module addresses Structure Learning—discovering the DAG structure itself from data. This is the most challenging aspect of Bayesian Network learning, involving search through an exponentially large space of possible structures.

Page Complete

You now understand parameter learning for Bayesian Networks. You can implement MLE and Bayesian estimation, handle missing data with EM, and apply practical strategies for limited data scenarios.