Loading content...
Once the structure of a Bayesian Network is determined, the next challenge is parameter learning: estimating the entries of each CPT from observed data. This is where the rubber meets the road—where abstract probabilistic models become calibrated to real-world phenomena.
Parameter learning in Bayesian Networks benefits enormously from the factorization property: each CPT can be learned independently using only the relevant variables. This decomposability transforms what could be an intractable high-dimensional estimation problem into a set of manageable local estimation tasks.
This page develops the complete theory and practice of parameter learning, from maximum likelihood estimation through Bayesian methods to handling the inevitable challenge of missing data.
By completing this page, you will: (1) Derive maximum likelihood estimates for CPT parameters, (2) Apply Bayesian estimation with Dirichlet priors, (3) Handle missing data using the EM algorithm, (4) Understand the bias-variance tradeoffs in parameter estimation, and (5) Implement practical parameter learning algorithms.
The most fundamental approach to parameter learning is Maximum Likelihood Estimation (MLE): find parameters that maximize the probability of observed data.
Setup: Given a dataset $\mathcal{D} = {\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(N)}}$ of $N$ i.i.d. samples and a Bayesian Network structure, we seek parameters $\theta$ (all CPT entries) that maximize:
$$\mathcal{L}(\theta; \mathcal{D}) = P(\mathcal{D} | \theta) = \prod_{j=1}^{N} P(\mathbf{x}^{(j)} | \theta)$$
Decomposition Property:
Using the BN factorization: $$\log \mathcal{L}(\theta; \mathcal{D}) = \sum_{j=1}^{N} \sum_{i=1}^{n} \log P(X_i^{(j)} | Pa(X_i)^{(j)}, \theta_i)$$
Rearranging: $$= \sum_{i=1}^{n} \sum_{j=1}^{N} \log P(X_i^{(j)} | Pa(X_i)^{(j)}, \theta_i)$$
Each inner sum depends only on $\theta_i$, so we can maximize separately for each CPT!
The decomposition means parameter learning parallelizes perfectly. Each CPT can be learned on a separate processor using only the columns for that variable and its parents. This is a major computational advantage of Bayesian Networks.
MLE Solution:
For complete data, the MLE has a beautiful closed-form solution. Let $\theta_{x|\mathbf{u}}$ denote $P(X_i = x | Pa(X_i) = \mathbf{u})$. The MLE is:
$$\hat{\theta}_{x|\mathbf{u}}^{\text{MLE}} = \frac{N(X_i = x, Pa(X_i) = \mathbf{u})}{N(Pa(X_i) = \mathbf{u})}$$
where $N(\cdot)$ denotes the count in the dataset.
Intuition: The MLE is simply the empirical conditional frequency—the proportion of times $X = x$ among cases where parents equal $\mathbf{u}$.
Derivation: Using Lagrange multipliers to enforce $\sum_x \theta_{x|\mathbf{u}} = 1$, we maximize: $$\sum_x N(x, \mathbf{u}) \log \theta_{x|\mathbf{u}} + \lambda(\sum_x \theta_{x|\mathbf{u}} - 1)$$
Taking derivatives and solving yields the frequency estimator.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
import numpy as npfrom collections import defaultdict def learn_cpt_mle(data, variable_idx, parent_indices, variable_states, parent_states_list): """ Learn CPT parameters using Maximum Likelihood Estimation. Args: data: NxD array where each row is a sample variable_idx: Column index of the child variable parent_indices: List of column indices for parents variable_states: Number of states for child parent_states_list: List of number of states per parent Returns: CPT as numpy array with shape (parent_states..., variable_states) """ # Initialize counts if parent_indices: shape = tuple(parent_states_list) + (variable_states,) joint_counts = np.zeros(shape) parent_counts = np.zeros(tuple(parent_states_list)) else: joint_counts = np.zeros(variable_states) parent_counts = len(data) # Count occurrences for row in data: child_val = int(row[variable_idx]) if parent_indices: parent_vals = tuple(int(row[p]) for p in parent_indices) joint_counts[parent_vals + (child_val,)] += 1 parent_counts[parent_vals] += 1 else: joint_counts[child_val] += 1 # Compute MLE: joint / parent (with handling for zero counts) if parent_indices: # Expand parent_counts to broadcast correctly parent_counts_expanded = parent_counts[..., np.newaxis] cpt = np.divide(joint_counts, parent_counts_expanded, where=parent_counts_expanded > 0, out=np.ones_like(joint_counts) / variable_states) else: cpt = joint_counts / parent_counts return cpt # Example: Learn from synthetic datanp.random.seed(42) # Generate data from known distribution# Structure: A -> B, A -> C (A is root, B and C depend on A)N = 1000A = np.random.choice([0, 1], size=N, p=[0.3, 0.7])B = np.array([np.random.choice([0, 1], p=[0.2, 0.8] if a else [0.9, 0.1]) for a in A])C = np.array([np.random.choice([0, 1], p=[0.4, 0.6] if a else [0.7, 0.3]) for a in A]) data = np.column_stack([A, B, C]) print("=== MLE Parameter Learning ===\n")print(f"Dataset size: {N} samples\n") # Learn P(A)cpt_A = learn_cpt_mle(data, 0, [], 2, [])print(f"Learned P(A):")print(f" P(A=0) = {cpt_A[0]:.4f} (true: 0.30)")print(f" P(A=1) = {cpt_A[1]:.4f} (true: 0.70)") # Learn P(B|A)cpt_B = learn_cpt_mle(data, 1, [0], 2, [2])print(f"\nLearned P(B|A):")print(f" P(B=0|A=0) = {cpt_B[0,0]:.4f} (true: 0.90)")print(f" P(B=1|A=0) = {cpt_B[0,1]:.4f} (true: 0.10)")print(f" P(B=0|A=1) = {cpt_B[1,0]:.4f} (true: 0.20)")print(f" P(B=1|A=1) = {cpt_B[1,1]:.4f} (true: 0.80)")MLE has a critical weakness: with limited data, estimates can be extreme (0 or 1) and overconfident. Bayesian estimation addresses this by incorporating prior beliefs.
The Dirichlet Prior:
For a categorical distribution, the natural conjugate prior is the Dirichlet distribution. For a variable $X$ with $r$ states, given parents $\mathbf{u}$:
$$P(\theta_{\cdot|\mathbf{u}}) = \text{Dirichlet}(\alpha_1, \ldots, \alpha_r)$$
The hyperparameters $\alpha_k$ can be interpreted as 'pseudo-counts'—imaginary observations before seeing real data.
Posterior with Conjugate Prior:
After observing counts $N(X=k, Pa=\mathbf{u})$, the posterior is still Dirichlet:
$$P(\theta_{\cdot|\mathbf{u}} | \mathcal{D}) = \text{Dirichlet}(\alpha_1 + N_1, \ldots, \alpha_r + N_r)$$
MAP Estimate (Posterior Mode):
$$\hat{\theta}_{x|\mathbf{u}}^{\text{MAP}} = \frac{N(X=x, Pa=\mathbf{u}) + \alpha_x - 1}{N(Pa=\mathbf{u}) + \sum_k (\alpha_k - 1)}$$
Posterior Mean (Expected Value):
$$\hat{\theta}_{x|\mathbf{u}}^{\text{Mean}} = \frac{N(X=x, Pa=\mathbf{u}) + \alpha_x}{N(Pa=\mathbf{u}) + \sum_k \alpha_k}$$
Setting all αₖ = 1 gives Laplace smoothing (add-one smoothing). Setting αₖ = 1/r gives the Jeffreys prior. With α = 2 (equivalent sample size = 2r), we get parameters halfway to uniform. The choice of α controls the strength of regularization toward the prior.
Advantages of Bayesian Estimation:
No zero probabilities: Even with N(x,u) = 0, the estimate is positive: $\alpha_x / (\sum \alpha)$
Smooth convergence: Estimates gradually move from prior to data as sample size increases
Uncertainty quantification: The posterior distribution captures estimation uncertainty, not just a point estimate
Prior incorporation: Domain knowledge (e.g., 'all states are roughly equally likely') can be encoded
Equivalent Sample Size (ESS):
The sum $\sum_k \alpha_k = \alpha_0$ is called the equivalent sample size—the strength of the prior in terms of imaginary observations. With $N$ real samples:
As $N \to \infty$, data dominates and estimates converge to MLE.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788
import numpy as np def learn_cpt_bayesian(data, variable_idx, parent_indices, variable_states, parent_states_list, prior_alpha=1.0): """ Learn CPT with Bayesian estimation using Dirichlet prior. Args: prior_alpha: Dirichlet hyperparameter (same for all states). α=1 gives Laplace smoothing α=0.5 gives Jeffreys prior Returns: Posterior mean estimates for CPT """ # Initialize with prior pseudo-counts if parent_indices: shape = tuple(parent_states_list) + (variable_states,) counts = np.full(shape, prior_alpha) else: counts = np.full(variable_states, prior_alpha) # Add observed counts for row in data: child_val = int(row[variable_idx]) if parent_indices: parent_vals = tuple(int(row[p]) for p in parent_indices) counts[parent_vals + (child_val,)] += 1 else: counts[child_val] += 1 # Compute posterior mean: (α + N) / sum(α + N) if parent_indices: totals = counts.sum(axis=-1, keepdims=True) cpt = counts / totals else: cpt = counts / counts.sum() return cpt # Demonstrate Bayesian vs MLE with sparse dataprint("=== Bayesian vs MLE: Sparse Data ===\n") # Sparse data: only 10 observationssparse_data = np.array([ [0, 1], # Parent=0, Child=1 [0, 1], [0, 0], [1, 1], [1, 1], [1, 1], [1, 1], [1, 1], [1, 0], [1, 0],]) print("Data: 10 samples")print("Parent=0: 3 samples (Child: 1,1,0)")print("Parent=1: 7 samples (Child: 1,1,1,1,1,0,0)\n") # MLE estimatescpt_mle = np.zeros((2, 2))for p in [0, 1]: mask = sparse_data[:, 0] == p child_vals = sparse_data[mask, 1] if len(child_vals) > 0: cpt_mle[p, :] = [np.mean(child_vals == 0), np.mean(child_vals == 1)] print("MLE Estimates P(Child|Parent):")print(f" P(C=0|P=0) = {cpt_mle[0,0]:.3f}, P(C=1|P=0) = {cpt_mle[0,1]:.3f}")print(f" P(C=0|P=1) = {cpt_mle[1,0]:.3f}, P(C=1|P=1) = {cpt_mle[1,1]:.3f}") # Bayesian with Laplace smoothing (α=1)cpt_bayes1 = learn_cpt_bayesian(sparse_data, 1, [0], 2, [2], prior_alpha=1.0)print(f"\nBayesian (α=1, Laplace):")print(f" P(C=0|P=0) = {cpt_bayes1[0,0]:.3f}, P(C=1|P=0) = {cpt_bayes1[0,1]:.3f}")print(f" P(C=0|P=1) = {cpt_bayes1[1,0]:.3f}, P(C=1|P=1) = {cpt_bayes1[1,1]:.3f}") # Bayesian with stronger prior (α=5)cpt_bayes5 = learn_cpt_bayesian(sparse_data, 1, [0], 2, [2], prior_alpha=5.0)print(f"\nBayesian (α=5, strong prior):")print(f" P(C=0|P=0) = {cpt_bayes5[0,0]:.3f}, P(C=1|P=0) = {cpt_bayes5[0,1]:.3f}")print(f" P(C=0|P=1) = {cpt_bayes5[1,0]:.3f}, P(C=1|P=1) = {cpt_bayes5[1,1]:.3f}") print("\nNote: Stronger prior pulls estimates toward 0.5 (uniform)")Real datasets invariably contain missing values. The Expectation-Maximization (EM) algorithm provides a principled approach to learning parameters when data is incomplete.
The Missing Data Problem:
With complete data, counting is straightforward. With missing values, we don't observe all variable configurations. For example, if sample $j$ has $X_2$ missing, we can't directly count its contribution to $P(X_2 | Pa(X_2))$.
The EM Algorithm:
EM iteratively refines parameter estimates:
E-Step (Expectation): Using current parameters $\theta^{(t)}$, compute expected counts by inferring the distribution over missing values: $$E[N(X=x, Pa=\mathbf{u}) | \mathcal{D}, \theta^{(t)}]$$
For each incomplete sample, run inference to get $P(\text{missing} | \text{observed}, \theta^{(t)})$.
M-Step (Maximization): Update parameters using expected counts as if they were real counts: $$\theta^{(t+1)} = \arg\max_\theta E[\log P(\mathcal{D} | \theta) | \mathcal{D}, \theta^{(t)}]$$
Convergence: EM monotonically increases the likelihood lower bound and converges to a local maximum (but not necessarily global).
EM finds local maxima, not global. Multiple random initializations are recommended. Also, with extensive missing data, the final estimates depend heavily on initialization. Consider using domain knowledge to initialize parameters reasonably.
EM for Bayesian Networks:
The BN structure dramatically simplifies EM:
E-Step uses standard BN inference: Given observed values, compute posterior over missing values using belief propagation or variable elimination.
M-Step decomposes by CPT: Expected counts for each CPT are computed independently, then parameters are estimated as before.
Sufficient statistics: Only expected counts matter—we don't need to enumerate all possible completions explicitly.
Algorithm Outline:
Initialize θ⁽⁰⁾ (random or prior-based)
repeat:
# E-Step
for each sample with missing values:
Compute P(missing | observed, θ⁽ᵗ⁾)
Accumulate expected counts
# M-Step
for each CPT:
θ⁽ᵗ⁺¹⁾ = expected_counts / expected_parent_counts
until convergence (Δ likelihood < ε)
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114
import numpy as np def em_simple_bn(data, missing_mask, n_iter=50, tol=1e-6): """ EM algorithm for a simple A -> B network with missing data. Args: data: Nx2 array (NaN for missing values) missing_mask: Boolean array indicating missing values n_iter: Maximum iterations tol: Convergence tolerance Returns: Learned P(A) and P(B|A) """ N = len(data) # Initialize parameters randomly p_A = np.array([0.5, 0.5]) # P(A=0), P(A=1) p_B_given_A = np.array([[0.5, 0.5], [0.5, 0.5]]) # P(B|A) prev_ll = float('-inf') for iteration in range(n_iter): # E-Step: Compute expected counts expected_A = np.zeros(2) expected_AB = np.zeros((2, 2)) for i in range(N): a_val = data[i, 0] b_val = data[i, 1] a_missing = missing_mask[i, 0] b_missing = missing_mask[i, 1] if not a_missing and not b_missing: # Complete observation a, b = int(a_val), int(b_val) expected_A[a] += 1 expected_AB[a, b] += 1 elif a_missing and not b_missing: # A missing, B observed b = int(b_val) # P(A | B=b) ∝ P(A) * P(B=b | A) posterior_A = p_A * p_B_given_A[:, b] posterior_A /= posterior_A.sum() expected_A += posterior_A expected_AB[:, b] += posterior_A elif not a_missing and b_missing: # A observed, B missing a = int(a_val) expected_A[a] += 1 expected_AB[a, :] += p_B_given_A[a, :] else: # Both missing - use joint for a in [0, 1]: for b in [0, 1]: prob = p_A[a] * p_B_given_A[a, b] expected_A[a] += prob expected_AB[a, b] += prob # M-Step: Update parameters p_A = expected_A / expected_A.sum() p_B_given_A = expected_AB / expected_AB.sum(axis=1, keepdims=True) # Check convergence ll = compute_expected_ll(data, missing_mask, p_A, p_B_given_A) if abs(ll - prev_ll) < tol: print(f"Converged at iteration {iteration}") break prev_ll = ll return p_A, p_B_given_A def compute_expected_ll(data, missing_mask, p_A, p_B_given_A): """Compute expected log-likelihood.""" ll = 0 for i in range(len(data)): if not missing_mask[i].any(): a, b = int(data[i, 0]), int(data[i, 1]) ll += np.log(p_A[a] * p_B_given_A[a, b]) return ll # Example with missing datanp.random.seed(42) # True parameterstrue_pA = np.array([0.3, 0.7])true_pB_A = np.array([[0.9, 0.1], [0.2, 0.8]]) # Generate complete dataN = 500A = np.random.choice([0, 1], size=N, p=true_pA)B = np.array([np.random.choice([0, 1], p=true_pB_A[a]) for a in A])data = np.column_stack([A, B]).astype(float) # Introduce 20% missing values randomlymissing_mask = np.random.random((N, 2)) < 0.2data_missing = data.copy()data_missing[missing_mask] = np.nan print("=== EM for Missing Data ===\n")print(f"Data: {N} samples, {missing_mask.sum()} missing values ({100*missing_mask.mean():.1f}%)\n") p_A_learned, p_B_A_learned = em_simple_bn(data_missing, missing_mask) print(f"\nLearned P(A):")print(f" P(A=0) = {p_A_learned[0]:.3f} (true: {true_pA[0]:.1f})")print(f" P(A=1) = {p_A_learned[1]:.3f} (true: {true_pA[1]:.1f})")print(f"\nLearned P(B|A):")print(f" P(B=0|A=0) = {p_B_A_learned[0,0]:.3f} (true: {true_pB_A[0,0]:.1f})")print(f" P(B=1|A=1) = {p_B_A_learned[1,1]:.3f} (true: {true_pB_A[1,1]:.1f})")The reliability of parameter estimates depends critically on sample size relative to CPT complexity. Understanding this relationship is essential for practical applications.
The Data Fragmentation Problem:
For a variable $X$ with $k$ parents each having $r$ states, the CPT has $r^k$ rows. Data is split across these rows, so each row sees only a fraction of samples.
Expected counts per row: $N / r^k$
With $N = 1000$, $k = 5$, $r = 2$: each row gets only $1000 / 32 \approx 31$ samples on average. Some rows may have very few or zero samples.
Rules of Thumb:
For reliable MLE estimates, each CPT row should have:
Implication: Total samples needed scales as $O(r^k)$—exponentially in parent count.
| Binary Parents | CPT Rows | Min N (5/row) | Good N (100/row) |
|---|---|---|---|
| 2 | 4 | 20 | 400 |
| 3 | 8 | 40 | 800 |
| 4 | 16 | 80 | 1,600 |
| 5 | 32 | 160 | 3,200 |
| 6 | 64 | 320 | 6,400 |
| 10 | 1,024 | 5,120 | 102,400 |
When data is limited: (1) Use Bayesian estimation with informative priors, (2) Reduce parent count through structure learning constraints, (3) Use parameter tying to share parameters across similar contexts, (4) Consider compact representations like Noisy-OR, (5) Pool rare parent configurations.
Variance of MLE:
For a binomial proportion estimated from $n$ samples: $$\text{Var}(\hat{p}) = \frac{p(1-p)}{n}$$
Standard error: $\sqrt{p(1-p)/n}$
With $n = 10$ and $p = 0.5$: SE = 0.158 (95% CI width ≈ 0.63). With $n = 100$ and $p = 0.5$: SE = 0.05 (95% CI width ≈ 0.20).
Confidence Intervals:
Normal approximation 95% CI: $\hat{p} \pm 1.96 \sqrt{\hat{p}(1-\hat{p})/n}$
For small $n$, use exact binomial (Clopper-Pearson) or Bayesian credible intervals.
Monitoring Estimation Quality:
During learning, track:
When domain knowledge suggests that certain parameters should be equal, parameter tying can dramatically reduce the number of free parameters and improve estimation with limited data.
Types of Parameter Sharing:
1. Within-CPT Tying: Certain rows of a CPT share the same distribution.
Example: For P(Disease | Symptom1, Symptom2), we might assume the distribution is the same whether only Symptom1 or only Symptom2 is present:
This halves the parameters for those rows.
2. Across-CPT Tying: Different nodes share CPT parameters.
Example: In a temporal model, P(Xₜ | Xₜ₋₁) might be the same for all time steps—stationary dynamics. One CPT serves all time slices.
3. Symmetry Constraints: Parameters obey symmetry relationships.
Example: P(Y=1 | X=1) = P(X=1 | Y=1) when the graph structure encodes symmetric relationships.
With tied parameters, aggregate counts from all configurations that share the parameter. MLE becomes: θ = (sum of counts across tied configs) / (sum of parent counts across tied configs). The effective sample size increases proportionally to the number of tied configurations.
Common Tying Patterns:
Order-independent (exchangeable) parents: If the effect depends only on the number of active causes, not which ones:
Homogeneous networks: In network models where nodes represent similar entities (e.g., social network members), all nodes might share the same CPT template.
Temporal stationarity: In Dynamic Bayesian Networks, transition probabilities often don't change over time, allowing massive parameter sharing across time slices.
Benefits:
Implementing parameter learning in practice requires attention to several engineering details that theory often glosses over.
Data Preprocessing:
Numerical Stability:
Validation:
Scalability:
What's Next:
The final page of this module addresses Structure Learning—discovering the DAG structure itself from data. This is the most challenging aspect of Bayesian Network learning, involving search through an exponentially large space of possible structures.
You now understand parameter learning for Bayesian Networks. You can implement MLE and Bayesian estimation, handle missing data with EM, and apply practical strategies for limited data scenarios.