Loading content...
Consider two fundamentally different ways to describe a relationship between variables:
Scenario 1 — Causation: "Smoking causes lung cancer." There's a clear direction here: smoking comes first (causally), and lung cancer may follow as an effect. If we wanted to reason about this relationship, we'd naturally think in terms of "if you smoke, what's the probability of cancer?"—a conditional probability flowing from cause to effect.
Scenario 2 — Correlation: "Neighboring pixels in an image tend to have similar colors." Here, there's no direction. Pixels don't cause each other—they simply co-occur with similar values. The relationship is symmetric: knowing pixel A tells you about pixel B, and knowing pixel B tells you about pixel A.
These two scenarios motivate the two major families of probabilistic graphical models:
Understanding when and how to use each is essential for modeling real-world phenomena correctly.
By the end of this page, you will deeply understand the mathematical foundations of directed and undirected graphical models. You'll master their factorization properties, learn what types of dependencies each can and cannot represent, and develop intuition for choosing the right model family for your problem.
A Bayesian Network (also called a Belief Network, Bayes Net, or Directed Graphical Model) represents a joint probability distribution using a directed acyclic graph (DAG).
Formal Definition:
A Bayesian Network consists of:
A directed acyclic graph $G = (V, E)$ where:
A set of conditional probability distributions (CPDs) ${P(X_i | \text{Pa}(X_i))}$ where:
The Chain Rule Factorization:
The joint distribution factorizes as:
$$P(X_1, X_2, \ldots, X_n) = \prod_{i=1}^{n} P(X_i | \text{Pa}(X_i))$$
Each variable is conditionally distributed given only its parents. This is the local Markov property of Bayesian networks.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
# Example: Bayesian Network for a Simple Diagnostic System# # Graph Structure:# Smoking → Cancer# Cancer → Dyspnea (shortness of breath)# Cancer → PositiveXRay## Pa(Smoking) = {} (no parents)# Pa(Cancer) = {Smoking}# Pa(Dyspnea) = {Cancer}# Pa(PositiveXRay) = {Cancer} import numpy as np # Define Conditional Probability Tables (CPTs) # P(Smoking) - prior probabilityP_Smoking = { True: 0.3, # 30% of population smokes False: 0.7} # P(Cancer | Smoking)P_Cancer_given_Smoking = { (True, True): 0.10, # P(Cancer=T | Smoking=T) = 10% (False, True): 0.90, # P(Cancer=F | Smoking=T) = 90% (True, False): 0.01, # P(Cancer=T | Smoking=F) = 1% (False, False): 0.99, # P(Cancer=F | Smoking=F) = 99%} # P(Dyspnea | Cancer)P_Dyspnea_given_Cancer = { (True, True): 0.65, # P(Dyspnea=T | Cancer=T) = 65% (False, True): 0.35, # P(Dyspnea=F | Cancer=T) = 35% (True, False): 0.05, # P(Dyspnea=T | Cancer=F) = 5% (False, False): 0.95, # P(Dyspnea=F | Cancer=F) = 95%} # P(PositiveXRay | Cancer)P_XRay_given_Cancer = { (True, True): 0.90, # P(XRay=T | Cancer=T) = 90% (False, True): 0.10, # P(XRay=F | Cancer=T) = 10% (True, False): 0.05, # P(XRay=T | Cancer=F) = 5% (false positive) (False, False): 0.95, # P(XRay=F | Cancer=F) = 95%} def joint_probability(smoking, cancer, dyspnea, xray): """ Compute P(Smoking, Cancer, Dyspnea, PositiveXRay) using factorization: P(S, C, D, X) = P(S) * P(C|S) * P(D|C) * P(X|C) """ p_s = P_Smoking[smoking] p_c_given_s = P_Cancer_given_Smoking[(cancer, smoking)] p_d_given_c = P_Dyspnea_given_Cancer[(dyspnea, cancer)] p_x_given_c = P_XRay_given_Cancer[(xray, cancer)] return p_s * p_c_given_s * p_d_given_c * p_x_given_c # Verify normalization: sum over all configurations should equal 1total = 0.0for s in [True, False]: for c in [True, False]: for d in [True, False]: for x in [True, False]: p = joint_probability(s, c, d, x) total += p if p > 0.01: print(f"P(S={s}, C={c}, D={d}, X={x}) = {p:.6f}") print(f"Total (should be 1.0): {total:.10f}") # Parameter count:# - P(Smoking): 1 parameter (binary, so 2-1=1)# - P(Cancer|Smoking): 2 parameters (2 parent configs, 2-1 each)# - P(Dyspnea|Cancer): 2 parameters# - P(XRay|Cancer): 2 parameters# Total: 7 parameters# Full joint would need: 2^4 - 1 = 15 parametersprint(f"Parameter savings: 7 vs 15 (53% reduction)")Key properties of Bayesian Networks:
Acyclicity requirement: The graph must be a DAG. Cycles would create circular dependencies where a variable depends on itself—mathematically undefined.
Automatic normalization: Since each factor is a conditional probability distribution, the product is automatically a valid probability distribution (sums to 1). No separate normalization constant needed.
Causal interpretation: Arrows often (but not always) represent causal influence. If we draw an arrow from A to B, we're asserting that A directly influences B's probability.
Generative semantics: We can sample from the joint by following topological order—sample parents before children.
Compact CPDs: Each conditional probability table has at most $O(k^{|\text{Pa}(X_i)|})$ parameters where $k$ is the number of values per variable. Keeping parent sets small yields compact models.
The acyclicity requirement is not a limitation—it reflects a deep intuition about causation. In a causal system, effects cannot cause their own causes (ignoring time loops). If you find yourself wanting cycles, you may need to unroll time (create separate nodes for X_t and X_{t+1}) or use undirected models.
A Markov Random Field (MRF), also called a Markov Network or Undirected Graphical Model, represents a joint distribution using an undirected graph.
Formal Definition:
A Markov Random Field consists of:
An undirected graph $G = (V, E)$ where:
A set of potential functions ${\phi_c(X_c)}$ defined over cliques of the graph:
The Gibbs Distribution Factorization:
$$P(X_1, X_2, \ldots, X_n) = \frac{1}{Z} \prod_{c \in \mathcal{C}} \phi_c(X_c)$$
Where:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101
# Example: Markov Random Field for Image Binary Segmentation## Graph: 3x3 grid where each pixel is a node# Edges connect neighboring pixels (4-connectivity)## X1 — X2 — X3# | | |# X4 — X5 — X6# | | |# X7 — X8 — X9## Cliques: All edges (pairwise cliques) + all nodes (singleton cliques) import numpy as npfrom itertools import product # Grid dimensionsH, W = 3, 3n_variables = H * W # Node potentials: preference for foreground (1) vs background (0)# Higher value = more likelydef node_potential(x_i, observed_intensity): """ Singleton potential based on observed pixel intensity. High intensity suggests foreground, low suggests background. """ if x_i == 1: # Foreground return np.exp(observed_intensity) else: # Background return np.exp(1.0 - observed_intensity) # Edge potentials: preference for neighbors to have same labeldef edge_potential(x_i, x_j, lambda_smooth=2.0): """ Pairwise potential encouraging neighboring pixels to have same label. This is the Ising model / Potts model potential. """ if x_i == x_j: return np.exp(lambda_smooth) # High potential for same labels else: return np.exp(0) # exp(0) = 1, neutral for different labels # Simulated observed intensities (grayscale values 0-1)observed = np.array([ [0.9, 0.8, 0.2], # Top-left seems foreground, top-right background [0.7, 0.8, 0.3], [0.2, 0.3, 0.1], # Bottom seems background]) def compute_energy(labels, observed): """ Compute the negative log-probability (energy) for a label configuration. E(x) = -sum_i log(φ_i(x_i)) - sum_(i,j) log(φ_ij(x_i, x_j)) """ energy = 0.0 # Singleton potentials (unary terms) for i in range(H): for j in range(W): energy -= np.log(node_potential(labels[i, j], observed[i, j])) # Pairwise potentials (horizontal edges) for i in range(H): for j in range(W - 1): energy -= np.log(edge_potential(labels[i, j], labels[i, j+1])) # Pairwise potentials (vertical edges) for i in range(H - 1): for j in range(W): energy -= np.log(edge_potential(labels[i, j], labels[i+1, j])) return energy def compute_unnormalized_prob(labels, observed): """ Compute the unnormalized probability (product of potentials). """ return np.exp(-compute_energy(labels, observed)) # Compute partition function by summing over all configurationsZ = 0.0all_configs = []for bits in product([0, 1], repeat=n_variables): labels = np.array(bits).reshape(H, W) unnorm_p = compute_unnormalized_prob(labels, observed) all_configs.append((labels.copy(), unnorm_p)) Z += unnorm_p print("Partition function Z:", Z)print("Top 5 most likely configurations:")all_configs.sort(key=lambda x: -x[1])for labels, unnorm_p in all_configs[:5]: prob = unnorm_p / Z print(f" P={prob:.4f}:") print(f" {labels.tolist()}") # Note: Computing Z required summing over 2^9 = 512 configurations.# For a 100x100 image, that's 2^10000 - intractable!# This is why approximate inference (MCMC, variational) is essential.Key properties of Markov Random Fields:
Symmetric relationships: Edges have no direction, representing mutual influence. If A and B are neighbors, A influences B and B influences A equally.
Potential functions, not probabilities: The factors $\phi_c$ are not conditional probabilities—they're just non-negative compatibility scores. They don't need to sum to 1.
Partition function required: Because potentials aren't normalized, we need the partition function $Z$ to convert the product to a proper probability. Computing $Z$ is often the hardest part.
Energy-based interpretation: Defining $E(X) = -\sum_c \log \phi_c(X_c)$, we get the Boltzmann distribution: $P(X) = \frac{1}{Z} e^{-E(X)}$. Low energy = high probability.
Clique decomposition: The Hammersley-Clifford theorem (covered in the next module) proves that any positive distribution satisfying the Markov properties of a graph can be factorized this way.
Computing the partition function Z requires summing over all possible configurations—exponential in the number of variables. For n binary variables, that's 2^n terms. This makes working with MRFs computationally challenging. Exact computation is only feasible for small graphs or special structures (trees). For general graphs, we must use approximate inference methods.
Let's directly compare how the two model families factorize the joint distribution:
Bayesian Network (Directed):
$$P(X_1, \ldots, X_n) = \prod_{i=1}^{n} P(X_i | \text{Pa}(X_i))$$
Markov Random Field (Undirected):
$$P(X_1, \ldots, X_n) = \frac{1}{Z} \prod_{c \in \mathcal{C}} \phi_c(X_c)$$
| Aspect | Bayesian Networks | Markov Random Fields |
|---|---|---|
| Graph type | Directed Acyclic Graph (DAG) | Undirected Graph |
| Factors | Conditional Probability Distributions (CPDs) | Potential Functions (unnormalized) |
| Factor scope | Node + its parents | Maximal cliques |
| Normalization | Automatic (factors are CPDs) | Requires partition function Z |
| Parameter constraints | CPDs must be valid distributions | Potentials just need to be non-negative |
| Generative sampling | Easy (ancestral sampling) | Hard (requires MCMC) |
| Causal interpretation | Natural (edges suggest causation) | Not applicable |
A concrete example:
Consider three variables: A, B, C with A connected to B and B connected to C.
As a Bayesian Network (A → B → C): $$P(A, B, C) = P(A) \cdot P(B|A) \cdot P(C|B)$$
As a Markov Random Field (A — B — C): $$P(A, B, C) = \frac{1}{Z} \phi_{AB}(A, B) \cdot \phi_{BC}(B, C)$$
Both encode the same conditional independence, but with different semantics. The BN has a causal/temporal ordering; the MRF is symmetric.
A crucial question: Can every distribution representable by one family be represented by the other? The answer is nuanced.
Definition: I-map (Independence Map)
A graph $G$ is an I-map for a distribution $P$ if the conditional independencies implied by $G$ are a subset of the conditional independencies that hold in $P$. In other words, $G$ doesn't assert any independence that $P$ violates.
Key results:
Some independencies can only be expressed in directed models (Example: "Explaining away" pattern)
Some independencies can only be expressed in undirected models (Example: Four-way symmetric constraint)
Neither family is strictly more powerful than the other
Let's examine specific cases where each family has an advantage:
Case 1: V-Structure (Explaining Away) — Only Directed Models
Consider two independent causes that share a common effect:
A B
\ /
↓ ↓
C
In this structure:
This is the famous explaining away phenomenon. Learning that the car won't start makes the two causes dependent—if one is ruled out, the other becomes more likely.
Why undirected graphs can't capture this:
To express $A \perp!!!\perp B$ (marginal independence), an undirected graph would have no edge between A and B. But without an edge, A and B remain independent even given C. Undirected graphs cannot express: "independent marginally, dependent given a common descendant."
Case 2: Diamond Structure — Only Undirected Models
Consider four variables in a cycle:
A — B
| |
C — D
Independencies we want:
An undirected graph with exactly these edges captures these independencies via graph separation.
Why directed graphs struggle:
To make this a DAG, we must add directions. Any acyclic orientation forces additional conditional independencies (through d-separation rules we'll cover later) that may not hold in the true distribution. Alternatively, we'd need to add edges to be a valid I-map, losing the precise independence structure.
The bottom line:
Neither family dominates. The choice depends on:
Every Bayesian network can be converted to an equivalent Markov network through 'moralization'—connecting all parents of each node and dropping edge directions. However, this process may add edges, losing some independence information. The moralized graph is an I-map but may not be a perfect map.
Choosing between directed and undirected models is one of the most important modeling decisions. Here's a principled framework for making this choice:
Use Bayesian Networks when:
Classic Bayesian Network applications:
Classic MRF applications:
| Domain Characteristic | Recommended Model | Reason |
|---|---|---|
| Clear cause-effect relationships | Bayesian Network | Causal semantics align with domain |
| Temporal/sequential data | Bayesian Network (HMM) | Natural ordering from time |
| Grid/spatial structure | Markov Random Field | Symmetric neighbor relationships |
| Need to generate samples | Bayesian Network | Ancestral sampling is easy |
| Discriminative classification | CRF (undirected, conditional) | Models P(labels|features) directly |
| Feature dependencies + label correlations | CRF | Avoids feature independence assumption |
The choice between directed and undirected models has significant computational implications:
Bayesian Networks — Computational Advantages:
No partition function: Factors are CPDs that automatically normalize. No need to compute the intractable sum $Z$.
Easy sampling: To generate a sample:
Local likelihood: The likelihood of data decomposes: $\log P(D) = \sum_i \log P(X_i | \text{Pa}(X_i))$. Each term can be optimized separately.
Parameter estimation: Maximum likelihood estimates for CPDs are just empirical conditional frequencies—easy to compute.
Markov Random Fields — Computational Challenges:
Partition function: Computing $Z = \sum_X \prod_c \phi_c(X_c)$ is #P-complete in general. For $n$ binary variables, it's $O(2^n)$.
Difficult sampling: No natural ordering for ancestral sampling. Must use MCMC (Gibbs sampling, Metropolis-Hastings).
Coupled likelihood: The log-likelihood $\log P(X) = \sum_c \log \phi_c(X_c) - \log Z$. The $\log Z$ term couples all parameters—can't optimize locally.
Gradient estimation: The gradient of log-likelihood involves expectations under the model: $ abla \log P(D) = E_{data}[ abla \log \phi] - E_{model}[ abla \log \phi]$. The second term requires sampling from the model.
When MRFs are still worth it:
Conditional Random Fields (CRFs) are undirected models that sidestep much of the computational difficulty. By modeling P(Y|X) instead of P(X,Y), the partition function Z(X) only sums over labels Y, not the potentially high-dimensional features X. This makes CRFs practical for sequence labeling tasks like POS tagging and named entity recognition.
In practice, the directed/undirected dichotomy is often too restrictive. Several representations extend beyond this basic division:
Factor Graphs:
A factor graph is a bipartite graph with two types of nodes:
Edges connect factor nodes to the variables they depend on.
$$P(X) = \frac{1}{Z} \prod_{a} f_a(X_{\mathcal{N}(a)})$$
Where $\mathcal{N}(a)$ denotes the neighbors of factor $a$.
Advantages of factor graphs:
Chain Graphs:
A chain graph allows both directed and undirected edges, with the constraint that you can partition nodes into ordered 'chain components' where:
This models situations like:
Conditional Random Fields (CRFs):
CRFs are undirected models conditioned on observations:
$$P(Y | X) = \frac{1}{Z(X)} \prod_{c} \phi_c(Y_c, X)$$
Note that:
CRFs combine the flexibility of undirected models with the computational benefits of conditioning.
Hybrid Approaches:
Modern systems often combine:
All these representations—Bayesian networks, MRFs, factor graphs, CRFs—are different ways of specifying factorizations. The choice depends on what's most natural for your domain and what computational properties you need. Factor graphs provide the most general representation, while the others offer specific conveniences for common cases.
Let's consolidate the essential insights from this deep dive into directed and undirected graphical models:
What's next:
Now that we understand the two major families of graphical models, we need to formalize the mathematical foundation that makes them work: conditional independence. The next page will give you a rigorous understanding of what conditional independence means, how it enables factorization, and how to read independence properties directly from graph structure.
You now have a deep understanding of directed and undirected graphical models. You can distinguish their factorization properties, explain what types of dependencies each can represent, and make informed choices about which family to use for a given problem. Next, we'll formalize conditional independence—the mathematical foundation that enables everything we've discussed.