Probabilistic Graphical Models - Learning Module

Loading content...

0/278

Directed vs Undirected Graphical Models

Two Ways to Express Probabilistic Relationships

Consider two fundamentally different ways to describe a relationship between variables:

Scenario 1 — Causation: "Smoking causes lung cancer." There's a clear direction here: smoking comes first (causally), and lung cancer may follow as an effect. If we wanted to reason about this relationship, we'd naturally think in terms of "if you smoke, what's the probability of cancer?"—a conditional probability flowing from cause to effect.

Scenario 2 — Correlation: "Neighboring pixels in an image tend to have similar colors." Here, there's no direction. Pixels don't cause each other—they simply co-occur with similar values. The relationship is symmetric: knowing pixel A tells you about pixel B, and knowing pixel B tells you about pixel A.

These two scenarios motivate the two major families of probabilistic graphical models:

Directed graphical models (Bayesian Networks) for asymmetric, causal, or generative relationships
Undirected graphical models (Markov Random Fields) for symmetric, correlational relationships

Understanding when and how to use each is essential for modeling real-world phenomena correctly.

What You Will Learn

By the end of this page, you will deeply understand the mathematical foundations of directed and undirected graphical models. You'll master their factorization properties, learn what types of dependencies each can and cannot represent, and develop intuition for choosing the right model family for your problem.

Directed Graphical Models: Bayesian Networks

A Bayesian Network (also called a Belief Network, Bayes Net, or Directed Graphical Model) represents a joint probability distribution using a directed acyclic graph (DAG).

Formal Definition:

A Bayesian Network consists of:

A directed acyclic graph $G = (V, E)$ where:
- $V = {X_1, X_2, \ldots, X_n}$ are random variables (nodes)
- $E$ are directed edges (arrows) representing direct dependencies
- The graph has no directed cycles (you cannot follow arrows and return to the starting node)
A set of conditional probability distributions (CPDs) ${P(X_i | \text{Pa}(X_i))}$ where:
- $\text{Pa}(X_i)$ denotes the parents of node $X_i$ — nodes with edges pointing to $X_i$
- Each $P(X_i | \text{Pa}(X_i))$ specifies the probability of $X_i$ given its parents

The Chain Rule Factorization:

The joint distribution factorizes as:

$$P(X_1, X_2, \ldots, X_n) = \prod_{i=1}^{n} P(X_i | \text{Pa}(X_i))$$

Each variable is conditionally distributed given only its parents. This is the local Markov property of Bayesian networks.

bayesian_network_factorization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
# Example: Bayesian Network for a Simple Diagnostic System
# 
# Graph Structure:
#   Smoking → Cancer
#   Cancer → Dyspnea (shortness of breath)
#   Cancer → PositiveXRay
#
# Pa(Smoking) = {} (no parents)
# Pa(Cancer) = {Smoking}
# Pa(Dyspnea) = {Cancer}
# Pa(PositiveXRay) = {Cancer}
 
import numpy as np
 
# Define Conditional Probability Tables (CPTs)
 
# P(Smoking) - prior probability
P_Smoking = {
    True: 0.3,   # 30% of population smokes
    False: 0.7
}
 
# P(Cancer | Smoking)
P_Cancer_given_Smoking = {
    (True, True): 0.10,    # P(Cancer=T | Smoking=T) = 10%
    (False, True): 0.90,   # P(Cancer=F | Smoking=T) = 90%
    (True, False): 0.01,   # P(Cancer=T | Smoking=F) = 1%
    (False, False): 0.99,  # P(Cancer=F | Smoking=F) = 99%
}
 
# P(Dyspnea | Cancer)
P_Dyspnea_given_Cancer = {
    (True, True): 0.65,    # P(Dyspnea=T | Cancer=T) = 65%
    (False, True): 0.35,   # P(Dyspnea=F | Cancer=T) = 35%
    (True, False): 0.05,   # P(Dyspnea=T | Cancer=F) = 5%
    (False, False): 0.95,  # P(Dyspnea=F | Cancer=F) = 95%
}
 
# P(PositiveXRay | Cancer)
P_XRay_given_Cancer = {
    (True, True): 0.90,    # P(XRay=T | Cancer=T) = 90%
    (False, True): 0.10,   # P(XRay=F | Cancer=T) = 10%
    (True, False): 0.05,   # P(XRay=T | Cancer=F) = 5% (false positive)
    (False, False): 0.95,  # P(XRay=F | Cancer=F) = 95%
}
 
def joint_probability(smoking, cancer, dyspnea, xray):
    """
    Compute P(Smoking, Cancer, Dyspnea, PositiveXRay) using factorization:
    P(S, C, D, X) = P(S) * P(C|S) * P(D|C) * P(X|C)
    """
    p_s = P_Smoking[smoking]
    p_c_given_s = P_Cancer_given_Smoking[(cancer, smoking)]
    p_d_given_c = P_Dyspnea_given_Cancer[(dyspnea, cancer)]
    p_x_given_c = P_XRay_given_Cancer[(xray, cancer)]
    
    return p_s * p_c_given_s * p_d_given_c * p_x_given_c
 
# Verify normalization: sum over all configurations should equal 1
total = 0.0
for s in [True, False]:
    for c in [True, False]:
        for d in [True, False]:
            for x in [True, False]:
                p = joint_probability(s, c, d, x)
                total += p
                if p > 0.01:
                    print(f"P(S={s}, C={c}, D={d}, X={x}) = {p:.6f}")
 
print(f"
Total (should be 1.0): {total:.10f}")
 
# Parameter count:
# - P(Smoking): 1 parameter (binary, so 2-1=1)
# - P(Cancer|Smoking): 2 parameters (2 parent configs, 2-1 each)
# - P(Dyspnea|Cancer): 2 parameters
# - P(XRay|Cancer): 2 parameters
# Total: 7 parameters
# Full joint would need: 2^4 - 1 = 15 parameters
print(f"
Parameter savings: 7 vs 15 (53% reduction)")

Key properties of Bayesian Networks:

Acyclicity requirement: The graph must be a DAG. Cycles would create circular dependencies where a variable depends on itself—mathematically undefined.
Automatic normalization: Since each factor is a conditional probability distribution, the product is automatically a valid probability distribution (sums to 1). No separate normalization constant needed.
Causal interpretation: Arrows often (but not always) represent causal influence. If we draw an arrow from A to B, we're asserting that A directly influences B's probability.
Generative semantics: We can sample from the joint by following topological order—sample parents before children.
Compact CPDs: Each conditional probability table has at most $O(k^{|\text{Pa}(X_i)|})$ parameters where $k$ is the number of values per variable. Keeping parent sets small yields compact models.

The DAG Constraint

The acyclicity requirement is not a limitation—it reflects a deep intuition about causation. In a causal system, effects cannot cause their own causes (ignoring time loops). If you find yourself wanting cycles, you may need to unroll time (create separate nodes for X_t and X_{t+1}) or use undirected models.

Undirected Graphical Models: Markov Random Fields

A Markov Random Field (MRF), also called a Markov Network or Undirected Graphical Model, represents a joint distribution using an undirected graph.

Formal Definition:

A Markov Random Field consists of:

An undirected graph $G = (V, E)$ where:
- $V = {X_1, X_2, \ldots, X_n}$ are random variables (nodes)
- $E$ are undirected edges representing direct correlations
A set of potential functions ${\phi_c(X_c)}$ defined over cliques of the graph:
- A clique is a fully-connected subset of nodes (every pair is connected by an edge)
- A maximal clique is a clique that cannot be extended by adding another node
- $\phi_c : \text{Val}(X_c) \to \mathbb{R}^+$ maps configurations to non-negative real values

The Gibbs Distribution Factorization:

$$P(X_1, X_2, \ldots, X_n) = \frac{1}{Z} \prod_{c \in \mathcal{C}} \phi_c(X_c)$$

Where:

$\mathcal{C}$ is the set of cliques
$\phi_c(X_c)$ are potential functions (not probabilities!)
$Z = \sum_{X} \prod_{c \in \mathcal{C}} \phi_c(X_c)$ is the partition function ensuring normalization

markov_random_field.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
# Example: Markov Random Field for Image Binary Segmentation
#
# Graph: 3x3 grid where each pixel is a node
# Edges connect neighboring pixels (4-connectivity)
#
#   X1 — X2 — X3
#   |    |    |
#   X4 — X5 — X6
#   |    |    |
#   X7 — X8 — X9
#
# Cliques: All edges (pairwise cliques) + all nodes (singleton cliques)
 
import numpy as np
from itertools import product
 
# Grid dimensions
H, W = 3, 3
n_variables = H * W
 
# Node potentials: preference for foreground (1) vs background (0)
# Higher value = more likely
def node_potential(x_i, observed_intensity):
    """
    Singleton potential based on observed pixel intensity.
    High intensity suggests foreground, low suggests background.
    """
    if x_i == 1:  # Foreground
        return np.exp(observed_intensity)
    else:  # Background
        return np.exp(1.0 - observed_intensity)
 
# Edge potentials: preference for neighbors to have same label
def edge_potential(x_i, x_j, lambda_smooth=2.0):
    """
    Pairwise potential encouraging neighboring pixels to have same label.
    This is the Ising model / Potts model potential.
    """
    if x_i == x_j:
        return np.exp(lambda_smooth)  # High potential for same labels
    else:
        return np.exp(0)  # exp(0) = 1, neutral for different labels
 
# Simulated observed intensities (grayscale values 0-1)
observed = np.array([
    [0.9, 0.8, 0.2],  # Top-left seems foreground, top-right background
    [0.7, 0.8, 0.3],
    [0.2, 0.3, 0.1],  # Bottom seems background
])
 
def compute_energy(labels, observed):
    """
    Compute the negative log-probability (energy) for a label configuration.
    E(x) = -sum_i log(φ_i(x_i)) - sum_(i,j) log(φ_ij(x_i, x_j))
    """
    energy = 0.0
    
    # Singleton potentials (unary terms)
    for i in range(H):
        for j in range(W):
            energy -= np.log(node_potential(labels[i, j], observed[i, j]))
    
    # Pairwise potentials (horizontal edges)
    for i in range(H):
        for j in range(W - 1):
            energy -= np.log(edge_potential(labels[i, j], labels[i, j+1]))
    
    # Pairwise potentials (vertical edges)
    for i in range(H - 1):
        for j in range(W):
            energy -= np.log(edge_potential(labels[i, j], labels[i+1, j]))
    
    return energy
 
def compute_unnormalized_prob(labels, observed):
    """
    Compute the unnormalized probability (product of potentials).
    """
    return np.exp(-compute_energy(labels, observed))
 
# Compute partition function by summing over all configurations
Z = 0.0
all_configs = []
for bits in product([0, 1], repeat=n_variables):
    labels = np.array(bits).reshape(H, W)
    unnorm_p = compute_unnormalized_prob(labels, observed)
    all_configs.append((labels.copy(), unnorm_p))
    Z += unnorm_p
 
print("Partition function Z:", Z)
print("
Top 5 most likely configurations:")
all_configs.sort(key=lambda x: -x[1])
for labels, unnorm_p in all_configs[:5]:
    prob = unnorm_p / Z
    print(f"  P={prob:.4f}:")
    print(f"    {labels.tolist()}")
 
# Note: Computing Z required summing over 2^9 = 512 configurations.
# For a 100x100 image, that's 2^10000 - intractable!
# This is why approximate inference (MCMC, variational) is essential.

Key properties of Markov Random Fields:

Symmetric relationships: Edges have no direction, representing mutual influence. If A and B are neighbors, A influences B and B influences A equally.
Potential functions, not probabilities: The factors $\phi_c$ are not conditional probabilities—they're just non-negative compatibility scores. They don't need to sum to 1.
Partition function required: Because potentials aren't normalized, we need the partition function $Z$ to convert the product to a proper probability. Computing $Z$ is often the hardest part.
Energy-based interpretation: Defining $E(X) = -\sum_c \log \phi_c(X_c)$, we get the Boltzmann distribution: $P(X) = \frac{1}{Z} e^{-E(X)}$. Low energy = high probability.
Clique decomposition: The Hammersley-Clifford theorem (covered in the next module) proves that any positive distribution satisfying the Markov properties of a graph can be factorized this way.

The Partition Function Problem

Computing the partition function Z requires summing over all possible configurations—exponential in the number of variables. For n binary variables, that's 2^n terms. This makes working with MRFs computationally challenging. Exact computation is only feasible for small graphs or special structures (trees). For general graphs, we must use approximate inference methods.

Factorization Comparison

Let's directly compare how the two model families factorize the joint distribution:

Bayesian Network (Directed):

$$P(X_1, \ldots, X_n) = \prod_{i=1}^{n} P(X_i | \text{Pa}(X_i))$$

Each factor is a conditional probability distribution
Factors are associated with nodes (one per node)
Parents are the conditioning set
Product is automatically normalized

Markov Random Field (Undirected):

$$P(X_1, \ldots, X_n) = \frac{1}{Z} \prod_{c \in \mathcal{C}} \phi_c(X_c)$$

Each factor is an unnormalized potential function
Factors are associated with cliques (fully-connected subsets)
Requires partition function $Z$ for normalization
More flexibility in factor design, but harder to normalize

Directed vs Undirected Factorization
Aspect	Bayesian Networks	Markov Random Fields
Graph type	Directed Acyclic Graph (DAG)	Undirected Graph
Factors	Conditional Probability Distributions (CPDs)	Potential Functions (unnormalized)
Factor scope	Node + its parents	Maximal cliques
Normalization	Automatic (factors are CPDs)	Requires partition function Z
Parameter constraints	CPDs must be valid distributions	Potentials just need to be non-negative
Generative sampling	Easy (ancestral sampling)	Hard (requires MCMC)
Causal interpretation	Natural (edges suggest causation)	Not applicable

A concrete example:

Consider three variables: A, B, C with A connected to B and B connected to C.

As a Bayesian Network (A → B → C): $$P(A, B, C) = P(A) \cdot P(B|A) \cdot P(C|B)$$

3 factors: $P(A)$, $P(B|A)$, $P(C|B)$
Interpretation: A causes B, B causes C
Implies: $A \perp!!!\perp C | B$ (A and C are independent given B)

As a Markov Random Field (A — B — C): $$P(A, B, C) = \frac{1}{Z} \phi_{AB}(A, B) \cdot \phi_{BC}(B, C)$$

2 clique factors: $\phi_{AB}$, $\phi_{BC}$
Interpretation: A and B are correlated, B and C are correlated
Also implies: $A \perp!!!\perp C | B$

Both encode the same conditional independence, but with different semantics. The BN has a causal/temporal ordering; the MRF is symmetric.

Representational Power: What Each Can Express

A crucial question: Can every distribution representable by one family be represented by the other? The answer is nuanced.

Definition: I-map (Independence Map)

A graph $G$ is an I-map for a distribution $P$ if the conditional independencies implied by $G$ are a subset of the conditional independencies that hold in $P$. In other words, $G$ doesn't assert any independence that $P$ violates.

Key results:

Some independencies can only be expressed in directed models (Example: "Explaining away" pattern)
Some independencies can only be expressed in undirected models (Example: Four-way symmetric constraint)
Neither family is strictly more powerful than the other

Let's examine specific cases where each family has an advantage:

Case 1: V-Structure (Explaining Away) — Only Directed Models

Consider two independent causes that share a common effect:

A: Battery dead
B: Fuel empty
C: Car won't start

In this structure:

Marginally: $A \perp!!!\perp B$ (battery and fuel status are independent)
Given C: $A ot\perp!!!\perp B | C$ (if car won't start and battery is fine, fuel is more likely empty)

This is the famous explaining away phenomenon. Learning that the car won't start makes the two causes dependent—if one is ruled out, the other becomes more likely.

Why undirected graphs can't capture this:

To express $A \perp!!!\perp B$ (marginal independence), an undirected graph would have no edge between A and B. But without an edge, A and B remain independent even given C. Undirected graphs cannot express: "independent marginally, dependent given a common descendant."

Case 2: Diamond Structure — Only Undirected Models

Consider four variables in a cycle:

    A — B
    |   |
    C — D

Independencies we want:

$A \perp!!!\perp D | {B, C}$
$B \perp!!!\perp C | {A, D}$

An undirected graph with exactly these edges captures these independencies via graph separation.

Why directed graphs struggle:

To make this a DAG, we must add directions. Any acyclic orientation forces additional conditional independencies (through d-separation rules we'll cover later) that may not hold in the true distribution. Alternatively, we'd need to add edges to be a valid I-map, losing the precise independence structure.

The bottom line:

Neither family dominates. The choice depends on:

Which independence structure matches your domain
Whether causal/generative semantics are appropriate
Computational considerations (e.g., partition function)

The Moralization Bridge

Every Bayesian network can be converted to an equivalent Markov network through 'moralization'—connecting all parents of each node and dropping edge directions. However, this process may add edges, losing some independence information. The moralized graph is an I-map but may not be a perfect map.

When to Use Each Model Type

Choosing between directed and undirected models is one of the most important modeling decisions. Here's a principled framework for making this choice:

Use Bayesian Networks when:

Bayesian Networks Are Appropriate For

•Causal modeling: There's a natural causal direction (Disease → Symptom, Gene → Trait)
•Temporal processes: Time provides a natural ordering (past → future)
•Generative modeling: You want to sample from the model (generate data)
•Expert knowledge: Domain experts can specify conditional probabilities
•Interventional reasoning: You want to answer 'what if we set X to value x?'
•Explaining away is relevant: Independent causes with common effects
•Easy normalization needed: No partition function to compute

Classic Bayesian Network applications:

Medical diagnosis (diseases cause symptoms)
Fault diagnosis (root causes produce observable errors)
Genetic inheritance (parent genes influence child genes)
Speech recognition (HMMs: hidden states produce observations)
Document modeling (LDA: topics generate words)

Markov Random Fields Are Appropriate For

•Symmetric correlations: Neighboring entities influence each other equally
•Spatial data: Image pixels, geographic regions, social networks
•Constraint satisfaction: Expressing 'soft' compatibility constraints
•Energy-based modeling: Natural domains with physical 'energy' interpretation
•No natural ordering: Variables are fundamentally symmetric
•Discriminative tasks: CRFs for classification/labeling (conditional model)

Classic MRF applications:

Image segmentation (neighboring pixels tend to have same label)
Stereo vision (disparity should be smooth across space)
Image denoising (recover clean image from noisy observations)
Statistical physics (Ising model for magnetic systems)
Sequence labeling with CRFs (POS tagging, NER)

Decision Guide: Directed vs Undirected
Domain Characteristic	Recommended Model	Reason
Clear cause-effect relationships	Bayesian Network	Causal semantics align with domain
Temporal/sequential data	Bayesian Network (HMM)	Natural ordering from time
Grid/spatial structure	Markov Random Field	Symmetric neighbor relationships
Need to generate samples	Bayesian Network	Ancestral sampling is easy
Discriminative classification	CRF (undirected, conditional)	Models P(labels\|features) directly
Feature dependencies + label correlations	CRF	Avoids feature independence assumption

Computational Trade-offs

The choice between directed and undirected models has significant computational implications:

Bayesian Networks — Computational Advantages:

No partition function: Factors are CPDs that automatically normalize. No need to compute the intractable sum $Z$.
Easy sampling: To generate a sample:
- Order nodes topologically
- For each node, sample from $P(X_i | \text{Pa}(X_i))$ using already-sampled parent values
- This is ancestral sampling — simple and exact.
Local likelihood: The likelihood of data decomposes: $\log P(D) = \sum_i \log P(X_i | \text{Pa}(X_i))$. Each term can be optimized separately.
Parameter estimation: Maximum likelihood estimates for CPDs are just empirical conditional frequencies—easy to compute.

Markov Random Fields — Computational Challenges:

Partition function: Computing $Z = \sum_X \prod_c \phi_c(X_c)$ is #P-complete in general. For $n$ binary variables, it's $O(2^n)$.
Difficult sampling: No natural ordering for ancestral sampling. Must use MCMC (Gibbs sampling, Metropolis-Hastings).
Coupled likelihood: The log-likelihood $\log P(X) = \sum_c \log \phi_c(X_c) - \log Z$. The $\log Z$ term couples all parameters—can't optimize locally.
Gradient estimation: The gradient of log-likelihood involves expectations under the model: $ abla \log P(D) = E_{data}[ abla \log \phi] - E_{model}[ abla \log \phi]$. The second term requires sampling from the model.

When MRFs are still worth it:

Discriminative models (CRFs): Condition on observed features, so $Z$ only sums over labels, not features. Much more tractable.
Loopy belief propagation: Approximate inference that works well in practice even without theoretical guarantees.
Variational methods: Convert intractable $Z$ to optimization problem.
Special structure: Trees and low-treewidth graphs allow exact inference.

The CRF Trick

Conditional Random Fields (CRFs) are undirected models that sidestep much of the computational difficulty. By modeling P(Y|X) instead of P(X,Y), the partition function Z(X) only sums over labels Y, not the potentially high-dimensional features X. This makes CRFs practical for sequence labeling tasks like POS tagging and named entity recognition.

Hybrid Models and Factor Graphs

In practice, the directed/undirected dichotomy is often too restrictive. Several representations extend beyond this basic division:

Factor Graphs:

A factor graph is a bipartite graph with two types of nodes:

Variable nodes (circles): Represent random variables
Factor nodes (squares): Represent factors/potentials

Edges connect factor nodes to the variables they depend on.

$$P(X) = \frac{1}{Z} \prod_{a} f_a(X_{\mathcal{N}(a)})$$

Where $\mathcal{N}(a)$ denotes the neighbors of factor $a$.

Advantages of factor graphs:

Unify directed and undirected representations
Make factor structure explicit (helpful for inference algorithms)
Can represent higher-order factors directly
Foundation for modern probabilistic programming languages

Chain Graphs:

A chain graph allows both directed and undirected edges, with the constraint that you can partition nodes into ordered 'chain components' where:

Edges between components are directed
Edges within components are undirected

This models situations like:

Manufacturer → Retailer1 — Retailer2 → Consumer (Retailers are peers who influence each other, but there's a supply chain direction)

Conditional Random Fields (CRFs):

CRFs are undirected models conditioned on observations:

$$P(Y | X) = \frac{1}{Z(X)} \prod_{c} \phi_c(Y_c, X)$$

Note that:

The partition function $Z(X)$ depends on the input $X$
We don't model $P(X)$ at all—just the conditional $P(Y|X)$
This is discriminative, not generative

CRFs combine the flexibility of undirected models with the computational benefits of conditioning.

Hybrid Approaches:

Modern systems often combine:

Neural networks to learn feature representations
Graphical model structure to capture dependencies
Examples: Neural CRFs, structured prediction networks, graph neural networks

A Unified Perspective

All these representations—Bayesian networks, MRFs, factor graphs, CRFs—are different ways of specifying factorizations. The choice depends on what's most natural for your domain and what computational properties you need. Factor graphs provide the most general representation, while the others offer specific conveniences for common cases.

Summary and Key Takeaways

Let's consolidate the essential insights from this deep dive into directed and undirected graphical models:

Key Takeaways

•Bayesian Networks use DAGs with conditional probability factors. They're natural for causal/generative processes, require no partition function, and enable easy sampling via ancestral sampling.
•Markov Random Fields use undirected graphs with potential functions over cliques. They're natural for symmetric correlations but require computing the intractable partition function Z.
•Neither family dominates the other: V-structures (explaining away) can only be captured by directed models; certain symmetric structures can only be captured by undirected models.
•Factorization is key: Both families achieve tractability by factorizing the joint distribution into products of local factors, exploiting conditional independence.
•Choose based on domain semantics: Use directed models for causation/generation, undirected models for symmetric spatial/relational structure.
•Computational trade-offs matter: BNs avoid partition functions; MRFs offer more flexible factor design but at computational cost.
•Factor graphs unify both: They make the factor structure explicit and are the foundation for general-purpose inference algorithms.

What's next:

Now that we understand the two major families of graphical models, we need to formalize the mathematical foundation that makes them work: conditional independence. The next page will give you a rigorous understanding of what conditional independence means, how it enables factorization, and how to read independence properties directly from graph structure.

Page Complete

You now have a deep understanding of directed and undirected graphical models. You can distinguish their factorization properties, explain what types of dependencies each can represent, and make informed choices about which family to use for a given problem. Next, we'll formalize conditional independence—the mathematical foundation that enables everything we've discussed.

Directed vs Undirected Graphical Models

Two Ways to Express Probabilistic Relationships

Consider two fundamentally different ways to describe a relationship between variables:

These two scenarios motivate the two major families of probabilistic graphical models:

Directed graphical models (Bayesian Networks) for asymmetric, causal, or generative relationships
Undirected graphical models (Markov Random Fields) for symmetric, correlational relationships

Understanding when and how to use each is essential for modeling real-world phenomena correctly.

What You Will Learn

Directed Graphical Models: Bayesian Networks

A Bayesian Network (also called a Belief Network, Bayes Net, or Directed Graphical Model) represents a joint probability distribution using a directed acyclic graph (DAG).

Formal Definition:

A Bayesian Network consists of:

A directed acyclic graph $G = (V, E)$ where:
- $V = {X_1, X_2, \ldots, X_n}$ are random variables (nodes)
- $E$ are directed edges (arrows) representing direct dependencies
- The graph has no directed cycles (you cannot follow arrows and return to the starting node)
A set of conditional probability distributions (CPDs) ${P(X_i | \text{Pa}(X_i))}$ where:
- $\text{Pa}(X_i)$ denotes the parents of node $X_i$ — nodes with edges pointing to $X_i$
- Each $P(X_i | \text{Pa}(X_i))$ specifies the probability of $X_i$ given its parents

The Chain Rule Factorization:

The joint distribution factorizes as:

$$P(X_1, X_2, \ldots, X_n) = \prod_{i=1}^{n} P(X_i | \text{Pa}(X_i))$$

Each variable is conditionally distributed given only its parents. This is the local Markov property of Bayesian networks.

bayesian_network_factorization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
# Example: Bayesian Network for a Simple Diagnostic System
# 
# Graph Structure:
#   Smoking → Cancer
#   Cancer → Dyspnea (shortness of breath)
#   Cancer → PositiveXRay
#
# Pa(Smoking) = {} (no parents)
# Pa(Cancer) = {Smoking}
# Pa(Dyspnea) = {Cancer}
# Pa(PositiveXRay) = {Cancer}
 
import numpy as np
 
# Define Conditional Probability Tables (CPTs)
 
# P(Smoking) - prior probability
P_Smoking = {
    True: 0.3,   # 30% of population smokes
    False: 0.7
}
 
# P(Cancer | Smoking)
P_Cancer_given_Smoking = {
    (True, True): 0.10,    # P(Cancer=T | Smoking=T) = 10%
    (False, True): 0.90,   # P(Cancer=F | Smoking=T) = 90%
    (True, False): 0.01,   # P(Cancer=T | Smoking=F) = 1%
    (False, False): 0.99,  # P(Cancer=F | Smoking=F) = 99%
}
 
# P(Dyspnea | Cancer)
P_Dyspnea_given_Cancer = {
    (True, True): 0.65,    # P(Dyspnea=T | Cancer=T) = 65%
    (False, True): 0.35,   # P(Dyspnea=F | Cancer=T) = 35%
    (True, False): 0.05,   # P(Dyspnea=T | Cancer=F) = 5%
    (False, False): 0.95,  # P(Dyspnea=F | Cancer=F) = 95%
}
 
# P(PositiveXRay | Cancer)
P_XRay_given_Cancer = {
    (True, True): 0.90,    # P(XRay=T | Cancer=T) = 90%
    (False, True): 0.10,   # P(XRay=F | Cancer=T) = 10%
    (True, False): 0.05,   # P(XRay=T | Cancer=F) = 5% (false positive)
    (False, False): 0.95,  # P(XRay=F | Cancer=F) = 95%
}
 
def joint_probability(smoking, cancer, dyspnea, xray):
    """
    Compute P(Smoking, Cancer, Dyspnea, PositiveXRay) using factorization:
    P(S, C, D, X) = P(S) * P(C|S) * P(D|C) * P(X|C)
    """
    p_s = P_Smoking[smoking]
    p_c_given_s = P_Cancer_given_Smoking[(cancer, smoking)]
    p_d_given_c = P_Dyspnea_given_Cancer[(dyspnea, cancer)]
    p_x_given_c = P_XRay_given_Cancer[(xray, cancer)]
    
    return p_s * p_c_given_s * p_d_given_c * p_x_given_c
 
# Verify normalization: sum over all configurations should equal 1
total = 0.0
for s in [True, False]:
    for c in [True, False]:
        for d in [True, False]:
            for x in [True, False]:
                p = joint_probability(s, c, d, x)
                total += p
                if p > 0.01:
                    print(f"P(S={s}, C={c}, D={d}, X={x}) = {p:.6f}")
 
print(f"
Total (should be 1.0): {total:.10f}")
 
# Parameter count:
# - P(Smoking): 1 parameter (binary, so 2-1=1)
# - P(Cancer|Smoking): 2 parameters (2 parent configs, 2-1 each)
# - P(Dyspnea|Cancer): 2 parameters
# - P(XRay|Cancer): 2 parameters
# Total: 7 parameters
# Full joint would need: 2^4 - 1 = 15 parameters
print(f"
Parameter savings: 7 vs 15 (53% reduction)")

Key properties of Bayesian Networks:

Acyclicity requirement: The graph must be a DAG. Cycles would create circular dependencies where a variable depends on itself—mathematically undefined.
Automatic normalization: Since each factor is a conditional probability distribution, the product is automatically a valid probability distribution (sums to 1). No separate normalization constant needed.
Causal interpretation: Arrows often (but not always) represent causal influence. If we draw an arrow from A to B, we're asserting that A directly influences B's probability.
Generative semantics: We can sample from the joint by following topological order—sample parents before children.
Compact CPDs: Each conditional probability table has at most $O(k^{|\text{Pa}(X_i)|})$ parameters where $k$ is the number of values per variable. Keeping parent sets small yields compact models.

The DAG Constraint

Undirected Graphical Models: Markov Random Fields

A Markov Random Field (MRF), also called a Markov Network or Undirected Graphical Model, represents a joint distribution using an undirected graph.

Formal Definition:

A Markov Random Field consists of:

An undirected graph $G = (V, E)$ where:
- $V = {X_1, X_2, \ldots, X_n}$ are random variables (nodes)
- $E$ are undirected edges representing direct correlations
A set of potential functions ${\phi_c(X_c)}$ defined over cliques of the graph:
- A clique is a fully-connected subset of nodes (every pair is connected by an edge)
- A maximal clique is a clique that cannot be extended by adding another node
- $\phi_c : \text{Val}(X_c) \to \mathbb{R}^+$ maps configurations to non-negative real values

The Gibbs Distribution Factorization:

$$P(X_1, X_2, \ldots, X_n) = \frac{1}{Z} \prod_{c \in \mathcal{C}} \phi_c(X_c)$$

Where:

$\mathcal{C}$ is the set of cliques
$\phi_c(X_c)$ are potential functions (not probabilities!)
$Z = \sum_{X} \prod_{c \in \mathcal{C}} \phi_c(X_c)$ is the partition function ensuring normalization

markov_random_field.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
# Example: Markov Random Field for Image Binary Segmentation
#
# Graph: 3x3 grid where each pixel is a node
# Edges connect neighboring pixels (4-connectivity)
#
#   X1 — X2 — X3
#   |    |    |
#   X4 — X5 — X6
#   |    |    |
#   X7 — X8 — X9
#
# Cliques: All edges (pairwise cliques) + all nodes (singleton cliques)
 
import numpy as np
from itertools import product
 
# Grid dimensions
H, W = 3, 3
n_variables = H * W
 
# Node potentials: preference for foreground (1) vs background (0)
# Higher value = more likely
def node_potential(x_i, observed_intensity):
    """
    Singleton potential based on observed pixel intensity.
    High intensity suggests foreground, low suggests background.
    """
    if x_i == 1:  # Foreground
        return np.exp(observed_intensity)
    else:  # Background
        return np.exp(1.0 - observed_intensity)
 
# Edge potentials: preference for neighbors to have same label
def edge_potential(x_i, x_j, lambda_smooth=2.0):
    """
    Pairwise potential encouraging neighboring pixels to have same label.
    This is the Ising model / Potts model potential.
    """
    if x_i == x_j:
        return np.exp(lambda_smooth)  # High potential for same labels
    else:
        return np.exp(0)  # exp(0) = 1, neutral for different labels
 
# Simulated observed intensities (grayscale values 0-1)
observed = np.array([
    [0.9, 0.8, 0.2],  # Top-left seems foreground, top-right background
    [0.7, 0.8, 0.3],
    [0.2, 0.3, 0.1],  # Bottom seems background
])
 
def compute_energy(labels, observed):
    """
    Compute the negative log-probability (energy) for a label configuration.
    E(x) = -sum_i log(φ_i(x_i)) - sum_(i,j) log(φ_ij(x_i, x_j))
    """
    energy = 0.0
    
    # Singleton potentials (unary terms)
    for i in range(H):
        for j in range(W):
            energy -= np.log(node_potential(labels[i, j], observed[i, j]))
    
    # Pairwise potentials (horizontal edges)
    for i in range(H):
        for j in range(W - 1):
            energy -= np.log(edge_potential(labels[i, j], labels[i, j+1]))
    
    # Pairwise potentials (vertical edges)
    for i in range(H - 1):
        for j in range(W):
            energy -= np.log(edge_potential(labels[i, j], labels[i+1, j]))
    
    return energy
 
def compute_unnormalized_prob(labels, observed):
    """
    Compute the unnormalized probability (product of potentials).
    """
    return np.exp(-compute_energy(labels, observed))
 
# Compute partition function by summing over all configurations
Z = 0.0
all_configs = []
for bits in product([0, 1], repeat=n_variables):
    labels = np.array(bits).reshape(H, W)
    unnorm_p = compute_unnormalized_prob(labels, observed)
    all_configs.append((labels.copy(), unnorm_p))
    Z += unnorm_p
 
print("Partition function Z:", Z)
print("
Top 5 most likely configurations:")
all_configs.sort(key=lambda x: -x[1])
for labels, unnorm_p in all_configs[:5]:
    prob = unnorm_p / Z
    print(f"  P={prob:.4f}:")
    print(f"    {labels.tolist()}")
 
# Note: Computing Z required summing over 2^9 = 512 configurations.
# For a 100x100 image, that's 2^10000 - intractable!
# This is why approximate inference (MCMC, variational) is essential.

Key properties of Markov Random Fields:

Symmetric relationships: Edges have no direction, representing mutual influence. If A and B are neighbors, A influences B and B influences A equally.
Potential functions, not probabilities: The factors $\phi_c$ are not conditional probabilities—they're just non-negative compatibility scores. They don't need to sum to 1.
Partition function required: Because potentials aren't normalized, we need the partition function $Z$ to convert the product to a proper probability. Computing $Z$ is often the hardest part.
Energy-based interpretation: Defining $E(X) = -\sum_c \log \phi_c(X_c)$, we get the Boltzmann distribution: $P(X) = \frac{1}{Z} e^{-E(X)}$. Low energy = high probability.
Clique decomposition: The Hammersley-Clifford theorem (covered in the next module) proves that any positive distribution satisfying the Markov properties of a graph can be factorized this way.

The Partition Function Problem

Factorization Comparison

Let's directly compare how the two model families factorize the joint distribution:

Bayesian Network (Directed):

$$P(X_1, \ldots, X_n) = \prod_{i=1}^{n} P(X_i | \text{Pa}(X_i))$$

Each factor is a conditional probability distribution
Factors are associated with nodes (one per node)
Parents are the conditioning set
Product is automatically normalized

Markov Random Field (Undirected):

$$P(X_1, \ldots, X_n) = \frac{1}{Z} \prod_{c \in \mathcal{C}} \phi_c(X_c)$$

Each factor is an unnormalized potential function
Factors are associated with cliques (fully-connected subsets)
Requires partition function $Z$ for normalization
More flexibility in factor design, but harder to normalize

Directed vs Undirected Factorization
Aspect	Bayesian Networks	Markov Random Fields
Graph type	Directed Acyclic Graph (DAG)	Undirected Graph
Factors	Conditional Probability Distributions (CPDs)	Potential Functions (unnormalized)
Factor scope	Node + its parents	Maximal cliques
Normalization	Automatic (factors are CPDs)	Requires partition function Z
Parameter constraints	CPDs must be valid distributions	Potentials just need to be non-negative
Generative sampling	Easy (ancestral sampling)	Hard (requires MCMC)
Causal interpretation	Natural (edges suggest causation)	Not applicable

A concrete example:

Consider three variables: A, B, C with A connected to B and B connected to C.

As a Bayesian Network (A → B → C): $$P(A, B, C) = P(A) \cdot P(B|A) \cdot P(C|B)$$

3 factors: $P(A)$, $P(B|A)$, $P(C|B)$
Interpretation: A causes B, B causes C
Implies: $A \perp!!!\perp C | B$ (A and C are independent given B)

As a Markov Random Field (A — B — C): $$P(A, B, C) = \frac{1}{Z} \phi_{AB}(A, B) \cdot \phi_{BC}(B, C)$$

2 clique factors: $\phi_{AB}$, $\phi_{BC}$
Interpretation: A and B are correlated, B and C are correlated
Also implies: $A \perp!!!\perp C | B$

Both encode the same conditional independence, but with different semantics. The BN has a causal/temporal ordering; the MRF is symmetric.

Representational Power: What Each Can Express

A crucial question: Can every distribution representable by one family be represented by the other? The answer is nuanced.

Definition: I-map (Independence Map)

Key results:

Some independencies can only be expressed in directed models (Example: "Explaining away" pattern)
Some independencies can only be expressed in undirected models (Example: Four-way symmetric constraint)
Neither family is strictly more powerful than the other

Let's examine specific cases where each family has an advantage:

Case 1: V-Structure (Explaining Away) — Only Directed Models

Consider two independent causes that share a common effect:

A: Battery dead
B: Fuel empty
C: Car won't start

In this structure:

Marginally: $A \perp!!!\perp B$ (battery and fuel status are independent)
Given C: $A ot\perp!!!\perp B | C$ (if car won't start and battery is fine, fuel is more likely empty)

This is the famous explaining away phenomenon. Learning that the car won't start makes the two causes dependent—if one is ruled out, the other becomes more likely.

Why undirected graphs can't capture this:

Case 2: Diamond Structure — Only Undirected Models

Consider four variables in a cycle:

    A — B
    |   |
    C — D

Independencies we want:

$A \perp!!!\perp D | {B, C}$
$B \perp!!!\perp C | {A, D}$

An undirected graph with exactly these edges captures these independencies via graph separation.

Why directed graphs struggle:

The bottom line:

Neither family dominates. The choice depends on:

Which independence structure matches your domain
Whether causal/generative semantics are appropriate
Computational considerations (e.g., partition function)

The Moralization Bridge

When to Use Each Model Type

Choosing between directed and undirected models is one of the most important modeling decisions. Here's a principled framework for making this choice:

Use Bayesian Networks when:

Bayesian Networks Are Appropriate For

•Causal modeling: There's a natural causal direction (Disease → Symptom, Gene → Trait)
•Temporal processes: Time provides a natural ordering (past → future)
•Generative modeling: You want to sample from the model (generate data)
•Expert knowledge: Domain experts can specify conditional probabilities
•Interventional reasoning: You want to answer 'what if we set X to value x?'
•Explaining away is relevant: Independent causes with common effects
•Easy normalization needed: No partition function to compute

Classic Bayesian Network applications:

Medical diagnosis (diseases cause symptoms)
Fault diagnosis (root causes produce observable errors)
Genetic inheritance (parent genes influence child genes)
Speech recognition (HMMs: hidden states produce observations)
Document modeling (LDA: topics generate words)

Markov Random Fields Are Appropriate For

•Symmetric correlations: Neighboring entities influence each other equally
•Spatial data: Image pixels, geographic regions, social networks
•Constraint satisfaction: Expressing 'soft' compatibility constraints
•Energy-based modeling: Natural domains with physical 'energy' interpretation
•No natural ordering: Variables are fundamentally symmetric
•Discriminative tasks: CRFs for classification/labeling (conditional model)

Classic MRF applications:

Image segmentation (neighboring pixels tend to have same label)
Stereo vision (disparity should be smooth across space)
Image denoising (recover clean image from noisy observations)
Statistical physics (Ising model for magnetic systems)
Sequence labeling with CRFs (POS tagging, NER)

Decision Guide: Directed vs Undirected
Domain Characteristic	Recommended Model	Reason
Clear cause-effect relationships	Bayesian Network	Causal semantics align with domain
Temporal/sequential data	Bayesian Network (HMM)	Natural ordering from time
Grid/spatial structure	Markov Random Field	Symmetric neighbor relationships
Need to generate samples	Bayesian Network	Ancestral sampling is easy
Discriminative classification	CRF (undirected, conditional)	Models P(labels\|features) directly
Feature dependencies + label correlations	CRF	Avoids feature independence assumption

Computational Trade-offs

The choice between directed and undirected models has significant computational implications:

Bayesian Networks — Computational Advantages:

No partition function: Factors are CPDs that automatically normalize. No need to compute the intractable sum $Z$.
Easy sampling: To generate a sample:
- Order nodes topologically
- For each node, sample from $P(X_i | \text{Pa}(X_i))$ using already-sampled parent values
- This is ancestral sampling — simple and exact.
Local likelihood: The likelihood of data decomposes: $\log P(D) = \sum_i \log P(X_i | \text{Pa}(X_i))$. Each term can be optimized separately.
Parameter estimation: Maximum likelihood estimates for CPDs are just empirical conditional frequencies—easy to compute.

Markov Random Fields — Computational Challenges:

Partition function: Computing $Z = \sum_X \prod_c \phi_c(X_c)$ is #P-complete in general. For $n$ binary variables, it's $O(2^n)$.
Difficult sampling: No natural ordering for ancestral sampling. Must use MCMC (Gibbs sampling, Metropolis-Hastings).
Coupled likelihood: The log-likelihood $\log P(X) = \sum_c \log \phi_c(X_c) - \log Z$. The $\log Z$ term couples all parameters—can't optimize locally.
Gradient estimation: The gradient of log-likelihood involves expectations under the model: $ abla \log P(D) = E_{data}[ abla \log \phi] - E_{model}[ abla \log \phi]$. The second term requires sampling from the model.

When MRFs are still worth it:

Discriminative models (CRFs): Condition on observed features, so $Z$ only sums over labels, not features. Much more tractable.
Loopy belief propagation: Approximate inference that works well in practice even without theoretical guarantees.
Variational methods: Convert intractable $Z$ to optimization problem.
Special structure: Trees and low-treewidth graphs allow exact inference.

The CRF Trick

Hybrid Models and Factor Graphs

In practice, the directed/undirected dichotomy is often too restrictive. Several representations extend beyond this basic division:

Factor Graphs:

A factor graph is a bipartite graph with two types of nodes:

Variable nodes (circles): Represent random variables
Factor nodes (squares): Represent factors/potentials

Edges connect factor nodes to the variables they depend on.

$$P(X) = \frac{1}{Z} \prod_{a} f_a(X_{\mathcal{N}(a)})$$

Where $\mathcal{N}(a)$ denotes the neighbors of factor $a$.

Advantages of factor graphs:

Unify directed and undirected representations
Make factor structure explicit (helpful for inference algorithms)
Can represent higher-order factors directly
Foundation for modern probabilistic programming languages

Chain Graphs:

A chain graph allows both directed and undirected edges, with the constraint that you can partition nodes into ordered 'chain components' where:

Edges between components are directed
Edges within components are undirected

This models situations like:

Manufacturer → Retailer1 — Retailer2 → Consumer (Retailers are peers who influence each other, but there's a supply chain direction)

Conditional Random Fields (CRFs):

CRFs are undirected models conditioned on observations:

$$P(Y | X) = \frac{1}{Z(X)} \prod_{c} \phi_c(Y_c, X)$$

Note that:

The partition function $Z(X)$ depends on the input $X$
We don't model $P(X)$ at all—just the conditional $P(Y|X)$
This is discriminative, not generative

CRFs combine the flexibility of undirected models with the computational benefits of conditioning.

Hybrid Approaches:

Modern systems often combine:

Neural networks to learn feature representations
Graphical model structure to capture dependencies
Examples: Neural CRFs, structured prediction networks, graph neural networks

A Unified Perspective

Summary and Key Takeaways

Let's consolidate the essential insights from this deep dive into directed and undirected graphical models:

Key Takeaways

•Bayesian Networks use DAGs with conditional probability factors. They're natural for causal/generative processes, require no partition function, and enable easy sampling via ancestral sampling.
•Markov Random Fields use undirected graphs with potential functions over cliques. They're natural for symmetric correlations but require computing the intractable partition function Z.
•Neither family dominates the other: V-structures (explaining away) can only be captured by directed models; certain symmetric structures can only be captured by undirected models.
•Factorization is key: Both families achieve tractability by factorizing the joint distribution into products of local factors, exploiting conditional independence.
•Choose based on domain semantics: Use directed models for causation/generation, undirected models for symmetric spatial/relational structure.
•Computational trade-offs matter: BNs avoid partition functions; MRFs offer more flexible factor design but at computational cost.
•Factor graphs unify both: They make the factor structure explicit and are the foundation for general-purpose inference algorithms.

What's next:

Page Complete