Machine LearningGraphical Models

Probabilistic Graphical Models

LevelAdvanced

Duration90 mins

TopicGraphical Models

1 / 5

Probabilistic Graphical Models Overview

The Language of Uncertain Relationships

Imagine you're a physician diagnosing a patient. The patient has symptoms: fever, cough, and fatigue. Multiple diseases could cause these symptoms—influenza, COVID-19, tuberculosis, or a common cold. Each disease has different probabilities, and the symptoms themselves are correlated. The presence of one symptom might increase or decrease the likelihood of others. How do you systematically reason about this web of uncertain relationships?

Or consider a self-driving car processing sensor data. The car's LIDAR detects an object. Is it a pedestrian, a cyclist, or a mailbox? The camera provides another view. The radar gives velocity information. These sensors are not independent—they're all observing the same underlying reality. How do you combine these uncertain, correlated observations into a coherent understanding of the world?

Probabilistic Graphical Models (PGMs) provide the mathematical framework to answer these questions. They represent the gold standard for modeling complex systems where uncertainty and dependencies interweave, and they form the theoretical backbone of many modern machine learning systems.

What You Will Learn

By the end of this page, you will understand what probabilistic graphical models are, why they represent a fundamental advance in representing and reasoning about uncertainty, and how they provide a unifying language for expressing complex probabilistic relationships. You will see how PGMs bridge the gap between human-interpretable models and powerful computational inference.

The Challenge of High-Dimensional Probability

Before we can appreciate the elegance of graphical models, we must understand the problem they solve. Consider a probability distribution over n binary random variables. In the most general case, specifying this distribution requires:

$$P(X_1, X_2, \ldots, X_n)$$

Since each variable can take 2 values, there are $2^n$ possible configurations. To fully specify the joint distribution, we need $2^n - 1$ parameters (the last is determined by normalization).

The combinatorial explosion:

For $n = 10$ variables: $2^{10} - 1 = 1,023$ parameters
For $n = 20$ variables: $2^{20} - 1 \approx 1$ million parameters
For $n = 100$ variables: $2^{100} - 1 \approx 10^{30}$ parameters

No computer can store or learn $10^{30}$ parameters. And real-world problems routinely involve hundreds, thousands, or millions of variables. Consider a 1000×1000 binary image—that's 1 million binary random variables, requiring $2^{1,000,000}$ parameters in the general case.

The fundamental question:

How can we represent high-dimensional probability distributions compactly, learn them from finite data, and perform inference efficiently?

The Curse of Dimensionality in Probability

Without structure, probability distributions suffer from an exponential curse of dimensionality. The full joint distribution over n variables requires O(2^n) parameters, making direct representation and learning intractable for all but the smallest problems. Graphical models exploit conditional independence structure to achieve compact, tractable representations.

The independence assumption—too naive:

One extreme approach assumes all variables are mutually independent:

$$P(X_1, X_2, \ldots, X_n) = \prod_{i=1}^{n} P(X_i)$$

This requires only $n$ parameters—tractable even for millions of variables. But it throws away all relationships between variables. In the medical diagnosis example, we'd assume symptoms are unrelated to diseases—clearly absurd.

The other extreme—full dependence:

Assuming full dependence preserves all relationships but leads to intractable complexity. Neither extreme works.

The solution: structured independence

Real-world systems exhibit conditional independence—variables are related, but not all-to-all. A patient's cough depends on their disease, but given the disease, it might be independent of their geographic location. Graphical models exploit this structure by encoding precisely which conditional independencies hold.

What Are Probabilistic Graphical Models?

A Probabilistic Graphical Model (PGM) is a marriage of two mathematical languages:

Probability theory — for quantifying uncertainty and reasoning under incomplete information
Graph theory — for representing relationships and structure

The key insight is profound: the graph structure encodes conditional independence assumptions, allowing the joint distribution to factorize into a product of simpler terms.

Formal definition:

A PGM consists of:

A graph $G = (V, E)$ where vertices $V$ represent random variables and edges $E$ represent direct dependencies
A parameterization that specifies how the joint distribution factors according to the graph structure

The graph serves as a qualitative representation of dependencies, while the parameters provide the quantitative specification of probabilities.

The Two Languages of Probabilistic Graphical Models
Component	From Probability Theory	From Graph Theory
Core object	Random variables, probability distributions	Nodes, edges, graph structure
Relationships	Statistical dependencies, conditional probabilities	Edges indicating direct influence
Inference	Marginalization, conditioning, Bayes' rule	Message passing, graph algorithms
Independence	Conditional independence assertions	Graph separation criteria (d-separation, Markov blanket)
Representation	Exponentially large joint distribution	Compact graphical structure

Why graphs?

Graphs are a natural language for expressing relationships:

Nodes represent entities (random variables)
Edges represent direct relationships (direct probabilistic influence)
Paths represent indirect relationships (chains of influence)
Absent edges encode conditional independencies (lack of direct influence)

The power of the graphical representation lies in what it doesn't include. Every missing edge is an assertion that two variables are conditionally independent given appropriate conditioning sets. These absent edges are what enable compact representation and efficient inference.

The Power of Absent Edges

In probabilistic graphical models, what's NOT connected is as important as what IS connected. Every missing edge represents a conditional independence assumption that simplifies the model. A sparse graph (few edges) implies strong independence assumptions and leads to more tractable inference. A dense graph preserves more dependencies but increases computational cost.

The Factorization Principle

The defining characteristic of PGMs is that the graph structure implies a factorization of the joint distribution. Instead of specifying the full joint directly, we specify a set of factors—smaller functions that combine to define the joint.

The general form:

$$P(X_1, X_2, \ldots, X_n) = \frac{1}{Z} \prod_{c \in \mathcal{C}} \phi_c(X_c)$$

Where:

$\mathcal{C}$ is a set of cliques or factors determined by the graph structure
$\phi_c(X_c)$ is a potential function over the variables in clique $c$
$Z$ is a normalizing constant (partition function) ensuring probabilities sum to 1

The magic is that each factor $\phi_c$ depends only on a small subset of variables—those that are directly connected in the graph. If the cliques are small, the factors are small, and the representation is compact.

factorization_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# Example: Factorization comparison
 
import numpy as np
 
def full_joint_complexity(n_variables):
    """Parameters needed for full joint distribution over n binary variables."""
    return 2**n_variables - 1
 
def chain_factorization_complexity(n_variables):
    """Parameters for chain-structured PGM: X1 -> X2 -> ... -> Xn.
    
    Factorizes as: P(X1) * P(X2|X1) * P(X3|X2) * ... * P(Xn|Xn-1)
    - P(X1): 1 parameter
    - P(Xi|Xi-1): 2 parameters each (for binary variables)
    Total: 1 + 2*(n-1) = 2n - 1
 
    This is LINEAR in n, not exponential!
    """
    return 1 + 2 * (n_variables - 1)
 
# Compare complexities
for n in [5, 10, 20, 50, 100]:
    full = full_joint_complexity(n)
    chain = chain_factorization_complexity(n)
    ratio = full / chain
    print(f"n={n:3d}: Full joint: {full:30,d} | Chain PGM: {chain:4d} | Ratio: {ratio:,.0f}x")
 
# Output:
# n=  5: Full joint:                             31 | Chain PGM:    9 | Ratio: 3x
# n= 10: Full joint:                          1,023 | Chain PGM:   19 | Ratio: 54x
# n= 20: Full joint:                      1,048,575 | Chain PGM:   39 | Ratio: 26,886x
# n= 50: Full joint:          1,125,899,906,842,623 | Chain PGM:   99 | Ratio: 11,372,726,332,754x
# n=100: Full joint:  (10^30, too large to display) | Chain PGM:  199 | Ratio: ~10^28x

Why factorization enables tractability:

Compact representation: Instead of $2^n$ parameters, we need only parameters for each factor. For a chain of n variables, that's $O(n)$ parameters.
Efficient learning: With fewer parameters, we need less data to estimate them reliably. Each factor can be learned from local statistics.
Efficient inference: Marginalizing or conditioning can exploit the factorization structure. We can "push" sums inside products, computing intermediate results efficiently.
Modularity: Factors represent local relationships that can be understood, validated, and modified independently.

The key insight: By encoding conditional independence through the graph structure, PGMs achieve exponential compression of the representation while preserving the essential dependencies for modeling real-world phenomena.

Two Major Families of Graphical Models

Probabilistic graphical models come in two primary flavors, distinguished by the type of graph they use:

1. Directed Graphical Models (Bayesian Networks)

Use directed acyclic graphs (DAGs) where edges have arrows indicating causal or generative direction. Also called Belief Networks, Bayes Nets, or Causal Networks.

Factorization: $$P(X_1, \ldots, X_n) = \prod_{i=1}^{n} P(X_i | \text{Parents}(X_i))$$

Each variable is conditioned on its parent nodes—the nodes with arrows pointing to it.

2. Undirected Graphical Models (Markov Random Fields)

Use undirected graphs where edges have no direction. Also called Markov Networks or Gibbs Random Fields.

Factorization: $$P(X_1, \ldots, X_n) = \frac{1}{Z} \prod_{c \in \mathcal{C}} \phi_c(X_c)$$

Potential functions are defined over cliques—maximal fully-connected subgraphs.

Directed Models (Bayesian Networks)

•Edges have directions (arrows)
•Represent causal or generative processes
•Natural for modeling 'cause → effect'
•Factors are conditional probabilities
•Always normalized (no partition function)
•Examples: medical diagnosis, genetics, NLP

Undirected Models (Markov Networks)

•Edges have no direction
•Represent symmetric correlations
•Natural for modeling 'mutual influence'
•Factors are unnormalized potentials
•Require partition function Z for normalization
•Examples: image segmentation, physics, spatial models

When to use which?

Directed models are natural when:

There's a clear causal or temporal ordering (disease → symptoms)
You understand the generative process (how data is produced)
You want to perform interventional reasoning (what happens if we set X to a value?)

Undirected models are natural when:

Relationships are symmetric (neighboring pixels influence each other)
There's no clear causal direction
You're modeling constraint satisfaction or energy-based systems
The domain involves spatial or relational structure

Hybrid models (Factor Graphs):

Some applications combine both types. Conditional Random Fields (CRFs) are undirected models conditioned on observations. Chain graphs allow both directed and undirected edges. The field has developed general-purpose representations like factor graphs that can express any factorization.

Same Distribution, Different Graphs

A given probability distribution can often be represented by both directed and undirected models—the choice depends on which conditional independencies you want to encode directly. However, some independencies can only be captured by one type. This observation leads to the study of 'I-maps' (independence maps) and the different expressive powers of directed versus undirected models.

The Three Fundamental Problems

Working with probabilistic graphical models involves three core computational problems. Understanding these problems and their solutions is essential for applying PGMs in practice.

Problem 1: Representation

Given: A domain with random variables and their relationships Goal: Construct a graph and parameterization that accurately models the joint distribution

This involves:

Determining the graph structure (which edges exist)
Specifying the parameters (conditional probabilities or potential functions)
Balancing model capacity against data availability and computational tractability

Problem 2: Inference

Given: A PGM with known structure and parameters, plus observed evidence Goal: Compute posterior probabilities for unobserved variables

Key inference tasks:

Marginal inference: $P(X_i | \text{Evidence})$ — probability of a single variable given observations
MAP inference: $\arg\max_X P(X | \text{Evidence})$ — most likely configuration
Partition function: $Z = \sum_X \prod_c \phi_c(X_c)$ — normalizing constant

Problem 3: Learning

Given: Data samples from a domain, possibly with missing values Goal: Learn the graph structure and/or parameters from data

The Three Fundamental Problems in PGMs
Problem	Input	Output	Key Challenges
Representation	Domain knowledge, independence assumptions	Graph structure + parameters	Model selection, expressiveness vs tractability
Inference	Model + observed evidence	Posterior probabilities	NP-hard in general, requires approximation
Learning	Data samples	Learned model (structure and/or parameters)	Structure learning is super-exponential, missing data

The computational challenge:

Both inference and learning are NP-hard in general. For inference, even computing the partition function Z is #P-complete (harder than NP). For structure learning, the space of possible graphs grows super-exponentially with the number of variables.

However, the graph structure often enables efficient algorithms:

Trees and forests: Exact inference in $O(n)$ time using belief propagation
Low tree-width graphs: Exact inference in $O(n \cdot k^w)$ where $w$ is tree-width
Special structures: Many practical models (HMMs, CRFs) have structure enabling efficient exact or approximate algorithms

When exact inference is intractable, we turn to approximate methods:

Variational inference: Optimize over tractable approximating distributions
Sampling methods: MCMC, importance sampling, particle filters
Loopy belief propagation: Message passing on graphs with cycles (no guarantees but often works well)

Why PGMs Matter for Machine Learning

Probabilistic graphical models are not just a theoretical framework—they underpin many successful machine learning systems and provide foundational concepts that extend to deep learning.

Direct applications:

Hidden Markov Models (HMMs): Speech recognition, bioinformatics (gene finding, protein structure)
Bayesian Networks: Medical diagnosis, risk assessment, causal discovery
Markov Random Fields: Image segmentation, stereo vision, denoising
Conditional Random Fields (CRFs): Named entity recognition, part-of-speech tagging
Topic Models (LDA): Document analysis, recommendation systems
Gaussian Mixture Models: Clustering, density estimation

Conceptual foundations:

Even when not using PGMs directly, ML practitioners benefit from understanding:

Conditional independence: The cornerstone of feature engineering and model simplification
Probabilistic inference: Foundation for uncertainty quantification in any ML system
Graphical notation: Standard language for communicating model assumptions
Variational methods: Now ubiquitous in deep learning (VAEs, variational Bayes)

PGMs Across Machine Learning Domains
Domain	PGM Application	Why It Works
Natural Language	HMMs, CRFs for sequence labeling	Sequential structure with local dependencies
Computer Vision	MRFs for image segmentation, stereo	Spatial structure with neighbor dependencies
Healthcare	Bayesian networks for diagnosis, epidemiology	Causal structure with interpretability requirements
Robotics	Particle filters, SLAM as factor graphs	Temporal structure with sequential observations
Bioinformatics	HMMs for gene finding, phylogenetic trees	Biological sequences with Markovian properties
Deep Learning	VAEs, energy-based models, attention as message passing	Latent variable modeling, structured outputs

PGMs and Deep Learning

Modern deep learning incorporates many PGM ideas: Variational Autoencoders use variational inference, attention mechanisms can be viewed as soft message passing, graph neural networks extend belief propagation, and energy-based models are undirected graphical models with neural network potentials. Understanding PGMs provides crucial intuition for these advanced techniques.

The Interpretability Advantage

In an era of black-box deep learning, probabilistic graphical models offer a crucial advantage: interpretability. The graph structure makes the model's assumptions explicit and human-readable.

What the graph tells us:

Direct dependencies: An edge from A to B says 'A directly influences B'
Indirect dependencies: A path from A to C through B says 'A influences C through B'
Independence: No path between A and C (given conditioning) says 'A provides no information about C'
Causal structure (for directed models): Arrows suggest cause-and-effect relationships

Benefits for stakeholders:

Domain experts can validate: 'Does the graph match my understanding?'
Regulators can audit: 'What factors influence this decision?'
Users can understand: 'Why did the system conclude this?'
Engineers can debug: 'Which dependencies are wrong?'

Comparison with deep learning:

PGM Advantages

•Structure is human-readable and auditable
•Independence assumptions are explicit
•Causal reasoning possible (interventions)
•Principled uncertainty quantification
•Can incorporate domain knowledge directly
•Modular—parts can be understood separately

Deep Learning Challenges

•Black-box: internal logic opaque
•Dependencies learned implicitly
•Causal reasoning not built-in
•Uncertainty often poorly calibrated
•Domain knowledge harder to integrate
•End-to-end: hard to isolate components

The Right Tool for the Job

PGMs and deep learning are not mutually exclusive. Many state-of-the-art systems combine both: using neural networks to learn complex feature representations while using graphical model structure to encode known relationships and enable interpretable reasoning. The choice depends on the task's requirements for interpretability, data availability, and the nature of domain knowledge.

A Unifying Framework

One of the most powerful aspects of PGMs is their role as a unifying framework. Many seemingly different machine learning models are special cases of graphical models:

Model	PGM Representation
Naive Bayes	Bayesian network with class → all features
Logistic Regression	Two-node Bayesian network
Hidden Markov Models	Chain-structured Bayesian network
Kalman Filters	Linear-Gaussian Bayesian network
Mixture Models	Bayesian network with latent cluster variable
LDA Topic Models	Hierarchical Bayesian network with Dirichlet priors
Ising Models	Pairwise Markov random field
Boltzmann Machines	Fully-connected MRF with hidden units

Why unification matters:

Transfer of techniques: Algorithms developed for one model apply to others
Principled extensions: Understand what assumptions you're relaxing or adding
Fair comparisons: Compare models in a common language
Hybrid approaches: Combine models in principled ways

The PGM research program:

The graphical models community has developed a remarkable toolkit:

Representation languages: Bayesian networks, MRFs, factor graphs, plate notation
Inference algorithms: Variable elimination, belief propagation, junction trees, variational methods, MCMC
Learning algorithms: MLE, EM, structure search, Bayesian structure learning
Complexity theory: Understanding when inference is tractable (tree-width, etc.)

This toolkit is general-purpose. Once you understand the PGM framework, you can:

Invent new models by specifying new graph structures
Apply standard inference algorithms to your new model
Use standard learning algorithms to fit data
Analyze computational complexity via graph properties

The vision: A 'compiler' for probabilistic models. Specify your model declaratively; let automated tools handle inference and learning.

Probabilistic Programming

This vision is realized in modern probabilistic programming languages like Stan, PyMC, Pyro, and TensorFlow Probability. You specify a graphical model using code, and the system automatically constructs and executes inference algorithms. Understanding PGMs is essential for using these powerful tools effectively.

Roadmap for This Module

This module will give you a deep understanding of the foundational concepts in probabilistic graphical models. Here's what's ahead:

Page 1: Directed vs Undirected Models

We'll examine the two major families of PGMs in detail—Bayesian networks and Markov random fields. You'll understand their different factorization properties, the types of dependencies each can represent, and when to choose one over the other.

Page 2: Conditional Independence

The cornerstone of graphical models. We'll formalize what conditional independence means, how it enables factorization, and how to read independence from graphs. This is the mathematical foundation everything else builds on.

Page 3: D-Separation

The algorithmic tool for reading conditional independencies from directed graphs. We'll master d-separation—a graph-theoretic criterion that tells you exactly which variables are independent given observed evidence.

Page 4: Markov Blanket

For any variable in a graphical model, what's the minimal set of other variables that renders it independent of all others? The Markov blanket answers this question, with profound implications for inference and learning.

Learning Objectives

•Explain why graphical models provide compact representations of high-dimensional distributions
•Distinguish between directed and undirected graphical models and their factorization properties
•Formalize conditional independence and its role in enabling tractable inference
•Apply d-separation to determine conditional independencies in Bayesian networks
•Identify Markov blankets and explain their significance for inference and feature selection
•Recognize how common ML models fit within the PGM framework

Page Complete

You now understand what probabilistic graphical models are and why they matter. They provide a principled framework for representing complex probability distributions compactly by exploiting conditional independence structure. The graph encodes qualitative assumptions about which variables directly influence which others, while the parameters encode quantitative probabilities. This foundation enables the inference and learning algorithms we'll explore throughout this chapter. Next, we'll examine the fundamental distinction between directed and undirected graphical models.

1 / 5

Loading learning content...

Machine LearningGraphical Models

Probabilistic Graphical Models

LevelAdvanced

Duration90 mins

TopicGraphical Models

1 / 5

Probabilistic Graphical Models Overview

The Language of Uncertain Relationships

What You Will Learn

The Challenge of High-Dimensional Probability

$$P(X_1, X_2, \ldots, X_n)$$

Since each variable can take 2 values, there are $2^n$ possible configurations. To fully specify the joint distribution, we need $2^n - 1$ parameters (the last is determined by normalization).

The combinatorial explosion:

For $n = 10$ variables: $2^{10} - 1 = 1,023$ parameters
For $n = 20$ variables: $2^{20} - 1 \approx 1$ million parameters
For $n = 100$ variables: $2^{100} - 1 \approx 10^{30}$ parameters

The fundamental question:

How can we represent high-dimensional probability distributions compactly, learn them from finite data, and perform inference efficiently?

The Curse of Dimensionality in Probability

The independence assumption—too naive:

One extreme approach assumes all variables are mutually independent:

$$P(X_1, X_2, \ldots, X_n) = \prod_{i=1}^{n} P(X_i)$$

The other extreme—full dependence:

Assuming full dependence preserves all relationships but leads to intractable complexity. Neither extreme works.

The solution: structured independence

What Are Probabilistic Graphical Models?

A Probabilistic Graphical Model (PGM) is a marriage of two mathematical languages:

Probability theory — for quantifying uncertainty and reasoning under incomplete information
Graph theory — for representing relationships and structure

The key insight is profound: the graph structure encodes conditional independence assumptions, allowing the joint distribution to factorize into a product of simpler terms.

Formal definition:

A PGM consists of:

A graph $G = (V, E)$ where vertices $V$ represent random variables and edges $E$ represent direct dependencies
A parameterization that specifies how the joint distribution factors according to the graph structure

The graph serves as a qualitative representation of dependencies, while the parameters provide the quantitative specification of probabilities.

The Two Languages of Probabilistic Graphical Models
Component	From Probability Theory	From Graph Theory
Core object	Random variables, probability distributions	Nodes, edges, graph structure
Relationships	Statistical dependencies, conditional probabilities	Edges indicating direct influence
Inference	Marginalization, conditioning, Bayes' rule	Message passing, graph algorithms
Independence	Conditional independence assertions	Graph separation criteria (d-separation, Markov blanket)
Representation	Exponentially large joint distribution	Compact graphical structure

Why graphs?

Graphs are a natural language for expressing relationships:

Nodes represent entities (random variables)
Edges represent direct relationships (direct probabilistic influence)
Paths represent indirect relationships (chains of influence)
Absent edges encode conditional independencies (lack of direct influence)

The Power of Absent Edges

The Factorization Principle

The general form:

$$P(X_1, X_2, \ldots, X_n) = \frac{1}{Z} \prod_{c \in \mathcal{C}} \phi_c(X_c)$$

Where:

$\mathcal{C}$ is a set of cliques or factors determined by the graph structure
$\phi_c(X_c)$ is a potential function over the variables in clique $c$
$Z$ is a normalizing constant (partition function) ensuring probabilities sum to 1

factorization_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# Example: Factorization comparison
 
import numpy as np
 
def full_joint_complexity(n_variables):
    """Parameters needed for full joint distribution over n binary variables."""
    return 2**n_variables - 1
 
def chain_factorization_complexity(n_variables):
    """Parameters for chain-structured PGM: X1 -> X2 -> ... -> Xn.
    
    Factorizes as: P(X1) * P(X2|X1) * P(X3|X2) * ... * P(Xn|Xn-1)
    - P(X1): 1 parameter
    - P(Xi|Xi-1): 2 parameters each (for binary variables)
    Total: 1 + 2*(n-1) = 2n - 1
 
    This is LINEAR in n, not exponential!
    """
    return 1 + 2 * (n_variables - 1)
 
# Compare complexities
for n in [5, 10, 20, 50, 100]:
    full = full_joint_complexity(n)
    chain = chain_factorization_complexity(n)
    ratio = full / chain
    print(f"n={n:3d}: Full joint: {full:30,d} | Chain PGM: {chain:4d} | Ratio: {ratio:,.0f}x")
 
# Output:
# n=  5: Full joint:                             31 | Chain PGM:    9 | Ratio: 3x
# n= 10: Full joint:                          1,023 | Chain PGM:   19 | Ratio: 54x
# n= 20: Full joint:                      1,048,575 | Chain PGM:   39 | Ratio: 26,886x
# n= 50: Full joint:          1,125,899,906,842,623 | Chain PGM:   99 | Ratio: 11,372,726,332,754x
# n=100: Full joint:  (10^30, too large to display) | Chain PGM:  199 | Ratio: ~10^28x

Why factorization enables tractability:

Compact representation: Instead of $2^n$ parameters, we need only parameters for each factor. For a chain of n variables, that's $O(n)$ parameters.
Efficient learning: With fewer parameters, we need less data to estimate them reliably. Each factor can be learned from local statistics.
Efficient inference: Marginalizing or conditioning can exploit the factorization structure. We can "push" sums inside products, computing intermediate results efficiently.
Modularity: Factors represent local relationships that can be understood, validated, and modified independently.

Two Major Families of Graphical Models

Probabilistic graphical models come in two primary flavors, distinguished by the type of graph they use:

1. Directed Graphical Models (Bayesian Networks)

Use directed acyclic graphs (DAGs) where edges have arrows indicating causal or generative direction. Also called Belief Networks, Bayes Nets, or Causal Networks.

Factorization: $$P(X_1, \ldots, X_n) = \prod_{i=1}^{n} P(X_i | \text{Parents}(X_i))$$

Each variable is conditioned on its parent nodes—the nodes with arrows pointing to it.

2. Undirected Graphical Models (Markov Random Fields)

Use undirected graphs where edges have no direction. Also called Markov Networks or Gibbs Random Fields.

Factorization: $$P(X_1, \ldots, X_n) = \frac{1}{Z} \prod_{c \in \mathcal{C}} \phi_c(X_c)$$

Potential functions are defined over cliques—maximal fully-connected subgraphs.

Directed Models (Bayesian Networks)

•Edges have directions (arrows)
•Represent causal or generative processes
•Natural for modeling 'cause → effect'
•Factors are conditional probabilities
•Always normalized (no partition function)
•Examples: medical diagnosis, genetics, NLP

Undirected Models (Markov Networks)

•Edges have no direction
•Represent symmetric correlations
•Natural for modeling 'mutual influence'
•Factors are unnormalized potentials
•Require partition function Z for normalization
•Examples: image segmentation, physics, spatial models

When to use which?

Directed models are natural when:

There's a clear causal or temporal ordering (disease → symptoms)
You understand the generative process (how data is produced)
You want to perform interventional reasoning (what happens if we set X to a value?)

Undirected models are natural when:

Relationships are symmetric (neighboring pixels influence each other)
There's no clear causal direction
You're modeling constraint satisfaction or energy-based systems
The domain involves spatial or relational structure

Hybrid models (Factor Graphs):

Same Distribution, Different Graphs

The Three Fundamental Problems

Working with probabilistic graphical models involves three core computational problems. Understanding these problems and their solutions is essential for applying PGMs in practice.

Problem 1: Representation

Given: A domain with random variables and their relationships Goal: Construct a graph and parameterization that accurately models the joint distribution

This involves:

Determining the graph structure (which edges exist)
Specifying the parameters (conditional probabilities or potential functions)
Balancing model capacity against data availability and computational tractability

Problem 2: Inference

Given: A PGM with known structure and parameters, plus observed evidence Goal: Compute posterior probabilities for unobserved variables

Key inference tasks:

Marginal inference: $P(X_i | \text{Evidence})$ — probability of a single variable given observations
MAP inference: $\arg\max_X P(X | \text{Evidence})$ — most likely configuration
Partition function: $Z = \sum_X \prod_c \phi_c(X_c)$ — normalizing constant

Problem 3: Learning

Given: Data samples from a domain, possibly with missing values Goal: Learn the graph structure and/or parameters from data

The Three Fundamental Problems in PGMs
Problem	Input	Output	Key Challenges
Representation	Domain knowledge, independence assumptions	Graph structure + parameters	Model selection, expressiveness vs tractability
Inference	Model + observed evidence	Posterior probabilities	NP-hard in general, requires approximation
Learning	Data samples	Learned model (structure and/or parameters)	Structure learning is super-exponential, missing data

The computational challenge:

However, the graph structure often enables efficient algorithms:

Trees and forests: Exact inference in $O(n)$ time using belief propagation
Low tree-width graphs: Exact inference in $O(n \cdot k^w)$ where $w$ is tree-width
Special structures: Many practical models (HMMs, CRFs) have structure enabling efficient exact or approximate algorithms

When exact inference is intractable, we turn to approximate methods:

Variational inference: Optimize over tractable approximating distributions
Sampling methods: MCMC, importance sampling, particle filters
Loopy belief propagation: Message passing on graphs with cycles (no guarantees but often works well)

Why PGMs Matter for Machine Learning

Probabilistic graphical models are not just a theoretical framework—they underpin many successful machine learning systems and provide foundational concepts that extend to deep learning.

Direct applications:

Hidden Markov Models (HMMs): Speech recognition, bioinformatics (gene finding, protein structure)
Bayesian Networks: Medical diagnosis, risk assessment, causal discovery
Markov Random Fields: Image segmentation, stereo vision, denoising
Conditional Random Fields (CRFs): Named entity recognition, part-of-speech tagging
Topic Models (LDA): Document analysis, recommendation systems
Gaussian Mixture Models: Clustering, density estimation

Conceptual foundations:

Even when not using PGMs directly, ML practitioners benefit from understanding:

Conditional independence: The cornerstone of feature engineering and model simplification
Probabilistic inference: Foundation for uncertainty quantification in any ML system
Graphical notation: Standard language for communicating model assumptions
Variational methods: Now ubiquitous in deep learning (VAEs, variational Bayes)

PGMs Across Machine Learning Domains
Domain	PGM Application	Why It Works
Natural Language	HMMs, CRFs for sequence labeling	Sequential structure with local dependencies
Computer Vision	MRFs for image segmentation, stereo	Spatial structure with neighbor dependencies
Healthcare	Bayesian networks for diagnosis, epidemiology	Causal structure with interpretability requirements
Robotics	Particle filters, SLAM as factor graphs	Temporal structure with sequential observations
Bioinformatics	HMMs for gene finding, phylogenetic trees	Biological sequences with Markovian properties
Deep Learning	VAEs, energy-based models, attention as message passing	Latent variable modeling, structured outputs

PGMs and Deep Learning

The Interpretability Advantage

In an era of black-box deep learning, probabilistic graphical models offer a crucial advantage: interpretability. The graph structure makes the model's assumptions explicit and human-readable.

What the graph tells us:

Direct dependencies: An edge from A to B says 'A directly influences B'
Indirect dependencies: A path from A to C through B says 'A influences C through B'
Independence: No path between A and C (given conditioning) says 'A provides no information about C'
Causal structure (for directed models): Arrows suggest cause-and-effect relationships

Benefits for stakeholders:

Domain experts can validate: 'Does the graph match my understanding?'
Regulators can audit: 'What factors influence this decision?'
Users can understand: 'Why did the system conclude this?'
Engineers can debug: 'Which dependencies are wrong?'

Comparison with deep learning:

PGM Advantages

•Structure is human-readable and auditable
•Independence assumptions are explicit
•Causal reasoning possible (interventions)
•Principled uncertainty quantification
•Can incorporate domain knowledge directly
•Modular—parts can be understood separately

Deep Learning Challenges

•Black-box: internal logic opaque
•Dependencies learned implicitly
•Causal reasoning not built-in
•Uncertainty often poorly calibrated
•Domain knowledge harder to integrate
•End-to-end: hard to isolate components

The Right Tool for the Job

A Unifying Framework

One of the most powerful aspects of PGMs is their role as a unifying framework. Many seemingly different machine learning models are special cases of graphical models:

Model	PGM Representation
Naive Bayes	Bayesian network with class → all features
Logistic Regression	Two-node Bayesian network
Hidden Markov Models	Chain-structured Bayesian network
Kalman Filters	Linear-Gaussian Bayesian network
Mixture Models	Bayesian network with latent cluster variable
LDA Topic Models	Hierarchical Bayesian network with Dirichlet priors
Ising Models	Pairwise Markov random field
Boltzmann Machines	Fully-connected MRF with hidden units

Why unification matters:

Transfer of techniques: Algorithms developed for one model apply to others
Principled extensions: Understand what assumptions you're relaxing or adding
Fair comparisons: Compare models in a common language
Hybrid approaches: Combine models in principled ways

The PGM research program:

The graphical models community has developed a remarkable toolkit:

Representation languages: Bayesian networks, MRFs, factor graphs, plate notation
Inference algorithms: Variable elimination, belief propagation, junction trees, variational methods, MCMC
Learning algorithms: MLE, EM, structure search, Bayesian structure learning
Complexity theory: Understanding when inference is tractable (tree-width, etc.)

This toolkit is general-purpose. Once you understand the PGM framework, you can:

Invent new models by specifying new graph structures
Apply standard inference algorithms to your new model
Use standard learning algorithms to fit data
Analyze computational complexity via graph properties

The vision: A 'compiler' for probabilistic models. Specify your model declaratively; let automated tools handle inference and learning.

Probabilistic Programming

Roadmap for This Module

This module will give you a deep understanding of the foundational concepts in probabilistic graphical models. Here's what's ahead:

Page 1: Directed vs Undirected Models

Page 2: Conditional Independence

Page 3: D-Separation

Page 4: Markov Blanket

Learning Objectives

•Explain why graphical models provide compact representations of high-dimensional distributions
•Distinguish between directed and undirected graphical models and their factorization properties
•Formalize conditional independence and its role in enabling tractable inference
•Apply d-separation to determine conditional independencies in Bayesian networks
•Identify Markov blankets and explain their significance for inference and feature selection
•Recognize how common ML models fit within the PGM framework

Page Complete

1 / 5