Loading learning content...
Imagine you're a physician diagnosing a patient. The patient has symptoms: fever, cough, and fatigue. Multiple diseases could cause these symptoms—influenza, COVID-19, tuberculosis, or a common cold. Each disease has different probabilities, and the symptoms themselves are correlated. The presence of one symptom might increase or decrease the likelihood of others. How do you systematically reason about this web of uncertain relationships?
Or consider a self-driving car processing sensor data. The car's LIDAR detects an object. Is it a pedestrian, a cyclist, or a mailbox? The camera provides another view. The radar gives velocity information. These sensors are not independent—they're all observing the same underlying reality. How do you combine these uncertain, correlated observations into a coherent understanding of the world?
Probabilistic Graphical Models (PGMs) provide the mathematical framework to answer these questions. They represent the gold standard for modeling complex systems where uncertainty and dependencies interweave, and they form the theoretical backbone of many modern machine learning systems.
By the end of this page, you will understand what probabilistic graphical models are, why they represent a fundamental advance in representing and reasoning about uncertainty, and how they provide a unifying language for expressing complex probabilistic relationships. You will see how PGMs bridge the gap between human-interpretable models and powerful computational inference.
Before we can appreciate the elegance of graphical models, we must understand the problem they solve. Consider a probability distribution over n binary random variables. In the most general case, specifying this distribution requires:
$$P(X_1, X_2, \ldots, X_n)$$
Since each variable can take 2 values, there are $2^n$ possible configurations. To fully specify the joint distribution, we need $2^n - 1$ parameters (the last is determined by normalization).
The combinatorial explosion:
No computer can store or learn $10^{30}$ parameters. And real-world problems routinely involve hundreds, thousands, or millions of variables. Consider a 1000×1000 binary image—that's 1 million binary random variables, requiring $2^{1,000,000}$ parameters in the general case.
The fundamental question:
How can we represent high-dimensional probability distributions compactly, learn them from finite data, and perform inference efficiently?
Without structure, probability distributions suffer from an exponential curse of dimensionality. The full joint distribution over n variables requires O(2^n) parameters, making direct representation and learning intractable for all but the smallest problems. Graphical models exploit conditional independence structure to achieve compact, tractable representations.
The independence assumption—too naive:
One extreme approach assumes all variables are mutually independent:
$$P(X_1, X_2, \ldots, X_n) = \prod_{i=1}^{n} P(X_i)$$
This requires only $n$ parameters—tractable even for millions of variables. But it throws away all relationships between variables. In the medical diagnosis example, we'd assume symptoms are unrelated to diseases—clearly absurd.
The other extreme—full dependence:
Assuming full dependence preserves all relationships but leads to intractable complexity. Neither extreme works.
The solution: structured independence
Real-world systems exhibit conditional independence—variables are related, but not all-to-all. A patient's cough depends on their disease, but given the disease, it might be independent of their geographic location. Graphical models exploit this structure by encoding precisely which conditional independencies hold.
A Probabilistic Graphical Model (PGM) is a marriage of two mathematical languages:
The key insight is profound: the graph structure encodes conditional independence assumptions, allowing the joint distribution to factorize into a product of simpler terms.
Formal definition:
A PGM consists of:
The graph serves as a qualitative representation of dependencies, while the parameters provide the quantitative specification of probabilities.
| Component | From Probability Theory | From Graph Theory |
|---|---|---|
| Core object | Random variables, probability distributions | Nodes, edges, graph structure |
| Relationships | Statistical dependencies, conditional probabilities | Edges indicating direct influence |
| Inference | Marginalization, conditioning, Bayes' rule | Message passing, graph algorithms |
| Independence | Conditional independence assertions | Graph separation criteria (d-separation, Markov blanket) |
| Representation | Exponentially large joint distribution | Compact graphical structure |
Why graphs?
Graphs are a natural language for expressing relationships:
The power of the graphical representation lies in what it doesn't include. Every missing edge is an assertion that two variables are conditionally independent given appropriate conditioning sets. These absent edges are what enable compact representation and efficient inference.
In probabilistic graphical models, what's NOT connected is as important as what IS connected. Every missing edge represents a conditional independence assumption that simplifies the model. A sparse graph (few edges) implies strong independence assumptions and leads to more tractable inference. A dense graph preserves more dependencies but increases computational cost.
The defining characteristic of PGMs is that the graph structure implies a factorization of the joint distribution. Instead of specifying the full joint directly, we specify a set of factors—smaller functions that combine to define the joint.
The general form:
$$P(X_1, X_2, \ldots, X_n) = \frac{1}{Z} \prod_{c \in \mathcal{C}} \phi_c(X_c)$$
Where:
The magic is that each factor $\phi_c$ depends only on a small subset of variables—those that are directly connected in the graph. If the cliques are small, the factors are small, and the representation is compact.
123456789101112131415161718192021222324252627282930313233
# Example: Factorization comparison import numpy as np def full_joint_complexity(n_variables): """Parameters needed for full joint distribution over n binary variables.""" return 2**n_variables - 1 def chain_factorization_complexity(n_variables): """Parameters for chain-structured PGM: X1 -> X2 -> ... -> Xn. Factorizes as: P(X1) * P(X2|X1) * P(X3|X2) * ... * P(Xn|Xn-1) - P(X1): 1 parameter - P(Xi|Xi-1): 2 parameters each (for binary variables) Total: 1 + 2*(n-1) = 2n - 1 This is LINEAR in n, not exponential! """ return 1 + 2 * (n_variables - 1) # Compare complexitiesfor n in [5, 10, 20, 50, 100]: full = full_joint_complexity(n) chain = chain_factorization_complexity(n) ratio = full / chain print(f"n={n:3d}: Full joint: {full:30,d} | Chain PGM: {chain:4d} | Ratio: {ratio:,.0f}x") # Output:# n= 5: Full joint: 31 | Chain PGM: 9 | Ratio: 3x# n= 10: Full joint: 1,023 | Chain PGM: 19 | Ratio: 54x# n= 20: Full joint: 1,048,575 | Chain PGM: 39 | Ratio: 26,886x# n= 50: Full joint: 1,125,899,906,842,623 | Chain PGM: 99 | Ratio: 11,372,726,332,754x# n=100: Full joint: (10^30, too large to display) | Chain PGM: 199 | Ratio: ~10^28xWhy factorization enables tractability:
Compact representation: Instead of $2^n$ parameters, we need only parameters for each factor. For a chain of n variables, that's $O(n)$ parameters.
Efficient learning: With fewer parameters, we need less data to estimate them reliably. Each factor can be learned from local statistics.
Efficient inference: Marginalizing or conditioning can exploit the factorization structure. We can "push" sums inside products, computing intermediate results efficiently.
Modularity: Factors represent local relationships that can be understood, validated, and modified independently.
The key insight: By encoding conditional independence through the graph structure, PGMs achieve exponential compression of the representation while preserving the essential dependencies for modeling real-world phenomena.
Probabilistic graphical models come in two primary flavors, distinguished by the type of graph they use:
1. Directed Graphical Models (Bayesian Networks)
Use directed acyclic graphs (DAGs) where edges have arrows indicating causal or generative direction. Also called Belief Networks, Bayes Nets, or Causal Networks.
Factorization: $$P(X_1, \ldots, X_n) = \prod_{i=1}^{n} P(X_i | \text{Parents}(X_i))$$
Each variable is conditioned on its parent nodes—the nodes with arrows pointing to it.
2. Undirected Graphical Models (Markov Random Fields)
Use undirected graphs where edges have no direction. Also called Markov Networks or Gibbs Random Fields.
Factorization: $$P(X_1, \ldots, X_n) = \frac{1}{Z} \prod_{c \in \mathcal{C}} \phi_c(X_c)$$
Potential functions are defined over cliques—maximal fully-connected subgraphs.
When to use which?
Directed models are natural when:
Undirected models are natural when:
Hybrid models (Factor Graphs):
Some applications combine both types. Conditional Random Fields (CRFs) are undirected models conditioned on observations. Chain graphs allow both directed and undirected edges. The field has developed general-purpose representations like factor graphs that can express any factorization.
A given probability distribution can often be represented by both directed and undirected models—the choice depends on which conditional independencies you want to encode directly. However, some independencies can only be captured by one type. This observation leads to the study of 'I-maps' (independence maps) and the different expressive powers of directed versus undirected models.
Working with probabilistic graphical models involves three core computational problems. Understanding these problems and their solutions is essential for applying PGMs in practice.
Problem 1: Representation
Given: A domain with random variables and their relationships Goal: Construct a graph and parameterization that accurately models the joint distribution
This involves:
Problem 2: Inference
Given: A PGM with known structure and parameters, plus observed evidence Goal: Compute posterior probabilities for unobserved variables
Key inference tasks:
Problem 3: Learning
Given: Data samples from a domain, possibly with missing values Goal: Learn the graph structure and/or parameters from data
| Problem | Input | Output | Key Challenges |
|---|---|---|---|
| Representation | Domain knowledge, independence assumptions | Graph structure + parameters | Model selection, expressiveness vs tractability |
| Inference | Model + observed evidence | Posterior probabilities | NP-hard in general, requires approximation |
| Learning | Data samples | Learned model (structure and/or parameters) | Structure learning is super-exponential, missing data |
The computational challenge:
Both inference and learning are NP-hard in general. For inference, even computing the partition function Z is #P-complete (harder than NP). For structure learning, the space of possible graphs grows super-exponentially with the number of variables.
However, the graph structure often enables efficient algorithms:
When exact inference is intractable, we turn to approximate methods:
Probabilistic graphical models are not just a theoretical framework—they underpin many successful machine learning systems and provide foundational concepts that extend to deep learning.
Direct applications:
Conceptual foundations:
Even when not using PGMs directly, ML practitioners benefit from understanding:
| Domain | PGM Application | Why It Works |
|---|---|---|
| Natural Language | HMMs, CRFs for sequence labeling | Sequential structure with local dependencies |
| Computer Vision | MRFs for image segmentation, stereo | Spatial structure with neighbor dependencies |
| Healthcare | Bayesian networks for diagnosis, epidemiology | Causal structure with interpretability requirements |
| Robotics | Particle filters, SLAM as factor graphs | Temporal structure with sequential observations |
| Bioinformatics | HMMs for gene finding, phylogenetic trees | Biological sequences with Markovian properties |
| Deep Learning | VAEs, energy-based models, attention as message passing | Latent variable modeling, structured outputs |
Modern deep learning incorporates many PGM ideas: Variational Autoencoders use variational inference, attention mechanisms can be viewed as soft message passing, graph neural networks extend belief propagation, and energy-based models are undirected graphical models with neural network potentials. Understanding PGMs provides crucial intuition for these advanced techniques.
In an era of black-box deep learning, probabilistic graphical models offer a crucial advantage: interpretability. The graph structure makes the model's assumptions explicit and human-readable.
What the graph tells us:
Benefits for stakeholders:
Comparison with deep learning:
PGMs and deep learning are not mutually exclusive. Many state-of-the-art systems combine both: using neural networks to learn complex feature representations while using graphical model structure to encode known relationships and enable interpretable reasoning. The choice depends on the task's requirements for interpretability, data availability, and the nature of domain knowledge.
One of the most powerful aspects of PGMs is their role as a unifying framework. Many seemingly different machine learning models are special cases of graphical models:
| Model | PGM Representation |
|---|---|
| Naive Bayes | Bayesian network with class → all features |
| Logistic Regression | Two-node Bayesian network |
| Hidden Markov Models | Chain-structured Bayesian network |
| Kalman Filters | Linear-Gaussian Bayesian network |
| Mixture Models | Bayesian network with latent cluster variable |
| LDA Topic Models | Hierarchical Bayesian network with Dirichlet priors |
| Ising Models | Pairwise Markov random field |
| Boltzmann Machines | Fully-connected MRF with hidden units |
Why unification matters:
The PGM research program:
The graphical models community has developed a remarkable toolkit:
This toolkit is general-purpose. Once you understand the PGM framework, you can:
The vision: A 'compiler' for probabilistic models. Specify your model declaratively; let automated tools handle inference and learning.
This vision is realized in modern probabilistic programming languages like Stan, PyMC, Pyro, and TensorFlow Probability. You specify a graphical model using code, and the system automatically constructs and executes inference algorithms. Understanding PGMs is essential for using these powerful tools effectively.
This module will give you a deep understanding of the foundational concepts in probabilistic graphical models. Here's what's ahead:
Page 1: Directed vs Undirected Models
We'll examine the two major families of PGMs in detail—Bayesian networks and Markov random fields. You'll understand their different factorization properties, the types of dependencies each can represent, and when to choose one over the other.
Page 2: Conditional Independence
The cornerstone of graphical models. We'll formalize what conditional independence means, how it enables factorization, and how to read independence from graphs. This is the mathematical foundation everything else builds on.
Page 3: D-Separation
The algorithmic tool for reading conditional independencies from directed graphs. We'll master d-separation—a graph-theoretic criterion that tells you exactly which variables are independent given observed evidence.
Page 4: Markov Blanket
For any variable in a graphical model, what's the minimal set of other variables that renders it independent of all others? The Markov blanket answers this question, with profound implications for inference and learning.
You now understand what probabilistic graphical models are and why they matter. They provide a principled framework for representing complex probability distributions compactly by exploiting conditional independence structure. The graph encodes qualitative assumptions about which variables directly influence which others, while the parameters encode quantitative probabilities. This foundation enables the inference and learning algorithms we'll explore throughout this chapter. Next, we'll examine the fundamental distinction between directed and undirected graphical models.