Inference In Graphical Models - Learning Module

Loading content...

0/278

Approximate Inference

The Inference Landscape: Beyond Exact Methods

Exact inference—variable elimination, junction trees—is the gold standard when tractable. But many real-world graphical models have structure that makes exact computation infeasible: high treewidth, continuous variables, or simply too many states to enumerate. When exact methods fail, we turn to approximate inference.

Approximate inference is a rich field with two major paradigms: sampling (Monte Carlo methods) and optimization (variational methods). Each paradigm offers different trade-offs in accuracy, speed, and theoretical guarantees. Mastering both is essential for practical probabilistic modeling.

What You Will Learn

By the end of this page, you will understand: (1) the fundamental divide between sampling and variational methods, (2) key sampling algorithms including importance sampling, MCMC, and Gibbs sampling, (3) variational inference basics including mean-field approximation, (4) practical trade-offs and selection criteria, and (5) how these methods apply to graphical model inference.

Taxonomy of Approximate Inference Methods

Approximate inference methods can be organized along several dimensions: stochastic vs. deterministic, local vs. global, and asymptotically exact vs. biased. Understanding this taxonomy helps in selecting the right method for a given problem.

The Two Major Paradigms:

Sampling (Monte Carlo) Methods: Generate samples from (or approximately from) the target distribution. Compute expectations by averaging over samples. Asymptotically exact as sample count increases.
Variational Methods: Approximate the target distribution with a simpler one from a tractable family. Optimize to make the approximation as close as possible. Deterministic and fast, but inherently biased.

Comparison of Approximate Inference Paradigms
Aspect	Sampling Methods	Variational Methods
Core idea	Draw samples, estimate by averaging	Optimize within tractable family
Stochastic?	Yes—different random samples each run	No—deterministic optimization
Asymptotic behavior	Converges to true answer with enough samples	Converges to best approximation (may be biased)
Error characterization	Variance (decreases with samples)	Bias (depends on approximation family)
Speed	Can be slow (many samples needed)	Often fast (fixed optimization steps)
Parallelization	Highly parallelizable (independent samples)	Sequential updates; some parallel variants
Continuous variables	Handles naturally	Requires analytic tractability
Multimodal distributions	Can explore all modes (with care)	Often collapses to one mode

Loopy BP is a Variational Method

Loopy belief propagation, covered in the previous page, is actually a variational method! Its fixed points minimize the Bethe free energy, a variational approximation. This connects message-passing to the broader variational framework.

Sampling Methods: Monte Carlo Fundamentals

Monte Carlo methods estimate expectations by averaging over random samples. If we want E[f(X)] where X ~ P, and we can draw samples x₁, x₂, ..., xₙ from P, then:

Ê[f(X)] = (1/n) Σᵢ f(xᵢ) → E[f(X)] as n → ∞

For graphical models, we want marginals P(Xᵢ = xᵢ), which are expectations of indicator functions. The challenge is: how do we sample from complex, high-dimensional joint distributions?

Key Sampling Challenges in Graphical Models

•High dimensionality: Sampling from P(X₁, ..., Xₙ) directly is exponentially expensive in n.
•Unnormalized target: We often only know P(X) up to normalization constant Z—we have access to ψ(X) where P(X) = ψ(X)/Z.
•Mode separation: If P has multiple modes separated by low-probability regions, samplers may get stuck in one mode.
•Slow mixing: For tightly coupled variables, generating independent samples requires many iterations.
•Continuous space: Discrete graphical models are easier; continuous latent spaces require gradient-based methods.

basic_sampling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
import numpy as np
from typing import List, Dict, Callable
from scipy.special import logsumexp
 
def ancestral_sampling(
    bayesian_network: Dict[str, Dict],
    num_samples: int = 1000
) -> List[Dict[str, int]]:
    """
    Ancestral (forward) sampling from a Bayesian network.
    
    Works by sampling variables in topological order,
    conditioning on already-sampled parents.
    
    This is the simplest sampling method for BNs:
    - Exact samples from the joint distribution
    - Only works for Bayesian networks (not MRFs)
    - Cannot incorporate evidence easily
    
    Args:
        bayesian_network: Dict mapping variable to {parents, cpt}
        num_samples: Number of samples to draw
        
    Returns:
        List of sample dicts {var: value}
    """
    # Get topological order
    order = _topological_sort(bayesian_network)
    
    samples = []
    for _ in range(num_samples):
        sample = {}
        
        for var in order:
            info = bayesian_network[var]
            parents = info['parents']
            cpt = info['cpt']
            
            # Get parent values
            parent_values = tuple(sample[p] for p in parents)
            
            # Index into CPT to get distribution for this variable
            # CPT shape: (parent1_card, parent2_card, ..., var_card)
            if parents:
                dist = cpt[parent_values]
            else:
                dist = cpt
            
            # Sample from conditional distribution
            sample[var] = np.random.choice(len(dist), p=dist)
        
        samples.append(sample)
    
    return samples
 
 
def rejection_sampling(
    target_unnorm: Callable,
    proposal: Callable,
    proposal_sample: Callable,
    M: float,
    num_samples: int = 1000
) -> np.ndarray:
    """
    Rejection sampling from unnormalized target distribution.
    
    Requires a proposal distribution q(x) such that:
    - We can sample from q(x)
    - We can evaluate q(x)
    - target(x) <= M * q(x) for all x
    
    Acceptance rate = Z / M, where Z is target's normalizing constant.
    Poor for high dimensions or when M >> Z.
    
    Args:
        target_unnorm: Function returning unnormalized target density
        proposal: Function returning proposal density
        proposal_sample: Function to sample from proposal
        M: Upper bound constant (target <= M * proposal everywhere)
        num_samples: Number of accepted samples desired
        
    Returns:
        Array of accepted samples
    """
    samples = []
    
    while len(samples) < num_samples:
        # Sample from proposal
        x = proposal_sample()
        
        # Compute acceptance probability
        p_accept = target_unnorm(x) / (M * proposal(x))
        
        # Accept with this probability
        if np.random.random() < p_accept:
            samples.append(x)
    
    return np.array(samples)
 
 
def _topological_sort(bn: Dict) -> List[str]:
    """Return variables in topological order."""
    visited = set()
    order = []
    
    def visit(var):
        if var in visited:
            return
        visited.add(var)
        for parent in bn[var]['parents']:
            visit(parent)
        order.append(var)
    
    for var in bn:
        visit(var)
    
    return order

Importance Sampling

Importance sampling avoids rejection by reweighting samples from a different distribution. Instead of sampling from P (hard), we sample from a proposal Q (easy) and weight each sample by the importance ratio.

The Core Identity:

E_P[f(X)] = E_Q[f(X) · P(X)/Q(X)] = E_Q[f(X) · w(X)]

where w(X) = P(X)/Q(X) is the importance weight.

For unnormalized targets where P(X) = ψ(X)/Z, we use self-normalized importance sampling:

Ê[f(X)] = Σᵢ w̃ᵢ f(xᵢ) where w̃ᵢ = wᵢ / Σⱼ wⱼ

importance_sampling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
def importance_sampling(
    target_unnorm: Callable,
    proposal_log_prob: Callable,
    proposal_sample: Callable,
    f: Callable,
    num_samples: int = 1000
) -> Tuple[float, float]:
    """
    Self-normalized importance sampling.
    
    Estimates E_P[f(X)] using samples from proposal Q.
    Handles unnormalized target distributions.
    
    Args:
        target_unnorm: Function returning log of unnormalized target
        proposal_log_prob: Function returning log proposal density
        proposal_sample: Function to sample from proposal
        f: Function to compute expectation of
        num_samples: Number of samples
        
    Returns:
        - Estimated expectation
        - Effective sample size (ESS)
    """
    samples = [proposal_sample() for _ in range(num_samples)]
    
    # Compute log importance weights (unnormalized)
    log_weights = []
    for x in samples:
        log_w = target_unnorm(x) - proposal_log_prob(x)
        log_weights.append(log_w)
    
    log_weights = np.array(log_weights)
    
    # Normalize weights using logsumexp for stability
    log_norm = logsumexp(log_weights)
    normalized_weights = np.exp(log_weights - log_norm)
    
    # Compute weighted average
    f_values = np.array([f(x) for x in samples])
    estimate = np.sum(normalized_weights * f_values)
    
    # Effective sample size: measures weight concentration
    # ESS = 1 / sum(w_i^2), where w_i are normalized weights
    ess = 1.0 / np.sum(normalized_weights ** 2)
    
    return estimate, ess
 
 
def likelihood_weighting(
    bayesian_network: Dict[str, Dict],
    evidence: Dict[str, int],
    query_var: str,
    num_samples: int = 1000
) -> np.ndarray:
    """
    Likelihood weighting: importance sampling for Bayesian networks.
    
    Proposal: sample non-evidence variables ancestrally, fix evidence
    Weight: product of evidence likelihoods given parents
    
    Much more efficient than rejection sampling with evidence.
    
    Args:
        bayesian_network: BN structure
        evidence: Observed variable assignments
        query_var: Variable whose marginal we want
        num_samples: Number of weighted samples
        
    Returns:
        Estimated marginal distribution for query_var
    """
    order = _topological_sort(bayesian_network)
    query_card = bayesian_network[query_var]['cpt'].shape[-1]
    
    weighted_counts = np.zeros(query_card)
    total_weight = 0.0
    
    for _ in range(num_samples):
        sample = {}
        log_weight = 0.0
        
        for var in order:
            info = bayesian_network[var]
            parents = info['parents']
            cpt = info['cpt']
            
            parent_values = tuple(sample[p] for p in parents)
            if parents:
                dist = cpt[parent_values]
            else:
                dist = cpt
            
            if var in evidence:
                # Evidence variable: don't sample, add to weight
                sample[var] = evidence[var]
                log_weight += np.log(dist[evidence[var]] + 1e-10)
            else:
                # Hidden variable: sample as usual
                sample[var] = np.random.choice(len(dist), p=dist)
        
        weight = np.exp(log_weight)
        weighted_counts[sample[query_var]] += weight
        total_weight += weight
    
    return weighted_counts / total_weight
 
 
class SequentialMonteCarlo:
    """
    Sequential Monte Carlo (particle filtering) for dynamic models.
    
    Maintains a weighted set of particles (samples) that are
    propagated, reweighted, and resampled as evidence arrives.
    
    Essential for online inference in temporal graphical models (HMMs, DBNs).
    """
    
    def __init__(
        self, 
        transition: Callable,
        emission: Callable,
        initial: Callable,
        num_particles: int = 100
    ):
        """
        Initialize SMC.
        
        Args:
            transition: P(x_t | x_{t-1}) sampler
            emission: P(y_t | x_t) density evaluator
            initial: P(x_0) sampler
            num_particles: Number of particles to maintain
        """
        self.transition = transition
        self.emission = emission
        self.initial = initial
        self.num_particles = num_particles
        
        self.particles = None
        self.weights = None
    
    def initialize(self):
        """Sample initial particles from prior."""
        self.particles = [self.initial() for _ in range(self.num_particles)]
        self.weights = np.ones(self.num_particles) / self.num_particles
    
    def step(self, observation):
        """
        Process one observation: propagate, reweight, resample.
        """
        # Propagate: sample new states from transition
        new_particles = [self.transition(p) for p in self.particles]
        
        # Reweight: multiply by observation likelihood
        log_weights = np.log(self.weights + 1e-10)
        for i, particle in enumerate(new_particles):
            log_weights[i] += np.log(self.emission(observation, particle) + 1e-10)
        
        # Normalize
        log_norm = logsumexp(log_weights)
        self.weights = np.exp(log_weights - log_norm)
        self.particles = new_particles
        
        # Resample if effective sample size is low
        ess = 1.0 / np.sum(self.weights ** 2)
        if ess < self.num_particles / 2:
            self._resample()
    
    def _resample(self):
        """Systematic resampling to rejuvenate particle set."""
        indices = np.random.choice(
            self.num_particles, 
            size=self.num_particles,
            p=self.weights
        )
        self.particles = [self.particles[i] for i in indices]
        self.weights = np.ones(self.num_particles) / self.num_particles

Weight Degeneracy

Importance sampling suffers from weight degeneracy in high dimensions: a few samples dominate, and effective sample size becomes tiny. This is why naive IS doesn't scale to complex graphical models. Sequential methods (SMC) and MCMC address this limitation.

Markov Chain Monte Carlo (MCMC)

MCMC constructs a Markov chain whose stationary distribution is the target P. By running the chain long enough, samples from the chain approximate samples from P—even when we only know P up to a normalizing constant.

Key Insight:

We don't need to sample P directly. Instead, we design a transition kernel T(x' | x) such that:

The chain is ergodic (mixes properly)
P is the unique stationary distribution: P(x') = Σₓ P(x) T(x' | x)

A sufficient condition is detailed balance: P(x) T(x' | x) = P(x') T(x | x')

mcmc.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
class MetropolisHastings:
    """
    Metropolis-Hastings MCMC for graphical model inference.
    
    General-purpose MCMC that works with any proposal distribution.
    Automatically satisfies detailed balance via accept/reject step.
    """
    
    def __init__(
        self, 
        log_target: Callable,
        proposal_sample: Callable,
        proposal_log_prob: Callable = None
    ):
        """
        Initialize MH sampler.
        
        Args:
            log_target: Function returning log unnormalized target density
            proposal_sample: Function(current_x) -> proposed_x
            proposal_log_prob: Function(x, x') -> log Q(x' | x)
                              If None, assumes symmetric proposal
        """
        self.log_target = log_target
        self.proposal_sample = proposal_sample
        self.proposal_log_prob = proposal_log_prob
        self.symmetric = proposal_log_prob is None
    
    def sample(
        self,
        initial: np.ndarray,
        num_samples: int,
        burn_in: int = 100,
        thin: int = 1
    ) -> Tuple[List[np.ndarray], float]:
        """
        Run MH to collect samples.
        
        Args:
            initial: Starting state
            num_samples: Number of samples to return
            burn_in: Initial samples to discard
            thin: Keep every thin-th sample
            
        Returns:
            - List of samples
            - Acceptance rate
        """
        samples = []
        current = initial
        current_log_prob = self.log_target(current)
        
        total_steps = burn_in + num_samples * thin
        accepted = 0
        
        for step in range(total_steps):
            # Propose new state
            proposed = self.proposal_sample(current)
            proposed_log_prob = self.log_target(proposed)
            
            # Compute acceptance probability
            log_alpha = proposed_log_prob - current_log_prob
            
            if not self.symmetric:
                # Add Hastings correction for asymmetric proposals
                log_alpha += self.proposal_log_prob(proposed, current)
                log_alpha -= self.proposal_log_prob(current, proposed)
            
            # Accept or reject
            if np.log(np.random.random()) < log_alpha:
                current = proposed
                current_log_prob = proposed_log_prob
                accepted += 1
            
            # Collect sample after burn-in, with thinning
            if step >= burn_in and (step - burn_in) % thin == 0:
                samples.append(current.copy())
        
        acceptance_rate = accepted / total_steps
        return samples, acceptance_rate
 
 
class GibbsSampling:
    """
    Gibbs sampling for graphical models.
    
    Special case of MH where each variable is sampled from its
    full conditional distribution. Always accepts (acceptance rate = 1).
    
    Particularly efficient for graphical models because full conditionals
    depend only on the Markov blanket.
    """
    
    def __init__(
        self, 
        variables: List[str],
        cardinalities: Dict[str, int],
        factors: List[Factor]
    ):
        """
        Initialize Gibbs sampler for a factor graph.
        
        Args:
            variables: List of variable names
            cardinalities: Dict mapping variable to number of values
            factors: List of factors defining the distribution
        """
        self.variables = variables
        self.cardinalities = cardinalities
        self.factors = factors
        
        # Precompute which factors involve each variable
        self.var_to_factors = {v: [] for v in variables}
        for factor in factors:
            for var in factor.variables:
                self.var_to_factors[var].append(factor)
    
    def compute_full_conditional(
        self, 
        var: str, 
        current_state: Dict[str, int]
    ) -> np.ndarray:
        """
        Compute P(var | all other variables) = P(var | Markov blanket).
        
        The full conditional only depends on factors containing var.
        """
        card = self.cardinalities[var]
        log_probs = np.zeros(card)
        
        for value in range(card):
            # Temporarily set this value
            test_state = current_state.copy()
            test_state[var] = value
            
            # Compute product of relevant factors
            log_prob = 0.0
            for factor in self.var_to_factors[var]:
                idx = tuple(test_state[v] for v in factor.variables)
                log_prob += np.log(factor.potential[idx] + 1e-10)
            
            log_probs[value] = log_prob
        
        # Normalize to get valid distribution
        log_probs -= logsumexp(log_probs)
        return np.exp(log_probs)
    
    def sample(
        self,
        initial: Dict[str, int],
        num_samples: int,
        burn_in: int = 100
    ) -> List[Dict[str, int]]:
        """
        Run Gibbs sampling to collect samples.
        
        One iteration updates all variables in order.
        """
        samples = []
        current = initial.copy()
        
        for iteration in range(burn_in + num_samples):
            # Update each variable from its full conditional
            for var in self.variables:
                full_cond = self.compute_full_conditional(var, current)
                current[var] = np.random.choice(
                    self.cardinalities[var], 
                    p=full_cond
                )
            
            if iteration >= burn_in:
                samples.append(current.copy())
        
        return samples
    
    def estimate_marginals(
        self, 
        samples: List[Dict[str, int]]
    ) -> Dict[str, np.ndarray]:
        """Estimate marginal distributions from Gibbs samples."""
        marginals = {
            v: np.zeros(self.cardinalities[v]) 
            for v in self.variables
        }
        
        for sample in samples:
            for var, value in sample.items():
                marginals[var][value] += 1
        
        for var in marginals:
            marginals[var] /= len(samples)
        
        return marginals

Gibbs is Perfect for Graphical Models

Gibbs sampling exploits graphical model structure beautifully: each full conditional depends only on the Markov blanket (neighbors in the graph). This makes updates local and efficient. For many graphical models, Gibbs is the go-to MCMC method.

Variational Inference

Variational inference (VI) transforms inference into optimization. Instead of sampling from P, we find the best approximation Q from a tractable family that minimizes the divergence from P.

The Key Objective:

We minimize KL(Q || P), but since we don't have access to P's normalizing constant, we equivalently maximize the Evidence Lower Bound (ELBO):

ELBO(Q) = E_Q[log P(X,E)] - E_Q[log Q(X)] = E_Q[log P(X,E)] + H(Q)

where E is evidence and X are latent variables. This is a lower bound on log P(E).

Mean-Field Approximation:

The simplest VI assumes Q fully factorizes:

Q(X) = ∏ᵢ Qᵢ(Xᵢ)

This ignores all dependencies in the approximate posterior! Despite this strong assumption, mean-field often works surprisingly well.

Coordinate ascent update for factor Qⱼ:

log Qⱼ(Xⱼ) = E_{Q_{-j}}[log P(X, E)] + const

VI Underestimates Variance

Because KL(Q||P) is zero-avoiding (Q doesn't put mass where P is zero), mean-field VI tends to underestimate posterior variance. The approximation covers the mode(s) of P but not the tails.

mean_field_vi.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
class MeanFieldVI:
    """
    Mean-field variational inference for discrete graphical models.
    
    Approximates the posterior with a fully factorized distribution
    and optimizes via coordinate ascent.
    """
    
    def __init__(
        self, 
        variables: List[str],
        cardinalities: Dict[str, int],
        factors: List[Factor]
    ):
        """
        Initialize mean-field VI.
        
        Args:
            variables: List of variable names
            cardinalities: Dict mapping variable to cardinality
            factors: List of factors defining the model
        """
        self.variables = variables
        self.cardinalities = cardinalities
        self.factors = factors
        
        # Variational parameters: Q_i(x_i) for each variable
        # Initialize to uniform
        self.q_params = {
            v: np.ones(cardinalities[v]) / cardinalities[v]
            for v in variables
        }
        
        # Precompute factor associations
        self.var_to_factors = {v: [] for v in variables}
        for factor in factors:
            for var in factor.variables:
                self.var_to_factors[var].append(factor)
    
    def update_q(self, var: str):
        """
        Update Q for one variable (coordinate ascent step).
        
        log Q_j(x_j) = E_{Q_{-j}}[log P(X, E)] + const
        
        For discrete factors, this is tractable.
        """
        card = self.cardinalities[var]
        log_q = np.zeros(card)
        
        for value in range(card):
            # Compute expected log potential for this assignment
            for factor in self.var_to_factors[var]:
                log_q[value] += self._expected_log_factor(
                    factor, var, value
                )
        
        # Normalize to get proper distribution
        log_q -= logsumexp(log_q)
        self.q_params[var] = np.exp(log_q)
    
    def _expected_log_factor(
        self, 
        factor: Factor, 
        fixed_var: str, 
        fixed_value: int
    ) -> float:
        """
        Compute E_Q[log factor(X)] with one variable fixed.
        
        This is a sum over all configurations, weighted by
        Q probabilities for non-fixed variables.
        """
        log_potential = np.log(factor.potential + 1e-10)
        
        # Sum over all configurations
        result = 0.0
        for idx in np.ndindex(factor.potential.shape):
            # Check if this config is consistent with fixed assignment
            var_idx = factor.variables.index(fixed_var)
            if idx[var_idx] != fixed_value:
                continue
            
            # Compute Q probability of this configuration
            log_q_prob = 0.0
            for i, v in enumerate(factor.variables):
                if v != fixed_var:
                    log_q_prob += np.log(self.q_params[v][idx[i]] + 1e-10)
            
            result += np.exp(log_q_prob) * log_potential[idx]
        
        return result
    
    def run(
        self, 
        max_iter: int = 100, 
        tolerance: float = 1e-6
    ) -> Tuple[Dict[str, np.ndarray], float]:
        """
        Run coordinate ascent until convergence.
        
        Returns:
            - Final variational approximation Q
            - Final ELBO value
        """
        prev_elbo = float('-inf')
        
        for iteration in range(max_iter):
            # Update each Q in turn
            for var in self.variables:
                self.update_q(var)
            
            # Check convergence via ELBO
            elbo = self.compute_elbo()
            
            if abs(elbo - prev_elbo) < tolerance:
                break
            
            prev_elbo = elbo
        
        return self.q_params, elbo
    
    def compute_elbo(self) -> float:
        """
        Compute the Evidence Lower Bound.
        
        ELBO = E_Q[log P(X)] + H(Q)
        """
        # Expected log joint
        expected_log_joint = 0.0
        for factor in self.factors:
            expected_log_joint += self._expected_log_factor_full(factor)
        
        # Entropy of Q (factorized, so sum of individual entropies)
        entropy = 0.0
        for var in self.variables:
            q = self.q_params[var]
            entropy -= np.sum(q * np.log(q + 1e-10))
        
        return expected_log_joint + entropy
    
    def _expected_log_factor_full(self, factor: Factor) -> float:
        """Compute E_Q[log factor] summing over all configs."""
        log_potential = np.log(factor.potential + 1e-10)
        
        result = 0.0
        for idx in np.ndindex(factor.potential.shape):
            q_prob = 1.0
            for i, v in enumerate(factor.variables):
                q_prob *= self.q_params[v][idx[i]]
            result += q_prob * log_potential[idx]
        
        return result

Choosing Between Inference Methods

Selecting the right inference method depends on the problem structure, desired accuracy, computational budget, and downstream use of the results. Here's a practical guide:

Inference Method Selection Guide
Situation	Recommended Method	Rationale
Low treewidth (≤ 15-20)	Junction Tree	Exact, deterministic, optimal when tractable
Tree-structured model	Belief Propagation	Exact in linear time; use this for chains, trees
High treewidth, sparse	Loopy BP	Fast, often accurate for sparse graphs
Need exact samples	MCMC (Gibbs)	Eventually exact; good for graphical models
Sequential/online data	Particle Filtering (SMC)	Handles streaming evidence naturally
Speed critical, some bias OK	Mean-Field VI	Fast, deterministic, good for exploration
Need uncertainty estimates	MCMC or SVI	Sampling gives full posterior; useful for decisions
Very complex, continuous latent	Variational + sampling	Combine VI for approximation, MCMC for refinement

Key Trade-offs

•Accuracy vs Speed: Exact methods (JT) are accurate but may be slow. Loopy BP and VI trade accuracy for speed. MCMC can be very accurate given enough time.
•Bias vs Variance: VI has bias (approximation error) but low variance. MCMC has no bias but high variance with few samples. Choose based on what errors are tolerable.
•Exploration vs Exploitation: MCMC explores the full posterior; VI focuses on the mode. For multimodal posteriors, sampling may find modes that VI misses.
•Implementation Complexity: VI and BP are relatively simple loops. MCMC requires tuning (proposals, step sizes). Junction trees require careful graph manipulation.
•Parallelization: Sampling methods parallelize easily (independent chains). VI updates are often sequential. Loopy BP has natural parallelism.

Hybrid Approaches

In practice, combining methods often works best. Use VI to find a good initialization for MCMC. Run loopy BP to get approximate marginals, then refine with importance sampling. Use MCMC within a junction tree for mixed discrete-continuous models.

Advanced Topics in Approximate Inference

The field of approximate inference continues to advance rapidly. Here are some important modern developments that extend the classical methods.

Modern Advances in Approximate Inference

•Stochastic Variational Inference (SVI): Use stochastic gradients to scale VI to massive datasets. Instead of full coordinate descent, use noisy gradient estimates from minibatches. Enables VI on millions of data points.
•Black-Box Variational Inference: Use score function or reparameterization gradients to optimize ELBO without model-specific derivations. Makes VI applicable to arbitrary probabilistic programs.
•Hamiltonian Monte Carlo (HMC): Use gradient information to propose MCMC moves that explore efficiently. Much better than random-walk Metropolis for continuous spaces. Requires differentiable log-densities.
•Normalizing Flows: Transform a simple distribution through a series of invertible functions to match complex posteriors. Provides flexible variational families that can capture complex dependencies.
•Amortized Inference: Train a neural network to predict approximate posterior parameters from observed data. After training, inference is a single forward pass—no optimization needed.
•Neural Message Passing: Learn message functions using neural networks. Can handle complex, continuous, or unknown factor potentials. Connects graphical models to graph neural networks.

The Connection to Deep Learning:

Modern approximate inference increasingly leverages deep learning:

Variational Autoencoders (VAEs): Use neural networks for both the generative model and the variational approximation. The amortized inference network learns to predict posteriors.
Graph Neural Networks: GNNs can be viewed as generalized message passing, connecting to belief propagation on factor graphs.
Attention Mechanisms: Various attention architectures implicitly perform approximate inference by weighting different parts of the input.

Understanding classical approximate inference provides the foundation for appreciating these modern developments.

Summary: Approximate Inference Landscape

Approximate inference provides the tools to handle graphical models that exceed exact methods' tractability limits. Whether through sampling or optimization, these methods make probabilistic reasoning practical for complex real-world problems.

Key Takeaways

•Two paradigms: Sampling (MCMC, importance) is asymptotically exact but stochastic; variational (VI, BP) is fast but biased.
•Importance sampling reweights samples from a proposal; suffers from weight degeneracy in high dimensions.
•MCMC constructs a chain with target stationary distribution; Gibbs sampling is especially natural for graphical models.
•Mean-field VI assumes full factorization and optimizes ELBO; fast but underestimates variance.
•Method selection depends on treewidth, accuracy needs, speed requirements, and whether uncertainty is needed.
•Modern advances include stochastic VI, HMC, normalizing flows, and neural message passing—connecting classical methods to deep learning.

Module Complete:

You have now completed the Inference in Graphical Models module. You understand:

Variable Elimination: The foundational exact inference algorithm
Belief Propagation: Message passing for efficient tree inference
Junction Trees: Exact inference for arbitrary graphs
Loopy BP: Approximate message passing with the Bethe interpretation
Approximate Inference: Sampling, variational methods, and practical selection

These tools form the computational backbone of probabilistic graphical models, enabling everything from medical diagnosis systems to speech recognition to computer vision.

Module Complete

Congratulations! You have mastered inference in graphical models. You can now select and implement appropriate inference algorithms for a wide range of probabilistic models, from exact methods for tractable cases to sophisticated approximations for complex real-world problems.