Generative Model Fundamentals - Learning Module

Loading content...

0/278

Sampling

From Distributions to Data

A generative model is only as useful as its ability to sample—to produce new, realistic instances from the learned distribution. Density estimation gives us $p_\theta(x)$, but users want actual samples: new images, new text, new molecules.

Sampling is the process of drawing random values $x$ such that $x \sim p(x)$. For simple distributions (Gaussian, uniform), standard libraries provide efficient sampling. But for complex, high-dimensional distributions learned by deep generative models, sampling becomes a sophisticated challenge.

This page explores the spectrum of sampling techniques, from foundational methods that work for any distribution in principle, to specialized approaches designed for modern generative architectures. Understanding these techniques is essential for deploying, evaluating, and improving generative models.

What You Will Learn

By the end of this page, you will understand ancestral sampling and the reparameterization trick, rejection and importance sampling with their trade-offs, Markov Chain Monte Carlo (MCMC) fundamentals, Metropolis-Hastings and Gibbs sampling algorithms, modern sampling strategies in neural generative models (flow sampling, diffusion denoising, autoregressive generation), and practical considerations for sample quality and diversity.

Ancestral Sampling

Ancestral sampling (also called forward sampling) is the most direct approach: sample from a generative model by following its generative process.

For directed graphical models:

If a model factorizes as:

$$p(x_1, x_2, \ldots, x_n) = \prod_{i=1}^n p(x_i | \text{parents}(x_i))$$

we sample by:

Order variables so children come after parents (topological order)
For each variable in order, sample $x_i \sim p(x_i | \text{parents}(x_i))$

Each conditional $p(x_i | \cdot)$ should be a simple distribution we can sample from (e.g., Gaussian, categorical).

Example: Bayesian Network

Consider a weather model:

$p(\text{cloudy})$ — sample first
$p(\text{rain} | \text{cloudy})$ — sample given cloudy status
$p(\text{wet_grass} | \text{rain})$ — sample given rain status

Sampling proceeds: cloudy → rain → wet_grass.

For autoregressive models:

Autoregressive models are explicitly designed for ancestral sampling:

$$p(x) = p(x_1) \cdot p(x_2|x_1) \cdot p(x_3|x_1,x_2) \cdots$$

Generate $x_1$, then $x_2$ given $x_1$, then $x_3$ given $x_1, x_2$, etc. This is how GPT generates text: token by token, each conditioned on all previous.

Speed vs. Quality Trade-off

Ancestral sampling in autoregressive models is inherently sequential—each step depends on all previous. For a 1024-token sequence, we need 1024 forward passes. This is why autoregressive LLMs have high latency despite fast individual forward passes. Parallel sampling methods (speculative decoding, Jacobi iteration) try to mitigate this, but remain an active research area.

ancestral_sampling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import numpy as np
 
# Example: Ancestral sampling from a Bayesian network
np.random.seed(42)
 
def sample_weather():
    """
    Bayesian Network:
    cloudy -> rain -> wet_grass
           \-> sprinkler -> wet_grass
    """
    # P(cloudy) = 0.5
    cloudy = np.random.random() < 0.5
    
    # P(sprinkler | cloudy) 
    p_sprinkler = 0.1 if cloudy else 0.5
    sprinkler = np.random.random() < p_sprinkler
    
    # P(rain | cloudy)
    p_rain = 0.8 if cloudy else 0.2
    rain = np.random.random() < p_rain
    
    # P(wet_grass | sprinkler, rain)
    if sprinkler and rain:
        p_wet = 0.99
    elif sprinkler or rain:
        p_wet = 0.9
    else:
        p_wet = 0.01
    wet_grass = np.random.random() < p_wet
    
    return {'cloudy': cloudy, 'sprinkler': sprinkler, 
            'rain': rain, 'wet_grass': wet_grass}
 
# Generate samples
n_samples = 10000
samples = [sample_weather() for _ in range(n_samples)]
 
# Estimate marginal probabilities
print("Ancestral Sampling from Weather Bayesian Network:")
print(f"  Generated {n_samples} samples")
print(f"
  Estimated marginal probabilities:")
print(f"    P(cloudy) = {np.mean([s['cloudy'] for s in samples]):.3f}")
print(f"    P(rain) = {np.mean([s['rain'] for s in samples]):.3f}")
print(f"    P(wet_grass) = {np.mean([s['wet_grass'] for s in samples]):.3f}")
 
# Estimate conditional
rain_samples = [s for s in samples if s['rain']]
print(f"
  Conditional estimates:")
print(f"    P(cloudy | rain) = {np.mean([s['cloudy'] for s in rain_samples]):.3f}")
 
# Autoregressive sampling example
print("
--- Autoregressive Text Generation (simplified) ---")
 
# Vocabulary and transition probabilities (toy example)
vocab = ['the', 'cat', 'dog', 'sat', 'ran', '.']
# P(word | prev_word) - simplified Markov model
transitions = {
    '<start>': {'the': 0.8, 'cat': 0.1, 'dog': 0.1},
    'the': {'cat': 0.5, 'dog': 0.5},
    'cat': {'sat': 0.6, 'ran': 0.4},
    'dog': {'sat': 0.3, 'ran': 0.7},
    'sat': {'.': 1.0},
    'ran': {'.': 1.0}
}
 
def sample_sentence():
    """Autoregressive sampling from Markov language model."""
    sentence = []
    prev = '<start>'
    while True:
        probs = transitions.get(prev, {'.': 1.0})
        words = list(probs.keys())
        ps = list(probs.values())
        word = np.random.choice(words, p=ps)
        if word == '.':
            break
        sentence.append(word)
        prev = word
    return ' '.join(sentence)
 
print("  Generated sentences:")
for i in range(5):
    print(f"    {i+1}. {sample_sentence()}")

The Reparameterization Trick

The reparameterization trick is a crucial technique enabling gradient-based training through stochastic layers. It's foundational for variational autoencoders and many modern generative models.

The Problem:

Suppose we need to optimize:

$$\mathcal{L}(\theta) = \mathbb{E}{z \sim p\theta(z)}[f(z)]$$

where $p_\theta(z)$ depends on parameters $\theta$. How do we compute $ abla_\theta \mathcal{L}$?

Naively: sample $z \sim p_\theta(z)$, compute $f(z)$. But $z$ is stochastic—we can't backpropagate through the sampling operation.

The Solution: Reparameterize

Express $z$ as a deterministic function of $\theta$ and independent noise:

$$z = g_\theta(\epsilon), \quad \epsilon \sim q(\epsilon)$$

where $q(\epsilon)$ does NOT depend on $\theta$.

Now: $$\mathcal{L}(\theta) = \mathbb{E}{\epsilon \sim q(\epsilon)}[f(g\theta(\epsilon))]$$

Gradients flow through $g_\theta$: $$ abla_\theta \mathcal{L} = \mathbb{E}{\epsilon \sim q(\epsilon)}[ abla\theta f(g_\theta(\epsilon))]$$

Example: Gaussian

For $z \sim \mathcal{N}(\mu, \sigma^2)$:

$$z = \mu + \sigma \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, 1)$$

Now $ abla_\mu z = 1$ and $ abla_\sigma z = \epsilon$—both well-defined.

VAE Connection

The reparameterization trick is essential for VAE training. The encoder outputs parameters (μ, σ) of a Gaussian latent distribution. We need gradients through samples from this distribution to train end-to-end. Without reparameterization, the sampling operation blocks gradients, making training impossible.

Beyond Gaussians:

The reparameterization trick extends to many distributions:

Continuous distributions:

Normal: $z = \mu + \sigma \epsilon$, $\epsilon \sim \mathcal{N}(0,1)$
Log-normal: $z = \exp(\mu + \sigma \epsilon)$
Exponential: $z = -\beta \log(1 - \epsilon)$, $\epsilon \sim \text{Uniform}(0,1)$
Any distribution with invertible CDF: $z = F^{-1}(\epsilon)$, $\epsilon \sim \text{Uniform}(0,1)$

Discrete distributions (harder):

For discrete variables, the mapping $g$ is inherently discontinuous. Solutions:

Gumbel-Softmax (Concrete): Differentiable relaxation of categorical sampling $$z_i = \frac{\exp((\log \pi_i + g_i)/\tau)}{\sum_j \exp((\log \pi_j + g_j)/\tau)}$$ where $g_i \sim \text{Gumbel}(0,1)$ and $\tau$ is temperature.
REINFORCE / Score function estimator: Unbiased but high-variance gradient estimate

reparameterization_trick.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import numpy as np
 
# Demonstrate reparameterization for Gaussian
np.random.seed(42)
 
# Parameters of our distribution
mu = 3.0
sigma = 2.0
 
# Standard sampling (cannot differentiate through this!)
def sample_naive(mu, sigma, n=1000):
    """Sample directly - gradients blocked."""
    return np.random.normal(mu, sigma, n)
 
# Reparameterized sampling (gradients flow!)
def sample_reparameterized(mu, sigma, n=1000):
    """Sample via reparameterization: z = mu + sigma * epsilon."""
    epsilon = np.random.standard_normal(n)  # Independent of mu, sigma
    z = mu + sigma * epsilon  # Deterministic transform
    return z, epsilon
 
# Both produce same distribution
z_naive = sample_naive(mu, sigma, 10000)
z_reparam, epsilon = sample_reparameterized(mu, sigma, 10000)
 
print("Reparameterization Trick Demonstration:")
print(f"  Target: N(μ={mu}, σ={sigma})")
print(f"
  Naive sampling:")
print(f"    Mean: {z_naive.mean():.3f}, Std: {z_naive.std():.3f}")
print(f"
  Reparameterized sampling:")
print(f"    Mean: {z_reparam.mean():.3f}, Std: {z_reparam.std():.3f}")
 
# Show gradients
print("
  Gradients (reparameterized):")
print(f"    dz/dμ = 1 (for all samples)")
print(f"    dz/dσ = ε (sample-dependent)")
print(f"    Mean dz/dσ: {epsilon.mean():.3f} (≈ 0)")
 
# Gumbel-Softmax for discrete variables
print("
--- Gumbel-Softmax for Categorical ---")
 
def gumbel_softmax(logits, temperature=1.0):
    """Differentiable approximation to categorical sampling."""
    # Sample Gumbel noise
    u = np.random.uniform(0, 1, len(logits))
    gumbel = -np.log(-np.log(u + 1e-10) + 1e-10)
    
    # Add noise and apply temperature-scaled softmax
    y = (logits + gumbel) / temperature
    y = np.exp(y) / np.exp(y).sum()
    return y
 
# Logits for 3 categories
logits = np.array([2.0, 1.0, 0.5])  # Category 0 most likely
true_probs = np.exp(logits) / np.exp(logits).sum()
 
print(f"  Logits: {logits}")
print(f"  True probabilities: {true_probs.round(3)}")
 
# Sample at different temperatures
for temp in [0.1, 0.5, 1.0, 2.0]:
    samples = [gumbel_softmax(logits, temp) for _ in range(1000)]
    avg = np.mean(samples, axis=0)
    print(f"  τ={temp}: avg samples = {avg.round(3)}")
print("  (Lower τ → closer to one-hot, higher τ → closer to uniform)")

Rejection and Importance Sampling

When direct sampling isn't possible, we turn to indirect methods that sample from a different distribution and correct for the discrepancy.

Rejection Sampling

Idea: Sample from a simpler proposal distribution $q(x)$ that we can sample from, then accept/reject samples to match target $p(x)$.

Algorithm:

Find constant $M$ such that $p(x) \leq M \cdot q(x)$ for all $x$
Sample $x \sim q(x)$
Sample $u \sim \text{Uniform}(0, 1)$
Accept $x$ if $u < \frac{p(x)}{M \cdot q(x)}$, otherwise reject and go to step 2

Properties:

Accepted samples are exactly distributed as $p(x)$
Acceptance rate = $1/M$
If $q$ is a poor fit to $p$, $M$ is large, acceptance rate is low, sampling is slow

The curse: In high dimensions, even a well-chosen $q$ typically requires exponentially large $M$, making rejection sampling impractical.

Importance Sampling

Idea: Don't reject samples—instead, weight them to correct for using the wrong distribution.

To estimate $\mathbb{E}_{x \sim p}[f(x)]$:

$$\mathbb{E}{x \sim p}[f(x)] = \mathbb{E}{x \sim q}\left[f(x) \cdot \frac{p(x)}{q(x)}\right]$$

The importance weight $w(x) = p(x)/q(x)$ re-weights samples from $q$ to be representative of $p$.

Self-Normalized Importance Sampling:

When we only know $p$ up to a normalizing constant:

$$\hat{\mathbb{E}}[f(x)] = \frac{\sum_i w(x_i) f(x_i)}{\sum_i w(x_i)}$$

where $x_i \sim q(x)$.

The Weight Variance Problem

Importance sampling works well when p and q are similar. In high dimensions, even small mismatches cause weights to have enormous variance—a few samples have huge weights, others near zero. The 'effective sample size' can be tiny even with many samples. This limits the practical utility of importance sampling for complex generative models.

rejection_importance_sampling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
from scipy.stats import norm, uniform
 
np.random.seed(42)
 
# Target distribution: mixture of Gaussians (hard to sample directly)
def target_pdf(x):
    return 0.5 * norm.pdf(x, -2, 0.5) + 0.5 * norm.pdf(x, 2, 0.8)
 
def target_unnormalized(x):
    return target_pdf(x)  # In this case, same as normalized
 
# Proposal: broader Gaussian that covers both modes
proposal = norm(0, 3)
 
# Rejection Sampling
print("=== Rejection Sampling ===")
 
# Find M such that target(x) <= M * proposal(x) for all x
x_test = np.linspace(-6, 6, 1000)
ratios = target_pdf(x_test) / proposal.pdf(x_test)
M = np.max(ratios) * 1.1  # Add margin for safety
 
print(f"  Proposal: N(0, 3)")
print(f"  Bound M: {M:.3f}")
print(f"  Expected acceptance rate: {1/M:.3f}")
 
# Rejection sampling
n_target = 10000
accepted = []
attempts = 0
 
while len(accepted) < n_target:
    x = proposal.rvs()
    u = np.random.uniform(0, 1)
    attempts += 1
    if u < target_pdf(x) / (M * proposal.pdf(x)):
        accepted.append(x)
 
accepted = np.array(accepted)
print(f"
  Requested: {n_target} samples")
print(f"  Attempts: {attempts}")
print(f"  Actual acceptance rate: {n_target/attempts:.3f}")
print(f"  Sample mean: {accepted.mean():.3f} (expected: 0)")
print(f"  Sample std: {accepted.std():.3f}")
 
# Importance Sampling
print("
=== Importance Sampling ===")
 
# Estimate E[X^2] under target distribution
def estimand(x):
    return x ** 2
 
# True value (analytically computed)
true_expectation = 0.5 * (4 + 0.5**2) + 0.5 * (4 + 0.8**2)  # E[X^2] for mixture
print(f"  True E[X²]: {true_expectation:.3f}")
 
# Importance sampling estimate
n_samples = 10000
samples = proposal.rvs(n_samples)
weights = target_pdf(samples) / proposal.pdf(samples)
normalized_weights = weights / weights.sum()
 
is_estimate = np.sum(normalized_weights * estimand(samples))
print(f"  IS estimate: {is_estimate:.3f}")
 
# Effective sample size
ess = 1 / np.sum(normalized_weights ** 2)
print(f"
  Effective Sample Size: {ess:.0f} out of {n_samples}")
print(f"  ESS ratio: {ess/n_samples:.1%}")
print(f"  (Low ESS = high weight variance, poor estimate)")
 
# Show weight distribution
print(f"
  Weight statistics:")
print(f"    Max weight: {weights.max():.3f}")
print(f"    Min weight: {weights.min():.6f}")
print(f"    Max/Min ratio: {weights.max()/weights.min():.0f}")

Markov Chain Monte Carlo

Markov Chain Monte Carlo (MCMC) is a family of algorithms that sample from complex distributions by constructing a Markov chain whose stationary distribution is the target.

Core Idea:

Instead of sampling independently, we construct a sequence of samples $x_0, x_1, x_2, \ldots$ where each sample depends on the previous one (a Markov chain). If we design the transition probabilities correctly, the distribution of samples converges to the target $p(x)$ as we run the chain.

Key Concepts:

Markov Chain: A sequence where $x_{t+1}$ depends only on $x_t$: $$P(x_{t+1} | x_t, x_{t-1}, \ldots, x_0) = P(x_{t+1} | x_t) = T(x_{t+1} | x_t)$$

where $T$ is the transition kernel.

Stationary Distribution: A distribution $\pi(x)$ such that: $$\pi(x') = \int T(x'|x) \pi(x) dx$$

If we sample $x_t \sim \pi$, then $x_{t+1} \sim \pi$ too. The distribution is 'stable.'

Ergodicity: Under certain conditions (irreducibility, aperiodicity), the chain converges to its stationary distribution regardless of initialization: $$\lim_{t \to \infty} P(x_t | x_0) = \pi(x)$$

MCMC goal: Design $T$ such that $\pi = p$ (target distribution).

Detailed Balance

A sufficient (not necessary) condition for π to be stationary is detailed balance: π(x)T(x'|x) = π(x')T(x|x'). This says the probability flow from x to x' equals the flow from x' to x. Most MCMC algorithms are designed by constructing T that satisfies detailed balance with respect to the target distribution.

Practical Considerations:

Burn-in: Initial samples are influenced by the starting point, not the target distribution. We discard these 'burn-in' samples.

Mixing: How quickly does the chain explore the distribution? Poor mixing = chain gets stuck in local regions, samples are highly correlated, effective sample size is low.

Thinning: To reduce autocorrelation, keep only every $k$-th sample. Trading samples for computational efficiency.

Diagnostics:

Trace plots: Visual inspection of chains
Autocorrelation: How correlated are successive samples?
R-hat: Compare multiple chains to detect non-convergence
Effective Sample Size (ESS): Accounts for autocorrelation

Limitations of MCMC:

Slow convergence in high dimensions or with multimodal distributions
Difficult to diagnose whether the chain has converged
Autocorrelated samples reduce effective sample size
Sequential nature limits parallelization

MCMC Diagnostics and Remedies
Issue	Symptom	Remedy
Poor mixing	Autocorrelation high, ESS low	Tune step size, use HMC, reparameterize
Not converged	Trace plot trends, R-hat > 1.1	Run longer, better initialization
Stuck in mode	Multimodal target, chain in one mode	Parallel tempering, multiple chains
High dimensionality	Acceptance rate too low/high	HMC, NUTS, blocked updates

Metropolis-Hastings Algorithm

Metropolis-Hastings (MH) is the foundational MCMC algorithm. It constructs a Markov chain with the target as its stationary distribution by using a proposal distribution and accept/reject steps.

Algorithm:

Initialize $x_0$
For $t = 1, 2, \ldots$:
- Propose $x' \sim q(x' | x_t)$ (proposal distribution)
- Compute acceptance probability: $$\alpha = \min\left(1, \frac{p(x') q(x_t | x')}{p(x_t) q(x' | x_t)}\right)$$
- Accept: with probability $\alpha$, set $x_{t+1} = x'$
- Reject: with probability $1-\alpha$, set $x_{t+1} = x_t$

Key Properties:

Only needs $p(x)$ up to a normalizing constant (ratio cancels)
Satisfies detailed balance with respect to $p$
Converges to sampling from $p$ under mild conditions

Symmetric Proposals (Metropolis):

If $q(x'|x) = q(x|x')$ (symmetric, like Gaussian random walk), the formula simplifies: $$\alpha = \min\left(1, \frac{p(x')}{p(x_t)}\right)$$

Accept if we move to higher probability; randomly accept if we move to lower.

Tuning the Proposal:

The proposal distribution critically affects performance:

Too narrow (small steps):

Almost always accept (always move to nearby regions)
Chain explores slowly, high autocorrelation
Takes forever to traverse the distribution

Too wide (large steps):

Often propose improbable regions, mostly reject
Chain gets stuck, doesn't move
Wastes computation on rejected proposals

Sweet spot: Acceptance rate around 20-50% for random walk proposals in high dimensions (theoretical optimal ~23.4% for high-dimensional Gaussian targets).

Adaptive MH:

Modern implementations adapt the proposal during run:

Learn the covariance structure from samples
Adjust step size based on acceptance rate
Requires care to maintain correctness (adapt only during burn-in, or use diminishing adaptation)

metropolis_hastings.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import numpy as np
 
# Metropolis-Hastings for sampling from a bimodal distribution
np.random.seed(42)
 
# Target: mixture of two Gaussians
def log_target(x):
    log_p1 = -0.5 * ((x - (-3)) ** 2) / 1.0
    log_p2 = -0.5 * ((x - 3) ** 2) / 1.0
    # log-sum-exp for numerical stability
    max_log = max(log_p1, log_p2)
    return max_log + np.log(np.exp(log_p1 - max_log) + np.exp(log_p2 - max_log))
 
def metropolis_hastings(log_target, x0, proposal_std, n_samples, burn_in=1000):
    """
    Random walk Metropolis-Hastings.
    
    Args:
        log_target: Function returning log p(x) (up to constant)
        x0: Initial state
        proposal_std: Standard deviation of Gaussian proposal
        n_samples: Number of samples after burn-in
        burn_in: Number of initial samples to discard
    """
    x = x0
    samples = []
    acceptances = 0
    total_iter = n_samples + burn_in
    
    for t in range(total_iter):
        # Propose
        x_prime = x + np.random.normal(0, proposal_std)
        
        # Compute acceptance probability (in log space)
        log_alpha = log_target(x_prime) - log_target(x)
        
        # Accept/reject
        if np.log(np.random.random()) < log_alpha:
            x = x_prime
            if t >= burn_in:
                acceptances += 1
        
        if t >= burn_in:
            samples.append(x)
    
    return np.array(samples), acceptances / n_samples
 
# Run with different proposal widths
n_samples = 10000
print("Metropolis-Hastings Sampling from Bimodal Distribution")
print("=" * 55)
 
for std in [0.1, 0.5, 1.0, 2.0, 5.0, 10.0]:
    samples, acc_rate = metropolis_hastings(
        log_target, x0=0.0, proposal_std=std, n_samples=n_samples
    )
    
    # Compute effective sample size (using autocorrelation)
    def compute_ess(samples):
        n = len(samples)
        mean = samples.mean()
        var = samples.var()
        if var == 0:
            return n
        autocorr = np.correlate(samples - mean, samples - mean, 'full')
        autocorr = autocorr[n-1:] / (var * np.arange(n, 0, -1))
        # Sum autocorrelations until they go negative
        ess_denom = 1 + 2 * sum(autocorr[1:min(100, n)] * (autocorr[1:min(100, n)] > 0.05))
        return n / max(1, ess_denom)
    
    ess = compute_ess(samples)
    
    print(f"  σ_proposal = {std:4.1f}: Accept rate = {acc_rate:.3f}, "
          f"ESS = {ess:6.0f} ({ess/n_samples*100:4.1f}%)")
 
print("
  (Best: acceptance rate 20-50%, high ESS)")
 
# Check if both modes are visited
best_samples, _ = metropolis_hastings(
    log_target, x0=0.0, proposal_std=2.0, n_samples=50000, burn_in=5000
)
n_left_mode = np.sum(best_samples < 0)
n_right_mode = np.sum(best_samples >= 0)
print(f"
  Mode coverage (σ=2.0):")
print(f"    Left mode (x<0): {n_left_mode} samples ({n_left_mode/len(best_samples):.1%})")
print(f"    Right mode (x≥0): {n_right_mode} samples ({n_right_mode/len(best_samples):.1%})")

Gibbs Sampling

Gibbs sampling is a special case of MCMC particularly useful when the joint distribution is hard to sample from, but the conditional distributions are easy.

Algorithm:

For a multivariate distribution $p(x_1, x_2, \ldots, x_d)$:

Initialize $(x_1^{(0)}, x_2^{(0)}, \ldots, x_d^{(0)})$
For $t = 1, 2, \ldots$:
- Sample $x_1^{(t)} \sim p(x_1 | x_2^{(t-1)}, x_3^{(t-1)}, \ldots, x_d^{(t-1)})$
- Sample $x_2^{(t)} \sim p(x_2 | x_1^{(t)}, x_3^{(t-1)}, \ldots, x_d^{(t-1)})$
- ...
- Sample $x_d^{(t)} \sim p(x_d | x_1^{(t)}, x_2^{(t)}, \ldots, x_{d-1}^{(t)})$

Each step updates one variable by sampling from its full conditional distribution (conditioned on current values of all others).

Why it works:

Gibbs sampling is a special case of MH where the proposal is the full conditional, and the acceptance probability is always 1 (the updates are always accepted).

Advantages:

No tuning required (step size is determined by conditionals)
Always accept (no rejected samples)
Natural for models with conjugate priors (closed-form conditionals)

Gibbs in Graphical Models

Gibbs sampling shines in graphical models where the conditional p(x_i | x_{-i}) depends only on the Markov blanket of x_i (parents, children, and co-parents). This locality makes conditionals tractable even in large models, and enables block updates where subsets of variables are updated together.

Block Gibbs Sampling:

Instead of updating one variable at a time, update blocks of variables together: $$x_A^{(t)} \sim p(x_A | x_B^{(t-1)})$$ $$x_B^{(t)} \sim p(x_B | x_A^{(t)})$$

Larger blocks improve mixing (less autocorrelation) but require sampling from more complex conditionals.

Collapsed Gibbs Sampling:

Integrate out (marginalize) some variables analytically, sampling only the remaining variables from the collapsed conditionals. This can dramatically improve mixing by reducing dependencies.

Limitations:

Requires tractable conditionals: If $p(x_i | x_{-i})$ is hard to sample from, Gibbs doesn't apply.
Slow mixing with strong dependencies: If variables are highly correlated, updating one at a time is inefficient.
Discrete latent variables: Can get stuck in local optima.

gibbs_sampling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import numpy as np
from scipy.stats import norm
 
np.random.seed(42)
 
# Gibbs sampling for bivariate Gaussian
# Target: p(x, y) = N([0, 0], [[1, ρ], [ρ, 1]])
rho = 0.9  # High correlation
 
# Conditionals:
# p(x | y) = N(ρy, 1 - ρ²)
# p(y | x) = N(ρx, 1 - ρ²)
 
def gibbs_bivariate_gaussian(rho, n_samples, burn_in=1000):
    """Gibbs sampling for bivariate Gaussian with correlation rho."""
    conditional_std = np.sqrt(1 - rho ** 2)
    
    x, y = 0.0, 0.0  # Initialize
    samples = []
    
    for t in range(n_samples + burn_in):
        # Update x given y
        x = np.random.normal(rho * y, conditional_std)
        # Update y given x
        y = np.random.normal(rho * x, conditional_std)
        
        if t >= burn_in:
            samples.append([x, y])
    
    return np.array(samples)
 
# Sample with different correlations
print("Gibbs Sampling for Bivariate Gaussian")
print("=" * 45)
 
for rho in [0.0, 0.5, 0.9, 0.99]:
    samples = gibbs_bivariate_gaussian(rho, n_samples=10000)
    
    # Check statistics
    sample_corr = np.corrcoef(samples[:, 0], samples[:, 1])[0, 1]
    
    # Estimate ESS (rough)
    x_autocorr = np.correlate(samples[:, 0] - samples[:, 0].mean(), 
                              samples[:, 0] - samples[:, 0].mean(), 'full')
    x_autocorr = x_autocorr[len(samples)-1:len(samples)+50] / x_autocorr[len(samples)-1]
    ess_approx = len(samples) / (1 + 2 * np.sum(x_autocorr[1:] * (x_autocorr[1:] > 0.05)))
    
    print(f"  ρ = {rho:.2f}: Sample corr = {sample_corr:.3f}, "
          f"ESS ≈ {ess_approx:.0f} ({ess_approx/10000*100:.1f}%)")
 
print("
  Note: Higher ρ → slower mixing, lower ESS")
 
# Gibbs for Gaussian Mixture Model (toy example)
print("
--- Gibbs for Gaussian Mixture Model ---")
 
# Generate data from mixture
n_data = 200
true_labels = np.random.choice([0, 1], n_data, p=[0.4, 0.6])
true_means = [-3, 3]
data = np.array([np.random.normal(true_means[z], 1) for z in true_labels])
 
# Gibbs sampling: alternate between z (labels) and μ (means)
def gibbs_gmm(data, n_iter=500, burn_in=100):
    n = len(data)
    K = 2
    
    # Initialize
    mu = np.array([0.0, 1.0])  # Means
    z = np.random.choice(K, n)  # Assignments
    pi = np.array([0.5, 0.5])  # Mixing weights
    
    samples = []
    
    for t in range(n_iter + burn_in):
        # Sample z given mu (assignments)
        for i in range(n):
            log_probs = np.log(pi) - 0.5 * (data[i] - mu) ** 2
            probs = np.exp(log_probs - log_probs.max())
            probs /= probs.sum()
            z[i] = np.random.choice(K, p=probs)
        
        # Sample mu given z (means, with prior N(0, 10))
        for k in range(K):
            mask = z == k
            n_k = mask.sum()
            if n_k > 0:
                precision = n_k + 1/10
                mean = (data[mask].sum() + 0/10) / precision
                mu[k] = np.random.normal(mean, 1/np.sqrt(precision))
        
        if t >= burn_in:
            samples.append(mu.copy())
    
    return np.array(samples)
 
mu_samples = gibbs_gmm(data)
print(f"  True means: {true_means}")
print(f"  Posterior mean estimates: [{mu_samples[:, 0].mean():.2f}, {mu_samples[:, 1].mean():.2f}]")

Sampling in Modern Generative Models

Modern deep generative models have developed specialized sampling procedures that diverge from classical MCMC. Let's examine how different architectures approach sampling.

Normalizing Flows: One-Shot Sampling

Flows transform a simple base distribution through an invertible function: $$z \sim \mathcal{N}(0, I) \rightarrow x = f_\theta(z)$$

Sampling is a single forward pass—no chains, no iterations, no rejection. This makes flows attractive for applications requiring fast sampling.

VAEs: Latent Space Sampling

VAEs sample from a simple prior in latent space, then decode: $$z \sim \mathcal{N}(0, I) \rightarrow x = \text{Decoder}_\theta(z)$$

Also one-shot. The quality depends on how well the decoder captures the data distribution conditional on $z$.

GANs: Generator Forward Pass

Like VAEs, GANs sample noise and transform: $$z \sim \mathcal{N}(0, I) \rightarrow x = G_\theta(z)$$

One-shot, very fast. No density evaluation—purely implicit sampling.

Autoregressive Models: Sequential Generation

Autoregressive models (GPT, PixelCNN) sample dimension by dimension: $$x_1 \sim p(x_1), \quad x_2 \sim p(x_2|x_1), \quad \ldots$$

This is ancestral sampling, but sequential—one forward pass per dimension. For long sequences (text) or high-resolution images (millions of pixels), this can be slow.

Sampling Strategies for Autoregressive:

Greedy: Always pick most likely next token. Deterministic, often repetitive.
Temperature: Sample from softmax with temperature $\tau$: $$p_i \propto \exp(\text{logit}_i / \tau)$$ Higher $\tau$ = more diversity, lower = more focused.
Top-k: Sample from top $k$ most likely tokens only.
Top-p (Nucleus): Sample from smallest set of tokens with cumulative probability $\geq p$.
Beam search: Explore multiple hypotheses in parallel, keep top candidates.

Diffusion Models: Iterative Denoising

Diffusion models sample through many denoising steps: $$x_T \sim \mathcal{N}(0, I) \rightarrow x_{T-1} \rightarrow \cdots \rightarrow x_0$$

Typically 20-1000 steps, each requiring a neural network forward pass. This is slower than one-shot methods but produces remarkably high-quality samples.

Fast Diffusion Sampling:

DDIM: Deterministic sampling with fewer steps (~10-50)
Progressive distillation: Train faster models to match slow ones
Consistency models: Learn to jump directly to clean samples

The Speed-Quality Trade-off

Faster sampling methods (flows, VAEs, GANs) often sacrifice sample quality or likelihood. Slower methods (diffusion, autoregressive) often produce better samples. Choosing the right model involves balancing sampling speed, sample quality, training stability, and mode coverage for your specific application.

Sampling Speed Comparison
Model	Sampling Method	Steps/Sample	Typical Speed
Normalizing Flow	Forward pass	1	~10ms
VAE	Sample + decode	1	~5-10ms
GAN	Forward pass	1	~5-10ms
Autoregressive (text)	Sequential	~100-1000	~100-1000ms
Diffusion (fast)	DDIM denoising	~20-50	~200-500ms
Diffusion (quality)	DDPM denoising	~100-1000	~1-10s

Summary

Sampling is where generative models meet the real world—transforming probability distributions into tangible new data. We've covered the spectrum from classical techniques to modern neural approaches.

Key Takeaways

•Ancestral sampling follows the generative process—sample parents, then children. Works for directed models and autoregressive architectures.
•Reparameterization trick enables gradients through stochastic layers: express samples as deterministic function of parameters and independent noise.
•Rejection sampling and importance sampling allow sampling from difficult distributions using easier proposals, but suffer in high dimensions.
•MCMC methods construct Markov chains converging to the target distribution. Require careful tuning and burn-in.
•Metropolis-Hastings is the foundational MCMC algorithm: propose, accept/reject based on density ratio.
•Gibbs sampling updates one variable at a time from full conditionals—no tuning, always accept, but can mix slowly.
•Modern generative models use specialized sampling: one-shot (flows, VAEs, GANs), sequential (autoregressive), or iterative (diffusion).
•Speed-quality trade-offs are fundamental: faster sampling often means accepting some loss in sample quality or diversity.

What's next:

We've seen how generative models learn distributions (density estimation) and produce samples (sampling). But many of the most powerful generative models introduce latent variables—hidden representations that capture meaningful structure in the data. The next page explores this crucial concept, laying the groundwork for VAEs and other latent variable models.

Sampling Mastery

You now understand the full landscape of sampling techniques for generative models. From classical MCMC to modern neural approaches, each method navigates the fundamental challenge of converting probability distributions into tangible samples. This knowledge is essential for implementing, debugging, and improving generative systems.

Sampling

From Distributions to Data

What You Will Learn

Ancestral Sampling

Ancestral sampling (also called forward sampling) is the most direct approach: sample from a generative model by following its generative process.

For directed graphical models:

If a model factorizes as:

$$p(x_1, x_2, \ldots, x_n) = \prod_{i=1}^n p(x_i | \text{parents}(x_i))$$

we sample by:

Order variables so children come after parents (topological order)
For each variable in order, sample $x_i \sim p(x_i | \text{parents}(x_i))$

Each conditional $p(x_i | \cdot)$ should be a simple distribution we can sample from (e.g., Gaussian, categorical).

Example: Bayesian Network

Consider a weather model:

$p(\text{cloudy})$ — sample first
$p(\text{rain} | \text{cloudy})$ — sample given cloudy status
$p(\text{wet_grass} | \text{rain})$ — sample given rain status

Sampling proceeds: cloudy → rain → wet_grass.

For autoregressive models:

Autoregressive models are explicitly designed for ancestral sampling:

$$p(x) = p(x_1) \cdot p(x_2|x_1) \cdot p(x_3|x_1,x_2) \cdots$$

Generate $x_1$, then $x_2$ given $x_1$, then $x_3$ given $x_1, x_2$, etc. This is how GPT generates text: token by token, each conditioned on all previous.

Speed vs. Quality Trade-off

ancestral_sampling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import numpy as np
 
# Example: Ancestral sampling from a Bayesian network
np.random.seed(42)
 
def sample_weather():
    """
    Bayesian Network:
    cloudy -> rain -> wet_grass
           \-> sprinkler -> wet_grass
    """
    # P(cloudy) = 0.5
    cloudy = np.random.random() < 0.5
    
    # P(sprinkler | cloudy) 
    p_sprinkler = 0.1 if cloudy else 0.5
    sprinkler = np.random.random() < p_sprinkler
    
    # P(rain | cloudy)
    p_rain = 0.8 if cloudy else 0.2
    rain = np.random.random() < p_rain
    
    # P(wet_grass | sprinkler, rain)
    if sprinkler and rain:
        p_wet = 0.99
    elif sprinkler or rain:
        p_wet = 0.9
    else:
        p_wet = 0.01
    wet_grass = np.random.random() < p_wet
    
    return {'cloudy': cloudy, 'sprinkler': sprinkler, 
            'rain': rain, 'wet_grass': wet_grass}
 
# Generate samples
n_samples = 10000
samples = [sample_weather() for _ in range(n_samples)]
 
# Estimate marginal probabilities
print("Ancestral Sampling from Weather Bayesian Network:")
print(f"  Generated {n_samples} samples")
print(f"
  Estimated marginal probabilities:")
print(f"    P(cloudy) = {np.mean([s['cloudy'] for s in samples]):.3f}")
print(f"    P(rain) = {np.mean([s['rain'] for s in samples]):.3f}")
print(f"    P(wet_grass) = {np.mean([s['wet_grass'] for s in samples]):.3f}")
 
# Estimate conditional
rain_samples = [s for s in samples if s['rain']]
print(f"
  Conditional estimates:")
print(f"    P(cloudy | rain) = {np.mean([s['cloudy'] for s in rain_samples]):.3f}")
 
# Autoregressive sampling example
print("
--- Autoregressive Text Generation (simplified) ---")
 
# Vocabulary and transition probabilities (toy example)
vocab = ['the', 'cat', 'dog', 'sat', 'ran', '.']
# P(word | prev_word) - simplified Markov model
transitions = {
    '<start>': {'the': 0.8, 'cat': 0.1, 'dog': 0.1},
    'the': {'cat': 0.5, 'dog': 0.5},
    'cat': {'sat': 0.6, 'ran': 0.4},
    'dog': {'sat': 0.3, 'ran': 0.7},
    'sat': {'.': 1.0},
    'ran': {'.': 1.0}
}
 
def sample_sentence():
    """Autoregressive sampling from Markov language model."""
    sentence = []
    prev = '<start>'
    while True:
        probs = transitions.get(prev, {'.': 1.0})
        words = list(probs.keys())
        ps = list(probs.values())
        word = np.random.choice(words, p=ps)
        if word == '.':
            break
        sentence.append(word)
        prev = word
    return ' '.join(sentence)
 
print("  Generated sentences:")
for i in range(5):
    print(f"    {i+1}. {sample_sentence()}")

The Reparameterization Trick

The reparameterization trick is a crucial technique enabling gradient-based training through stochastic layers. It's foundational for variational autoencoders and many modern generative models.

The Problem:

Suppose we need to optimize:

$$\mathcal{L}(\theta) = \mathbb{E}{z \sim p\theta(z)}[f(z)]$$

where $p_\theta(z)$ depends on parameters $\theta$. How do we compute $ abla_\theta \mathcal{L}$?

Naively: sample $z \sim p_\theta(z)$, compute $f(z)$. But $z$ is stochastic—we can't backpropagate through the sampling operation.

The Solution: Reparameterize

Express $z$ as a deterministic function of $\theta$ and independent noise:

$$z = g_\theta(\epsilon), \quad \epsilon \sim q(\epsilon)$$

where $q(\epsilon)$ does NOT depend on $\theta$.

Now: $$\mathcal{L}(\theta) = \mathbb{E}{\epsilon \sim q(\epsilon)}[f(g\theta(\epsilon))]$$

Gradients flow through $g_\theta$: $$ abla_\theta \mathcal{L} = \mathbb{E}{\epsilon \sim q(\epsilon)}[ abla\theta f(g_\theta(\epsilon))]$$

Example: Gaussian

For $z \sim \mathcal{N}(\mu, \sigma^2)$:

$$z = \mu + \sigma \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, 1)$$

Now $ abla_\mu z = 1$ and $ abla_\sigma z = \epsilon$—both well-defined.

VAE Connection

Beyond Gaussians:

The reparameterization trick extends to many distributions:

Continuous distributions:

Normal: $z = \mu + \sigma \epsilon$, $\epsilon \sim \mathcal{N}(0,1)$
Log-normal: $z = \exp(\mu + \sigma \epsilon)$
Exponential: $z = -\beta \log(1 - \epsilon)$, $\epsilon \sim \text{Uniform}(0,1)$
Any distribution with invertible CDF: $z = F^{-1}(\epsilon)$, $\epsilon \sim \text{Uniform}(0,1)$

Discrete distributions (harder):

For discrete variables, the mapping $g$ is inherently discontinuous. Solutions:

Gumbel-Softmax (Concrete): Differentiable relaxation of categorical sampling $$z_i = \frac{\exp((\log \pi_i + g_i)/\tau)}{\sum_j \exp((\log \pi_j + g_j)/\tau)}$$ where $g_i \sim \text{Gumbel}(0,1)$ and $\tau$ is temperature.
REINFORCE / Score function estimator: Unbiased but high-variance gradient estimate

reparameterization_trick.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import numpy as np
 
# Demonstrate reparameterization for Gaussian
np.random.seed(42)
 
# Parameters of our distribution
mu = 3.0
sigma = 2.0
 
# Standard sampling (cannot differentiate through this!)
def sample_naive(mu, sigma, n=1000):
    """Sample directly - gradients blocked."""
    return np.random.normal(mu, sigma, n)
 
# Reparameterized sampling (gradients flow!)
def sample_reparameterized(mu, sigma, n=1000):
    """Sample via reparameterization: z = mu + sigma * epsilon."""
    epsilon = np.random.standard_normal(n)  # Independent of mu, sigma
    z = mu + sigma * epsilon  # Deterministic transform
    return z, epsilon
 
# Both produce same distribution
z_naive = sample_naive(mu, sigma, 10000)
z_reparam, epsilon = sample_reparameterized(mu, sigma, 10000)
 
print("Reparameterization Trick Demonstration:")
print(f"  Target: N(μ={mu}, σ={sigma})")
print(f"
  Naive sampling:")
print(f"    Mean: {z_naive.mean():.3f}, Std: {z_naive.std():.3f}")
print(f"
  Reparameterized sampling:")
print(f"    Mean: {z_reparam.mean():.3f}, Std: {z_reparam.std():.3f}")
 
# Show gradients
print("
  Gradients (reparameterized):")
print(f"    dz/dμ = 1 (for all samples)")
print(f"    dz/dσ = ε (sample-dependent)")
print(f"    Mean dz/dσ: {epsilon.mean():.3f} (≈ 0)")
 
# Gumbel-Softmax for discrete variables
print("
--- Gumbel-Softmax for Categorical ---")
 
def gumbel_softmax(logits, temperature=1.0):
    """Differentiable approximation to categorical sampling."""
    # Sample Gumbel noise
    u = np.random.uniform(0, 1, len(logits))
    gumbel = -np.log(-np.log(u + 1e-10) + 1e-10)
    
    # Add noise and apply temperature-scaled softmax
    y = (logits + gumbel) / temperature
    y = np.exp(y) / np.exp(y).sum()
    return y
 
# Logits for 3 categories
logits = np.array([2.0, 1.0, 0.5])  # Category 0 most likely
true_probs = np.exp(logits) / np.exp(logits).sum()
 
print(f"  Logits: {logits}")
print(f"  True probabilities: {true_probs.round(3)}")
 
# Sample at different temperatures
for temp in [0.1, 0.5, 1.0, 2.0]:
    samples = [gumbel_softmax(logits, temp) for _ in range(1000)]
    avg = np.mean(samples, axis=0)
    print(f"  τ={temp}: avg samples = {avg.round(3)}")
print("  (Lower τ → closer to one-hot, higher τ → closer to uniform)")

Rejection and Importance Sampling

When direct sampling isn't possible, we turn to indirect methods that sample from a different distribution and correct for the discrepancy.

Rejection Sampling

Idea: Sample from a simpler proposal distribution $q(x)$ that we can sample from, then accept/reject samples to match target $p(x)$.

Algorithm:

Find constant $M$ such that $p(x) \leq M \cdot q(x)$ for all $x$
Sample $x \sim q(x)$
Sample $u \sim \text{Uniform}(0, 1)$
Accept $x$ if $u < \frac{p(x)}{M \cdot q(x)}$, otherwise reject and go to step 2

Properties:

Accepted samples are exactly distributed as $p(x)$
Acceptance rate = $1/M$
If $q$ is a poor fit to $p$, $M$ is large, acceptance rate is low, sampling is slow

The curse: In high dimensions, even a well-chosen $q$ typically requires exponentially large $M$, making rejection sampling impractical.

Importance Sampling

Idea: Don't reject samples—instead, weight them to correct for using the wrong distribution.

To estimate $\mathbb{E}_{x \sim p}[f(x)]$:

$$\mathbb{E}{x \sim p}[f(x)] = \mathbb{E}{x \sim q}\left[f(x) \cdot \frac{p(x)}{q(x)}\right]$$

The importance weight $w(x) = p(x)/q(x)$ re-weights samples from $q$ to be representative of $p$.

Self-Normalized Importance Sampling:

When we only know $p$ up to a normalizing constant:

$$\hat{\mathbb{E}}[f(x)] = \frac{\sum_i w(x_i) f(x_i)}{\sum_i w(x_i)}$$

where $x_i \sim q(x)$.

The Weight Variance Problem

rejection_importance_sampling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
from scipy.stats import norm, uniform
 
np.random.seed(42)
 
# Target distribution: mixture of Gaussians (hard to sample directly)
def target_pdf(x):
    return 0.5 * norm.pdf(x, -2, 0.5) + 0.5 * norm.pdf(x, 2, 0.8)
 
def target_unnormalized(x):
    return target_pdf(x)  # In this case, same as normalized
 
# Proposal: broader Gaussian that covers both modes
proposal = norm(0, 3)
 
# Rejection Sampling
print("=== Rejection Sampling ===")
 
# Find M such that target(x) <= M * proposal(x) for all x
x_test = np.linspace(-6, 6, 1000)
ratios = target_pdf(x_test) / proposal.pdf(x_test)
M = np.max(ratios) * 1.1  # Add margin for safety
 
print(f"  Proposal: N(0, 3)")
print(f"  Bound M: {M:.3f}")
print(f"  Expected acceptance rate: {1/M:.3f}")
 
# Rejection sampling
n_target = 10000
accepted = []
attempts = 0
 
while len(accepted) < n_target:
    x = proposal.rvs()
    u = np.random.uniform(0, 1)
    attempts += 1
    if u < target_pdf(x) / (M * proposal.pdf(x)):
        accepted.append(x)
 
accepted = np.array(accepted)
print(f"
  Requested: {n_target} samples")
print(f"  Attempts: {attempts}")
print(f"  Actual acceptance rate: {n_target/attempts:.3f}")
print(f"  Sample mean: {accepted.mean():.3f} (expected: 0)")
print(f"  Sample std: {accepted.std():.3f}")
 
# Importance Sampling
print("
=== Importance Sampling ===")
 
# Estimate E[X^2] under target distribution
def estimand(x):
    return x ** 2
 
# True value (analytically computed)
true_expectation = 0.5 * (4 + 0.5**2) + 0.5 * (4 + 0.8**2)  # E[X^2] for mixture
print(f"  True E[X²]: {true_expectation:.3f}")
 
# Importance sampling estimate
n_samples = 10000
samples = proposal.rvs(n_samples)
weights = target_pdf(samples) / proposal.pdf(samples)
normalized_weights = weights / weights.sum()
 
is_estimate = np.sum(normalized_weights * estimand(samples))
print(f"  IS estimate: {is_estimate:.3f}")
 
# Effective sample size
ess = 1 / np.sum(normalized_weights ** 2)
print(f"
  Effective Sample Size: {ess:.0f} out of {n_samples}")
print(f"  ESS ratio: {ess/n_samples:.1%}")
print(f"  (Low ESS = high weight variance, poor estimate)")
 
# Show weight distribution
print(f"
  Weight statistics:")
print(f"    Max weight: {weights.max():.3f}")
print(f"    Min weight: {weights.min():.6f}")
print(f"    Max/Min ratio: {weights.max()/weights.min():.0f}")

Markov Chain Monte Carlo

Markov Chain Monte Carlo (MCMC) is a family of algorithms that sample from complex distributions by constructing a Markov chain whose stationary distribution is the target.

Core Idea:

Key Concepts:

Markov Chain: A sequence where $x_{t+1}$ depends only on $x_t$: $$P(x_{t+1} | x_t, x_{t-1}, \ldots, x_0) = P(x_{t+1} | x_t) = T(x_{t+1} | x_t)$$

where $T$ is the transition kernel.

Stationary Distribution: A distribution $\pi(x)$ such that: $$\pi(x') = \int T(x'|x) \pi(x) dx$$

If we sample $x_t \sim \pi$, then $x_{t+1} \sim \pi$ too. The distribution is 'stable.'

Ergodicity: Under certain conditions (irreducibility, aperiodicity), the chain converges to its stationary distribution regardless of initialization: $$\lim_{t \to \infty} P(x_t | x_0) = \pi(x)$$

MCMC goal: Design $T$ such that $\pi = p$ (target distribution).

Detailed Balance

Practical Considerations:

Burn-in: Initial samples are influenced by the starting point, not the target distribution. We discard these 'burn-in' samples.

Mixing: How quickly does the chain explore the distribution? Poor mixing = chain gets stuck in local regions, samples are highly correlated, effective sample size is low.

Thinning: To reduce autocorrelation, keep only every $k$-th sample. Trading samples for computational efficiency.

Diagnostics:

Trace plots: Visual inspection of chains
Autocorrelation: How correlated are successive samples?
R-hat: Compare multiple chains to detect non-convergence
Effective Sample Size (ESS): Accounts for autocorrelation

Limitations of MCMC:

Slow convergence in high dimensions or with multimodal distributions
Difficult to diagnose whether the chain has converged
Autocorrelated samples reduce effective sample size
Sequential nature limits parallelization

MCMC Diagnostics and Remedies
Issue	Symptom	Remedy
Poor mixing	Autocorrelation high, ESS low	Tune step size, use HMC, reparameterize
Not converged	Trace plot trends, R-hat > 1.1	Run longer, better initialization
Stuck in mode	Multimodal target, chain in one mode	Parallel tempering, multiple chains
High dimensionality	Acceptance rate too low/high	HMC, NUTS, blocked updates

Metropolis-Hastings Algorithm

Metropolis-Hastings (MH) is the foundational MCMC algorithm. It constructs a Markov chain with the target as its stationary distribution by using a proposal distribution and accept/reject steps.

Algorithm:

Initialize $x_0$
For $t = 1, 2, \ldots$:
- Propose $x' \sim q(x' | x_t)$ (proposal distribution)
- Compute acceptance probability: $$\alpha = \min\left(1, \frac{p(x') q(x_t | x')}{p(x_t) q(x' | x_t)}\right)$$
- Accept: with probability $\alpha$, set $x_{t+1} = x'$
- Reject: with probability $1-\alpha$, set $x_{t+1} = x_t$

Key Properties:

Only needs $p(x)$ up to a normalizing constant (ratio cancels)
Satisfies detailed balance with respect to $p$
Converges to sampling from $p$ under mild conditions

Symmetric Proposals (Metropolis):

If $q(x'|x) = q(x|x')$ (symmetric, like Gaussian random walk), the formula simplifies: $$\alpha = \min\left(1, \frac{p(x')}{p(x_t)}\right)$$

Accept if we move to higher probability; randomly accept if we move to lower.

Tuning the Proposal:

The proposal distribution critically affects performance:

Too narrow (small steps):

Almost always accept (always move to nearby regions)
Chain explores slowly, high autocorrelation
Takes forever to traverse the distribution

Too wide (large steps):

Often propose improbable regions, mostly reject
Chain gets stuck, doesn't move
Wastes computation on rejected proposals

Sweet spot: Acceptance rate around 20-50% for random walk proposals in high dimensions (theoretical optimal ~23.4% for high-dimensional Gaussian targets).

Adaptive MH:

Modern implementations adapt the proposal during run:

Learn the covariance structure from samples
Adjust step size based on acceptance rate
Requires care to maintain correctness (adapt only during burn-in, or use diminishing adaptation)

metropolis_hastings.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import numpy as np
 
# Metropolis-Hastings for sampling from a bimodal distribution
np.random.seed(42)
 
# Target: mixture of two Gaussians
def log_target(x):
    log_p1 = -0.5 * ((x - (-3)) ** 2) / 1.0
    log_p2 = -0.5 * ((x - 3) ** 2) / 1.0
    # log-sum-exp for numerical stability
    max_log = max(log_p1, log_p2)
    return max_log + np.log(np.exp(log_p1 - max_log) + np.exp(log_p2 - max_log))
 
def metropolis_hastings(log_target, x0, proposal_std, n_samples, burn_in=1000):
    """
    Random walk Metropolis-Hastings.
    
    Args:
        log_target: Function returning log p(x) (up to constant)
        x0: Initial state
        proposal_std: Standard deviation of Gaussian proposal
        n_samples: Number of samples after burn-in
        burn_in: Number of initial samples to discard
    """
    x = x0
    samples = []
    acceptances = 0
    total_iter = n_samples + burn_in
    
    for t in range(total_iter):
        # Propose
        x_prime = x + np.random.normal(0, proposal_std)
        
        # Compute acceptance probability (in log space)
        log_alpha = log_target(x_prime) - log_target(x)
        
        # Accept/reject
        if np.log(np.random.random()) < log_alpha:
            x = x_prime
            if t >= burn_in:
                acceptances += 1
        
        if t >= burn_in:
            samples.append(x)
    
    return np.array(samples), acceptances / n_samples
 
# Run with different proposal widths
n_samples = 10000
print("Metropolis-Hastings Sampling from Bimodal Distribution")
print("=" * 55)
 
for std in [0.1, 0.5, 1.0, 2.0, 5.0, 10.0]:
    samples, acc_rate = metropolis_hastings(
        log_target, x0=0.0, proposal_std=std, n_samples=n_samples
    )
    
    # Compute effective sample size (using autocorrelation)
    def compute_ess(samples):
        n = len(samples)
        mean = samples.mean()
        var = samples.var()
        if var == 0:
            return n
        autocorr = np.correlate(samples - mean, samples - mean, 'full')
        autocorr = autocorr[n-1:] / (var * np.arange(n, 0, -1))
        # Sum autocorrelations until they go negative
        ess_denom = 1 + 2 * sum(autocorr[1:min(100, n)] * (autocorr[1:min(100, n)] > 0.05))
        return n / max(1, ess_denom)
    
    ess = compute_ess(samples)
    
    print(f"  σ_proposal = {std:4.1f}: Accept rate = {acc_rate:.3f}, "
          f"ESS = {ess:6.0f} ({ess/n_samples*100:4.1f}%)")
 
print("
  (Best: acceptance rate 20-50%, high ESS)")
 
# Check if both modes are visited
best_samples, _ = metropolis_hastings(
    log_target, x0=0.0, proposal_std=2.0, n_samples=50000, burn_in=5000
)
n_left_mode = np.sum(best_samples < 0)
n_right_mode = np.sum(best_samples >= 0)
print(f"
  Mode coverage (σ=2.0):")
print(f"    Left mode (x<0): {n_left_mode} samples ({n_left_mode/len(best_samples):.1%})")
print(f"    Right mode (x≥0): {n_right_mode} samples ({n_right_mode/len(best_samples):.1%})")

Gibbs Sampling

Gibbs sampling is a special case of MCMC particularly useful when the joint distribution is hard to sample from, but the conditional distributions are easy.

Algorithm:

For a multivariate distribution $p(x_1, x_2, \ldots, x_d)$:

Initialize $(x_1^{(0)}, x_2^{(0)}, \ldots, x_d^{(0)})$
For $t = 1, 2, \ldots$:
- Sample $x_1^{(t)} \sim p(x_1 | x_2^{(t-1)}, x_3^{(t-1)}, \ldots, x_d^{(t-1)})$
- Sample $x_2^{(t)} \sim p(x_2 | x_1^{(t)}, x_3^{(t-1)}, \ldots, x_d^{(t-1)})$
- ...
- Sample $x_d^{(t)} \sim p(x_d | x_1^{(t)}, x_2^{(t)}, \ldots, x_{d-1}^{(t)})$

Each step updates one variable by sampling from its full conditional distribution (conditioned on current values of all others).

Why it works:

Gibbs sampling is a special case of MH where the proposal is the full conditional, and the acceptance probability is always 1 (the updates are always accepted).

Advantages:

No tuning required (step size is determined by conditionals)
Always accept (no rejected samples)
Natural for models with conjugate priors (closed-form conditionals)

Gibbs in Graphical Models

Block Gibbs Sampling:

Instead of updating one variable at a time, update blocks of variables together: $$x_A^{(t)} \sim p(x_A | x_B^{(t-1)})$$ $$x_B^{(t)} \sim p(x_B | x_A^{(t)})$$

Larger blocks improve mixing (less autocorrelation) but require sampling from more complex conditionals.

Collapsed Gibbs Sampling:

Integrate out (marginalize) some variables analytically, sampling only the remaining variables from the collapsed conditionals. This can dramatically improve mixing by reducing dependencies.

Limitations:

Requires tractable conditionals: If $p(x_i | x_{-i})$ is hard to sample from, Gibbs doesn't apply.
Slow mixing with strong dependencies: If variables are highly correlated, updating one at a time is inefficient.
Discrete latent variables: Can get stuck in local optima.

gibbs_sampling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import numpy as np
from scipy.stats import norm
 
np.random.seed(42)
 
# Gibbs sampling for bivariate Gaussian
# Target: p(x, y) = N([0, 0], [[1, ρ], [ρ, 1]])
rho = 0.9  # High correlation
 
# Conditionals:
# p(x | y) = N(ρy, 1 - ρ²)
# p(y | x) = N(ρx, 1 - ρ²)
 
def gibbs_bivariate_gaussian(rho, n_samples, burn_in=1000):
    """Gibbs sampling for bivariate Gaussian with correlation rho."""
    conditional_std = np.sqrt(1 - rho ** 2)
    
    x, y = 0.0, 0.0  # Initialize
    samples = []
    
    for t in range(n_samples + burn_in):
        # Update x given y
        x = np.random.normal(rho * y, conditional_std)
        # Update y given x
        y = np.random.normal(rho * x, conditional_std)
        
        if t >= burn_in:
            samples.append([x, y])
    
    return np.array(samples)
 
# Sample with different correlations
print("Gibbs Sampling for Bivariate Gaussian")
print("=" * 45)
 
for rho in [0.0, 0.5, 0.9, 0.99]:
    samples = gibbs_bivariate_gaussian(rho, n_samples=10000)
    
    # Check statistics
    sample_corr = np.corrcoef(samples[:, 0], samples[:, 1])[0, 1]
    
    # Estimate ESS (rough)
    x_autocorr = np.correlate(samples[:, 0] - samples[:, 0].mean(), 
                              samples[:, 0] - samples[:, 0].mean(), 'full')
    x_autocorr = x_autocorr[len(samples)-1:len(samples)+50] / x_autocorr[len(samples)-1]
    ess_approx = len(samples) / (1 + 2 * np.sum(x_autocorr[1:] * (x_autocorr[1:] > 0.05)))
    
    print(f"  ρ = {rho:.2f}: Sample corr = {sample_corr:.3f}, "
          f"ESS ≈ {ess_approx:.0f} ({ess_approx/10000*100:.1f}%)")
 
print("
  Note: Higher ρ → slower mixing, lower ESS")
 
# Gibbs for Gaussian Mixture Model (toy example)
print("
--- Gibbs for Gaussian Mixture Model ---")
 
# Generate data from mixture
n_data = 200
true_labels = np.random.choice([0, 1], n_data, p=[0.4, 0.6])
true_means = [-3, 3]
data = np.array([np.random.normal(true_means[z], 1) for z in true_labels])
 
# Gibbs sampling: alternate between z (labels) and μ (means)
def gibbs_gmm(data, n_iter=500, burn_in=100):
    n = len(data)
    K = 2
    
    # Initialize
    mu = np.array([0.0, 1.0])  # Means
    z = np.random.choice(K, n)  # Assignments
    pi = np.array([0.5, 0.5])  # Mixing weights
    
    samples = []
    
    for t in range(n_iter + burn_in):
        # Sample z given mu (assignments)
        for i in range(n):
            log_probs = np.log(pi) - 0.5 * (data[i] - mu) ** 2
            probs = np.exp(log_probs - log_probs.max())
            probs /= probs.sum()
            z[i] = np.random.choice(K, p=probs)
        
        # Sample mu given z (means, with prior N(0, 10))
        for k in range(K):
            mask = z == k
            n_k = mask.sum()
            if n_k > 0:
                precision = n_k + 1/10
                mean = (data[mask].sum() + 0/10) / precision
                mu[k] = np.random.normal(mean, 1/np.sqrt(precision))
        
        if t >= burn_in:
            samples.append(mu.copy())
    
    return np.array(samples)
 
mu_samples = gibbs_gmm(data)
print(f"  True means: {true_means}")
print(f"  Posterior mean estimates: [{mu_samples[:, 0].mean():.2f}, {mu_samples[:, 1].mean():.2f}]")

Sampling in Modern Generative Models

Modern deep generative models have developed specialized sampling procedures that diverge from classical MCMC. Let's examine how different architectures approach sampling.

Normalizing Flows: One-Shot Sampling

Flows transform a simple base distribution through an invertible function: $$z \sim \mathcal{N}(0, I) \rightarrow x = f_\theta(z)$$

Sampling is a single forward pass—no chains, no iterations, no rejection. This makes flows attractive for applications requiring fast sampling.

VAEs: Latent Space Sampling

VAEs sample from a simple prior in latent space, then decode: $$z \sim \mathcal{N}(0, I) \rightarrow x = \text{Decoder}_\theta(z)$$

Also one-shot. The quality depends on how well the decoder captures the data distribution conditional on $z$.

GANs: Generator Forward Pass

Like VAEs, GANs sample noise and transform: $$z \sim \mathcal{N}(0, I) \rightarrow x = G_\theta(z)$$

One-shot, very fast. No density evaluation—purely implicit sampling.

Autoregressive Models: Sequential Generation

Autoregressive models (GPT, PixelCNN) sample dimension by dimension: $$x_1 \sim p(x_1), \quad x_2 \sim p(x_2|x_1), \quad \ldots$$

This is ancestral sampling, but sequential—one forward pass per dimension. For long sequences (text) or high-resolution images (millions of pixels), this can be slow.

Sampling Strategies for Autoregressive:

Greedy: Always pick most likely next token. Deterministic, often repetitive.
Temperature: Sample from softmax with temperature $\tau$: $$p_i \propto \exp(\text{logit}_i / \tau)$$ Higher $\tau$ = more diversity, lower = more focused.
Top-k: Sample from top $k$ most likely tokens only.
Top-p (Nucleus): Sample from smallest set of tokens with cumulative probability $\geq p$.
Beam search: Explore multiple hypotheses in parallel, keep top candidates.

Diffusion Models: Iterative Denoising

Diffusion models sample through many denoising steps: $$x_T \sim \mathcal{N}(0, I) \rightarrow x_{T-1} \rightarrow \cdots \rightarrow x_0$$

Typically 20-1000 steps, each requiring a neural network forward pass. This is slower than one-shot methods but produces remarkably high-quality samples.

Fast Diffusion Sampling:

DDIM: Deterministic sampling with fewer steps (~10-50)
Progressive distillation: Train faster models to match slow ones
Consistency models: Learn to jump directly to clean samples

The Speed-Quality Trade-off

Sampling Speed Comparison
Model	Sampling Method	Steps/Sample	Typical Speed
Normalizing Flow	Forward pass	1	~10ms
VAE	Sample + decode	1	~5-10ms
GAN	Forward pass	1	~5-10ms
Autoregressive (text)	Sequential	~100-1000	~100-1000ms
Diffusion (fast)	DDIM denoising	~20-50	~200-500ms
Diffusion (quality)	DDPM denoising	~100-1000	~1-10s

Summary

Key Takeaways

•Ancestral sampling follows the generative process—sample parents, then children. Works for directed models and autoregressive architectures.
•Reparameterization trick enables gradients through stochastic layers: express samples as deterministic function of parameters and independent noise.
•Rejection sampling and importance sampling allow sampling from difficult distributions using easier proposals, but suffer in high dimensions.
•MCMC methods construct Markov chains converging to the target distribution. Require careful tuning and burn-in.
•Metropolis-Hastings is the foundational MCMC algorithm: propose, accept/reject based on density ratio.
•Gibbs sampling updates one variable at a time from full conditionals—no tuning, always accept, but can mix slowly.
•Modern generative models use specialized sampling: one-shot (flows, VAEs, GANs), sequential (autoregressive), or iterative (diffusion).
•Speed-quality trade-offs are fundamental: faster sampling often means accepting some loss in sample quality or diversity.

What's next:

Sampling Mastery