Approximate Inference Sampling - Learning Module

Loading content...

0/245

MCMC: Metropolis-Hastings Algorithm

The Revolutionary Idea: Sampling Through Markov Chains

Rejection sampling and importance sampling share a fundamental limitation: they require a proposal that globally matches the target distribution. In high dimensions, no such proposal exists.

Markov Chain Monte Carlo (MCMC) takes a radically different approach. Instead of trying to directly sample from the target, we construct a Markov chain—a sequence of random states where each state depends only on the previous one—whose stationary distribution is the target distribution.

The power of this idea is profound:

Local proposals: We only need to explore locally, not globally
Any dimension: The same approach works whether $d = 2$ or $d = 2000$
Unnormalized densities: We only need the ratio $p(x')/p(x)$, so normalizing constants cancel

The Metropolis-Hastings (MH) algorithm provides a general recipe for constructing such chains. Published in 1953 (Metropolis) and generalized in 1970 (Hastings), it remains the foundation of modern Bayesian computation.

What You Will Learn

This page develops the Metropolis-Hastings algorithm from first principles. You will understand detailed balance, derive the acceptance probability, learn proposal design strategies, and master practical diagnostics for ensuring your MCMC chains have converged.

Markov Chain Foundations

Before diving into the algorithm, we need the mathematical machinery of Markov chains.

Definition: Markov Chain

A sequence of random variables $X_0, X_1, X_2, \ldots$ forms a Markov chain if:

$$P(X_{t+1} | X_t, X_{t-1}, \ldots, X_0) = P(X_{t+1} | X_t)$$

The future depends on the past only through the present state. This "memoryless" property is the Markov property.

Transition kernel:

The chain is characterized by its transition kernel $K(x' | x)$—the probability (density) of moving from state $x$ to state $x'$:

$$P(X_{t+1} \in A | X_t = x) = \int_A K(x' | x) , dx'$$

Stationary distribution:

A distribution $\pi$ is a stationary distribution for the chain if, when $X_t \sim \pi$, then $X_{t+1} \sim \pi$:

$$\pi(x') = \int K(x' | x) , \pi(x) , dx$$

If we start the chain from the stationary distribution, it stays there forever. More remarkably, under mild conditions, the chain converges to the stationary distribution regardless of where it starts.

The MCMC Goal

MCMC inverts the usual setup: instead of starting with a transition kernel and finding its stationary distribution, we start with a target distribution π (like a Bayesian posterior) and construct a transition kernel whose stationary distribution is π. The Metropolis-Hastings algorithm provides this construction.

Conditions for convergence:

For a Markov chain to converge to a unique stationary distribution, it must satisfy:

Irreducibility: The chain can reach any state from any other state (possibly in multiple steps)
Aperiodicity: The chain doesn't cycle deterministically between states
Positive recurrence: The expected return time to any state is finite

For continuous state spaces, we also need technical regularity conditions (Harris recurrence).

Ergodic theorem:

If the chain satisfies these conditions, then for almost any starting point $X_0$:

$$\frac{1}{T} \sum_{t=1}^{T} f(X_t) \xrightarrow{a.s.} \mathbb{E}_{\pi}[f(X)] \quad \text{as } T \to \infty$$

This is the MCMC analog of the Law of Large Numbers: time averages converge to expectations under the stationary distribution.

Detailed Balance: The Key to MCMC Design

How do we construct a transition kernel with a desired stationary distribution? The answer lies in detailed balance.

Detailed balance condition:

A transition kernel $K$ satisfies detailed balance with respect to distribution $\pi$ if:

$$\pi(x) , K(x' | x) = \pi(x') , K(x | x')$$

for all states $x$ and $x'$.

Interpretation:

This says the "probability flow" from $x$ to $x'$ equals the flow from $x'$ to $x$. If we imagine many particles distributed according to $\pi$ and simultaneously apply one transition step, the flows balance—no net movement.

Why detailed balance implies stationarity:

Integrate both sides over $x$: $$\int \pi(x) , K(x' | x) , dx = \int \pi(x') , K(x | x') , dx = \pi(x') \int K(x | x') , dx = \pi(x')$$

The left side is the distribution of $X_{t+1}$ when $X_t \sim \pi$. The right side is $\pi(x')$. So $\pi$ is stationary.

Detailed balance is sufficient, not necessary:

A chain can have a stationary distribution without satisfying detailed balance (such chains exhibit "circulation"). But detailed balance is the easiest condition to verify and engineer, making it the standard approach.

The Design Principle

Detailed balance gives us a recipe: design a transition kernel such that π(x)K(x'|x) = π(x')K(x|x'). Metropolis-Hastings achieves this by introducing an acceptance probability that "corrects" any proposal distribution to satisfy detailed balance with respect to the target.

The Metropolis-Hastings Algorithm

The Metropolis-Hastings algorithm constructs a valid MCMC transition kernel from any proposal distribution.

Setup:

Target distribution: $\pi(x)$ (we want samples from this)
Proposal distribution: $q(x' | x)$ (given current state $x$, propose new state $x'$)

The algorithm:

Initialize X_0 arbitrarily
For t = 0, 1, 2, ..., T-1:
    1. Propose: Draw x' ~ q(x' | X_t)
    
    2. Compute acceptance probability:
       α(x', X_t) = min(1, π(x') q(X_t | x') / (π(X_t) q(x' | X_t)))
    
    3. Accept or reject:
       Draw u ~ Uniform(0, 1)
       If u ≤ α(x', X_t):
           X_{t+1} = x'    (accept)
       Else:
           X_{t+1} = X_t   (reject, stay)

The acceptance ratio:

$$\alpha(x', x) = \min\left(1, \frac{\pi(x') , q(x | x')}{\pi(x) , q(x' | x)}\right)$$

This involves:

$\pi(x') / \pi(x)$: Ratio of target densities (likelihood of new vs. current)
$q(x | x') / q(x' | x)$: Proposal ratio (correction for asymmetric proposals)

Unnormalized Densities Are Fine!

A crucial feature: we only need the RATIO π(x')/π(x). Any normalizing constants cancel! For Bayesian posteriors:

π(θ')/π(θ) = p(D|θ')p(θ') / p(D|θ)p(θ)

The intractable evidence p(D) cancels completely. This is why MCMC transformed Bayesian computation—we never need to compute the normalizing constant.

Why MH satisfies detailed balance:

The transition kernel is: $$K(x' | x) = q(x' | x) , \alpha(x', x) + r(x) , \delta(x' - x)$$

where $r(x) = 1 - \int q(x' | x) , \alpha(x', x) , dx'$ is the probability of rejection.

For $x \neq x'$, we need to verify: $$\pi(x) , q(x' | x) , \alpha(x', x) = \pi(x') , q(x | x') , \alpha(x, x')$$

Case 1: If $\pi(x') q(x | x') < \pi(x) q(x' | x)$, then $\alpha(x', x) = \frac{\pi(x') q(x | x')}{\pi(x) q(x' | x)}$ and $\alpha(x, x') = 1$.

LHS: $\pi(x) q(x' | x) \cdot \frac{\pi(x') q(x | x')}{\pi(x) q(x' | x)} = \pi(x') q(x | x')$ RHS: $\pi(x') q(x | x') \cdot 1 = \pi(x') q(x | x')$ ✓

Case 2: Symmetric, also balances. ✓

The acceptance probability is precisely designed to achieve detailed balance.

metropolis_hastings.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
import numpy as np
from scipy import stats
 
def metropolis_hastings(
    log_target,
    proposal_sampler,
    log_proposal_density,
    x0,
    n_samples,
    burn_in=1000
):
    """
    General Metropolis-Hastings algorithm.
    
    Parameters:
    -----------
    log_target : callable
        Log of (possibly unnormalized) target density
    proposal_sampler : callable(x)
        Returns a sample x' from q(x'|x)
    log_proposal_density : callable(x_new, x_old)
        Returns log q(x_new | x_old)
    x0 : array-like
        Initial state
    n_samples : int
        Number of samples after burn-in
    burn_in : int
        Number of initial samples to discard
    
    Returns:
    --------
    samples : ndarray
        MCMC samples from target distribution
    acceptance_rate : float
        Fraction of proposals accepted
    """
    x = np.array(x0, dtype=float)
    d = x.shape[0] if x.ndim > 0 else 1
    
    samples = []
    n_accepted = 0
    total_iterations = n_samples + burn_in
    
    for t in range(total_iterations):
        # Propose new state
        x_proposed = proposal_sampler(x)
        
        # Compute log acceptance ratio
        log_target_proposed = log_target(x_proposed)
        log_target_current = log_target(x)
        
        log_proposal_forward = log_proposal_density(x_proposed, x)
        log_proposal_backward = log_proposal_density(x, x_proposed)
        
        log_alpha = (log_target_proposed - log_target_current + 
                     log_proposal_backward - log_proposal_forward)
        
        # Accept or reject
        if np.log(np.random.uniform()) < log_alpha:
            x = x_proposed
            n_accepted += 1
        
        # Store sample after burn-in
        if t >= burn_in:
            samples.append(x.copy() if hasattr(x, 'copy') else x)
    
    samples = np.array(samples)
    acceptance_rate = n_accepted / total_iterations
    
    return samples, acceptance_rate
 
 
# Example: Sample from 2D posterior
# Likelihood: Y ~ N(theta_1 + theta_2, 1), observe Y = 5
# Prior: theta_1, theta_2 ~ N(0, 2^2) independently
 
observed_y = 5
 
def log_posterior(theta):
    """Log unnormalized posterior"""
    theta1, theta2 = theta
    
    log_prior = (stats.norm.logpdf(theta1, 0, 2) + 
                 stats.norm.logpdf(theta2, 0, 2))
    
    log_likelihood = stats.norm.logpdf(observed_y, theta1 + theta2, 1)
    
    return log_prior + log_likelihood
 
# Random walk proposal: q(theta' | theta) = N(theta, sigma^2 I)
proposal_std = 0.5
 
def proposal_sampler(theta):
    return theta + np.random.normal(0, proposal_std, size=2)
 
def log_proposal_density(theta_new, theta_old):
    # Symmetric random walk: q(new|old) = q(old|new)
    return np.sum(stats.norm.logpdf(theta_new - theta_old, 0, proposal_std))
 
 
# Run MCMC
np.random.seed(42)
x0 = np.array([0.0, 0.0])
 
samples, acc_rate = metropolis_hastings(
    log_posterior,
    proposal_sampler,
    log_proposal_density,
    x0,
    n_samples=10000,
    burn_in=2000
)
 
print("=== Metropolis-Hastings: 2D Posterior ===")
print(f"Acceptance rate: {acc_rate:.3f}")
print(f"theta_1: mean = {np.mean(samples[:, 0]):.3f}, std = {np.std(samples[:, 0]):.3f}")
print(f"theta_2: mean = {np.mean(samples[:, 1]):.3f}, std = {np.std(samples[:, 1]):.3f}")
print(f"theta_1 + theta_2: mean = {np.mean(samples[:, 0] + samples[:, 1]):.3f}")

Special Cases: Metropolis and Independence Sampler

Two important special cases of Metropolis-Hastings simplify the algorithm.

1. Metropolis Algorithm (Symmetric Proposal)

When the proposal is symmetric—$q(x' | x) = q(x | x')$—the proposal ratio cancels:

$$\alpha(x', x) = \min\left(1, \frac{\pi(x')}{\pi(x)}\right)$$

The acceptance depends only on the target density ratio. This is the original Metropolis algorithm (1953).

Common symmetric proposals:

Random walk: $q(x' | x) = \mathcal{N}(x' ; x, \sigma^2 I)$ — propose by adding Gaussian noise
Uniform ball: $q(x' | x) = \text{Uniform}(x' ; {y : |y - x| < \epsilon})$

2. Independence Sampler

When the proposal doesn't depend on the current state—$q(x' | x) = q(x')$—this becomes like importance sampling with MCMC:

$$\alpha(x', x) = \min\left(1, \frac{\pi(x') / q(x')}{\pi(x) / q(x)}\right) = \min\left(1, \frac{w(x')}{w(x)}\right)$$

where $w(x) = \pi(x)/q(x)$ are importance weights.

Independence sampler properties:

Like IS, requires proposal that covers target well
Unlike IS, produces a Markov chain (correlated samples)
Can be efficient if proposal closely matches target

MH Variants Comparison
Variant	Proposal	Acceptance Ratio	Use Case
Full MH	$q(x'\|x)$ any	$\frac{\pi(x')q(x\|x')}{\pi(x)q(x'\|x)}$	General purpose
Metropolis	$q(x'\|x) = q(x\|x')$	$\frac{\pi(x')}{\pi(x)}$	Random walk exploration
Independence	$q(x'\|x) = q(x')$	$\frac{w(x')}{w(x)}$	Global proposals

metropolis_variants.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import numpy as np
from scipy import stats
 
def random_walk_metropolis(log_target, x0, n_samples, step_size, burn_in=1000):
    """
    Random walk Metropolis (symmetric proposal).
    Simplified version since proposal ratio = 1.
    """
    x = np.array(x0, dtype=float)
    d = len(x)
    
    samples = []
    n_accepted = 0
    
    for t in range(n_samples + burn_in):
        # Symmetric random walk proposal
        x_proposed = x + np.random.normal(0, step_size, d)
        
        # Acceptance ratio (no proposal correction needed)
        log_alpha = log_target(x_proposed) - log_target(x)
        
        if np.log(np.random.uniform()) < log_alpha:
            x = x_proposed
            n_accepted += 1
        
        if t >= burn_in:
            samples.append(x.copy())
    
    return np.array(samples), n_accepted / (n_samples + burn_in)
 
 
def independence_sampler(log_target, log_proposal, proposal_sampler, 
                         x0, n_samples, burn_in=1000):
    """
    Independence sampler (proposal doesn't depend on current state).
    """
    x = np.array(x0, dtype=float)
    log_w_current = log_target(x) - log_proposal(x)
    
    samples = []
    n_accepted = 0
    
    for t in range(n_samples + burn_in):
        # Independent proposal
        x_proposed = proposal_sampler()
        
        # Acceptance based on weight ratio
        log_w_proposed = log_target(x_proposed) - log_proposal(x_proposed)
        log_alpha = log_w_proposed - log_w_current
        
        if np.log(np.random.uniform()) < log_alpha:
            x = x_proposed
            log_w_current = log_w_proposed
            n_accepted += 1
        
        if t >= burn_in:
            samples.append(x.copy())
    
    return np.array(samples), n_accepted / (n_samples + burn_in)
 
 
# Compare on 2D banana-shaped distribution
def log_banana(x, a=1, b=100):
    """Log density of banana (Rosenbrock) distribution."""
    return -((a - x[0])**2 + b * (x[1] - x[0]**2)**2) / 2
 
# Random walk Metropolis
samples_rw, acc_rw = random_walk_metropolis(
    log_banana, x0=[0, 0], n_samples=10000, step_size=0.1
)
 
print("=== Random Walk Metropolis on Banana Distribution ===")
print(f"Acceptance rate: {acc_rw:.3f}")
print(f"Mean: [{np.mean(samples_rw[:, 0]):.3f}, {np.mean(samples_rw[:, 1]):.3f}]")
 
# Independence sampler with Gaussian proposal (will struggle)
def log_gaussian_proposal(x):
    return stats.multivariate_normal.logpdf(x, mean=[0, 0], cov=[[1, 0], [0, 1]])
 
def gaussian_proposal_sampler():
    return np.random.multivariate_normal([0, 0], [[1, 0], [0, 1]])
 
samples_ind, acc_ind = independence_sampler(
    log_banana, log_gaussian_proposal, gaussian_proposal_sampler,
    x0=[0, 0], n_samples=10000
)
 
print(f"\n=== Independence Sampler (poor proposal) ===")
print(f"Acceptance rate: {acc_ind:.3f}  (low = proposal mismatch)")

Proposal Design: The Art of MCMC

The choice of proposal distribution critically affects MCMC efficiency. The goal is rapid exploration of the target distribution.

The acceptance rate trade-off:

For random walk proposals with step size $\sigma$:

$\sigma$ too small: Nearly all proposals accepted, but chain takes tiny steps → slow exploration
$\sigma$ too large: Most proposals land in low-probability regions → low acceptance, chain stays stuck
Optimal $\sigma$: Balance between step size and acceptance rate

Theoretical guidance:

For Gaussian targets in $d$ dimensions, optimal scaling theory suggests:

$$\text{Optimal acceptance rate} \approx \begin{cases} 0.44 & d = 1 \ 0.234 & d \to \infty \end{cases}$$

A common rule of thumb: aim for acceptance rate ≈ 20-40% for random walk proposals.

Adaptive MCMC:

Modern approaches tune the proposal during sampling:

Run initial chain to estimate posterior covariance $\hat{\Sigma}$
Use proposal $q(x' | x) = \mathcal{N}(x', x, (2.38^2/d) \cdot \hat{\Sigma})$
The factor $2.38^2/d$ is theoretically optimal for Gaussian targets

Adaptation must decrease over time for rigorous convergence guarantees.

The 2.38²/d Rule

For d-dimensional Gaussian targets with covariance Σ, the optimal random walk proposal is N(x, (2.38²/d)Σ). This achieves ~23.4% acceptance and maximizes the expected squared jumping distance per iteration. Using the posterior covariance as a proxy for Σ substantially improves mixing.

Signs of Good Mixing

•Trace plots show rapid, irregular fluctuations
•Acceptance rate 20-50% for random walk
•Autocorrelation decays quickly
•Multiple chains converge to same region
•Posterior summaries stable with more samples

Signs of Poor Mixing

•Trace plots show trends or long stays
•Acceptance rate < 10% or > 70%
•Autocorrelation persists for lag > 100
•Different chains converge to different places
•ESS << number of samples

proposal_tuning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
from scipy import stats
 
def adaptive_metropolis(log_target, x0, n_samples, 
                        initial_cov=None, adaptation_period=1000):
    """
    Adaptive Metropolis with empirical covariance tuning.
    
    Uses 2.38²/d scaling factor for optimal mixing.
    """
    x = np.array(x0, dtype=float)
    d = len(x)
    
    if initial_cov is None:
        initial_cov = 0.1 * np.eye(d)
    
    # Scaling factor for d dimensions
    scale = 2.38**2 / d
    proposal_cov = scale * initial_cov
    
    samples = [x.copy()]
    n_accepted = 0
    
    # For computing running mean and covariance
    running_mean = x.copy()
    running_cov = np.zeros((d, d))
    
    for t in range(1, n_samples):
        # Propose from current adaptive covariance
        try:
            x_proposed = np.random.multivariate_normal(x, proposal_cov)
        except np.linalg.LinAlgError:
            # Fallback if covariance becomes singular
            x_proposed = x + np.random.normal(0, 0.1, d)
        
        # Accept/reject
        log_alpha = log_target(x_proposed) - log_target(x)
        if np.log(np.random.uniform()) < log_alpha:
            x = x_proposed
            n_accepted += 1
        
        samples.append(x.copy())
        
        # Update running statistics (Welford's algorithm)
        old_mean = running_mean.copy()
        running_mean = old_mean + (x - old_mean) / (t + 1)
        running_cov = (t - 1) / t * running_cov + \
                      np.outer(x - old_mean, x - running_mean) / t
        
        # Adapt proposal covariance periodically
        if t > adaptation_period and t % 100 == 0:
            proposal_cov = scale * (running_cov + 1e-6 * np.eye(d))
    
    return np.array(samples), n_accepted / n_samples
 
 
# Test on correlated 2D Gaussian
target_mean = [0, 0]
target_cov = [[1, 0.9], [0.9, 1]]  # Highly correlated
 
def log_target(x):
    return stats.multivariate_normal.logpdf(x, target_mean, target_cov)
 
# Compare non-adaptive vs adaptive
print("=== Proposal Tuning Comparison ===")
print("Target: 2D Gaussian with correlation = 0.9")
print()
 
# Non-adaptive with identity covariance (poor for correlated target)
def run_fixed_proposal(proposal_cov):
    samples = [np.array([0.0, 0.0])]
    x = samples[0]
    n_accepted = 0
    
    for _ in range(9999):
        x_proposed = np.random.multivariate_normal(x, proposal_cov)
        if np.log(np.random.uniform()) < log_target(x_proposed) - log_target(x):
            x = x_proposed
            n_accepted += 1
        samples.append(x.copy())
    
    return np.array(samples), n_accepted / 10000
 
samples_poor, acc_poor = run_fixed_proposal(0.1 * np.eye(2))
print(f"Non-adaptive (identity proposal):")
print(f"  Acceptance rate: {acc_poor:.3f}")
print(f"  Sample covariance:\n{np.cov(samples_poor.T)}")
 
# Adaptive
samples_adaptive, acc_adaptive = adaptive_metropolis(
    log_target, [0.0, 0.0], 10000
)
print(f"\nAdaptive Metropolis:")
print(f"  Acceptance rate: {acc_adaptive:.3f}")
print(f"  Sample covariance:\n{np.cov(samples_adaptive.T)}")
print(f"\nTrue target covariance:\n{np.array(target_cov)}")

Convergence Diagnostics

A critical aspect of MCMC practice: how do we know the chain has converged?

Unlike optimization, where we can check gradients, MCMC gives no direct signal. We rely on diagnostics that detect lack of convergence rather than proving convergence.

1. Trace plots:

Plot parameter values vs. iteration. Look for:

Stationarity: no trends, settling into a "fuzzy caterpillar"
Mixing: rapid up-and-down movement, not long flat stretches
Burn-in: initial period before reaching typical set

2. Autocorrelation:

Compute $\rho_k = \text{Cor}(X_t, X_{t+k})$ for lags $k = 1, 2, \ldots$

Good mixing: autocorrelation decays quickly (within 10-50 lags)
Poor mixing: autocorrelation persists for hundreds of lags

3. Effective Sample Size (ESS):

$$\text{ESS} = \frac{N}{1 + 2\sum_{k=1}^{\infty} \rho_k}$$

Number of independent samples worth of information. ESS << N indicates high autocorrelation.

4. $\hat{R}$ (R-hat / Gelman-Rubin statistic):

Run multiple chains from dispersed starting points. Compare within-chain and between-chain variance:

$$\hat{R} = \sqrt{\frac{\text{Var}(\text{pooled})}{\text{Mean within-chain var}}}$$

$\hat{R} \approx 1$: Chains have converged to same distribution
$\hat{R} > 1.1$: Non-convergence suspected, run longer

You Cannot Prove Convergence

All diagnostics can only indicate problems—they cannot prove convergence. A chain could pass all tests yet miss a distant mode. Best practice:

Run multiple chains from dispersed starts
Use all standard diagnostics (trace, R̂, ESS)
Check sensitivity to initialization
When possible, validate against known solutions
Report diagnostics alongside results

convergence_diagnostics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import numpy as np
from scipy import stats
 
def compute_ess(samples):
    """
    Compute effective sample size from autocorrelation.
    Uses initial positive sequence estimator (Geyer's method).
    """
    N = len(samples)
    
    # Compute autocorrelation using FFT
    samples_centered = samples - np.mean(samples)
    fft = np.fft.fft(samples_centered, n=2*N)
    acf = np.fft.ifft(fft * np.conj(fft))[:N].real
    acf = acf / acf[0]
    
    # Initial positive sequence: sum until consecutive pairs become negative
    rho_sum = 0
    for k in range(0, N - 1, 2):
        pair_sum = acf[k] + acf[k + 1] if k + 1 < N else acf[k]
        if pair_sum < 0:
            break
        rho_sum += pair_sum
    
    ESS = N / (1 + 2 * rho_sum)
    return max(1, ESS)
 
 
def compute_rhat(chains):
    """
    Compute R-hat (Gelman-Rubin) statistic.
    
    chains: list of M chains, each of length N
    """
    M = len(chains)
    N = len(chains[0])
    
    # Within-chain variance
    W = np.mean([np.var(chain, ddof=1) for chain in chains])
    
    # Between-chain variance
    chain_means = [np.mean(chain) for chain in chains]
    overall_mean = np.mean(chain_means)
    B = N * np.var(chain_means, ddof=1)
    
    # Estimated variance
    var_estimate = ((N - 1) * W + B) / N
    
    # R-hat
    rhat = np.sqrt(var_estimate / W) if W > 0 else np.inf
    
    return rhat
 
 
def mcmc_diagnostics_demo():
    """
    Demonstrate MCMC diagnostics on a simple problem.
    """
    # Target: N(3, 1)
    def log_target(x):
        return stats.norm.logpdf(x, 3, 1)
    
    # Run 4 chains from dispersed starts
    n_chains = 4
    n_samples = 5000
    starting_points = [-10, 0, 5, 15]  # Dispersed
    
    chains = []
    for start in starting_points:
        samples = [start]
        x = start
        for _ in range(n_samples - 1):
            x_proposed = x + np.random.normal(0, 1)
            if np.log(np.random.uniform()) < log_target(x_proposed) - log_target(x):
                x = x_proposed
            samples.append(x)
        chains.append(np.array(samples))
    
    print("=== MCMC Convergence Diagnostics ===")
    print(f"Target: N(3, 1)")
    print(f"Chains: {n_chains}, Samples per chain: {n_samples}")
    print(f"Starting points: {starting_points}")
    print()
    
    # R-hat (should be close to 1)
    # Use second half of chains (after burn-in)
    rhat = compute_rhat([chain[n_samples//2:] for chain in chains])
    print(f"R-hat: {rhat:.4f}")
    print(f"  → {'Good convergence' if rhat < 1.1 else 'Convergence issue!'}")
    
    # ESS for each chain
    print(f"\nEffective Sample Size (per chain):")
    for i, chain in enumerate(chains):
        ess = compute_ess(chain[n_samples//2:])
        print(f"  Chain {i+1}: ESS = {ess:.0f} ({100*ess/(n_samples//2):.1f}%)")
    
    # Pooled statistics
    pooled = np.concatenate([chain[n_samples//2:] for chain in chains])
    print(f"\nPooled posterior summary:")
    print(f"  Mean: {np.mean(pooled):.4f} (true: 3.0)")
    print(f"  Std:  {np.std(pooled):.4f} (true: 1.0)")
    print(f"  95% CI: [{np.percentile(pooled, 2.5):.4f}, {np.percentile(pooled, 97.5):.4f}]")
 
 
mcmc_diagnostics_demo()

Burn-in and Thinning

Two common practices in MCMC sampling are burn-in and thinning. Understanding when and why to use them is important.

Burn-in:

The initial samples from an MCMC chain depend on the starting point and may not represent the target distribution. Burn-in discards these initial samples.

How much burn-in?

Examine trace plots to see when chain reaches typical set
Conservative rule: discard first half of samples
Use R̂ across chains from dispersed starts

Thinning:

Thinning keeps only every $k$th sample (e.g., every 10th), reducing autocorrelation.

Should you thin?

The traditional advice was to thin heavily. Modern understanding:

$$\text{MCMC variance} \propto \frac{\text{autocorrelated variance}}{N}$$

Thinning by $k$ reduces $N$ to $N/k$ but only moderately reduces autocorrelation contribution. For inference, thinning usually hurts precision.

When thinning makes sense:

Storage constraints (can't keep millions of samples)
Communication constraints (sending samples between processes)
Certain downstream analyses that assume independent samples

Modern Best Practice

For most applications: use burn-in based on diagnostics, but don't thin. Keep all post-burn-in samples. Report ESS so readers know the effective information content. Thinning discards information you've already paid to compute. Exception: if storing 10 million samples is prohibitive, thin to a manageable size.

Burn-in and Thinning Summary
Practice	Purpose	Recommendation
Burn-in	Remove initialization bias	Yes, use diagnostics to determine length
Thinning	Reduce autocorrelation	Generally no; keep all samples for inference
Multiple chains	Detect non-convergence	Yes, always run 2-4 chains minimum
Report ESS	Communicate precision	Yes, always report with estimates

Summary: Metropolis-Hastings

The Metropolis-Hastings algorithm provides a general framework for sampling from complex distributions using only unnormalized density evaluations. It transformed Bayesian computation by making high-dimensional posterior sampling practical.

Key Takeaways

•Markov chain principle: Construct a chain whose stationary distribution is the target; samples converge to target draws
•Detailed balance: $\pi(x)K(x'|x) = \pi(x')K(x|x')$ guarantees target is stationary
•MH acceptance: $\alpha = \min(1, \frac{\pi(x')q(x|x')}{\pi(x)q(x'|x)})$ corrects any proposal to satisfy detailed balance
•Unnormalized densities: Only ratios $\pi(x')/\pi(x)$ are needed—normalizing constants cancel
•Proposal design: Aim for 20-40% acceptance; use adaptive methods to tune to posterior structure
•Diagnostics essential: Use trace plots, R̂, ESS, multiple chains; you cannot prove convergence
•Burn-in, not thinning: Discard initial samples, keep all post-burn-in samples for inference

What's Next:

While Metropolis-Hastings is general-purpose, it updates all parameters jointly. For models with specific conditional structure, Gibbs sampling offers a powerful alternative: update each parameter (or block) from its full conditional distribution, cycling through the parameter space. This exploits model structure for faster mixing.

Page Complete

You now understand the Metropolis-Hastings algorithm: its theoretical foundations in Markov chains and detailed balance, the algorithm itself, proposal design strategies, and essential convergence diagnostics. This is the workhorse of Bayesian computation, enabling inference for virtually any model.