Loading content...
Rejection sampling and importance sampling share a fundamental limitation: they require a proposal that globally matches the target distribution. In high dimensions, no such proposal exists.
Markov Chain Monte Carlo (MCMC) takes a radically different approach. Instead of trying to directly sample from the target, we construct a Markov chain—a sequence of random states where each state depends only on the previous one—whose stationary distribution is the target distribution.
The power of this idea is profound:
The Metropolis-Hastings (MH) algorithm provides a general recipe for constructing such chains. Published in 1953 (Metropolis) and generalized in 1970 (Hastings), it remains the foundation of modern Bayesian computation.
This page develops the Metropolis-Hastings algorithm from first principles. You will understand detailed balance, derive the acceptance probability, learn proposal design strategies, and master practical diagnostics for ensuring your MCMC chains have converged.
Before diving into the algorithm, we need the mathematical machinery of Markov chains.
Definition: Markov Chain
A sequence of random variables $X_0, X_1, X_2, \ldots$ forms a Markov chain if:
$$P(X_{t+1} | X_t, X_{t-1}, \ldots, X_0) = P(X_{t+1} | X_t)$$
The future depends on the past only through the present state. This "memoryless" property is the Markov property.
Transition kernel:
The chain is characterized by its transition kernel $K(x' | x)$—the probability (density) of moving from state $x$ to state $x'$:
$$P(X_{t+1} \in A | X_t = x) = \int_A K(x' | x) , dx'$$
Stationary distribution:
A distribution $\pi$ is a stationary distribution for the chain if, when $X_t \sim \pi$, then $X_{t+1} \sim \pi$:
$$\pi(x') = \int K(x' | x) , \pi(x) , dx$$
If we start the chain from the stationary distribution, it stays there forever. More remarkably, under mild conditions, the chain converges to the stationary distribution regardless of where it starts.
MCMC inverts the usual setup: instead of starting with a transition kernel and finding its stationary distribution, we start with a target distribution π (like a Bayesian posterior) and construct a transition kernel whose stationary distribution is π. The Metropolis-Hastings algorithm provides this construction.
Conditions for convergence:
For a Markov chain to converge to a unique stationary distribution, it must satisfy:
For continuous state spaces, we also need technical regularity conditions (Harris recurrence).
Ergodic theorem:
If the chain satisfies these conditions, then for almost any starting point $X_0$:
$$\frac{1}{T} \sum_{t=1}^{T} f(X_t) \xrightarrow{a.s.} \mathbb{E}_{\pi}[f(X)] \quad \text{as } T \to \infty$$
This is the MCMC analog of the Law of Large Numbers: time averages converge to expectations under the stationary distribution.
How do we construct a transition kernel with a desired stationary distribution? The answer lies in detailed balance.
Detailed balance condition:
A transition kernel $K$ satisfies detailed balance with respect to distribution $\pi$ if:
$$\pi(x) , K(x' | x) = \pi(x') , K(x | x')$$
for all states $x$ and $x'$.
Interpretation:
This says the "probability flow" from $x$ to $x'$ equals the flow from $x'$ to $x$. If we imagine many particles distributed according to $\pi$ and simultaneously apply one transition step, the flows balance—no net movement.
Why detailed balance implies stationarity:
Integrate both sides over $x$: $$\int \pi(x) , K(x' | x) , dx = \int \pi(x') , K(x | x') , dx = \pi(x') \int K(x | x') , dx = \pi(x')$$
The left side is the distribution of $X_{t+1}$ when $X_t \sim \pi$. The right side is $\pi(x')$. So $\pi$ is stationary.
Detailed balance is sufficient, not necessary:
A chain can have a stationary distribution without satisfying detailed balance (such chains exhibit "circulation"). But detailed balance is the easiest condition to verify and engineer, making it the standard approach.
Detailed balance gives us a recipe: design a transition kernel such that π(x)K(x'|x) = π(x')K(x|x'). Metropolis-Hastings achieves this by introducing an acceptance probability that "corrects" any proposal distribution to satisfy detailed balance with respect to the target.
The Metropolis-Hastings algorithm constructs a valid MCMC transition kernel from any proposal distribution.
Setup:
The algorithm:
Initialize X_0 arbitrarily
For t = 0, 1, 2, ..., T-1:
1. Propose: Draw x' ~ q(x' | X_t)
2. Compute acceptance probability:
α(x', X_t) = min(1, π(x') q(X_t | x') / (π(X_t) q(x' | X_t)))
3. Accept or reject:
Draw u ~ Uniform(0, 1)
If u ≤ α(x', X_t):
X_{t+1} = x' (accept)
Else:
X_{t+1} = X_t (reject, stay)
The acceptance ratio:
$$\alpha(x', x) = \min\left(1, \frac{\pi(x') , q(x | x')}{\pi(x) , q(x' | x)}\right)$$
This involves:
A crucial feature: we only need the RATIO π(x')/π(x). Any normalizing constants cancel! For Bayesian posteriors:
π(θ')/π(θ) = p(D|θ')p(θ') / p(D|θ)p(θ)
The intractable evidence p(D) cancels completely. This is why MCMC transformed Bayesian computation—we never need to compute the normalizing constant.
Why MH satisfies detailed balance:
The transition kernel is: $$K(x' | x) = q(x' | x) , \alpha(x', x) + r(x) , \delta(x' - x)$$
where $r(x) = 1 - \int q(x' | x) , \alpha(x', x) , dx'$ is the probability of rejection.
For $x \neq x'$, we need to verify: $$\pi(x) , q(x' | x) , \alpha(x', x) = \pi(x') , q(x | x') , \alpha(x, x')$$
Case 1: If $\pi(x') q(x | x') < \pi(x) q(x' | x)$, then $\alpha(x', x) = \frac{\pi(x') q(x | x')}{\pi(x) q(x' | x)}$ and $\alpha(x, x') = 1$.
LHS: $\pi(x) q(x' | x) \cdot \frac{\pi(x') q(x | x')}{\pi(x) q(x' | x)} = \pi(x') q(x | x')$ RHS: $\pi(x') q(x | x') \cdot 1 = \pi(x') q(x | x')$ ✓
Case 2: Symmetric, also balances. ✓
The acceptance probability is precisely designed to achieve detailed balance.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118
import numpy as npfrom scipy import stats def metropolis_hastings( log_target, proposal_sampler, log_proposal_density, x0, n_samples, burn_in=1000): """ General Metropolis-Hastings algorithm. Parameters: ----------- log_target : callable Log of (possibly unnormalized) target density proposal_sampler : callable(x) Returns a sample x' from q(x'|x) log_proposal_density : callable(x_new, x_old) Returns log q(x_new | x_old) x0 : array-like Initial state n_samples : int Number of samples after burn-in burn_in : int Number of initial samples to discard Returns: -------- samples : ndarray MCMC samples from target distribution acceptance_rate : float Fraction of proposals accepted """ x = np.array(x0, dtype=float) d = x.shape[0] if x.ndim > 0 else 1 samples = [] n_accepted = 0 total_iterations = n_samples + burn_in for t in range(total_iterations): # Propose new state x_proposed = proposal_sampler(x) # Compute log acceptance ratio log_target_proposed = log_target(x_proposed) log_target_current = log_target(x) log_proposal_forward = log_proposal_density(x_proposed, x) log_proposal_backward = log_proposal_density(x, x_proposed) log_alpha = (log_target_proposed - log_target_current + log_proposal_backward - log_proposal_forward) # Accept or reject if np.log(np.random.uniform()) < log_alpha: x = x_proposed n_accepted += 1 # Store sample after burn-in if t >= burn_in: samples.append(x.copy() if hasattr(x, 'copy') else x) samples = np.array(samples) acceptance_rate = n_accepted / total_iterations return samples, acceptance_rate # Example: Sample from 2D posterior# Likelihood: Y ~ N(theta_1 + theta_2, 1), observe Y = 5# Prior: theta_1, theta_2 ~ N(0, 2^2) independently observed_y = 5 def log_posterior(theta): """Log unnormalized posterior""" theta1, theta2 = theta log_prior = (stats.norm.logpdf(theta1, 0, 2) + stats.norm.logpdf(theta2, 0, 2)) log_likelihood = stats.norm.logpdf(observed_y, theta1 + theta2, 1) return log_prior + log_likelihood # Random walk proposal: q(theta' | theta) = N(theta, sigma^2 I)proposal_std = 0.5 def proposal_sampler(theta): return theta + np.random.normal(0, proposal_std, size=2) def log_proposal_density(theta_new, theta_old): # Symmetric random walk: q(new|old) = q(old|new) return np.sum(stats.norm.logpdf(theta_new - theta_old, 0, proposal_std)) # Run MCMCnp.random.seed(42)x0 = np.array([0.0, 0.0]) samples, acc_rate = metropolis_hastings( log_posterior, proposal_sampler, log_proposal_density, x0, n_samples=10000, burn_in=2000) print("=== Metropolis-Hastings: 2D Posterior ===")print(f"Acceptance rate: {acc_rate:.3f}")print(f"theta_1: mean = {np.mean(samples[:, 0]):.3f}, std = {np.std(samples[:, 0]):.3f}")print(f"theta_2: mean = {np.mean(samples[:, 1]):.3f}, std = {np.std(samples[:, 1]):.3f}")print(f"theta_1 + theta_2: mean = {np.mean(samples[:, 0] + samples[:, 1]):.3f}")Two important special cases of Metropolis-Hastings simplify the algorithm.
1. Metropolis Algorithm (Symmetric Proposal)
When the proposal is symmetric—$q(x' | x) = q(x | x')$—the proposal ratio cancels:
$$\alpha(x', x) = \min\left(1, \frac{\pi(x')}{\pi(x)}\right)$$
The acceptance depends only on the target density ratio. This is the original Metropolis algorithm (1953).
Common symmetric proposals:
2. Independence Sampler
When the proposal doesn't depend on the current state—$q(x' | x) = q(x')$—this becomes like importance sampling with MCMC:
$$\alpha(x', x) = \min\left(1, \frac{\pi(x') / q(x')}{\pi(x) / q(x)}\right) = \min\left(1, \frac{w(x')}{w(x)}\right)$$
where $w(x) = \pi(x)/q(x)$ are importance weights.
Independence sampler properties:
| Variant | Proposal | Acceptance Ratio | Use Case |
|---|---|---|---|
| Full MH | $q(x'|x)$ any | $\frac{\pi(x')q(x|x')}{\pi(x)q(x'|x)}$ | General purpose |
| Metropolis | $q(x'|x) = q(x|x')$ | $\frac{\pi(x')}{\pi(x)}$ | Random walk exploration |
| Independence | $q(x'|x) = q(x')$ | $\frac{w(x')}{w(x)}$ | Global proposals |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889
import numpy as npfrom scipy import stats def random_walk_metropolis(log_target, x0, n_samples, step_size, burn_in=1000): """ Random walk Metropolis (symmetric proposal). Simplified version since proposal ratio = 1. """ x = np.array(x0, dtype=float) d = len(x) samples = [] n_accepted = 0 for t in range(n_samples + burn_in): # Symmetric random walk proposal x_proposed = x + np.random.normal(0, step_size, d) # Acceptance ratio (no proposal correction needed) log_alpha = log_target(x_proposed) - log_target(x) if np.log(np.random.uniform()) < log_alpha: x = x_proposed n_accepted += 1 if t >= burn_in: samples.append(x.copy()) return np.array(samples), n_accepted / (n_samples + burn_in) def independence_sampler(log_target, log_proposal, proposal_sampler, x0, n_samples, burn_in=1000): """ Independence sampler (proposal doesn't depend on current state). """ x = np.array(x0, dtype=float) log_w_current = log_target(x) - log_proposal(x) samples = [] n_accepted = 0 for t in range(n_samples + burn_in): # Independent proposal x_proposed = proposal_sampler() # Acceptance based on weight ratio log_w_proposed = log_target(x_proposed) - log_proposal(x_proposed) log_alpha = log_w_proposed - log_w_current if np.log(np.random.uniform()) < log_alpha: x = x_proposed log_w_current = log_w_proposed n_accepted += 1 if t >= burn_in: samples.append(x.copy()) return np.array(samples), n_accepted / (n_samples + burn_in) # Compare on 2D banana-shaped distributiondef log_banana(x, a=1, b=100): """Log density of banana (Rosenbrock) distribution.""" return -((a - x[0])**2 + b * (x[1] - x[0]**2)**2) / 2 # Random walk Metropolissamples_rw, acc_rw = random_walk_metropolis( log_banana, x0=[0, 0], n_samples=10000, step_size=0.1) print("=== Random Walk Metropolis on Banana Distribution ===")print(f"Acceptance rate: {acc_rw:.3f}")print(f"Mean: [{np.mean(samples_rw[:, 0]):.3f}, {np.mean(samples_rw[:, 1]):.3f}]") # Independence sampler with Gaussian proposal (will struggle)def log_gaussian_proposal(x): return stats.multivariate_normal.logpdf(x, mean=[0, 0], cov=[[1, 0], [0, 1]]) def gaussian_proposal_sampler(): return np.random.multivariate_normal([0, 0], [[1, 0], [0, 1]]) samples_ind, acc_ind = independence_sampler( log_banana, log_gaussian_proposal, gaussian_proposal_sampler, x0=[0, 0], n_samples=10000) print(f"\n=== Independence Sampler (poor proposal) ===")print(f"Acceptance rate: {acc_ind:.3f} (low = proposal mismatch)")The choice of proposal distribution critically affects MCMC efficiency. The goal is rapid exploration of the target distribution.
The acceptance rate trade-off:
For random walk proposals with step size $\sigma$:
Theoretical guidance:
For Gaussian targets in $d$ dimensions, optimal scaling theory suggests:
$$\text{Optimal acceptance rate} \approx \begin{cases} 0.44 & d = 1 \ 0.234 & d \to \infty \end{cases}$$
A common rule of thumb: aim for acceptance rate ≈ 20-40% for random walk proposals.
Adaptive MCMC:
Modern approaches tune the proposal during sampling:
Adaptation must decrease over time for rigorous convergence guarantees.
For d-dimensional Gaussian targets with covariance Σ, the optimal random walk proposal is N(x, (2.38²/d)Σ). This achieves ~23.4% acceptance and maximizes the expected squared jumping distance per iteration. Using the posterior covariance as a proxy for Σ substantially improves mixing.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596
import numpy as npfrom scipy import stats def adaptive_metropolis(log_target, x0, n_samples, initial_cov=None, adaptation_period=1000): """ Adaptive Metropolis with empirical covariance tuning. Uses 2.38²/d scaling factor for optimal mixing. """ x = np.array(x0, dtype=float) d = len(x) if initial_cov is None: initial_cov = 0.1 * np.eye(d) # Scaling factor for d dimensions scale = 2.38**2 / d proposal_cov = scale * initial_cov samples = [x.copy()] n_accepted = 0 # For computing running mean and covariance running_mean = x.copy() running_cov = np.zeros((d, d)) for t in range(1, n_samples): # Propose from current adaptive covariance try: x_proposed = np.random.multivariate_normal(x, proposal_cov) except np.linalg.LinAlgError: # Fallback if covariance becomes singular x_proposed = x + np.random.normal(0, 0.1, d) # Accept/reject log_alpha = log_target(x_proposed) - log_target(x) if np.log(np.random.uniform()) < log_alpha: x = x_proposed n_accepted += 1 samples.append(x.copy()) # Update running statistics (Welford's algorithm) old_mean = running_mean.copy() running_mean = old_mean + (x - old_mean) / (t + 1) running_cov = (t - 1) / t * running_cov + \ np.outer(x - old_mean, x - running_mean) / t # Adapt proposal covariance periodically if t > adaptation_period and t % 100 == 0: proposal_cov = scale * (running_cov + 1e-6 * np.eye(d)) return np.array(samples), n_accepted / n_samples # Test on correlated 2D Gaussiantarget_mean = [0, 0]target_cov = [[1, 0.9], [0.9, 1]] # Highly correlated def log_target(x): return stats.multivariate_normal.logpdf(x, target_mean, target_cov) # Compare non-adaptive vs adaptiveprint("=== Proposal Tuning Comparison ===")print("Target: 2D Gaussian with correlation = 0.9")print() # Non-adaptive with identity covariance (poor for correlated target)def run_fixed_proposal(proposal_cov): samples = [np.array([0.0, 0.0])] x = samples[0] n_accepted = 0 for _ in range(9999): x_proposed = np.random.multivariate_normal(x, proposal_cov) if np.log(np.random.uniform()) < log_target(x_proposed) - log_target(x): x = x_proposed n_accepted += 1 samples.append(x.copy()) return np.array(samples), n_accepted / 10000 samples_poor, acc_poor = run_fixed_proposal(0.1 * np.eye(2))print(f"Non-adaptive (identity proposal):")print(f" Acceptance rate: {acc_poor:.3f}")print(f" Sample covariance:\n{np.cov(samples_poor.T)}") # Adaptivesamples_adaptive, acc_adaptive = adaptive_metropolis( log_target, [0.0, 0.0], 10000)print(f"\nAdaptive Metropolis:")print(f" Acceptance rate: {acc_adaptive:.3f}")print(f" Sample covariance:\n{np.cov(samples_adaptive.T)}")print(f"\nTrue target covariance:\n{np.array(target_cov)}")A critical aspect of MCMC practice: how do we know the chain has converged?
Unlike optimization, where we can check gradients, MCMC gives no direct signal. We rely on diagnostics that detect lack of convergence rather than proving convergence.
1. Trace plots:
Plot parameter values vs. iteration. Look for:
2. Autocorrelation:
Compute $\rho_k = \text{Cor}(X_t, X_{t+k})$ for lags $k = 1, 2, \ldots$
3. Effective Sample Size (ESS):
$$\text{ESS} = \frac{N}{1 + 2\sum_{k=1}^{\infty} \rho_k}$$
Number of independent samples worth of information. ESS << N indicates high autocorrelation.
4. $\hat{R}$ (R-hat / Gelman-Rubin statistic):
Run multiple chains from dispersed starting points. Compare within-chain and between-chain variance:
$$\hat{R} = \sqrt{\frac{\text{Var}(\text{pooled})}{\text{Mean within-chain var}}}$$
All diagnostics can only indicate problems—they cannot prove convergence. A chain could pass all tests yet miss a distant mode. Best practice:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105
import numpy as npfrom scipy import stats def compute_ess(samples): """ Compute effective sample size from autocorrelation. Uses initial positive sequence estimator (Geyer's method). """ N = len(samples) # Compute autocorrelation using FFT samples_centered = samples - np.mean(samples) fft = np.fft.fft(samples_centered, n=2*N) acf = np.fft.ifft(fft * np.conj(fft))[:N].real acf = acf / acf[0] # Initial positive sequence: sum until consecutive pairs become negative rho_sum = 0 for k in range(0, N - 1, 2): pair_sum = acf[k] + acf[k + 1] if k + 1 < N else acf[k] if pair_sum < 0: break rho_sum += pair_sum ESS = N / (1 + 2 * rho_sum) return max(1, ESS) def compute_rhat(chains): """ Compute R-hat (Gelman-Rubin) statistic. chains: list of M chains, each of length N """ M = len(chains) N = len(chains[0]) # Within-chain variance W = np.mean([np.var(chain, ddof=1) for chain in chains]) # Between-chain variance chain_means = [np.mean(chain) for chain in chains] overall_mean = np.mean(chain_means) B = N * np.var(chain_means, ddof=1) # Estimated variance var_estimate = ((N - 1) * W + B) / N # R-hat rhat = np.sqrt(var_estimate / W) if W > 0 else np.inf return rhat def mcmc_diagnostics_demo(): """ Demonstrate MCMC diagnostics on a simple problem. """ # Target: N(3, 1) def log_target(x): return stats.norm.logpdf(x, 3, 1) # Run 4 chains from dispersed starts n_chains = 4 n_samples = 5000 starting_points = [-10, 0, 5, 15] # Dispersed chains = [] for start in starting_points: samples = [start] x = start for _ in range(n_samples - 1): x_proposed = x + np.random.normal(0, 1) if np.log(np.random.uniform()) < log_target(x_proposed) - log_target(x): x = x_proposed samples.append(x) chains.append(np.array(samples)) print("=== MCMC Convergence Diagnostics ===") print(f"Target: N(3, 1)") print(f"Chains: {n_chains}, Samples per chain: {n_samples}") print(f"Starting points: {starting_points}") print() # R-hat (should be close to 1) # Use second half of chains (after burn-in) rhat = compute_rhat([chain[n_samples//2:] for chain in chains]) print(f"R-hat: {rhat:.4f}") print(f" → {'Good convergence' if rhat < 1.1 else 'Convergence issue!'}") # ESS for each chain print(f"\nEffective Sample Size (per chain):") for i, chain in enumerate(chains): ess = compute_ess(chain[n_samples//2:]) print(f" Chain {i+1}: ESS = {ess:.0f} ({100*ess/(n_samples//2):.1f}%)") # Pooled statistics pooled = np.concatenate([chain[n_samples//2:] for chain in chains]) print(f"\nPooled posterior summary:") print(f" Mean: {np.mean(pooled):.4f} (true: 3.0)") print(f" Std: {np.std(pooled):.4f} (true: 1.0)") print(f" 95% CI: [{np.percentile(pooled, 2.5):.4f}, {np.percentile(pooled, 97.5):.4f}]") mcmc_diagnostics_demo()Two common practices in MCMC sampling are burn-in and thinning. Understanding when and why to use them is important.
Burn-in:
The initial samples from an MCMC chain depend on the starting point and may not represent the target distribution. Burn-in discards these initial samples.
How much burn-in?
Thinning:
Thinning keeps only every $k$th sample (e.g., every 10th), reducing autocorrelation.
Should you thin?
The traditional advice was to thin heavily. Modern understanding:
$$\text{MCMC variance} \propto \frac{\text{autocorrelated variance}}{N}$$
Thinning by $k$ reduces $N$ to $N/k$ but only moderately reduces autocorrelation contribution. For inference, thinning usually hurts precision.
When thinning makes sense:
For most applications: use burn-in based on diagnostics, but don't thin. Keep all post-burn-in samples. Report ESS so readers know the effective information content. Thinning discards information you've already paid to compute. Exception: if storing 10 million samples is prohibitive, thin to a manageable size.
| Practice | Purpose | Recommendation |
|---|---|---|
| Burn-in | Remove initialization bias | Yes, use diagnostics to determine length |
| Thinning | Reduce autocorrelation | Generally no; keep all samples for inference |
| Multiple chains | Detect non-convergence | Yes, always run 2-4 chains minimum |
| Report ESS | Communicate precision | Yes, always report with estimates |
The Metropolis-Hastings algorithm provides a general framework for sampling from complex distributions using only unnormalized density evaluations. It transformed Bayesian computation by making high-dimensional posterior sampling practical.
What's Next:
While Metropolis-Hastings is general-purpose, it updates all parameters jointly. For models with specific conditional structure, Gibbs sampling offers a powerful alternative: update each parameter (or block) from its full conditional distribution, cycling through the parameter space. This exploits model structure for faster mixing.
You now understand the Metropolis-Hastings algorithm: its theoretical foundations in Markov chains and detailed balance, the algorithm itself, proposal design strategies, and essential convergence diagnostics. This is the workhorse of Bayesian computation, enabling inference for virtually any model.