Loading content...
Rejection sampling's core limitation is waste: samples that fall outside the envelope are discarded completely. In high dimensions, nearly all proposals are rejected, making the method impractical.
Importance sampling takes a fundamentally different approach. Instead of discarding samples that don't match the target, we keep all samples but assign them weights reflecting how well they represent the target distribution. Samples from over-represented regions get low weights; samples from under-represented regions get high weights.
This simple shift has profound implications:
However, importance sampling introduces its own challenges. When the proposal poorly matches the target, weights become highly variable, and the effective sample size collapses. Understanding this variance-efficiency trade-off is essential for practical application.
This page develops importance sampling from first principles. You will understand importance weights, derive both standard and self-normalized estimators, analyze variance and effective sample size, and learn diagnostics for detecting when importance sampling is failing.
The mathematical foundation of importance sampling is a simple identity about expectations.
The importance sampling identity:
Suppose we want to compute $\mathbb{E}_{p}[f(x)]$ but can only sample from $q(x)$. A crucial observation:
$$\mathbb{E}{p}[f(x)] = \int f(x) , p(x) , dx = \int f(x) \cdot \frac{p(x)}{q(x)} \cdot q(x) , dx = \mathbb{E}{q}\left[f(x) \cdot \frac{p(x)}{q(x)}\right]$$
The expectation under $p$ equals the expectation under $q$ of the weighted function $f(x) \cdot w(x)$, where:
$$w(x) = \frac{p(x)}{q(x)}$$
This ratio $w(x)$ is called the importance weight or likelihood ratio.
Intuition:
The weights correct for the mismatch between proposal and target.
For importance sampling to be valid, the proposal must have support at least as broad as the target: wherever p(x) > 0, we must have q(x) > 0. Otherwise, w(x) = p(x)/q(x) would divide by zero, and regions with p(x) > 0 but q(x) = 0 would be completely missed. Unlike rejection sampling, we don't require q to dominate p—just to cover its support.
The importance sampling estimator:
Given samples $x^{(1)}, \ldots, x^{(N)} \sim q(x)$, the importance sampling (IS) estimator for $\mathbb{E}_p[f(x)]$ is:
$$\hat{I}{\text{IS}} = \frac{1}{N} \sum{n=1}^{N} f(x^{(n)}) \cdot w(x^{(n)}) = \frac{1}{N} \sum_{n=1}^{N} f(x^{(n)}) \cdot \frac{p(x^{(n)})}{q(x^{(n)})}$$
Properties:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283
import numpy as npfrom scipy import stats def importance_sampling(f, target_pdf, proposal_sampler, proposal_pdf, N): """ Estimate E_p[f(x)] using importance sampling. Parameters: ----------- f : callable Function to integrate target_pdf : callable Target probability density p(x) (can be unnormalized) proposal_sampler : callable(N) Returns N samples from proposal q(x) proposal_pdf : callable Proposal probability density q(x) N : int Number of samples Returns: -------- estimate : float Importance sampling estimate std_error : float Estimated standard error weights : ndarray Importance weights for analysis """ # Draw samples from proposal samples = proposal_sampler(N) # Compute importance weights weights = np.array([target_pdf(x) / proposal_pdf(x) for x in samples]) # Weighted function values weighted_f = np.array([f(x) * w for x, w in zip(samples, weights)]) # IS estimate estimate = np.mean(weighted_f) # Standard error std_error = np.std(weighted_f, ddof=1) / np.sqrt(N) return estimate, std_error, weights # Example: Estimate E_p[X^2] where p = N(2, 0.5^2)# Using proposal q = N(0, 1) (deliberately poor match) def f(x): return x ** 2 def target_pdf(x): return stats.norm.pdf(x, loc=2, scale=0.5) def proposal_sampler(N): return np.random.normal(0, 1, N) def proposal_pdf(x): return stats.norm.pdf(x, loc=0, scale=1) # True value: E[X^2] = Var(X) + E[X]^2 = 0.25 + 4 = 4.25true_value = 4.25 print("=== Importance Sampling Example ===")print("Target: N(2, 0.5²)")print("Proposal: N(0, 1) [poor match]")print(f"True E[X²] = {true_value}")print() for N in [100, 1000, 10000]: est, se, weights = importance_sampling( f, target_pdf, proposal_sampler, proposal_pdf, N ) # Compute effective sample size ESS = np.sum(weights)**2 / np.sum(weights**2) print(f"N = {N:5d}: Estimate = {est:.4f} ± {1.96*se:.4f}") print(f" ESS = {ESS:.1f} ({100*ESS/N:.1f}% of N)") print(f" Max weight = {np.max(weights):.2f}") print()In Bayesian inference, we typically know the target only up to a normalizing constant:
$$p(x) = \frac{\tilde{p}(x)}{Z_p}$$
where $Z_p = \int \tilde{p}(x) dx$ is unknown. The standard IS estimator requires the normalized $p(x)$, which we don't have.
The solution: Self-normalized importance sampling (SNIS)
We can estimate the normalizing constant itself using IS:
$$\hat{Z}p = \frac{1}{N} \sum{n=1}^{N} \frac{\tilde{p}(x^{(n)})}{q(x^{(n)})} = \frac{1}{N} \sum_{n=1}^{N} \tilde{w}(x^{(n)})$$
where $\tilde{w}(x) = \tilde{p}(x) / q(x)$ are the unnormalized weights.
The SNIS estimator:
$$\hat{I}{\text{SNIS}} = \frac{\sum{n=1}^{N} f(x^{(n)}) \cdot \tilde{w}(x^{(n)})}{\sum_{n=1}^{N} \tilde{w}(x^{(n)})} = \sum_{n=1}^{N} f(x^{(n)}) \cdot \bar{w}(x^{(n)})$$
where $\bar{w}(x^{(n)}) = \tilde{w}(x^{(n)}) / \sum_{m} \tilde{w}(x^{(m)})$ are the normalized weights that sum to 1.
This is a weighted average of function values, where weights reflect relative importance.
The SNIS estimator is biased but consistent. The bias arises because it's a ratio of random variables—E[A/B] ≠ E[A]/E[B]. However, the bias decreases as O(1/N), faster than the standard error decreases as O(N^{-1/2}). For large enough N, bias is negligible compared to variance. SNIS often has lower variance than standard IS, making the bias worthwhile.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102
import numpy as npfrom scipy import stats def self_normalized_importance_sampling( f, unnormalized_target, proposal_sampler, proposal_pdf, N): """ Self-normalized importance sampling for unnormalized targets. Parameters: ----------- f : callable Function to integrate unnormalized_target : callable Unnormalized target density p̃(x) ∝ p(x) Returns estimate of E_p[f(x)] where p = p̃/Z is the normalized target. """ # Draw samples from proposal samples = proposal_sampler(N) # Compute unnormalized importance weights unnorm_weights = np.array([ unnormalized_target(x) / proposal_pdf(x) for x in samples ]) # Normalize weights to sum to 1 norm_weights = unnorm_weights / np.sum(unnorm_weights) # SNIS estimate (weighted average) estimate = np.sum([f(x) * w for x, w in zip(samples, norm_weights)]) # Effective sample size ESS = 1 / np.sum(norm_weights ** 2) return estimate, norm_weights, ESS # Example: Bayesian posterior inference# Prior: theta ~ N(0, 1)# Likelihood: X | theta ~ N(theta, 0.5^2), observe X = 2.0# Posterior (unnormalized): p̃(theta) ∝ N(X; theta, 0.5) * N(theta; 0, 1) observed_x = 2.0likelihood_std = 0.5prior_std = 1.0 def unnormalized_posterior(theta): """p̃(theta) = p(X|theta) * p(theta)""" likelihood = stats.norm.pdf(observed_x, loc=theta, scale=likelihood_std) prior = stats.norm.pdf(theta, loc=0, scale=prior_std) return likelihood * prior # Proposal: N(1, 1) - compromise between prior and likelihooddef proposal_sampler(N): return np.random.normal(1, 1, N) def proposal_pdf(theta): return stats.norm.pdf(theta, loc=1, scale=1) # True posterior: N(posterior_mean, posterior_std)# Conjugate normal-normal: posterior_precision = 1/prior_std**2 + 1/likelihood_std**2posterior_var = 1 / posterior_precisionposterior_mean = posterior_var * (0/prior_std**2 + observed_x/likelihood_std**2)posterior_std = np.sqrt(posterior_var) print("=== Self-Normalized Importance Sampling ===")print(f"Prior: N(0, {prior_std}²)")print(f"Likelihood: X ~ N(θ, {likelihood_std}²), observed X = {observed_x}")print(f"True posterior: N({posterior_mean:.4f}, {posterior_std:.4f}²)")print() N = 10000 # Estimate posterior mean (f(theta) = theta)est_mean, weights, ESS = self_normalized_importance_sampling( lambda x: x, unnormalized_posterior, proposal_sampler, proposal_pdf, N) # Estimate posterior variance (f(theta) = (theta - mean)^2)est_var, _, _ = self_normalized_importance_sampling( lambda x: (x - est_mean)**2, unnormalized_posterior, proposal_sampler, proposal_pdf, N) print(f"SNIS Estimates (N = {N}):")print(f" Posterior mean: {est_mean:.4f} (true: {posterior_mean:.4f})")print(f" Posterior var: {est_var:.4f} (true: {posterior_var:.4f})")print(f" ESS: {ESS:.1f} ({100*ESS/N:.1f}% of N)")The variance of the IS estimator depends critically on the match between proposal and target. Understanding this variance is essential for practical importance sampling.
Variance of standard IS estimator:
$$\text{Var}q(\hat{I}{\text{IS}}) = \frac{1}{N} \text{Var}_q[f(x) \cdot w(x)] = \frac{1}{N} \left( \mathbb{E}_q[f(x)^2 w(x)^2] - I^2 \right)$$
$$= \frac{1}{N} \left( \mathbb{E}_p[f(x)^2 w(x)] - I^2 \right)$$
Comparison with standard MC:
If we could sample directly from $p$: $$\text{Var}p(\hat{I}{\text{MC}}) = \frac{1}{N} \left( \mathbb{E}_p[f(x)^2] - I^2 \right) = \frac{\sigma_f^2}{N}$$
The ratio of variances: $$\frac{\text{Var}(\hat{I}{\text{IS}})}{\text{Var}(\hat{I}{\text{MC}})} = \frac{\mathbb{E}_p[f(x)^2 w(x)] - I^2}{\sigma_f^2}$$
Key insight: IS can have lower variance than direct MC if $w(x)$ is small where $f(x)^2$ is large, or much higher variance if weights are volatile.
The variance-minimizing proposal for estimating E[f(x)] is:
q*(x) ∝ |f(x)| p(x)
This places proposal density proportional to the contribution of each point to the integral. With this optimal proposal, IS has zero variance for positive f(x)—but it requires knowing the integral we're trying to compute! Still, this guides intuition: good proposals concentrate samples where |f(x)| p(x) is large.
When IS variance explodes:
The most dangerous situation is when the proposal has lighter tails than the target. In regions where $p(x) >> q(x)$, weights become extremely large:
$$w(x) = \frac{p(x)}{q(x)} \to \infty \text{ as } q(x) \to 0$$
A single sample with a huge weight can dominate the entire estimate, leading to high variance and unstable results.
Practical diagnostic: weight distribution
Examine the distribution of importance weights:
Key quantities to monitor:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
import numpy as npfrom scipy import stats def compare_is_variance(): """ Compare IS variance with different proposals. Shows how proposal choice affects efficiency. """ # Target: N(0, 1) # Goal: Estimate P(X > 2), a rare event def target_pdf(x): return stats.norm.pdf(x, 0, 1) def indicator(x): return float(x > 2) # True value true_prob = 1 - stats.norm.cdf(2) # ≈ 0.0228 proposals = [ # Proposal 1: Standard normal (same as target - baseline) { "name": "N(0, 1) [same as target]", "sampler": lambda N: np.random.normal(0, 1, N), "pdf": lambda x: stats.norm.pdf(x, 0, 1) }, # Proposal 2: Shifted toward tail (good for rare event) { "name": "N(2.5, 1) [shifted to tail]", "sampler": lambda N: np.random.normal(2.5, 1, N), "pdf": lambda x: stats.norm.pdf(x, 2.5, 1) }, # Proposal 3: Wrong direction (very bad) { "name": "N(-2, 1) [wrong direction]", "sampler": lambda N: np.random.normal(-2, 1, N), "pdf": lambda x: stats.norm.pdf(x, -2, 1) } ] N = 10000 n_trials = 100 print("=== Rare Event Estimation: P(X > 2) for X ~ N(0,1) ===") print(f"True probability: {true_prob:.6f}") print(f"N = {N} samples, {n_trials} trials") print() for proposal in proposals: estimates = [] avg_ess = 0 for _ in range(n_trials): samples = proposal["sampler"](N) # Importance weights weights = target_pdf(samples) / proposal["pdf"](samples) # IS estimate weighted_indicator = np.array([indicator(x) * w for x, w in zip(samples, weights)]) estimate = np.mean(weighted_indicator) estimates.append(estimate) # ESS ESS = np.sum(weights)**2 / np.sum(weights**2) avg_ess += ESS / n_trials estimates = np.array(estimates) mean_est = np.mean(estimates) std_est = np.std(estimates) rmse = np.sqrt(np.mean((estimates - true_prob)**2)) print(f"Proposal: {proposal['name']}") print(f" Mean estimate: {mean_est:.6f}") print(f" Std of estimates: {std_est:.6f}") print(f" RMSE: {rmse:.6f}") print(f" Average ESS: {avg_ess:.1f} ({100*avg_ess/N:.1f}%)") print() compare_is_variance()The Effective Sample Size (ESS) quantifies how many independent, equally-weighted samples the importance-weighted samples are worth. It's the key diagnostic for importance sampling quality.
Definition:
For self-normalized importance sampling with normalized weights $\bar{w}_n$:
$$\text{ESS} = \frac{1}{\sum_{n=1}^{N} \bar{w}_n^2} = \frac{\left(\sum_n \tilde{w}_n\right)^2}{\sum_n \tilde{w}_n^2}$$
Interpretation:
Why ESS matters:
The variance of the SNIS estimator is approximately: $$\text{Var}(\hat{I}_{\text{SNIS}}) \approx \frac{\sigma_f^2}{\text{ESS}}$$
So ESS tells us the effective number of samples we have for variance purposes. With N = 10,000 samples but ESS = 50, our estimates have the precision of only 50 independent samples.
General guidelines for interpreting ESS:
• ESS/N > 50%: Excellent efficiency, proposal well-matched • ESS/N = 20-50%: Acceptable, results reasonably reliable • ESS/N = 5-20%: Marginal, consider improving proposal • ESS/N < 5%: Poor, results likely unreliable • ESS/N < 1%: Catastrophic, discard results and redesign
Always report ESS alongside IS estimates to communicate result reliability.
Connection to weight concentration:
ESS is inversely related to weight concentration. Define the weight coefficient of variation:
$$\text{CV}^2(w) = \frac{\text{Var}(w)}{\mathbb{E}[w]^2}$$
Then: $$\text{ESS} = \frac{N}{1 + \text{CV}^2(w)}$$
High weight variability directly reduces effective sample size.
Alternative ESS measures:
The formula above uses the second moment of normalized weights. Other measures exist:
All capture the same intuition: uniform weights → high ESS, concentrated weights → low ESS.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103
import numpy as npfrom scipy import stats def analyze_ess(unnorm_weights): """ Compute various ESS measures and weight diagnostics. """ N = len(unnorm_weights) # Normalize weights norm_weights = unnorm_weights / np.sum(unnorm_weights) # Standard ESS (inverse of sum of squared normalized weights) ESS_standard = 1 / np.sum(norm_weights ** 2) # ESS from max weight ESS_max = 1 / np.max(norm_weights) # Entropy-based ESS # Handle zeros in weights nonzero_weights = norm_weights[norm_weights > 0] entropy = -np.sum(nonzero_weights * np.log(nonzero_weights)) ESS_entropy = np.exp(entropy) # Weight diagnostics weight_cv = np.std(unnorm_weights) / np.mean(unnorm_weights) max_weight_ratio = np.max(unnorm_weights) / np.mean(unnorm_weights) return { 'N': N, 'ESS_standard': ESS_standard, 'ESS_max': ESS_max, 'ESS_entropy': ESS_entropy, 'efficiency': ESS_standard / N, 'weight_cv': weight_cv, 'max_weight_ratio': max_weight_ratio, 'norm_weights': norm_weights } def demonstrate_ess_degradation(): """ Show how ESS degrades as target-proposal mismatch increases. """ print("=== ESS Degradation with Proposal-Target Mismatch ===") print() N = 10000 target_mean = 0 target_std = 1 def target_pdf(x): return stats.norm.pdf(x, target_mean, target_std) # Vary proposal mean to create mismatch proposal_means = [0, 0.5, 1, 2, 3, 4, 5] print(f"Target: N({target_mean}, {target_std}²)") print(f"N = {N} samples") print() print(f"{'Proposal Mean':>15} | {'ESS':>8} | {'Efficiency':>10} | {'Max w/Mean w':>12}") print("-" * 55) for pm in proposal_means: # Sample from proposal samples = np.random.normal(pm, target_std, N) # Compute weights proposal_pdf = lambda x, m=pm: stats.norm.pdf(x, m, target_std) weights = target_pdf(samples) / proposal_pdf(samples) # Analyze results = analyze_ess(weights) print(f"{pm:>15.1f} | {results['ESS_standard']:>8.1f} | " f"{results['efficiency']:>9.1%} | {results['max_weight_ratio']:>12.1f}") demonstrate_ess_degradation() print() # Visualize weight distributionprint("=== Weight Distribution Examples ===")print() N = 1000target_pdf = lambda x: stats.norm.pdf(x, 0, 1) for mismatch, proposal_mean in [("None", 0), ("Moderate", 2), ("Severe", 4)]: samples = np.random.normal(proposal_mean, 1, N) proposal_pdf = lambda x, m=proposal_mean: stats.norm.pdf(x, m, 1) weights = target_pdf(samples) / proposal_pdf(samples) results = analyze_ess(weights) print(f"{mismatch} mismatch (proposal mean = {proposal_mean}):") print(f" ESS = {results['ESS_standard']:.1f} ({100*results['efficiency']:.1f}%)") print(f" Weight percentiles: 50th={np.percentile(weights, 50):.3f}, " f"90th={np.percentile(weights, 90):.3f}, " f"99th={np.percentile(weights, 99):.3f}, " f"max={np.max(weights):.3f}") print()When a single proposal is inadequate—especially for multimodal targets—Multiple Importance Sampling (MIS) combines samples from several proposals.
The setup:
Naive combination (bad):
Simply pooling samples with standard IS weights fails because samples from $q_k$ don't account for the other proposals.
Balance heuristic (good):
The optimal combination uses the balance heuristic. For a sample $x^{(n)}$ from any proposal:
$$w_{\text{balance}}(x) = \frac{p(x)}{\sum_{k=1}^{K} \frac{N_k}{N} q_k(x)} = \frac{p(x)}{\bar{q}(x)}$$
where $\bar{q}(x) = \sum_k \frac{N_k}{N} q_k(x)$ is the mixture proposal.
Why this works:
The balance heuristic treats all samples as if they came from the mixture $\bar{q}$. A point covered by any proposal gets reasonable weight, avoiding the exploding weights that occur when a point is far from a single proposal.
Multiple importance sampling excels for multimodal targets. Place a narrow proposal near each mode. Samples from each proposal get good weights near their respective modes. The balance heuristic ensures samples aren't penalized for being far from other proposals' modes. This is far more efficient than a single broad proposal that wastes samples in low-probability regions between modes.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112
import numpy as npfrom scipy import stats def multiple_importance_sampling( f, target_pdf, proposal_samplers, proposal_pdfs, samples_per_proposal): """ Multiple importance sampling with balance heuristic. Parameters: ----------- proposal_samplers : list of callables Each sampler(N) returns N samples from proposal k proposal_pdfs : list of callables Each pdf(x) returns proposal k's density at x samples_per_proposal : list of int Number of samples from each proposal """ K = len(proposal_samplers) N_total = sum(samples_per_proposal) weights_k = [n / N_total for n in samples_per_proposal] all_samples = [] all_weights = [] for k, (sampler, N_k) in enumerate(zip(proposal_samplers, samples_per_proposal)): samples = sampler(N_k) for x in samples: # Mixture proposal density (balance heuristic denominator) mixture_density = sum( w * pdf(x) for w, pdf in zip(weights_k, proposal_pdfs) ) # Balance heuristic weight weight = target_pdf(x) / mixture_density all_samples.append(x) all_weights.append(weight) samples = np.array(all_samples) weights = np.array(all_weights) # Normalize weights norm_weights = weights / np.sum(weights) # SNIS estimate estimate = np.sum([f(x) * w for x, w in zip(samples, norm_weights)]) # ESS ESS = 1 / np.sum(norm_weights ** 2) return estimate, ESS, samples, norm_weights # Example: Bimodal target# Target: 0.4 * N(-3, 0.7) + 0.6 * N(3, 0.9) def target_pdf(x): return 0.4 * stats.norm.pdf(x, -3, 0.7) + 0.6 * stats.norm.pdf(x, 3, 0.9) # Single proposal (covers both modes poorly)single_sampler = lambda N: np.random.normal(0, 3, N)single_pdf = lambda x: stats.norm.pdf(x, 0, 3) # Multiple proposals (one per mode)multi_samplers = [ lambda N: np.random.normal(-3, 0.7, N), lambda N: np.random.normal(3, 0.9, N)]multi_pdfs = [ lambda x: stats.norm.pdf(x, -3, 0.7), lambda x: stats.norm.pdf(x, 3, 0.9)] # True mean: 0.4 * (-3) + 0.6 * 3 = 0.6true_mean = 0.4 * (-3) + 0.6 * 3N = 5000 print("=== Multiple Importance Sampling: Bimodal Target ===")print("Target: 0.4 * N(-3, 0.7) + 0.6 * N(3, 0.9)")print(f"True mean: {true_mean}")print() # Single proposal approachsamples_single = single_sampler(N)weights_single = target_pdf(samples_single) / single_pdf(samples_single)norm_weights_single = weights_single / np.sum(weights_single)est_single = np.sum(samples_single * norm_weights_single)ESS_single = 1 / np.sum(norm_weights_single ** 2) print(f"Single proposal N(0, 3):")print(f" Estimate: {est_single:.4f}")print(f" ESS: {ESS_single:.1f} ({100*ESS_single/N:.1f}%)") # Multiple proposal approachest_multi, ESS_multi, _, _ = multiple_importance_sampling( lambda x: x, target_pdf, multi_samplers, multi_pdfs, [int(0.4 * N), int(0.6 * N)] # Match mode weights) print(f"\nMultiple proposals (one per mode):")print(f" Estimate: {est_multi:.4f}")print(f" ESS: {ESS_multi:.1f} ({100*ESS_multi/N:.1f}%)")Importance sampling is particularly natural for Bayesian inference, where we typically know posteriors only up to normalizing constants.
Common setup:
Using the prior as proposal:
When $q(\theta) = p(\theta)$, the importance weight simplifies beautifully:
$$w(\theta) = \frac{p(\mathcal{D} | \theta) , p(\theta)}{p(\theta)} = p(\mathcal{D} | \theta)$$
The weight is just the likelihood! This is computationally convenient and interpretable: samples with high likelihood get high weight.
Limitations of prior sampling:
Using the prior works only when the posterior is "close" to the prior—i.e., when data is informative but not overwhelming. If the likelihood concentrates in a small region of prior mass, most prior samples get negligible weight.
Better proposals:
| Proposal | Weights | When to Use |
|---|---|---|
| Prior p(θ) | Likelihood p(D|θ) | Weak data, prior reasonably accurate |
| Laplace approximation | Posterior / Gaussian | Unimodal, smooth posteriors |
| Variational approximation | Posterior / q_VI(θ) | Complex models with VI available |
| Previous posterior t-1 | Likelihood ratio | Sequential/online settings |
| Mixture of experts | Balance heuristic | Multimodal posteriors |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
import numpy as npfrom scipy import stats, optimize def bayesian_is_comparison(): """ Compare different proposals for Bayesian IS: 1. Prior as proposal 2. Laplace approximation as proposal """ # Model: Observations X_i ~ N(theta, 1), prior theta ~ N(0, 2^2) # With many observations, posterior concentrates away from prior prior_mean, prior_std = 0, 2 likelihood_std = 1 true_theta = 3.5 # Generate observations np.random.seed(42) n_obs = 50 observations = np.random.normal(true_theta, likelihood_std, n_obs) # True posterior (conjugate): N(posterior_mean, posterior_std) posterior_precision = 1/prior_std**2 + n_obs/likelihood_std**2 posterior_var = 1 / posterior_precision posterior_mean = posterior_var * (prior_mean/prior_std**2 + np.sum(observations)/likelihood_std**2) posterior_std = np.sqrt(posterior_var) print("=== Bayesian IS: Prior vs Laplace Proposal ===") print(f"True theta: {true_theta}") print(f"Prior: N({prior_mean}, {prior_std}²)") print(f"Data: {n_obs} observations, sample mean = {np.mean(observations):.3f}") print(f"True posterior: N({posterior_mean:.3f}, {posterior_std:.3f}²)") print() def log_likelihood(theta): return np.sum(stats.norm.logpdf(observations, theta, likelihood_std)) def log_prior(theta): return stats.norm.logpdf(theta, prior_mean, prior_std) def log_posterior_unnorm(theta): return log_likelihood(theta) + log_prior(theta) N = 10000 # Method 1: Prior as proposal prior_samples = np.random.normal(prior_mean, prior_std, N) prior_weights = np.exp([log_likelihood(theta) for theta in prior_samples]) prior_weights_norm = prior_weights / np.sum(prior_weights) ESS_prior = 1 / np.sum(prior_weights_norm ** 2) mean_est_prior = np.sum(prior_samples * prior_weights_norm) print(f"Prior as proposal:") print(f" Mean estimate: {mean_est_prior:.4f}") print(f" ESS: {ESS_prior:.1f} ({100*ESS_prior/N:.1f}%)") print(f" Max weight / Mean weight: {np.max(prior_weights) / np.mean(prior_weights):.1f}") # Method 2: Laplace approximation as proposal # Find posterior mode result = optimize.minimize(lambda x: -log_posterior_unnorm(x[0]), [0]) laplace_mean = result.x[0] # Approximate posterior curvature at mode eps = 1e-5 hess_approx = (log_posterior_unnorm(laplace_mean + eps) - 2*log_posterior_unnorm(laplace_mean) + log_posterior_unnorm(laplace_mean - eps)) / eps**2 laplace_std = np.sqrt(-1 / hess_approx) laplace_samples = np.random.normal(laplace_mean, laplace_std, N) laplace_proposal_log = stats.norm.logpdf(laplace_samples, laplace_mean, laplace_std) laplace_target_log = np.array([log_posterior_unnorm(x) for x in laplace_samples]) laplace_weights = np.exp(laplace_target_log - laplace_proposal_log) laplace_weights_norm = laplace_weights / np.sum(laplace_weights) ESS_laplace = 1 / np.sum(laplace_weights_norm ** 2) mean_est_laplace = np.sum(laplace_samples * laplace_weights_norm) print(f"\nLaplace approximation as proposal:") print(f" Laplace: N({laplace_mean:.4f}, {laplace_std:.4f}²)") print(f" Mean estimate: {mean_est_laplace:.4f}") print(f" ESS: {ESS_laplace:.1f} ({100*ESS_laplace/N:.1f}%)") print(f" Improvement: {ESS_laplace/ESS_prior:.1f}x higher ESS") bayesian_is_comparison()Despite its flexibility, importance sampling has fundamental limitations that restrict its applicability.
The dimensional curse:
In high dimensions, even minor mismatches between proposal and target lead to weight degeneracy. Consider this scaling:
The weight variance problem:
For IS to work, we need $\mathbb{E}_q[w(x)^2] < \infty$. If the target has heavier tails than the proposal:
$$\mathbb{E}_q\left[\frac{p(x)^2}{q(x)^2}\right] = \int \frac{p(x)^2}{q(x)} dx = \infty$$
This infinite variance makes IS unusable, even theoretically.
Proposal design challenges:
Finding a good proposal requires knowing a lot about the target—which we're trying to learn. It's a chicken-and-egg problem. For complex, high-dimensional posteriors, no simple parametric proposal works well.
The most dangerous aspect of IS failure is that estimates can look reasonable while being completely wrong. A few samples with large weights can drive the estimate, but their randomness isn't reflected in standard errors (which assume balanced weights). Always compute and report ESS. If ESS << N, treat results with extreme skepticism.
| Scenario | Verdict | Reason |
|---|---|---|
| Low-dimensional, good proposal | ✓ Excellent | Efficient and exact |
| Rare event estimation | ✓ Excellent | Can beat standard MC by orders of magnitude |
| Sequential updating | ✓ Good | Reweight previous samples (particle filtering) |
| Moderate dimension, good proposal | △ Viable | Monitor ESS carefully |
| High dimension (d > 20) | ✗ Risky | Weight degeneracy likely |
| Heavy-tailed target, light-tailed proposal | ✗ Fails | Infinite variance |
| Unknown target structure | ✗ Dangerous | Can't design good proposal |
Importance sampling reweights samples from a proposal distribution to estimate expectations under a target distribution. It eliminates rejection sampling's waste but introduces variance challenges.
What's Next:
Both rejection sampling and importance sampling struggle in high dimensions. The solution: Markov Chain Monte Carlo (MCMC). Instead of sampling independently from a proposal, MCMC constructs a Markov chain whose stationary distribution is the target. The Metropolis-Hastings algorithm provides a general framework that requires only evaluating the unnormalized target density—transforming Bayesian computation.
You now understand importance sampling: the reweighting principle, self-normalization for unnormalized targets, variance and ESS analysis, and fundamental limitations. This prepares you for MCMC, which overcomes these limitations through the power of Markov chain convergence.