Approximate Inference Sampling - Learning Module

Loading content...

0/245

Importance Sampling

Beyond Rejection: Reweighting Instead of Discarding

Rejection sampling's core limitation is waste: samples that fall outside the envelope are discarded completely. In high dimensions, nearly all proposals are rejected, making the method impractical.

Importance sampling takes a fundamentally different approach. Instead of discarding samples that don't match the target, we keep all samples but assign them weights reflecting how well they represent the target distribution. Samples from over-represented regions get low weights; samples from under-represented regions get high weights.

This simple shift has profound implications:

All samples used: No computational waste from rejection
Unnormalized targets: Works when we only know $p(x)$ up to a constant
Flexible proposals: No hard dominance requirement
Variance trade-off: Efficiency becomes a variance problem, not an impossibility

However, importance sampling introduces its own challenges. When the proposal poorly matches the target, weights become highly variable, and the effective sample size collapses. Understanding this variance-efficiency trade-off is essential for practical application.

What You Will Learn

This page develops importance sampling from first principles. You will understand importance weights, derive both standard and self-normalized estimators, analyze variance and effective sample size, and learn diagnostics for detecting when importance sampling is failing.

The Core Idea: Reweighting Expectations

The mathematical foundation of importance sampling is a simple identity about expectations.

The importance sampling identity:

Suppose we want to compute $\mathbb{E}_{p}[f(x)]$ but can only sample from $q(x)$. A crucial observation:

$$\mathbb{E}{p}[f(x)] = \int f(x) , p(x) , dx = \int f(x) \cdot \frac{p(x)}{q(x)} \cdot q(x) , dx = \mathbb{E}{q}\left[f(x) \cdot \frac{p(x)}{q(x)}\right]$$

The expectation under $p$ equals the expectation under $q$ of the weighted function $f(x) \cdot w(x)$, where:

$$w(x) = \frac{p(x)}{q(x)}$$

This ratio $w(x)$ is called the importance weight or likelihood ratio.

Intuition:

If $q(x)$ over-samples a region (relative to $p$), $w(x) < 1$ downweights those samples
If $q(x)$ under-samples a region, $w(x) > 1$ upweights those samples
If $q(x) = p(x)$ everywhere, all weights equal 1 and we're back to standard Monte Carlo

The weights correct for the mismatch between proposal and target.

The Support Requirement

For importance sampling to be valid, the proposal must have support at least as broad as the target: wherever p(x) > 0, we must have q(x) > 0. Otherwise, w(x) = p(x)/q(x) would divide by zero, and regions with p(x) > 0 but q(x) = 0 would be completely missed. Unlike rejection sampling, we don't require q to dominate p—just to cover its support.

The importance sampling estimator:

Given samples $x^{(1)}, \ldots, x^{(N)} \sim q(x)$, the importance sampling (IS) estimator for $\mathbb{E}_p[f(x)]$ is:

$$\hat{I}{\text{IS}} = \frac{1}{N} \sum{n=1}^{N} f(x^{(n)}) \cdot w(x^{(n)}) = \frac{1}{N} \sum_{n=1}^{N} f(x^{(n)}) \cdot \frac{p(x^{(n)})}{q(x^{(n)})}$$

Properties:

Unbiased: $\mathbb{E}q[\hat{I}{\text{IS}}] = \mathbb{E}_p[f(x)] = I$
Consistent: $\hat{I}_{\text{IS}} \xrightarrow{a.s.} I$ as $N \to \infty$
Variance depends on weight variability: Can be much higher or lower than standard MC

importance_sampling_basic.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import numpy as np
from scipy import stats
 
def importance_sampling(f, target_pdf, proposal_sampler, proposal_pdf, N):
    """
    Estimate E_p[f(x)] using importance sampling.
    
    Parameters:
    -----------
    f : callable
        Function to integrate
    target_pdf : callable
        Target probability density p(x) (can be unnormalized)
    proposal_sampler : callable(N)
        Returns N samples from proposal q(x)
    proposal_pdf : callable
        Proposal probability density q(x)
    N : int
        Number of samples
    
    Returns:
    --------
    estimate : float
        Importance sampling estimate
    std_error : float
        Estimated standard error
    weights : ndarray
        Importance weights for analysis
    """
    # Draw samples from proposal
    samples = proposal_sampler(N)
    
    # Compute importance weights
    weights = np.array([target_pdf(x) / proposal_pdf(x) for x in samples])
    
    # Weighted function values
    weighted_f = np.array([f(x) * w for x, w in zip(samples, weights)])
    
    # IS estimate
    estimate = np.mean(weighted_f)
    
    # Standard error
    std_error = np.std(weighted_f, ddof=1) / np.sqrt(N)
    
    return estimate, std_error, weights
 
 
# Example: Estimate E_p[X^2] where p = N(2, 0.5^2)
# Using proposal q = N(0, 1) (deliberately poor match)
 
def f(x):
    return x ** 2
 
def target_pdf(x):
    return stats.norm.pdf(x, loc=2, scale=0.5)
 
def proposal_sampler(N):
    return np.random.normal(0, 1, N)
 
def proposal_pdf(x):
    return stats.norm.pdf(x, loc=0, scale=1)
 
# True value: E[X^2] = Var(X) + E[X]^2 = 0.25 + 4 = 4.25
true_value = 4.25
 
print("=== Importance Sampling Example ===")
print("Target: N(2, 0.5²)")
print("Proposal: N(0, 1) [poor match]")
print(f"True E[X²] = {true_value}")
print()
 
for N in [100, 1000, 10000]:
    est, se, weights = importance_sampling(
        f, target_pdf, proposal_sampler, proposal_pdf, N
    )
    
    # Compute effective sample size
    ESS = np.sum(weights)**2 / np.sum(weights**2)
    
    print(f"N = {N:5d}: Estimate = {est:.4f} ± {1.96*se:.4f}")
    print(f"          ESS = {ESS:.1f} ({100*ESS/N:.1f}% of N)")
    print(f"          Max weight = {np.max(weights):.2f}")
    print()

Self-Normalized Importance Sampling

In Bayesian inference, we typically know the target only up to a normalizing constant:

$$p(x) = \frac{\tilde{p}(x)}{Z_p}$$

where $Z_p = \int \tilde{p}(x) dx$ is unknown. The standard IS estimator requires the normalized $p(x)$, which we don't have.

The solution: Self-normalized importance sampling (SNIS)

We can estimate the normalizing constant itself using IS:

$$\hat{Z}p = \frac{1}{N} \sum{n=1}^{N} \frac{\tilde{p}(x^{(n)})}{q(x^{(n)})} = \frac{1}{N} \sum_{n=1}^{N} \tilde{w}(x^{(n)})$$

where $\tilde{w}(x) = \tilde{p}(x) / q(x)$ are the unnormalized weights.

The SNIS estimator:

$$\hat{I}{\text{SNIS}} = \frac{\sum{n=1}^{N} f(x^{(n)}) \cdot \tilde{w}(x^{(n)})}{\sum_{n=1}^{N} \tilde{w}(x^{(n)})} = \sum_{n=1}^{N} f(x^{(n)}) \cdot \bar{w}(x^{(n)})$$

where $\bar{w}(x^{(n)}) = \tilde{w}(x^{(n)}) / \sum_{m} \tilde{w}(x^{(m)})$ are the normalized weights that sum to 1.

This is a weighted average of function values, where weights reflect relative importance.

Bias-Variance Trade-off

The SNIS estimator is biased but consistent. The bias arises because it's a ratio of random variables—E[A/B] ≠ E[A]/E[B]. However, the bias decreases as O(1/N), faster than the standard error decreases as O(N^{-1/2}). For large enough N, bias is negligible compared to variance. SNIS often has lower variance than standard IS, making the bias worthwhile.

self_normalized_is.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import numpy as np
from scipy import stats
 
def self_normalized_importance_sampling(
    f, 
    unnormalized_target, 
    proposal_sampler, 
    proposal_pdf, 
    N
):
    """
    Self-normalized importance sampling for unnormalized targets.
    
    Parameters:
    -----------
    f : callable
        Function to integrate
    unnormalized_target : callable
        Unnormalized target density p̃(x) ∝ p(x)
    
    Returns estimate of E_p[f(x)] where p = p̃/Z is the normalized target.
    """
    # Draw samples from proposal
    samples = proposal_sampler(N)
    
    # Compute unnormalized importance weights
    unnorm_weights = np.array([
        unnormalized_target(x) / proposal_pdf(x) 
        for x in samples
    ])
    
    # Normalize weights to sum to 1
    norm_weights = unnorm_weights / np.sum(unnorm_weights)
    
    # SNIS estimate (weighted average)
    estimate = np.sum([f(x) * w for x, w in zip(samples, norm_weights)])
    
    # Effective sample size
    ESS = 1 / np.sum(norm_weights ** 2)
    
    return estimate, norm_weights, ESS
 
 
# Example: Bayesian posterior inference
# Prior: theta ~ N(0, 1)
# Likelihood: X | theta ~ N(theta, 0.5^2), observe X = 2.0
# Posterior (unnormalized): p̃(theta) ∝ N(X; theta, 0.5) * N(theta; 0, 1)
 
observed_x = 2.0
likelihood_std = 0.5
prior_std = 1.0
 
def unnormalized_posterior(theta):
    """p̃(theta) = p(X|theta) * p(theta)"""
    likelihood = stats.norm.pdf(observed_x, loc=theta, scale=likelihood_std)
    prior = stats.norm.pdf(theta, loc=0, scale=prior_std)
    return likelihood * prior
 
# Proposal: N(1, 1) - compromise between prior and likelihood
def proposal_sampler(N):
    return np.random.normal(1, 1, N)
 
def proposal_pdf(theta):
    return stats.norm.pdf(theta, loc=1, scale=1)
 
# True posterior: N(posterior_mean, posterior_std)
# Conjugate normal-normal: 
posterior_precision = 1/prior_std**2 + 1/likelihood_std**2
posterior_var = 1 / posterior_precision
posterior_mean = posterior_var * (0/prior_std**2 + observed_x/likelihood_std**2)
posterior_std = np.sqrt(posterior_var)
 
print("=== Self-Normalized Importance Sampling ===")
print(f"Prior: N(0, {prior_std}²)")
print(f"Likelihood: X ~ N(θ, {likelihood_std}²), observed X = {observed_x}")
print(f"True posterior: N({posterior_mean:.4f}, {posterior_std:.4f}²)")
print()
 
N = 10000
 
# Estimate posterior mean (f(theta) = theta)
est_mean, weights, ESS = self_normalized_importance_sampling(
    lambda x: x,
    unnormalized_posterior,
    proposal_sampler,
    proposal_pdf,
    N
)
 
# Estimate posterior variance (f(theta) = (theta - mean)^2)
est_var, _, _ = self_normalized_importance_sampling(
    lambda x: (x - est_mean)**2,
    unnormalized_posterior,
    proposal_sampler,
    proposal_pdf,
    N
)
 
print(f"SNIS Estimates (N = {N}):")
print(f"  Posterior mean: {est_mean:.4f} (true: {posterior_mean:.4f})")
print(f"  Posterior var:  {est_var:.4f} (true: {posterior_var:.4f})")
print(f"  ESS: {ESS:.1f} ({100*ESS/N:.1f}% of N)")

Variance Analysis: When IS Helps and Hurts

The variance of the IS estimator depends critically on the match between proposal and target. Understanding this variance is essential for practical importance sampling.

Variance of standard IS estimator:

$$\text{Var}q(\hat{I}{\text{IS}}) = \frac{1}{N} \text{Var}_q[f(x) \cdot w(x)] = \frac{1}{N} \left( \mathbb{E}_q[f(x)^2 w(x)^2] - I^2 \right)$$

$$= \frac{1}{N} \left( \mathbb{E}_p[f(x)^2 w(x)] - I^2 \right)$$

Comparison with standard MC:

If we could sample directly from $p$: $$\text{Var}p(\hat{I}{\text{MC}}) = \frac{1}{N} \left( \mathbb{E}_p[f(x)^2] - I^2 \right) = \frac{\sigma_f^2}{N}$$

The ratio of variances: $$\frac{\text{Var}(\hat{I}{\text{IS}})}{\text{Var}(\hat{I}{\text{MC}})} = \frac{\mathbb{E}_p[f(x)^2 w(x)] - I^2}{\sigma_f^2}$$

Key insight: IS can have lower variance than direct MC if $w(x)$ is small where $f(x)^2$ is large, or much higher variance if weights are volatile.

The Optimal Proposal

The variance-minimizing proposal for estimating E[f(x)] is:

q*(x) ∝ |f(x)| p(x)

This places proposal density proportional to the contribution of each point to the integral. With this optimal proposal, IS has zero variance for positive f(x)—but it requires knowing the integral we're trying to compute! Still, this guides intuition: good proposals concentrate samples where |f(x)| p(x) is large.

When IS variance explodes:

The most dangerous situation is when the proposal has lighter tails than the target. In regions where $p(x) >> q(x)$, weights become extremely large:

$$w(x) = \frac{p(x)}{q(x)} \to \infty \text{ as } q(x) \to 0$$

A single sample with a huge weight can dominate the entire estimate, leading to high variance and unstable results.

Practical diagnostic: weight distribution

Examine the distribution of importance weights:

If weights are roughly uniform, IS is efficient
If a few weights dominate, IS is failing
If max(w) >> mean(w), results are unreliable

Key quantities to monitor:

Maximum weight: $\max_n w(x^{(n)})$
Weight coefficient of variation: $\text{CV}(w) = \text{std}(w) / \text{mean}(w)$
Effective sample size (next section)

variance_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
from scipy import stats
 
def compare_is_variance():
    """
    Compare IS variance with different proposals.
    Shows how proposal choice affects efficiency.
    """
    # Target: N(0, 1)
    # Goal: Estimate P(X > 2), a rare event
    
    def target_pdf(x):
        return stats.norm.pdf(x, 0, 1)
    
    def indicator(x):
        return float(x > 2)
    
    # True value
    true_prob = 1 - stats.norm.cdf(2)  # ≈ 0.0228
    
    proposals = [
        # Proposal 1: Standard normal (same as target - baseline)
        {
            "name": "N(0, 1) [same as target]",
            "sampler": lambda N: np.random.normal(0, 1, N),
            "pdf": lambda x: stats.norm.pdf(x, 0, 1)
        },
        # Proposal 2: Shifted toward tail (good for rare event)
        {
            "name": "N(2.5, 1) [shifted to tail]",
            "sampler": lambda N: np.random.normal(2.5, 1, N),
            "pdf": lambda x: stats.norm.pdf(x, 2.5, 1)
        },
        # Proposal 3: Wrong direction (very bad)
        {
            "name": "N(-2, 1) [wrong direction]",
            "sampler": lambda N: np.random.normal(-2, 1, N),
            "pdf": lambda x: stats.norm.pdf(x, -2, 1)
        }
    ]
    
    N = 10000
    n_trials = 100
    
    print("=== Rare Event Estimation: P(X > 2) for X ~ N(0,1) ===")
    print(f"True probability: {true_prob:.6f}")
    print(f"N = {N} samples, {n_trials} trials")
    print()
    
    for proposal in proposals:
        estimates = []
        avg_ess = 0
        
        for _ in range(n_trials):
            samples = proposal["sampler"](N)
            
            # Importance weights
            weights = target_pdf(samples) / proposal["pdf"](samples)
            
            # IS estimate
            weighted_indicator = np.array([indicator(x) * w for x, w in zip(samples, weights)])
            estimate = np.mean(weighted_indicator)
            estimates.append(estimate)
            
            # ESS
            ESS = np.sum(weights)**2 / np.sum(weights**2)
            avg_ess += ESS / n_trials
        
        estimates = np.array(estimates)
        mean_est = np.mean(estimates)
        std_est = np.std(estimates)
        rmse = np.sqrt(np.mean((estimates - true_prob)**2))
        
        print(f"Proposal: {proposal['name']}")
        print(f"  Mean estimate: {mean_est:.6f}")
        print(f"  Std of estimates: {std_est:.6f}")
        print(f"  RMSE: {rmse:.6f}")
        print(f"  Average ESS: {avg_ess:.1f} ({100*avg_ess/N:.1f}%)")
        print()
 
 
compare_is_variance()

Effective Sample Size (ESS)

The Effective Sample Size (ESS) quantifies how many independent, equally-weighted samples the importance-weighted samples are worth. It's the key diagnostic for importance sampling quality.

Definition:

For self-normalized importance sampling with normalized weights $\bar{w}_n$:

$$\text{ESS} = \frac{1}{\sum_{n=1}^{N} \bar{w}_n^2} = \frac{\left(\sum_n \tilde{w}_n\right)^2}{\sum_n \tilde{w}_n^2}$$

Interpretation:

ESS ranges from 1 to N
ESS = N when all weights are equal (perfect proposal = target)
ESS = 1 when one sample has all the weight (catastrophic mismatch)
ESS/N gives the "efficiency" of the importance sampling

Why ESS matters:

The variance of the SNIS estimator is approximately: $$\text{Var}(\hat{I}_{\text{SNIS}}) \approx \frac{\sigma_f^2}{\text{ESS}}$$

So ESS tells us the effective number of samples we have for variance purposes. With N = 10,000 samples but ESS = 50, our estimates have the precision of only 50 independent samples.

ESS Rules of Thumb

General guidelines for interpreting ESS:

• ESS/N > 50%: Excellent efficiency, proposal well-matched • ESS/N = 20-50%: Acceptable, results reasonably reliable • ESS/N = 5-20%: Marginal, consider improving proposal • ESS/N < 5%: Poor, results likely unreliable • ESS/N < 1%: Catastrophic, discard results and redesign

Always report ESS alongside IS estimates to communicate result reliability.

Connection to weight concentration:

ESS is inversely related to weight concentration. Define the weight coefficient of variation:

$$\text{CV}^2(w) = \frac{\text{Var}(w)}{\mathbb{E}[w]^2}$$

Then: $$\text{ESS} = \frac{N}{1 + \text{CV}^2(w)}$$

High weight variability directly reduces effective sample size.

Alternative ESS measures:

The formula above uses the second moment of normalized weights. Other measures exist:

ESS from max weight: $\text{ESS}_{\max} = \frac{1}{\max_n \bar{w}_n}$ (conservative)
ESS from entropy: $\text{ESS}_{\text{entropy}} = \exp\left(-\sum_n \bar{w}_n \log \bar{w}_n\right)$

All capture the same intuition: uniform weights → high ESS, concentrated weights → low ESS.

effective_sample_size.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
import numpy as np
from scipy import stats
 
def analyze_ess(unnorm_weights):
    """
    Compute various ESS measures and weight diagnostics.
    """
    N = len(unnorm_weights)
    
    # Normalize weights
    norm_weights = unnorm_weights / np.sum(unnorm_weights)
    
    # Standard ESS (inverse of sum of squared normalized weights)
    ESS_standard = 1 / np.sum(norm_weights ** 2)
    
    # ESS from max weight
    ESS_max = 1 / np.max(norm_weights)
    
    # Entropy-based ESS
    # Handle zeros in weights
    nonzero_weights = norm_weights[norm_weights > 0]
    entropy = -np.sum(nonzero_weights * np.log(nonzero_weights))
    ESS_entropy = np.exp(entropy)
    
    # Weight diagnostics
    weight_cv = np.std(unnorm_weights) / np.mean(unnorm_weights)
    max_weight_ratio = np.max(unnorm_weights) / np.mean(unnorm_weights)
    
    return {
        'N': N,
        'ESS_standard': ESS_standard,
        'ESS_max': ESS_max,
        'ESS_entropy': ESS_entropy,
        'efficiency': ESS_standard / N,
        'weight_cv': weight_cv,
        'max_weight_ratio': max_weight_ratio,
        'norm_weights': norm_weights
    }
 
 
def demonstrate_ess_degradation():
    """
    Show how ESS degrades as target-proposal mismatch increases.
    """
    print("=== ESS Degradation with Proposal-Target Mismatch ===")
    print()
    
    N = 10000
    target_mean = 0
    target_std = 1
    
    def target_pdf(x):
        return stats.norm.pdf(x, target_mean, target_std)
    
    # Vary proposal mean to create mismatch
    proposal_means = [0, 0.5, 1, 2, 3, 4, 5]
    
    print(f"Target: N({target_mean}, {target_std}²)")
    print(f"N = {N} samples")
    print()
    print(f"{'Proposal Mean':>15} | {'ESS':>8} | {'Efficiency':>10} | {'Max w/Mean w':>12}")
    print("-" * 55)
    
    for pm in proposal_means:
        # Sample from proposal
        samples = np.random.normal(pm, target_std, N)
        
        # Compute weights
        proposal_pdf = lambda x, m=pm: stats.norm.pdf(x, m, target_std)
        weights = target_pdf(samples) / proposal_pdf(samples)
        
        # Analyze
        results = analyze_ess(weights)
        
        print(f"{pm:>15.1f} | {results['ESS_standard']:>8.1f} | "
              f"{results['efficiency']:>9.1%} | {results['max_weight_ratio']:>12.1f}")
 
 
demonstrate_ess_degradation()
 
print()
 
# Visualize weight distribution
print("=== Weight Distribution Examples ===")
print()
 
N = 1000
target_pdf = lambda x: stats.norm.pdf(x, 0, 1)
 
for mismatch, proposal_mean in [("None", 0), ("Moderate", 2), ("Severe", 4)]:
    samples = np.random.normal(proposal_mean, 1, N)
    proposal_pdf = lambda x, m=proposal_mean: stats.norm.pdf(x, m, 1)
    weights = target_pdf(samples) / proposal_pdf(samples)
    
    results = analyze_ess(weights)
    
    print(f"{mismatch} mismatch (proposal mean = {proposal_mean}):")
    print(f"  ESS = {results['ESS_standard']:.1f} ({100*results['efficiency']:.1f}%)")
    print(f"  Weight percentiles: 50th={np.percentile(weights, 50):.3f}, "
          f"90th={np.percentile(weights, 90):.3f}, "
          f"99th={np.percentile(weights, 99):.3f}, "
          f"max={np.max(weights):.3f}")
    print()

Multiple Importance Sampling

When a single proposal is inadequate—especially for multimodal targets—Multiple Importance Sampling (MIS) combines samples from several proposals.

The setup:

Target distribution: $p(x)$
Multiple proposals: $q_1(x), q_2(x), \ldots, q_K(x)$
Draw $N_k$ samples from each proposal $q_k$

Naive combination (bad):

Simply pooling samples with standard IS weights fails because samples from $q_k$ don't account for the other proposals.

Balance heuristic (good):

The optimal combination uses the balance heuristic. For a sample $x^{(n)}$ from any proposal:

$$w_{\text{balance}}(x) = \frac{p(x)}{\sum_{k=1}^{K} \frac{N_k}{N} q_k(x)} = \frac{p(x)}{\bar{q}(x)}$$

where $\bar{q}(x) = \sum_k \frac{N_k}{N} q_k(x)$ is the mixture proposal.

Why this works:

The balance heuristic treats all samples as if they came from the mixture $\bar{q}$. A point covered by any proposal gets reasonable weight, avoiding the exploding weights that occur when a point is far from a single proposal.

Covering Multimodal Targets

Multiple importance sampling excels for multimodal targets. Place a narrow proposal near each mode. Samples from each proposal get good weights near their respective modes. The balance heuristic ensures samples aren't penalized for being far from other proposals' modes. This is far more efficient than a single broad proposal that wastes samples in low-probability regions between modes.

multiple_importance_sampling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
import numpy as np
from scipy import stats
 
def multiple_importance_sampling(
    f,
    target_pdf,
    proposal_samplers,
    proposal_pdfs,
    samples_per_proposal
):
    """
    Multiple importance sampling with balance heuristic.
    
    Parameters:
    -----------
    proposal_samplers : list of callables
        Each sampler(N) returns N samples from proposal k
    proposal_pdfs : list of callables
        Each pdf(x) returns proposal k's density at x
    samples_per_proposal : list of int
        Number of samples from each proposal
    """
    K = len(proposal_samplers)
    N_total = sum(samples_per_proposal)
    weights_k = [n / N_total for n in samples_per_proposal]
    
    all_samples = []
    all_weights = []
    
    for k, (sampler, N_k) in enumerate(zip(proposal_samplers, samples_per_proposal)):
        samples = sampler(N_k)
        
        for x in samples:
            # Mixture proposal density (balance heuristic denominator)
            mixture_density = sum(
                w * pdf(x) 
                for w, pdf in zip(weights_k, proposal_pdfs)
            )
            
            # Balance heuristic weight
            weight = target_pdf(x) / mixture_density
            
            all_samples.append(x)
            all_weights.append(weight)
    
    samples = np.array(all_samples)
    weights = np.array(all_weights)
    
    # Normalize weights
    norm_weights = weights / np.sum(weights)
    
    # SNIS estimate
    estimate = np.sum([f(x) * w for x, w in zip(samples, norm_weights)])
    
    # ESS
    ESS = 1 / np.sum(norm_weights ** 2)
    
    return estimate, ESS, samples, norm_weights
 
 
# Example: Bimodal target
# Target: 0.4 * N(-3, 0.7) + 0.6 * N(3, 0.9)
 
def target_pdf(x):
    return 0.4 * stats.norm.pdf(x, -3, 0.7) + 0.6 * stats.norm.pdf(x, 3, 0.9)
 
# Single proposal (covers both modes poorly)
single_sampler = lambda N: np.random.normal(0, 3, N)
single_pdf = lambda x: stats.norm.pdf(x, 0, 3)
 
# Multiple proposals (one per mode)
multi_samplers = [
    lambda N: np.random.normal(-3, 0.7, N),
    lambda N: np.random.normal(3, 0.9, N)
]
multi_pdfs = [
    lambda x: stats.norm.pdf(x, -3, 0.7),
    lambda x: stats.norm.pdf(x, 3, 0.9)
]
 
# True mean: 0.4 * (-3) + 0.6 * 3 = 0.6
true_mean = 0.4 * (-3) + 0.6 * 3
N = 5000
 
print("=== Multiple Importance Sampling: Bimodal Target ===")
print("Target: 0.4 * N(-3, 0.7) + 0.6 * N(3, 0.9)")
print(f"True mean: {true_mean}")
print()
 
# Single proposal approach
samples_single = single_sampler(N)
weights_single = target_pdf(samples_single) / single_pdf(samples_single)
norm_weights_single = weights_single / np.sum(weights_single)
est_single = np.sum(samples_single * norm_weights_single)
ESS_single = 1 / np.sum(norm_weights_single ** 2)
 
print(f"Single proposal N(0, 3):")
print(f"  Estimate: {est_single:.4f}")
print(f"  ESS: {ESS_single:.1f} ({100*ESS_single/N:.1f}%)")
 
# Multiple proposal approach
est_multi, ESS_multi, _, _ = multiple_importance_sampling(
    lambda x: x,
    target_pdf,
    multi_samplers,
    multi_pdfs,
    [int(0.4 * N), int(0.6 * N)]  # Match mode weights
)
 
print(f"\nMultiple proposals (one per mode):")
print(f"  Estimate: {est_multi:.4f}")
print(f"  ESS: {ESS_multi:.1f} ({100*ESS_multi/N:.1f}%)")

Importance Sampling for Bayesian Inference

Importance sampling is particularly natural for Bayesian inference, where we typically know posteriors only up to normalizing constants.

Common setup:

Target: Posterior $p(\theta | \mathcal{D}) \propto p(\mathcal{D} | \theta) , p(\theta)$
Proposal: Often the prior $q(\theta) = p(\theta)$ or a Gaussian approximation

Using the prior as proposal:

When $q(\theta) = p(\theta)$, the importance weight simplifies beautifully:

$$w(\theta) = \frac{p(\mathcal{D} | \theta) , p(\theta)}{p(\theta)} = p(\mathcal{D} | \theta)$$

The weight is just the likelihood! This is computationally convenient and interpretable: samples with high likelihood get high weight.

Limitations of prior sampling:

Using the prior works only when the posterior is "close" to the prior—i.e., when data is informative but not overwhelming. If the likelihood concentrates in a small region of prior mass, most prior samples get negligible weight.

Better proposals:

Laplace approximation: Gaussian centered at posterior mode
Variational approximation: Optimized for KL divergence from posterior
Previous posterior: For sequential updating (particle filtering)

Proposal Choices for Bayesian IS
Proposal	Weights	When to Use
Prior p(θ)	Likelihood p(D\|θ)	Weak data, prior reasonably accurate
Laplace approximation	Posterior / Gaussian	Unimodal, smooth posteriors
Variational approximation	Posterior / q_VI(θ)	Complex models with VI available
Previous posterior t-1	Likelihood ratio	Sequential/online settings
Mixture of experts	Balance heuristic	Multimodal posteriors

bayesian_is.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
from scipy import stats, optimize
 
def bayesian_is_comparison():
    """
    Compare different proposals for Bayesian IS:
    1. Prior as proposal
    2. Laplace approximation as proposal
    """
    # Model: Observations X_i ~ N(theta, 1), prior theta ~ N(0, 2^2)
    # With many observations, posterior concentrates away from prior
    
    prior_mean, prior_std = 0, 2
    likelihood_std = 1
    true_theta = 3.5
    
    # Generate observations
    np.random.seed(42)
    n_obs = 50
    observations = np.random.normal(true_theta, likelihood_std, n_obs)
    
    # True posterior (conjugate): N(posterior_mean, posterior_std)
    posterior_precision = 1/prior_std**2 + n_obs/likelihood_std**2
    posterior_var = 1 / posterior_precision
    posterior_mean = posterior_var * (prior_mean/prior_std**2 + np.sum(observations)/likelihood_std**2)
    posterior_std = np.sqrt(posterior_var)
    
    print("=== Bayesian IS: Prior vs Laplace Proposal ===")
    print(f"True theta: {true_theta}")
    print(f"Prior: N({prior_mean}, {prior_std}²)")
    print(f"Data: {n_obs} observations, sample mean = {np.mean(observations):.3f}")
    print(f"True posterior: N({posterior_mean:.3f}, {posterior_std:.3f}²)")
    print()
    
    def log_likelihood(theta):
        return np.sum(stats.norm.logpdf(observations, theta, likelihood_std))
    
    def log_prior(theta):
        return stats.norm.logpdf(theta, prior_mean, prior_std)
    
    def log_posterior_unnorm(theta):
        return log_likelihood(theta) + log_prior(theta)
    
    N = 10000
    
    # Method 1: Prior as proposal
    prior_samples = np.random.normal(prior_mean, prior_std, N)
    prior_weights = np.exp([log_likelihood(theta) for theta in prior_samples])
    prior_weights_norm = prior_weights / np.sum(prior_weights)
    ESS_prior = 1 / np.sum(prior_weights_norm ** 2)
    mean_est_prior = np.sum(prior_samples * prior_weights_norm)
    
    print(f"Prior as proposal:")
    print(f"  Mean estimate: {mean_est_prior:.4f}")
    print(f"  ESS: {ESS_prior:.1f} ({100*ESS_prior/N:.1f}%)")
    print(f"  Max weight / Mean weight: {np.max(prior_weights) / np.mean(prior_weights):.1f}")
    
    # Method 2: Laplace approximation as proposal
    # Find posterior mode
    result = optimize.minimize(lambda x: -log_posterior_unnorm(x[0]), [0])
    laplace_mean = result.x[0]
    
    # Approximate posterior curvature at mode
    eps = 1e-5
    hess_approx = (log_posterior_unnorm(laplace_mean + eps) - 
                   2*log_posterior_unnorm(laplace_mean) + 
                   log_posterior_unnorm(laplace_mean - eps)) / eps**2
    laplace_std = np.sqrt(-1 / hess_approx)
    
    laplace_samples = np.random.normal(laplace_mean, laplace_std, N)
    laplace_proposal_log = stats.norm.logpdf(laplace_samples, laplace_mean, laplace_std)
    laplace_target_log = np.array([log_posterior_unnorm(x) for x in laplace_samples])
    laplace_weights = np.exp(laplace_target_log - laplace_proposal_log)
    laplace_weights_norm = laplace_weights / np.sum(laplace_weights)
    ESS_laplace = 1 / np.sum(laplace_weights_norm ** 2)
    mean_est_laplace = np.sum(laplace_samples * laplace_weights_norm)
    
    print(f"\nLaplace approximation as proposal:")
    print(f"  Laplace: N({laplace_mean:.4f}, {laplace_std:.4f}²)")
    print(f"  Mean estimate: {mean_est_laplace:.4f}")
    print(f"  ESS: {ESS_laplace:.1f} ({100*ESS_laplace/N:.1f}%)")
    print(f"  Improvement: {ESS_laplace/ESS_prior:.1f}x higher ESS")
 
 
bayesian_is_comparison()

Limitations of Importance Sampling

Despite its flexibility, importance sampling has fundamental limitations that restrict its applicability.

The dimensional curse:

In high dimensions, even minor mismatches between proposal and target lead to weight degeneracy. Consider this scaling:

In $d$ dimensions, if proposal and target differ slightly in each dimension, the overall mismatch compounds multiplicatively
Weight variance grows exponentially with dimension
ESS/N → 0 exponentially fast as $d$ increases

The weight variance problem:

For IS to work, we need $\mathbb{E}_q[w(x)^2] < \infty$. If the target has heavier tails than the proposal:

$$\mathbb{E}_q\left[\frac{p(x)^2}{q(x)^2}\right] = \int \frac{p(x)^2}{q(x)} dx = \infty$$

This infinite variance makes IS unusable, even theoretically.

Proposal design challenges:

Finding a good proposal requires knowing a lot about the target—which we're trying to learn. It's a chicken-and-egg problem. For complex, high-dimensional posteriors, no simple parametric proposal works well.

When IS Fails Silently

The most dangerous aspect of IS failure is that estimates can look reasonable while being completely wrong. A few samples with large weights can drive the estimate, but their randomness isn't reflected in standard errors (which assume balanced weights). Always compute and report ESS. If ESS << N, treat results with extreme skepticism.

When to Use (and Avoid) Importance Sampling
Scenario	Verdict	Reason
Low-dimensional, good proposal	✓ Excellent	Efficient and exact
Rare event estimation	✓ Excellent	Can beat standard MC by orders of magnitude
Sequential updating	✓ Good	Reweight previous samples (particle filtering)
Moderate dimension, good proposal	△ Viable	Monitor ESS carefully
High dimension (d > 20)	✗ Risky	Weight degeneracy likely
Heavy-tailed target, light-tailed proposal	✗ Fails	Infinite variance
Unknown target structure	✗ Dangerous	Can't design good proposal

Key Diagnostics to Always Compute

•Effective Sample Size (ESS): The primary diagnostic; ESS/N should be > 10-20%
•Maximum weight: If max(w) >> mean(w), a few samples dominate
•Weight histogram: Should not be heavily right-skewed
•Running estimates: Should stabilize, not jump with occasional large-weight samples
•Multiple runs: Independent IS runs should give similar results

Summary: Importance Sampling

Importance sampling reweights samples from a proposal distribution to estimate expectations under a target distribution. It eliminates rejection sampling's waste but introduces variance challenges.

Key Takeaways

•Core identity: $\mathbb{E}_p[f] = \mathbb{E}_q[f \cdot p/q]$ converts expectations between distributions
•Importance weights: $w(x) = p(x)/q(x)$ correct for proposal-target mismatch
•Self-normalized IS: Handles unnormalized targets by normalizing weights to sum to 1; introduces small bias
•Effective Sample Size: $\text{ESS} = 1/\sum \bar{w}_n^2$ measures how many effective samples we have; critical diagnostic
•Variance depends on proposal quality: Poor proposals cause weight degeneracy and useless estimates
•Multiple IS: Combines proposals to cover multimodal targets; balance heuristic for weighting
•Dimensional limitations: Weight variance grows exponentially with dimension; rarely practical beyond d ≈ 10-20

What's Next:

Both rejection sampling and importance sampling struggle in high dimensions. The solution: Markov Chain Monte Carlo (MCMC). Instead of sampling independently from a proposal, MCMC constructs a Markov chain whose stationary distribution is the target. The Metropolis-Hastings algorithm provides a general framework that requires only evaluating the unnormalized target density—transforming Bayesian computation.

Page Complete

You now understand importance sampling: the reweighting principle, self-normalization for unnormalized targets, variance and ESS analysis, and fundamental limitations. This prepares you for MCMC, which overcomes these limitations through the power of Markov chain convergence.