Machine LearningBayesian Inference

Approximate Inference: Sampling Methods

LevelAdvanced

Duration120 mins

TopicBayesian Inference

1 / 5

Monte Carlo Methods

The Fundamental Challenge of Bayesian Inference

At the heart of Bayesian statistics lies a beautiful but computationally demanding principle: update beliefs with evidence. Bayes' theorem tells us exactly how to do this—compute the posterior distribution by multiplying the prior by the likelihood and normalizing. In theory, this is elegant. In practice, it often demands computing integrals that have no closed-form solution.

The normalization problem:

$$p(\theta | \mathcal{D}) = \frac{p(\mathcal{D} | \theta) , p(\theta)}{p(\mathcal{D})} = \frac{p(\mathcal{D} | \theta) , p(\theta)}{\int p(\mathcal{D} | \theta') , p(\theta') , d\theta'}$$

The denominator—the marginal likelihood or evidence—requires integrating over the entire parameter space. For most realistic models, this integral is analytically intractable. We cannot compute it in closed form, and numerical integration (like simple quadrature) fails catastrophically in high dimensions due to the curse of dimensionality.

What You Will Learn

This page introduces Monte Carlo methods—a class of computational techniques that approximate intractable integrals using random sampling. You will understand the theoretical foundations, convergence guarantees, and practical considerations that make Monte Carlo the workhorse of modern Bayesian computation.

The Integration Problem in Bayesian Inference

Bayesian inference isn't just about computing posteriors—nearly every quantity of interest requires integration. Consider what we often need:

1. Posterior expectations: $$\mathbb{E}_{p(\theta|\mathcal{D})}[f(\theta)] = \int f(\theta) , p(\theta | \mathcal{D}) , d\theta$$

This includes posterior means ($f(\theta) = \theta$), variances, credible intervals, and any function of parameters.

2. Posterior predictive distribution: $$p(x_{\text{new}} | \mathcal{D}) = \int p(x_{\text{new}} | \theta) , p(\theta | \mathcal{D}) , d\theta$$

Predictions that properly account for parameter uncertainty require averaging over the posterior.

3. Marginal posteriors: $$p(\theta_1 | \mathcal{D}) = \int p(\theta_1, \theta_2 | \mathcal{D}) , d\theta_2$$

Examining individual parameters requires marginalizing out all others.

4. Model evidence (for model comparison): $$p(\mathcal{D} | M) = \int p(\mathcal{D} | \theta, M) , p(\theta | M) , d\theta$$

Bayes factors and posterior model probabilities depend on this integral.

The Curse of Dimensionality

Traditional numerical integration (e.g., trapezoidal rule, Simpson's rule) places a grid of points over the integration domain. In $d$ dimensions with $n$ points per dimension, this requires $n^d$ function evaluations. For a modest model with 20 parameters and 10 points per dimension, we need $10^{20}$ evaluations—utterly infeasible. This exponential scaling renders grid-based methods useless for high-dimensional inference.

Why is this so hard?

The fundamental issue is that probability mass in high dimensions concentrates in counterintuitive ways. Consider a Gaussian distribution in $d$ dimensions. Although the mode (peak) is at the origin, most of the probability mass lies in a thin typical set—a shell of intermediate density values, far from both the mode and the tails.

Grid-based methods waste effort evaluating the function at points that contribute negligibly to the integral. The regions that matter (the typical set) occupy an increasingly small fraction of the domain as dimensionality grows. We need a smarter approach that focuses computational effort where it counts.

Scaling of Integration Approaches
Method	Grid Points Required	d = 2	d = 10	d = 50
Grid Quadrature	$n^d$	$100$	$10^{10}$	$10^{50}$
Monte Carlo	$n$ (independent of d)	$100$	$100$	$100$
Error (Quadrature)	$O(n^{-k/d})$	Fast	Slow	Fails
Error (MC)	$O(n^{-1/2})$	$O(0.1)$	$O(0.1)$	$O(0.1)$

The Monte Carlo Principle

Monte Carlo methods replace deterministic integration with stochastic approximation. The core insight is elegantly simple: if we can draw samples from a distribution, we can approximate expectations under that distribution using sample averages.

The fundamental Monte Carlo estimator:

Suppose we want to compute: $$I = \mathbb{E}_{p(\theta)}[f(\theta)] = \int f(\theta) , p(\theta) , d\theta$$

If we can generate i.i.d. samples ${\theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(N)}}$ from the distribution $p(\theta)$, then:

$$\hat{I}N = \frac{1}{N} \sum{n=1}^{N} f(\theta^{(n)})$$

This sample mean is an unbiased estimator of the true integral $I$.

Why This Works: Law of Large Numbers

The Law of Large Numbers guarantees that the sample average converges to the true expectation as the number of samples grows: $\hat{I}_N \xrightarrow{a.s.} I$ as $N \to \infty$. This is not an approximation that gets systematically better—it's a probabilistic guarantee of convergence. With enough samples, we can get arbitrarily close to the true value.

The variance of the Monte Carlo estimator:

Since each $f(\theta^{(n)})$ is an independent draw, the variance of the sample mean is:

$$\text{Var}(\hat{I}N) = \frac{1}{N} \text{Var}{p(\theta)}[f(\theta)] = \frac{\sigma_f^2}{N}$$

where $\sigma_f^2 = \mathbb{E}[(f(\theta) - I)^2]$ is the variance of $f$ under $p$.

The standard error of the MC estimator is: $$\text{SE}(\hat{I}_N) = \frac{\sigma_f}{\sqrt{N}}$$

The Central Limit Theorem gives us more: for large $N$, the estimator is approximately normally distributed:

$$\hat{I}_N \approx \mathcal{N}\left(I, \frac{\sigma_f^2}{N}\right)$$

This allows us to construct confidence intervals for our Monte Carlo estimates.

monte_carlo_basic.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import numpy as np
from scipy import stats
 
def monte_carlo_integral(sampler, f, N):
    """
    Estimate E_p[f(theta)] using Monte Carlo sampling.
    
    Parameters:
    -----------
    sampler : callable
        Function that returns N i.i.d. samples from p(theta)
    f : callable
        Function to integrate
    N : int
        Number of samples
    
    Returns:
    --------
    estimate : float
        Monte Carlo estimate of E[f(theta)]
    std_error : float
        Standard error of the estimate
    """
    # Generate samples from the target distribution
    samples = sampler(N)
    
    # Evaluate the function at each sample
    f_values = np.array([f(theta) for theta in samples])
    
    # Monte Carlo estimate (sample mean)
    estimate = np.mean(f_values)
    
    # Standard error = sample std / sqrt(N)
    std_error = np.std(f_values, ddof=1) / np.sqrt(N)
    
    return estimate, std_error
 
 
# Example: Estimate E[X^2] where X ~ N(0, 1)
# True value: Var(X) + E[X]^2 = 1 + 0 = 1
 
def standard_normal_sampler(N):
    return np.random.randn(N)
 
def squared(x):
    return x ** 2
 
# Run with increasing sample sizes
for N in [100, 1000, 10000, 100000]:
    est, se = monte_carlo_integral(standard_normal_sampler, squared, N)
    print(f"N={N:6d}: Estimate = {est:.4f} ± {1.96*se:.4f} (95% CI)")
    print(f"          True value = 1.0000")
 
# Output (typical):
# N=   100: Estimate = 0.9823 ± 0.1843 (95% CI)
# N=  1000: Estimate = 1.0102 ± 0.0621 (95% CI)
# N= 10000: Estimate = 0.9987 ± 0.0196 (95% CI)
# N=100000: Estimate = 1.0003 ± 0.0062 (95% CI)

Convergence Properties and Error Analysis

The remarkable property of Monte Carlo estimation is that its convergence rate is independent of the dimensionality of the integration problem. This stands in stark contrast to deterministic quadrature methods.

The $O(N^{-1/2})$ convergence rate:

The root-mean-square error of the Monte Carlo estimator scales as: $$\text{RMSE} = \sqrt{\mathbb{E}[(\hat{I}_N - I)^2]} = \frac{\sigma_f}{\sqrt{N}}$$

Halving the error requires quadrupling the number of samples. This rate is dimension-independent—whether we're integrating in 2 dimensions or 200, the same number of samples achieves the same error reduction.

Comparison with deterministic methods:

For a function with $k$ continuous derivatives, optimal deterministic quadrature achieves error $O(N^{-k/d})$ in $d$ dimensions. In high dimensions, this becomes arbitrarily slow. Monte Carlo's $O(N^{-1/2})$ rate, while seemingly modest, becomes superior as soon as:

$$\frac{k}{d} < \frac{1}{2} \quad \Rightarrow \quad d > 2k$$

For smooth functions ($k=2$), MC wins for $d > 4$. Real posterior distributions with dozens or hundreds of parameters make MC the only viable approach.

Practical Implications of the Convergence Rate

The $O(N^{-1/2})$ rate has practical implications for computational budgets:

• 1 decimal digit of precision needs ~$100$ samples • 2 decimal digits needs ~$10,000$ samples • 3 decimal digits needs ~$1,000,000$ samples

For most Bayesian applications, 2-3 significant figures suffice for posterior summaries, making MC eminently practical.

Estimating the variance:

In practice, we don't know $\sigma_f^2$—we must estimate it from the same samples:

$$\hat{\sigma}f^2 = \frac{1}{N-1} \sum{n=1}^{N} \left(f(\theta^{(n)}) - \hat{I}_N\right)^2$$

This gives us the estimated standard error: $$\widehat{\text{SE}}(\hat{I}_N) = \frac{\hat{\sigma}_f}{\sqrt{N}}$$

Confidence intervals can be constructed using the normal approximation: $$\hat{I}N \pm z{\alpha/2} \cdot \widehat{\text{SE}}(\hat{I}_N)$$

For a 95% CI, $z_{0.025} \approx 1.96$, giving roughly $\hat{I}_N \pm 2 \cdot \widehat{\text{SE}}$.

convergence_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import numpy as np
import matplotlib.pyplot as plt
 
def analyze_convergence(sampler, f, true_value, max_N=100000, num_trials=50):
    """
    Analyze Monte Carlo convergence empirically.
    
    Demonstrates O(N^{-1/2}) error scaling.
    """
    sample_sizes = np.logspace(1, np.log10(max_N), 50).astype(int)
    sample_sizes = np.unique(sample_sizes)
    
    rmse_values = []
    theoretical_se = []
    
    for N in sample_sizes:
        errors = []
        for _ in range(num_trials):
            samples = sampler(N)
            f_vals = np.array([f(x) for x in samples])
            estimate = np.mean(f_vals)
            errors.append(estimate - true_value)
        
        rmse = np.sqrt(np.mean(np.array(errors)**2))
        rmse_values.append(rmse)
        
        # Theoretical: sigma_f / sqrt(N)
        # For X ~ N(0,1), f(x) = x^2, Var(X^2) = E[X^4] - E[X^2]^2 = 3 - 1 = 2
        sigma_f = np.sqrt(2)  # Known for this specific case
        theoretical_se.append(sigma_f / np.sqrt(N))
    
    return sample_sizes, rmse_values, theoretical_se
 
# Run analysis for E[X^2], X ~ N(0,1)
sampler = lambda N: np.random.randn(N)
f = lambda x: x**2
true_value = 1.0
 
sizes, rmse, theory = analyze_convergence(sampler, f, true_value)
 
# Verify O(N^{-1/2}) scaling
log_sizes = np.log10(sizes)
log_rmse = np.log10(rmse)
slope, intercept = np.polyfit(log_sizes, log_rmse, 1)
 
print(f"Empirical convergence rate: O(N^{{{slope:.3f}}})")
print(f"Expected rate: O(N^{{-0.500}})")
print(f"")
print("Sample size vs RMSE:")
for i in range(0, len(sizes), 10):
    print(f"  N = {sizes[i]:6d}: RMSE = {rmse[i]:.4f}, Theory = {theory[i]:.4f}")
 
# Typical output:
# Empirical convergence rate: O(N^{-0.498})
# Expected rate: O(N^{-0.500})

Applying Monte Carlo to Bayesian Inference

The Monte Carlo principle directly addresses Bayesian computational challenges. If we can sample from the posterior $p(\theta | \mathcal{D})$, any posterior quantity becomes a sample average.

Posterior expectations: $$\mathbb{E}[f(\theta) | \mathcal{D}] = \int f(\theta) , p(\theta | \mathcal{D}) , d\theta \approx \frac{1}{N} \sum_{n=1}^{N} f(\theta^{(n)})$$

where $\theta^{(n)} \sim p(\theta | \mathcal{D})$.

Posterior predictive distribution: $$p(\tilde{x} | \mathcal{D}) = \int p(\tilde{x} | \theta) , p(\theta | \mathcal{D}) , d\theta \approx \frac{1}{N} \sum_{n=1}^{N} p(\tilde{x} | \theta^{(n)})$$

Generating predictive samples:

To sample from the predictive distribution, we use composition:

Draw $\theta^{(n)} \sim p(\theta | \mathcal{D})$
Draw $\tilde{x}^{(n)} \sim p(\tilde{x} | \theta^{(n)})$

The resulting ${\tilde{x}^{(n)}}$ are samples from $p(\tilde{x} | \mathcal{D})$.

Credible intervals:

To find a 95% credible interval for a parameter $\theta_j$, simply take the 2.5th and 97.5th percentiles of the posterior samples.

The Power of Posterior Samples

Once you have samples from the posterior, computing any derived quantity is trivial. Want P(θ > 0)? Count the fraction of samples where θ > 0. Want the probability that model A outperforms model B? Apply both models to each posterior sample and count wins. The flexibility is remarkable—you're limited only by what you can compute from parameter samples.

bayesian_mc_inference.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np
from scipy import stats
 
def bayesian_inference_with_samples(posterior_samples, X_new=None):
    """
    Demonstrate Bayesian inference using posterior samples.
    
    Assume we have samples from posterior p(mu, sigma | data) for a Gaussian model.
    posterior_samples: (N, 2) array with columns [mu, sigma]
    """
    N = len(posterior_samples)
    mu_samples = posterior_samples[:, 0]
    sigma_samples = posterior_samples[:, 1]
    
    # 1. Posterior mean of mu
    mu_posterior_mean = np.mean(mu_samples)
    mu_posterior_std = np.std(mu_samples)
    
    # 2. 95% Credible Interval for mu
    mu_ci = np.percentile(mu_samples, [2.5, 97.5])
    
    # 3. Probability that mu > 0
    prob_mu_positive = np.mean(mu_samples > 0)
    
    # 4. Posterior of a derived quantity: coefficient of variation = sigma/|mu|
    cv_samples = sigma_samples / np.abs(mu_samples)
    cv_mean = np.mean(cv_samples)
    cv_ci = np.percentile(cv_samples, [2.5, 97.5])
    
    print("=== Bayesian Inference from Posterior Samples ===")
    print(f"Number of posterior samples: {N}")
    print()
    print(f"Posterior mean of μ: {mu_posterior_mean:.4f} ± {mu_posterior_std:.4f}")
    print(f"95% Credible Interval for μ: [{mu_ci[0]:.4f}, {mu_ci[1]:.4f}]")
    print(f"P(μ > 0 | data): {prob_mu_positive:.4f}")
    print()
    print(f"Coefficient of Variation (σ/|μ|):")
    print(f"  Posterior mean: {cv_mean:.4f}")
    print(f"  95% CI: [{cv_ci[0]:.4f}, {cv_ci[1]:.4f}]")
    
    # 5. Posterior predictive distribution
    if X_new is not None:
        # For each posterior sample, compute p(X_new | mu, sigma)
        predictive_densities = []
        for mu, sigma in posterior_samples:
            density = stats.norm.pdf(X_new, loc=mu, scale=sigma)
            predictive_densities.append(density)
        
        # Average over posterior samples
        predictive_density = np.mean(predictive_densities)
        print(f"\nPredictive density at X_new={X_new}: {predictive_density:.6f}")
    
    return {
        'mu_mean': mu_posterior_mean,
        'mu_ci': mu_ci,
        'prob_positive': prob_mu_positive,
        'cv_mean': cv_mean,
        'cv_ci': cv_ci
    }
 
# Simulate posterior samples (in practice, these come from MCMC)
# Pretend data came from N(2.5, 1.2)
np.random.seed(42)
true_mu, true_sigma = 2.5, 1.2
n_samples = 10000
 
# Simulate "posterior" (in reality, this would come from MCMC)
# Using posterior from conjugate normal-normal model as illustration
mu_samples = np.random.normal(2.48, 0.15, n_samples)  # Posterior of mu
sigma_samples = np.abs(np.random.normal(1.18, 0.12, n_samples))  # Posterior of sigma
 
posterior_samples = np.column_stack([mu_samples, sigma_samples])
 
results = bayesian_inference_with_samples(posterior_samples, X_new=3.0)

The Central Challenge: Sampling from the Posterior

The Monte Carlo principle is beautiful in theory, but there's a critical gap: How do we actually generate samples from the posterior $p(\theta | \mathcal{D})$?

This is precisely where the computational difficulty lies. The posterior is defined only up to a normalizing constant:

$$p(\theta | \mathcal{D}) = \frac{p(\mathcal{D} | \theta) , p(\theta)}{Z}$$

where $Z = p(\mathcal{D}) = \int p(\mathcal{D} | \theta') , p(\theta') , d\theta'$ is the very integral we're trying to avoid computing.

We can easily evaluate the unnormalized posterior: $$\tilde{p}(\theta | \mathcal{D}) = p(\mathcal{D} | \theta) , p(\theta)$$

But sampling from an unnormalized distribution is not trivial. Standard techniques like inverse CDF sampling require the normalized distribution.

The Normalization Barrier

This is the fundamental computational barrier in Bayesian inference. We can compute ratios of posterior probabilities easily—p(θ₁|D)/p(θ₂|D)—since the normalization constants cancel. But generating samples requires more sophisticated machinery. The remainder of this module presents progressively more powerful techniques for overcoming this barrier.

The hierarchy of sampling methods:

Different sampling techniques apply under different conditions:

Method	Requirements	Complexity
Direct sampling	Known, invertible CDF	Trivial
Transformation	Express as transform of simple RV	Simple
Rejection sampling	Bounding envelope	Moderate
Importance sampling	Proposal distribution	Moderate
MCMC	Ability to evaluate unnormalized density	General-purpose

We will explore rejection sampling and importance sampling in the following pages as building blocks, culminating in the powerful MCMC framework.

What Each Sampling Method Offers

•Rejection Sampling: Generates exact i.i.d. samples from the target, but efficiency degrades in high dimensions
•Importance Sampling: Handles unnormalized targets, but variance can explode if the proposal is poorly chosen
•MCMC (Metropolis-Hastings): Works with only unnormalized densities, scales to high dimensions, but produces correlated samples
•Gibbs Sampling: Special MCMC for models with tractable conditionals, often highly efficient for structured posteriors

Historical Context and Modern Significance

Monte Carlo methods have a fascinating history that intertwines physics, computing, and statistics.

Origins at Los Alamos (1940s):

The name "Monte Carlo" was coined by Stanislaw Ulam and John von Neumann during work on the Manhattan Project. They needed to simulate neutron diffusion through fissile material—a problem with too many random interactions for analytical solutions.

Ulam, recovering from illness and playing solitaire, realized that rather than calculating the probability of a successful game analytically, he could simply play many games and count successes. This insight—simulating a process to estimate probabilities—became foundational.

The name references the Monte Carlo Casino in Monaco, nodding to the role of chance in the method. Nicholas Metropolis, who would later develop the Metropolis algorithm, coined the name.

Metropolis algorithm (1953):

Metropolis and colleagues published the foundational MCMC algorithm in "Equation of State Calculations by Fast Computing Machines." Originally developed to simulate thermodynamic equilibrium of molecules, it would transform Bayesian computation decades later.

The Bayesian revolution (1990s):

For decades, Bayesian methods remained largely theoretical due to computational barriers. The work of Gelfand and Smith (1990) demonstrating Gibbs sampling for Bayesian inference, followed by practical MCMC software (BUGS, WinBUGS), launched a revolution. Suddenly, complex hierarchical models became tractable.

From Physics to Statistics

The migration of Monte Carlo methods from physics to statistics is a beautiful example of cross-disciplinary fertilization. Physicists simulating molecules at thermal equilibrium and statisticians sampling posterior distributions face the same mathematical problem: generating samples from a distribution known only up to a normalizing constant. The canonical distribution in statistical mechanics, p(x) ∝ exp(-E(x)/kT), has the same form as Bayesian posteriors, p(θ|D) ∝ exp(log p(D|θ) + log p(θ)).

Key Milestones in Monte Carlo Methods
Year	Development	Significance
1946	Ulam's solitaire insight	Conceptual foundation for simulation-based estimation
1949	Metropolis & Ulam paper	First formal description of Monte Carlo method
1953	Metropolis algorithm	MCMC for sampling from Boltzmann distributions
1970	Hastings generalization	Extended Metropolis to asymmetric proposals
1984	Geman & Geman (Gibbs sampling)	Named and formalized for image restoration
1990	Gelfand & Smith	Demonstrated Gibbs for Bayesian inference
1994	BUGS software released	Made MCMC accessible to applied statisticians
2011	Stan, Hamiltonian MC	Gradient-based MCMC for complex models

Practical Considerations for Monte Carlo Estimation

While the theory of Monte Carlo estimation is elegant, practical application requires attention to several details.

Sample size determination:

How many samples do we need? This depends on:

Desired precision: For standard error $\varepsilon$, we need $N \approx \sigma_f^2 / \varepsilon^2$
Variance of the quantity: High-variance functions need more samples
Computational budget: Each sample has a cost

Rule of thumb: Start with 1,000-10,000 samples for exploration, increase to 100,000+ for final inference.

Monitoring convergence:

With i.i.d. samples, monitoring is straightforward:

Plot running estimates vs. sample size (should stabilize)
Compute standard errors (should shrink as $O(N^{-1/2})$)
Run multiple independent chains (should agree)

Effective sample size (for correlated samples):

When samples are correlated (as in MCMC), the effective sample size is: $$N_{\text{eff}} = \frac{N}{1 + 2\sum_{k=1}^{\infty} \rho_k}$$

where $\rho_k$ is the autocorrelation at lag $k$. Highly correlated chains have $N_{\text{eff}} \ll N$.

Best Practices

•Run multiple independent chains
•Report standard errors with estimates
•Check for chain convergence
•Use transformations for heavy-tailed quantities
•Validate with known test cases

Common Pitfalls

•Insufficient samples for high-variance quantities
•Ignoring autocorrelation in MCMC
•Using biased initializations
•Forgetting to check for multimodality
•Over-interpreting noisy estimates

Variance Reduction Techniques

When standard Monte Carlo variance is too high, consider:

• Antithetic variates: Generate negatively correlated pairs to reduce variance • Control variates: Use a correlated quantity with known expectation • Stratification: Partition the sample space for more uniform coverage • Rao-Blackwellization: Replace random quantities with conditional expectations

These techniques can dramatically improve efficiency without increasing sample size.

Summary: The Monte Carlo Foundation

Monte Carlo methods provide the computational foundation for modern Bayesian inference. The core insight is profound yet simple: replace intractable integrals with sample averages.

Key Takeaways

•The Integration Problem: Bayesian inference requires computing intractable high-dimensional integrals for normalization, expectations, and predictions
•Monte Carlo Principle: Expectations can be approximated by sample averages: $\mathbb{E}[f] \approx \frac{1}{N}\sum_n f(\theta^{(n)})$
•Convergence Rate: Error scales as $O(N^{-1/2})$, independent of dimension—the key advantage over deterministic quadrature
•Central Limit Theorem: Enables construction of confidence intervals and uncertainty quantification for MC estimates
•The Sampling Challenge: The difficulty lies in generating samples from the posterior, known only up to a normalizing constant
•Building Blocks: Rejection sampling and importance sampling provide foundations for the general MCMC framework

What's Next:

The following pages develop the machinery for sampling from complex posteriors:

Rejection Sampling: Exact sampling using envelope distributions
Importance Sampling: Reweighting samples from a proposal distribution
Metropolis-Hastings: The general MCMC framework that revolutionized Bayesian computation
Gibbs Sampling: Exploiting conditional structure for efficient sampling

Each technique builds on the Monte Carlo foundation established here, progressively enabling inference in more challenging settings.

Page Complete

You now understand the theoretical foundations of Monte Carlo estimation: why it works, how fast it converges, and what makes it uniquely suited to high-dimensional Bayesian inference. Next, we'll develop rejection sampling—the first concrete technique for drawing samples from complex distributions.

1 / 5

Loading learning content...