Loading learning content...
At the heart of Bayesian statistics lies a beautiful but computationally demanding principle: update beliefs with evidence. Bayes' theorem tells us exactly how to do this—compute the posterior distribution by multiplying the prior by the likelihood and normalizing. In theory, this is elegant. In practice, it often demands computing integrals that have no closed-form solution.
The normalization problem:
$$p(\theta | \mathcal{D}) = \frac{p(\mathcal{D} | \theta) , p(\theta)}{p(\mathcal{D})} = \frac{p(\mathcal{D} | \theta) , p(\theta)}{\int p(\mathcal{D} | \theta') , p(\theta') , d\theta'}$$
The denominator—the marginal likelihood or evidence—requires integrating over the entire parameter space. For most realistic models, this integral is analytically intractable. We cannot compute it in closed form, and numerical integration (like simple quadrature) fails catastrophically in high dimensions due to the curse of dimensionality.
This page introduces Monte Carlo methods—a class of computational techniques that approximate intractable integrals using random sampling. You will understand the theoretical foundations, convergence guarantees, and practical considerations that make Monte Carlo the workhorse of modern Bayesian computation.
Bayesian inference isn't just about computing posteriors—nearly every quantity of interest requires integration. Consider what we often need:
1. Posterior expectations: $$\mathbb{E}_{p(\theta|\mathcal{D})}[f(\theta)] = \int f(\theta) , p(\theta | \mathcal{D}) , d\theta$$
This includes posterior means ($f(\theta) = \theta$), variances, credible intervals, and any function of parameters.
2. Posterior predictive distribution: $$p(x_{\text{new}} | \mathcal{D}) = \int p(x_{\text{new}} | \theta) , p(\theta | \mathcal{D}) , d\theta$$
Predictions that properly account for parameter uncertainty require averaging over the posterior.
3. Marginal posteriors: $$p(\theta_1 | \mathcal{D}) = \int p(\theta_1, \theta_2 | \mathcal{D}) , d\theta_2$$
Examining individual parameters requires marginalizing out all others.
4. Model evidence (for model comparison): $$p(\mathcal{D} | M) = \int p(\mathcal{D} | \theta, M) , p(\theta | M) , d\theta$$
Bayes factors and posterior model probabilities depend on this integral.
Traditional numerical integration (e.g., trapezoidal rule, Simpson's rule) places a grid of points over the integration domain. In $d$ dimensions with $n$ points per dimension, this requires $n^d$ function evaluations. For a modest model with 20 parameters and 10 points per dimension, we need $10^{20}$ evaluations—utterly infeasible. This exponential scaling renders grid-based methods useless for high-dimensional inference.
Why is this so hard?
The fundamental issue is that probability mass in high dimensions concentrates in counterintuitive ways. Consider a Gaussian distribution in $d$ dimensions. Although the mode (peak) is at the origin, most of the probability mass lies in a thin typical set—a shell of intermediate density values, far from both the mode and the tails.
Grid-based methods waste effort evaluating the function at points that contribute negligibly to the integral. The regions that matter (the typical set) occupy an increasingly small fraction of the domain as dimensionality grows. We need a smarter approach that focuses computational effort where it counts.
| Method | Grid Points Required | d = 2 | d = 10 | d = 50 |
|---|---|---|---|---|
| Grid Quadrature | $n^d$ | $100$ | $10^{10}$ | $10^{50}$ |
| Monte Carlo | $n$ (independent of d) | $100$ | $100$ | $100$ |
| Error (Quadrature) | $O(n^{-k/d})$ | Fast | Slow | Fails |
| Error (MC) | $O(n^{-1/2})$ | $O(0.1)$ | $O(0.1)$ | $O(0.1)$ |
Monte Carlo methods replace deterministic integration with stochastic approximation. The core insight is elegantly simple: if we can draw samples from a distribution, we can approximate expectations under that distribution using sample averages.
The fundamental Monte Carlo estimator:
Suppose we want to compute: $$I = \mathbb{E}_{p(\theta)}[f(\theta)] = \int f(\theta) , p(\theta) , d\theta$$
If we can generate i.i.d. samples ${\theta^{(1)}, \theta^{(2)}, \ldots, \theta^{(N)}}$ from the distribution $p(\theta)$, then:
$$\hat{I}N = \frac{1}{N} \sum{n=1}^{N} f(\theta^{(n)})$$
This sample mean is an unbiased estimator of the true integral $I$.
The Law of Large Numbers guarantees that the sample average converges to the true expectation as the number of samples grows: $\hat{I}_N \xrightarrow{a.s.} I$ as $N \to \infty$. This is not an approximation that gets systematically better—it's a probabilistic guarantee of convergence. With enough samples, we can get arbitrarily close to the true value.
The variance of the Monte Carlo estimator:
Since each $f(\theta^{(n)})$ is an independent draw, the variance of the sample mean is:
$$\text{Var}(\hat{I}N) = \frac{1}{N} \text{Var}{p(\theta)}[f(\theta)] = \frac{\sigma_f^2}{N}$$
where $\sigma_f^2 = \mathbb{E}[(f(\theta) - I)^2]$ is the variance of $f$ under $p$.
The standard error of the MC estimator is: $$\text{SE}(\hat{I}_N) = \frac{\sigma_f}{\sqrt{N}}$$
The Central Limit Theorem gives us more: for large $N$, the estimator is approximately normally distributed:
$$\hat{I}_N \approx \mathcal{N}\left(I, \frac{\sigma_f^2}{N}\right)$$
This allows us to construct confidence intervals for our Monte Carlo estimates.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
import numpy as npfrom scipy import stats def monte_carlo_integral(sampler, f, N): """ Estimate E_p[f(theta)] using Monte Carlo sampling. Parameters: ----------- sampler : callable Function that returns N i.i.d. samples from p(theta) f : callable Function to integrate N : int Number of samples Returns: -------- estimate : float Monte Carlo estimate of E[f(theta)] std_error : float Standard error of the estimate """ # Generate samples from the target distribution samples = sampler(N) # Evaluate the function at each sample f_values = np.array([f(theta) for theta in samples]) # Monte Carlo estimate (sample mean) estimate = np.mean(f_values) # Standard error = sample std / sqrt(N) std_error = np.std(f_values, ddof=1) / np.sqrt(N) return estimate, std_error # Example: Estimate E[X^2] where X ~ N(0, 1)# True value: Var(X) + E[X]^2 = 1 + 0 = 1 def standard_normal_sampler(N): return np.random.randn(N) def squared(x): return x ** 2 # Run with increasing sample sizesfor N in [100, 1000, 10000, 100000]: est, se = monte_carlo_integral(standard_normal_sampler, squared, N) print(f"N={N:6d}: Estimate = {est:.4f} ± {1.96*se:.4f} (95% CI)") print(f" True value = 1.0000") # Output (typical):# N= 100: Estimate = 0.9823 ± 0.1843 (95% CI)# N= 1000: Estimate = 1.0102 ± 0.0621 (95% CI)# N= 10000: Estimate = 0.9987 ± 0.0196 (95% CI)# N=100000: Estimate = 1.0003 ± 0.0062 (95% CI)The remarkable property of Monte Carlo estimation is that its convergence rate is independent of the dimensionality of the integration problem. This stands in stark contrast to deterministic quadrature methods.
The $O(N^{-1/2})$ convergence rate:
The root-mean-square error of the Monte Carlo estimator scales as: $$\text{RMSE} = \sqrt{\mathbb{E}[(\hat{I}_N - I)^2]} = \frac{\sigma_f}{\sqrt{N}}$$
Halving the error requires quadrupling the number of samples. This rate is dimension-independent—whether we're integrating in 2 dimensions or 200, the same number of samples achieves the same error reduction.
Comparison with deterministic methods:
For a function with $k$ continuous derivatives, optimal deterministic quadrature achieves error $O(N^{-k/d})$ in $d$ dimensions. In high dimensions, this becomes arbitrarily slow. Monte Carlo's $O(N^{-1/2})$ rate, while seemingly modest, becomes superior as soon as:
$$\frac{k}{d} < \frac{1}{2} \quad \Rightarrow \quad d > 2k$$
For smooth functions ($k=2$), MC wins for $d > 4$. Real posterior distributions with dozens or hundreds of parameters make MC the only viable approach.
The $O(N^{-1/2})$ rate has practical implications for computational budgets:
• 1 decimal digit of precision needs ~$100$ samples • 2 decimal digits needs ~$10,000$ samples • 3 decimal digits needs ~$1,000,000$ samples
For most Bayesian applications, 2-3 significant figures suffice for posterior summaries, making MC eminently practical.
Estimating the variance:
In practice, we don't know $\sigma_f^2$—we must estimate it from the same samples:
$$\hat{\sigma}f^2 = \frac{1}{N-1} \sum{n=1}^{N} \left(f(\theta^{(n)}) - \hat{I}_N\right)^2$$
This gives us the estimated standard error: $$\widehat{\text{SE}}(\hat{I}_N) = \frac{\hat{\sigma}_f}{\sqrt{N}}$$
Confidence intervals can be constructed using the normal approximation: $$\hat{I}N \pm z{\alpha/2} \cdot \widehat{\text{SE}}(\hat{I}_N)$$
For a 95% CI, $z_{0.025} \approx 1.96$, giving roughly $\hat{I}_N \pm 2 \cdot \widehat{\text{SE}}$.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
import numpy as npimport matplotlib.pyplot as plt def analyze_convergence(sampler, f, true_value, max_N=100000, num_trials=50): """ Analyze Monte Carlo convergence empirically. Demonstrates O(N^{-1/2}) error scaling. """ sample_sizes = np.logspace(1, np.log10(max_N), 50).astype(int) sample_sizes = np.unique(sample_sizes) rmse_values = [] theoretical_se = [] for N in sample_sizes: errors = [] for _ in range(num_trials): samples = sampler(N) f_vals = np.array([f(x) for x in samples]) estimate = np.mean(f_vals) errors.append(estimate - true_value) rmse = np.sqrt(np.mean(np.array(errors)**2)) rmse_values.append(rmse) # Theoretical: sigma_f / sqrt(N) # For X ~ N(0,1), f(x) = x^2, Var(X^2) = E[X^4] - E[X^2]^2 = 3 - 1 = 2 sigma_f = np.sqrt(2) # Known for this specific case theoretical_se.append(sigma_f / np.sqrt(N)) return sample_sizes, rmse_values, theoretical_se # Run analysis for E[X^2], X ~ N(0,1)sampler = lambda N: np.random.randn(N)f = lambda x: x**2true_value = 1.0 sizes, rmse, theory = analyze_convergence(sampler, f, true_value) # Verify O(N^{-1/2}) scalinglog_sizes = np.log10(sizes)log_rmse = np.log10(rmse)slope, intercept = np.polyfit(log_sizes, log_rmse, 1) print(f"Empirical convergence rate: O(N^{{{slope:.3f}}})")print(f"Expected rate: O(N^{{-0.500}})")print(f"")print("Sample size vs RMSE:")for i in range(0, len(sizes), 10): print(f" N = {sizes[i]:6d}: RMSE = {rmse[i]:.4f}, Theory = {theory[i]:.4f}") # Typical output:# Empirical convergence rate: O(N^{-0.498})# Expected rate: O(N^{-0.500})The Monte Carlo principle directly addresses Bayesian computational challenges. If we can sample from the posterior $p(\theta | \mathcal{D})$, any posterior quantity becomes a sample average.
Posterior expectations: $$\mathbb{E}[f(\theta) | \mathcal{D}] = \int f(\theta) , p(\theta | \mathcal{D}) , d\theta \approx \frac{1}{N} \sum_{n=1}^{N} f(\theta^{(n)})$$
where $\theta^{(n)} \sim p(\theta | \mathcal{D})$.
Posterior predictive distribution: $$p(\tilde{x} | \mathcal{D}) = \int p(\tilde{x} | \theta) , p(\theta | \mathcal{D}) , d\theta \approx \frac{1}{N} \sum_{n=1}^{N} p(\tilde{x} | \theta^{(n)})$$
Generating predictive samples:
To sample from the predictive distribution, we use composition:
The resulting ${\tilde{x}^{(n)}}$ are samples from $p(\tilde{x} | \mathcal{D})$.
Credible intervals:
To find a 95% credible interval for a parameter $\theta_j$, simply take the 2.5th and 97.5th percentiles of the posterior samples.
Once you have samples from the posterior, computing any derived quantity is trivial. Want P(θ > 0)? Count the fraction of samples where θ > 0. Want the probability that model A outperforms model B? Apply both models to each posterior sample and count wins. The flexibility is remarkable—you're limited only by what you can compute from parameter samples.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
import numpy as npfrom scipy import stats def bayesian_inference_with_samples(posterior_samples, X_new=None): """ Demonstrate Bayesian inference using posterior samples. Assume we have samples from posterior p(mu, sigma | data) for a Gaussian model. posterior_samples: (N, 2) array with columns [mu, sigma] """ N = len(posterior_samples) mu_samples = posterior_samples[:, 0] sigma_samples = posterior_samples[:, 1] # 1. Posterior mean of mu mu_posterior_mean = np.mean(mu_samples) mu_posterior_std = np.std(mu_samples) # 2. 95% Credible Interval for mu mu_ci = np.percentile(mu_samples, [2.5, 97.5]) # 3. Probability that mu > 0 prob_mu_positive = np.mean(mu_samples > 0) # 4. Posterior of a derived quantity: coefficient of variation = sigma/|mu| cv_samples = sigma_samples / np.abs(mu_samples) cv_mean = np.mean(cv_samples) cv_ci = np.percentile(cv_samples, [2.5, 97.5]) print("=== Bayesian Inference from Posterior Samples ===") print(f"Number of posterior samples: {N}") print() print(f"Posterior mean of μ: {mu_posterior_mean:.4f} ± {mu_posterior_std:.4f}") print(f"95% Credible Interval for μ: [{mu_ci[0]:.4f}, {mu_ci[1]:.4f}]") print(f"P(μ > 0 | data): {prob_mu_positive:.4f}") print() print(f"Coefficient of Variation (σ/|μ|):") print(f" Posterior mean: {cv_mean:.4f}") print(f" 95% CI: [{cv_ci[0]:.4f}, {cv_ci[1]:.4f}]") # 5. Posterior predictive distribution if X_new is not None: # For each posterior sample, compute p(X_new | mu, sigma) predictive_densities = [] for mu, sigma in posterior_samples: density = stats.norm.pdf(X_new, loc=mu, scale=sigma) predictive_densities.append(density) # Average over posterior samples predictive_density = np.mean(predictive_densities) print(f"\nPredictive density at X_new={X_new}: {predictive_density:.6f}") return { 'mu_mean': mu_posterior_mean, 'mu_ci': mu_ci, 'prob_positive': prob_mu_positive, 'cv_mean': cv_mean, 'cv_ci': cv_ci } # Simulate posterior samples (in practice, these come from MCMC)# Pretend data came from N(2.5, 1.2)np.random.seed(42)true_mu, true_sigma = 2.5, 1.2n_samples = 10000 # Simulate "posterior" (in reality, this would come from MCMC)# Using posterior from conjugate normal-normal model as illustrationmu_samples = np.random.normal(2.48, 0.15, n_samples) # Posterior of musigma_samples = np.abs(np.random.normal(1.18, 0.12, n_samples)) # Posterior of sigma posterior_samples = np.column_stack([mu_samples, sigma_samples]) results = bayesian_inference_with_samples(posterior_samples, X_new=3.0)The Monte Carlo principle is beautiful in theory, but there's a critical gap: How do we actually generate samples from the posterior $p(\theta | \mathcal{D})$?
This is precisely where the computational difficulty lies. The posterior is defined only up to a normalizing constant:
$$p(\theta | \mathcal{D}) = \frac{p(\mathcal{D} | \theta) , p(\theta)}{Z}$$
where $Z = p(\mathcal{D}) = \int p(\mathcal{D} | \theta') , p(\theta') , d\theta'$ is the very integral we're trying to avoid computing.
We can easily evaluate the unnormalized posterior: $$\tilde{p}(\theta | \mathcal{D}) = p(\mathcal{D} | \theta) , p(\theta)$$
But sampling from an unnormalized distribution is not trivial. Standard techniques like inverse CDF sampling require the normalized distribution.
This is the fundamental computational barrier in Bayesian inference. We can compute ratios of posterior probabilities easily—p(θ₁|D)/p(θ₂|D)—since the normalization constants cancel. But generating samples requires more sophisticated machinery. The remainder of this module presents progressively more powerful techniques for overcoming this barrier.
The hierarchy of sampling methods:
Different sampling techniques apply under different conditions:
| Method | Requirements | Complexity |
|---|---|---|
| Direct sampling | Known, invertible CDF | Trivial |
| Transformation | Express as transform of simple RV | Simple |
| Rejection sampling | Bounding envelope | Moderate |
| Importance sampling | Proposal distribution | Moderate |
| MCMC | Ability to evaluate unnormalized density | General-purpose |
We will explore rejection sampling and importance sampling in the following pages as building blocks, culminating in the powerful MCMC framework.
Monte Carlo methods have a fascinating history that intertwines physics, computing, and statistics.
Origins at Los Alamos (1940s):
The name "Monte Carlo" was coined by Stanislaw Ulam and John von Neumann during work on the Manhattan Project. They needed to simulate neutron diffusion through fissile material—a problem with too many random interactions for analytical solutions.
Ulam, recovering from illness and playing solitaire, realized that rather than calculating the probability of a successful game analytically, he could simply play many games and count successes. This insight—simulating a process to estimate probabilities—became foundational.
The name references the Monte Carlo Casino in Monaco, nodding to the role of chance in the method. Nicholas Metropolis, who would later develop the Metropolis algorithm, coined the name.
Metropolis algorithm (1953):
Metropolis and colleagues published the foundational MCMC algorithm in "Equation of State Calculations by Fast Computing Machines." Originally developed to simulate thermodynamic equilibrium of molecules, it would transform Bayesian computation decades later.
The Bayesian revolution (1990s):
For decades, Bayesian methods remained largely theoretical due to computational barriers. The work of Gelfand and Smith (1990) demonstrating Gibbs sampling for Bayesian inference, followed by practical MCMC software (BUGS, WinBUGS), launched a revolution. Suddenly, complex hierarchical models became tractable.
The migration of Monte Carlo methods from physics to statistics is a beautiful example of cross-disciplinary fertilization. Physicists simulating molecules at thermal equilibrium and statisticians sampling posterior distributions face the same mathematical problem: generating samples from a distribution known only up to a normalizing constant. The canonical distribution in statistical mechanics, p(x) ∝ exp(-E(x)/kT), has the same form as Bayesian posteriors, p(θ|D) ∝ exp(log p(D|θ) + log p(θ)).
| Year | Development | Significance |
|---|---|---|
| 1946 | Ulam's solitaire insight | Conceptual foundation for simulation-based estimation |
| 1949 | Metropolis & Ulam paper | First formal description of Monte Carlo method |
| 1953 | Metropolis algorithm | MCMC for sampling from Boltzmann distributions |
| 1970 | Hastings generalization | Extended Metropolis to asymmetric proposals |
| 1984 | Geman & Geman (Gibbs sampling) | Named and formalized for image restoration |
| 1990 | Gelfand & Smith | Demonstrated Gibbs for Bayesian inference |
| 1994 | BUGS software released | Made MCMC accessible to applied statisticians |
| 2011 | Stan, Hamiltonian MC | Gradient-based MCMC for complex models |
While the theory of Monte Carlo estimation is elegant, practical application requires attention to several details.
Sample size determination:
How many samples do we need? This depends on:
Rule of thumb: Start with 1,000-10,000 samples for exploration, increase to 100,000+ for final inference.
Monitoring convergence:
With i.i.d. samples, monitoring is straightforward:
Effective sample size (for correlated samples):
When samples are correlated (as in MCMC), the effective sample size is: $$N_{\text{eff}} = \frac{N}{1 + 2\sum_{k=1}^{\infty} \rho_k}$$
where $\rho_k$ is the autocorrelation at lag $k$. Highly correlated chains have $N_{\text{eff}} \ll N$.
When standard Monte Carlo variance is too high, consider:
• Antithetic variates: Generate negatively correlated pairs to reduce variance • Control variates: Use a correlated quantity with known expectation • Stratification: Partition the sample space for more uniform coverage • Rao-Blackwellization: Replace random quantities with conditional expectations
These techniques can dramatically improve efficiency without increasing sample size.
Monte Carlo methods provide the computational foundation for modern Bayesian inference. The core insight is profound yet simple: replace intractable integrals with sample averages.
What's Next:
The following pages develop the machinery for sampling from complex posteriors:
Each technique builds on the Monte Carlo foundation established here, progressively enabling inference in more challenging settings.
You now understand the theoretical foundations of Monte Carlo estimation: why it works, how fast it converges, and what makes it uniquely suited to high-dimensional Bayesian inference. Next, we'll develop rejection sampling—the first concrete technique for drawing samples from complex distributions.