Loading learning content...
In machine learning, we constantly face a fundamental question: Which model should we use? Given a dataset, we might consider a linear model, a polynomial model, a neural network, or countless other possibilities. Each model makes different assumptions about the underlying data-generating process, and each has different capacity to fit complex patterns.
The classical approach to model selection—choosing based on training error—is deeply flawed. A model with more parameters can always achieve lower training error, but this often leads to overfitting. Cross-validation offers some remedy, but it requires holding out data and doesn't provide a principled probabilistic framework.
Bayesian inference offers an elegant solution: the marginal likelihood. Also called the model evidence, the marginal likelihood quantifies how well a model explains the observed data while automatically penalizing model complexity. It's the foundation upon which all Bayesian model comparison rests.
By the end of this page, you will understand: (1) The precise mathematical definition of marginal likelihood and how it differs from the ordinary likelihood, (2) Why marginal likelihood automatically balances data fit against model complexity, (3) The computational challenges in calculating marginal likelihoods, (4) How to compute marginal likelihoods analytically when possible, and (5) Approximation methods for intractable cases.
To understand marginal likelihood, we must first revisit the structure of Bayesian inference. Given a model $\mathcal{M}$ with parameters $\boldsymbol{\theta}$ and observed data $\mathcal{D}$, Bayes' theorem gives us the posterior distribution over parameters:
$$p(\boldsymbol{\theta} | \mathcal{D}, \mathcal{M}) = \frac{p(\mathcal{D} | \boldsymbol{\theta}, \mathcal{M}) \cdot p(\boldsymbol{\theta} | \mathcal{M})}{p(\mathcal{D} | \mathcal{M})}$$
The denominator $p(\mathcal{D} | \mathcal{M})$ is the marginal likelihood. It's called 'marginal' because it's obtained by marginalizing (integrating) over all possible parameter values:
$$p(\mathcal{D} | \mathcal{M}) = \int p(\mathcal{D} | \boldsymbol{\theta}, \mathcal{M}) \cdot p(\boldsymbol{\theta} | \mathcal{M}) , d\boldsymbol{\theta}$$
The marginal likelihood goes by many names: model evidence, integrated likelihood, or simply evidence. In the context of model comparison, 'evidence' is particularly apt—it represents the total evidential support the data provides for the model as a whole, not for any particular parameter setting.
Let's parse this integral carefully:
$p(\mathcal{D} | \boldsymbol{\theta}, \mathcal{M})$: The likelihood function. Given specific parameter values $\boldsymbol{\theta}$, how probable is the observed data?
$p(\boldsymbol{\theta} | \mathcal{M})$: The prior distribution. Before seeing data, how plausible are different parameter values under our model assumptions?
The integral: We average the likelihood over all possible parameter values, weighted by their prior probability.
The key insight: The marginal likelihood doesn't evaluate the model at any single point in parameter space. Instead, it asks: 'Considering all possible parameter values this model could have, weighted by their prior probability, how well does this model explain the data on average?'
| Aspect | Maximum Likelihood | Marginal Likelihood |
|---|---|---|
| Formula | $\max_{\boldsymbol{\theta}} p(\mathcal{D} | \boldsymbol{\theta})$ | $\int p(\mathcal{D} | \boldsymbol{\theta}) p(\boldsymbol{\theta}) d\boldsymbol{\theta}$ |
| What it measures | Best-case fit (optimized parameters) | Average fit (all parameters weighted by prior) |
| Dependency on parameter count | Always improves with more parameters | May decrease with unnecessary parameters |
| Overfitting tendency | Highly prone to overfitting | Inherently regularized |
| Computational cost | Optimization (relatively fast) | Integration (often intractable) |
The marginal likelihood has a beautiful geometric interpretation that illuminates why it naturally penalizes complexity.
Imagine the parameter space as a landscape. The height at each point $\boldsymbol{\theta}$ represents the likelihood $p(\mathcal{D} | \boldsymbol{\theta})$—how well those particular parameters explain the data. The prior $p(\boldsymbol{\theta})$ acts as a 'sampling weight' over this landscape.
The marginal likelihood is the volume under this weighted likelihood surface. It's the integral of the likelihood, weighted by the prior, over the entire parameter space.
Now consider two models:
Model A: Simple model (few parameters)
Model B: Complex model (many parameters)
In high-dimensional parameter spaces, most of the volume is in the 'corners' far from any good-fitting region. A complex model must spread its prior probability mass across this vast space, leaving little mass near the parameters that actually explain the data well. This geometric fact is why marginal likelihood penalizes unnecessary complexity.
An analogy from probability: Consider a Gaussian distribution. A narrow Gaussian (small variance) concentrates its probability mass in a small region, achieving high density there. A wide Gaussian (large variance) spreads its mass thin, achieving low density everywhere. The marginal likelihood follows similar logic—a simpler model that 'commits' to a smaller region of data space can achieve higher evidence if its predictions are correct.
The key principle: Marginal likelihood rewards models that make specific, accurate predictions. A complex model that could explain almost any data pattern gets penalized because it hasn't 'committed' to this particular pattern—its prior probability is spread across all possibilities.
The complexity-penalizing behavior of marginal likelihood can be made explicit through the concept of the Occam factor.
For models where the posterior is approximately Gaussian (which is common due to Bernstein-von Mises theorem), we can decompose the log marginal likelihood as:
$$\log p(\mathcal{D} | \mathcal{M}) \approx \log p(\mathcal{D} | \hat{\boldsymbol{\theta}}) + \log \text{Occam Factor}$$
where $\hat{\boldsymbol{\theta}}$ is the maximum a posteriori (MAP) estimate.
The Occam factor captures the ratio of posterior volume to prior volume in parameter space:
$$\text{Occam Factor} = \frac{\text{Posterior Volume}}{\text{Prior Volume}} = \frac{\Delta\boldsymbol{\theta}{\text{posterior}}}{\Delta\boldsymbol{\theta}{\text{prior}}}$$
Since data constrains parameters, the posterior is always more concentrated than the prior. Thus the Occam factor is always less than 1, contributing a negative term to the log marginal likelihood.
For a d-dimensional Gaussian posterior approximation, the Occam factor can be computed explicitly:
$$\log \text{Occam Factor} = \log \frac{(2\pi)^{d/2} |\boldsymbol{\Sigma}_{\text{post}}|^{1/2}}{\int p(\boldsymbol{\theta}) d\boldsymbol{\theta}}$$
where $\boldsymbol{\Sigma}_{\text{post}}$ is the posterior covariance matrix.
For a model with isotropic priors of width $\sigma_{\text{prior}}$ and isotropic posterior of width $\sigma_{\text{post}}$:
$$\log \text{Occam Factor} \approx d \log \frac{\sigma_{\text{post}}}{\sigma_{\text{prior}}}$$
This is negative (a penalty) because $\sigma_{\text{post}} < \sigma_{\text{prior}}$. The penalty grows linearly with the number of parameters $d$, but is modulated by how much each parameter is constrained by the data.
The Occam factor provides automatic regularization without any hand-tuned hyperparameters. Unlike L1/L2 regularization where you choose λ, or cross-validation where you choose k-folds, the Bayesian framework derives the complexity penalty from first principles. The 'right' amount of regularization emerges from the prior and data.
For certain model classes, the marginal likelihood can be computed exactly in closed form. These cases typically involve conjugate prior relationships, where the prior and likelihood combine to produce a posterior in the same family as the prior.
Consider linear regression with Gaussian noise: $$y = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I})$$
With a Gaussian prior on weights: $$\boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, \tau^2\mathbf{I})$$
The marginal likelihood (assuming known $\sigma^2, \tau^2$) is:
$$p(\mathbf{y} | \mathbf{X}, \sigma^2, \tau^2) = \mathcal{N}(\mathbf{y} | \mathbf{0}, \tau^2\mathbf{X}\mathbf{X}^\top + \sigma^2\mathbf{I})$$
This is a multivariate Gaussian with mean zero and covariance $\mathbf{K} = \tau^2\mathbf{X}\mathbf{X}^\top + \sigma^2\mathbf{I}$.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
import numpy as npfrom scipy.linalg import cho_factor, cho_solve def log_marginal_likelihood_linear_regression(X, y, tau_sq, sigma_sq): """ Compute log marginal likelihood for Bayesian linear regression. Parameters: ----------- X : ndarray of shape (n_samples, n_features) Design matrix y : ndarray of shape (n_samples,) Target values tau_sq : float Prior variance on weights sigma_sq : float Noise variance Returns: -------- log_ml : float Log marginal likelihood """ n = len(y) # Covariance matrix: K = tau^2 * X @ X.T + sigma^2 * I K = tau_sq * X @ X.T + sigma_sq * np.eye(n) # Cholesky decomposition for numerical stability L, lower = cho_factor(K, lower=True) # Log determinant via Cholesky: log|K| = 2 * sum(log(diag(L))) log_det_K = 2 * np.sum(np.log(np.diag(L))) # Solve K^{-1} y via Cholesky alpha = cho_solve((L, lower), y) # Log marginal likelihood log_ml = -0.5 * (y @ alpha + log_det_K + n * np.log(2 * np.pi)) return log_ml # Example: Compare models with different numbers of polynomial featuresnp.random.seed(42)n = 50x_raw = np.linspace(-1, 1, n)y_true = 2 * x_raw + 0.5 * x_raw**2y = y_true + 0.3 * np.random.randn(n) # Model 1: Linear (degree 1)X1 = np.column_stack([np.ones(n), x_raw]) # Model 2: Quadratic (degree 2)X2 = np.column_stack([np.ones(n), x_raw, x_raw**2]) # Model 3: High-degree polynomial (degree 10) - overly complexX3 = np.column_stack([x_raw**i for i in range(11)]) tau_sq = 1.0sigma_sq = 0.1 for name, X in [("Linear (d=1)", X1), ("Quadratic (d=2)", X2), ("Degree-10", X3)]: log_ml = log_marginal_likelihood_linear_regression(X, y, tau_sq, sigma_sq) print(f"{name:20s}: log p(y|X,M) = {log_ml:.2f}")For coin flipping with a Beta prior on the probability $\theta$:
The marginal likelihood is:
$$p(k | \alpha, \beta) = \binom{n}{k} \cdot \frac{B(\alpha + k, \beta + n - k)}{B(\alpha, \beta)}$$
where $B(\cdot, \cdot)$ is the Beta function.
This closed-form result arises from the conjugacy of the Beta prior with the Binomial likelihood.
Most practical models—neural networks, Gaussian processes with learned hyperparameters, complex graphical models—don't have closed-form marginal likelihoods. The integral is intractable, and we must resort to approximation methods covered later in this page.
When closed-form computation is impossible, we need approximation methods. Each has distinct strengths and is appropriate for different situations.
The Laplace approximation treats the posterior as approximately Gaussian, centered at its mode (MAP estimate). This gives:
$$\log p(\mathcal{D} | \mathcal{M}) \approx \log p(\mathcal{D} | \hat{\boldsymbol{\theta}}) + \log p(\hat{\boldsymbol{\theta}}) + \frac{d}{2}\log(2\pi) - \frac{1}{2}\log|\mathbf{H}|$$
where $\hat{\boldsymbol{\theta}}$ is the MAP estimate and $\mathbf{H}$ is the Hessian of the negative log posterior at the mode.
Intuition: We approximate the integrand with a Gaussian bump centered at the MAP, then compute the volume under this Gaussian exactly.
BIC is a rough approximation to the log marginal likelihood:
$$\text{BIC} = \log p(\mathcal{D} | \hat{\boldsymbol{\theta}}) - \frac{d}{2}\log n$$
where $d$ is the number of parameters and $n$ is the number of data points.
Derivation sketch: BIC emerges from the Laplace approximation when we assume:
Interpretation: The $\frac{d}{2}\log n$ term is the complexity penalty. It grows with the number of parameters but is modulated by sample size—more data justifies more parameters.
AIC (Akaike Information Criterion) penalizes by $d$ rather than $\frac{d}{2}\log n$. AIC aims to minimize prediction error on new data, while BIC approximates Bayesian model evidence. For large $n$, BIC penalizes complexity more heavily, favoring simpler models. Neither is 'correct'—they optimize different objectives.
For arbitrary posteriors, we can estimate the marginal likelihood using Monte Carlo sampling:
Simple Monte Carlo (often poor): $$p(\mathcal{D}) \approx \frac{1}{S}\sum_{s=1}^{S} p(\mathcal{D} | \boldsymbol{\theta}^{(s)}), \quad \boldsymbol{\theta}^{(s)} \sim p(\boldsymbol{\theta})$$
This samples from the prior and averages likelihoods. It's unbiased but has extremely high variance—prior samples rarely land near high-likelihood regions.
Importance Sampling (better): $$p(\mathcal{D}) \approx \frac{1}{S}\sum_{s=1}^{S} \frac{p(\mathcal{D} | \boldsymbol{\theta}^{(s)}) p(\boldsymbol{\theta}^{(s)})}{q(\boldsymbol{\theta}^{(s)})}, \quad \boldsymbol{\theta}^{(s)} \sim q(\boldsymbol{\theta})$$
We sample from a proposal distribution $q$ concentrated near high-posterior regions.
Bridge Sampling, Path Sampling, Thermodynamic Integration: Advanced methods that construct a 'bridge' between prior and posterior, often via a sequence of intermediate distributions.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
import numpy as npfrom scipy import stats def importance_sampling_marginal_likelihood(log_likelihood_fn, prior, proposal, n_samples=10000, seed=42): """ Estimate log marginal likelihood using importance sampling. Parameters: ----------- log_likelihood_fn : callable Function that takes theta and returns log p(D|theta) prior : scipy.stats distribution Prior distribution p(theta) proposal : scipy.stats distribution Proposal distribution q(theta), should approximate posterior n_samples : int Number of importance samples Returns: -------- log_ml : float Estimated log marginal likelihood log_ml_se : float Standard error of the estimate """ np.random.seed(seed) # Sample from proposal theta_samples = proposal.rvs(n_samples) # Compute importance weights in log space log_likes = np.array([log_likelihood_fn(theta) for theta in theta_samples]) log_priors = prior.logpdf(theta_samples) log_proposals = proposal.logpdf(theta_samples) # Log importance weights: log(p(D|theta) * p(theta) / q(theta)) log_weights = log_likes + log_priors - log_proposals # Log-sum-exp trick for numerical stability max_log_w = np.max(log_weights) log_ml = max_log_w + np.log(np.mean(np.exp(log_weights - max_log_w))) # Estimate standard error normalized_weights = np.exp(log_weights - log_ml) effective_sample_size = 1 / np.sum(normalized_weights**2 / n_samples**2) log_ml_se = np.std(log_weights) / np.sqrt(effective_sample_size) return log_ml, log_ml_se, effective_sample_size # Example: Estimating marginal likelihood for Normal mean estimationtrue_mu = 2.0sigma = 1.0n = 20data = np.random.normal(true_mu, sigma, n) # Log likelihood functiondef log_likelihood(mu): return np.sum(stats.norm.logpdf(data, loc=mu, scale=sigma)) # Prior: Normal(0, 10)prior = stats.norm(loc=0, scale=10) # Proposal: Approximate posterior (Normal centered at MLE with reduced variance)mle = np.mean(data)posterior_std = sigma / np.sqrt(n)proposal = stats.norm(loc=mle, scale=posterior_std * 1.5) # slightly wider log_ml, se, ess = importance_sampling_marginal_likelihood( log_likelihood, prior, proposal, n_samples=50000)print(f"Estimated log marginal likelihood: {log_ml:.4f} ± {se:.4f}")print(f"Effective sample size: {ess:.1f}")A critical but often overlooked aspect of marginal likelihood is its sensitivity to prior specification. Unlike posterior inference, where reasonable priors often lead to similar conclusions when data is abundant, the marginal likelihood can be highly sensitive to prior choices.
Consider two priors for a parameter $\theta$:
If the true parameter is $\theta = 0.5$ and data strongly supports this:
For posterior inference: Both priors lead to posteriors concentrated near 0.5. The prior gets 'washed out' by data.
For marginal likelihood: Prior B spreads probability mass over a huge range. Since most of that range has low likelihood, the marginal likelihood is lower. Prior A 'predicted' the true parameter region better, so it gets higher evidence.
The implication: When comparing models via marginal likelihood, you're also comparing how well each prior 'predicted' what the data would show. Vague, diffuse priors get penalized.
With improper or very diffuse priors, Bayesian model comparison breaks down. Consider testing H₀: θ = 0 vs H₁: θ ≠ 0; if the prior under H₁ is uniform over (-∞, ∞), the marginal likelihood under H₁ is zero! This paradox shows that priors must be proper and carefully considered for model comparison.
1. Use informative priors when possible: If you have domain knowledge about reasonable parameter ranges, encode it. This isn't 'cheating'—it's using available information.
2. Conduct sensitivity analysis: Compute marginal likelihoods under several reasonable prior specifications. If conclusions are robust, you can be more confident.
3. Use proper priors: Improper priors (e.g., flat priors over infinite ranges) lead to undefined or zero marginal likelihoods. Always use proper probability distributions.
4. Consider objective Bayes approaches: Methods like reference priors or Jeffreys priors attempt to be 'least informative' while remaining proper, but have their own limitations.
5. Match prior scales across models: When comparing models, ensure that priors on shared parameters have compatible scales. A model isn't 'worse' just because its prior was poorly calibrated.
| Prior Variance | Prior Predictive Range | Marginal Likelihood (relative) | Interpretation |
|---|---|---|---|
| σ² = 0.1 | θ ∈ [-0.6, 0.6] (95%) | 1.00 | Prior matches data scale well |
| σ² = 1.0 | θ ∈ [-2, 2] (95%) | 0.42 | Prior somewhat diffuse |
| σ² = 10 | θ ∈ [-6.3, 6.3] (95%) | 0.13 | Prior too vague—penalized |
| σ² = 100 | θ ∈ [-20, 20] (95%) | 0.04 | Prior extremely diffuse—heavy penalty |
Having understood the theory, let's address practical challenges in using marginal likelihood for model selection.
Marginal likelihoods span enormous ranges. For typical datasets, log marginal likelihoods can be -1000 or lower. Always work in log space:
✗ BAD: ml_ratio = ml_model1 / ml_model2 # Underflow/overflow
✓ GOOD: log_ml_diff = log_ml_model1 - log_ml_model2
When comparing nested models (e.g., linear vs. polynomial regression), marginal likelihood automatically handles the 'nesting' through the Occam factor. You don't need to account for it separately—it emerges from the integral.
Marginal likelihood excels here. Unlike likelihood ratio tests (which require nested models), marginal likelihood compares any two models on equal footing. A Gaussian mixture vs. a kernel density estimator? Both get evaluated on how well they explain the data, weighted by their prior commitments.
If your goal is pure prediction accuracy on new data, cross-validation may be more appropriate. Marginal likelihood optimizes for 'posterior model probability'—the probability that this model generated the data—which isn't always aligned with predictive performance, especially when all models are misspecified.
Let's consolidate what we've learned about marginal likelihood:
Looking ahead: Marginal likelihood is rarely used in isolation. In the next page, we'll learn how to combine marginal likelihoods from different models to compute Bayes factors—the ratio of evidences that quantifies how much more one model is supported by the data compared to another.
Bayes factors provide a principled, calibrated scale for interpreting the strength of evidence, moving beyond mere p-values to a fully probabilistic comparison framework.
You now understand marginal likelihood—the foundation of Bayesian model comparison. You know what it measures, why it penalizes complexity, how to compute or approximate it, and its limitations. Next, we'll see how to interpret marginal likelihood ratios as Bayes factors for principled model selection.