Loading learning content...
In the previous page, we developed marginal likelihood—the probability of observing our data under a model, integrated over all possible parameter values. While marginal likelihood quantifies the absolute evidence for a single model, scientific inference rarely asks 'How probable is this model?' in isolation. Instead, we ask: 'Which model better explains the data?' or 'How much more evidence supports Model A over Model B?'
The Bayes factor answers precisely this question. It is the ratio of marginal likelihoods between two competing models, providing a principled, calibrated measure of relative evidential support. Unlike p-values (which measure how surprising data would be under a null hypothesis), Bayes factors directly quantify how much one model is favored over another given the observed data.
By the end of this page, you will understand: (1) The formal definition of Bayes factors and their relationship to posterior model probabilities, (2) Standard interpretation scales for Bayes factors (Jeffreys, Kass-Raftery), (3) How Bayes factors relate to hypothesis testing, (4) Computational strategies for Bayes factor estimation, (5) Common pitfalls and limitations in applying Bayes factors.
Given two models $\mathcal{M}_1$ and $\mathcal{M}_2$, the Bayes factor in favor of $\mathcal{M}_1$ over $\mathcal{M}_2$ is defined as the ratio of their marginal likelihoods:
$$\text{BF}_{12} = \frac{p(\mathcal{D} | \mathcal{M}_1)}{p(\mathcal{D} | \mathcal{M}_2)}$$
Expanding the marginal likelihoods:
$$\text{BF}_{12} = \frac{\int p(\mathcal{D} | \boldsymbol{\theta}_1, \mathcal{M}_1) p(\boldsymbol{\theta}_1 | \mathcal{M}_1) d\boldsymbol{\theta}_1}{\int p(\mathcal{D} | \boldsymbol{\theta}_2, \mathcal{M}_2) p(\boldsymbol{\theta}_2 | \mathcal{M}_2) d\boldsymbol{\theta}_2}$$
Interpretation: $\text{BF}_{12} = 10$ means the data is 10 times more probable under Model 1 than Model 2. The data provides 10-to-1 odds in favor of Model 1.
The subscript order matters: BF₁₂ is evidence for Model 1 over Model 2. Values > 1 favor Model 1; values < 1 favor Model 2. Some authors use BF₁₀ for 'alternative over null' and BF₀₁ for 'null over alternative.' Always check the convention in any paper you read.
Bayes factors are intimately connected to posterior probabilities over models via Bayes' theorem at the model level:
$$\frac{p(\mathcal{M}_1 | \mathcal{D})}{p(\mathcal{M}_2 | \mathcal{D})} = \frac{p(\mathcal{D} | \mathcal{M}_1)}{p(\mathcal{D} | \mathcal{M}_2)} \cdot \frac{p(\mathcal{M}_1)}{p(\mathcal{M}_2)}$$
$$\underbrace{\text{Posterior Odds}}{\text{after seeing data}} = \underbrace{\text{Bayes Factor}}{\text{evidence from data}} \times \underbrace{\text{Prior Odds}}_{\text{before seeing data}}$$
The Bayes factor is the multiplicative update from prior odds to posterior odds. It represents purely the evidence contributed by the data, independent of prior model preferences.
If we start with equal prior probabilities ($p(\mathcal{M}_1) = p(\mathcal{M}_2) = 0.5$):
$$\text{Posterior Odds} = \text{BF}_{12}$$
This allows conversion to posterior probabilities:
$$p(\mathcal{M}1 | \mathcal{D}) = \frac{\text{BF}{12}}{1 + \text{BF}_{12}}$$
| Bayes Factor (BF₁₂) | Posterior Odds | P(M₁|D) | P(M₂|D) |
|---|---|---|---|
| 1 | 1 : 1 | 50% | 50% |
| 3 | 3 : 1 | 75% | 25% |
| 10 | 10 : 1 | 91% | 9% |
| 30 | 30 : 1 | 97% | 3% |
| 100 | 100 : 1 | 99% | 1% |
| 1000 | 1000 : 1 | 99.9% | 0.1% |
A Bayes factor of 5 means the data is 5 times more likely under one model. But is that 'strong' evidence? Researchers have proposed calibration scales to interpret Bayes factors consistently.
Harold Jeffreys, one of the founders of Bayesian statistics, proposed the first widely-used interpretation scale. He worked with $\log_{10}(\text{BF})$ for convenience:
| log₁₀(BF₁₂) | BF₁₂ | Interpretation |
|---|---|---|
| 0 to 0.5 | 1 to 3.2 | Not worth more than a bare mention |
| 0.5 to 1 | 3.2 to 10 | Substantial evidence |
| 1 to 1.5 | 10 to 32 | Strong evidence |
| 1.5 to 2 | 32 to 100 | Very strong evidence |
2 | 100 | Decisive evidence |
Robert Kass and Adrian Raftery proposed a modified scale using $2 \ln(\text{BF})$, which has the same scale as likelihood ratio test statistics and the deviance:
| 2 ln(BF₁₂) | BF₁₂ | Interpretation |
|---|---|---|
| 0 to 2 | 1 to 3 | Not worth more than a bare mention |
| 2 to 6 | 3 to 20 | Positive evidence |
| 6 to 10 | 20 to 150 | Strong evidence |
10 | 150 | Very strong evidence |
These scales are conventions, not theorems. A Bayes factor of 15 doesn't magically become 'strong' at an arbitrary threshold. Consider the scientific context: in a new field with weak prior information, BF = 10 might warrant caution; in a mature field with robust theory, BF = 10 might be decisive. Context matters.
Bayes factors are symmetric under inversion:
$$\text{BF}{21} = \frac{1}{\text{BF}{12}}$$
This means the same scale applies in both directions:
This symmetry is philosophically important. Unlike p-values, which only measure evidence against a null hypothesis, Bayes factors measure evidence in both directions. A small Bayes factor IS evidence—evidence for the alternative model.
Understanding how Bayes factors differ from frequentist hypothesis testing illuminates their strengths and appropriate use cases.
A p-value answers: 'If the null hypothesis were true, how often would I see data at least as extreme as what I observed?'
A Bayes factor answers: 'How many times more probable is my observed data under Model 1 compared to Model 2?'
A famous example illustrating the difference: Consider testing whether a coin is fair ($\theta = 0.5$) vs. biased ($\theta \neq 0.5$).
With n = 100,000 flips and 50,500 heads:
How can this be?
The frequentist test asks: 'Is 50,500/100,000 significantly different from 0.5?' With huge sample size, even tiny differences are 'significant.'
The Bayesian test compares two models:
The observed proportion (0.505) is very close to 0.5. Model 1, having made a precise prediction that was nearly correct, gets rewarded. Model 2, having hedged its bets across all possibilities, gets penalized for not committing.
The philosophical lesson: Testing for 'any difference' is different from testing for 'meaningful difference.' Bayes factors respect this distinction; p-values do not.
P-values are useful for quality control (detecting any departure from a standard) and as a first-pass filter. Bayes factors are better for scientific inference (weighing competing theories) and formal model comparison. Neither replaces thinking about the science—both are tools, not oracles.
Since Bayes factors are ratios of marginal likelihoods, all methods for computing marginal likelihoods apply here. But there are also specialized methods for Bayes factor computation.
If both marginal likelihoods can be computed (analytically or by approximation), the Bayes factor is simply their ratio:
$$\log \text{BF}_{12} = \log p(\mathcal{D} | \mathcal{M}_1) - \log p(\mathcal{D} | \mathcal{M}_2)$$
This straightforward approach works well when both models have tractable marginal likelihoods.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
import numpy as npfrom scipy import statsfrom scipy.linalg import cho_factor, cho_solve def log_marginal_likelihood_bayesian_regression(X, y, prior_var, noise_var): """ Compute log marginal likelihood for Bayesian linear regression. Assumes: y = X @ beta + epsilon beta ~ N(0, prior_var * I) epsilon ~ N(0, noise_var * I) """ n = len(y) K = prior_var * X @ X.T + noise_var * np.eye(n) L, lower = cho_factor(K, lower=True) log_det = 2 * np.sum(np.log(np.diag(L))) alpha = cho_solve((L, lower), y) log_ml = -0.5 * (y @ alpha + log_det + n * np.log(2 * np.pi)) return log_ml def compute_bayes_factor_polynomial_comparison(x, y, degree1, degree2, prior_var=1.0, noise_var=0.1): """ Compare two polynomial regression models via Bayes factor. Returns: -------- log_bf : float Log Bayes factor in favor of degree1 over degree2 interpretation : str Verbal interpretation of evidence strength """ # Build design matrices X1 = np.column_stack([x**i for i in range(degree1 + 1)]) X2 = np.column_stack([x**i for i in range(degree2 + 1)]) # Compute log marginal likelihoods log_ml1 = log_marginal_likelihood_bayesian_regression(X1, y, prior_var, noise_var) log_ml2 = log_marginal_likelihood_bayesian_regression(X2, y, prior_var, noise_var) log_bf = log_ml1 - log_ml2 bf = np.exp(log_bf) # Interpret according to Kass-Raftery scale two_ln_bf = 2 * log_bf if abs(two_ln_bf) < 2: strength = "Not worth more than a bare mention" elif abs(two_ln_bf) < 6: strength = "Positive evidence" elif abs(two_ln_bf) < 10: strength = "Strong evidence" else: strength = "Very strong evidence" favored = f"degree {degree1}" if log_bf > 0 else f"degree {degree2}" interpretation = f"{strength} for {favored}" return log_bf, bf, interpretation # Example: Compare polynomial models on synthetic datanp.random.seed(42)n = 100x = np.linspace(-2, 2, n) # True function: quadraticy_true = 1 + 0.5*x + 0.8*x**2y = y_true + 0.5 * np.random.randn(n) print("Comparing polynomial regression models:\n")print(f"{'Comparison':<25} {'log(BF)':<10} {'BF':<12} {'Interpretation'}")print("-" * 80) for d1, d2 in [(1, 2), (2, 3), (2, 5), (3, 10)]: log_bf, bf, interp = compute_bayes_factor_polynomial_comparison( x, y, d1, d2, prior_var=10.0, noise_var=0.25 ) print(f"Degree {d1} vs Degree {d2:<15} {log_bf:>8.2f} {bf:>10.2f} {interp}")For nested models where one model is a special case of the other (e.g., testing $\theta = \theta_0$ vs. $\theta \neq \theta_0$), the Savage-Dickey density ratio provides an elegant shortcut:
$$\text{BF}_{01} = \frac{p(\theta = \theta_0 | \mathcal{D}, \mathcal{M}_1)}{p(\theta = \theta_0 | \mathcal{M}_1)}$$
Interpretation: The Bayes factor equals the posterior density at the null value divided by the prior density at the null value.
Advantage: You only need to fit the more complex model and evaluate densities at a point—no need to integrate.
Bridge sampling is a powerful method for estimating Bayes factors, especially when models share parameters. The idea is to construct an 'optimal bridge' between two distributions.
Given samples from two posteriors, bridge sampling estimates:
$$\text{BF}_{12} \approx \frac{\sum_i h(\boldsymbol{\theta}^{(1)}_i) \cdot p(\boldsymbol{\theta}^{(1)}_i | \mathcal{M}_2)}{\sum_j h(\boldsymbol{\theta}^{(2)}_j) \cdot p(\boldsymbol{\theta}^{(2)}_j | \mathcal{M}_1)}$$
where $h(\cdot)$ is an optimally chosen 'bridge function.'
Key properties:
bridgesampling in RFor most applications, start with BIC as a rough approximation. If BIC suggests models are close (within 2-6 units), invest in more accurate methods like Laplace approximation or bridge sampling. Reserve computationally expensive methods like thermodynamic integration for high-stakes decisions.
A particularly important application of Bayes factors is testing point null hypotheses: $H_0: \theta = \theta_0$ vs. $H_1: \theta \neq \theta_0$.
In frequentist testing, the null hypothesis $\theta = 0$ has probability zero in a continuous parameter space, yet we can still compute p-values (probability of data more extreme than observed, given null).
In Bayesian testing, we must assign actual probability to the null hypothesis. This is done using a mixture prior:
$$p(\theta) = \pi_0 \cdot \delta(\theta - \theta_0) + (1 - \pi_0) \cdot p_1(\theta)$$
where:
For testing whether a mean differs from zero, Rouder et al. (2009) developed the JZS (Jeffreys-Zellner-Siow) prior, now widely used for Bayesian t-tests:
$$\text{Effect size } \delta \sim \text{Cauchy}(0, r)$$
where $r$ (often 0.707) controls the expected effect size.
Why Cauchy?
The resulting Bayes factor compares:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101
import numpy as npfrom scipy import statsfrom scipy.integrate import quad def bayesian_t_test(x, y=None, mu=0, r=0.707): """ Compute Bayes factor for one-sample or two-sample t-test. Uses JZS prior (Cauchy on effect size). Parameters: ----------- x : array-like First sample y : array-like or None Second sample (None for one-sample test) mu : float Null hypothesis value (for one-sample test) r : float Scale of Cauchy prior on effect size Returns: -------- bf10 : float Bayes factor in favor of H1 (effect exists) over H0 """ x = np.asarray(x) if y is None: # One-sample test n = len(x) t_stat = (np.mean(x) - mu) / (np.std(x, ddof=1) / np.sqrt(n)) df = n - 1 else: # Two-sample test y = np.asarray(y) n1, n2 = len(x), len(y) n = n1 + n2 # Pooled standard deviation var1, var2 = np.var(x, ddof=1), np.var(y, ddof=1) pooled_var = ((n1-1)*var1 + (n2-1)*var2) / (n1 + n2 - 2) se = np.sqrt(pooled_var * (1/n1 + 1/n2)) t_stat = (np.mean(x) - np.mean(y)) / se df = n1 + n2 - 2 # Compute Bayes factor using numerical integration # BF10 = integral over delta of: p(t|delta) * p(delta) / p(t|H0) def integrand(delta): # Non-central t distribution for effect size delta ncp = delta * np.sqrt(n) if y is None else delta * np.sqrt(n1*n2/(n1+n2)) likelihood = stats.nct.pdf(t_stat, df, ncp) prior = stats.cauchy.pdf(delta, 0, r) return likelihood * prior # Likelihood under null null_likelihood = stats.t.pdf(t_stat, df) # Integrate over effect sizes integral, _ = quad(integrand, -np.inf, np.inf) bf10 = integral / null_likelihood return bf10 # Example: Testing treatment effectnp.random.seed(123) # Control groupcontrol = np.random.normal(100, 15, 30) # Treatment groups with different effect sizesprint("Bayesian t-tests with JZS prior (r=0.707):\n") for effect, name in [(0, "No effect"), (5, "Small effect"), (10, "Medium effect"), (20, "Large effect")]: treatment = np.random.normal(100 + effect, 15, 30) bf10 = bayesian_t_test(treatment, control) # Interpret if bf10 > 100: interp = "Extreme evidence for H1" elif bf10 > 30: interp = "Very strong evidence for H1" elif bf10 > 10: interp = "Strong evidence for H1" elif bf10 > 3: interp = "Moderate evidence for H1" elif bf10 > 1: interp = "Anecdotal evidence for H1" elif bf10 > 1/3: interp = "Anecdotal evidence for H0" elif bf10 > 1/10: interp = "Moderate evidence for H0" else: interp = "Strong evidence for H0" print(f"{name:20s}: BF₁₀ = {bf10:>8.2f} ({interp})")The choice of prior under H₁ (e.g., the scale r in JZS) affects the Bayes factor. A narrow prior expects small effects; a wide prior hedges across all effect sizes. There's no universally 'correct' choice—use domain knowledge or conduct sensitivity analysis.
Despite their advantages, Bayes factors have important limitations. Understanding these is essential for appropriate application.
We emphasized this for marginal likelihoods, and it's doubly important for Bayes factors. Since BF = ML₁/ML₂, prior misspecification in either model affects the comparison.
Example: Two researchers test the same hypothesis with the same data:
Who's right? Both—given their priors. The solution is transparency about prior choices and sensitivity analysis.
Empty models (which make no predictions) can have arbitrarily high Bayes factors. Consider a model that says 'anything is possible' with uniform prior over vast ranges—its marginal likelihood is essentially the prior density, which can be made arbitrarily small.
Lesson: Bayes factors compare specific, well-defined models. Comparing a precise model to a vague 'catch-all' alternative isn't informative.
When Bayes factors are problematic, consider:
Use Bayes factors as one tool among many. If Bayes factors, cross-validation, and posterior predictive checks all point to the same model, you can be confident. If they disagree, investigate why—the disagreement often reveals important aspects of your models or data.
Let's synthesize everything into a practical workflow for using Bayes factors in research.
When publishing results using Bayes factors:
Essential to report:
Recommended to report:
Example reporting: 'The data provided strong evidence for the interaction model over the main-effects model, BF₁₀ = 27.3 (assuming equal prior odds). Under the interaction model, we used a Normal(0, 10²) prior on coefficients. Sensitivity analysis with prior scales [5, 10, 20] yielded BF₁₀ ∈ [18.1, 35.6], consistently supporting the interaction model.'
Let's consolidate what we've learned about Bayes factors:
Looking ahead: Bayes factors quantify relative evidence between two models, but the story doesn't end there. In the next page, we'll explore model evidence—understanding marginal likelihood as a measure of model quality that embodies a principled balance between fit and complexity, leading to the Bayesian interpretation of Occam's razor.
You now understand Bayes factors—the ratio of marginal likelihoods that quantifies relative model support. You know how to interpret them, compute them, and recognize their limitations. Next, we'll deepen our understanding of model evidence and the automatic Occam's razor embedded in Bayesian model comparison.