Loading content...
Imagine you're a medical researcher investigating a new diagnostic test for a rare disease. Before collecting any data, you already possess valuable knowledge: the disease affects approximately 1 in 10,000 people in the general population. This knowledge—prior to observing any test results—fundamentally shapes how you should interpret any evidence you subsequently gather.
This is the essence of prior beliefs in Bayesian inference: the formal mathematical encoding of what we know (or believe) about a phenomenon before observing new data. Unlike frequentist statistics, which treats parameters as fixed but unknown constants, Bayesian inference treats parameters as random variables with probability distributions that capture our uncertainty about their true values.
By the end of this page, you will understand the philosophical and mathematical foundations of prior distributions, be able to categorize priors by their informativeness, select appropriate priors for different problem contexts, and appreciate how prior choice influences posterior inference. This knowledge forms the bedrock upon which all subsequent Bayesian methods are built.
Before diving into mathematical formalism, we must understand why priors exist and what they represent. The concept of prior beliefs emerges from a fundamental philosophical position about the nature of probability itself.
Two Interpretations of Probability:
In the frequentist interpretation, probability represents the long-run frequency of events in repeated experiments. Under this view, parameters are fixed constants—there's nothing probabilistic about them. The only randomness comes from sampling variability.
In the Bayesian (subjectivist) interpretation, probability represents a degree of belief about uncertain propositions. Under this view, probability quantifies our state of knowledge, and parameters are random variables whose distributions encode our uncertainty.
This distinction is profound: Bayesians can assign probabilities to any uncertain proposition, including:
These statements are meaningless in frequentist probability because they concern fixed quantities, not random events. But in Bayesian probability, they represent coherent expressions of belief.
The Bayesian interpretation is often called 'subjectivist' because prior beliefs may differ between individuals. Two rational scientists with different background knowledge may specify different priors—and this is not a flaw but a feature. Bayesian inference provides a principled mechanism for updating these beliefs as evidence accumulates, with posteriors converging as data overwhelms prior differences.
Historical Context:
The Reverend Thomas Bayes (1701–1761) first formulated the theorem bearing his name in a posthumously published essay. However, the modern conception of subjective probability was developed extensively by Bruno de Finetti, Leonard Savage, and others in the 20th century.
De Finetti's famous theorem demonstrates that exchangeability (the assumption that the order of observations doesn't matter) implies the existence of a prior distribution over parameters. This provides a rigorous foundation for Bayesian inference that doesn't require accepting subjective probability as a philosophical primitive—it emerges naturally from more basic modeling assumptions.
The Coherence Argument:
Another powerful justification for Bayesian probability comes from Dutch book arguments. If your beliefs don't satisfy the axioms of probability (including Bayes' theorem for updating), then there exist sequences of bets that guarantee you lose money regardless of outcomes. Only probability-coherent beliefs avoid such exploitability.
This means Bayesian reasoning isn't merely one approach among many—it's the unique coherent approach to reasoning under uncertainty.
Having established the philosophical foundations, let's formalize prior beliefs mathematically. A prior distribution is a probability distribution over the parameter space that encodes our beliefs before observing data.
Notation and Basic Setup:
Let θ (theta) denote the unknown parameter(s) of our model. The prior distribution is denoted:
$$p(\theta)$$
This is a probability density function (for continuous parameters) or probability mass function (for discrete parameters) that satisfies the standard requirements:
The choice of $p(\theta)$ reflects our prior knowledge. More probability mass in a region means we believe values in that region are more plausible a priori.
12345678910111213141516171819202122232425262728293031323334353637383940414243
import numpy as npimport matplotlib.pyplot as pltfrom scipy import stats # Example: Prior distributions for a coin bias parameter θ ∈ [0, 1] theta = np.linspace(0, 1, 1000) # Uniform prior: Complete ignoranceuniform_prior = stats.uniform(0, 1).pdf(theta) # Informative prior: Belief that coin is roughly fair# Beta(10, 10) centered at 0.5 with moderate concentrationinformative_prior = stats.beta(10, 10).pdf(theta) # Expert prior: Strong belief coin is biased toward heads# Beta(30, 10) centered at 0.75expert_prior = stats.beta(30, 10).pdf(theta) # Weakly informative prior: Slight preference for fairness# Beta(2, 2) - gentle peak at 0.5weakly_informative = stats.beta(2, 2).pdf(theta) # Visualize different priorsfig, axes = plt.subplots(2, 2, figsize=(12, 10))priors = [ (uniform_prior, "Uniform (Non-informative)", "Beta(1,1)"), (informative_prior, "Informative", "Beta(10,10)"), (expert_prior, "Expert Knowledge", "Beta(30,10)"), (weakly_informative, "Weakly Informative", "Beta(2,2)")] for ax, (prior, title, params) in zip(axes.flatten(), priors): ax.fill_between(theta, prior, alpha=0.3) ax.plot(theta, prior, linewidth=2) ax.set_xlabel("θ (coin bias)") ax.set_ylabel("p(θ)") ax.set_title(f"{title}\n{params}") ax.set_xlim(0, 1) ax.grid(True, alpha=0.3) plt.tight_layout()plt.savefig("prior_distributions.png", dpi=150)The Parameter Space:
The form of the prior depends critically on the parameter space—the set of all possible values the parameter can take:
| Parameter Type | Space | Common Prior Families |
|---|---|---|
| Probability | [0, 1] | Beta |
| Positive real | (0, ∞) | Gamma, Inverse-Gamma, Log-Normal |
| Real line | (-∞, ∞) | Normal, Cauchy, Student-t |
| Positive integer | {1, 2, 3, ...} | Poisson, Negative Binomial |
| Simplex (probabilities summing to 1) | Δ<sub>K-1</sub> | Dirichlet |
| Covariance matrix | Positive definite | Inverse-Wishart, LKJ |
The prior must have support matching the parameter space. Using a Normal prior for a probability parameter would be invalid because Normal distributions place mass on negative values and values greater than 1.
A common error is specifying priors whose support doesn't match the parameter space. If θ represents a variance (strictly positive), using a Normal prior centered at a small positive value will place some probability mass on negative values, creating invalid posterior samples. Always verify your prior's support matches your parameter's domain.
Multivariate Priors:
When models have multiple parameters θ = (θ₁, θ₂, ..., θₖ), we specify a joint prior distribution:
$$p(\theta) = p(\theta_1, \theta_2, ..., \theta_k)$$
A common simplification is to assume prior independence:
$$p(\theta) = \prod_{i=1}^{k} p(\theta_i)$$
This factorization makes prior specification tractable but may not always be appropriate. Sometimes parameters have known relationships that should be encoded in the prior structure.
For example, in a hierarchical model where individual effects are drawn from a population distribution:
Here, the θᵢ are conditionally independent given μ and σ², but marginally they are correlated through their shared hyperparameters.
Priors can be classified along several dimensions. The most important classification concerns their informativeness—how much constraint they place on parameter values. Understanding this taxonomy is essential for appropriate prior selection.
Let's examine each category in depth.
Non-informative Priors:
The quest for 'objective' priors that represent complete ignorance has a long history. The most naive approach is the flat (uniform) prior:
$$p(\theta) \propto 1$$
This seems intuitive—all values are equally likely. However, flat priors have serious problems:
Non-invariance under reparameterization: If $p(\theta) \propto 1$, then for $\phi = g(\theta)$, the prior on $\phi$ is NOT flat: $p(\phi) \propto |d\theta/d\phi|$
Impropriety: On unbounded spaces, flat priors have infinite integral—they're not proper probability distributions
Information paradox: A flat prior on [0, 1] for probability θ is actually informative—it implies specific beliefs about log-odds or other transformations
Jeffreys Prior:
Harold Jeffreys proposed a principled solution: the prior should be proportional to the square root of the Fisher Information:
$$p(\theta) \propto \sqrt{I(\theta)}$$
where $I(\theta) = -E\left[\frac{\partial^2}{\partial\theta^2}\log p(x|\theta)\right]$ is the Fisher Information.
Jeffreys priors are transformation-invariant: the same prior is obtained regardless of how the parameter is expressed. For a Bernoulli likelihood, Jeffreys prior is Beta(1/2, 1/2)—not uniform!
1234567891011121314151617181920212223242526272829303132333435363738394041
import numpy as npfrom scipy import statsimport matplotlib.pyplot as plt # Jeffreys prior for Bernoulli parameter# I(θ) = 1/(θ(1-θ)), so p(θ) ∝ θ^(-1/2) * (1-θ)^(-1/2) = Beta(1/2, 1/2) theta = np.linspace(0.001, 0.999, 1000) # Compare flat vs Jeffreys priorflat_prior = np.ones_like(theta) # Uniformjeffreys_prior = stats.beta(0.5, 0.5).pdf(theta) # Beta(1/2, 1/2) fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5)) # Prior comparisonax1.plot(theta, flat_prior, 'b-', linewidth=2, label='Flat Prior (Beta(1,1))')ax1.plot(theta, jeffreys_prior, 'r-', linewidth=2, label="Jeffreys Prior (Beta(1/2,1/2))")ax1.fill_between(theta, jeffreys_prior, alpha=0.2, color='red')ax1.set_xlabel("θ", fontsize=12)ax1.set_ylabel("p(θ)", fontsize=12)ax1.set_title("Non-informative Priors for Bernoulli Parameter")ax1.legend()ax1.grid(True, alpha=0.3)ax1.set_xlim(0, 1) # Show transformation non-invariance of flat prior# If θ ~ Uniform(0,1), what is the distribution of log-odds φ = log(θ/(1-θ))?theta_samples = np.random.uniform(0, 1, 100000)log_odds = np.log(theta_samples / (1 - theta_samples)) ax2.hist(log_odds, bins=100, density=True, alpha=0.7, color='blue')ax2.set_xlabel("φ = log(θ/(1-θ))", fontsize=12)ax2.set_ylabel("Implied p(φ)", fontsize=12)ax2.set_title("Implied Prior on Log-Odds from Uniform θ")ax2.grid(True, alpha=0.3)ax2.set_xlim(-6, 6)ax2.axhline(1/2/np.pi, color='red', linestyle='--', label='Logistic(0,1) density (max)') plt.tight_layout()Weakly Informative Priors:
Modern Bayesian practice, particularly as advocated by Andrew Gelman and collaborators, emphasizes weakly informative priors as a practical default. These priors:
For example, when modeling a regression coefficient β representing the effect of a standardized predictor on a standardized outcome:
$$\beta \sim \mathcal{N}(0, 1)$$
asserts that effects larger than ±2 standard deviations are unlikely (95% prior probability between -2 and 2). This is weak enough to let data dominate with reasonable sample sizes, but provides regularization against overfitting.
Informative Priors from Domain Knowledge:
When genuine prior knowledge exists, it should be used. Examples include:
Encoding this knowledge as an informative prior is a strength of Bayesian inference, not a weakness. It formalizes the scientific process of building on prior work.
Before fitting a model, always examine the prior predictive distribution—data simulated from the prior × likelihood without conditioning on observed data. If the prior predicts outcomes that are nonsensical (e.g., negative heights, heights of 50 meters), your prior is poorly calibrated and should be revised.
Certain distribution families appear repeatedly in Bayesian modeling due to their mathematical convenience and flexibility. Mastering these families is essential for effective prior specification.
| Family | Support | Parameters | Use Case | Key Properties |
|---|---|---|---|---|
| Beta(α, β) | [0, 1] | α > 0, β > 0 (shape) | Probabilities, proportions | Conjugate to Binomial; mode at (α-1)/(α+β-2) |
| Normal(μ, σ²) | (-∞, ∞) | μ (mean), σ² (variance) | Unbounded continuous | Conjugate to itself; maximum entropy for given mean/variance |
| Gamma(α, β) | (0, ∞) | α (shape), β (rate) | Positive quantities, precisions | Conjugate to Poisson, Exponential |
| Inverse-Gamma(α, β) | (0, ∞) | α (shape), β (scale) | Variances | Conjugate to Normal variance |
| Dirichlet(α) | Simplex | α ∈ ℝ₊ᴷ (concentration) | Probability vectors | Conjugate to Multinomial |
| Half-Cauchy(s) | (0, ∞) | s (scale) | Scale parameters, variances | Heavy-tailed; recommended for hierarchical SDs |
| Student-t(ν, μ, σ) | (-∞, ∞) | ν (df), μ (loc), σ (scale) | Robust alternatives to Normal | Heavy-tailed; ν=1 is Cauchy |
| LKJ(η) | Correlation matrices | η > 0 (shape) | Correlation structures | η=1 is uniform over correlations |
The Beta Distribution in Detail:
The Beta distribution is perhaps the most important prior family for beginners. It's the natural prior for parameters representing probabilities or proportions.
$$p(\theta | \alpha, \beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \theta^{\alpha-1}(1-\theta)^{\beta-1}$$
Interpreting α and β:
Special cases:
12345678910111213141516171819202122232425262728293031323334353637383940414243
import numpy as npfrom scipy import statsimport matplotlib.pyplot as plt theta = np.linspace(0.001, 0.999, 500) # Various Beta distributions demonstrating flexibilitybeta_params = [ (1, 1, "Uniform", "#3498db"), (0.5, 0.5, "Jeffreys", "#e74c3c"), (2, 2, "Weakly informative (symmetric)", "#2ecc71"), (5, 1, "Skewed right", "#9b59b6"), (1, 5, "Skewed left", "#f39c12"), (10, 10, "Informative (centered)", "#1abc9c"), (50, 50, "Highly informative", "#34495e"), (2, 5, "Asymmetric", "#e91e63"),] fig, axes = plt.subplots(2, 4, figsize=(16, 8))axes = axes.flatten() for ax, (a, b, name, color) in zip(axes, beta_params): pdf = stats.beta(a, b).pdf(theta) ax.fill_between(theta, pdf, alpha=0.3, color=color) ax.plot(theta, pdf, color=color, linewidth=2) # Add mean and mode lines mean = a / (a + b) ax.axvline(mean, color='black', linestyle='--', alpha=0.7, label=f'Mean={mean:.2f}') if a > 1 and b > 1: mode = (a - 1) / (a + b - 2) ax.axvline(mode, color='red', linestyle=':', alpha=0.7, label=f'Mode={mode:.2f}') ax.set_title(f"Beta({a}, {b})\n{name}", fontsize=10) ax.set_xlabel("θ") ax.set_ylabel("p(θ)") ax.legend(fontsize=8) ax.set_xlim(0, 1) ax.grid(True, alpha=0.3) plt.suptitle("Beta Distribution Family: Flexibility for Prior Specification", fontsize=14)plt.tight_layout()The Normal Distribution for Location Parameters:
For unbounded real-valued parameters, the Normal (Gaussian) distribution is a natural choice:
$$p(\theta | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(\theta - \mu)^2}{2\sigma^2}\right)$$
The Normal distribution has maximum entropy among all distributions with a given mean and variance, making it a principled default when only these moments are known.
Guidelines for Normal priors:
Half-Cauchy for Scale Parameters:
For parameters that must be positive and represent scales or standard deviations, the half-Cauchy distribution is increasingly recommended:
$$p(\sigma | s) = \frac{2}{\pi s \left(1 + (\sigma/s)^2\right)}, \quad \sigma > 0$$
The half-Cauchy has heavier tails than Gamma or Inverse-Gamma alternatives, providing robustness when the true scale is unexpectedly large. This is particularly valuable in hierarchical models where shrinkage might otherwise be excessive.
Selecting appropriate priors is both an art and a science. Here we present a systematic framework for making principled prior choices.
Encoding Domain Knowledge:
Suppose you're modeling the probability θ that a patient responds to a new cancer treatment. From the literature, similar drugs in this class have response rates between 15% and 35%, with a best estimate around 25%.
Step 1: What prior encodes this?
We want a Beta(α, β) with:
Using numerical methods or trial-and-error, Beta(10, 30) satisfies these constraints:
Alternatively, Beta(5, 15) gives similar mean but wider spread—appropriate if literature is less certain.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
import numpy as npfrom scipy import statsfrom scipy.optimize import minimize def find_beta_params(target_mean, target_lower, target_upper, coverage=0.95): """ Find Beta parameters that match desired mean and credible interval. Args: target_mean: Desired prior mean target_lower: Lower bound of desired credible interval target_upper: Upper bound of desired credible interval coverage: Probability coverage of interval (default 95%) Returns: (alpha, beta) parameters """ alpha_lower = (1 - coverage) / 2 alpha_upper = 1 - alpha_lower def objective(params): a, b = params if a <= 0 or b <= 0: return np.inf dist = stats.beta(a, b) mean_error = (dist.mean() - target_mean)**2 lower_error = (dist.ppf(alpha_lower) - target_lower)**2 upper_error = (dist.ppf(alpha_upper) - target_upper)**2 return mean_error + lower_error + upper_error # Initial guess based on mean a_init = target_mean * 10 b_init = (1 - target_mean) * 10 result = minimize(objective, [a_init, b_init], bounds=[(0.1, 100), (0.1, 100)]) return result.x # Example: Drug response rate# Prior belief: 15-35% range, mean around 25%alpha, beta = find_beta_params( target_mean=0.25, target_lower=0.15, target_upper=0.35) print(f"Optimal Beta parameters: α = {alpha:.2f}, β = {beta:.2f}") # Verifydist = stats.beta(alpha, beta)print(f"Prior mean: {dist.mean():.3f}")print(f"Prior 2.5th percentile: {dist.ppf(0.025):.3f}")print(f"Prior 97.5th percentile: {dist.ppf(0.975):.3f}") # Visualizetheta = np.linspace(0, 1, 500)plt.figure(figsize=(10, 6))plt.fill_between(theta, dist.pdf(theta), alpha=0.3, color='blue')plt.plot(theta, dist.pdf(theta), 'b-', linewidth=2, label=f'Beta({alpha:.1f}, {beta:.1f})')plt.axvline(0.15, color='red', linestyle='--', label='Target lower (0.15)')plt.axvline(0.35, color='red', linestyle='--', label='Target upper (0.35)')plt.axvline(0.25, color='green', linestyle='-', label='Target mean (0.25)')plt.xlabel("θ (response rate)")plt.ylabel("p(θ)")plt.title("Prior Constructed from Domain Knowledge")plt.legend()plt.grid(True, alpha=0.3)It's invalid to 'peek' at the data when selecting priors. Priors must be specified before observing the data, or the posterior will be mis-calibrated. If you must use the data for prior specification (e.g., empirical Bayes), use formal methods that account for this double-use of data.
An improper prior is a function that does not integrate to a finite value—it's not a valid probability distribution. The most common example is the flat prior on an unbounded space:
$$p(\theta) \propto 1, \quad \theta \in (-\infty, \infty)$$
Since $\int_{-\infty}^{\infty} 1 , d\theta = \infty$, this is not a proper probability distribution.
Why use improper priors?
Despite their theoretical issues, improper priors are sometimes used because:
The Danger:
Improper priors can lead to improper posteriors—posteriors that also don't integrate to finite values. This makes inference impossible: you can't compute means, credible intervals, or any meaningful summaries.
Example of disaster:
Consider a hierarchical model:
If n = 1 (single observation), the posterior on σ² is improper—there's no finite integral. The model is unidentified and inference is meaningless.
Safe practices:
Before using an improper prior, prove that the posterior is proper. A sufficient condition is that the likelihood L(θ) is 'dominating' in the sense that ∫ π(θ)L(θ|y)dθ < ∞ for almost all y. When in doubt, use proper (even very diffuse) priors instead.
A critical aspect of responsible Bayesian analysis is understanding how much conclusions depend on prior assumptions. Sensitivity analysis systematically varies the prior and observes how the posterior changes.
When to worry:
Formal sensitivity analysis:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
import numpy as npfrom scipy import statsimport matplotlib.pyplot as plt def compute_beta_posterior(prior_alpha, prior_beta, successes, failures): """ Compute Beta posterior analytically (conjugacy). """ post_alpha = prior_alpha + successes post_beta = prior_beta + failures return post_alpha, post_beta # Observed data: 3 successes in 10 trialssuccesses, failures = 3, 7 # Define a class of reasonable priorspriors = [ (1, 1, "Flat (Beta(1,1))"), (0.5, 0.5, "Jeffreys (Beta(0.5,0.5))"), (2, 2, "Weakly informative (Beta(2,2))"), (1, 9, "Skeptical (Beta(1,9)) - favors low θ"), (5, 5, "Moderately informative (Beta(5,5))"), (10, 10, "Informative (Beta(10,10))"),] theta = np.linspace(0, 1, 500) fig, axes = plt.subplots(2, 3, figsize=(15, 10))axes = axes.flatten() posterior_means = []posterior_95ci = [] for ax, (a, b, name) in zip(axes, priors): # Prior prior_dist = stats.beta(a, b) # Posterior post_a, post_b = compute_beta_posterior(a, b, successes, failures) post_dist = stats.beta(post_a, post_b) # Store summary statistics posterior_means.append(post_dist.mean()) posterior_95ci.append((post_dist.ppf(0.025), post_dist.ppf(0.975))) # Plot ax.plot(theta, prior_dist.pdf(theta), 'b--', linewidth=2, label=f'Prior: Beta({a},{b})', alpha=0.7) ax.fill_between(theta, post_dist.pdf(theta), alpha=0.3, color='red') ax.plot(theta, post_dist.pdf(theta), 'r-', linewidth=2, label=f'Posterior: Beta({post_a},{post_b})') ax.axvline(post_dist.mean(), color='red', linestyle=':', label=f'Post. mean={post_dist.mean():.3f}') ax.set_title(f"{name}\nPosterior mean: {post_dist.mean():.3f}") ax.set_xlabel("θ") ax.set_ylabel("Density") ax.legend(fontsize=8) ax.set_xlim(0, 1) ax.grid(True, alpha=0.3) plt.suptitle(f"Prior Sensitivity Analysis\nData: {successes} successes in {successes+failures} trials", fontsize=14)plt.tight_layout() # Summary reportprint("=" * 60)print("PRIOR SENSITIVITY ANALYSIS SUMMARY")print("=" * 60)print(f"{"Prior":<35} {"Post. Mean":<12} {"95% CI":<20}")print("-" * 60)for (a, b, name), mean, ci in zip(priors, posterior_means, posterior_95ci): print(f"{name:<35} {mean:.4f} [{ci[0]:.3f}, {ci[1]:.3f}]")print("-" * 60)print(f"Range of posterior means: [{min(posterior_means):.4f}, {max(posterior_means):.4f}]")print(f"Spread: {max(posterior_means) - min(posterior_means):.4f}")Interpreting Sensitivity:
In the example above with only 10 observations:
The Asymptotics of Prior Influence:
As sample size n → ∞, the posterior converges to the true parameter value regardless of the prior (assuming the model is well-specified and the prior places non-zero mass on the true value). This is called posterior consistency.
The rate of convergence is typically $O(1/\sqrt{n})$, meaning the prior's influence shrinks proportionally. After 100 observations, the prior has roughly 1/10th the influence it had after 1 observation.
This provides theoretical comfort: with enough data, prior misspecification is automatically corrected. But in finite samples, prior choice matters.
Prior beliefs form the foundation of Bayesian inference—they encode what we know before observing data. Let's consolidate the key concepts:
What's Next:
With prior beliefs established, we now turn to the likelihood function—the mathematical bridge connecting our model's parameters to observed data. The likelihood quantifies how probable our observations are under different parameter values, and its combination with the prior yields the posterior distribution.
You now understand the philosophical and mathematical foundations of prior distributions in Bayesian inference. You can categorize priors by informativeness, select appropriate distribution families, and conduct sensitivity analyses. Next, we'll explore how the likelihood function quantifies data evidence.