Loading content...
In classical linear regression, we learn a single set of weights w that best fits our training data. We find the point estimate that minimizes squared error, and we're done. But this approach hides a profound limitation: it tells us nothing about our uncertainty.
Consider fitting a line through three data points versus fitting the same line through three thousand points. Classical regression gives you the same type of answer in both cases—a single weight vector. But intuitively, you should be far more confident in the second case. Where does that confidence live in the classical framework? It doesn't.
Bayesian linear regression fundamentally reimagines this problem. Instead of asking "What are the best weights?", we ask "What is the probability distribution over all possible weights, given our data?" This shift from point estimates to probability distributions unlocks principled uncertainty quantification, robustness to limited data, and a framework that gracefully incorporates prior knowledge.
By the end of this page, you will understand how to mathematically encode prior beliefs about regression weights, why this seemingly small change has profound implications for machine learning, and how different prior choices lead to different learning behaviors. This foundation is essential for everything that follows in Bayesian linear regression.
The core philosophical departure of Bayesian statistics is treating parameters as random variables rather than fixed but unknown constants. This isn't merely a mathematical trick—it represents a fundamentally different interpretation of what parameters mean.
The Frequentist View (Classical Linear Regression):
In the frequentist paradigm, the true weight vector w* exists as a fixed, deterministic quantity. Our training data is a random sample from some population, and our estimate ŵ is a random variable that approximates w*. The randomness comes from the data, not from the parameters.
This view leads to questions like:
The Bayesian View:
In the Bayesian paradigm, w itself is a random variable with a probability distribution. Before seeing data, we have a prior distribution p(w) representing our beliefs about likely weight values. After seeing data D, we update to a posterior distribution p(w|D) that combines prior beliefs with evidence.
This view leads to different questions:
| Aspect | Frequentist View | Bayesian View |
|---|---|---|
| Parameters | Fixed, unknown constants | Random variables with distributions |
| Probability | Long-run frequency of events | Degree of belief or certainty |
| Estimation Goal | Find a point estimate ŵ | Compute the posterior distribution p(w|D) |
| Uncertainty | Confidence intervals (frequentist coverage) | Credible intervals (posterior probability) |
| Prior Knowledge | Difficult to incorporate formally | Naturally incorporated via prior p(w) |
| Small Data Regime | Often problematic (overfitting) | Priors provide regularization |
The frequentist and Bayesian views are different frameworks for reasoning about uncertainty. Each has strengths. Bayesian methods excel when incorporating prior knowledge is valuable, uncertainty quantification is critical, or data is limited. Frequentist methods often have computational advantages and well-understood theoretical guarantees. Understanding both makes you a more effective practitioner.
Before we can define a prior on weights, we need a clear probabilistic model for linear regression. Let's establish the framework carefully.
The Data:
We observe N training examples, each consisting of:
We collect all features into a design matrix X ∈ ℝᴺˣᴰ and all targets into a vector y ∈ ℝᴺ.
The Generative Model:
We assume that targets are generated by a linear function of features plus Gaussian noise:
$$y_n = \mathbf{w}^\top \mathbf{x}_n + \epsilon_n$$
where:
This can be written equivalently as:
$$y_n | \mathbf{x}_n, \mathbf{w}, \sigma^2 \sim \mathcal{N}(\mathbf{w}^\top \mathbf{x}_n, \sigma^2)$$
The Likelihood Function:
Under the assumption that observations are conditionally independent given w, the likelihood of observing all data is:
$$p(\mathbf{y} | \mathbf{X}, \mathbf{w}, \sigma^2) = \prod_{n=1}^{N} \mathcal{N}(y_n | \mathbf{w}^\top \mathbf{x}_n, \sigma^2)$$
Taking the log and expanding:
$$\log p(\mathbf{y} | \mathbf{X}, \mathbf{w}, \sigma^2) = -\frac{N}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{n=1}^{N} (y_n - \mathbf{w}^\top \mathbf{x}_n)^2$$
Maximizing this log-likelihood with respect to w is equivalent to minimizing the sum of squared errors—exactly what ordinary least squares (OLS) does. So OLS is the maximum likelihood estimator for this probabilistic model.
For now, we treat σ² as known or as a hyperparameter to be set. In full Bayesian treatment, we would place a prior on σ² as well (typically an Inverse-Gamma distribution, which is conjugate to the Gaussian likelihood). We'll focus on the weights for clarity.
Matrix Notation:
For computational convenience, we write the model in matrix form:
$$\mathbf{y} = \mathbf{X}\mathbf{w} + \boldsymbol{\epsilon}$$
where ε ~ 𝒩(0, σ²Iₙ), giving us:
$$\mathbf{y} | \mathbf{X}, \mathbf{w}, \sigma^2 \sim \mathcal{N}(\mathbf{X}\mathbf{w}, \sigma^2 \mathbf{I}_N)$$
This multivariate Gaussian formulation is crucial for deriving the posterior distribution efficiently.
The prior distribution p(w) encodes our beliefs about the weights before observing any data. This is where we inject domain knowledge, regularization preferences, or express uncertainty.
The Gaussian Prior:
The most common choice is a multivariate Gaussian prior:
$$\mathbf{w} \sim \mathcal{N}(\mathbf{m}_0, \mathbf{S}_0)$$
where:
Why Gaussian?
Conjugacy: Gaussian priors are conjugate to the Gaussian likelihood, meaning the posterior is also Gaussian. This allows closed-form solutions.
Interpretability: The prior mean represents our initial estimate; the prior covariance represents how uncertain we are and how features relate.
Flexibility: By choosing m₀ and S₀ appropriately, we can encode a wide range of prior beliefs.
Central Limit Theorem: In many settings, the Gaussian emerges as a natural limiting distribution.
The Isotropic Gaussian Prior (Most Common):
Often, we use a zero-mean isotropic prior:
$$\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \alpha^{-1}\mathbf{I}_D)$$
where:
This prior says: "Before seeing data, I believe each weight is equally likely to be positive or negative, with magnitude controlled by α."
12345678910111213141516171819202122232425262728293031323334
import numpy as npimport matplotlib.pyplot as pltfrom scipy.stats import multivariate_normal # Define a 2D Gaussian prior for visualizationprior_mean = np.array([0.0, 0.0])prior_precision = 1.0 # alphaprior_cov = (1.0 / prior_precision) * np.eye(2) # Create a grid for visualizationw1 = np.linspace(-3, 3, 100)w2 = np.linspace(-3, 3, 100)W1, W2 = np.meshgrid(w1, w2)pos = np.dstack((W1, W2)) # Evaluate the prior densityprior_dist = multivariate_normal(prior_mean, prior_cov)Z = prior_dist.pdf(pos) # Visualizeplt.figure(figsize=(8, 6))plt.contourf(W1, W2, Z, levels=20, cmap='Blues')plt.colorbar(label='Prior Density p(w)')plt.xlabel('$w_1$')plt.ylabel('$w_2$')plt.title('Isotropic Gaussian Prior: $w \\sim \\mathcal{N}(0, I)$')plt.axis('equal')plt.show() # The prior expresses that:# - We have no initial preference (mean at origin)# - Weights near zero are most likely# - Large weights are increasingly unlikely# - The precision alpha controls this "shrinkage"A common misconception is that priors inject arbitrary bias. In reality, priors formalize assumptions that all methods make implicitly. OLS implicitly assumes a uniform prior (all weight values equally likely), which is actually a very strong assumption—it says infinite weights are just as plausible as small ones. The Gaussian prior is often more realistic.
The prior precision α (or equivalently, the prior variance α⁻¹) is a critical hyperparameter that controls the strength of our prior beliefs.
High Precision (Large α):
Low Precision (Small α):
The Limit α → 0 (Improper Flat Prior):
As α → 0, the prior variance → ∞, and we approach an "uninformative" or flat prior. In this limit, the Bayesian posterior mode matches the OLS solution. However, truly flat priors are technically improper (don't integrate to 1) and can cause issues.
The Limit α → ∞ (Infinitely Informative Prior):
As α → ∞, the prior collapses to a point mass at zero. No matter what data we see, the weights stay at zero. The prior completely dominates.
| Prior Precision α | Prior Variance | Effect on Weights | Analogy |
|---|---|---|---|
| α → 0 | ∞ (flat) | OLS solution (data dominates) | No regularization |
| α = 0.1 | 10.0 | Mild shrinkage toward zero | Light regularization |
| α = 1.0 | 1.0 | Moderate shrinkage | Standard regularization |
| α = 10.0 | 0.1 | Strong shrinkage to zero | Heavy regularization |
| α → ∞ | 0 (point mass) | Weights fixed at zero | Infinite regularization |
The Regularization Interpretation:
We'll explore this deeply in a later page, but the key insight is:
$$\text{Posterior Mode} = \arg\max_\mathbf{w} \left[ \log p(\mathbf{y}|\mathbf{X}, \mathbf{w}) + \log p(\mathbf{w}) \right]$$
For a Gaussian prior with precision α and Gaussian likelihood with noise variance σ²:
$$\text{Posterior Mode} = \arg\min_\mathbf{w} \left[ |\mathbf{y} - \mathbf{X}\mathbf{w}|^2 + \frac{\alpha\sigma^2}{1} |\mathbf{w}|^2 \right]$$
This is exactly Ridge Regression with regularization parameter λ = ασ². The Bayesian prior directly corresponds to the regularization term!
This connection reveals that every regularized method implicitly assumes some prior. Bayesian inference makes this assumption explicit and principled.
α can be selected via cross-validation (treat it as a hyperparameter), empirical Bayes (maximize marginal likelihood), or full Bayesian treatment (place a hyperprior on α and integrate it out). Each approach has different computational and philosophical tradeoffs.
While the isotropic zero-mean prior is common, real problems often benefit from more sophisticated prior specifications.
Non-Zero Mean:
If we have prior knowledge that certain features should have positive or negative effects, we can encode this:
$$\mathbf{w} \sim \mathcal{N}(\mathbf{m}_0, \mathbf{S}_0)$$
with m₀ ≠ 0. For example, in a housing price model, we might believe:
Diagonal (Anisotropic) Prior:
Different features may warrant different prior variances:
$$\mathbf{S}_0 = \text{diag}(\sigma_1^2, \sigma_2^2, \ldots, \sigma_D^2)$$
This is useful when:
Full Covariance Prior:
For correlated features, a full covariance matrix captures prior beliefs about relationships:
$$\mathbf{S}_0 = \begin{pmatrix} \sigma_1^2 & \rho\sigma_1\sigma_2 \ \rho\sigma_1\sigma_2 & \sigma_2^2 \end{pmatrix}$$
Positive correlation in S₀ says: "If weight 1 is large, weight 2 is likely large too." This can encode structural knowledge about feature relationships.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
import numpy as npfrom scipy.stats import multivariate_normalimport matplotlib.pyplot as plt fig, axes = plt.subplots(1, 3, figsize=(15, 4)) # Create grid for visualizationw1 = np.linspace(-4, 4, 100)w2 = np.linspace(-4, 4, 100)W1, W2 = np.meshgrid(w1, w2)pos = np.dstack((W1, W2)) # 1. Isotropic Priormean_iso = [0, 0]cov_iso = [[1, 0], [0, 1]]Z_iso = multivariate_normal(mean_iso, cov_iso).pdf(pos)axes[0].contourf(W1, W2, Z_iso, levels=15, cmap='Blues')axes[0].set_title('Isotropic: $S_0 = I$')axes[0].set_xlabel('$w_1$'); axes[0].set_ylabel('$w_2$')axes[0].set_aspect('equal') # 2. Anisotropic Prior (different variances)mean_aniso = [0, 0]cov_aniso = [[0.5, 0], [0, 2.0]] # w1 more constrained than w2Z_aniso = multivariate_normal(mean_aniso, cov_aniso).pdf(pos)axes[1].contourf(W1, W2, Z_aniso, levels=15, cmap='Greens')axes[1].set_title('Anisotropic: $\\sigma_1^2=0.5, \\sigma_2^2=2$')axes[1].set_xlabel('$w_1$'); axes[1].set_ylabel('$w_2$')axes[1].set_aspect('equal') # 3. Correlated Priormean_corr = [1, 0.5] # Non-zero meancov_corr = [[1.0, 0.7], [0.7, 1.0]] # Positive correlationZ_corr = multivariate_normal(mean_corr, cov_corr).pdf(pos)axes[2].contourf(W1, W2, Z_corr, levels=15, cmap='Oranges')axes[2].set_title('Correlated: $\\rho=0.7$, mean=(1, 0.5)')axes[2].set_xlabel('$w_1$'); axes[2].set_ylabel('$w_2$')axes[2].set_aspect('equal') plt.tight_layout()plt.show() # Key insight: The prior shape guides learning# - Isotropic: All directions equally constrained# - Anisotropic: Some features more flexible# - Correlated: Encodes relationships between featuresConstructing informative priors requires domain expertise. In practice, priors often come from: previous studies (meta-analysis), expert knowledge (interviews with domain experts), or hierarchical models (learning priors from related tasks). Poor prior elicitation can harm performance, so sensitivity analysis is important.
While Gaussian priors are convenient and interpretable, other prior families encode different assumptions about weight structure.
Laplace Prior (Sparse Weights):
$$p(w_j) \propto \exp(-\lambda |w_j|)$$
The Laplace (double exponential) prior has a sharp peak at zero and heavy tails. This encourages sparsity—many weights exactly or nearly zero, with a few large weights. The Laplace prior corresponds to L1 regularization (Lasso).
Spike-and-Slab Prior (Explicit Sparsity):
$$p(w_j) = \pi \cdot \delta(w_j) + (1-\pi) \cdot \mathcal{N}(w_j | 0, \sigma^2)$$
A mixture of a point mass at zero (spike) and a diffuse Gaussian (slab). This explicitly models the belief that each weight is either exactly zero or drawn from a continuous distribution. Powerful for feature selection but computationally challenging.
Horseshoe Prior (Adaptive Shrinkage):
$$w_j | \lambda_j \sim \mathcal{N}(0, \lambda_j^2), \quad \lambda_j \sim \text{Half-Cauchy}(0, \tau)$$
A hierarchical prior where each weight has its own scale parameter λⱼ with a heavy-tailed distribution. This provides adaptive shrinkage: truly zero weights are shrunk aggressively, while large true weights experience less shrinkage.
Automatic Relevance Determination (ARD):
$$w_j \sim \mathcal{N}(0, \alpha_j^{-1})$$
Each weight has its own precision αⱼ, learned from data. If αⱼ → ∞ during learning, the corresponding feature is deemed irrelevant. ARD provides automatic feature selection within a Gaussian framework.
| Prior Family | Sparsity | Conjugate? | Regularization Analog | Computation |
|---|---|---|---|---|
| Gaussian (isotropic) | None | Yes | Ridge (L2) | Easy (closed-form) |
| Gaussian (ARD) | Soft | Yes* | Adaptive L2 | Moderate (iterative) |
| Laplace | Yes | No | Lasso (L1) | Moderate (no closed-form) |
| Spike-and-Slab | Yes (exact) | No | Best subset | Hard (discrete) |
| Horseshoe | Yes (adaptive) | No | Adaptive ℓ₂ | Moderate (MCMC) |
Use sparse priors when you believe: (1) most features are irrelevant (high D, many noise features), (2) interpretability is important (identifying which features matter), or (3) the true model is genuinely sparse. For dense signals where all features contribute, Gaussian priors are often more appropriate.
One of the most profound insights from Bayesian linear regression is that every regularization scheme corresponds to some prior distribution. This isn't just a mathematical curiosity—it provides deep insight into what regularization actually means.
The MAP Estimator:
The Maximum A Posteriori (MAP) estimate is the mode of the posterior:
$$\mathbf{w}{\text{MAP}} = \arg\max\mathbf{w} p(\mathbf{w} | \mathbf{y}, \mathbf{X}) = \arg\max_\mathbf{w} \left[ p(\mathbf{y} | \mathbf{X}, \mathbf{w}) \cdot p(\mathbf{w}) \right]$$
Taking logs:
$$\mathbf{w}{\text{MAP}} = \arg\min\mathbf{w} \left[ -\log p(\mathbf{y} | \mathbf{X}, \mathbf{w}) - \log p(\mathbf{w}) \right]$$
The first term is the negative log-likelihood (data fit). The second term is the negative log-prior (regularization).
Specific Correspondences:
| Prior Distribution | MAP Objective (Regularization) |
|---|---|
| Gaussian: 𝒩(0, α⁻¹I) | Ridge: ∥y - Xw∥² + λ∥w∥² |
| Laplace: ∏ exp(-λ|wⱼ|) | Lasso: ∥y - Xw∥² + λ∥w∥₁ |
| Uniform (improper) | OLS: ∥y - Xw∥² (no regularization) |
| Elastic Net prior | Elastic Net: ∥y - Xw∥² + λ₁∥w∥₁ + λ₂∥w∥² |
While the MAP estimator reveals the prior-regularization connection, it's only a point estimate. Full Bayesian inference uses the entire posterior distribution, providing uncertainty quantification that MAP alone cannot offer. Think of MAP as a bridge between frequentist and Bayesian thinking, but not the full power of the Bayesian approach.
Why This Matters:
Principled Hyperparameter Interpretation: Regularization strength λ isn't arbitrary—it's the ratio of prior precision to noise precision. Setting λ = 1 means you trust the prior equally to one data point.
Informed Prior Design: Want L1-like sparsity? Use a Laplace prior. Want smooth solutions? Use a prior that penalizes weight differences.
Unified Framework: All regularization methods become instances of Bayesian inference with different priors. This unified view clarifies when each method is appropriate.
Beyond Point Estimates: Once you recognize the prior, you can do full Bayesian inference with it—getting posterior distributions, not just MAP estimates.
Setting priors in practice requires balancing domain knowledge, computational tractability, and robustness. Here are key principles:
1. Feature Scaling and Priors:
If features are on different scales, an isotropic prior may be inappropriate. A weight of 0.01 for a feature measured in millions differs from 0.01 for a binary feature. Options:
2. Prior Predictive Checks:
Before seeing data, sample from the prior and simulate predictions: $$\mathbf{w} \sim p(\mathbf{w}), \quad \mathbf{y}_{\text{simulated}} = \mathbf{X}\mathbf{w} + \boldsymbol{\epsilon}$$
Are the simulated predictions plausible? If your prior produces predictions like "house prices of -$10 million," the prior is poorly calibrated.
3. Weakly Informative Priors:
If you lack strong prior knowledge, use weakly informative priors that:
A common choice: 𝒩(0, s²) where s is set to cover plausible weight magnitudes.
4. Sensitivity Analysis:
Run inference with different prior settings. If conclusions change dramatically with minor prior adjustments, the data may be insufficient or the model misspecified.
Treat prior specification with the same rigor as any other modeling assumption. Document your choices, justify them, and test sensitivity. The prior is part of your model, not an afterthought.
We've established the foundational concepts for Bayesian linear regression by understanding how to place prior distributions on weights. Let's consolidate the key insights:
What's Next:
With the prior established, we're ready to see what we actually learn when data arrives. The next page derives the posterior distribution over weights—the result of combining our prior beliefs with the evidence from observed data. This posterior is the central object in Bayesian inference, from which all predictions and uncertainty estimates flow.
You now understand how to encode prior beliefs about regression weights as probability distributions. This is the starting point of Bayesian linear regression—the foundation upon which posterior derivation, predictive inference, and the connection to regularization all build. Next, we'll derive what happens when prior meets data: the posterior distribution.