Bayesian Linear Regression - Learning Module

Loading content...

0/245

Prior on Weights

From Point Estimates to Probability Distributions

In classical linear regression, we learn a single set of weights w that best fits our training data. We find the point estimate that minimizes squared error, and we're done. But this approach hides a profound limitation: it tells us nothing about our uncertainty.

Consider fitting a line through three data points versus fitting the same line through three thousand points. Classical regression gives you the same type of answer in both cases—a single weight vector. But intuitively, you should be far more confident in the second case. Where does that confidence live in the classical framework? It doesn't.

Bayesian linear regression fundamentally reimagines this problem. Instead of asking "What are the best weights?", we ask "What is the probability distribution over all possible weights, given our data?" This shift from point estimates to probability distributions unlocks principled uncertainty quantification, robustness to limited data, and a framework that gracefully incorporates prior knowledge.

What You Will Learn

By the end of this page, you will understand how to mathematically encode prior beliefs about regression weights, why this seemingly small change has profound implications for machine learning, and how different prior choices lead to different learning behaviors. This foundation is essential for everything that follows in Bayesian linear regression.

The Philosophical Shift: Parameters as Random Variables

The core philosophical departure of Bayesian statistics is treating parameters as random variables rather than fixed but unknown constants. This isn't merely a mathematical trick—it represents a fundamentally different interpretation of what parameters mean.

The Frequentist View (Classical Linear Regression):

In the frequentist paradigm, the true weight vector w* exists as a fixed, deterministic quantity. Our training data is a random sample from some population, and our estimate ŵ is a random variable that approximates w*. The randomness comes from the data, not from the parameters.

This view leads to questions like:

"What is the sampling distribution of our estimator?"
"What is the probability that our estimate is within ε of the true value?"

The Bayesian View:

In the Bayesian paradigm, w itself is a random variable with a probability distribution. Before seeing data, we have a prior distribution p(w) representing our beliefs about likely weight values. After seeing data D, we update to a posterior distribution p(w|D) that combines prior beliefs with evidence.

This view leads to different questions:

"What is the probability that w lies in a certain region, given what we've observed?"
"How should we update our beliefs as we see more data?"

Frequentist vs. Bayesian Perspectives on Parameters
Aspect	Frequentist View	Bayesian View
Parameters	Fixed, unknown constants	Random variables with distributions
Probability	Long-run frequency of events	Degree of belief or certainty
Estimation Goal	Find a point estimate ŵ	Compute the posterior distribution p(w\|D)
Uncertainty	Confidence intervals (frequentist coverage)	Credible intervals (posterior probability)
Prior Knowledge	Difficult to incorporate formally	Naturally incorporated via prior p(w)
Small Data Regime	Often problematic (overfitting)	Priors provide regularization

Neither View Is 'Correct'

The frequentist and Bayesian views are different frameworks for reasoning about uncertainty. Each has strengths. Bayesian methods excel when incorporating prior knowledge is valuable, uncertainty quantification is critical, or data is limited. Frequentist methods often have computational advantages and well-understood theoretical guarantees. Understanding both makes you a more effective practitioner.

The Linear Regression Setup: A Probabilistic Model

Before we can define a prior on weights, we need a clear probabilistic model for linear regression. Let's establish the framework carefully.

The Data:

We observe N training examples, each consisting of:

A feature vector xₙ ∈ ℝᴰ (D features)
A target value yₙ ∈ ℝ

We collect all features into a design matrix X ∈ ℝᴺˣᴰ and all targets into a vector y ∈ ℝᴺ.

The Generative Model:

We assume that targets are generated by a linear function of features plus Gaussian noise:

$$y_n = \mathbf{w}^\top \mathbf{x}_n + \epsilon_n$$

where:

w ∈ ℝᴰ is the weight vector we want to learn
εₙ ~ 𝒩(0, σ²) is independent Gaussian noise with variance σ²

This can be written equivalently as:

$$y_n | \mathbf{x}_n, \mathbf{w}, \sigma^2 \sim \mathcal{N}(\mathbf{w}^\top \mathbf{x}_n, \sigma^2)$$

The Likelihood Function:

Under the assumption that observations are conditionally independent given w, the likelihood of observing all data is:

$$p(\mathbf{y} | \mathbf{X}, \mathbf{w}, \sigma^2) = \prod_{n=1}^{N} \mathcal{N}(y_n | \mathbf{w}^\top \mathbf{x}_n, \sigma^2)$$

Taking the log and expanding:

$$\log p(\mathbf{y} | \mathbf{X}, \mathbf{w}, \sigma^2) = -\frac{N}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{n=1}^{N} (y_n - \mathbf{w}^\top \mathbf{x}_n)^2$$

Maximizing this log-likelihood with respect to w is equivalent to minimizing the sum of squared errors—exactly what ordinary least squares (OLS) does. So OLS is the maximum likelihood estimator for this probabilistic model.

Noise Variance σ²

For now, we treat σ² as known or as a hyperparameter to be set. In full Bayesian treatment, we would place a prior on σ² as well (typically an Inverse-Gamma distribution, which is conjugate to the Gaussian likelihood). We'll focus on the weights for clarity.

Matrix Notation:

For computational convenience, we write the model in matrix form:

$$\mathbf{y} = \mathbf{X}\mathbf{w} + \boldsymbol{\epsilon}$$

where ε ~ 𝒩(0, σ²Iₙ), giving us:

$$\mathbf{y} | \mathbf{X}, \mathbf{w}, \sigma^2 \sim \mathcal{N}(\mathbf{X}\mathbf{w}, \sigma^2 \mathbf{I}_N)$$

This multivariate Gaussian formulation is crucial for deriving the posterior distribution efficiently.

Defining the Prior Distribution on Weights

The prior distribution p(w) encodes our beliefs about the weights before observing any data. This is where we inject domain knowledge, regularization preferences, or express uncertainty.

The Gaussian Prior:

The most common choice is a multivariate Gaussian prior:

$$\mathbf{w} \sim \mathcal{N}(\mathbf{m}_0, \mathbf{S}_0)$$

where:

m₀ ∈ ℝᴰ is the prior mean (our initial guess for the weights)
S₀ ∈ ℝᴰˣᴰ is the prior covariance matrix (encoding uncertainty and correlations)

Why Gaussian?

Conjugacy: Gaussian priors are conjugate to the Gaussian likelihood, meaning the posterior is also Gaussian. This allows closed-form solutions.
Interpretability: The prior mean represents our initial estimate; the prior covariance represents how uncertain we are and how features relate.
Flexibility: By choosing m₀ and S₀ appropriately, we can encode a wide range of prior beliefs.
Central Limit Theorem: In many settings, the Gaussian emerges as a natural limiting distribution.

The Isotropic Gaussian Prior (Most Common):

Often, we use a zero-mean isotropic prior:

$$\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \alpha^{-1}\mathbf{I}_D)$$

where:

m₀ = 0: No initial bias toward any particular weights
S₀ = α⁻¹Iᴰ: Equal, independent uncertainty for all weights
α is the precision (inverse variance) of the prior

This prior says: "Before seeing data, I believe each weight is equally likely to be positive or negative, with magnitude controlled by α."

gaussian_prior.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal
 
# Define a 2D Gaussian prior for visualization
prior_mean = np.array([0.0, 0.0])
prior_precision = 1.0  # alpha
prior_cov = (1.0 / prior_precision) * np.eye(2)
 
# Create a grid for visualization
w1 = np.linspace(-3, 3, 100)
w2 = np.linspace(-3, 3, 100)
W1, W2 = np.meshgrid(w1, w2)
pos = np.dstack((W1, W2))
 
# Evaluate the prior density
prior_dist = multivariate_normal(prior_mean, prior_cov)
Z = prior_dist.pdf(pos)
 
# Visualize
plt.figure(figsize=(8, 6))
plt.contourf(W1, W2, Z, levels=20, cmap='Blues')
plt.colorbar(label='Prior Density p(w)')
plt.xlabel('$w_1$')
plt.ylabel('$w_2$')
plt.title('Isotropic Gaussian Prior: $w \\sim \\mathcal{N}(0, I)$')
plt.axis('equal')
plt.show()
 
# The prior expresses that:
# - We have no initial preference (mean at origin)
# - Weights near zero are most likely
# - Large weights are increasingly unlikely
# - The precision alpha controls this "shrinkage"

The Prior Is Not 'Just a Guess'

A common misconception is that priors inject arbitrary bias. In reality, priors formalize assumptions that all methods make implicitly. OLS implicitly assumes a uniform prior (all weight values equally likely), which is actually a very strong assumption—it says infinite weights are just as plausible as small ones. The Gaussian prior is often more realistic.

Interpreting the Prior Precision α

The prior precision α (or equivalently, the prior variance α⁻¹) is a critical hyperparameter that controls the strength of our prior beliefs.

High Precision (Large α):

Prior variance α⁻¹ is small
Prior concentrated tightly around zero
Strong belief that weights should be small
Data must be very convincing to move weights away from zero
Acts as strong regularization

Low Precision (Small α):

Prior variance α⁻¹ is large
Prior spread broadly
Weak belief about weight magnitudes
Even weak data signal can dominate
Acts as weak regularization

The Limit α → 0 (Improper Flat Prior):

As α → 0, the prior variance → ∞, and we approach an "uninformative" or flat prior. In this limit, the Bayesian posterior mode matches the OLS solution. However, truly flat priors are technically improper (don't integrate to 1) and can cause issues.

The Limit α → ∞ (Infinitely Informative Prior):

As α → ∞, the prior collapses to a point mass at zero. No matter what data we see, the weights stay at zero. The prior completely dominates.

Effect of Prior Precision on Learning
Prior Precision α	Prior Variance	Effect on Weights	Analogy
α → 0	∞ (flat)	OLS solution (data dominates)	No regularization
α = 0.1	10.0	Mild shrinkage toward zero	Light regularization
α = 1.0	1.0	Moderate shrinkage	Standard regularization
α = 10.0	0.1	Strong shrinkage to zero	Heavy regularization
α → ∞	0 (point mass)	Weights fixed at zero	Infinite regularization

The Regularization Interpretation:

We'll explore this deeply in a later page, but the key insight is:

$$\text{Posterior Mode} = \arg\max_\mathbf{w} \left[ \log p(\mathbf{y}|\mathbf{X}, \mathbf{w}) + \log p(\mathbf{w}) \right]$$

For a Gaussian prior with precision α and Gaussian likelihood with noise variance σ²:

$$\text{Posterior Mode} = \arg\min_\mathbf{w} \left[ |\mathbf{y} - \mathbf{X}\mathbf{w}|^2 + \frac{\alpha\sigma^2}{1} |\mathbf{w}|^2 \right]$$

This is exactly Ridge Regression with regularization parameter λ = ασ². The Bayesian prior directly corresponds to the regularization term!

This connection reveals that every regularized method implicitly assumes some prior. Bayesian inference makes this assumption explicit and principled.

Choosing α in Practice

α can be selected via cross-validation (treat it as a hyperparameter), empirical Bayes (maximize marginal likelihood), or full Bayesian treatment (place a hyperprior on α and integrate it out). Each approach has different computational and philosophical tradeoffs.

Non-Isotropic and Informative Priors

While the isotropic zero-mean prior is common, real problems often benefit from more sophisticated prior specifications.

Non-Zero Mean:

If we have prior knowledge that certain features should have positive or negative effects, we can encode this:

$$\mathbf{w} \sim \mathcal{N}(\mathbf{m}_0, \mathbf{S}_0)$$

with m₀ ≠ 0. For example, in a housing price model, we might believe:

Prior mean for "square footage" weight: positive (larger homes cost more)
Prior mean for "distance to city center" weight: negative (farther is cheaper)

Diagonal (Anisotropic) Prior:

Different features may warrant different prior variances:

$$\mathbf{S}_0 = \text{diag}(\sigma_1^2, \sigma_2^2, \ldots, \sigma_D^2)$$

This is useful when:

Some features are known to be more reliable predictors (smaller variance)
Some features are exploratory and should have high uncertainty
Feature scales differ substantially

Full Covariance Prior:

For correlated features, a full covariance matrix captures prior beliefs about relationships:

$$\mathbf{S}_0 = \begin{pmatrix} \sigma_1^2 & \rho\sigma_1\sigma_2 \ \rho\sigma_1\sigma_2 & \sigma_2^2 \end{pmatrix}$$

Positive correlation in S₀ says: "If weight 1 is large, weight 2 is likely large too." This can encode structural knowledge about feature relationships.

informative_priors.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import numpy as np
from scipy.stats import multivariate_normal
import matplotlib.pyplot as plt
 
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
 
# Create grid for visualization
w1 = np.linspace(-4, 4, 100)
w2 = np.linspace(-4, 4, 100)
W1, W2 = np.meshgrid(w1, w2)
pos = np.dstack((W1, W2))
 
# 1. Isotropic Prior
mean_iso = [0, 0]
cov_iso = [[1, 0], [0, 1]]
Z_iso = multivariate_normal(mean_iso, cov_iso).pdf(pos)
axes[0].contourf(W1, W2, Z_iso, levels=15, cmap='Blues')
axes[0].set_title('Isotropic: $S_0 = I$')
axes[0].set_xlabel('$w_1$'); axes[0].set_ylabel('$w_2$')
axes[0].set_aspect('equal')
 
# 2. Anisotropic Prior (different variances)
mean_aniso = [0, 0]
cov_aniso = [[0.5, 0], [0, 2.0]]  # w1 more constrained than w2
Z_aniso = multivariate_normal(mean_aniso, cov_aniso).pdf(pos)
axes[1].contourf(W1, W2, Z_aniso, levels=15, cmap='Greens')
axes[1].set_title('Anisotropic: $\\sigma_1^2=0.5, \\sigma_2^2=2$')
axes[1].set_xlabel('$w_1$'); axes[1].set_ylabel('$w_2$')
axes[1].set_aspect('equal')
 
# 3. Correlated Prior
mean_corr = [1, 0.5]  # Non-zero mean
cov_corr = [[1.0, 0.7], [0.7, 1.0]]  # Positive correlation
Z_corr = multivariate_normal(mean_corr, cov_corr).pdf(pos)
axes[2].contourf(W1, W2, Z_corr, levels=15, cmap='Oranges')
axes[2].set_title('Correlated: $\\rho=0.7$, mean=(1, 0.5)')
axes[2].set_xlabel('$w_1$'); axes[2].set_ylabel('$w_2$')
axes[2].set_aspect('equal')
 
plt.tight_layout()
plt.show()
 
# Key insight: The prior shape guides learning
# - Isotropic: All directions equally constrained
# - Anisotropic: Some features more flexible
# - Correlated: Encodes relationships between features

Prior Elicitation Is a Skill

Constructing informative priors requires domain expertise. In practice, priors often come from: previous studies (meta-analysis), expert knowledge (interviews with domain experts), or hierarchical models (learning priors from related tasks). Poor prior elicitation can harm performance, so sensitivity analysis is important.

Alternative Prior Families: Beyond Gaussian

While Gaussian priors are convenient and interpretable, other prior families encode different assumptions about weight structure.

Laplace Prior (Sparse Weights):

$$p(w_j) \propto \exp(-\lambda |w_j|)$$

The Laplace (double exponential) prior has a sharp peak at zero and heavy tails. This encourages sparsity—many weights exactly or nearly zero, with a few large weights. The Laplace prior corresponds to L1 regularization (Lasso).

Spike-and-Slab Prior (Explicit Sparsity):

$$p(w_j) = \pi \cdot \delta(w_j) + (1-\pi) \cdot \mathcal{N}(w_j | 0, \sigma^2)$$

A mixture of a point mass at zero (spike) and a diffuse Gaussian (slab). This explicitly models the belief that each weight is either exactly zero or drawn from a continuous distribution. Powerful for feature selection but computationally challenging.

Horseshoe Prior (Adaptive Shrinkage):

$$w_j | \lambda_j \sim \mathcal{N}(0, \lambda_j^2), \quad \lambda_j \sim \text{Half-Cauchy}(0, \tau)$$

A hierarchical prior where each weight has its own scale parameter λⱼ with a heavy-tailed distribution. This provides adaptive shrinkage: truly zero weights are shrunk aggressively, while large true weights experience less shrinkage.

Automatic Relevance Determination (ARD):

$$w_j \sim \mathcal{N}(0, \alpha_j^{-1})$$

Each weight has its own precision αⱼ, learned from data. If αⱼ → ∞ during learning, the corresponding feature is deemed irrelevant. ARD provides automatic feature selection within a Gaussian framework.

Comparison of Prior Families for Linear Regression Weights
Prior Family	Sparsity	Conjugate?	Regularization Analog	Computation
Gaussian (isotropic)	None	Yes	Ridge (L2)	Easy (closed-form)
Gaussian (ARD)	Soft	Yes*	Adaptive L2	Moderate (iterative)
Laplace	Yes	No	Lasso (L1)	Moderate (no closed-form)
Spike-and-Slab	Yes (exact)	No	Best subset	Hard (discrete)
Horseshoe	Yes (adaptive)	No	Adaptive ℓ₂	Moderate (MCMC)

When to Use Sparse Priors

Use sparse priors when you believe: (1) most features are irrelevant (high D, many noise features), (2) interpretability is important (identifying which features matter), or (3) the true model is genuinely sparse. For dense signals where all features contribute, Gaussian priors are often more appropriate.

The Prior as an Implicit Regularizer

One of the most profound insights from Bayesian linear regression is that every regularization scheme corresponds to some prior distribution. This isn't just a mathematical curiosity—it provides deep insight into what regularization actually means.

The MAP Estimator:

The Maximum A Posteriori (MAP) estimate is the mode of the posterior:

$$\mathbf{w}{\text{MAP}} = \arg\max\mathbf{w} p(\mathbf{w} | \mathbf{y}, \mathbf{X}) = \arg\max_\mathbf{w} \left[ p(\mathbf{y} | \mathbf{X}, \mathbf{w}) \cdot p(\mathbf{w}) \right]$$

Taking logs:

$$\mathbf{w}{\text{MAP}} = \arg\min\mathbf{w} \left[ -\log p(\mathbf{y} | \mathbf{X}, \mathbf{w}) - \log p(\mathbf{w}) \right]$$

The first term is the negative log-likelihood (data fit). The second term is the negative log-prior (regularization).

Specific Correspondences:

Prior Distribution	MAP Objective (Regularization)
Gaussian: 𝒩(0, α⁻¹I)	Ridge: ∥y - Xw∥² + λ∥w∥²
Laplace: ∏ exp(-λ\|wⱼ\|)	Lasso: ∥y - Xw∥² + λ∥w∥₁
Uniform (improper)	OLS: ∥y - Xw∥² (no regularization)
Elastic Net prior	Elastic Net: ∥y - Xw∥² + λ₁∥w∥₁ + λ₂∥w∥²

MAP vs. Full Bayesian

While the MAP estimator reveals the prior-regularization connection, it's only a point estimate. Full Bayesian inference uses the entire posterior distribution, providing uncertainty quantification that MAP alone cannot offer. Think of MAP as a bridge between frequentist and Bayesian thinking, but not the full power of the Bayesian approach.

Why This Matters:

Principled Hyperparameter Interpretation: Regularization strength λ isn't arbitrary—it's the ratio of prior precision to noise precision. Setting λ = 1 means you trust the prior equally to one data point.
Informed Prior Design: Want L1-like sparsity? Use a Laplace prior. Want smooth solutions? Use a prior that penalizes weight differences.
Unified Framework: All regularization methods become instances of Bayesian inference with different priors. This unified view clarifies when each method is appropriate.
Beyond Point Estimates: Once you recognize the prior, you can do full Bayesian inference with it—getting posterior distributions, not just MAP estimates.

Practical Considerations for Prior Specification

Setting priors in practice requires balancing domain knowledge, computational tractability, and robustness. Here are key principles:

1. Feature Scaling and Priors:

If features are on different scales, an isotropic prior may be inappropriate. A weight of 0.01 for a feature measured in millions differs from 0.01 for a binary feature. Options:

Standardize features to zero mean, unit variance
Use feature-specific prior variances (ARD)
Design priors in the standardized space

2. Prior Predictive Checks:

Before seeing data, sample from the prior and simulate predictions: $$\mathbf{w} \sim p(\mathbf{w}), \quad \mathbf{y}_{\text{simulated}} = \mathbf{X}\mathbf{w} + \boldsymbol{\epsilon}$$

Are the simulated predictions plausible? If your prior produces predictions like "house prices of -$10 million," the prior is poorly calibrated.

3. Weakly Informative Priors:

If you lack strong prior knowledge, use weakly informative priors that:

Exclude implausible regions (e.g., infinite weights)
Don't commit strongly to any particular value
Provide just enough regularization for stable inference

A common choice: 𝒩(0, s²) where s is set to cover plausible weight magnitudes.

4. Sensitivity Analysis:

Run inference with different prior settings. If conclusions change dramatically with minor prior adjustments, the data may be insufficient or the model misspecified.

Prior Specification Checklist

•Domain Knowledge: Have experts reviewed the prior? Does it capture known relationships?
•Prior Predictive Check: Do samples from the prior produce plausible predictions?
•Scale Consistency: Are priors appropriate for the feature scales being used?
•Sensitivity Testing: Do conclusions hold under reasonable prior variations?
•Computational Tractability: Can you perform inference efficiently with this prior?
•Conjugacy Trade-off: Is the analytical convenience of conjugate priors worth potential expressiveness loss?

Priors Are Model Assumptions

Treat prior specification with the same rigor as any other modeling assumption. Document your choices, justify them, and test sensitivity. The prior is part of your model, not an afterthought.

Summary: The Foundation Is Set

We've established the foundational concepts for Bayesian linear regression by understanding how to place prior distributions on weights. Let's consolidate the key insights:

Key Takeaways

•Parameters as Random Variables: Bayesian inference treats weights as random variables with probability distributions, enabling principled uncertainty quantification.
•The Gaussian Prior: The multivariate Gaussian prior p(w) = 𝒩(m₀, S₀) is the workhorse of Bayesian linear regression due to conjugacy and interpretability.
•Prior Precision Controls Regularization: The precision α determines how strongly weights are shrunk toward the prior mean, directly corresponding to regularization strength.
•Prior = Regularization: Every regularization scheme (Ridge, Lasso, etc.) corresponds to a specific prior distribution. Bayesian inference unifies these methods.
•Informative Priors Encode Knowledge: Non-isotropic and non-zero mean priors can encode domain expertise, improving learning especially with limited data.
•Alternative Priors for Sparsity: Laplace, spike-and-slab, and horseshoe priors encourage sparse solutions when most features are expected to be irrelevant.

What's Next:

With the prior established, we're ready to see what we actually learn when data arrives. The next page derives the posterior distribution over weights—the result of combining our prior beliefs with the evidence from observed data. This posterior is the central object in Bayesian inference, from which all predictions and uncertainty estimates flow.

Page Complete

You now understand how to encode prior beliefs about regression weights as probability distributions. This is the starting point of Bayesian linear regression—the foundation upon which posterior derivation, predictive inference, and the connection to regularization all build. Next, we'll derive what happens when prior meets data: the posterior distribution.

Prior on Weights

From Point Estimates to Probability Distributions

What You Will Learn

The Philosophical Shift: Parameters as Random Variables

The Frequentist View (Classical Linear Regression):

This view leads to questions like:

"What is the sampling distribution of our estimator?"
"What is the probability that our estimate is within ε of the true value?"

The Bayesian View:

This view leads to different questions:

"What is the probability that w lies in a certain region, given what we've observed?"
"How should we update our beliefs as we see more data?"

Frequentist vs. Bayesian Perspectives on Parameters
Aspect	Frequentist View	Bayesian View
Parameters	Fixed, unknown constants	Random variables with distributions
Probability	Long-run frequency of events	Degree of belief or certainty
Estimation Goal	Find a point estimate ŵ	Compute the posterior distribution p(w\|D)
Uncertainty	Confidence intervals (frequentist coverage)	Credible intervals (posterior probability)
Prior Knowledge	Difficult to incorporate formally	Naturally incorporated via prior p(w)
Small Data Regime	Often problematic (overfitting)	Priors provide regularization

Neither View Is 'Correct'

The Linear Regression Setup: A Probabilistic Model

Before we can define a prior on weights, we need a clear probabilistic model for linear regression. Let's establish the framework carefully.

The Data:

We observe N training examples, each consisting of:

A feature vector xₙ ∈ ℝᴰ (D features)
A target value yₙ ∈ ℝ

We collect all features into a design matrix X ∈ ℝᴺˣᴰ and all targets into a vector y ∈ ℝᴺ.

The Generative Model:

We assume that targets are generated by a linear function of features plus Gaussian noise:

$$y_n = \mathbf{w}^\top \mathbf{x}_n + \epsilon_n$$

where:

w ∈ ℝᴰ is the weight vector we want to learn
εₙ ~ 𝒩(0, σ²) is independent Gaussian noise with variance σ²

This can be written equivalently as:

$$y_n | \mathbf{x}_n, \mathbf{w}, \sigma^2 \sim \mathcal{N}(\mathbf{w}^\top \mathbf{x}_n, \sigma^2)$$

The Likelihood Function:

Under the assumption that observations are conditionally independent given w, the likelihood of observing all data is:

$$p(\mathbf{y} | \mathbf{X}, \mathbf{w}, \sigma^2) = \prod_{n=1}^{N} \mathcal{N}(y_n | \mathbf{w}^\top \mathbf{x}_n, \sigma^2)$$

Taking the log and expanding:

$$\log p(\mathbf{y} | \mathbf{X}, \mathbf{w}, \sigma^2) = -\frac{N}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{n=1}^{N} (y_n - \mathbf{w}^\top \mathbf{x}_n)^2$$

Noise Variance σ²

Matrix Notation:

For computational convenience, we write the model in matrix form:

$$\mathbf{y} = \mathbf{X}\mathbf{w} + \boldsymbol{\epsilon}$$

where ε ~ 𝒩(0, σ²Iₙ), giving us:

$$\mathbf{y} | \mathbf{X}, \mathbf{w}, \sigma^2 \sim \mathcal{N}(\mathbf{X}\mathbf{w}, \sigma^2 \mathbf{I}_N)$$

This multivariate Gaussian formulation is crucial for deriving the posterior distribution efficiently.

Defining the Prior Distribution on Weights

The prior distribution p(w) encodes our beliefs about the weights before observing any data. This is where we inject domain knowledge, regularization preferences, or express uncertainty.

The Gaussian Prior:

The most common choice is a multivariate Gaussian prior:

$$\mathbf{w} \sim \mathcal{N}(\mathbf{m}_0, \mathbf{S}_0)$$

where:

m₀ ∈ ℝᴰ is the prior mean (our initial guess for the weights)
S₀ ∈ ℝᴰˣᴰ is the prior covariance matrix (encoding uncertainty and correlations)

Why Gaussian?

Conjugacy: Gaussian priors are conjugate to the Gaussian likelihood, meaning the posterior is also Gaussian. This allows closed-form solutions.
Interpretability: The prior mean represents our initial estimate; the prior covariance represents how uncertain we are and how features relate.
Flexibility: By choosing m₀ and S₀ appropriately, we can encode a wide range of prior beliefs.
Central Limit Theorem: In many settings, the Gaussian emerges as a natural limiting distribution.

The Isotropic Gaussian Prior (Most Common):

Often, we use a zero-mean isotropic prior:

$$\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \alpha^{-1}\mathbf{I}_D)$$

where:

m₀ = 0: No initial bias toward any particular weights
S₀ = α⁻¹Iᴰ: Equal, independent uncertainty for all weights
α is the precision (inverse variance) of the prior

This prior says: "Before seeing data, I believe each weight is equally likely to be positive or negative, with magnitude controlled by α."

gaussian_prior.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal
 
# Define a 2D Gaussian prior for visualization
prior_mean = np.array([0.0, 0.0])
prior_precision = 1.0  # alpha
prior_cov = (1.0 / prior_precision) * np.eye(2)
 
# Create a grid for visualization
w1 = np.linspace(-3, 3, 100)
w2 = np.linspace(-3, 3, 100)
W1, W2 = np.meshgrid(w1, w2)
pos = np.dstack((W1, W2))
 
# Evaluate the prior density
prior_dist = multivariate_normal(prior_mean, prior_cov)
Z = prior_dist.pdf(pos)
 
# Visualize
plt.figure(figsize=(8, 6))
plt.contourf(W1, W2, Z, levels=20, cmap='Blues')
plt.colorbar(label='Prior Density p(w)')
plt.xlabel('$w_1$')
plt.ylabel('$w_2$')
plt.title('Isotropic Gaussian Prior: $w \\sim \\mathcal{N}(0, I)$')
plt.axis('equal')
plt.show()
 
# The prior expresses that:
# - We have no initial preference (mean at origin)
# - Weights near zero are most likely
# - Large weights are increasingly unlikely
# - The precision alpha controls this "shrinkage"

The Prior Is Not 'Just a Guess'

Interpreting the Prior Precision α

The prior precision α (or equivalently, the prior variance α⁻¹) is a critical hyperparameter that controls the strength of our prior beliefs.

High Precision (Large α):

Prior variance α⁻¹ is small
Prior concentrated tightly around zero
Strong belief that weights should be small
Data must be very convincing to move weights away from zero
Acts as strong regularization

Low Precision (Small α):

Prior variance α⁻¹ is large
Prior spread broadly
Weak belief about weight magnitudes
Even weak data signal can dominate
Acts as weak regularization

The Limit α → 0 (Improper Flat Prior):

The Limit α → ∞ (Infinitely Informative Prior):

As α → ∞, the prior collapses to a point mass at zero. No matter what data we see, the weights stay at zero. The prior completely dominates.

Effect of Prior Precision on Learning
Prior Precision α	Prior Variance	Effect on Weights	Analogy
α → 0	∞ (flat)	OLS solution (data dominates)	No regularization
α = 0.1	10.0	Mild shrinkage toward zero	Light regularization
α = 1.0	1.0	Moderate shrinkage	Standard regularization
α = 10.0	0.1	Strong shrinkage to zero	Heavy regularization
α → ∞	0 (point mass)	Weights fixed at zero	Infinite regularization

The Regularization Interpretation:

We'll explore this deeply in a later page, but the key insight is:

$$\text{Posterior Mode} = \arg\max_\mathbf{w} \left[ \log p(\mathbf{y}|\mathbf{X}, \mathbf{w}) + \log p(\mathbf{w}) \right]$$

For a Gaussian prior with precision α and Gaussian likelihood with noise variance σ²:

$$\text{Posterior Mode} = \arg\min_\mathbf{w} \left[ |\mathbf{y} - \mathbf{X}\mathbf{w}|^2 + \frac{\alpha\sigma^2}{1} |\mathbf{w}|^2 \right]$$

This is exactly Ridge Regression with regularization parameter λ = ασ². The Bayesian prior directly corresponds to the regularization term!

This connection reveals that every regularized method implicitly assumes some prior. Bayesian inference makes this assumption explicit and principled.

Choosing α in Practice

Non-Isotropic and Informative Priors

While the isotropic zero-mean prior is common, real problems often benefit from more sophisticated prior specifications.

Non-Zero Mean:

If we have prior knowledge that certain features should have positive or negative effects, we can encode this:

$$\mathbf{w} \sim \mathcal{N}(\mathbf{m}_0, \mathbf{S}_0)$$

with m₀ ≠ 0. For example, in a housing price model, we might believe:

Prior mean for "square footage" weight: positive (larger homes cost more)
Prior mean for "distance to city center" weight: negative (farther is cheaper)

Diagonal (Anisotropic) Prior:

Different features may warrant different prior variances:

$$\mathbf{S}_0 = \text{diag}(\sigma_1^2, \sigma_2^2, \ldots, \sigma_D^2)$$

This is useful when:

Some features are known to be more reliable predictors (smaller variance)
Some features are exploratory and should have high uncertainty
Feature scales differ substantially

Full Covariance Prior:

For correlated features, a full covariance matrix captures prior beliefs about relationships:

$$\mathbf{S}_0 = \begin{pmatrix} \sigma_1^2 & \rho\sigma_1\sigma_2 \ \rho\sigma_1\sigma_2 & \sigma_2^2 \end{pmatrix}$$

Positive correlation in S₀ says: "If weight 1 is large, weight 2 is likely large too." This can encode structural knowledge about feature relationships.

informative_priors.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import numpy as np
from scipy.stats import multivariate_normal
import matplotlib.pyplot as plt
 
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
 
# Create grid for visualization
w1 = np.linspace(-4, 4, 100)
w2 = np.linspace(-4, 4, 100)
W1, W2 = np.meshgrid(w1, w2)
pos = np.dstack((W1, W2))
 
# 1. Isotropic Prior
mean_iso = [0, 0]
cov_iso = [[1, 0], [0, 1]]
Z_iso = multivariate_normal(mean_iso, cov_iso).pdf(pos)
axes[0].contourf(W1, W2, Z_iso, levels=15, cmap='Blues')
axes[0].set_title('Isotropic: $S_0 = I$')
axes[0].set_xlabel('$w_1$'); axes[0].set_ylabel('$w_2$')
axes[0].set_aspect('equal')
 
# 2. Anisotropic Prior (different variances)
mean_aniso = [0, 0]
cov_aniso = [[0.5, 0], [0, 2.0]]  # w1 more constrained than w2
Z_aniso = multivariate_normal(mean_aniso, cov_aniso).pdf(pos)
axes[1].contourf(W1, W2, Z_aniso, levels=15, cmap='Greens')
axes[1].set_title('Anisotropic: $\\sigma_1^2=0.5, \\sigma_2^2=2$')
axes[1].set_xlabel('$w_1$'); axes[1].set_ylabel('$w_2$')
axes[1].set_aspect('equal')
 
# 3. Correlated Prior
mean_corr = [1, 0.5]  # Non-zero mean
cov_corr = [[1.0, 0.7], [0.7, 1.0]]  # Positive correlation
Z_corr = multivariate_normal(mean_corr, cov_corr).pdf(pos)
axes[2].contourf(W1, W2, Z_corr, levels=15, cmap='Oranges')
axes[2].set_title('Correlated: $\\rho=0.7$, mean=(1, 0.5)')
axes[2].set_xlabel('$w_1$'); axes[2].set_ylabel('$w_2$')
axes[2].set_aspect('equal')
 
plt.tight_layout()
plt.show()
 
# Key insight: The prior shape guides learning
# - Isotropic: All directions equally constrained
# - Anisotropic: Some features more flexible
# - Correlated: Encodes relationships between features

Prior Elicitation Is a Skill

Alternative Prior Families: Beyond Gaussian

While Gaussian priors are convenient and interpretable, other prior families encode different assumptions about weight structure.

Laplace Prior (Sparse Weights):

$$p(w_j) \propto \exp(-\lambda |w_j|)$$

Spike-and-Slab Prior (Explicit Sparsity):

$$p(w_j) = \pi \cdot \delta(w_j) + (1-\pi) \cdot \mathcal{N}(w_j | 0, \sigma^2)$$

Horseshoe Prior (Adaptive Shrinkage):

$$w_j | \lambda_j \sim \mathcal{N}(0, \lambda_j^2), \quad \lambda_j \sim \text{Half-Cauchy}(0, \tau)$$

Automatic Relevance Determination (ARD):

$$w_j \sim \mathcal{N}(0, \alpha_j^{-1})$$

Comparison of Prior Families for Linear Regression Weights
Prior Family	Sparsity	Conjugate?	Regularization Analog	Computation
Gaussian (isotropic)	None	Yes	Ridge (L2)	Easy (closed-form)
Gaussian (ARD)	Soft	Yes*	Adaptive L2	Moderate (iterative)
Laplace	Yes	No	Lasso (L1)	Moderate (no closed-form)
Spike-and-Slab	Yes (exact)	No	Best subset	Hard (discrete)
Horseshoe	Yes (adaptive)	No	Adaptive ℓ₂	Moderate (MCMC)

When to Use Sparse Priors

The Prior as an Implicit Regularizer

The MAP Estimator:

The Maximum A Posteriori (MAP) estimate is the mode of the posterior:

$$\mathbf{w}{\text{MAP}} = \arg\max\mathbf{w} p(\mathbf{w} | \mathbf{y}, \mathbf{X}) = \arg\max_\mathbf{w} \left[ p(\mathbf{y} | \mathbf{X}, \mathbf{w}) \cdot p(\mathbf{w}) \right]$$

Taking logs:

$$\mathbf{w}{\text{MAP}} = \arg\min\mathbf{w} \left[ -\log p(\mathbf{y} | \mathbf{X}, \mathbf{w}) - \log p(\mathbf{w}) \right]$$

The first term is the negative log-likelihood (data fit). The second term is the negative log-prior (regularization).

Specific Correspondences:

Prior Distribution	MAP Objective (Regularization)
Gaussian: 𝒩(0, α⁻¹I)	Ridge: ∥y - Xw∥² + λ∥w∥²
Laplace: ∏ exp(-λ\|wⱼ\|)	Lasso: ∥y - Xw∥² + λ∥w∥₁
Uniform (improper)	OLS: ∥y - Xw∥² (no regularization)
Elastic Net prior	Elastic Net: ∥y - Xw∥² + λ₁∥w∥₁ + λ₂∥w∥²

MAP vs. Full Bayesian

Why This Matters:

Principled Hyperparameter Interpretation: Regularization strength λ isn't arbitrary—it's the ratio of prior precision to noise precision. Setting λ = 1 means you trust the prior equally to one data point.
Informed Prior Design: Want L1-like sparsity? Use a Laplace prior. Want smooth solutions? Use a prior that penalizes weight differences.
Unified Framework: All regularization methods become instances of Bayesian inference with different priors. This unified view clarifies when each method is appropriate.
Beyond Point Estimates: Once you recognize the prior, you can do full Bayesian inference with it—getting posterior distributions, not just MAP estimates.

Practical Considerations for Prior Specification

Setting priors in practice requires balancing domain knowledge, computational tractability, and robustness. Here are key principles:

1. Feature Scaling and Priors:

If features are on different scales, an isotropic prior may be inappropriate. A weight of 0.01 for a feature measured in millions differs from 0.01 for a binary feature. Options:

Standardize features to zero mean, unit variance
Use feature-specific prior variances (ARD)
Design priors in the standardized space

2. Prior Predictive Checks:

Before seeing data, sample from the prior and simulate predictions: $$\mathbf{w} \sim p(\mathbf{w}), \quad \mathbf{y}_{\text{simulated}} = \mathbf{X}\mathbf{w} + \boldsymbol{\epsilon}$$

Are the simulated predictions plausible? If your prior produces predictions like "house prices of -$10 million," the prior is poorly calibrated.

3. Weakly Informative Priors:

If you lack strong prior knowledge, use weakly informative priors that:

Exclude implausible regions (e.g., infinite weights)
Don't commit strongly to any particular value
Provide just enough regularization for stable inference

A common choice: 𝒩(0, s²) where s is set to cover plausible weight magnitudes.

4. Sensitivity Analysis:

Run inference with different prior settings. If conclusions change dramatically with minor prior adjustments, the data may be insufficient or the model misspecified.

Prior Specification Checklist

•Domain Knowledge: Have experts reviewed the prior? Does it capture known relationships?
•Prior Predictive Check: Do samples from the prior produce plausible predictions?
•Scale Consistency: Are priors appropriate for the feature scales being used?
•Sensitivity Testing: Do conclusions hold under reasonable prior variations?
•Computational Tractability: Can you perform inference efficiently with this prior?
•Conjugacy Trade-off: Is the analytical convenience of conjugate priors worth potential expressiveness loss?

Priors Are Model Assumptions

Treat prior specification with the same rigor as any other modeling assumption. Document your choices, justify them, and test sensitivity. The prior is part of your model, not an afterthought.

Summary: The Foundation Is Set

We've established the foundational concepts for Bayesian linear regression by understanding how to place prior distributions on weights. Let's consolidate the key insights:

Key Takeaways

•Parameters as Random Variables: Bayesian inference treats weights as random variables with probability distributions, enabling principled uncertainty quantification.
•The Gaussian Prior: The multivariate Gaussian prior p(w) = 𝒩(m₀, S₀) is the workhorse of Bayesian linear regression due to conjugacy and interpretability.
•Prior Precision Controls Regularization: The precision α determines how strongly weights are shrunk toward the prior mean, directly corresponding to regularization strength.
•Prior = Regularization: Every regularization scheme (Ridge, Lasso, etc.) corresponds to a specific prior distribution. Bayesian inference unifies these methods.
•Informative Priors Encode Knowledge: Non-isotropic and non-zero mean priors can encode domain expertise, improving learning especially with limited data.
•Alternative Priors for Sparsity: Laplace, spike-and-slab, and horseshoe priors encourage sparse solutions when most features are expected to be irrelevant.

What's Next:

Page Complete