Bayesian Linear Regression - Learning Module

Loading content...

0/245

Posterior Derivation

The Posterior: Where Prior Meets Data

With our prior beliefs about weights encoded as a probability distribution, we're now ready to answer the central question of Bayesian inference: How should we update our beliefs when we observe data?

The answer is the posterior distribution p(w|y, X)—a probability distribution over weights that incorporates both our prior knowledge and the evidence from training data. The posterior is the complete summary of what we know about the weights after learning.

What makes Bayesian linear regression special is that, with Gaussian priors and likelihoods, the posterior has a closed-form solution. We don't need sampling or approximations—we can write down the exact posterior distribution analytically. This is the reward of conjugacy: mathematical tractability without sacrificing principled inference.

What You Will Learn

By the end of this page, you will be able to derive the posterior distribution over regression weights from first principles, understand the geometric interpretation of Bayesian updating, and see how the posterior elegantly balances prior beliefs against data evidence.

Bayes' Theorem for Regression

The foundation of Bayesian inference is Bayes' theorem. For weights w, given observed data y and input features X:

$$p(\mathbf{w} | \mathbf{y}, \mathbf{X}) = \frac{p(\mathbf{y} | \mathbf{X}, \mathbf{w}) \cdot p(\mathbf{w})}{p(\mathbf{y} | \mathbf{X})}$$

Let's unpack each term:

Posterior p(w|y, X): This is what we want to compute—the probability distribution over weights given the observed data. It represents our updated beliefs.

Likelihood p(y|X, w): The probability of observing the target values y if the weights were w. This measures how well the weights explain the data.

Prior p(w): Our beliefs about weights before seeing data. From the previous page, we use: $$p(\mathbf{w}) = \mathcal{N}(\mathbf{w} | \mathbf{m}_0, \mathbf{S}_0)$$

Marginal Likelihood p(y|X): Also called the evidence or model evidence. It's the probability of the data averaged over all possible weights: $$p(\mathbf{y} | \mathbf{X}) = \int p(\mathbf{y} | \mathbf{X}, \mathbf{w}) \cdot p(\mathbf{w}) , d\mathbf{w}$$

For computing the posterior, this is a normalizing constant—we'll return to its interpretation in the model comparison page.

The Proportionality Shortcut

Since p(y|X) doesn't depend on w, we often write:

p(w|y, X) ∝ p(y|X, w) · p(w)

Posterior ∝ Likelihood × Prior

We can find the posterior by multiplying likelihood and prior, then normalizing. For Gaussians, this normalization happens automatically because the product of Gaussians is Gaussian.

The Likelihood Function in Matrix Form

Recall our probabilistic model: observations are generated by a linear function plus Gaussian noise:

$$\mathbf{y} | \mathbf{X}, \mathbf{w} \sim \mathcal{N}(\mathbf{X}\mathbf{w}, \sigma^2 \mathbf{I}_N)$$

The likelihood function is the probability density of y treated as a function of w:

$$p(\mathbf{y} | \mathbf{X}, \mathbf{w}) = \mathcal{N}(\mathbf{y} | \mathbf{X}\mathbf{w}, \sigma^2 \mathbf{I}_N)$$

Expanding the Gaussian pdf:

$$p(\mathbf{y} | \mathbf{X}, \mathbf{w}) = (2\pi\sigma^2)^{-N/2} \exp\left( -\frac{1}{2\sigma^2} (\mathbf{y} - \mathbf{X}\mathbf{w})^\top (\mathbf{y} - \mathbf{X}\mathbf{w}) \right)$$

Useful Rewriting:

For the derivation, we'll need the log-likelihood (ignoring constants not depending on w):

$$\log p(\mathbf{y} | \mathbf{X}, \mathbf{w}) = -\frac{1}{2\sigma^2} |\mathbf{y} - \mathbf{X}\mathbf{w}|^2 + \text{const}$$

Expanding the quadratic:

$$|\mathbf{y} - \mathbf{X}\mathbf{w}|^2 = \mathbf{y}^\top\mathbf{y} - 2\mathbf{w}^\top\mathbf{X}^\top\mathbf{y} + \mathbf{w}^\top\mathbf{X}^\top\mathbf{X}\mathbf{w}$$

This is quadratic in w, which is crucial for the Gaussian posterior we'll derive.

Precision Notation:

It's often convenient to work with the noise precision β = 1/σ² instead of variance. This simplifies many expressions:

$$p(\mathbf{y} | \mathbf{X}, \mathbf{w}) \propto \exp\left( -\frac{\beta}{2} |\mathbf{y} - \mathbf{X}\mathbf{w}|^2 \right)$$

The likelihood is maximized when w = (XᵀX)⁻¹Xᵀy—the ordinary least squares solution. The posterior will be "pulled" toward this maximum likelihood estimate, with the prior providing counterbalancing force.

The Prior Distribution Setup

Our Gaussian prior on weights is:

$$p(\mathbf{w}) = \mathcal{N}(\mathbf{w} | \mathbf{m}_0, \mathbf{S}_0)$$

Expanding:

$$p(\mathbf{w}) = (2\pi)^{-D/2} |\mathbf{S}_0|^{-1/2} \exp\left( -\frac{1}{2} (\mathbf{w} - \mathbf{m}_0)^\top \mathbf{S}_0^{-1} (\mathbf{w} - \mathbf{m}_0) \right)$$

In log form (ignoring constants):

$$\log p(\mathbf{w}) = -\frac{1}{2} (\mathbf{w} - \mathbf{m}_0)^\top \mathbf{S}_0^{-1} (\mathbf{w} - \mathbf{m}_0) + \text{const}$$

The Common Case (Zero-Mean Isotropic Prior):

For the standard zero-mean isotropic prior with precision α:

$$p(\mathbf{w}) = \mathcal{N}(\mathbf{w} | \mathbf{0}, \alpha^{-1}\mathbf{I})$$

This simplifies to:

$$\log p(\mathbf{w}) = -\frac{\alpha}{2} \mathbf{w}^\top\mathbf{w} + \text{const}$$

The prior log-probability is quadratic in w, just like the log-likelihood. This shared quadratic structure is why the product remains Gaussian.

Precision Matrix

The inverse covariance matrix S₀⁻¹ is called the precision matrix or information matrix. Working with precision matrices often leads to cleaner formulas in Bayesian inference, especially when combining information from multiple sources.

Deriving the Posterior Distribution

Now we combine prior and likelihood using Bayes' theorem. The posterior is proportional to their product:

$$\log p(\mathbf{w} | \mathbf{y}, \mathbf{X}) = \log p(\mathbf{y} | \mathbf{X}, \mathbf{w}) + \log p(\mathbf{w}) + \text{const}$$

Substituting our expressions:

$$\log p(\mathbf{w} | \mathbf{y}, \mathbf{X}) = -\frac{\beta}{2}(\mathbf{y} - \mathbf{X}\mathbf{w})^\top(\mathbf{y} - \mathbf{X}\mathbf{w}) - \frac{1}{2}(\mathbf{w} - \mathbf{m}_0)^\top\mathbf{S}_0^{-1}(\mathbf{w} - \mathbf{m}_0) + \text{const}$$

Key Observation: This is still quadratic in w! Since the log of a Gaussian density is quadratic, and the sum of quadratics is quadratic, the posterior must be Gaussian.

Let's expand and group terms to find the posterior mean mₙ and covariance Sₙ.

Step 1: Expand the Quadratics

Likelihood term: $$-\frac{\beta}{2}(\mathbf{y}^\top\mathbf{y} - 2\mathbf{w}^\top\mathbf{X}^\top\mathbf{y} + \mathbf{w}^\top\mathbf{X}^\top\mathbf{X}\mathbf{w})$$

Prior term: $$-\frac{1}{2}(\mathbf{w}^\top\mathbf{S}_0^{-1}\mathbf{w} - 2\mathbf{w}^\top\mathbf{S}_0^{-1}\mathbf{m}_0 + \mathbf{m}_0^\top\mathbf{S}_0^{-1}\mathbf{m}_0)$$

Step 2: Collect Terms by Power of w

Quadratic in w: $$-\frac{1}{2}\mathbf{w}^\top(\beta\mathbf{X}^\top\mathbf{X} + \mathbf{S}_0^{-1})\mathbf{w}$$

Linear in w: $$\mathbf{w}^\top(\beta\mathbf{X}^\top\mathbf{y} + \mathbf{S}_0^{-1}\mathbf{m}_0)$$

Step 3: Complete the Square

A Gaussian with mean μ and precision matrix Λ has log-density: $$-\frac{1}{2}(\mathbf{w} - \boldsymbol{\mu})^\top\mathbf{\Lambda}(\mathbf{w} - \boldsymbol{\mu}) = -\frac{1}{2}\mathbf{w}^\top\mathbf{\Lambda}\mathbf{w} + \mathbf{w}^\top\mathbf{\Lambda}\boldsymbol{\mu} + \text{const}$$

Comparing with our collected terms: $$\mathbf{\Lambda}_N = \mathbf{S}_N^{-1} = \mathbf{S}_0^{-1} + \beta\mathbf{X}^\top\mathbf{X}$$ $$\mathbf{\Lambda}_N\mathbf{m}_N = \mathbf{S}_0^{-1}\mathbf{m}_0 + \beta\mathbf{X}^\top\mathbf{y}$$

The Posterior Distribution

p(w|y, X) = 𝒩(w | mₙ, Sₙ) where:

Posterior Precision: Sₙ⁻¹ = S₀⁻¹ + β X⊤X

Posterior Mean: mₙ = Sₙ(S₀⁻¹m₀ + β X⊤y)

Equivalently: mₙ = Sₙ(S₀⁻¹m₀ + β X⊤y)

Special Case: Zero-Mean Isotropic Prior

The most common scenario uses a zero-mean isotropic prior: m₀ = 0 and S₀ = α⁻¹I. The posterior simplifies beautifully:

Posterior Precision: $$\mathbf{S}_N^{-1} = \alpha\mathbf{I} + \beta\mathbf{X}^\top\mathbf{X}$$

Posterior Covariance: $$\mathbf{S}_N = (\alpha\mathbf{I} + \beta\mathbf{X}^\top\mathbf{X})^{-1}$$

Posterior Mean: $$\mathbf{m}_N = \beta\mathbf{S}_N\mathbf{X}^\top\mathbf{y} = \beta(\alpha\mathbf{I} + \beta\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$$

Connection to Ridge Regression:

With λ = α/β = ασ², this posterior mean is exactly the Ridge regression solution:

$$\mathbf{m}_N = (\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}$$

This confirms that Ridge regression computes the posterior mean of a Bayesian linear regression with Gaussian prior. But Bayesian inference gives us more—the entire posterior distribution, not just the mean.

posterior_computation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np
from numpy.linalg import inv
 
def compute_bayesian_posterior(X, y, alpha, beta, m0=None, S0=None):
    """
    Compute the posterior distribution for Bayesian linear regression.
    
    Parameters:
    -----------
    X : ndarray (N, D)
        Design matrix of N observations with D features
    y : ndarray (N,)
        Target values
    alpha : float
        Prior precision (inverse variance of prior)
    beta : float
        Noise precision (inverse of noise variance sigma^2)
    m0 : ndarray (D,), optional
        Prior mean. Defaults to zeros.
    S0 : ndarray (D, D), optional
        Prior covariance. Defaults to (1/alpha) * I
        
    Returns:
    --------
    m_N : ndarray (D,)
        Posterior mean
    S_N : ndarray (D, D)
        Posterior covariance
    """
    N, D = X.shape
    
    # Default prior
    if m0 is None:
        m0 = np.zeros(D)
    if S0 is None:
        S0 = (1.0 / alpha) * np.eye(D)
    
    # Prior precision
    S0_inv = inv(S0)  # For isotropic: alpha * I
    
    # Posterior precision (information from prior + information from data)
    S_N_inv = S0_inv + beta * X.T @ X
    
    # Posterior covariance
    S_N = inv(S_N_inv)
    
    # Posterior mean (weighted combination of prior and data)
    m_N = S_N @ (S0_inv @ m0 + beta * X.T @ y)
    
    return m_N, S_N
 
# Example: Simple linear regression with 10 data points
np.random.seed(42)
 
# True weights
w_true = np.array([1.5, -2.0])
 
# Generate synthetic data
N = 10
X = np.random.randn(N, 2)
noise_std = 0.5
y = X @ w_true + noise_std * np.random.randn(N)
 
# Bayesian inference
alpha = 0.5    # Prior precision
beta = 1.0 / noise_std**2  # Noise precision
 
m_N, S_N = compute_bayesian_posterior(X, y, alpha, beta)
 
print("True weights:", w_true)
print("Posterior mean:", m_N)
print("Posterior std:", np.sqrt(np.diag(S_N)))
print("\nPosterior covariance matrix:")
print(S_N)

Interpretation: What the Posterior Tells Us

The posterior distribution encodes everything we know about the weights after observing data. Let's unpack its components.

The Posterior Mean mₙ:

This is our "best guess" for the weights—the point estimate that minimizes expected squared error under the posterior. It balances two forces:

S₀⁻¹m₀: Information pulling toward the prior mean
βXᵀy: Information from the data

With more data (larger XᵀX), the data term dominates and mₙ approaches the OLS solution.

The Posterior Covariance Sₙ:

This measures our remaining uncertainty about the weights after seeing data. Important properties:

Sₙ ≤ S₀: Posterior variance is never greater than prior variance (in the matrix sense). Data always reduces uncertainty or leaves it unchanged.
Inverse of precision sum: Sₙ⁻¹ = S₀⁻¹ + βXᵀX. Precision (certainty) adds: prior precision + data precision = posterior precision.
Direction dependence: Uncertainty shrinks most in directions where data is most informative (large eigenvalues of XᵀX).

How Posterior Components Depend on Problem Setup
Condition	Effect on Posterior Mean	Effect on Posterior Covariance
Strong prior (large α)	Pulled toward m₀	Small (prior dominates)
Weak prior (small α)	Close to OLS solution	Large (data must constrain)
Low noise (large β)	Close to OLS solution	Small (data is reliable)
High noise (small β)	Between prior and OLS	Large (data is noisy)
Many observations (large N)	Approaches OLS	Shrinks as O(1/N)
Few observations	Prior-dominated	Remains large

The Sum of Information Principle

The posterior precision is the sum of prior precision and data precision:

Sₙ⁻¹ = S₀⁻¹ + β X⊤X

This is a general principle in Bayesian inference: information from independent sources adds. The Fisher information from the likelihood (β X⊤X) augments the prior information (S₀⁻¹).

Sequential Bayesian Updating

One elegant property of Bayesian inference is sequential updating: we can incorporate data one point at a time, with result identical to batch processing.

The Procedure:

Start with prior p(w) = 𝒩(m₀, S₀)
Observe data point (x₁, y₁)
Update to posterior p(w|x₁, y₁) = 𝒩(m₁, S₁)
This posterior becomes the prior for the next observation
Observe (x₂, y₂), update to 𝒩(m₂, S₂)
Continue...

Why It Works:

Bayes' theorem is associative. For data points 1 and 2:

$$p(\mathbf{w}|y_1, y_2) \propto p(y_2|\mathbf{w}) \cdot p(y_1|\mathbf{w}) \cdot p(\mathbf{w})$$ $$= p(y_2|\mathbf{w}) \cdot \left[ p(y_1|\mathbf{w}) \cdot p(\mathbf{w}) \right]$$ $$= p(y_2|\mathbf{w}) \cdot p(\mathbf{w}|y_1)$$

The order doesn't matter—same posterior whether we process (1, then 2) or (2, then 1) or (1+2 together).

Practical Implications:

Online learning: Update as new data arrives without storing all past data
Streaming applications: Process infinite data streams with fixed memory
Incremental refinement: Each observation shrinks uncertainty

sequential_bayesian_update.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import numpy as np
from numpy.linalg import inv
 
def sequential_bayesian_update(X, y, alpha, beta, verbose=False):
    """
    Demonstrate sequential Bayesian updating: 
    process data one point at a time.
    
    Returns the same posterior as batch processing.
    """
    N, D = X.shape
    
    # Initial prior
    m_curr = np.zeros(D)
    S_curr = (1.0 / alpha) * np.eye(D)
    
    history = [(m_curr.copy(), S_curr.copy())]
    
    for n in range(N):
        x_n = X[n:n+1, :]  # (1, D)
        y_n = y[n:n+1]     # (1,)
        
        # Current prior = previous posterior
        S_prior_inv = inv(S_curr)
        
        # Update equations (single data point)
        S_new_inv = S_prior_inv + beta * x_n.T @ x_n
        S_new = inv(S_new_inv)
        m_new = S_new @ (S_prior_inv @ m_curr + beta * x_n.T @ y_n)
        
        m_curr = m_new.flatten()
        S_curr = S_new
        history.append((m_curr.copy(), S_curr.copy()))
        
        if verbose:
            print(f"After observation {n+1}:")
            print(f"  Posterior mean: {m_curr}")
            print(f"  Posterior std:  {np.sqrt(np.diag(S_curr))}\n")
    
    return m_curr, S_curr, history
 
# Verify: Sequential = Batch
np.random.seed(42)
w_true = np.array([1.5, -2.0])
N, D = 10, 2
X = np.random.randn(N, D)
y = X @ w_true + 0.5 * np.random.randn(N)
 
alpha, beta = 0.5, 4.0
 
# Batch processing
from posterior_computation import compute_bayesian_posterior
m_batch, S_batch = compute_bayesian_posterior(X, y, alpha, beta)
 
# Sequential processing
m_seq, S_seq, _ = sequential_bayesian_update(X, y, alpha, beta)
 
print("Batch posterior mean:", m_batch)
print("Sequential posterior mean:", m_seq)
print("Difference:", np.linalg.norm(m_batch - m_seq))  # Should be ~0

Computational Considerations

While mathematically equivalent, sequential updating has different numerical properties than batch. Each update requires O(D³) for matrix inversion, so N sequential updates cost O(ND³). Batch processing typically costs O(ND² + D³). Choose based on your use case: batch for fixed datasets, sequential for online/streaming scenarios.

Geometric Interpretation: Ellipses in Weight Space

The Gaussian posterior has a beautiful geometric interpretation. In weight space, constant-probability contours are ellipsoids.

Prior Ellipsoid:

The prior 𝒩(m₀, S₀) defines an ellipsoid centered at m₀. The axes are determined by the eigenvectors of S₀, with lengths proportional to the square roots of eigenvalues.

Likelihood 'Ellipsoid':

The likelihood function, when viewed as a function of w, also creates an ellipsoidal landscape. Its center is the OLS solution w_OLS = (XᵀX)⁻¹Xᵀy, and its shape is determined by XᵀX.

Posterior as Intersection:

The posterior ellipsoid is the result of "intersecting" the prior and likelihood ellipsoids in a precision-weighted sense. The posterior:

Sits between the prior mean and likelihood maximum
Is smaller than either (more certain)
Its shape blends the constraints from both

Visualization of Bayesian Updating:

Consider 2D weights (w₁, w₂). Before seeing data:

Prior is a circle (isotropic) centered at origin
We believe all directions equally, all small values likely

After seeing data:

Posterior is an ellipse
Center moved toward the OLS solution
Ellipse is smaller (less uncertainty)
Shape reflects the data geometry: contracted in directions where data is informative

With more data:

Ellipse shrinks further
Center converges to OLS
Prior influence diminishes

In the limit of infinite data:

Posterior collapses to a point at the OLS solution
Prior is overwhelmed by evidence
Bayesian = frequentist in this asymptotic regime

Why Ellipses?

Gaussian contours are ellipsoids because the Mahalanobis distance (w - μ)ᵀΣ⁻¹(w - μ) = c defines an ellipsoid. The eigenvectors of Σ give the principal axes, and eigenvalues give the squared axis lengths. This geometry makes Gaussians natural for representing correlated uncertainties.

Numerical Stability Considerations

While the closed-form posterior is mathematically elegant, careful implementation is needed for numerical stability.

Potential Issues:

Ill-Conditioned X⊤X: If features are highly correlated, XᵀX may be nearly singular. The regularization from the prior (αI) helps, but severe collinearity can still cause issues.
Large N or D: For very large matrices, direct inversion is expensive and can accumulate numerical errors.
Extreme Precision Values: Very large α (strong prior) or very small β (high noise) can create numerical instability.

Best Practices:

Use Cholesky Decomposition: Instead of direct inversion, compute: $$\mathbf{L} = \text{chol}(\mathbf{S}_N^{-1})$$ Then solve systems using forward/backward substitution.
Woodbury Identity for Large N: When N >> D, use: $$\mathbf{S}_N = \alpha^{-1}\mathbf{I} - \alpha^{-1}\mathbf{X}^\top(\alpha^{-1}\mathbf{X}\mathbf{X}^\top + \beta^{-1}\mathbf{I})^{-1}\mathbf{X}\alpha^{-1}$$ This inverts an N×N matrix instead of D×D when D > N.
Feature Scaling: Standardize features before fitting to avoid extremely different scales in XᵀX.
Work in Log Space: When computing marginal likelihoods (next module), work with log-determinants to avoid overflow.

stable_posterior.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
from scipy.linalg import cho_factor, cho_solve
 
def stable_bayesian_posterior(X, y, alpha, beta):
    """
    Numerically stable Bayesian posterior computation
    using Cholesky decomposition.
    """
    N, D = X.shape
    
    # Posterior precision matrix
    S_N_inv = alpha * np.eye(D) + beta * X.T @ X
    
    # Cholesky factorization: S_N_inv = L @ L.T
    # This is more stable than direct inversion
    L, lower = cho_factor(S_N_inv, lower=True)
    
    # Posterior mean: m_N = S_N @ (beta * X.T @ y)
    # Solve via: S_N_inv @ m_N = beta * X.T @ y
    b = beta * X.T @ y
    m_N = cho_solve((L, lower), b)
    
    # Posterior covariance: solve S_N_inv @ S_N = I
    S_N = cho_solve((L, lower), np.eye(D))
    
    # Log determinant (useful for marginal likelihood)
    log_det_S_N_inv = 2 * np.sum(np.log(np.diag(L)))
    log_det_S_N = -log_det_S_N_inv
    
    return m_N, S_N, log_det_S_N
 
# Alternative for N >> D: Woodbury identity
def woodbury_posterior(X, y, alpha, beta):
    """
    Use Woodbury identity when N >> D.
    Inverts N×N matrix instead of D×D.
    """
    N, D = X.shape
    
    # A_inv = (1/alpha) * I
    # (A + UBV)^{-1} = A^{-1} - A^{-1}U(B^{-1} + VA^{-1}U)^{-1}VA^{-1}
    # Here: U = X.T, V = X, B = beta * I_N
    
    alpha_inv = 1.0 / alpha
    beta_inv = 1.0 / beta
    
    # Inner matrix: (1/beta)I + (1/alpha)XX^T
    inner = beta_inv * np.eye(N) + alpha_inv * (X @ X.T)
    inner_inv = np.linalg.inv(inner)  # N×N inversion
    
    # S_N = (1/alpha)I - (1/alpha^2) X.T @ inner_inv @ X
    S_N = alpha_inv * np.eye(D) - (alpha_inv**2) * (X.T @ inner_inv @ X)
    
    # m_N = S_N @ (beta * X.T @ y)
    m_N = S_N @ (beta * X.T @ y)
    
    return m_N, S_N

Summary: The Posterior is Our Knowledge

We've derived the central object in Bayesian linear regression: the posterior distribution over weights. This is the complete description of what we know about the weights after observing data.

Key Takeaways

•Bayes' Theorem: Posterior ∝ Likelihood × Prior. This is the fundamental update rule.
•Gaussian Conjugacy: Gaussian prior + Gaussian likelihood = Gaussian posterior. Closed-form solutions enable exact inference.
•Posterior Precision: Sₙ⁻¹ = S₀⁻¹ + β X⊤X. Precision sums from independent sources—prior precision plus data precision.
•Posterior Mean: mₙ = Sₙ(S₀⁻¹m₀ + β X⊤y). A precision-weighted combination of prior belief and data evidence.
•Ridge Connection: With zero-mean isotropic prior, the posterior mean equals the Ridge regression solution.
•Sequential Updating: Data can be processed one point at a time with identical results to batch processing.
•Geometric View: The posterior is an ellipsoid in weight space, smaller than the prior, centered between prior mean and OLS solution.

What's Next:

The posterior over weights is fundamental, but for prediction, we need the predictive distribution: given a new input x₊, what is the distribution over its target y₊? The next page derives this predictive distribution, showing how uncertainty in weights translates to uncertainty in predictions—the key output of Bayesian regression.

Page Complete

You can now derive the posterior distribution for Bayesian linear regression from first principles. This is the mathematical heart of the method—understanding how prior beliefs combine with evidence to yield updated beliefs. Next, we'll see how this posterior enables principled predictions with uncertainty estimates.