Loading content...
With our prior beliefs about weights encoded as a probability distribution, we're now ready to answer the central question of Bayesian inference: How should we update our beliefs when we observe data?
The answer is the posterior distribution p(w|y, X)—a probability distribution over weights that incorporates both our prior knowledge and the evidence from training data. The posterior is the complete summary of what we know about the weights after learning.
What makes Bayesian linear regression special is that, with Gaussian priors and likelihoods, the posterior has a closed-form solution. We don't need sampling or approximations—we can write down the exact posterior distribution analytically. This is the reward of conjugacy: mathematical tractability without sacrificing principled inference.
By the end of this page, you will be able to derive the posterior distribution over regression weights from first principles, understand the geometric interpretation of Bayesian updating, and see how the posterior elegantly balances prior beliefs against data evidence.
The foundation of Bayesian inference is Bayes' theorem. For weights w, given observed data y and input features X:
$$p(\mathbf{w} | \mathbf{y}, \mathbf{X}) = \frac{p(\mathbf{y} | \mathbf{X}, \mathbf{w}) \cdot p(\mathbf{w})}{p(\mathbf{y} | \mathbf{X})}$$
Let's unpack each term:
Posterior p(w|y, X): This is what we want to compute—the probability distribution over weights given the observed data. It represents our updated beliefs.
Likelihood p(y|X, w): The probability of observing the target values y if the weights were w. This measures how well the weights explain the data.
Prior p(w): Our beliefs about weights before seeing data. From the previous page, we use: $$p(\mathbf{w}) = \mathcal{N}(\mathbf{w} | \mathbf{m}_0, \mathbf{S}_0)$$
Marginal Likelihood p(y|X): Also called the evidence or model evidence. It's the probability of the data averaged over all possible weights: $$p(\mathbf{y} | \mathbf{X}) = \int p(\mathbf{y} | \mathbf{X}, \mathbf{w}) \cdot p(\mathbf{w}) , d\mathbf{w}$$
For computing the posterior, this is a normalizing constant—we'll return to its interpretation in the model comparison page.
Since p(y|X) doesn't depend on w, we often write:
p(w|y, X) ∝ p(y|X, w) · p(w)
Posterior ∝ Likelihood × Prior
We can find the posterior by multiplying likelihood and prior, then normalizing. For Gaussians, this normalization happens automatically because the product of Gaussians is Gaussian.
Recall our probabilistic model: observations are generated by a linear function plus Gaussian noise:
$$\mathbf{y} | \mathbf{X}, \mathbf{w} \sim \mathcal{N}(\mathbf{X}\mathbf{w}, \sigma^2 \mathbf{I}_N)$$
The likelihood function is the probability density of y treated as a function of w:
$$p(\mathbf{y} | \mathbf{X}, \mathbf{w}) = \mathcal{N}(\mathbf{y} | \mathbf{X}\mathbf{w}, \sigma^2 \mathbf{I}_N)$$
Expanding the Gaussian pdf:
$$p(\mathbf{y} | \mathbf{X}, \mathbf{w}) = (2\pi\sigma^2)^{-N/2} \exp\left( -\frac{1}{2\sigma^2} (\mathbf{y} - \mathbf{X}\mathbf{w})^\top (\mathbf{y} - \mathbf{X}\mathbf{w}) \right)$$
Useful Rewriting:
For the derivation, we'll need the log-likelihood (ignoring constants not depending on w):
$$\log p(\mathbf{y} | \mathbf{X}, \mathbf{w}) = -\frac{1}{2\sigma^2} |\mathbf{y} - \mathbf{X}\mathbf{w}|^2 + \text{const}$$
Expanding the quadratic:
$$|\mathbf{y} - \mathbf{X}\mathbf{w}|^2 = \mathbf{y}^\top\mathbf{y} - 2\mathbf{w}^\top\mathbf{X}^\top\mathbf{y} + \mathbf{w}^\top\mathbf{X}^\top\mathbf{X}\mathbf{w}$$
This is quadratic in w, which is crucial for the Gaussian posterior we'll derive.
Precision Notation:
It's often convenient to work with the noise precision β = 1/σ² instead of variance. This simplifies many expressions:
$$p(\mathbf{y} | \mathbf{X}, \mathbf{w}) \propto \exp\left( -\frac{\beta}{2} |\mathbf{y} - \mathbf{X}\mathbf{w}|^2 \right)$$
The likelihood is maximized when w = (XᵀX)⁻¹Xᵀy—the ordinary least squares solution. The posterior will be "pulled" toward this maximum likelihood estimate, with the prior providing counterbalancing force.
Our Gaussian prior on weights is:
$$p(\mathbf{w}) = \mathcal{N}(\mathbf{w} | \mathbf{m}_0, \mathbf{S}_0)$$
Expanding:
$$p(\mathbf{w}) = (2\pi)^{-D/2} |\mathbf{S}_0|^{-1/2} \exp\left( -\frac{1}{2} (\mathbf{w} - \mathbf{m}_0)^\top \mathbf{S}_0^{-1} (\mathbf{w} - \mathbf{m}_0) \right)$$
In log form (ignoring constants):
$$\log p(\mathbf{w}) = -\frac{1}{2} (\mathbf{w} - \mathbf{m}_0)^\top \mathbf{S}_0^{-1} (\mathbf{w} - \mathbf{m}_0) + \text{const}$$
The Common Case (Zero-Mean Isotropic Prior):
For the standard zero-mean isotropic prior with precision α:
$$p(\mathbf{w}) = \mathcal{N}(\mathbf{w} | \mathbf{0}, \alpha^{-1}\mathbf{I})$$
This simplifies to:
$$\log p(\mathbf{w}) = -\frac{\alpha}{2} \mathbf{w}^\top\mathbf{w} + \text{const}$$
The prior log-probability is quadratic in w, just like the log-likelihood. This shared quadratic structure is why the product remains Gaussian.
The inverse covariance matrix S₀⁻¹ is called the precision matrix or information matrix. Working with precision matrices often leads to cleaner formulas in Bayesian inference, especially when combining information from multiple sources.
Now we combine prior and likelihood using Bayes' theorem. The posterior is proportional to their product:
$$\log p(\mathbf{w} | \mathbf{y}, \mathbf{X}) = \log p(\mathbf{y} | \mathbf{X}, \mathbf{w}) + \log p(\mathbf{w}) + \text{const}$$
Substituting our expressions:
$$\log p(\mathbf{w} | \mathbf{y}, \mathbf{X}) = -\frac{\beta}{2}(\mathbf{y} - \mathbf{X}\mathbf{w})^\top(\mathbf{y} - \mathbf{X}\mathbf{w}) - \frac{1}{2}(\mathbf{w} - \mathbf{m}_0)^\top\mathbf{S}_0^{-1}(\mathbf{w} - \mathbf{m}_0) + \text{const}$$
Key Observation: This is still quadratic in w! Since the log of a Gaussian density is quadratic, and the sum of quadratics is quadratic, the posterior must be Gaussian.
Let's expand and group terms to find the posterior mean mₙ and covariance Sₙ.
Step 1: Expand the Quadratics
Likelihood term: $$-\frac{\beta}{2}(\mathbf{y}^\top\mathbf{y} - 2\mathbf{w}^\top\mathbf{X}^\top\mathbf{y} + \mathbf{w}^\top\mathbf{X}^\top\mathbf{X}\mathbf{w})$$
Prior term: $$-\frac{1}{2}(\mathbf{w}^\top\mathbf{S}_0^{-1}\mathbf{w} - 2\mathbf{w}^\top\mathbf{S}_0^{-1}\mathbf{m}_0 + \mathbf{m}_0^\top\mathbf{S}_0^{-1}\mathbf{m}_0)$$
Step 2: Collect Terms by Power of w
Quadratic in w: $$-\frac{1}{2}\mathbf{w}^\top(\beta\mathbf{X}^\top\mathbf{X} + \mathbf{S}_0^{-1})\mathbf{w}$$
Linear in w: $$\mathbf{w}^\top(\beta\mathbf{X}^\top\mathbf{y} + \mathbf{S}_0^{-1}\mathbf{m}_0)$$
Step 3: Complete the Square
A Gaussian with mean μ and precision matrix Λ has log-density: $$-\frac{1}{2}(\mathbf{w} - \boldsymbol{\mu})^\top\mathbf{\Lambda}(\mathbf{w} - \boldsymbol{\mu}) = -\frac{1}{2}\mathbf{w}^\top\mathbf{\Lambda}\mathbf{w} + \mathbf{w}^\top\mathbf{\Lambda}\boldsymbol{\mu} + \text{const}$$
Comparing with our collected terms: $$\mathbf{\Lambda}_N = \mathbf{S}_N^{-1} = \mathbf{S}_0^{-1} + \beta\mathbf{X}^\top\mathbf{X}$$ $$\mathbf{\Lambda}_N\mathbf{m}_N = \mathbf{S}_0^{-1}\mathbf{m}_0 + \beta\mathbf{X}^\top\mathbf{y}$$
p(w|y, X) = 𝒩(w | mₙ, Sₙ) where:
Posterior Precision: Sₙ⁻¹ = S₀⁻¹ + β X⊤X
Posterior Mean: mₙ = Sₙ(S₀⁻¹m₀ + β X⊤y)
Equivalently: mₙ = Sₙ(S₀⁻¹m₀ + β X⊤y)
The most common scenario uses a zero-mean isotropic prior: m₀ = 0 and S₀ = α⁻¹I. The posterior simplifies beautifully:
Posterior Precision: $$\mathbf{S}_N^{-1} = \alpha\mathbf{I} + \beta\mathbf{X}^\top\mathbf{X}$$
Posterior Covariance: $$\mathbf{S}_N = (\alpha\mathbf{I} + \beta\mathbf{X}^\top\mathbf{X})^{-1}$$
Posterior Mean: $$\mathbf{m}_N = \beta\mathbf{S}_N\mathbf{X}^\top\mathbf{y} = \beta(\alpha\mathbf{I} + \beta\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$$
Connection to Ridge Regression:
With λ = α/β = ασ², this posterior mean is exactly the Ridge regression solution:
$$\mathbf{m}_N = (\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}$$
This confirms that Ridge regression computes the posterior mean of a Bayesian linear regression with Gaussian prior. But Bayesian inference gives us more—the entire posterior distribution, not just the mean.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
import numpy as npfrom numpy.linalg import inv def compute_bayesian_posterior(X, y, alpha, beta, m0=None, S0=None): """ Compute the posterior distribution for Bayesian linear regression. Parameters: ----------- X : ndarray (N, D) Design matrix of N observations with D features y : ndarray (N,) Target values alpha : float Prior precision (inverse variance of prior) beta : float Noise precision (inverse of noise variance sigma^2) m0 : ndarray (D,), optional Prior mean. Defaults to zeros. S0 : ndarray (D, D), optional Prior covariance. Defaults to (1/alpha) * I Returns: -------- m_N : ndarray (D,) Posterior mean S_N : ndarray (D, D) Posterior covariance """ N, D = X.shape # Default prior if m0 is None: m0 = np.zeros(D) if S0 is None: S0 = (1.0 / alpha) * np.eye(D) # Prior precision S0_inv = inv(S0) # For isotropic: alpha * I # Posterior precision (information from prior + information from data) S_N_inv = S0_inv + beta * X.T @ X # Posterior covariance S_N = inv(S_N_inv) # Posterior mean (weighted combination of prior and data) m_N = S_N @ (S0_inv @ m0 + beta * X.T @ y) return m_N, S_N # Example: Simple linear regression with 10 data pointsnp.random.seed(42) # True weightsw_true = np.array([1.5, -2.0]) # Generate synthetic dataN = 10X = np.random.randn(N, 2)noise_std = 0.5y = X @ w_true + noise_std * np.random.randn(N) # Bayesian inferencealpha = 0.5 # Prior precisionbeta = 1.0 / noise_std**2 # Noise precision m_N, S_N = compute_bayesian_posterior(X, y, alpha, beta) print("True weights:", w_true)print("Posterior mean:", m_N)print("Posterior std:", np.sqrt(np.diag(S_N)))print("\nPosterior covariance matrix:")print(S_N)The posterior distribution encodes everything we know about the weights after observing data. Let's unpack its components.
The Posterior Mean mₙ:
This is our "best guess" for the weights—the point estimate that minimizes expected squared error under the posterior. It balances two forces:
With more data (larger XᵀX), the data term dominates and mₙ approaches the OLS solution.
The Posterior Covariance Sₙ:
This measures our remaining uncertainty about the weights after seeing data. Important properties:
Sₙ ≤ S₀: Posterior variance is never greater than prior variance (in the matrix sense). Data always reduces uncertainty or leaves it unchanged.
Inverse of precision sum: Sₙ⁻¹ = S₀⁻¹ + βXᵀX. Precision (certainty) adds: prior precision + data precision = posterior precision.
Direction dependence: Uncertainty shrinks most in directions where data is most informative (large eigenvalues of XᵀX).
| Condition | Effect on Posterior Mean | Effect on Posterior Covariance |
|---|---|---|
| Strong prior (large α) | Pulled toward m₀ | Small (prior dominates) |
| Weak prior (small α) | Close to OLS solution | Large (data must constrain) |
| Low noise (large β) | Close to OLS solution | Small (data is reliable) |
| High noise (small β) | Between prior and OLS | Large (data is noisy) |
| Many observations (large N) | Approaches OLS | Shrinks as O(1/N) |
| Few observations | Prior-dominated | Remains large |
The posterior precision is the sum of prior precision and data precision:
Sₙ⁻¹ = S₀⁻¹ + β X⊤X
This is a general principle in Bayesian inference: information from independent sources adds. The Fisher information from the likelihood (β X⊤X) augments the prior information (S₀⁻¹).
One elegant property of Bayesian inference is sequential updating: we can incorporate data one point at a time, with result identical to batch processing.
The Procedure:
Why It Works:
Bayes' theorem is associative. For data points 1 and 2:
$$p(\mathbf{w}|y_1, y_2) \propto p(y_2|\mathbf{w}) \cdot p(y_1|\mathbf{w}) \cdot p(\mathbf{w})$$ $$= p(y_2|\mathbf{w}) \cdot \left[ p(y_1|\mathbf{w}) \cdot p(\mathbf{w}) \right]$$ $$= p(y_2|\mathbf{w}) \cdot p(\mathbf{w}|y_1)$$
The order doesn't matter—same posterior whether we process (1, then 2) or (2, then 1) or (1+2 together).
Practical Implications:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
import numpy as npfrom numpy.linalg import inv def sequential_bayesian_update(X, y, alpha, beta, verbose=False): """ Demonstrate sequential Bayesian updating: process data one point at a time. Returns the same posterior as batch processing. """ N, D = X.shape # Initial prior m_curr = np.zeros(D) S_curr = (1.0 / alpha) * np.eye(D) history = [(m_curr.copy(), S_curr.copy())] for n in range(N): x_n = X[n:n+1, :] # (1, D) y_n = y[n:n+1] # (1,) # Current prior = previous posterior S_prior_inv = inv(S_curr) # Update equations (single data point) S_new_inv = S_prior_inv + beta * x_n.T @ x_n S_new = inv(S_new_inv) m_new = S_new @ (S_prior_inv @ m_curr + beta * x_n.T @ y_n) m_curr = m_new.flatten() S_curr = S_new history.append((m_curr.copy(), S_curr.copy())) if verbose: print(f"After observation {n+1}:") print(f" Posterior mean: {m_curr}") print(f" Posterior std: {np.sqrt(np.diag(S_curr))}\n") return m_curr, S_curr, history # Verify: Sequential = Batchnp.random.seed(42)w_true = np.array([1.5, -2.0])N, D = 10, 2X = np.random.randn(N, D)y = X @ w_true + 0.5 * np.random.randn(N) alpha, beta = 0.5, 4.0 # Batch processingfrom posterior_computation import compute_bayesian_posteriorm_batch, S_batch = compute_bayesian_posterior(X, y, alpha, beta) # Sequential processingm_seq, S_seq, _ = sequential_bayesian_update(X, y, alpha, beta) print("Batch posterior mean:", m_batch)print("Sequential posterior mean:", m_seq)print("Difference:", np.linalg.norm(m_batch - m_seq)) # Should be ~0While mathematically equivalent, sequential updating has different numerical properties than batch. Each update requires O(D³) for matrix inversion, so N sequential updates cost O(ND³). Batch processing typically costs O(ND² + D³). Choose based on your use case: batch for fixed datasets, sequential for online/streaming scenarios.
The Gaussian posterior has a beautiful geometric interpretation. In weight space, constant-probability contours are ellipsoids.
Prior Ellipsoid:
The prior 𝒩(m₀, S₀) defines an ellipsoid centered at m₀. The axes are determined by the eigenvectors of S₀, with lengths proportional to the square roots of eigenvalues.
Likelihood 'Ellipsoid':
The likelihood function, when viewed as a function of w, also creates an ellipsoidal landscape. Its center is the OLS solution w_OLS = (XᵀX)⁻¹Xᵀy, and its shape is determined by XᵀX.
Posterior as Intersection:
The posterior ellipsoid is the result of "intersecting" the prior and likelihood ellipsoids in a precision-weighted sense. The posterior:
Visualization of Bayesian Updating:
Consider 2D weights (w₁, w₂). Before seeing data:
After seeing data:
With more data:
In the limit of infinite data:
Gaussian contours are ellipsoids because the Mahalanobis distance (w - μ)ᵀΣ⁻¹(w - μ) = c defines an ellipsoid. The eigenvectors of Σ give the principal axes, and eigenvalues give the squared axis lengths. This geometry makes Gaussians natural for representing correlated uncertainties.
While the closed-form posterior is mathematically elegant, careful implementation is needed for numerical stability.
Potential Issues:
Ill-Conditioned X⊤X: If features are highly correlated, XᵀX may be nearly singular. The regularization from the prior (αI) helps, but severe collinearity can still cause issues.
Large N or D: For very large matrices, direct inversion is expensive and can accumulate numerical errors.
Extreme Precision Values: Very large α (strong prior) or very small β (high noise) can create numerical instability.
Best Practices:
Use Cholesky Decomposition: Instead of direct inversion, compute: $$\mathbf{L} = \text{chol}(\mathbf{S}_N^{-1})$$ Then solve systems using forward/backward substitution.
Woodbury Identity for Large N: When N >> D, use: $$\mathbf{S}_N = \alpha^{-1}\mathbf{I} - \alpha^{-1}\mathbf{X}^\top(\alpha^{-1}\mathbf{X}\mathbf{X}^\top + \beta^{-1}\mathbf{I})^{-1}\mathbf{X}\alpha^{-1}$$ This inverts an N×N matrix instead of D×D when D > N.
Feature Scaling: Standardize features before fitting to avoid extremely different scales in XᵀX.
Work in Log Space: When computing marginal likelihoods (next module), work with log-determinants to avoid overflow.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
import numpy as npfrom scipy.linalg import cho_factor, cho_solve def stable_bayesian_posterior(X, y, alpha, beta): """ Numerically stable Bayesian posterior computation using Cholesky decomposition. """ N, D = X.shape # Posterior precision matrix S_N_inv = alpha * np.eye(D) + beta * X.T @ X # Cholesky factorization: S_N_inv = L @ L.T # This is more stable than direct inversion L, lower = cho_factor(S_N_inv, lower=True) # Posterior mean: m_N = S_N @ (beta * X.T @ y) # Solve via: S_N_inv @ m_N = beta * X.T @ y b = beta * X.T @ y m_N = cho_solve((L, lower), b) # Posterior covariance: solve S_N_inv @ S_N = I S_N = cho_solve((L, lower), np.eye(D)) # Log determinant (useful for marginal likelihood) log_det_S_N_inv = 2 * np.sum(np.log(np.diag(L))) log_det_S_N = -log_det_S_N_inv return m_N, S_N, log_det_S_N # Alternative for N >> D: Woodbury identitydef woodbury_posterior(X, y, alpha, beta): """ Use Woodbury identity when N >> D. Inverts N×N matrix instead of D×D. """ N, D = X.shape # A_inv = (1/alpha) * I # (A + UBV)^{-1} = A^{-1} - A^{-1}U(B^{-1} + VA^{-1}U)^{-1}VA^{-1} # Here: U = X.T, V = X, B = beta * I_N alpha_inv = 1.0 / alpha beta_inv = 1.0 / beta # Inner matrix: (1/beta)I + (1/alpha)XX^T inner = beta_inv * np.eye(N) + alpha_inv * (X @ X.T) inner_inv = np.linalg.inv(inner) # N×N inversion # S_N = (1/alpha)I - (1/alpha^2) X.T @ inner_inv @ X S_N = alpha_inv * np.eye(D) - (alpha_inv**2) * (X.T @ inner_inv @ X) # m_N = S_N @ (beta * X.T @ y) m_N = S_N @ (beta * X.T @ y) return m_N, S_NWe've derived the central object in Bayesian linear regression: the posterior distribution over weights. This is the complete description of what we know about the weights after observing data.
What's Next:
The posterior over weights is fundamental, but for prediction, we need the predictive distribution: given a new input x₊, what is the distribution over its target y₊? The next page derives this predictive distribution, showing how uncertainty in weights translates to uncertainty in predictions—the key output of Bayesian regression.
You can now derive the posterior distribution for Bayesian linear regression from first principles. This is the mathematical heart of the method—understanding how prior beliefs combine with evidence to yield updated beliefs. Next, we'll see how this posterior enables principled predictions with uncertainty estimates.