Loading content...
We've derived the posterior distribution over weights—a complete description of our uncertainty about the model parameters. But in practice, we don't care about weights directly. What we care about is prediction: given a new input x₊, what is the corresponding output y₊?
Classical regression gives a point prediction: ŷ₊ = wᵀx₊. But this ignores all uncertainty. If we're uncertain about the weights, shouldn't we be uncertain about predictions made with those weights?
Bayesian linear regression answers with the predictive distribution: a full probability distribution over the target value for a new input. This distribution captures two sources of uncertainty:
The predictive distribution integrates out weight uncertainty, giving us principled confidence intervals that reflect both what we don't know and what is inherently unknowable.
By the end of this page, you will be able to derive the predictive distribution for Bayesian linear regression, understand how weight uncertainty transforms into prediction uncertainty, compute prediction intervals that have valid probabilistic interpretation, and visualize predictive uncertainty across the input space.
Given a new input x₊ ∈ ℝᴰ, we want to predict the corresponding target value y₊. In the Bayesian framework, we compute the predictive distribution:
$$p(y_* | \mathbf{x}_*, \mathbf{y}, \mathbf{X})$$
This is the probability distribution over y₊ given:
The Key Insight: Marginalization
We don't know the true weights w. But we have a posterior distribution over them. The predictive distribution averages over all possible weights, weighted by their posterior probability:
$$p(y_* | \mathbf{x}*, \mathbf{y}, \mathbf{X}) = \int p(y* | \mathbf{x}_*, \mathbf{w}) \cdot p(\mathbf{w} | \mathbf{y}, \mathbf{X}) , d\mathbf{w}$$
This integral marginalizes out the weights. The result accounts for all sources of uncertainty:
You could predict using ŷ₊ = mₙᵀx₊ (the posterior mean of weights). This is the MAP/point prediction. But it discards uncertainty: you get no confidence intervals, no honest assessment of how much you don't know. The predictive distribution gives you the full picture.
Let's compute the integral. We have:
Likelihood for new observation: $$p(y_* | \mathbf{x}*, \mathbf{w}) = \mathcal{N}(y* | \mathbf{w}^\top\mathbf{x}_*, \sigma^2)$$
Posterior over weights: $$p(\mathbf{w} | \mathbf{y}, \mathbf{X}) = \mathcal{N}(\mathbf{w} | \mathbf{m}_N, \mathbf{S}_N)$$
The predictive distribution is: $$p(y_* | \mathbf{x}*, \mathbf{y}, \mathbf{X}) = \int \mathcal{N}(y* | \mathbf{w}^\top\mathbf{x}_*, \sigma^2) \cdot \mathcal{N}(\mathbf{w} | \mathbf{m}_N, \mathbf{S}_N) , d\mathbf{w}$$
This is the convolution of two Gaussians—which is itself Gaussian! The result follows from the general formula for marginalizing Gaussian random variables.
Step 1: Linear Transformation of Weights
Define the latent prediction (without noise): f₊ = wᵀx₊
Since w ~ 𝒩(mₙ, Sₙ), and f₊ is a linear function of w:
$$f_* = \mathbf{w}^\top\mathbf{x}* \sim \mathcal{N}(\mathbf{m}N^\top\mathbf{x}*, \mathbf{x}^\top\mathbf{S}N\mathbf{x})$$
The mean is mₙᵀx₊ and variance is x₊ᵀSₙx₊.
Step 2: Add Observation Noise
The observed y₊ is f₊ plus independent Gaussian noise: $$y_* = f_* + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2)$$
Since f₊ and ε are independent Gaussians, their sum is Gaussian with:
p(y₊ | x₊, y, X) = 𝒩(y₊ | μ₊, σ²₊) where:
Predictive Mean: μ₊ = mₙᵀx₊
Predictive Variance: σ²₊ = x₊ᵀSₙx₊ + σ²
The variance decomposes into: • x₊ᵀSₙx₊: Uncertainty from not knowing weights (epistemic) • σ²: Inherent observation noise (aleatoric)
The predictive variance σ²₊ = x₊ᵀSₙx₊ + σ² has a profound interpretation. Let's unpack each term.
Term 1: Epistemic Uncertainty (x₊ᵀSₙx₊)
This term arises because we don't know the true weights. Even with infinite precision observations (σ² = 0), we'd still have prediction uncertainty if we're uncertain about w.
Key properties:
Term 2: Aleatoric Uncertainty (σ²)
This is irreducible noise—the inherent randomness in the data-generating process. Even with known true weights, observations would be noisy.
Key properties:
| Property | Epistemic (x₊ᵀSₙx₊) | Aleatoric (σ²) |
|---|---|---|
| Source | Unknown weights | Inherent randomness |
| Reducible? | Yes, with more data | No, fundamental |
| Input-dependent? | Yes | No |
| Limit as N → ∞ | → 0 | Stays σ² |
| Also called | Model uncertainty | Noise variance |
| Can be estimated? | Yes, from posterior | Yes, from data |
Understanding the source of uncertainty guides action. High epistemic uncertainty? Collect more data in that region. High aleatoric uncertainty? Improve measurement quality or accept fundamental limits. Bayesian inference naturally provides this decomposition.
The predictive distribution enables principled prediction intervals—ranges that contain the true value with specified probability.
Credible Intervals:
For a Gaussian predictive distribution 𝒩(μ₊, σ²₊), the (1-α) credible interval is:
$$\text{CI}{1-\alpha} = \left[ \mu* - z_{\alpha/2} \sigma_, \quad \mu_ + z_{\alpha/2} \sigma_* \right]$$
where z_{α/2} is the standard normal quantile. Common values:
Interpretation (Bayesian):
The 95% credible interval means: "Given our prior and observed data, there's a 95% probability the true y₊ lies in this interval."
This is subtly different from frequentist confidence intervals, which make statements about long-run coverage rather than probability of the true value.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
import numpy as npfrom scipy import statsimport matplotlib.pyplot as plt def bayesian_prediction(X_train, y_train, X_test, alpha, beta): """ Compute Bayesian predictions with uncertainty. Returns: mu: Predictive means sigma: Predictive standard deviations """ N, D = X_train.shape # Posterior S_N_inv = alpha * np.eye(D) + beta * X_train.T @ X_train S_N = np.linalg.inv(S_N_inv) m_N = beta * S_N @ X_train.T @ y_train # Predictive distribution for each test point mu = X_test @ m_N # Predictive variance for each test point # sigma^2_* = x^T S_N x + 1/beta sigma_sq = np.sum((X_test @ S_N) * X_test, axis=1) + 1.0/beta sigma = np.sqrt(sigma_sq) return mu, sigma, m_N, S_N # Generate synthetic datanp.random.seed(42)N_train = 20X_train = np.sort(np.random.uniform(-3, 3, N_train))X_train_design = np.column_stack([np.ones(N_train), X_train]) # Add bias w_true = np.array([0.5, 1.5]) # True weightsnoise_std = 0.5y_train = X_train_design @ w_true + noise_std * np.random.randn(N_train) # Bayesian inferencealpha = 0.1 # Prior precisionbeta = 1.0 / noise_std**2 # Noise precision # Test pointsX_test = np.linspace(-4, 4, 200)X_test_design = np.column_stack([np.ones(len(X_test)), X_test]) mu, sigma, m_N, S_N = bayesian_prediction( X_train_design, y_train, X_test_design, alpha, beta) # Visualizationfig, ax = plt.subplots(figsize=(12, 6)) # Mean predictionax.plot(X_test, mu, 'b-', linewidth=2, label='Predictive Mean') # Confidence bandsfor n_std, alpha_fill in [(1, 0.4), (2, 0.2), (3, 0.1)]: ax.fill_between(X_test, mu - n_std*sigma, mu + n_std*sigma, alpha=alpha_fill, color='blue', label=f'{n_std}σ interval ({100*stats.norm.cdf(n_std) - 50:.0f}%)') # Training dataax.scatter(X_train, y_train, c='red', s=50, zorder=5, label='Training Data') # True functiony_true = X_test_design @ w_trueax.plot(X_test, y_true, 'g--', linewidth=2, label='True Function') ax.set_xlabel('x')ax.set_ylabel('y')ax.set_title('Bayesian Linear Regression: Predictive Distribution')ax.legend()ax.grid(True, alpha=0.3)plt.show()Key Observations from the Visualization:
A distinctive feature of Bayesian prediction is that uncertainty varies across the input space. The epistemic component x₊ᵀSₙx₊ depends on where we're predicting.
Geometric Interpretation:
The posterior covariance Sₙ defines an ellipsoid of weight uncertainty. The quadratic form x₊ᵀSₙx₊ measures how x₊ "stretches" this ellipsoid:
Intuitive Understanding:
Imagine linear regression with two features x₁ and x₂:
For a test point with large x₂, predictions will be uncertain because the uncertain weight w₂ is multiplied by a large factor.
Extrapolation Warning:
This is why Bayesian methods naturally warn about extrapolation:
Bayesian methods refuse to be confidently wrong—they widen uncertainty where knowledge is lacking. This is philosophically correct but sometimes operationally inconvenient. A point prediction with wide error bars may be less actionable than a (overconfident) point estimate. The honest thing isn't always the useful thing, but it should be the starting point.
Often we need predictions at multiple test points simultaneously. Let X₊ ∈ ℝᴺˣᴰ be a matrix of N test inputs. The joint predictive distribution is:
$$p(\mathbf{y}* | \mathbf{X}, \mathbf{y}, \mathbf{X}) = \mathcal{N}(\mathbf{y}_ | \boldsymbol{\mu}*, \boldsymbol{\Sigma}*)$$
where:
Predictive Mean Vector: $$\boldsymbol{\mu}* = \mathbf{X}*\mathbf{m}_N$$
Predictive Covariance Matrix: $$\boldsymbol{\Sigma}* = \mathbf{X}\mathbf{S}N\mathbf{X}^\top + \sigma^2\mathbf{I}{N*}$$
Structure of the Covariance:
The diagonal elements are the individual predictive variances: $$(\boldsymbol{\Sigma}*){ii} = \mathbf{x}_{*i}^\top\mathbf{S}N\mathbf{x}{*i} + \sigma^2$$
The off-diagonal elements capture correlations between predictions: $$(\boldsymbol{\Sigma}*){ij} = \mathbf{x}_{*i}^\top\mathbf{S}N\mathbf{x}{*j}$$
Note: Noise is independent across test points, so it only appears on the diagonal.
Why Correlations Matter:
Predictions at different test points are correlated through shared weight uncertainty. If we're uncertain about w₁, predictions for all test points with large x₁ will be jointly uncertain—they'll tend to be wrong in the same direction.
This correlation structure is crucial for:
Sampling from the Predictive Distribution:
To generate samples of consistent predictions:
$$\mathbf{y}* \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}_)$$
Or equivalently:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
import numpy as npimport matplotlib.pyplot as plt def sample_predictive(X_train, y_train, X_test, alpha, beta, n_samples=10): """ Sample from the predictive distribution. Shows multiple consistent function realizations. """ N, D = X_train.shape # Posterior S_N_inv = alpha * np.eye(D) + beta * X_train.T @ X_train S_N = np.linalg.inv(S_N_inv) m_N = beta * S_N @ X_train.T @ y_train sigma = np.sqrt(1.0 / beta) # Sample weights from posterior w_samples = np.random.multivariate_normal(m_N, S_N, size=n_samples) # Predictions for each weight sample f_samples = X_test @ w_samples.T # (N_test, n_samples) # Add observation noise y_samples = f_samples + sigma * np.random.randn(*f_samples.shape) return f_samples, y_samples, m_N, S_N # Setupnp.random.seed(42)N_train = 10x_train = np.sort(np.random.uniform(-2, 2, N_train))X_train = np.column_stack([np.ones(N_train), x_train])w_true = np.array([0.3, 1.2])y_train = X_train @ w_true + 0.3 * np.random.randn(N_train) x_test = np.linspace(-3, 3, 100)X_test = np.column_stack([np.ones(len(x_test)), x_test]) alpha, beta = 0.5, 1.0 / 0.3**2 f_samples, y_samples, m_N, S_N = sample_predictive( X_train, y_train, X_test, alpha, beta, n_samples=20) # Visualizationfig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Left: Samples from posterior (noiseless)ax = axes[0]for i in range(f_samples.shape[1]): ax.plot(x_test, f_samples[:, i], alpha=0.3, color='blue')ax.plot(x_test, X_test @ m_N, 'r-', linewidth=2, label='Posterior Mean')ax.scatter(x_train, y_train, c='black', s=50, zorder=5, label='Training Data')ax.set_title('Samples from Posterior (Noiseless f)')ax.set_xlabel('x')ax.set_ylabel('f(x)')ax.legend() # Right: Samples with noiseax = axes[1]for i in range(y_samples.shape[1]): ax.plot(x_test, y_samples[:, i], alpha=0.3, color='green')ax.plot(x_test, X_test @ m_N, 'r-', linewidth=2, label='Posterior Mean')ax.scatter(x_train, y_train, c='black', s=50, zorder=5, label='Training Data')ax.set_title('Samples from Predictive (With Noise)')ax.set_xlabel('x')ax.set_ylabel('y')ax.legend() plt.tight_layout()plt.show() # Notice: Samples spread more outside training region# All samples pass near the data points (constrained by likelihood)# The right plot is noisier (observation noise added)Frequentist linear regression also provides prediction intervals. How do they compare to Bayesian credible intervals?
Frequentist Prediction Interval:
For OLS with Gaussian errors, the (1-α) prediction interval at x₊ is:
$$\hat{y}* \pm t{n-p, \alpha/2} \cdot \hat{\sigma} \cdot \sqrt{1 + \mathbf{x}*^\top(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{x}*}$$
where:
Key Differences:
| Aspect | Bayesian Credible Interval | Frequentist Prediction Interval |
|---|---|---|
| Philosophy | Probability of true value | Long-run coverage guarantee |
| Prior information | Incorporated via p(w) | Not incorporated |
| Small samples | Prior regularizes, stable | Can be unstable |
| Interpretation | "95% probability y₊ is here" | "95% of such intervals contain y₊" |
| σ² treatment | Can be fixed or integrated out | Estimated with degrees-of-freedom correction |
| Collinearity | Prior helps, stable | Can blow up |
When They Agree:
With a flat/vague prior (α → 0) and known σ²:
When Bayesian Is Better:
When Frequentist Is Better:
For large samples with well-behaved data, Bayesian and frequentist intervals are nearly identical. The differences matter most in challenging settings: small samples, many features, prior knowledge available, or uncertainty needs to propagate into downstream decisions. In these cases, Bayesian inference shines.
Implementing Bayesian prediction correctly requires attention to several details.
Efficient Variance Computation:
For many test points, computing x₊ᵀSₙx₊ individually is inefficient. Use vectorized computation:
# Slow: loop over test points
var = np.array([x.T @ S_N @ x for x in X_test])
# Fast: vectorized
var = np.sum((X_test @ S_N) * X_test, axis=1)
Numerical Stability:
Ensure Sₙ is computed stably (via Cholesky decomposition). Small negative eigenvalues from numerical error can cause problems when computing variances.
Handling Unknown σ²:
If σ² is estimated from data, the predictive distribution is technically a Student-t, not Gaussian. For practical purposes with reasonable N:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126
import numpy as npfrom scipy.linalg import cho_factor, cho_solve class BayesianLinearRegression: """ Complete implementation of Bayesian Linear Regression with predictions and uncertainty quantification. """ def __init__(self, alpha=1.0, beta=None, fit_noise=True): """ Parameters: ----------- alpha : float Prior precision on weights beta : float or None Noise precision (1/sigma^2). If None, estimated from data. fit_noise : bool Whether to estimate noise from data """ self.alpha = alpha self.beta = beta self.fit_noise = fit_noise self.m_N = None self.S_N = None self.L_N = None # Cholesky factor def fit(self, X, y): """Compute posterior over weights.""" N, D = X.shape # Estimate noise if needed if self.beta is None or self.fit_noise: # Quick OLS estimate for noise level w_ols = np.linalg.lstsq(X, y, rcond=None)[0] residuals = y - X @ w_ols sigma_sq_hat = np.var(residuals) * N / max(N - D, 1) self.beta = 1.0 / max(sigma_sq_hat, 1e-10) # Posterior precision S_N_inv = self.alpha * np.eye(D) + self.beta * X.T @ X # Cholesky factorization for stability self.L_N, _ = cho_factor(S_N_inv, lower=True) # Posterior covariance self.S_N = cho_solve((self.L_N, True), np.eye(D)) # Posterior mean self.m_N = self.beta * self.S_N @ X.T @ y return self def predict(self, X_test, return_std=False, return_cov=False): """ Make predictions with uncertainty. Returns: -------- mu : Predictive means std : Predictive standard deviations (if return_std=True) cov : Predictive covariance matrix (if return_cov=True) """ mu = X_test @ self.m_N if not return_std and not return_cov: return mu # Epistemic variance for each point var_epistemic = np.sum((X_test @ self.S_N) * X_test, axis=1) # Total predictive variance var_total = var_epistemic + 1.0 / self.beta result = [mu] if return_std: result.append(np.sqrt(var_total)) if return_cov: # Full covariance matrix cov = X_test @ self.S_N @ X_test.T + (1.0/self.beta) * np.eye(len(X_test)) result.append(cov) return tuple(result) def sample_predictions(self, X_test, n_samples=100): """Sample from the predictive distribution.""" # Sample weights w_samples = np.random.multivariate_normal(self.m_N, self.S_N, n_samples) # Noiseless predictions f_samples = X_test @ w_samples.T # Add observation noise sigma = np.sqrt(1.0 / self.beta) y_samples = f_samples + sigma * np.random.randn(*f_samples.shape) return y_samples def score(self, X, y): """Compute log predictive probability.""" mu, std = self.predict(X, return_std=True) log_prob = -0.5 * np.log(2 * np.pi * std**2) - 0.5 * ((y - mu) / std)**2 return np.mean(log_prob) # Usage exampleif __name__ == "__main__": # Generate synthetic data np.random.seed(42) N, D = 50, 3 X = np.random.randn(N, D) w_true = np.array([1.0, -0.5, 0.3]) y = X @ w_true + 0.3 * np.random.randn(N) # Fit model model = BayesianLinearRegression(alpha=0.1) model.fit(X, y) # Predict X_test = np.random.randn(10, D) mu, std = model.predict(X_test, return_std=True) print("Predictions with 95% intervals:") for i, (m, s) in enumerate(zip(mu, std)): print(f" Point {i}: {m:.3f} ± {1.96*s:.3f}")The predictive distribution is the practical output of Bayesian linear regression—it tells us what we can expect for new inputs, with honest uncertainty quantification.
What's Next:
We've seen that Bayesian linear regression provides uncertainty estimates. But how do we use this uncertainty in practice? When is uncertainty high vs. low? How do we communicate it to stakeholders? The next page explores uncertainty quantification in depth—practical methods for interpreting, visualizing, and acting on predictive uncertainty.
You can now derive and compute the predictive distribution for Bayesian linear regression. This is the practical payoff of Bayesian inference—predictions that come with honest, calibrated uncertainty estimates. Next, we'll explore how to use these uncertainty estimates effectively.