Bayesian Linear Regression - Learning Module

Loading content...

0/245

Predictive Distribution

From Weight Uncertainty to Prediction Uncertainty

We've derived the posterior distribution over weights—a complete description of our uncertainty about the model parameters. But in practice, we don't care about weights directly. What we care about is prediction: given a new input x₊, what is the corresponding output y₊?

Classical regression gives a point prediction: ŷ₊ = wᵀx₊. But this ignores all uncertainty. If we're uncertain about the weights, shouldn't we be uncertain about predictions made with those weights?

Bayesian linear regression answers with the predictive distribution: a full probability distribution over the target value for a new input. This distribution captures two sources of uncertainty:

Epistemic uncertainty: Uncertainty about the weights (we don't know the true w)
Aleatoric uncertainty: Inherent noise in the observation process (even with known w, observations are noisy)

The predictive distribution integrates out weight uncertainty, giving us principled confidence intervals that reflect both what we don't know and what is inherently unknowable.

What You Will Learn

By the end of this page, you will be able to derive the predictive distribution for Bayesian linear regression, understand how weight uncertainty transforms into prediction uncertainty, compute prediction intervals that have valid probabilistic interpretation, and visualize predictive uncertainty across the input space.

The Predictive Distribution: Setup

Given a new input x₊ ∈ ℝᴰ, we want to predict the corresponding target value y₊. In the Bayesian framework, we compute the predictive distribution:

$$p(y_* | \mathbf{x}_*, \mathbf{y}, \mathbf{X})$$

This is the probability distribution over y₊ given:

The new input x₊
All observed training data (X, y)

The Key Insight: Marginalization

We don't know the true weights w. But we have a posterior distribution over them. The predictive distribution averages over all possible weights, weighted by their posterior probability:

$$p(y_* | \mathbf{x}*, \mathbf{y}, \mathbf{X}) = \int p(y* | \mathbf{x}_*, \mathbf{w}) \cdot p(\mathbf{w} | \mathbf{y}, \mathbf{X}) , d\mathbf{w}$$

This integral marginalizes out the weights. The result accounts for all sources of uncertainty:

Each possible w contributes a prediction
Weights are weighted by how probable they are given the data
The spread reflects both weight uncertainty and observation noise

Why Not Just Use the Posterior Mean?

You could predict using ŷ₊ = mₙᵀx₊ (the posterior mean of weights). This is the MAP/point prediction. But it discards uncertainty: you get no confidence intervals, no honest assessment of how much you don't know. The predictive distribution gives you the full picture.

Deriving the Predictive Distribution

Let's compute the integral. We have:

Likelihood for new observation: $$p(y_* | \mathbf{x}*, \mathbf{w}) = \mathcal{N}(y* | \mathbf{w}^\top\mathbf{x}_*, \sigma^2)$$

Posterior over weights: $$p(\mathbf{w} | \mathbf{y}, \mathbf{X}) = \mathcal{N}(\mathbf{w} | \mathbf{m}_N, \mathbf{S}_N)$$

The predictive distribution is: $$p(y_* | \mathbf{x}*, \mathbf{y}, \mathbf{X}) = \int \mathcal{N}(y* | \mathbf{w}^\top\mathbf{x}_*, \sigma^2) \cdot \mathcal{N}(\mathbf{w} | \mathbf{m}_N, \mathbf{S}_N) , d\mathbf{w}$$

This is the convolution of two Gaussians—which is itself Gaussian! The result follows from the general formula for marginalizing Gaussian random variables.

Step 1: Linear Transformation of Weights

Define the latent prediction (without noise): f₊ = wᵀx₊

Since w ~ 𝒩(mₙ, Sₙ), and f₊ is a linear function of w:

$$f_* = \mathbf{w}^\top\mathbf{x}* \sim \mathcal{N}(\mathbf{m}N^\top\mathbf{x}*, \mathbf{x}^\top\mathbf{S}N\mathbf{x})$$

The mean is mₙᵀx₊ and variance is x₊ᵀSₙx₊.

Step 2: Add Observation Noise

The observed y₊ is f₊ plus independent Gaussian noise: $$y_* = f_* + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2)$$

Since f₊ and ε are independent Gaussians, their sum is Gaussian with:

Mean = mean of f₊ = mₙᵀx₊
Variance = var(f₊) + var(ε) = x₊ᵀSₙx₊ + σ²

The Predictive Distribution

p(y₊ | x₊, y, X) = 𝒩(y₊ | μ₊, σ²₊) where:

Predictive Mean: μ₊ = mₙᵀx₊

Predictive Variance: σ²₊ = x₊ᵀSₙx₊ + σ²

The variance decomposes into: • x₊ᵀSₙx₊: Uncertainty from not knowing weights (epistemic) • σ²: Inherent observation noise (aleatoric)

Decomposing Predictive Uncertainty

The predictive variance σ²₊ = x₊ᵀSₙx₊ + σ² has a profound interpretation. Let's unpack each term.

Term 1: Epistemic Uncertainty (x₊ᵀSₙx₊)

This term arises because we don't know the true weights. Even with infinite precision observations (σ² = 0), we'd still have prediction uncertainty if we're uncertain about w.

Key properties:

Depends on x₊: Uncertainty is input-dependent
Decreases with data: As N → ∞, Sₙ → 0, and this term vanishes
Highest far from training data: Where we've seen few examples, weight uncertainty is high
Quadratic form: x₊ᵀSₙx₊ measures how x₊ "projects" onto directions of high weight uncertainty

Term 2: Aleatoric Uncertainty (σ²)

This is irreducible noise—the inherent randomness in the data-generating process. Even with known true weights, observations would be noisy.

Key properties:

Constant across inputs: Same noise level everywhere
Cannot be reduced by more data: It's fundamental, not epistemic
Sets a floor on predictive variance: σ²₊ ≥ σ² always
Often called "observation noise" or "measurement noise"

Epistemic vs. Aleatoric Uncertainty in Predictions
Property	Epistemic (x₊ᵀSₙx₊)	Aleatoric (σ²)
Source	Unknown weights	Inherent randomness
Reducible?	Yes, with more data	No, fundamental
Input-dependent?	Yes	No
Limit as N → ∞	→ 0	Stays σ²
Also called	Model uncertainty	Noise variance
Can be estimated?	Yes, from posterior	Yes, from data

Why This Decomposition Matters

Understanding the source of uncertainty guides action. High epistemic uncertainty? Collect more data in that region. High aleatoric uncertainty? Improve measurement quality or accept fundamental limits. Bayesian inference naturally provides this decomposition.

Constructing Prediction Intervals

The predictive distribution enables principled prediction intervals—ranges that contain the true value with specified probability.

Credible Intervals:

For a Gaussian predictive distribution 𝒩(μ₊, σ²₊), the (1-α) credible interval is:

$$\text{CI}{1-\alpha} = \left[ \mu* - z_{\alpha/2} \sigma_, \quad \mu_ + z_{\alpha/2} \sigma_* \right]$$

where z_{α/2} is the standard normal quantile. Common values:

95% CI: z = 1.96 ≈ 2
99% CI: z = 2.576
68% CI: z = 1 (one standard deviation)

Interpretation (Bayesian):

The 95% credible interval means: "Given our prior and observed data, there's a 95% probability the true y₊ lies in this interval."

This is subtly different from frequentist confidence intervals, which make statements about long-run coverage rather than probability of the true value.

prediction_intervals.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def bayesian_prediction(X_train, y_train, X_test, alpha, beta):
    """
    Compute Bayesian predictions with uncertainty.
    
    Returns:
        mu: Predictive means
        sigma: Predictive standard deviations
    """
    N, D = X_train.shape
    
    # Posterior
    S_N_inv = alpha * np.eye(D) + beta * X_train.T @ X_train
    S_N = np.linalg.inv(S_N_inv)
    m_N = beta * S_N @ X_train.T @ y_train
    
    # Predictive distribution for each test point
    mu = X_test @ m_N
    
    # Predictive variance for each test point
    # sigma^2_* = x^T S_N x + 1/beta
    sigma_sq = np.sum((X_test @ S_N) * X_test, axis=1) + 1.0/beta
    sigma = np.sqrt(sigma_sq)
    
    return mu, sigma, m_N, S_N
 
# Generate synthetic data
np.random.seed(42)
N_train = 20
X_train = np.sort(np.random.uniform(-3, 3, N_train))
X_train_design = np.column_stack([np.ones(N_train), X_train])  # Add bias
 
w_true = np.array([0.5, 1.5])  # True weights
noise_std = 0.5
y_train = X_train_design @ w_true + noise_std * np.random.randn(N_train)
 
# Bayesian inference
alpha = 0.1  # Prior precision
beta = 1.0 / noise_std**2  # Noise precision
 
# Test points
X_test = np.linspace(-4, 4, 200)
X_test_design = np.column_stack([np.ones(len(X_test)), X_test])
 
mu, sigma, m_N, S_N = bayesian_prediction(
    X_train_design, y_train, X_test_design, alpha, beta
)
 
# Visualization
fig, ax = plt.subplots(figsize=(12, 6))
 
# Mean prediction
ax.plot(X_test, mu, 'b-', linewidth=2, label='Predictive Mean')
 
# Confidence bands
for n_std, alpha_fill in [(1, 0.4), (2, 0.2), (3, 0.1)]:
    ax.fill_between(X_test, mu - n_std*sigma, mu + n_std*sigma,
                    alpha=alpha_fill, color='blue',
                    label=f'{n_std}σ interval ({100*stats.norm.cdf(n_std) - 50:.0f}%)')
 
# Training data
ax.scatter(X_train, y_train, c='red', s=50, zorder=5, label='Training Data')
 
# True function
y_true = X_test_design @ w_true
ax.plot(X_test, y_true, 'g--', linewidth=2, label='True Function')
 
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Bayesian Linear Regression: Predictive Distribution')
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()

Key Observations from the Visualization:

Uncertainty grows away from data: As we move from the training region, the predictive variance increases
Narrowest at data-dense regions: Where we have many observations, we're most confident
Never zero variance: Even at training points, σ² sets a floor due to observation noise
Intervals are honest: The coverage is calibrated—95% intervals really do contain 95% of test points on average

Input-Dependent Uncertainty: Where Are We Uncertain?

A distinctive feature of Bayesian prediction is that uncertainty varies across the input space. The epistemic component x₊ᵀSₙx₊ depends on where we're predicting.

Geometric Interpretation:

The posterior covariance Sₙ defines an ellipsoid of weight uncertainty. The quadratic form x₊ᵀSₙx₊ measures how x₊ "stretches" this ellipsoid:

If x₊ aligns with directions of high uncertainty in Sₙ: Large contribution
If x₊ is orthogonal to uncertain directions: Small contribution

Intuitive Understanding:

Imagine linear regression with two features x₁ and x₂:

Training data only covers x₁ ∈ [-1, 1], x₂ ∈ [-1, 1]
The weight for x₁ is well-estimated (small variance)
The weight for x₂ is less certain (larger variance)

For a test point with large x₂, predictions will be uncertain because the uncertain weight w₂ is multiplied by a large factor.

Extrapolation Warning:

This is why Bayesian methods naturally warn about extrapolation:

Test points far from training data → large x₊ₜSₙx₊ → wide uncertainty bands
The model honestly admits "I haven't seen data like this"
Classical regression gives confident (but unreliable) extrapolations

The Blessing and Curse of Honest Uncertainty

Bayesian methods refuse to be confidently wrong—they widen uncertainty where knowledge is lacking. This is philosophically correct but sometimes operationally inconvenient. A point prediction with wide error bars may be less actionable than a (overconfident) point estimate. The honest thing isn't always the useful thing, but it should be the starting point.

Predictions at Multiple Test Points

Often we need predictions at multiple test points simultaneously. Let X₊ ∈ ℝᴺˣᴰ be a matrix of N test inputs. The joint predictive distribution is:

$$p(\mathbf{y}* | \mathbf{X}, \mathbf{y}, \mathbf{X}) = \mathcal{N}(\mathbf{y}_ | \boldsymbol{\mu}*, \boldsymbol{\Sigma}*)$$

where:

Predictive Mean Vector: $$\boldsymbol{\mu}* = \mathbf{X}*\mathbf{m}_N$$

Predictive Covariance Matrix: $$\boldsymbol{\Sigma}* = \mathbf{X}\mathbf{S}N\mathbf{X}^\top + \sigma^2\mathbf{I}{N*}$$

Structure of the Covariance:

The diagonal elements are the individual predictive variances: $$(\boldsymbol{\Sigma}*){ii} = \mathbf{x}_{*i}^\top\mathbf{S}N\mathbf{x}{*i} + \sigma^2$$

The off-diagonal elements capture correlations between predictions: $$(\boldsymbol{\Sigma}*){ij} = \mathbf{x}_{*i}^\top\mathbf{S}N\mathbf{x}{*j}$$

Note: Noise is independent across test points, so it only appears on the diagonal.

Why Correlations Matter:

Predictions at different test points are correlated through shared weight uncertainty. If we're uncertain about w₁, predictions for all test points with large x₁ will be jointly uncertain—they'll tend to be wrong in the same direction.

This correlation structure is crucial for:

Sampling consistent function realizations
Propagating uncertainty through downstream computations
Understanding how errors compound in multi-step predictions

Sampling from the Predictive Distribution:

To generate samples of consistent predictions:

$$\mathbf{y}* \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}_)$$

Or equivalently:

Sample weights: w ~ 𝒩(mₙ, Sₙ)
Compute noiseless predictions: f₊ = X₊w
Add noise: y₊ = f₊ + ε, where ε ~ 𝒩(0, σ²I)

sample_predictions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
import matplotlib.pyplot as plt
 
def sample_predictive(X_train, y_train, X_test, alpha, beta, n_samples=10):
    """
    Sample from the predictive distribution.
    Shows multiple consistent function realizations.
    """
    N, D = X_train.shape
    
    # Posterior
    S_N_inv = alpha * np.eye(D) + beta * X_train.T @ X_train
    S_N = np.linalg.inv(S_N_inv)
    m_N = beta * S_N @ X_train.T @ y_train
    sigma = np.sqrt(1.0 / beta)
    
    # Sample weights from posterior
    w_samples = np.random.multivariate_normal(m_N, S_N, size=n_samples)
    
    # Predictions for each weight sample
    f_samples = X_test @ w_samples.T  # (N_test, n_samples)
    
    # Add observation noise
    y_samples = f_samples + sigma * np.random.randn(*f_samples.shape)
    
    return f_samples, y_samples, m_N, S_N
 
# Setup
np.random.seed(42)
N_train = 10
x_train = np.sort(np.random.uniform(-2, 2, N_train))
X_train = np.column_stack([np.ones(N_train), x_train])
w_true = np.array([0.3, 1.2])
y_train = X_train @ w_true + 0.3 * np.random.randn(N_train)
 
x_test = np.linspace(-3, 3, 100)
X_test = np.column_stack([np.ones(len(x_test)), x_test])
 
alpha, beta = 0.5, 1.0 / 0.3**2
 
f_samples, y_samples, m_N, S_N = sample_predictive(
    X_train, y_train, X_test, alpha, beta, n_samples=20
)
 
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Left: Samples from posterior (noiseless)
ax = axes[0]
for i in range(f_samples.shape[1]):
    ax.plot(x_test, f_samples[:, i], alpha=0.3, color='blue')
ax.plot(x_test, X_test @ m_N, 'r-', linewidth=2, label='Posterior Mean')
ax.scatter(x_train, y_train, c='black', s=50, zorder=5, label='Training Data')
ax.set_title('Samples from Posterior (Noiseless f)')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.legend()
 
# Right: Samples with noise
ax = axes[1]
for i in range(y_samples.shape[1]):
    ax.plot(x_test, y_samples[:, i], alpha=0.3, color='green')
ax.plot(x_test, X_test @ m_N, 'r-', linewidth=2, label='Posterior Mean')
ax.scatter(x_train, y_train, c='black', s=50, zorder=5, label='Training Data')
ax.set_title('Samples from Predictive (With Noise)')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.legend()
 
plt.tight_layout()
plt.show()
 
# Notice: Samples spread more outside training region
# All samples pass near the data points (constrained by likelihood)
# The right plot is noisier (observation noise added)

Comparison: Bayesian vs. Frequentist Prediction Intervals

Frequentist linear regression also provides prediction intervals. How do they compare to Bayesian credible intervals?

Frequentist Prediction Interval:

For OLS with Gaussian errors, the (1-α) prediction interval at x₊ is:

$$\hat{y}* \pm t{n-p, \alpha/2} \cdot \hat{\sigma} \cdot \sqrt{1 + \mathbf{x}*^\top(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{x}*}$$

where:

t_{n-p, α/2} is the t-distribution quantile (accounts for estimating σ)
σ̂ is the estimated noise standard deviation
The term under the square root looks remarkably similar to our Bayesian formula!

Key Differences:

Aspect	Bayesian Credible Interval	Frequentist Prediction Interval
Philosophy	Probability of true value	Long-run coverage guarantee
Prior information	Incorporated via p(w)	Not incorporated
Small samples	Prior regularizes, stable	Can be unstable
Interpretation	"95% probability y₊ is here"	"95% of such intervals contain y₊"
σ² treatment	Can be fixed or integrated out	Estimated with degrees-of-freedom correction
Collinearity	Prior helps, stable	Can blow up

When They Agree:

With a flat/vague prior (α → 0) and known σ²:

Bayesian predictive variance → Frequentist formula
The t-correction vanishes with known σ²
Intervals become identical

When Bayesian Is Better:

Small samples: Prior prevents overfitting and unstable variance estimates
Near-collinear features: Prior regularization stabilizes (XᵀX)⁻¹
Informative prior available: Domain knowledge improves predictions
Uncertainty propagation: Full distribution enables proper uncertainty propagation through complex pipelines

When Frequentist Is Better:

Unknown prior: If no prior knowledge exists, frequentist avoids arbitrary choices
Coverage guarantees: If you need formal frequentist coverage guarantees
Computational simplicity: Fewer choices to make, standard implementation

The Practical Truth

For large samples with well-behaved data, Bayesian and frequentist intervals are nearly identical. The differences matter most in challenging settings: small samples, many features, prior knowledge available, or uncertainty needs to propagate into downstream decisions. In these cases, Bayesian inference shines.

Implementation Notes and Best Practices

Implementing Bayesian prediction correctly requires attention to several details.

Efficient Variance Computation:

For many test points, computing x₊ᵀSₙx₊ individually is inefficient. Use vectorized computation:

# Slow: loop over test points
var = np.array([x.T @ S_N @ x for x in X_test])

# Fast: vectorized
var = np.sum((X_test @ S_N) * X_test, axis=1)

Numerical Stability:

Ensure Sₙ is computed stably (via Cholesky decomposition). Small negative eigenvalues from numerical error can cause problems when computing variances.

Handling Unknown σ²:

If σ² is estimated from data, the predictive distribution is technically a Student-t, not Gaussian. For practical purposes with reasonable N:

Estimate σ² from residuals: σ̂² = ||y - X****mₙ||² / (N - D)
Use this estimate in the Gaussian predictive (slight overconfidence for small N)
Or: place a prior on σ² and integrate it out (see advanced Bayesian texts)

bayesian_linreg_complete.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
import numpy as np
from scipy.linalg import cho_factor, cho_solve
 
class BayesianLinearRegression:
    """
    Complete implementation of Bayesian Linear Regression
    with predictions and uncertainty quantification.
    """
    
    def __init__(self, alpha=1.0, beta=None, fit_noise=True):
        """
        Parameters:
        -----------
        alpha : float
            Prior precision on weights
        beta : float or None
            Noise precision (1/sigma^2). If None, estimated from data.
        fit_noise : bool
            Whether to estimate noise from data
        """
        self.alpha = alpha
        self.beta = beta
        self.fit_noise = fit_noise
        self.m_N = None
        self.S_N = None
        self.L_N = None  # Cholesky factor
        
    def fit(self, X, y):
        """Compute posterior over weights."""
        N, D = X.shape
        
        # Estimate noise if needed
        if self.beta is None or self.fit_noise:
            # Quick OLS estimate for noise level
            w_ols = np.linalg.lstsq(X, y, rcond=None)[0]
            residuals = y - X @ w_ols
            sigma_sq_hat = np.var(residuals) * N / max(N - D, 1)
            self.beta = 1.0 / max(sigma_sq_hat, 1e-10)
        
        # Posterior precision
        S_N_inv = self.alpha * np.eye(D) + self.beta * X.T @ X
        
        # Cholesky factorization for stability
        self.L_N, _ = cho_factor(S_N_inv, lower=True)
        
        # Posterior covariance
        self.S_N = cho_solve((self.L_N, True), np.eye(D))
        
        # Posterior mean
        self.m_N = self.beta * self.S_N @ X.T @ y
        
        return self
    
    def predict(self, X_test, return_std=False, return_cov=False):
        """
        Make predictions with uncertainty.
        
        Returns:
        --------
        mu : Predictive means
        std : Predictive standard deviations (if return_std=True)
        cov : Predictive covariance matrix (if return_cov=True)
        """
        mu = X_test @ self.m_N
        
        if not return_std and not return_cov:
            return mu
        
        # Epistemic variance for each point
        var_epistemic = np.sum((X_test @ self.S_N) * X_test, axis=1)
        
        # Total predictive variance
        var_total = var_epistemic + 1.0 / self.beta
        
        result = [mu]
        
        if return_std:
            result.append(np.sqrt(var_total))
        
        if return_cov:
            # Full covariance matrix
            cov = X_test @ self.S_N @ X_test.T + (1.0/self.beta) * np.eye(len(X_test))
            result.append(cov)
        
        return tuple(result)
    
    def sample_predictions(self, X_test, n_samples=100):
        """Sample from the predictive distribution."""
        # Sample weights
        w_samples = np.random.multivariate_normal(self.m_N, self.S_N, n_samples)
        
        # Noiseless predictions
        f_samples = X_test @ w_samples.T
        
        # Add observation noise
        sigma = np.sqrt(1.0 / self.beta)
        y_samples = f_samples + sigma * np.random.randn(*f_samples.shape)
        
        return y_samples
    
    def score(self, X, y):
        """Compute log predictive probability."""
        mu, std = self.predict(X, return_std=True)
        log_prob = -0.5 * np.log(2 * np.pi * std**2) - 0.5 * ((y - mu) / std)**2
        return np.mean(log_prob)
 
# Usage example
if __name__ == "__main__":
    # Generate synthetic data
    np.random.seed(42)
    N, D = 50, 3
    X = np.random.randn(N, D)
    w_true = np.array([1.0, -0.5, 0.3])
    y = X @ w_true + 0.3 * np.random.randn(N)
    
    # Fit model
    model = BayesianLinearRegression(alpha=0.1)
    model.fit(X, y)
    
    # Predict
    X_test = np.random.randn(10, D)
    mu, std = model.predict(X_test, return_std=True)
    
    print("Predictions with 95% intervals:")
    for i, (m, s) in enumerate(zip(mu, std)):
        print(f"  Point {i}: {m:.3f} ± {1.96*s:.3f}")

Summary: Predictions with Honest Uncertainty

The predictive distribution is the practical output of Bayesian linear regression—it tells us what we can expect for new inputs, with honest uncertainty quantification.

Key Takeaways

•Marginalization: The predictive distribution integrates over weight uncertainty, giving p(y₊|x₊, data) without committing to specific weights.
•Gaussian Predictive: p(y₊|x₊) = 𝒩(mₙᵀx₊, x₊ᵀSₙx₊ + σ²). Mean from posterior mean weights, variance from weight uncertainty plus noise.
•Two Uncertainty Sources: Epistemic (x₊ᵀSₙx₊, reducible with data) and aleatoric (σ², irreducible).
•Input-Dependent Uncertainty: Prediction confidence varies across input space—wider where data is sparse, narrower where data is dense.
•Principled Intervals: Credible intervals have valid probabilistic interpretation and naturally widen for extrapolation.
•Correlation Structure: Predictions at multiple points are correlated through shared weight uncertainty.

What's Next:

We've seen that Bayesian linear regression provides uncertainty estimates. But how do we use this uncertainty in practice? When is uncertainty high vs. low? How do we communicate it to stakeholders? The next page explores uncertainty quantification in depth—practical methods for interpreting, visualizing, and acting on predictive uncertainty.

Page Complete

You can now derive and compute the predictive distribution for Bayesian linear regression. This is the practical payoff of Bayesian inference—predictions that come with honest, calibrated uncertainty estimates. Next, we'll explore how to use these uncertainty estimates effectively.

Predictive Distribution

From Weight Uncertainty to Prediction Uncertainty

Bayesian linear regression answers with the predictive distribution: a full probability distribution over the target value for a new input. This distribution captures two sources of uncertainty:

Epistemic uncertainty: Uncertainty about the weights (we don't know the true w)
Aleatoric uncertainty: Inherent noise in the observation process (even with known w, observations are noisy)

The predictive distribution integrates out weight uncertainty, giving us principled confidence intervals that reflect both what we don't know and what is inherently unknowable.

What You Will Learn

The Predictive Distribution: Setup

Given a new input x₊ ∈ ℝᴰ, we want to predict the corresponding target value y₊. In the Bayesian framework, we compute the predictive distribution:

$$p(y_* | \mathbf{x}_*, \mathbf{y}, \mathbf{X})$$

This is the probability distribution over y₊ given:

The new input x₊
All observed training data (X, y)

The Key Insight: Marginalization

We don't know the true weights w. But we have a posterior distribution over them. The predictive distribution averages over all possible weights, weighted by their posterior probability:

$$p(y_* | \mathbf{x}*, \mathbf{y}, \mathbf{X}) = \int p(y* | \mathbf{x}_*, \mathbf{w}) \cdot p(\mathbf{w} | \mathbf{y}, \mathbf{X}) , d\mathbf{w}$$

This integral marginalizes out the weights. The result accounts for all sources of uncertainty:

Each possible w contributes a prediction
Weights are weighted by how probable they are given the data
The spread reflects both weight uncertainty and observation noise

Why Not Just Use the Posterior Mean?

Deriving the Predictive Distribution

Let's compute the integral. We have:

Likelihood for new observation: $$p(y_* | \mathbf{x}*, \mathbf{w}) = \mathcal{N}(y* | \mathbf{w}^\top\mathbf{x}_*, \sigma^2)$$

Posterior over weights: $$p(\mathbf{w} | \mathbf{y}, \mathbf{X}) = \mathcal{N}(\mathbf{w} | \mathbf{m}_N, \mathbf{S}_N)$$

This is the convolution of two Gaussians—which is itself Gaussian! The result follows from the general formula for marginalizing Gaussian random variables.

Step 1: Linear Transformation of Weights

Define the latent prediction (without noise): f₊ = wᵀx₊

Since w ~ 𝒩(mₙ, Sₙ), and f₊ is a linear function of w:

$$f_* = \mathbf{w}^\top\mathbf{x}* \sim \mathcal{N}(\mathbf{m}N^\top\mathbf{x}*, \mathbf{x}^\top\mathbf{S}N\mathbf{x})$$

The mean is mₙᵀx₊ and variance is x₊ᵀSₙx₊.

Step 2: Add Observation Noise

The observed y₊ is f₊ plus independent Gaussian noise: $$y_* = f_* + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2)$$

Since f₊ and ε are independent Gaussians, their sum is Gaussian with:

Mean = mean of f₊ = mₙᵀx₊
Variance = var(f₊) + var(ε) = x₊ᵀSₙx₊ + σ²

The Predictive Distribution

p(y₊ | x₊, y, X) = 𝒩(y₊ | μ₊, σ²₊) where:

Predictive Mean: μ₊ = mₙᵀx₊

Predictive Variance: σ²₊ = x₊ᵀSₙx₊ + σ²

The variance decomposes into: • x₊ᵀSₙx₊: Uncertainty from not knowing weights (epistemic) • σ²: Inherent observation noise (aleatoric)

Decomposing Predictive Uncertainty

The predictive variance σ²₊ = x₊ᵀSₙx₊ + σ² has a profound interpretation. Let's unpack each term.

Term 1: Epistemic Uncertainty (x₊ᵀSₙx₊)

This term arises because we don't know the true weights. Even with infinite precision observations (σ² = 0), we'd still have prediction uncertainty if we're uncertain about w.

Key properties:

Depends on x₊: Uncertainty is input-dependent
Decreases with data: As N → ∞, Sₙ → 0, and this term vanishes
Highest far from training data: Where we've seen few examples, weight uncertainty is high
Quadratic form: x₊ᵀSₙx₊ measures how x₊ "projects" onto directions of high weight uncertainty

Term 2: Aleatoric Uncertainty (σ²)

This is irreducible noise—the inherent randomness in the data-generating process. Even with known true weights, observations would be noisy.

Key properties:

Constant across inputs: Same noise level everywhere
Cannot be reduced by more data: It's fundamental, not epistemic
Sets a floor on predictive variance: σ²₊ ≥ σ² always
Often called "observation noise" or "measurement noise"

Epistemic vs. Aleatoric Uncertainty in Predictions
Property	Epistemic (x₊ᵀSₙx₊)	Aleatoric (σ²)
Source	Unknown weights	Inherent randomness
Reducible?	Yes, with more data	No, fundamental
Input-dependent?	Yes	No
Limit as N → ∞	→ 0	Stays σ²
Also called	Model uncertainty	Noise variance
Can be estimated?	Yes, from posterior	Yes, from data

Why This Decomposition Matters

Constructing Prediction Intervals

The predictive distribution enables principled prediction intervals—ranges that contain the true value with specified probability.

Credible Intervals:

For a Gaussian predictive distribution 𝒩(μ₊, σ²₊), the (1-α) credible interval is:

$$\text{CI}{1-\alpha} = \left[ \mu* - z_{\alpha/2} \sigma_, \quad \mu_ + z_{\alpha/2} \sigma_* \right]$$

where z_{α/2} is the standard normal quantile. Common values:

95% CI: z = 1.96 ≈ 2
99% CI: z = 2.576
68% CI: z = 1 (one standard deviation)

Interpretation (Bayesian):

The 95% credible interval means: "Given our prior and observed data, there's a 95% probability the true y₊ lies in this interval."

This is subtly different from frequentist confidence intervals, which make statements about long-run coverage rather than probability of the true value.

prediction_intervals.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def bayesian_prediction(X_train, y_train, X_test, alpha, beta):
    """
    Compute Bayesian predictions with uncertainty.
    
    Returns:
        mu: Predictive means
        sigma: Predictive standard deviations
    """
    N, D = X_train.shape
    
    # Posterior
    S_N_inv = alpha * np.eye(D) + beta * X_train.T @ X_train
    S_N = np.linalg.inv(S_N_inv)
    m_N = beta * S_N @ X_train.T @ y_train
    
    # Predictive distribution for each test point
    mu = X_test @ m_N
    
    # Predictive variance for each test point
    # sigma^2_* = x^T S_N x + 1/beta
    sigma_sq = np.sum((X_test @ S_N) * X_test, axis=1) + 1.0/beta
    sigma = np.sqrt(sigma_sq)
    
    return mu, sigma, m_N, S_N
 
# Generate synthetic data
np.random.seed(42)
N_train = 20
X_train = np.sort(np.random.uniform(-3, 3, N_train))
X_train_design = np.column_stack([np.ones(N_train), X_train])  # Add bias
 
w_true = np.array([0.5, 1.5])  # True weights
noise_std = 0.5
y_train = X_train_design @ w_true + noise_std * np.random.randn(N_train)
 
# Bayesian inference
alpha = 0.1  # Prior precision
beta = 1.0 / noise_std**2  # Noise precision
 
# Test points
X_test = np.linspace(-4, 4, 200)
X_test_design = np.column_stack([np.ones(len(X_test)), X_test])
 
mu, sigma, m_N, S_N = bayesian_prediction(
    X_train_design, y_train, X_test_design, alpha, beta
)
 
# Visualization
fig, ax = plt.subplots(figsize=(12, 6))
 
# Mean prediction
ax.plot(X_test, mu, 'b-', linewidth=2, label='Predictive Mean')
 
# Confidence bands
for n_std, alpha_fill in [(1, 0.4), (2, 0.2), (3, 0.1)]:
    ax.fill_between(X_test, mu - n_std*sigma, mu + n_std*sigma,
                    alpha=alpha_fill, color='blue',
                    label=f'{n_std}σ interval ({100*stats.norm.cdf(n_std) - 50:.0f}%)')
 
# Training data
ax.scatter(X_train, y_train, c='red', s=50, zorder=5, label='Training Data')
 
# True function
y_true = X_test_design @ w_true
ax.plot(X_test, y_true, 'g--', linewidth=2, label='True Function')
 
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Bayesian Linear Regression: Predictive Distribution')
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()

Key Observations from the Visualization:

Uncertainty grows away from data: As we move from the training region, the predictive variance increases
Narrowest at data-dense regions: Where we have many observations, we're most confident
Never zero variance: Even at training points, σ² sets a floor due to observation noise
Intervals are honest: The coverage is calibrated—95% intervals really do contain 95% of test points on average

Input-Dependent Uncertainty: Where Are We Uncertain?

A distinctive feature of Bayesian prediction is that uncertainty varies across the input space. The epistemic component x₊ᵀSₙx₊ depends on where we're predicting.

Geometric Interpretation:

The posterior covariance Sₙ defines an ellipsoid of weight uncertainty. The quadratic form x₊ᵀSₙx₊ measures how x₊ "stretches" this ellipsoid:

If x₊ aligns with directions of high uncertainty in Sₙ: Large contribution
If x₊ is orthogonal to uncertain directions: Small contribution

Intuitive Understanding:

Imagine linear regression with two features x₁ and x₂:

Training data only covers x₁ ∈ [-1, 1], x₂ ∈ [-1, 1]
The weight for x₁ is well-estimated (small variance)
The weight for x₂ is less certain (larger variance)

For a test point with large x₂, predictions will be uncertain because the uncertain weight w₂ is multiplied by a large factor.

Extrapolation Warning:

This is why Bayesian methods naturally warn about extrapolation:

Test points far from training data → large x₊ₜSₙx₊ → wide uncertainty bands
The model honestly admits "I haven't seen data like this"
Classical regression gives confident (but unreliable) extrapolations

The Blessing and Curse of Honest Uncertainty

Predictions at Multiple Test Points

Often we need predictions at multiple test points simultaneously. Let X₊ ∈ ℝᴺˣᴰ be a matrix of N test inputs. The joint predictive distribution is:

$$p(\mathbf{y}* | \mathbf{X}, \mathbf{y}, \mathbf{X}) = \mathcal{N}(\mathbf{y}_ | \boldsymbol{\mu}*, \boldsymbol{\Sigma}*)$$

where:

Predictive Mean Vector: $$\boldsymbol{\mu}* = \mathbf{X}*\mathbf{m}_N$$

Predictive Covariance Matrix: $$\boldsymbol{\Sigma}* = \mathbf{X}\mathbf{S}N\mathbf{X}^\top + \sigma^2\mathbf{I}{N*}$$

Structure of the Covariance:

The diagonal elements are the individual predictive variances: $$(\boldsymbol{\Sigma}*){ii} = \mathbf{x}_{*i}^\top\mathbf{S}N\mathbf{x}{*i} + \sigma^2$$

The off-diagonal elements capture correlations between predictions: $$(\boldsymbol{\Sigma}*){ij} = \mathbf{x}_{*i}^\top\mathbf{S}N\mathbf{x}{*j}$$

Note: Noise is independent across test points, so it only appears on the diagonal.

Why Correlations Matter:

This correlation structure is crucial for:

Sampling consistent function realizations
Propagating uncertainty through downstream computations
Understanding how errors compound in multi-step predictions

Sampling from the Predictive Distribution:

To generate samples of consistent predictions:

$$\mathbf{y}* \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}_)$$

Or equivalently:

Sample weights: w ~ 𝒩(mₙ, Sₙ)
Compute noiseless predictions: f₊ = X₊w
Add noise: y₊ = f₊ + ε, where ε ~ 𝒩(0, σ²I)

sample_predictions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
import matplotlib.pyplot as plt
 
def sample_predictive(X_train, y_train, X_test, alpha, beta, n_samples=10):
    """
    Sample from the predictive distribution.
    Shows multiple consistent function realizations.
    """
    N, D = X_train.shape
    
    # Posterior
    S_N_inv = alpha * np.eye(D) + beta * X_train.T @ X_train
    S_N = np.linalg.inv(S_N_inv)
    m_N = beta * S_N @ X_train.T @ y_train
    sigma = np.sqrt(1.0 / beta)
    
    # Sample weights from posterior
    w_samples = np.random.multivariate_normal(m_N, S_N, size=n_samples)
    
    # Predictions for each weight sample
    f_samples = X_test @ w_samples.T  # (N_test, n_samples)
    
    # Add observation noise
    y_samples = f_samples + sigma * np.random.randn(*f_samples.shape)
    
    return f_samples, y_samples, m_N, S_N
 
# Setup
np.random.seed(42)
N_train = 10
x_train = np.sort(np.random.uniform(-2, 2, N_train))
X_train = np.column_stack([np.ones(N_train), x_train])
w_true = np.array([0.3, 1.2])
y_train = X_train @ w_true + 0.3 * np.random.randn(N_train)
 
x_test = np.linspace(-3, 3, 100)
X_test = np.column_stack([np.ones(len(x_test)), x_test])
 
alpha, beta = 0.5, 1.0 / 0.3**2
 
f_samples, y_samples, m_N, S_N = sample_predictive(
    X_train, y_train, X_test, alpha, beta, n_samples=20
)
 
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Left: Samples from posterior (noiseless)
ax = axes[0]
for i in range(f_samples.shape[1]):
    ax.plot(x_test, f_samples[:, i], alpha=0.3, color='blue')
ax.plot(x_test, X_test @ m_N, 'r-', linewidth=2, label='Posterior Mean')
ax.scatter(x_train, y_train, c='black', s=50, zorder=5, label='Training Data')
ax.set_title('Samples from Posterior (Noiseless f)')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.legend()
 
# Right: Samples with noise
ax = axes[1]
for i in range(y_samples.shape[1]):
    ax.plot(x_test, y_samples[:, i], alpha=0.3, color='green')
ax.plot(x_test, X_test @ m_N, 'r-', linewidth=2, label='Posterior Mean')
ax.scatter(x_train, y_train, c='black', s=50, zorder=5, label='Training Data')
ax.set_title('Samples from Predictive (With Noise)')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.legend()
 
plt.tight_layout()
plt.show()
 
# Notice: Samples spread more outside training region
# All samples pass near the data points (constrained by likelihood)
# The right plot is noisier (observation noise added)

Comparison: Bayesian vs. Frequentist Prediction Intervals

Frequentist linear regression also provides prediction intervals. How do they compare to Bayesian credible intervals?

Frequentist Prediction Interval:

For OLS with Gaussian errors, the (1-α) prediction interval at x₊ is:

$$\hat{y}* \pm t{n-p, \alpha/2} \cdot \hat{\sigma} \cdot \sqrt{1 + \mathbf{x}*^\top(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{x}*}$$

where:

t_{n-p, α/2} is the t-distribution quantile (accounts for estimating σ)
σ̂ is the estimated noise standard deviation
The term under the square root looks remarkably similar to our Bayesian formula!

Key Differences:

Aspect	Bayesian Credible Interval	Frequentist Prediction Interval
Philosophy	Probability of true value	Long-run coverage guarantee
Prior information	Incorporated via p(w)	Not incorporated
Small samples	Prior regularizes, stable	Can be unstable
Interpretation	"95% probability y₊ is here"	"95% of such intervals contain y₊"
σ² treatment	Can be fixed or integrated out	Estimated with degrees-of-freedom correction
Collinearity	Prior helps, stable	Can blow up

When They Agree:

With a flat/vague prior (α → 0) and known σ²:

Bayesian predictive variance → Frequentist formula
The t-correction vanishes with known σ²
Intervals become identical

When Bayesian Is Better:

Small samples: Prior prevents overfitting and unstable variance estimates
Near-collinear features: Prior regularization stabilizes (XᵀX)⁻¹
Informative prior available: Domain knowledge improves predictions
Uncertainty propagation: Full distribution enables proper uncertainty propagation through complex pipelines

When Frequentist Is Better:

Unknown prior: If no prior knowledge exists, frequentist avoids arbitrary choices
Coverage guarantees: If you need formal frequentist coverage guarantees
Computational simplicity: Fewer choices to make, standard implementation

The Practical Truth

Implementation Notes and Best Practices

Implementing Bayesian prediction correctly requires attention to several details.

Efficient Variance Computation:

For many test points, computing x₊ᵀSₙx₊ individually is inefficient. Use vectorized computation:

# Slow: loop over test points
var = np.array([x.T @ S_N @ x for x in X_test])

# Fast: vectorized
var = np.sum((X_test @ S_N) * X_test, axis=1)

Numerical Stability:

Ensure Sₙ is computed stably (via Cholesky decomposition). Small negative eigenvalues from numerical error can cause problems when computing variances.

Handling Unknown σ²:

If σ² is estimated from data, the predictive distribution is technically a Student-t, not Gaussian. For practical purposes with reasonable N:

Estimate σ² from residuals: σ̂² = ||y - X****mₙ||² / (N - D)
Use this estimate in the Gaussian predictive (slight overconfidence for small N)
Or: place a prior on σ² and integrate it out (see advanced Bayesian texts)

bayesian_linreg_complete.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
import numpy as np
from scipy.linalg import cho_factor, cho_solve
 
class BayesianLinearRegression:
    """
    Complete implementation of Bayesian Linear Regression
    with predictions and uncertainty quantification.
    """
    
    def __init__(self, alpha=1.0, beta=None, fit_noise=True):
        """
        Parameters:
        -----------
        alpha : float
            Prior precision on weights
        beta : float or None
            Noise precision (1/sigma^2). If None, estimated from data.
        fit_noise : bool
            Whether to estimate noise from data
        """
        self.alpha = alpha
        self.beta = beta
        self.fit_noise = fit_noise
        self.m_N = None
        self.S_N = None
        self.L_N = None  # Cholesky factor
        
    def fit(self, X, y):
        """Compute posterior over weights."""
        N, D = X.shape
        
        # Estimate noise if needed
        if self.beta is None or self.fit_noise:
            # Quick OLS estimate for noise level
            w_ols = np.linalg.lstsq(X, y, rcond=None)[0]
            residuals = y - X @ w_ols
            sigma_sq_hat = np.var(residuals) * N / max(N - D, 1)
            self.beta = 1.0 / max(sigma_sq_hat, 1e-10)
        
        # Posterior precision
        S_N_inv = self.alpha * np.eye(D) + self.beta * X.T @ X
        
        # Cholesky factorization for stability
        self.L_N, _ = cho_factor(S_N_inv, lower=True)
        
        # Posterior covariance
        self.S_N = cho_solve((self.L_N, True), np.eye(D))
        
        # Posterior mean
        self.m_N = self.beta * self.S_N @ X.T @ y
        
        return self
    
    def predict(self, X_test, return_std=False, return_cov=False):
        """
        Make predictions with uncertainty.
        
        Returns:
        --------
        mu : Predictive means
        std : Predictive standard deviations (if return_std=True)
        cov : Predictive covariance matrix (if return_cov=True)
        """
        mu = X_test @ self.m_N
        
        if not return_std and not return_cov:
            return mu
        
        # Epistemic variance for each point
        var_epistemic = np.sum((X_test @ self.S_N) * X_test, axis=1)
        
        # Total predictive variance
        var_total = var_epistemic + 1.0 / self.beta
        
        result = [mu]
        
        if return_std:
            result.append(np.sqrt(var_total))
        
        if return_cov:
            # Full covariance matrix
            cov = X_test @ self.S_N @ X_test.T + (1.0/self.beta) * np.eye(len(X_test))
            result.append(cov)
        
        return tuple(result)
    
    def sample_predictions(self, X_test, n_samples=100):
        """Sample from the predictive distribution."""
        # Sample weights
        w_samples = np.random.multivariate_normal(self.m_N, self.S_N, n_samples)
        
        # Noiseless predictions
        f_samples = X_test @ w_samples.T
        
        # Add observation noise
        sigma = np.sqrt(1.0 / self.beta)
        y_samples = f_samples + sigma * np.random.randn(*f_samples.shape)
        
        return y_samples
    
    def score(self, X, y):
        """Compute log predictive probability."""
        mu, std = self.predict(X, return_std=True)
        log_prob = -0.5 * np.log(2 * np.pi * std**2) - 0.5 * ((y - mu) / std)**2
        return np.mean(log_prob)
 
# Usage example
if __name__ == "__main__":
    # Generate synthetic data
    np.random.seed(42)
    N, D = 50, 3
    X = np.random.randn(N, D)
    w_true = np.array([1.0, -0.5, 0.3])
    y = X @ w_true + 0.3 * np.random.randn(N)
    
    # Fit model
    model = BayesianLinearRegression(alpha=0.1)
    model.fit(X, y)
    
    # Predict
    X_test = np.random.randn(10, D)
    mu, std = model.predict(X_test, return_std=True)
    
    print("Predictions with 95% intervals:")
    for i, (m, s) in enumerate(zip(mu, std)):
        print(f"  Point {i}: {m:.3f} ± {1.96*s:.3f}")

Summary: Predictions with Honest Uncertainty

The predictive distribution is the practical output of Bayesian linear regression—it tells us what we can expect for new inputs, with honest uncertainty quantification.

Key Takeaways

•Marginalization: The predictive distribution integrates over weight uncertainty, giving p(y₊|x₊, data) without committing to specific weights.
•Gaussian Predictive: p(y₊|x₊) = 𝒩(mₙᵀx₊, x₊ᵀSₙx₊ + σ²). Mean from posterior mean weights, variance from weight uncertainty plus noise.
•Two Uncertainty Sources: Epistemic (x₊ᵀSₙx₊, reducible with data) and aleatoric (σ², irreducible).
•Input-Dependent Uncertainty: Prediction confidence varies across input space—wider where data is sparse, narrower where data is dense.
•Principled Intervals: Credible intervals have valid probabilistic interpretation and naturally widen for extrapolation.
•Correlation Structure: Predictions at multiple points are correlated through shared weight uncertainty.

What's Next:

Page Complete