Machine LearningBayesian Linear Regression

Bayesian Linear Regression

LevelAdvanced

Duration90 mins

TopicBayesian Linear Regression

4 / 5

Uncertainty Quantification

Making Uncertainty Actionable

Having derived the predictive distribution, we now possess a powerful tool: calibrated uncertainty estimates. But a probability distribution is only useful if we know how to interpret it, trust it, and act on it.

In practice, uncertainty quantification serves many purposes:

Risk assessment: How likely is a catastrophic prediction error?
Decision support: Should we collect more data or make a decision now?
Model debugging: Are uncertainties calibrated? Where is the model overconfident?
Active learning: Where should we sample next to reduce uncertainty most?
Safety-critical applications: When should an autonomous system defer to human oversight?

This page bridges theory and practice—we'll learn how to validate that our uncertainty estimates are trustworthy and how to use them effectively in real-world applications.

What You Will Learn

By the end of this page, you will understand how to check if uncertainty estimates are calibrated, visualize uncertainty effectively for different audiences, decompose uncertainty into epistemic and aleatoric components, and make decisions that appropriately account for prediction uncertainty.

Calibration: Are Our Uncertainties Trustworthy?

Calibration is the most important property for uncertainty estimates. A model is well-calibrated if its probabilistic predictions match observed frequencies.

Definition:

A predictive distribution is calibrated if, for any confidence level 1-α: $$P(y_* \in \text{CI}{1-\alpha}(\mathbf{x}*)) = 1 - \alpha$$

In plain terms: 95% confidence intervals should contain the true value 95% of the time. 80% intervals should contain 80% of observations. If 95% intervals only capture 70% of outcomes, the model is overconfident.

Why Calibration Matters:

Trust: Decision-makers need to trust the stated uncertainty
Risk management: Underestimated uncertainty → unexpected failures
Scientific integrity: Overconfidence = false precision
Downstream systems: Miscalibrated inputs corrupt downstream uncertainty propagation

Checking Calibration:

For regression, we use calibration curves (reliability diagrams). The procedure:

For each test point, compute the predictive distribution
Compute the probability integral transform (PIT): $$u_i = \Phi\left(\frac{y_i - \mu_i}{\sigma_i}\right)$$ where Φ is the standard normal CDF
If well-calibrated, the u_i values should be Uniform(0,1)
Plot empirical CDF of u vs. theoretical Uniform

Interpretation:

If points lie on the diagonal → well calibrated
If curve is above diagonal → overconfident (intervals too narrow)
If curve is below diagonal → underconfident (intervals too wide)

Alternatively, check the coverage rate for various confidence levels:

Compute 50%, 80%, 90%, 95% intervals
Count what fraction of test points fall within each
Compare to nominal coverage

calibration_check.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def check_calibration(y_true, mu_pred, sigma_pred, n_bins=10):
    """
    Check calibration of Bayesian predictions.
    
    Returns calibration metrics and creates diagnostic plots.
    """
    # Probability Integral Transform (PIT)
    pit_values = stats.norm.cdf((y_true - mu_pred) / sigma_pred)
    
    # 1. PIT Histogram (should be uniform)
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    # PIT histogram
    ax = axes[0]
    ax.hist(pit_values, bins=n_bins, density=True, alpha=0.7, edgecolor='black')
    ax.axhline(1.0, color='red', linestyle='--', label='Ideal (Uniform)')
    ax.set_xlabel('PIT Value')
    ax.set_ylabel('Density')
    ax.set_title('PIT Histogram (Ideal: Uniform)')
    ax.legend()
    
    # 2. Calibration curve
    ax = axes[1]
    expected_coverage = np.linspace(0, 1, 100)
    observed_coverage = []
    for p in expected_coverage:
        # Symmetric interval containing probability p
        z = stats.norm.ppf((1 + p) / 2)  # z-score for symmetric interval
        in_interval = np.abs((y_true - mu_pred) / sigma_pred) <= z
        observed_coverage.append(np.mean(in_interval))
    
    ax.plot(expected_coverage, observed_coverage, 'b-', linewidth=2, label='Observed')
    ax.plot([0, 1], [0, 1], 'r--', linewidth=1, label='Ideal')
    ax.fill_between(expected_coverage, 
                    expected_coverage - 2*np.sqrt(expected_coverage*(1-expected_coverage)/len(y_true)),
                    expected_coverage + 2*np.sqrt(expected_coverage*(1-expected_coverage)/len(y_true)),
                    alpha=0.2, color='gray', label='±2σ band')
    ax.set_xlabel('Expected Coverage')
    ax.set_ylabel('Observed Coverage')
    ax.set_title('Calibration Curve')
    ax.legend()
    
    # 3. Quantile-Quantile plot
    ax = axes[2]
    sorted_pit = np.sort(pit_values)
    theoretical_quantiles = np.linspace(0, 1, len(pit_values))
    ax.plot(theoretical_quantiles, sorted_pit, 'b.', alpha=0.5)
    ax.plot([0, 1], [0, 1], 'r--', linewidth=1)
    ax.set_xlabel('Theoretical Quantile')
    ax.set_ylabel('Observed Quantile')
    ax.set_title('Q-Q Plot for PIT')
    
    plt.tight_layout()
    plt.show()
    
    # Compute summary metrics
    # Kolmogorov-Smirnov test for uniformity
    ks_stat, ks_pvalue = stats.kstest(pit_values, 'uniform')
    
    # Specific coverage rates
    coverage_results = {}
    for confidence in [0.5, 0.8, 0.9, 0.95, 0.99]:
        z = stats.norm.ppf((1 + confidence) / 2)
        coverage = np.mean(np.abs((y_true - mu_pred) / sigma_pred) <= z)
        coverage_results[confidence] = coverage
    
    print("Calibration Summary:")
    print(f"  KS test p-value: {ks_pvalue:.4f} (>0.05 suggests good calibration)")
    print("\nCoverage rates (expected → observed):")
    for conf, obs in coverage_results.items():
        status = '✓' if abs(obs - conf) < 0.05 else '✗'
        print(f"  {100*conf:.0f}%: {100*obs:.1f}% {status}")
    
    return pit_values, coverage_results
 
# Example usage
np.random.seed(42)
n_test = 500
 
# Generate test data
x_test = np.random.uniform(-3, 3, n_test)
w_true = 1.5
noise_true = 0.5
y_true = w_true * x_test + noise_true * np.random.randn(n_test)
 
# Simulate Bayesian predictions (well-calibrated)
mu_pred = w_true * x_test  # Assume we know the true weight (ideal case)
sigma_pred = np.sqrt(0.05 + noise_true**2)  # Epistemic + aleatoric
 
pit, coverage = check_calibration(y_true, mu_pred, sigma_pred)

Sources of Miscalibration

Bayesian linear regression with Gaussian assumptions produces calibrated predictions if the model is correct. In practice, miscalibration arises from model misspecification.

Common Causes of Overconfidence:

Model Misspecification: True relationship is nonlinear, but we fit a linear model. Predictions are systematically wrong, and this error isn't captured in σ².
Heteroscedastic Noise: True noise varies with input (σ²(x)), but we assume constant σ². Regions with high true noise are overconfident.
Outliers: Heavy-tailed noise (not Gaussian) causes more extreme observations than predicted.
Wrong Prior: Prior is too confident (large α), shrinks too much toward prior mean.
Underestimated σ²: If σ² is estimated from data but the sample is small, it may be underestimated.

Common Causes of Underconfidence:

Overly Weak Prior: α too small, leading to excessive variance in weight estimates.
Overestimated σ²: If noise variance is set higher than true value.
Conservative Modeling: Some practitioners deliberately inflate uncertainty "to be safe."

Diagnosing and Fixing Calibration Issues
Symptom	Likely Cause	Potential Fix
Coverage too low (overconfident)	Model misspecification	Use more flexible model (kernels, basis functions)
Coverage too low	Noise underestimated	Increase σ² or estimate from residuals
Coverage too low	Prior too strong	Decrease α (weaker prior)
Coverage too high (underconfident)	Noise overestimated	Decrease σ² or estimate properly
Coverage too high	Prior too weak	Increase α (add regularization)
Coverage varies by region	Heteroscedastic noise	Model noise as function of x
Extreme outliers	Heavy-tailed noise	Use robust likelihood (Student-t)

Miscalibration Is a Modeling Problem

Bayesian inference gives calibrated uncertainties if the model is correct. Miscalibration signals model misspecification. Don't patch uncertainty estimates—fix the modeling problem. If linear isn't enough, use Gaussian processes. If Gaussian noise doesn't fit, use robust alternatives.

Visualizing Uncertainty Effectively

Effective visualization communicates uncertainty to diverse audiences—from domain experts to executives. Different contexts call for different approaches.

1. Confidence Bands (Best for Continuous Predictions)

Shade regions representing different confidence levels:

Inner band: e.g., 50% interval
Outer band: e.g., 95% interval
Gradient shading: Continuous probability density

2. Error Bars (Best for Discrete Points)

Classic vertical line segments showing ±1σ or ±2σ. Simple and widely understood.

3. Fan Charts (Best for Time Series)

Widening bands as forecast horizon increases. Common in economic forecasting.

4. Probability Distribution Insets (Best for Critical Points)

Show the full distribution at specific points of interest. Reveals asymmetry, multimodality.

5. Spaghetti Plots (Best for Showing Variability)

Plot multiple sample trajectories from the predictive distribution. Shows correlation structure.

uncertainty_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
def comprehensive_uncertainty_plot(x_test, mu, sigma, x_train=None, y_train=None):
    """
    Create multiple uncertainty visualization approaches.
    """
    fig = plt.figure(figsize=(16, 10))
    
    # 1. Gradient bands (continuous probability)
    ax1 = fig.add_subplot(2, 2, 1)
    for alpha, z in zip(np.linspace(0.1, 0.5, 10), np.linspace(2, 0.5, 10)):
        ax1.fill_between(x_test, mu - z*sigma, mu + z*sigma, 
                         alpha=alpha, color='blue', linewidth=0)
    ax1.plot(x_test, mu, 'b-', linewidth=2, label='Mean')
    if x_train is not None:
        ax1.scatter(x_train, y_train, c='red', s=30, zorder=5, label='Data')
    ax1.set_title('Gradient Confidence Bands')
    ax1.set_xlabel('x')
    ax1.set_ylabel('y')
    ax1.legend()
    
    # 2. Discrete confidence intervals
    ax2 = fig.add_subplot(2, 2, 2)
    ax2.fill_between(x_test, mu - 2*sigma, mu + 2*sigma, 
                     alpha=0.2, color='blue', label='95% CI')
    ax2.fill_between(x_test, mu - 1*sigma, mu + 1*sigma, 
                     alpha=0.3, color='blue', label='68% CI')
    ax2.plot(x_test, mu, 'b-', linewidth=2, label='Mean')
    if x_train is not None:
        ax2.scatter(x_train, y_train, c='red', s=30, zorder=5)
    ax2.set_title('Discrete Confidence Intervals')
    ax2.set_xlabel('x')
    ax2.legend()
    
    # 3. Spaghetti plot (sample trajectories)
    ax3 = fig.add_subplot(2, 2, 3)
    n_samples = 50
    for i in range(n_samples):
        y_sample = mu + sigma * np.random.randn(len(mu))
        ax3.plot(x_test, y_sample, 'b-', alpha=0.1, linewidth=0.5)
    ax3.plot(x_test, mu, 'r-', linewidth=2, label='Mean')
    if x_train is not None:
        ax3.scatter(x_train, y_train, c='black', s=30, zorder=5)
    ax3.set_title('Spaghetti Plot (50 Samples)')
    ax3.set_xlabel('x')
    ax3.set_ylabel('y')
    
    # 4. Uncertainty magnitude
    ax4 = fig.add_subplot(2, 2, 4)
    ax4.fill_between(x_test, 0, sigma, alpha=0.5, color='purple')
    ax4.plot(x_test, sigma, 'purple', linewidth=2)
    ax4.axhline(np.sqrt(0.25), color='gray', linestyle='--', 
                label=f'Aleatoric component (σ)')
    ax4.set_title('Predictive Standard Deviation')
    ax4.set_xlabel('x')
    ax4.set_ylabel('σ(x)')
    ax4.legend()
    
    plt.tight_layout()
    plt.show()
 
# Example
np.random.seed(42)
x_test = np.linspace(-3, 3, 100)
x_train = np.random.uniform(-2, 2, 15)
y_train = 0.5 * x_train + 0.5 * np.random.randn(15)
 
# Simulated Bayesian predictions
mu = 0.5 * x_test
# Uncertainty increases away from training data
sigma = 0.4 + 0.1 * np.minimum(np.abs(x_test - 0), 2)
 
comprehensive_uncertainty_plot(x_test, mu, sigma, x_train, y_train)

Know Your Audience

Technical audiences appreciate full distributions and calibration details. Decision-makers often prefer simple ranges ('expect 10-15, but 5-20 is possible'). Executives may need worst-case bounds. Translate your uncertainty into the language your audience understands.

Decomposing Epistemic and Aleatoric Uncertainty

Recall that predictive variance decomposes as:

$$\sigma_^2 = \underbrace{\mathbf{x}_^\top\mathbf{S}N\mathbf{x}*}{\text{Epistemic}} + \underbrace{\sigma^2}{\text{Aleatoric}}$$

Tracking these separately provides actionable insights.

Epistemic Uncertainty (Model Uncertainty):

Comes from not knowing the true weights
Varies across input space
High where training data is sparse
Reducible: Decreases with more data
Action: Collect more data in high-epistemic regions

Aleatoric Uncertainty (Data Uncertainty):

Comes from inherent randomness in observations
Constant in our homoscedastic model
Sets a floor on predictive uncertainty
Irreducible: Cannot be reduced by more data
Action: Improve measurement quality, or accept as fundamental

uncertainty_decomposition.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
import numpy as np
import matplotlib.pyplot as plt
 
def decompose_uncertainty(X_train, y_train, X_test, alpha, beta):
    """
    Decompose predictive uncertainty into epistemic and aleatoric.
    """
    N, D = X_train.shape
    
    # Posterior
    S_N_inv = alpha * np.eye(D) + beta * X_train.T @ X_train
    S_N = np.linalg.inv(S_N_inv)
    m_N = beta * S_N @ X_train.T @ y_train
    
    # Predictions
    mu = X_test @ m_N
    
    # Epistemic (model uncertainty)
    epistemic = np.sum((X_test @ S_N) * X_test, axis=1)
    
    # Aleatoric (noise)
    aleatoric = 1.0 / beta
    
    # Total
    total = epistemic + aleatoric
    
    return mu, np.sqrt(epistemic), np.sqrt(aleatoric), np.sqrt(total)
 
# Generate data
np.random.seed(42)
n_train = 30
x = np.sort(np.concatenate([
    np.random.uniform(-4, -1.5, 10),
    np.random.uniform(-0.5, 0.5, 10),
    np.random.uniform(1.5, 4, 10)
]))  # Clustered data with gaps
 
X_train = np.column_stack([np.ones_like(x), x])
w_true = np.array([0.5, 1.2])
noise_std = 0.5
y_train = X_train @ w_true + noise_std * np.random.randn(n_train)
 
x_test = np.linspace(-5, 5, 200)
X_test = np.column_stack([np.ones_like(x_test), x_test])
 
alpha = 0.2
beta = 1.0 / noise_std**2
 
mu, sigma_epi, sigma_ale, sigma_total = decompose_uncertainty(
    X_train, y_train, X_test, alpha, beta
)
 
# Visualization
fig, axes = plt.subplots(2, 1, figsize=(12, 8))
 
# Top: Predictions with uncertainty
ax = axes[0]
ax.fill_between(x_test, mu - 2*sigma_total, mu + 2*sigma_total,
                alpha=0.2, color='blue', label='Total Uncertainty (±2σ)')
ax.plot(x_test, mu, 'b-', linewidth=2, label='Mean Prediction')
ax.scatter(x, y_train, c='red', s=50, zorder=5, label='Training Data')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Predictions with Total Uncertainty')
ax.legend()
ax.grid(alpha=0.3)
 
# Bottom: Uncertainty decomposition
ax = axes[1]
ax.fill_between(x_test, 0, sigma_ale * np.ones_like(x_test),
                alpha=0.5, color='orange', label='Aleatoric (σ)')
ax.fill_between(x_test, sigma_ale, sigma_ale + sigma_epi,
                alpha=0.5, color='purple', label='Epistemic')
ax.plot(x_test, sigma_total, 'k-', linewidth=2, label='Total σ')
ax.axhline(sigma_ale, color='orange', linestyle='--', alpha=0.7)
ax.set_xlabel('x')
ax.set_ylabel('Standard Deviation')
ax.set_title('Uncertainty Decomposition: Epistemic vs Aleatoric')
ax.legend()
ax.grid(alpha=0.3)
 
# Mark data regions
for ax in axes:
    ax.axvspan(-4, -1.5, alpha=0.1, color='green')
    ax.axvspan(-0.5, 0.5, alpha=0.1, color='green')
    ax.axvspan(1.5, 4, alpha=0.1, color='green')
 
plt.tight_layout()
plt.show()
 
print("Key Observations:")
print("- Epistemic uncertainty is LOW where data exists (green regions)")
print("- Epistemic uncertainty is HIGH in gaps between data clusters")
print("- Aleatoric uncertainty is CONSTANT everywhere")
print("- Total uncertainty = sqrt(epistemic² + aleatoric²)")

Actionable Insights from Decomposition

If epistemic >> aleatoric in some region: Collect more data there. If aleatoric >> epistemic everywhere: More data won't help much; focus on noise reduction. If both are high: The problem is inherently hard AND you need more data.

Decision-Making Under Uncertainty

Uncertainty estimates enable principled decision-making that accounts for risk.

Expected Utility Framework:

Given a prediction with uncertainty, the optimal action a* maximizes expected utility:

$$a^* = \arg\max_a \mathbb{E}_{p(y|x)}[U(a, y)]$$

where U(a, y) is the utility of action a when the true outcome is y.

Example: Bid Pricing

Suppose you're predicting the value V of an item to decide how much to bid.

Predicted distribution: V ~ 𝒩(μ, σ²)
Bidding B wins if B > true value. Profit = V - B if you win, 0 otherwise.

The expected profit for bid B is: $$\mathbb{E}[\text{profit}] = \int_{-\infty}^{B} (v - B) \cdot p(v) , dv$$

With high σ (uncertainty), you might bid lower to reduce risk. With low σ, you can bid closer to μ.

Risk-Aware Predictions:

Rather than predicting the mean, uncertainty-aware systems can output:

Conservative estimate: μ - kσ for some k > 0
- Use when overestimation is costly
- k controls the "safety margin"
Optimistic bound: μ + kσ
- Use for upper-confidence-bound exploration (UCB)
- Balances exploitation (μ) and exploration (σ)
Value at Risk (VaR): The α-quantile
- "We're α% confident the value exceeds this"
- Common in financial risk management
Expected Shortfall: Mean of the worst α% outcomes
- Averages the tail, more stable than VaR

Example: Anomaly Detection

Flag as anomaly if: |y - μ| > kσ

This threshold adapts to local uncertainty. Where σ is high, the threshold is wider (we expect more variability).

Uncertainty-Aware Decision Strategies
Strategy	Formula	Use Case
Point prediction	μ	When uncertainty doesn't matter
Conservative	μ - kσ	Cost of overestimation > underestimation
UCB	μ + kσ	Exploration in bandits, active learning
Threshold k	μ ± kσ	Anomaly detection, control limits
VaR α	μ + z_α σ	Financial risk, worst-case planning
Expected value	∫ y p(y) dy	Long-run optimal under neutrality

Active Learning: Using Uncertainty to Guide Data Collection

One powerful application of uncertainty quantification is active learning: choosing which data points to label next to maximize learning efficiency.

The Basic Idea:

Labeling data is expensive (human annotation, expensive experiments). If we can choose which points to label, we should pick the ones that reduce uncertainty the most.

Uncertainty Sampling:

The simplest active learning strategy: query the point with highest predictive variance.

$$\mathbf{x}{\text{next}} = \arg\max{\mathbf{x}} \sigma^2(\mathbf{x}) = \arg\max_{\mathbf{x}} \left[ \mathbf{x}^\top\mathbf{S}_N\mathbf{x} + \sigma^2 \right]$$

Since σ² is constant, this reduces to: $$\mathbf{x}{\text{next}} = \arg\max{\mathbf{x}} \mathbf{x}^\top\mathbf{S}_N\mathbf{x}$$

Query where epistemic uncertainty is highest.

Why It Works:

Observing a point reduces posterior variance especially in nearby regions. By targeting high-variance regions, we efficiently reduce overall uncertainty with fewer labeled examples.

active_learning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
import matplotlib.pyplot as plt
 
def active_learning_demo(n_initial=5, n_queries=10, pool_size=100):
    """
    Demonstrate active learning with Bayesian linear regression.
    Compare uncertainty sampling to random sampling.
    """
    np.random.seed(42)
    
    # True function
    w_true = np.array([0.5, 1.0, -0.5])
    noise_std = 0.3
    
    def true_func(X):
        return X @ w_true + noise_std * np.random.randn(len(X))
    
    def compute_posterior(X, y, alpha=0.5, beta=1/0.3**2):
        D = X.shape[1]
        S_N_inv = alpha * np.eye(D) + beta * X.T @ X
        S_N = np.linalg.inv(S_N_inv)
        m_N = beta * S_N @ X.T @ y
        return m_N, S_N
    
    def epistemic_uncertainty(X_pool, S_N):
        return np.sum((X_pool @ S_N) * X_pool, axis=1)
    
    # Create data pool
    x_pool = np.linspace(-3, 3, pool_size)
    X_pool = np.column_stack([np.ones(pool_size), x_pool, x_pool**2])
    
    # Initial random samples
    initial_idx = np.random.choice(pool_size, n_initial, replace=False)
    
    results = {'uncertainty': [], 'random': []}
    
    for strategy in ['uncertainty', 'random']:
        labeled = list(initial_idx.copy())
        unlabeled = [i for i in range(pool_size) if i not in labeled]
        
        X_train = X_pool[labeled]
        y_train = true_func(X_train)
        
        mse_history = []
        
        for query in range(n_queries):
            # Current posterior
            m_N, S_N = compute_posterior(X_train, y_train)
            
            # Evaluate on test set
            mse = np.mean((X_pool @ m_N - X_pool @ w_true)**2)
            mse_history.append(mse)
            
            # Select next point
            if strategy == 'uncertainty':
                # Uncertainty sampling
                uncertainties = epistemic_uncertainty(X_pool[unlabeled], S_N)
                next_relative = np.argmax(uncertainties)
            else:
                # Random sampling
                next_relative = np.random.randint(len(unlabeled))
            
            next_idx = unlabeled[next_relative]
            
            # Query (get label)
            x_new = X_pool[next_idx:next_idx+1]
            y_new = true_func(x_new)
            
            # Update
            X_train = np.vstack([X_train, x_new])
            y_train = np.concatenate([y_train, y_new])
            labeled.append(next_idx)
            unlabeled.remove(next_idx)
        
        results[strategy] = mse_history
    
    # Plot comparison
    fig, ax = plt.subplots(figsize=(10, 6))
    ax.plot(results['uncertainty'], 'b-o', linewidth=2, 
            markersize=6, label='Uncertainty Sampling')
    ax.plot(results['random'], 'r-s', linewidth=2, 
            markersize=6, label='Random Sampling')
    ax.set_xlabel('Number of Queries')
    ax.set_ylabel('Mean Squared Error')
    ax.set_title('Active Learning: Uncertainty vs Random Sampling')
    ax.legend()
    ax.grid(alpha=0.3)
    ax.set_yscale('log')
    plt.show()
    
    print("Final MSE (Uncertainty):", results['uncertainty'][-1])
    print("Final MSE (Random):", results['random'][-1])
    print("Improvement:", (results['random'][-1] - results['uncertainty'][-1]) 
          / results['random'][-1] * 100, "%")
 
active_learning_demo()

When Active Learning Helps Most

Active learning provides the biggest gains when: (1) Labeling is expensive relative to data acquisition, (2) The input space has regions of varying importance, (3) You can choose which points to query. If labeling is cheap or you must process whatever data arrives, active learning may not be applicable.

Using Uncertainty for Model Selection

Uncertainty also helps choose between models. The key quantity is the marginal likelihood (also called model evidence):

$$p(\mathbf{y} | \mathbf{X}, \mathcal{M}) = \int p(\mathbf{y} | \mathbf{X}, \mathbf{w}) \cdot p(\mathbf{w} | \mathcal{M}) , d\mathbf{w}$$

This is the probability of the data averaged over all possible parameter values under model ℳ.

For Bayesian Linear Regression (closed-form):

$$\log p(\mathbf{y} | \mathbf{X}) = -\frac{1}{2}\left[ N\log(2\pi) + \log|\mathbf{K}| + \mathbf{y}^\top\mathbf{K}^{-1}\mathbf{y} \right]$$

where K = σ²I + X S₀ Xᵀ is the data covariance matrix.

Intuition:

The marginal likelihood balances:

Data fit: Does the model explain the observed data?
Complexity: Does it do so parsimoniously?

Models that balance fit and simplicity have high marginal likelihood. This is automatic Occam's razor—complex models spread probability mass over predictions they never observe.

Applications:

Hyperparameter Selection: Different α values define different models. Choose α to maximize marginal likelihood. $$\alpha^* = \arg\max_\alpha \log p(\mathbf{y} | \mathbf{X}, \alpha)$$ This is called empirical Bayes or type-II maximum likelihood.
Feature Selection: Compare models with different feature subsets. Higher marginal likelihood = better model.
Model Class Comparison: Compare linear models to polynomial models to Gaussian processes. The marginal likelihood accounts for complexity.

BIC Approximation:

For quick comparisons, the Bayesian Information Criterion approximates log marginal likelihood: $$\text{BIC} \approx \log p(\mathbf{y}|\mathbf{X}) \approx \log p(\mathbf{y}|\mathbf{X}, \hat{\mathbf{w}}) - \frac{D}{2}\log N$$

The penalty (D/2)log N discourages complex models.

Caveat: Model Space Must Be Right

Marginal likelihood compares models within a model class. If the true data-generating process isn't in your model class, the highest marginal likelihood doesn't mean 'true.' It means 'best approximation among candidates.' Always consider whether your model class is rich enough.

Practical Guidelines for Uncertainty Quantification

Based on experience deploying Bayesian models in production, here are practical guidelines:

Always Check Calibration:

Before trusting any uncertainty estimate, verify calibration on held-out data. Run the diagnostic plots from Section 1. Miscalibrated uncertainties are worse than no uncertainties—they create false confidence.

Start Simple:

Bayesian linear regression with reasonable priors is a strong baseline. Its uncertainties are well-understood and cheap to compute. Only add complexity (GPs, neural networks) when linear isn't enough.

Be Transparent About Assumptions:

Specify the prior and justify it
Document how σ² was estimated
Note which uncertainties are epistemic vs. aleatoric
Acknowledge model limitations

Communicate Appropriately:

Technical audiences: Full distributions, calibration metrics
Decision-makers: Key quantiles, actionable thresholds
End users: Simple ranges with layperson language

Uncertainty Quantification Checklist

•Calibration verified: Coverage matches nominal levels on test data
•Decomposition computed: Know how much is epistemic vs. aleatoric
•Edge cases examined: Uncertainties sensible at boundaries and gaps
•Assumptions documented: Prior, noise model, limitations stated
•Visualization appropriate: Chosen for audience and use case
•Decision integration planned: How uncertainties feed into actions
•Failure modes identified: When might uncertainties be wrong?
•Monitoring in place: Track calibration over time in production

Uncertainty Is Not a Silver Bullet

Uncertainty estimates are only as good as the model that produces them. A wrong model gives wrong uncertainties. Bayesian inference doesn't protect against model misspecification—it faithfully reports uncertainty within the model, which may not reflect reality. Always validate against real outcomes.

Summary: From Theory to Practice

We've explored how to take the theoretical predictive distribution and use it effectively in practice—validating, visualizing, decomposing, and acting on uncertainty.

Key Takeaways

•Calibration is essential: Verify that your confidence intervals actually contain the right fraction of outcomes. Use PIT histograms and calibration curves.
•Miscalibration signals model problems: Don't patch uncertainties—fix the modeling assumptions.
•Visualize for your audience: Use gradient bands, spaghetti plots, or simple ranges depending on who needs to understand.
•Decompose uncertainty: Separate epistemic (fixable with data) from aleatoric (irreducible). This guides action.
•Uncertainty enables better decisions: UCB for exploration, conservative bounds for safety, VaR for risk management.
•Active learning leverages uncertainty: Query where uncertainty is highest to learn efficiently.
•Marginal likelihood for model selection: Compare models considering both fit and complexity.

What's Next:

We've covered the entire Bayesian linear regression pipeline: priors, posteriors, predictions, and uncertainty quantification. The final page reveals a profound connection that ties it all together: the connection between Bayesian inference and regularization. We'll see that Ridge regression, LASSO, and other regularized methods are Bayesian methods in disguise—and this perspective illuminates when and why regularization works.

Page Complete

You now have practical tools for using Bayesian uncertainty: calibration checks, visualization techniques, decomposition into epistemic and aleatoric components, and frameworks for decision-making under uncertainty. Next, we'll complete the picture by revealing the deep connection between Bayesian inference and regularization.

4 / 5

Loading learning content...

Machine LearningBayesian Linear Regression

Bayesian Linear Regression

LevelAdvanced

Duration90 mins

TopicBayesian Linear Regression

4 / 5

Uncertainty Quantification

Making Uncertainty Actionable

In practice, uncertainty quantification serves many purposes:

Risk assessment: How likely is a catastrophic prediction error?
Decision support: Should we collect more data or make a decision now?
Model debugging: Are uncertainties calibrated? Where is the model overconfident?
Active learning: Where should we sample next to reduce uncertainty most?
Safety-critical applications: When should an autonomous system defer to human oversight?

This page bridges theory and practice—we'll learn how to validate that our uncertainty estimates are trustworthy and how to use them effectively in real-world applications.

What You Will Learn

Calibration: Are Our Uncertainties Trustworthy?

Calibration is the most important property for uncertainty estimates. A model is well-calibrated if its probabilistic predictions match observed frequencies.

Definition:

A predictive distribution is calibrated if, for any confidence level 1-α: $$P(y_* \in \text{CI}{1-\alpha}(\mathbf{x}*)) = 1 - \alpha$$

Why Calibration Matters:

Trust: Decision-makers need to trust the stated uncertainty
Risk management: Underestimated uncertainty → unexpected failures
Scientific integrity: Overconfidence = false precision
Downstream systems: Miscalibrated inputs corrupt downstream uncertainty propagation

Checking Calibration:

For regression, we use calibration curves (reliability diagrams). The procedure:

For each test point, compute the predictive distribution
Compute the probability integral transform (PIT): $$u_i = \Phi\left(\frac{y_i - \mu_i}{\sigma_i}\right)$$ where Φ is the standard normal CDF
If well-calibrated, the u_i values should be Uniform(0,1)
Plot empirical CDF of u vs. theoretical Uniform

Interpretation:

If points lie on the diagonal → well calibrated
If curve is above diagonal → overconfident (intervals too narrow)
If curve is below diagonal → underconfident (intervals too wide)

Alternatively, check the coverage rate for various confidence levels:

Compute 50%, 80%, 90%, 95% intervals
Count what fraction of test points fall within each
Compare to nominal coverage

calibration_check.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def check_calibration(y_true, mu_pred, sigma_pred, n_bins=10):
    """
    Check calibration of Bayesian predictions.
    
    Returns calibration metrics and creates diagnostic plots.
    """
    # Probability Integral Transform (PIT)
    pit_values = stats.norm.cdf((y_true - mu_pred) / sigma_pred)
    
    # 1. PIT Histogram (should be uniform)
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    # PIT histogram
    ax = axes[0]
    ax.hist(pit_values, bins=n_bins, density=True, alpha=0.7, edgecolor='black')
    ax.axhline(1.0, color='red', linestyle='--', label='Ideal (Uniform)')
    ax.set_xlabel('PIT Value')
    ax.set_ylabel('Density')
    ax.set_title('PIT Histogram (Ideal: Uniform)')
    ax.legend()
    
    # 2. Calibration curve
    ax = axes[1]
    expected_coverage = np.linspace(0, 1, 100)
    observed_coverage = []
    for p in expected_coverage:
        # Symmetric interval containing probability p
        z = stats.norm.ppf((1 + p) / 2)  # z-score for symmetric interval
        in_interval = np.abs((y_true - mu_pred) / sigma_pred) <= z
        observed_coverage.append(np.mean(in_interval))
    
    ax.plot(expected_coverage, observed_coverage, 'b-', linewidth=2, label='Observed')
    ax.plot([0, 1], [0, 1], 'r--', linewidth=1, label='Ideal')
    ax.fill_between(expected_coverage, 
                    expected_coverage - 2*np.sqrt(expected_coverage*(1-expected_coverage)/len(y_true)),
                    expected_coverage + 2*np.sqrt(expected_coverage*(1-expected_coverage)/len(y_true)),
                    alpha=0.2, color='gray', label='±2σ band')
    ax.set_xlabel('Expected Coverage')
    ax.set_ylabel('Observed Coverage')
    ax.set_title('Calibration Curve')
    ax.legend()
    
    # 3. Quantile-Quantile plot
    ax = axes[2]
    sorted_pit = np.sort(pit_values)
    theoretical_quantiles = np.linspace(0, 1, len(pit_values))
    ax.plot(theoretical_quantiles, sorted_pit, 'b.', alpha=0.5)
    ax.plot([0, 1], [0, 1], 'r--', linewidth=1)
    ax.set_xlabel('Theoretical Quantile')
    ax.set_ylabel('Observed Quantile')
    ax.set_title('Q-Q Plot for PIT')
    
    plt.tight_layout()
    plt.show()
    
    # Compute summary metrics
    # Kolmogorov-Smirnov test for uniformity
    ks_stat, ks_pvalue = stats.kstest(pit_values, 'uniform')
    
    # Specific coverage rates
    coverage_results = {}
    for confidence in [0.5, 0.8, 0.9, 0.95, 0.99]:
        z = stats.norm.ppf((1 + confidence) / 2)
        coverage = np.mean(np.abs((y_true - mu_pred) / sigma_pred) <= z)
        coverage_results[confidence] = coverage
    
    print("Calibration Summary:")
    print(f"  KS test p-value: {ks_pvalue:.4f} (>0.05 suggests good calibration)")
    print("\nCoverage rates (expected → observed):")
    for conf, obs in coverage_results.items():
        status = '✓' if abs(obs - conf) < 0.05 else '✗'
        print(f"  {100*conf:.0f}%: {100*obs:.1f}% {status}")
    
    return pit_values, coverage_results
 
# Example usage
np.random.seed(42)
n_test = 500
 
# Generate test data
x_test = np.random.uniform(-3, 3, n_test)
w_true = 1.5
noise_true = 0.5
y_true = w_true * x_test + noise_true * np.random.randn(n_test)
 
# Simulate Bayesian predictions (well-calibrated)
mu_pred = w_true * x_test  # Assume we know the true weight (ideal case)
sigma_pred = np.sqrt(0.05 + noise_true**2)  # Epistemic + aleatoric
 
pit, coverage = check_calibration(y_true, mu_pred, sigma_pred)

Sources of Miscalibration

Bayesian linear regression with Gaussian assumptions produces calibrated predictions if the model is correct. In practice, miscalibration arises from model misspecification.

Common Causes of Overconfidence:

Model Misspecification: True relationship is nonlinear, but we fit a linear model. Predictions are systematically wrong, and this error isn't captured in σ².
Heteroscedastic Noise: True noise varies with input (σ²(x)), but we assume constant σ². Regions with high true noise are overconfident.
Outliers: Heavy-tailed noise (not Gaussian) causes more extreme observations than predicted.
Wrong Prior: Prior is too confident (large α), shrinks too much toward prior mean.
Underestimated σ²: If σ² is estimated from data but the sample is small, it may be underestimated.

Common Causes of Underconfidence:

Overly Weak Prior: α too small, leading to excessive variance in weight estimates.
Overestimated σ²: If noise variance is set higher than true value.
Conservative Modeling: Some practitioners deliberately inflate uncertainty "to be safe."

Diagnosing and Fixing Calibration Issues
Symptom	Likely Cause	Potential Fix
Coverage too low (overconfident)	Model misspecification	Use more flexible model (kernels, basis functions)
Coverage too low	Noise underestimated	Increase σ² or estimate from residuals
Coverage too low	Prior too strong	Decrease α (weaker prior)
Coverage too high (underconfident)	Noise overestimated	Decrease σ² or estimate properly
Coverage too high	Prior too weak	Increase α (add regularization)
Coverage varies by region	Heteroscedastic noise	Model noise as function of x
Extreme outliers	Heavy-tailed noise	Use robust likelihood (Student-t)

Miscalibration Is a Modeling Problem

Visualizing Uncertainty Effectively

Effective visualization communicates uncertainty to diverse audiences—from domain experts to executives. Different contexts call for different approaches.

1. Confidence Bands (Best for Continuous Predictions)

Shade regions representing different confidence levels:

Inner band: e.g., 50% interval
Outer band: e.g., 95% interval
Gradient shading: Continuous probability density

2. Error Bars (Best for Discrete Points)

Classic vertical line segments showing ±1σ or ±2σ. Simple and widely understood.

3. Fan Charts (Best for Time Series)

Widening bands as forecast horizon increases. Common in economic forecasting.

4. Probability Distribution Insets (Best for Critical Points)

Show the full distribution at specific points of interest. Reveals asymmetry, multimodality.

5. Spaghetti Plots (Best for Showing Variability)

Plot multiple sample trajectories from the predictive distribution. Shows correlation structure.

uncertainty_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
def comprehensive_uncertainty_plot(x_test, mu, sigma, x_train=None, y_train=None):
    """
    Create multiple uncertainty visualization approaches.
    """
    fig = plt.figure(figsize=(16, 10))
    
    # 1. Gradient bands (continuous probability)
    ax1 = fig.add_subplot(2, 2, 1)
    for alpha, z in zip(np.linspace(0.1, 0.5, 10), np.linspace(2, 0.5, 10)):
        ax1.fill_between(x_test, mu - z*sigma, mu + z*sigma, 
                         alpha=alpha, color='blue', linewidth=0)
    ax1.plot(x_test, mu, 'b-', linewidth=2, label='Mean')
    if x_train is not None:
        ax1.scatter(x_train, y_train, c='red', s=30, zorder=5, label='Data')
    ax1.set_title('Gradient Confidence Bands')
    ax1.set_xlabel('x')
    ax1.set_ylabel('y')
    ax1.legend()
    
    # 2. Discrete confidence intervals
    ax2 = fig.add_subplot(2, 2, 2)
    ax2.fill_between(x_test, mu - 2*sigma, mu + 2*sigma, 
                     alpha=0.2, color='blue', label='95% CI')
    ax2.fill_between(x_test, mu - 1*sigma, mu + 1*sigma, 
                     alpha=0.3, color='blue', label='68% CI')
    ax2.plot(x_test, mu, 'b-', linewidth=2, label='Mean')
    if x_train is not None:
        ax2.scatter(x_train, y_train, c='red', s=30, zorder=5)
    ax2.set_title('Discrete Confidence Intervals')
    ax2.set_xlabel('x')
    ax2.legend()
    
    # 3. Spaghetti plot (sample trajectories)
    ax3 = fig.add_subplot(2, 2, 3)
    n_samples = 50
    for i in range(n_samples):
        y_sample = mu + sigma * np.random.randn(len(mu))
        ax3.plot(x_test, y_sample, 'b-', alpha=0.1, linewidth=0.5)
    ax3.plot(x_test, mu, 'r-', linewidth=2, label='Mean')
    if x_train is not None:
        ax3.scatter(x_train, y_train, c='black', s=30, zorder=5)
    ax3.set_title('Spaghetti Plot (50 Samples)')
    ax3.set_xlabel('x')
    ax3.set_ylabel('y')
    
    # 4. Uncertainty magnitude
    ax4 = fig.add_subplot(2, 2, 4)
    ax4.fill_between(x_test, 0, sigma, alpha=0.5, color='purple')
    ax4.plot(x_test, sigma, 'purple', linewidth=2)
    ax4.axhline(np.sqrt(0.25), color='gray', linestyle='--', 
                label=f'Aleatoric component (σ)')
    ax4.set_title('Predictive Standard Deviation')
    ax4.set_xlabel('x')
    ax4.set_ylabel('σ(x)')
    ax4.legend()
    
    plt.tight_layout()
    plt.show()
 
# Example
np.random.seed(42)
x_test = np.linspace(-3, 3, 100)
x_train = np.random.uniform(-2, 2, 15)
y_train = 0.5 * x_train + 0.5 * np.random.randn(15)
 
# Simulated Bayesian predictions
mu = 0.5 * x_test
# Uncertainty increases away from training data
sigma = 0.4 + 0.1 * np.minimum(np.abs(x_test - 0), 2)
 
comprehensive_uncertainty_plot(x_test, mu, sigma, x_train, y_train)

Know Your Audience

Decomposing Epistemic and Aleatoric Uncertainty

Recall that predictive variance decomposes as:

$$\sigma_^2 = \underbrace{\mathbf{x}_^\top\mathbf{S}N\mathbf{x}*}{\text{Epistemic}} + \underbrace{\sigma^2}{\text{Aleatoric}}$$

Tracking these separately provides actionable insights.

Epistemic Uncertainty (Model Uncertainty):

Comes from not knowing the true weights
Varies across input space
High where training data is sparse
Reducible: Decreases with more data
Action: Collect more data in high-epistemic regions

Aleatoric Uncertainty (Data Uncertainty):

Comes from inherent randomness in observations
Constant in our homoscedastic model
Sets a floor on predictive uncertainty
Irreducible: Cannot be reduced by more data
Action: Improve measurement quality, or accept as fundamental

uncertainty_decomposition.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
import numpy as np
import matplotlib.pyplot as plt
 
def decompose_uncertainty(X_train, y_train, X_test, alpha, beta):
    """
    Decompose predictive uncertainty into epistemic and aleatoric.
    """
    N, D = X_train.shape
    
    # Posterior
    S_N_inv = alpha * np.eye(D) + beta * X_train.T @ X_train
    S_N = np.linalg.inv(S_N_inv)
    m_N = beta * S_N @ X_train.T @ y_train
    
    # Predictions
    mu = X_test @ m_N
    
    # Epistemic (model uncertainty)
    epistemic = np.sum((X_test @ S_N) * X_test, axis=1)
    
    # Aleatoric (noise)
    aleatoric = 1.0 / beta
    
    # Total
    total = epistemic + aleatoric
    
    return mu, np.sqrt(epistemic), np.sqrt(aleatoric), np.sqrt(total)
 
# Generate data
np.random.seed(42)
n_train = 30
x = np.sort(np.concatenate([
    np.random.uniform(-4, -1.5, 10),
    np.random.uniform(-0.5, 0.5, 10),
    np.random.uniform(1.5, 4, 10)
]))  # Clustered data with gaps
 
X_train = np.column_stack([np.ones_like(x), x])
w_true = np.array([0.5, 1.2])
noise_std = 0.5
y_train = X_train @ w_true + noise_std * np.random.randn(n_train)
 
x_test = np.linspace(-5, 5, 200)
X_test = np.column_stack([np.ones_like(x_test), x_test])
 
alpha = 0.2
beta = 1.0 / noise_std**2
 
mu, sigma_epi, sigma_ale, sigma_total = decompose_uncertainty(
    X_train, y_train, X_test, alpha, beta
)
 
# Visualization
fig, axes = plt.subplots(2, 1, figsize=(12, 8))
 
# Top: Predictions with uncertainty
ax = axes[0]
ax.fill_between(x_test, mu - 2*sigma_total, mu + 2*sigma_total,
                alpha=0.2, color='blue', label='Total Uncertainty (±2σ)')
ax.plot(x_test, mu, 'b-', linewidth=2, label='Mean Prediction')
ax.scatter(x, y_train, c='red', s=50, zorder=5, label='Training Data')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Predictions with Total Uncertainty')
ax.legend()
ax.grid(alpha=0.3)
 
# Bottom: Uncertainty decomposition
ax = axes[1]
ax.fill_between(x_test, 0, sigma_ale * np.ones_like(x_test),
                alpha=0.5, color='orange', label='Aleatoric (σ)')
ax.fill_between(x_test, sigma_ale, sigma_ale + sigma_epi,
                alpha=0.5, color='purple', label='Epistemic')
ax.plot(x_test, sigma_total, 'k-', linewidth=2, label='Total σ')
ax.axhline(sigma_ale, color='orange', linestyle='--', alpha=0.7)
ax.set_xlabel('x')
ax.set_ylabel('Standard Deviation')
ax.set_title('Uncertainty Decomposition: Epistemic vs Aleatoric')
ax.legend()
ax.grid(alpha=0.3)
 
# Mark data regions
for ax in axes:
    ax.axvspan(-4, -1.5, alpha=0.1, color='green')
    ax.axvspan(-0.5, 0.5, alpha=0.1, color='green')
    ax.axvspan(1.5, 4, alpha=0.1, color='green')
 
plt.tight_layout()
plt.show()
 
print("Key Observations:")
print("- Epistemic uncertainty is LOW where data exists (green regions)")
print("- Epistemic uncertainty is HIGH in gaps between data clusters")
print("- Aleatoric uncertainty is CONSTANT everywhere")
print("- Total uncertainty = sqrt(epistemic² + aleatoric²)")

Actionable Insights from Decomposition

Decision-Making Under Uncertainty

Uncertainty estimates enable principled decision-making that accounts for risk.

Expected Utility Framework:

Given a prediction with uncertainty, the optimal action a* maximizes expected utility:

$$a^* = \arg\max_a \mathbb{E}_{p(y|x)}[U(a, y)]$$

where U(a, y) is the utility of action a when the true outcome is y.

Example: Bid Pricing

Suppose you're predicting the value V of an item to decide how much to bid.

Predicted distribution: V ~ 𝒩(μ, σ²)
Bidding B wins if B > true value. Profit = V - B if you win, 0 otherwise.

The expected profit for bid B is: $$\mathbb{E}[\text{profit}] = \int_{-\infty}^{B} (v - B) \cdot p(v) , dv$$

With high σ (uncertainty), you might bid lower to reduce risk. With low σ, you can bid closer to μ.

Risk-Aware Predictions:

Rather than predicting the mean, uncertainty-aware systems can output:

Conservative estimate: μ - kσ for some k > 0
- Use when overestimation is costly
- k controls the "safety margin"
Optimistic bound: μ + kσ
- Use for upper-confidence-bound exploration (UCB)
- Balances exploitation (μ) and exploration (σ)
Value at Risk (VaR): The α-quantile
- "We're α% confident the value exceeds this"
- Common in financial risk management
Expected Shortfall: Mean of the worst α% outcomes
- Averages the tail, more stable than VaR

Example: Anomaly Detection

Flag as anomaly if: |y - μ| > kσ

This threshold adapts to local uncertainty. Where σ is high, the threshold is wider (we expect more variability).

Uncertainty-Aware Decision Strategies
Strategy	Formula	Use Case
Point prediction	μ	When uncertainty doesn't matter
Conservative	μ - kσ	Cost of overestimation > underestimation
UCB	μ + kσ	Exploration in bandits, active learning
Threshold k	μ ± kσ	Anomaly detection, control limits
VaR α	μ + z_α σ	Financial risk, worst-case planning
Expected value	∫ y p(y) dy	Long-run optimal under neutrality

Active Learning: Using Uncertainty to Guide Data Collection

One powerful application of uncertainty quantification is active learning: choosing which data points to label next to maximize learning efficiency.

The Basic Idea:

Labeling data is expensive (human annotation, expensive experiments). If we can choose which points to label, we should pick the ones that reduce uncertainty the most.

Uncertainty Sampling:

The simplest active learning strategy: query the point with highest predictive variance.

$$\mathbf{x}{\text{next}} = \arg\max{\mathbf{x}} \sigma^2(\mathbf{x}) = \arg\max_{\mathbf{x}} \left[ \mathbf{x}^\top\mathbf{S}_N\mathbf{x} + \sigma^2 \right]$$

Since σ² is constant, this reduces to: $$\mathbf{x}{\text{next}} = \arg\max{\mathbf{x}} \mathbf{x}^\top\mathbf{S}_N\mathbf{x}$$

Query where epistemic uncertainty is highest.

Why It Works:

Observing a point reduces posterior variance especially in nearby regions. By targeting high-variance regions, we efficiently reduce overall uncertainty with fewer labeled examples.

active_learning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
import matplotlib.pyplot as plt
 
def active_learning_demo(n_initial=5, n_queries=10, pool_size=100):
    """
    Demonstrate active learning with Bayesian linear regression.
    Compare uncertainty sampling to random sampling.
    """
    np.random.seed(42)
    
    # True function
    w_true = np.array([0.5, 1.0, -0.5])
    noise_std = 0.3
    
    def true_func(X):
        return X @ w_true + noise_std * np.random.randn(len(X))
    
    def compute_posterior(X, y, alpha=0.5, beta=1/0.3**2):
        D = X.shape[1]
        S_N_inv = alpha * np.eye(D) + beta * X.T @ X
        S_N = np.linalg.inv(S_N_inv)
        m_N = beta * S_N @ X.T @ y
        return m_N, S_N
    
    def epistemic_uncertainty(X_pool, S_N):
        return np.sum((X_pool @ S_N) * X_pool, axis=1)
    
    # Create data pool
    x_pool = np.linspace(-3, 3, pool_size)
    X_pool = np.column_stack([np.ones(pool_size), x_pool, x_pool**2])
    
    # Initial random samples
    initial_idx = np.random.choice(pool_size, n_initial, replace=False)
    
    results = {'uncertainty': [], 'random': []}
    
    for strategy in ['uncertainty', 'random']:
        labeled = list(initial_idx.copy())
        unlabeled = [i for i in range(pool_size) if i not in labeled]
        
        X_train = X_pool[labeled]
        y_train = true_func(X_train)
        
        mse_history = []
        
        for query in range(n_queries):
            # Current posterior
            m_N, S_N = compute_posterior(X_train, y_train)
            
            # Evaluate on test set
            mse = np.mean((X_pool @ m_N - X_pool @ w_true)**2)
            mse_history.append(mse)
            
            # Select next point
            if strategy == 'uncertainty':
                # Uncertainty sampling
                uncertainties = epistemic_uncertainty(X_pool[unlabeled], S_N)
                next_relative = np.argmax(uncertainties)
            else:
                # Random sampling
                next_relative = np.random.randint(len(unlabeled))
            
            next_idx = unlabeled[next_relative]
            
            # Query (get label)
            x_new = X_pool[next_idx:next_idx+1]
            y_new = true_func(x_new)
            
            # Update
            X_train = np.vstack([X_train, x_new])
            y_train = np.concatenate([y_train, y_new])
            labeled.append(next_idx)
            unlabeled.remove(next_idx)
        
        results[strategy] = mse_history
    
    # Plot comparison
    fig, ax = plt.subplots(figsize=(10, 6))
    ax.plot(results['uncertainty'], 'b-o', linewidth=2, 
            markersize=6, label='Uncertainty Sampling')
    ax.plot(results['random'], 'r-s', linewidth=2, 
            markersize=6, label='Random Sampling')
    ax.set_xlabel('Number of Queries')
    ax.set_ylabel('Mean Squared Error')
    ax.set_title('Active Learning: Uncertainty vs Random Sampling')
    ax.legend()
    ax.grid(alpha=0.3)
    ax.set_yscale('log')
    plt.show()
    
    print("Final MSE (Uncertainty):", results['uncertainty'][-1])
    print("Final MSE (Random):", results['random'][-1])
    print("Improvement:", (results['random'][-1] - results['uncertainty'][-1]) 
          / results['random'][-1] * 100, "%")
 
active_learning_demo()

When Active Learning Helps Most

Using Uncertainty for Model Selection

Uncertainty also helps choose between models. The key quantity is the marginal likelihood (also called model evidence):

$$p(\mathbf{y} | \mathbf{X}, \mathcal{M}) = \int p(\mathbf{y} | \mathbf{X}, \mathbf{w}) \cdot p(\mathbf{w} | \mathcal{M}) , d\mathbf{w}$$

This is the probability of the data averaged over all possible parameter values under model ℳ.

For Bayesian Linear Regression (closed-form):

$$\log p(\mathbf{y} | \mathbf{X}) = -\frac{1}{2}\left[ N\log(2\pi) + \log|\mathbf{K}| + \mathbf{y}^\top\mathbf{K}^{-1}\mathbf{y} \right]$$

where K = σ²I + X S₀ Xᵀ is the data covariance matrix.

Intuition:

The marginal likelihood balances:

Data fit: Does the model explain the observed data?
Complexity: Does it do so parsimoniously?

Models that balance fit and simplicity have high marginal likelihood. This is automatic Occam's razor—complex models spread probability mass over predictions they never observe.

Applications:

Hyperparameter Selection: Different α values define different models. Choose α to maximize marginal likelihood. $$\alpha^* = \arg\max_\alpha \log p(\mathbf{y} | \mathbf{X}, \alpha)$$ This is called empirical Bayes or type-II maximum likelihood.
Feature Selection: Compare models with different feature subsets. Higher marginal likelihood = better model.
Model Class Comparison: Compare linear models to polynomial models to Gaussian processes. The marginal likelihood accounts for complexity.

BIC Approximation:

The penalty (D/2)log N discourages complex models.

Caveat: Model Space Must Be Right

Practical Guidelines for Uncertainty Quantification

Based on experience deploying Bayesian models in production, here are practical guidelines:

Always Check Calibration:

Start Simple:

Be Transparent About Assumptions:

Specify the prior and justify it
Document how σ² was estimated
Note which uncertainties are epistemic vs. aleatoric
Acknowledge model limitations

Communicate Appropriately:

Technical audiences: Full distributions, calibration metrics
Decision-makers: Key quantiles, actionable thresholds
End users: Simple ranges with layperson language

Uncertainty Quantification Checklist

•Calibration verified: Coverage matches nominal levels on test data
•Decomposition computed: Know how much is epistemic vs. aleatoric
•Edge cases examined: Uncertainties sensible at boundaries and gaps
•Assumptions documented: Prior, noise model, limitations stated
•Visualization appropriate: Chosen for audience and use case
•Decision integration planned: How uncertainties feed into actions
•Failure modes identified: When might uncertainties be wrong?
•Monitoring in place: Track calibration over time in production

Uncertainty Is Not a Silver Bullet

Summary: From Theory to Practice

We've explored how to take the theoretical predictive distribution and use it effectively in practice—validating, visualizing, decomposing, and acting on uncertainty.

Key Takeaways

•Calibration is essential: Verify that your confidence intervals actually contain the right fraction of outcomes. Use PIT histograms and calibration curves.
•Miscalibration signals model problems: Don't patch uncertainties—fix the modeling assumptions.
•Visualize for your audience: Use gradient bands, spaghetti plots, or simple ranges depending on who needs to understand.
•Decompose uncertainty: Separate epistemic (fixable with data) from aleatoric (irreducible). This guides action.
•Uncertainty enables better decisions: UCB for exploration, conservative bounds for safety, VaR for risk management.
•Active learning leverages uncertainty: Query where uncertainty is highest to learn efficiently.
•Marginal likelihood for model selection: Compare models considering both fit and complexity.

What's Next:

Page Complete

4 / 5