Loading learning content...
Having derived the predictive distribution, we now possess a powerful tool: calibrated uncertainty estimates. But a probability distribution is only useful if we know how to interpret it, trust it, and act on it.
In practice, uncertainty quantification serves many purposes:
This page bridges theory and practice—we'll learn how to validate that our uncertainty estimates are trustworthy and how to use them effectively in real-world applications.
By the end of this page, you will understand how to check if uncertainty estimates are calibrated, visualize uncertainty effectively for different audiences, decompose uncertainty into epistemic and aleatoric components, and make decisions that appropriately account for prediction uncertainty.
Calibration is the most important property for uncertainty estimates. A model is well-calibrated if its probabilistic predictions match observed frequencies.
Definition:
A predictive distribution is calibrated if, for any confidence level 1-α: $$P(y_* \in \text{CI}{1-\alpha}(\mathbf{x}*)) = 1 - \alpha$$
In plain terms: 95% confidence intervals should contain the true value 95% of the time. 80% intervals should contain 80% of observations. If 95% intervals only capture 70% of outcomes, the model is overconfident.
Why Calibration Matters:
Checking Calibration:
For regression, we use calibration curves (reliability diagrams). The procedure:
Interpretation:
Alternatively, check the coverage rate for various confidence levels:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394
import numpy as npfrom scipy import statsimport matplotlib.pyplot as plt def check_calibration(y_true, mu_pred, sigma_pred, n_bins=10): """ Check calibration of Bayesian predictions. Returns calibration metrics and creates diagnostic plots. """ # Probability Integral Transform (PIT) pit_values = stats.norm.cdf((y_true - mu_pred) / sigma_pred) # 1. PIT Histogram (should be uniform) fig, axes = plt.subplots(1, 3, figsize=(15, 4)) # PIT histogram ax = axes[0] ax.hist(pit_values, bins=n_bins, density=True, alpha=0.7, edgecolor='black') ax.axhline(1.0, color='red', linestyle='--', label='Ideal (Uniform)') ax.set_xlabel('PIT Value') ax.set_ylabel('Density') ax.set_title('PIT Histogram (Ideal: Uniform)') ax.legend() # 2. Calibration curve ax = axes[1] expected_coverage = np.linspace(0, 1, 100) observed_coverage = [] for p in expected_coverage: # Symmetric interval containing probability p z = stats.norm.ppf((1 + p) / 2) # z-score for symmetric interval in_interval = np.abs((y_true - mu_pred) / sigma_pred) <= z observed_coverage.append(np.mean(in_interval)) ax.plot(expected_coverage, observed_coverage, 'b-', linewidth=2, label='Observed') ax.plot([0, 1], [0, 1], 'r--', linewidth=1, label='Ideal') ax.fill_between(expected_coverage, expected_coverage - 2*np.sqrt(expected_coverage*(1-expected_coverage)/len(y_true)), expected_coverage + 2*np.sqrt(expected_coverage*(1-expected_coverage)/len(y_true)), alpha=0.2, color='gray', label='±2σ band') ax.set_xlabel('Expected Coverage') ax.set_ylabel('Observed Coverage') ax.set_title('Calibration Curve') ax.legend() # 3. Quantile-Quantile plot ax = axes[2] sorted_pit = np.sort(pit_values) theoretical_quantiles = np.linspace(0, 1, len(pit_values)) ax.plot(theoretical_quantiles, sorted_pit, 'b.', alpha=0.5) ax.plot([0, 1], [0, 1], 'r--', linewidth=1) ax.set_xlabel('Theoretical Quantile') ax.set_ylabel('Observed Quantile') ax.set_title('Q-Q Plot for PIT') plt.tight_layout() plt.show() # Compute summary metrics # Kolmogorov-Smirnov test for uniformity ks_stat, ks_pvalue = stats.kstest(pit_values, 'uniform') # Specific coverage rates coverage_results = {} for confidence in [0.5, 0.8, 0.9, 0.95, 0.99]: z = stats.norm.ppf((1 + confidence) / 2) coverage = np.mean(np.abs((y_true - mu_pred) / sigma_pred) <= z) coverage_results[confidence] = coverage print("Calibration Summary:") print(f" KS test p-value: {ks_pvalue:.4f} (>0.05 suggests good calibration)") print("\nCoverage rates (expected → observed):") for conf, obs in coverage_results.items(): status = '✓' if abs(obs - conf) < 0.05 else '✗' print(f" {100*conf:.0f}%: {100*obs:.1f}% {status}") return pit_values, coverage_results # Example usagenp.random.seed(42)n_test = 500 # Generate test datax_test = np.random.uniform(-3, 3, n_test)w_true = 1.5noise_true = 0.5y_true = w_true * x_test + noise_true * np.random.randn(n_test) # Simulate Bayesian predictions (well-calibrated)mu_pred = w_true * x_test # Assume we know the true weight (ideal case)sigma_pred = np.sqrt(0.05 + noise_true**2) # Epistemic + aleatoric pit, coverage = check_calibration(y_true, mu_pred, sigma_pred)Bayesian linear regression with Gaussian assumptions produces calibrated predictions if the model is correct. In practice, miscalibration arises from model misspecification.
Common Causes of Overconfidence:
Model Misspecification: True relationship is nonlinear, but we fit a linear model. Predictions are systematically wrong, and this error isn't captured in σ².
Heteroscedastic Noise: True noise varies with input (σ²(x)), but we assume constant σ². Regions with high true noise are overconfident.
Outliers: Heavy-tailed noise (not Gaussian) causes more extreme observations than predicted.
Wrong Prior: Prior is too confident (large α), shrinks too much toward prior mean.
Underestimated σ²: If σ² is estimated from data but the sample is small, it may be underestimated.
Common Causes of Underconfidence:
Overly Weak Prior: α too small, leading to excessive variance in weight estimates.
Overestimated σ²: If noise variance is set higher than true value.
Conservative Modeling: Some practitioners deliberately inflate uncertainty "to be safe."
| Symptom | Likely Cause | Potential Fix |
|---|---|---|
| Coverage too low (overconfident) | Model misspecification | Use more flexible model (kernels, basis functions) |
| Coverage too low | Noise underestimated | Increase σ² or estimate from residuals |
| Coverage too low | Prior too strong | Decrease α (weaker prior) |
| Coverage too high (underconfident) | Noise overestimated | Decrease σ² or estimate properly |
| Coverage too high | Prior too weak | Increase α (add regularization) |
| Coverage varies by region | Heteroscedastic noise | Model noise as function of x |
| Extreme outliers | Heavy-tailed noise | Use robust likelihood (Student-t) |
Bayesian inference gives calibrated uncertainties if the model is correct. Miscalibration signals model misspecification. Don't patch uncertainty estimates—fix the modeling problem. If linear isn't enough, use Gaussian processes. If Gaussian noise doesn't fit, use robust alternatives.
Effective visualization communicates uncertainty to diverse audiences—from domain experts to executives. Different contexts call for different approaches.
1. Confidence Bands (Best for Continuous Predictions)
Shade regions representing different confidence levels:
2. Error Bars (Best for Discrete Points)
Classic vertical line segments showing ±1σ or ±2σ. Simple and widely understood.
3. Fan Charts (Best for Time Series)
Widening bands as forecast horizon increases. Common in economic forecasting.
4. Probability Distribution Insets (Best for Critical Points)
Show the full distribution at specific points of interest. Reveals asymmetry, multimodality.
5. Spaghetti Plots (Best for Showing Variability)
Plot multiple sample trajectories from the predictive distribution. Shows correlation structure.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
import numpy as npimport matplotlib.pyplot as pltfrom scipy import stats def comprehensive_uncertainty_plot(x_test, mu, sigma, x_train=None, y_train=None): """ Create multiple uncertainty visualization approaches. """ fig = plt.figure(figsize=(16, 10)) # 1. Gradient bands (continuous probability) ax1 = fig.add_subplot(2, 2, 1) for alpha, z in zip(np.linspace(0.1, 0.5, 10), np.linspace(2, 0.5, 10)): ax1.fill_between(x_test, mu - z*sigma, mu + z*sigma, alpha=alpha, color='blue', linewidth=0) ax1.plot(x_test, mu, 'b-', linewidth=2, label='Mean') if x_train is not None: ax1.scatter(x_train, y_train, c='red', s=30, zorder=5, label='Data') ax1.set_title('Gradient Confidence Bands') ax1.set_xlabel('x') ax1.set_ylabel('y') ax1.legend() # 2. Discrete confidence intervals ax2 = fig.add_subplot(2, 2, 2) ax2.fill_between(x_test, mu - 2*sigma, mu + 2*sigma, alpha=0.2, color='blue', label='95% CI') ax2.fill_between(x_test, mu - 1*sigma, mu + 1*sigma, alpha=0.3, color='blue', label='68% CI') ax2.plot(x_test, mu, 'b-', linewidth=2, label='Mean') if x_train is not None: ax2.scatter(x_train, y_train, c='red', s=30, zorder=5) ax2.set_title('Discrete Confidence Intervals') ax2.set_xlabel('x') ax2.legend() # 3. Spaghetti plot (sample trajectories) ax3 = fig.add_subplot(2, 2, 3) n_samples = 50 for i in range(n_samples): y_sample = mu + sigma * np.random.randn(len(mu)) ax3.plot(x_test, y_sample, 'b-', alpha=0.1, linewidth=0.5) ax3.plot(x_test, mu, 'r-', linewidth=2, label='Mean') if x_train is not None: ax3.scatter(x_train, y_train, c='black', s=30, zorder=5) ax3.set_title('Spaghetti Plot (50 Samples)') ax3.set_xlabel('x') ax3.set_ylabel('y') # 4. Uncertainty magnitude ax4 = fig.add_subplot(2, 2, 4) ax4.fill_between(x_test, 0, sigma, alpha=0.5, color='purple') ax4.plot(x_test, sigma, 'purple', linewidth=2) ax4.axhline(np.sqrt(0.25), color='gray', linestyle='--', label=f'Aleatoric component (σ)') ax4.set_title('Predictive Standard Deviation') ax4.set_xlabel('x') ax4.set_ylabel('σ(x)') ax4.legend() plt.tight_layout() plt.show() # Examplenp.random.seed(42)x_test = np.linspace(-3, 3, 100)x_train = np.random.uniform(-2, 2, 15)y_train = 0.5 * x_train + 0.5 * np.random.randn(15) # Simulated Bayesian predictionsmu = 0.5 * x_test# Uncertainty increases away from training datasigma = 0.4 + 0.1 * np.minimum(np.abs(x_test - 0), 2) comprehensive_uncertainty_plot(x_test, mu, sigma, x_train, y_train)Technical audiences appreciate full distributions and calibration details. Decision-makers often prefer simple ranges ('expect 10-15, but 5-20 is possible'). Executives may need worst-case bounds. Translate your uncertainty into the language your audience understands.
Recall that predictive variance decomposes as:
$$\sigma_^2 = \underbrace{\mathbf{x}_^\top\mathbf{S}N\mathbf{x}*}{\text{Epistemic}} + \underbrace{\sigma^2}{\text{Aleatoric}}$$
Tracking these separately provides actionable insights.
Epistemic Uncertainty (Model Uncertainty):
Aleatoric Uncertainty (Data Uncertainty):
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495
import numpy as npimport matplotlib.pyplot as plt def decompose_uncertainty(X_train, y_train, X_test, alpha, beta): """ Decompose predictive uncertainty into epistemic and aleatoric. """ N, D = X_train.shape # Posterior S_N_inv = alpha * np.eye(D) + beta * X_train.T @ X_train S_N = np.linalg.inv(S_N_inv) m_N = beta * S_N @ X_train.T @ y_train # Predictions mu = X_test @ m_N # Epistemic (model uncertainty) epistemic = np.sum((X_test @ S_N) * X_test, axis=1) # Aleatoric (noise) aleatoric = 1.0 / beta # Total total = epistemic + aleatoric return mu, np.sqrt(epistemic), np.sqrt(aleatoric), np.sqrt(total) # Generate datanp.random.seed(42)n_train = 30x = np.sort(np.concatenate([ np.random.uniform(-4, -1.5, 10), np.random.uniform(-0.5, 0.5, 10), np.random.uniform(1.5, 4, 10)])) # Clustered data with gaps X_train = np.column_stack([np.ones_like(x), x])w_true = np.array([0.5, 1.2])noise_std = 0.5y_train = X_train @ w_true + noise_std * np.random.randn(n_train) x_test = np.linspace(-5, 5, 200)X_test = np.column_stack([np.ones_like(x_test), x_test]) alpha = 0.2beta = 1.0 / noise_std**2 mu, sigma_epi, sigma_ale, sigma_total = decompose_uncertainty( X_train, y_train, X_test, alpha, beta) # Visualizationfig, axes = plt.subplots(2, 1, figsize=(12, 8)) # Top: Predictions with uncertaintyax = axes[0]ax.fill_between(x_test, mu - 2*sigma_total, mu + 2*sigma_total, alpha=0.2, color='blue', label='Total Uncertainty (±2σ)')ax.plot(x_test, mu, 'b-', linewidth=2, label='Mean Prediction')ax.scatter(x, y_train, c='red', s=50, zorder=5, label='Training Data')ax.set_xlabel('x')ax.set_ylabel('y')ax.set_title('Predictions with Total Uncertainty')ax.legend()ax.grid(alpha=0.3) # Bottom: Uncertainty decompositionax = axes[1]ax.fill_between(x_test, 0, sigma_ale * np.ones_like(x_test), alpha=0.5, color='orange', label='Aleatoric (σ)')ax.fill_between(x_test, sigma_ale, sigma_ale + sigma_epi, alpha=0.5, color='purple', label='Epistemic')ax.plot(x_test, sigma_total, 'k-', linewidth=2, label='Total σ')ax.axhline(sigma_ale, color='orange', linestyle='--', alpha=0.7)ax.set_xlabel('x')ax.set_ylabel('Standard Deviation')ax.set_title('Uncertainty Decomposition: Epistemic vs Aleatoric')ax.legend()ax.grid(alpha=0.3) # Mark data regionsfor ax in axes: ax.axvspan(-4, -1.5, alpha=0.1, color='green') ax.axvspan(-0.5, 0.5, alpha=0.1, color='green') ax.axvspan(1.5, 4, alpha=0.1, color='green') plt.tight_layout()plt.show() print("Key Observations:")print("- Epistemic uncertainty is LOW where data exists (green regions)")print("- Epistemic uncertainty is HIGH in gaps between data clusters")print("- Aleatoric uncertainty is CONSTANT everywhere")print("- Total uncertainty = sqrt(epistemic² + aleatoric²)")If epistemic >> aleatoric in some region: Collect more data there. If aleatoric >> epistemic everywhere: More data won't help much; focus on noise reduction. If both are high: The problem is inherently hard AND you need more data.
Uncertainty estimates enable principled decision-making that accounts for risk.
Expected Utility Framework:
Given a prediction with uncertainty, the optimal action a* maximizes expected utility:
$$a^* = \arg\max_a \mathbb{E}_{p(y|x)}[U(a, y)]$$
where U(a, y) is the utility of action a when the true outcome is y.
Example: Bid Pricing
Suppose you're predicting the value V of an item to decide how much to bid.
The expected profit for bid B is: $$\mathbb{E}[\text{profit}] = \int_{-\infty}^{B} (v - B) \cdot p(v) , dv$$
With high σ (uncertainty), you might bid lower to reduce risk. With low σ, you can bid closer to μ.
Risk-Aware Predictions:
Rather than predicting the mean, uncertainty-aware systems can output:
Conservative estimate: μ - kσ for some k > 0
Optimistic bound: μ + kσ
Value at Risk (VaR): The α-quantile
Expected Shortfall: Mean of the worst α% outcomes
Example: Anomaly Detection
Flag as anomaly if: |y - μ| > kσ
This threshold adapts to local uncertainty. Where σ is high, the threshold is wider (we expect more variability).
| Strategy | Formula | Use Case |
|---|---|---|
| Point prediction | μ | When uncertainty doesn't matter |
| Conservative | μ - kσ | Cost of overestimation > underestimation |
| UCB | μ + kσ | Exploration in bandits, active learning |
| Threshold k | μ ± kσ | Anomaly detection, control limits |
| VaR α | μ + z_α σ | Financial risk, worst-case planning |
| Expected value | ∫ y p(y) dy | Long-run optimal under neutrality |
One powerful application of uncertainty quantification is active learning: choosing which data points to label next to maximize learning efficiency.
The Basic Idea:
Labeling data is expensive (human annotation, expensive experiments). If we can choose which points to label, we should pick the ones that reduce uncertainty the most.
Uncertainty Sampling:
The simplest active learning strategy: query the point with highest predictive variance.
$$\mathbf{x}{\text{next}} = \arg\max{\mathbf{x}} \sigma^2(\mathbf{x}) = \arg\max_{\mathbf{x}} \left[ \mathbf{x}^\top\mathbf{S}_N\mathbf{x} + \sigma^2 \right]$$
Since σ² is constant, this reduces to: $$\mathbf{x}{\text{next}} = \arg\max{\mathbf{x}} \mathbf{x}^\top\mathbf{S}_N\mathbf{x}$$
Query where epistemic uncertainty is highest.
Why It Works:
Observing a point reduces posterior variance especially in nearby regions. By targeting high-variance regions, we efficiently reduce overall uncertainty with fewer labeled examples.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596
import numpy as npimport matplotlib.pyplot as plt def active_learning_demo(n_initial=5, n_queries=10, pool_size=100): """ Demonstrate active learning with Bayesian linear regression. Compare uncertainty sampling to random sampling. """ np.random.seed(42) # True function w_true = np.array([0.5, 1.0, -0.5]) noise_std = 0.3 def true_func(X): return X @ w_true + noise_std * np.random.randn(len(X)) def compute_posterior(X, y, alpha=0.5, beta=1/0.3**2): D = X.shape[1] S_N_inv = alpha * np.eye(D) + beta * X.T @ X S_N = np.linalg.inv(S_N_inv) m_N = beta * S_N @ X.T @ y return m_N, S_N def epistemic_uncertainty(X_pool, S_N): return np.sum((X_pool @ S_N) * X_pool, axis=1) # Create data pool x_pool = np.linspace(-3, 3, pool_size) X_pool = np.column_stack([np.ones(pool_size), x_pool, x_pool**2]) # Initial random samples initial_idx = np.random.choice(pool_size, n_initial, replace=False) results = {'uncertainty': [], 'random': []} for strategy in ['uncertainty', 'random']: labeled = list(initial_idx.copy()) unlabeled = [i for i in range(pool_size) if i not in labeled] X_train = X_pool[labeled] y_train = true_func(X_train) mse_history = [] for query in range(n_queries): # Current posterior m_N, S_N = compute_posterior(X_train, y_train) # Evaluate on test set mse = np.mean((X_pool @ m_N - X_pool @ w_true)**2) mse_history.append(mse) # Select next point if strategy == 'uncertainty': # Uncertainty sampling uncertainties = epistemic_uncertainty(X_pool[unlabeled], S_N) next_relative = np.argmax(uncertainties) else: # Random sampling next_relative = np.random.randint(len(unlabeled)) next_idx = unlabeled[next_relative] # Query (get label) x_new = X_pool[next_idx:next_idx+1] y_new = true_func(x_new) # Update X_train = np.vstack([X_train, x_new]) y_train = np.concatenate([y_train, y_new]) labeled.append(next_idx) unlabeled.remove(next_idx) results[strategy] = mse_history # Plot comparison fig, ax = plt.subplots(figsize=(10, 6)) ax.plot(results['uncertainty'], 'b-o', linewidth=2, markersize=6, label='Uncertainty Sampling') ax.plot(results['random'], 'r-s', linewidth=2, markersize=6, label='Random Sampling') ax.set_xlabel('Number of Queries') ax.set_ylabel('Mean Squared Error') ax.set_title('Active Learning: Uncertainty vs Random Sampling') ax.legend() ax.grid(alpha=0.3) ax.set_yscale('log') plt.show() print("Final MSE (Uncertainty):", results['uncertainty'][-1]) print("Final MSE (Random):", results['random'][-1]) print("Improvement:", (results['random'][-1] - results['uncertainty'][-1]) / results['random'][-1] * 100, "%") active_learning_demo()Active learning provides the biggest gains when: (1) Labeling is expensive relative to data acquisition, (2) The input space has regions of varying importance, (3) You can choose which points to query. If labeling is cheap or you must process whatever data arrives, active learning may not be applicable.
Uncertainty also helps choose between models. The key quantity is the marginal likelihood (also called model evidence):
$$p(\mathbf{y} | \mathbf{X}, \mathcal{M}) = \int p(\mathbf{y} | \mathbf{X}, \mathbf{w}) \cdot p(\mathbf{w} | \mathcal{M}) , d\mathbf{w}$$
This is the probability of the data averaged over all possible parameter values under model ℳ.
For Bayesian Linear Regression (closed-form):
$$\log p(\mathbf{y} | \mathbf{X}) = -\frac{1}{2}\left[ N\log(2\pi) + \log|\mathbf{K}| + \mathbf{y}^\top\mathbf{K}^{-1}\mathbf{y} \right]$$
where K = σ²I + X S₀ Xᵀ is the data covariance matrix.
Intuition:
The marginal likelihood balances:
Models that balance fit and simplicity have high marginal likelihood. This is automatic Occam's razor—complex models spread probability mass over predictions they never observe.
Applications:
Hyperparameter Selection: Different α values define different models. Choose α to maximize marginal likelihood. $$\alpha^* = \arg\max_\alpha \log p(\mathbf{y} | \mathbf{X}, \alpha)$$ This is called empirical Bayes or type-II maximum likelihood.
Feature Selection: Compare models with different feature subsets. Higher marginal likelihood = better model.
Model Class Comparison: Compare linear models to polynomial models to Gaussian processes. The marginal likelihood accounts for complexity.
BIC Approximation:
For quick comparisons, the Bayesian Information Criterion approximates log marginal likelihood: $$\text{BIC} \approx \log p(\mathbf{y}|\mathbf{X}) \approx \log p(\mathbf{y}|\mathbf{X}, \hat{\mathbf{w}}) - \frac{D}{2}\log N$$
The penalty (D/2)log N discourages complex models.
Marginal likelihood compares models within a model class. If the true data-generating process isn't in your model class, the highest marginal likelihood doesn't mean 'true.' It means 'best approximation among candidates.' Always consider whether your model class is rich enough.
Based on experience deploying Bayesian models in production, here are practical guidelines:
Always Check Calibration:
Before trusting any uncertainty estimate, verify calibration on held-out data. Run the diagnostic plots from Section 1. Miscalibrated uncertainties are worse than no uncertainties—they create false confidence.
Start Simple:
Bayesian linear regression with reasonable priors is a strong baseline. Its uncertainties are well-understood and cheap to compute. Only add complexity (GPs, neural networks) when linear isn't enough.
Be Transparent About Assumptions:
Communicate Appropriately:
Uncertainty estimates are only as good as the model that produces them. A wrong model gives wrong uncertainties. Bayesian inference doesn't protect against model misspecification—it faithfully reports uncertainty within the model, which may not reflect reality. Always validate against real outcomes.
We've explored how to take the theoretical predictive distribution and use it effectively in practice—validating, visualizing, decomposing, and acting on uncertainty.
What's Next:
We've covered the entire Bayesian linear regression pipeline: priors, posteriors, predictions, and uncertainty quantification. The final page reveals a profound connection that ties it all together: the connection between Bayesian inference and regularization. We'll see that Ridge regression, LASSO, and other regularized methods are Bayesian methods in disguise—and this perspective illuminates when and why regularization works.
You now have practical tools for using Bayesian uncertainty: calibration checks, visualization techniques, decomposition into epistemic and aleatoric components, and frameworks for decision-making under uncertainty. Next, we'll complete the picture by revealing the deep connection between Bayesian inference and regularization.