Loading content...
Most machine learning models produce point predictions: a single "best guess" for each input. Gaussian Processes go further—they provide complete predictive distributions that quantify uncertainty at every point.
This uncertainty-aware prediction is not a luxury; it's essential for real-world applications. When a GP predicts temperature will be 25°C with standard deviation of 0.5°C, that's fundamentally different from predicting 25°C with standard deviation of 10°C. The mean is the same, but the implications for decision-making are vastly different.
This page explores how to interpret, visualize, and leverage the full predictive distribution from GP regression. We'll move beyond seeing uncertainty as a curiosity to understanding it as actionable information.
By the end of this page, you will understand how to interpret the posterior mean as a prediction and the posterior variance as uncertainty, how to construct and visualize confidence/credible intervals, the distinction between epistemic and observation uncertainty, and how to use predictive distributions for decision-making and active learning.
The posterior mean $\mu_*(x)$ serves as our "best estimate" of the function value at input $x$. But what exactly does "best" mean?
Optimal Under Squared Error Loss:
The posterior mean minimizes expected squared error: $$\mu_* = \arg\min_a \mathbb{E}[(f(x_*) - a)^2 | \mathcal{D}]$$
If we must commit to a single prediction and will be penalized by squared error, the mean is optimal.
As a Weighted Combination:
We showed earlier that: $$\mu_* = m(x_) + \mathbf{k}_^T (K + \sigma_n^2 I)^{-1}(\mathbf{y} - \mathbf{m})$$
This reveals the structure: the prediction starts from the prior mean and adds a correction based on the discrepancy between observations and prior expectations, weighted by correlations.
Key Properties of the Posterior Mean:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104
import numpy as npimport matplotlib.pyplot as pltfrom scipy.linalg import cholesky, solve_triangular def gp_predict(X_train, y_train, X_test, length_scale=1.0, signal_var=1.0, noise_var=0.1): """Standard GP prediction returning mean and std.""" def kernel(X1, X2): X1, X2 = np.atleast_2d(X1), np.atleast_2d(X2) sq_dist = np.sum(X1**2, 1).reshape(-1,1) + np.sum(X2**2, 1) - 2*X1@X2.T return signal_var * np.exp(-np.maximum(sq_dist, 0) / (2*length_scale**2)) K = kernel(X_train, X_train) + noise_var * np.eye(len(X_train)) K_star = kernel(X_test, X_train) L = cholesky(K, lower=True) alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True)) v = solve_triangular(L, K_star.T, lower=True) mean = K_star @ alpha var = signal_var - np.sum(v**2, axis=0) return mean, np.sqrt(np.maximum(var, 0)) np.random.seed(42) # Demonstration 1: Mean as optimal estimatefig, axes = plt.subplots(2, 2, figsize=(14, 10)) # Generate dataX_train = np.array([[-2], [0], [1], [3]])y_train = np.array([0.5, 0, 1, 0.5])X_test = np.linspace(-4, 5, 200).reshape(-1, 1) mean, std = gp_predict(X_train, y_train, X_test, length_scale=1.0, noise_var=0.01) # Plot: Mean as central predictionax = axes[0, 0]ax.fill_between(X_test.ravel(), mean-2*std, mean+2*std, alpha=0.2, color='C0')ax.plot(X_test, mean, 'C0-', linewidth=2, label='Posterior mean')ax.scatter(X_train, y_train, c='red', s=80, zorder=5, label='Data')ax.axhline(0, color='gray', linestyle='--', alpha=0.3)ax.set_title('Posterior Mean: Best Estimate Under L2 Loss')ax.set_xlabel('x')ax.set_ylabel('f(x)')ax.legend()ax.grid(True, alpha=0.3) # Plot: Reversion to prior far from dataax = axes[0, 1]X_train_sparse = np.array([[0]])y_train_sparse = np.array([2])mean_sparse, std_sparse = gp_predict(X_train_sparse, y_train_sparse, X_test, length_scale=1.5, noise_var=0.01) ax.fill_between(X_test.ravel(), mean_sparse-2*std_sparse, mean_sparse+2*std_sparse, alpha=0.2, color='C0')ax.plot(X_test, mean_sparse, 'C0-', linewidth=2, label='Posterior mean')ax.axhline(0, color='C1', linestyle='--', linewidth=2, label='Prior mean (= 0)')ax.scatter(X_train_sparse, y_train_sparse, c='red', s=100, zorder=5, label='Single observation')ax.set_title('Mean Reverts to Prior Far from Data')ax.set_xlabel('x')ax.set_ylabel('f(x)')ax.legend()ax.grid(True, alpha=0.3) # Plot: Linearity in observationsax = axes[1, 0]y1 = np.array([1, 0, 1])y2 = 2 * y1 # Double the observationsX_train_3 = np.array([[-1], [0], [1]]) mean1, _ = gp_predict(X_train_3, y1, X_test, noise_var=0.01)mean2, _ = gp_predict(X_train_3, y2, X_test, noise_var=0.01) ax.plot(X_test, mean1, 'C0-', linewidth=2, label='Original y')ax.plot(X_test, mean2, 'C1--', linewidth=2, label='Doubled y')ax.plot(X_test, 2*mean1, 'k:', linewidth=2, label='2 × Original mean')ax.scatter(X_train_3, y1, c='C0', s=60, zorder=5)ax.scatter(X_train_3, y2, c='C1', s=60, zorder=5)ax.set_title('Mean is Linear in Observations')ax.set_xlabel('x')ax.set_ylabel('f(x)')ax.legend()ax.grid(True, alpha=0.3) # Plot: Smoothness from kernelax = axes[1, 1]X_train_4 = np.array([[-2], [0], [2]])y_train_4 = np.array([0, 1, 0]) for ls, style, label in [(0.5, '-', 'ℓ=0.5 (less smooth)'), (2.0, '--', 'ℓ=2.0 (smoother)')]: mean_ls, _ = gp_predict(X_train_4, y_train_4, X_test, length_scale=ls, noise_var=0.01) ax.plot(X_test, mean_ls, style, linewidth=2, label=label) ax.scatter(X_train_4, y_train_4, c='red', s=80, zorder=5, label='Data')ax.set_title('Smoothness Controlled by Length Scale')ax.set_xlabel('x')ax.set_ylabel('f(x)')ax.legend()ax.grid(True, alpha=0.3) plt.suptitle('Properties of the Posterior Mean', fontsize=14, fontweight='bold')plt.tight_layout()plt.show()The posterior variance $\sigma_*^2(x)$ quantifies our uncertainty about the function value at each input. This is arguably the most valuable output of a GP—the ability to say not just "I predict 25" but "I predict 25 ± 2 with 95% confidence."
Understanding the Variance Formula:
$$\sigma_^2 = k(x_, x_) - \mathbf{k}_^T (K + \sigma_n^2 I)^{-1} \mathbf{k}_*$$
The variance can never be negative (the second term is bounded by the first), so observations always reduce uncertainty—never increase it.
Key Properties of the Posterior Variance:
The fact that variance is independent of observed y values is profound. It means we can plan experiments (decide where to sample next) purely based on input locations, without needing to run the experiments first. This is the foundation of active learning and Bayesian optimization.
Two Components of Predictive Uncertainty:
When making predictions, we must distinguish two sources of uncertainty:
Epistemic Uncertainty (Model Uncertainty)
Aleatoric Uncertainty (Observation Noise)
When predicting a new observation $y_* = f(x_*) + \epsilon$, the total predictive variance is:
$$\text{Var}(y_* | \mathcal{D}) = \sigma_*^2 + \sigma_n^2$$
The first term (reducible) plus the second term (irreducible).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596
import numpy as npimport matplotlib.pyplot as pltfrom scipy.linalg import cholesky, solve_triangular def gp_predict_full(X_train, y_train, X_test, length_scale=1.0, signal_var=1.0, noise_var=0.1): """GP prediction with both epistemic and total predictive variance.""" def kernel(X1, X2): X1, X2 = np.atleast_2d(X1), np.atleast_2d(X2) sq_dist = np.sum(X1**2, 1).reshape(-1,1) + np.sum(X2**2, 1) - 2*X1@X2.T return signal_var * np.exp(-np.maximum(sq_dist, 0) / (2*length_scale**2)) K = kernel(X_train, X_train) + noise_var * np.eye(len(X_train)) K_star = kernel(X_test, X_train) L = cholesky(K, lower=True) alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True)) v = solve_triangular(L, K_star.T, lower=True) mean = K_star @ alpha # Epistemic (model) uncertainty epistemic_var = signal_var - np.sum(v**2, axis=0) epistemic_var = np.maximum(epistemic_var, 0) # Total predictive uncertainty (for new observation) total_var = epistemic_var + noise_var return mean, np.sqrt(epistemic_var), np.sqrt(total_var) np.random.seed(42) # Generate dataX_train = np.array([[-3], [-1], [0], [2]])y_train = np.array([0.5, -0.5, 0, 1])X_test = np.linspace(-5, 5, 200).reshape(-1, 1) mean, epistemic_std, total_std = gp_predict_full( X_train, y_train, X_test, length_scale=1.0, signal_var=1.0, noise_var=0.2) fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Plot 1: Both types of uncertaintyax = axes[0]ax.fill_between(X_test.ravel(), mean-2*total_std, mean+2*total_std, alpha=0.15, color='C1', label='Total (epistemic + noise)')ax.fill_between(X_test.ravel(), mean-2*epistemic_std, mean+2*epistemic_std, alpha=0.3, color='C0', label='Epistemic only')ax.plot(X_test, mean, 'C0-', linewidth=2, label='Posterior mean')ax.scatter(X_train, y_train, c='red', s=80, zorder=5, label='Data')ax.set_title('Epistemic vs Total Uncertainty')ax.set_xlabel('x')ax.set_ylabel('f(x)')ax.legend()ax.grid(True, alpha=0.3) # Plot 2: How uncertainty evolves as data accumulatesax = axes[1]X_test_local = np.linspace(-3, 3, 150).reshape(-1, 1)X_sequential = [np.array([[-2]]), np.array([[-2], [0]]), np.array([[-2], [0], [2]]), np.array([[-2], [-1], [0], [1], [2]])]y_sequential = [np.array([0]), np.array([0, 1]), np.array([0, 1, 0]), np.array([0, 0.5, 1, 0.5, 0])]labels = ['n=1', 'n=2', 'n=3', 'n=5']colors = ['C0', 'C1', 'C2', 'C3'] for X_i, y_i, label, color in zip(X_sequential, y_sequential, labels, colors): _, std_i, _ = gp_predict_full(X_i, y_i, X_test_local, noise_var=0.01) ax.plot(X_test_local, std_i, linewidth=2, color=color, label=label) ax.scatter(X_i, np.zeros_like(X_i), c=color, s=30, zorder=5, marker='|') ax.set_title('Uncertainty Decreases with More Data')ax.set_xlabel('x')ax.set_ylabel('Posterior Std Dev')ax.legend()ax.grid(True, alpha=0.3) plt.suptitle('Understanding GP Uncertainty', fontsize=14, fontweight='bold')plt.tight_layout()plt.show() # Visualization: Epistemic vs Aleatoricfig, ax = plt.subplots(figsize=(10, 6))noise_vars = [0.01, 0.1, 0.5] for nv, style in zip(noise_vars, ['-', '--', ':']): mean, eps_std, tot_std = gp_predict_full(X_train, y_train, X_test, noise_var=nv) ax.plot(X_test, tot_std, style, linewidth=2, label=f'σₙ²={nv}') ax.set_xlabel('x')ax.set_ylabel('Total Predictive Std Dev')ax.set_title('Effect of Noise Variance on Predictive Uncertainty')ax.legend()ax.grid(True, alpha=0.3)plt.show()Given the posterior distribution, we can construct intervals that contain the true function value with specified probability.
Bayesian Credible Intervals:
For a GP, the posterior at any point is Gaussian: $$f(x_) | \mathcal{D} \sim \mathcal{N}(\mu_, \sigma_*^2)$$
A credible interval is simply a probability interval under this posterior:
Interpretation:
"There is a 95% posterior probability that the true function value lies within this interval."
This is a Bayesian statement—it quantifies our belief given the data and prior. It's distinct from frequentist confidence intervals, which make statements about procedure coverage across repeated sampling.
Frequentist confidence intervals say: "If we repeated this experiment infinitely, 95% of intervals constructed this way would contain the true value." Bayesian credible intervals say: "Given our data and prior, there's a 95% probability the true value is here." For GP regression, credible intervals are natural and more intuitive for decision-making.
| Probability | Multiplier (z) | Interval |
|---|---|---|
| 68.27% | 1.0 | $\mu \pm \sigma$ |
| 90% | 1.645 | $\mu \pm 1.645\sigma$ |
| 95% | 1.96 | $\mu \pm 1.96\sigma$ |
| 99% | 2.576 | $\mu \pm 2.576\sigma$ |
| 99.7% | 3.0 | $\mu \pm 3\sigma$ |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111
import numpy as npimport matplotlib.pyplot as pltfrom scipy.linalg import cholesky, solve_triangularfrom scipy import stats def gp_predict_with_intervals(X_train, y_train, X_test, length_scale=1.0, signal_var=1.0, noise_var=0.1, confidence_levels=[0.68, 0.95, 0.99]): """GP prediction with multiple credible intervals.""" def kernel(X1, X2): X1, X2 = np.atleast_2d(X1), np.atleast_2d(X2) sq_dist = np.sum(X1**2, 1).reshape(-1,1) + np.sum(X2**2, 1) - 2*X1@X2.T return signal_var * np.exp(-np.maximum(sq_dist, 0) / (2*length_scale**2)) K = kernel(X_train, X_train) + noise_var * np.eye(len(X_train)) K_star = kernel(X_test, X_train) L = cholesky(K, lower=True) alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True)) v = solve_triangular(L, K_star.T, lower=True) mean = K_star @ alpha var = signal_var - np.sum(v**2, axis=0) std = np.sqrt(np.maximum(var, 0)) # Compute intervals for each confidence level intervals = {} for level in confidence_levels: z = stats.norm.ppf(0.5 + level/2) # Two-sided z-score intervals[level] = { 'lower': mean - z * std, 'upper': mean + z * std, 'z': z } return mean, std, intervals np.random.seed(42) # Generate synthetic dataX_train = np.linspace(-3, 3, 8).reshape(-1, 1)y_train = np.sin(X_train.ravel()) + 0.1 * np.random.randn(len(X_train))X_test = np.linspace(-4, 4, 200).reshape(-1, 1) mean, std, intervals = gp_predict_with_intervals( X_train, y_train, X_test, length_scale=1.0, noise_var=0.01) # Visualization: Nested credible intervalsfig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Plot 1: Nested bandsax = axes[0]colors = ['#1f77b4', '#1f77b4', '#1f77b4']alphas = [0.1, 0.2, 0.35] for level, color, alpha in zip([0.99, 0.95, 0.68], colors, alphas): interval = intervals[level] ax.fill_between(X_test.ravel(), interval['lower'], interval['upper'], alpha=alpha, color=color, label=f'{int(level*100)}% CI (±{interval["z"]:.2f}σ)') ax.plot(X_test, mean, 'C0-', linewidth=2, label='Mean')ax.plot(X_test, np.sin(X_test.ravel()), 'k--', linewidth=1.5, label='True function', alpha=0.7)ax.scatter(X_train, y_train, c='red', s=60, zorder=5, label='Data')ax.set_title('Nested Credible Intervals')ax.set_xlabel('x')ax.set_ylabel('f(x)')ax.legend(loc='upper right')ax.grid(True, alpha=0.3) # Plot 2: Coverage analysisax = axes[1] # Generate many test points and check coveragen_test = 100X_test_coverage = np.linspace(-3, 3, n_test).reshape(-1, 1)f_true = np.sin(X_test_coverage.ravel()) # True function values mean_cov, std_cov, _ = gp_predict_with_intervals( X_train, y_train, X_test_coverage, noise_var=0.01) # Check what percentage of true values fall within different intervalstheory_probs = []empirical_probs = [] for z_mult in np.linspace(0.1, 3.5, 50): lower = mean_cov - z_mult * std_cov upper = mean_cov + z_mult * std_cov # Empirical coverage inside = np.sum((f_true >= lower) & (f_true <= upper)) empirical_probs.append(inside / n_test) # Theoretical coverage theory_probs.append(2 * stats.norm.cdf(z_mult) - 1) ax.plot(theory_probs, empirical_probs, 'o-', markersize=3, linewidth=1.5)ax.plot([0, 1], [0, 1], 'k--', linewidth=1.5, label='Perfect calibration')ax.set_xlabel('Theoretical Coverage')ax.set_ylabel('Empirical Coverage')ax.set_title('Calibration Plot: Are Credible Intervals Calibrated?')ax.legend()ax.grid(True, alpha=0.3)ax.set_aspect('equal') plt.suptitle('Credible Intervals and Calibration', fontsize=14, fontweight='bold')plt.tight_layout()plt.show()While mean and variance summaries are useful, the full predictive distribution contains richer information.
For the Latent Function $f(x_*)$:
$$p(f_* | \mathcal{D}, x_) = \mathcal{N}(\mu_, \sigma_*^2)$$
This is the epistemic predictive distribution—uncertainty about the true function.
For a New Observation $y_*$:
If we'll observe $y_* = f(x_*) + \epsilon$ with $\epsilon \sim \mathcal{N}(0, \sigma_n^2)$:
$$p(y_* | \mathcal{D}, x_) = \mathcal{N}(\mu_, \sigma_*^2 + \sigma_n^2)$$
This is the total predictive distribution, accounting for both model and observation uncertainty.
Uses of the Full Distribution:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990
import numpy as npimport matplotlib.pyplot as pltfrom scipy import statsfrom scipy.linalg import cholesky, solve_triangular def gp_predict(X_train, y_train, X_test, length_scale=1.0, signal_var=1.0, noise_var=0.1): def kernel(X1, X2): X1, X2 = np.atleast_2d(X1), np.atleast_2d(X2) sq_dist = np.sum(X1**2, 1).reshape(-1,1) + np.sum(X2**2, 1) - 2*X1@X2.T return signal_var * np.exp(-np.maximum(sq_dist, 0) / (2*length_scale**2)) K = kernel(X_train, X_train) + noise_var * np.eye(len(X_train)) K_star = kernel(X_test, X_train) L = cholesky(K, lower=True) alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True)) v = solve_triangular(L, K_star.T, lower=True) mean = K_star @ alpha var = signal_var - np.sum(v**2, axis=0) return mean, np.sqrt(np.maximum(var, 0)) np.random.seed(42) # SetupX_train = np.array([[-2], [0], [1]])y_train = np.array([0.5, 0, 1])X_test = np.linspace(-3, 3, 150).reshape(-1, 1) mean, std = gp_predict(X_train, y_train, X_test, noise_var=0.01) # Focus on a specific test pointx_query = np.array([[1.5]])mean_q, std_q = gp_predict(X_train, y_train, x_query, noise_var=0.01) fig, axes = plt.subplots(1, 3, figsize=(15, 4)) # Plot 1: Full GP with highlighted query pointax = axes[0]ax.fill_between(X_test.ravel(), mean-2*std, mean+2*std, alpha=0.2, color='C0')ax.plot(X_test, mean, 'C0-', linewidth=2)ax.scatter(X_train, y_train, c='red', s=80, zorder=5)ax.axvline(x_query.item(), color='green', linewidth=2, linestyle='--')ax.scatter([x_query.item()], [mean_q.item()], c='green', s=100, zorder=6, marker='D')ax.set_title('GP Prediction with Query Point')ax.set_xlabel('x')ax.set_ylabel('f(x)')ax.grid(True, alpha=0.3) # Plot 2: Predictive distribution at query pointax = axes[1]f_vals = np.linspace(mean_q.item() - 4*std_q.item(), mean_q.item() + 4*std_q.item(), 100)pdf = stats.norm.pdf(f_vals, mean_q.item(), std_q.item()) ax.plot(f_vals, pdf, 'C0-', linewidth=2)ax.fill_between(f_vals, 0, pdf, alpha=0.3, color='C0')ax.axvline(mean_q.item(), color='red', linewidth=2, linestyle='--', label='Mean')ax.axvline(0, color='gray', linewidth=1, linestyle=':', label='Threshold at 0') # Shade probability f > 0f_positive = f_vals[f_vals >= 0]pdf_positive = stats.norm.pdf(f_positive, mean_q.item(), std_q.item())ax.fill_between(f_positive, 0, pdf_positive, alpha=0.5, color='C1', label=f'P(f > 0) = {1 - stats.norm.cdf(0, mean_q.item(), std_q.item()):.3f}') ax.set_title(f'Predictive Distribution at x = {x_query.item():.1f}')ax.set_xlabel('f(x)')ax.set_ylabel('Density')ax.legend()ax.grid(True, alpha=0.3) # Plot 3: Probability queries across input spaceax = axes[2]threshold = 0.5prob_above = 1 - stats.norm.cdf(threshold, mean, std) ax.fill_between(X_test.ravel(), 0, prob_above, alpha=0.3, color='C2')ax.plot(X_test, prob_above, 'C2-', linewidth=2)ax.axhline(0.5, color='gray', linestyle='--', alpha=0.5)ax.scatter(X_train, np.zeros_like(y_train) + 0.02, c='red', s=60, marker='^') ax.set_title(f'P(f(x) > {threshold}) Across Input Space')ax.set_xlabel('x')ax.set_ylabel('Probability')ax.set_ylim(0, 1)ax.grid(True, alpha=0.3) plt.suptitle('Using the Full Predictive Distribution', fontsize=14, fontweight='bold')plt.tight_layout()plt.show()Effective visualization of GP predictions is crucial for interpretation and communication. Here are best practices:
Standard Elements:
Color and Transparency:
Avoid Common Mistakes:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114
import numpy as npimport matplotlib.pyplot as pltfrom scipy.linalg import cholesky, solve_triangular def gp_predict(X_train, y_train, X_test, length_scale=1.0, signal_var=1.0, noise_var=0.1): def kernel(X1, X2): X1, X2 = np.atleast_2d(X1), np.atleast_2d(X2) sq_dist = np.sum(X1**2, 1).reshape(-1,1) + np.sum(X2**2, 1) - 2*X1@X2.T return signal_var * np.exp(-np.maximum(sq_dist, 0) / (2*length_scale**2)) K = kernel(X_train, X_train) + noise_var * np.eye(len(X_train)) K_star = kernel(X_test, X_train) K_ss = kernel(X_test, X_test) L = cholesky(K, lower=True) alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True)) v = solve_triangular(L, K_star.T, lower=True) mean = K_star @ alpha cov = K_ss - v.T @ v + 1e-8 * np.eye(len(X_test)) return mean, cov def excellent_gp_plot(X_train, y_train, X_test, mean, std, samples=None, true_func=None, title='GP Regression', noise_var=0.1): """ Create a publication-quality GP visualization. """ fig, ax = plt.subplots(figsize=(10, 6)) # Style settings plt.style.use('default') ax.set_facecolor('#fafafa') # Outer band: 99% CI ax.fill_between(X_test.ravel(), mean - 2.58*std, mean + 2.58*std, alpha=0.1, color='#1f77b4', edgecolor='none', label='99% credible interval') # Middle band: 95% CI ax.fill_between(X_test.ravel(), mean - 1.96*std, mean + 1.96*std, alpha=0.2, color='#1f77b4', edgecolor='none', label='95% credible interval') # Inner band: 68% CI ax.fill_between(X_test.ravel(), mean - std, mean + std, alpha=0.35, color='#1f77b4', edgecolor='none', label='68% credible interval') # Posterior samples (if provided) if samples is not None: for i, sample in enumerate(samples.T[:5]): ax.plot(X_test, sample, color='#1f77b4', alpha=0.3, linewidth=1, zorder=2, label='Posterior samples' if i == 0 else None) # Posterior mean ax.plot(X_test, mean, color='#1f77b4', linewidth=2.5, label='Posterior mean', zorder=3) # True function (if provided) if true_func is not None: ax.plot(X_test, true_func, 'k--', linewidth=2, alpha=0.7, label='True function', zorder=4) # Training data ax.scatter(X_train, y_train, c='#d62728', s=80, zorder=5, edgecolors='white', linewidths=1.5, label='Training observations') # Formatting ax.set_xlabel('Input, x', fontsize=12) ax.set_ylabel('Output, f(x)', fontsize=12) ax.set_title(title, fontsize=14, fontweight='bold') ax.legend(loc='upper left', framealpha=0.9) ax.grid(True, alpha=0.3, linestyle='--') # Add annotation about noise ax.text(0.98, 0.02, f'Observation noise: σₙ² = {noise_var}', transform=ax.transAxes, ha='right', va='bottom', fontsize=9, color='gray') plt.tight_layout() return fig, ax # Generate examplenp.random.seed(42) # True functiondef true_func(x): return np.sin(x) + 0.3 * np.sin(3*x) # Generate training dataX_train = np.random.uniform(-4, 4, 12).reshape(-1, 1)X_train = np.sort(X_train, axis=0)y_train = true_func(X_train.ravel()) + 0.15 * np.random.randn(len(X_train)) X_test = np.linspace(-5, 5, 200).reshape(-1, 1) mean, cov = gp_predict(X_train, y_train, X_test, length_scale=1.0, noise_var=0.02)std = np.sqrt(np.diag(cov)) # Generate posterior samplesL = cholesky(cov, lower=True)samples = mean[:, np.newaxis] + L @ np.random.randn(len(X_test), 10) # Create excellent visualizationfig, ax = excellent_gp_plot( X_train.ravel(), y_train, X_test.ravel(), mean, std, samples, true_func=true_func(X_test.ravel()), title='Gaussian Process Regression with Uncertainty', noise_var=0.02)plt.show()The ultimate purpose of predictions is often to support decisions. GP predictions, with their uncertainty quantification, are particularly well-suited for decision-making under uncertainty.
Key Quantities for Decisions:
Probability of Improvement (PI) $$PI(x) = P(f(x) > f^+) = \Phi\left(\frac{\mu(x) - f^+}{\sigma(x)}\right)$$ where $f^+$ is the best observed value and $\Phi$ is the standard normal CDF.
Expected Improvement (EI) $$EI(x) = \mathbb{E}[\max(f(x) - f^+, 0)]$$ $$= (\mu(x) - f^+)\Phi(Z) + \sigma(x)\phi(Z)$$ where $Z = \frac{\mu(x) - f^+}{\sigma(x)}$ and $\phi$ is the standard normal PDF.
Upper Confidence Bound (UCB) $$UCB(x) = \mu(x) + \kappa \cdot \sigma(x)$$ Balances exploitation (high $\mu$) vs exploration (high $\sigma$).
These form the basis of Bayesian Optimization—sequential decision-making for expensive black-box function optimization.
Should we exploit what we know (query where the mean is high) or explore where we're uncertain (query where the variance is high)? GP uncertainty naturally enables this tradeoff. Acquisition functions like EI and UCB automatically balance both considerations.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130
import numpy as npimport matplotlib.pyplot as pltfrom scipy import statsfrom scipy.linalg import cholesky, solve_triangular def gp_predict(X_train, y_train, X_test, length_scale=1.0, signal_var=1.0, noise_var=0.1): def kernel(X1, X2): X1, X2 = np.atleast_2d(X1), np.atleast_2d(X2) sq_dist = np.sum(X1**2, 1).reshape(-1,1) + np.sum(X2**2, 1) - 2*X1@X2.T return signal_var * np.exp(-np.maximum(sq_dist, 0) / (2*length_scale**2)) K = kernel(X_train, X_train) + noise_var * np.eye(len(X_train)) K_star = kernel(X_test, X_train) L = cholesky(K, lower=True) alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True)) v = solve_triangular(L, K_star.T, lower=True) mean = K_star @ alpha var = signal_var - np.sum(v**2, axis=0) return mean, np.sqrt(np.maximum(var, 0)) def acquisition_functions(mean, std, y_best, kappa=2.0): """ Compute common acquisition functions for Bayesian optimization. """ # Standardized improvement with np.errstate(divide='ignore', invalid='ignore'): Z = (mean - y_best) / std Z = np.where(std > 0, Z, 0) # Probability of Improvement PI = stats.norm.cdf(Z) # Expected Improvement EI = (mean - y_best) * stats.norm.cdf(Z) + std * stats.norm.pdf(Z) EI = np.where(std > 0, EI, 0) # Upper Confidence Bound UCB = mean + kappa * std return PI, EI, UCB np.random.seed(42) # Simulate optimization scenario# True function (unknown to optimizer)def objective(x): return -(x - 0.5)**2 * np.sin(3*x) + 0.5 # Current observationsX_train = np.array([[-3], [-1], [1], [3]])y_train = objective(X_train.ravel()) + 0.05 * np.random.randn(len(X_train)) # Candidate pointsX_test = np.linspace(-4, 4, 200).reshape(-1, 1) # GP predictionmean, std = gp_predict(X_train, y_train, X_test, length_scale=1.0, noise_var=0.0025) # Best observed valuey_best = np.max(y_train) # Compute acquisitionsPI, EI, UCB = acquisition_functions(mean, std, y_best) # Visualizationfig, axes = plt.subplots(2, 2, figsize=(14, 10)) # Plot 1: GP predictionax = axes[0, 0]ax.fill_between(X_test.ravel(), mean-2*std, mean+2*std, alpha=0.2, color='C0')ax.plot(X_test, mean, 'C0-', linewidth=2, label='GP Mean')ax.plot(X_test, objective(X_test.ravel()), 'k--', linewidth=1.5, label='True objective', alpha=0.7)ax.scatter(X_train, y_train, c='red', s=80, zorder=5, label='Observations')ax.axhline(y_best, color='green', linestyle=':', label=f'Best so far: {y_best:.2f}')ax.set_title('GP Model of Objective Function')ax.set_xlabel('x')ax.set_ylabel('f(x)')ax.legend()ax.grid(True, alpha=0.3) # Plot 2: Probability of Improvementax = axes[0, 1]ax.fill_between(X_test.ravel(), 0, PI, alpha=0.3, color='C1')ax.plot(X_test, PI, 'C1-', linewidth=2)best_PI_idx = np.argmax(PI)ax.axvline(X_test[best_PI_idx], color='C1', linestyle='--')ax.scatter([X_test[best_PI_idx]], [PI[best_PI_idx]], c='C1', s=100, zorder=5, marker='*')ax.set_title('Probability of Improvement (PI)')ax.set_xlabel('x')ax.set_ylabel('PI(x)')ax.grid(True, alpha=0.3) # Plot 3: Expected Improvementax = axes[1, 0]ax.fill_between(X_test.ravel(), 0, EI, alpha=0.3, color='C2')ax.plot(X_test, EI, 'C2-', linewidth=2)best_EI_idx = np.argmax(EI)ax.axvline(X_test[best_EI_idx], color='C2', linestyle='--')ax.scatter([X_test[best_EI_idx]], [EI[best_EI_idx]], c='C2', s=100, zorder=5, marker='*')ax.set_title('Expected Improvement (EI)')ax.set_xlabel('x')ax.set_ylabel('EI(x)')ax.grid(True, alpha=0.3) # Plot 4: Upper Confidence Boundax = axes[1, 1]ax.plot(X_test, mean, 'C0-', linewidth=1.5, label='Mean', alpha=0.7)ax.plot(X_test, UCB, 'C3-', linewidth=2, label=f'UCB (κ=2)')best_UCB_idx = np.argmax(UCB)ax.axvline(X_test[best_UCB_idx], color='C3', linestyle='--')ax.scatter([X_test[best_UCB_idx]], [UCB[best_UCB_idx]], c='C3', s=100, zorder=5, marker='*')ax.fill_between(X_test.ravel(), mean, UCB, alpha=0.2, color='C3')ax.set_title('Upper Confidence Bound (UCB)')ax.set_xlabel('x')ax.set_ylabel('UCB(x)')ax.legend()ax.grid(True, alpha=0.3) plt.suptitle('Acquisition Functions for Decision-Making', fontsize=14, fontweight='bold')plt.tight_layout()plt.show() # Summary of recommendationsprint("Next point to sample:")print(f" According to PI: x = {X_test[best_PI_idx].item():.3f}")print(f" According to EI: x = {X_test[best_EI_idx].item():.3f}")print(f" According to UCB: x = {X_test[best_UCB_idx].item():.3f}")We've explored how to interpret and use GP predictions effectively, moving beyond point estimates to full probabilistic inference.
What's Next:
We'll conclude this module by comparing Gaussian Processes with Kernel Ridge Regression, understanding their connections and differences, and when to choose one over the other.
You now understand how to interpret GP predictions, visualize uncertainty, and use probabilistic outputs for decision-making. This practical knowledge transforms GPs from mathematical abstractions into powerful tools for real-world prediction and optimization.