Loading content...
In the previous page, we derived the Gaussian Process posterior—the distribution over functions conditioned on observed data. But knowing the posterior is only the first step. The ultimate goal of any regression model is prediction: given a new input $\mathbf{x}_*$, what is the corresponding output?
For point-estimate regression methods (linear regression, neural networks, etc.), prediction means computing a single number $\hat{y}*$. But Gaussian Processes offer something far richer: a predictive distribution that fully characterizes our uncertainty about $y*$. This distribution tells us not just what to expect, but how confident we should be—and when we should trust our predictions.
The predictive distribution is what makes GPs uniquely valuable for applications where uncertainty matters: Bayesian optimization, active learning, model-based reinforcement learning, and safety-critical systems. In this page, we develop the complete theory of GP prediction.
By the end of this page, you will: (1) Derive the predictive distribution at a single test point, (2) Understand the joint predictive distribution over multiple test points, (3) Distinguish between marginal and joint predictions, (4) Sample functions from the predictive distribution, (5) Interpret prediction intervals correctly.
Consider a single test input $\mathbf{x}*$. We want the predictive distribution $p(f* | \mathbf{x}_*, \mathbf{X}, \mathbf{y})$—the probability distribution over function values at this new point, conditioned on our training data.
From the posterior derivation, this distribution is Gaussian:
$$f_* | \mathbf{x}*, \mathbf{X}, \mathbf{y} \sim \mathcal{N}(\mu, \sigma_^2)$$
where (assuming zero prior mean):
$$\mu_* = \mathbf{k}_*^\top K^{-1} \mathbf{y}$$
$$\sigma_^2 = k(\mathbf{x}_, \mathbf{x}*) - \mathbf{k}^\top K^{-1} \mathbf{k}_$$
Here $\mathbf{k}* = [k(\mathbf{x}, \mathbf{x}1), \ldots, k(\mathbf{x}, \mathbf{x}_n)]^\top$ is the vector of covariances between the test point and all training points, and $K = K(\mathbf{X}, \mathbf{X})$ is the training covariance matrix.
For prediction at a single point, you need exactly two numbers: the predictive mean μ* (your best guess) and the predictive variance σ*² (your uncertainty). Together, these fully characterize the Gaussian predictive distribution.
Decomposing the Predictive Mean
The predictive mean has a beautiful interpretation as a weighted average of training outputs:
$$\mu_* = \mathbf{k}*^\top K^{-1} \mathbf{y} = \sum{i=1}^{n} w_i y_i$$
where the weights $\mathbf{w} = K^{-1} \mathbf{k}_*$ depend on:
The weights $w_i$ can be positive or negative, and they sum to a value that depends on the kernel and point configuration. For the RBF kernel, if $\mathbf{x}*$ is far from all training points, the weights become negligible and $\mu* \to 0$ (the prior mean).
Decomposing the Predictive Variance
The predictive variance reveals how uncertainty decreases with proximity to training data:
$$\sigma_^2 = \underbrace{k(\mathbf{x}_, \mathbf{x}*)}{\text{prior variance}} - \underbrace{\mathbf{k}*^\top K^{-1} \mathbf{k}*}_{\text{variance reduction}}$$
| Scenario | Predictive Mean μ* | Predictive Variance σ*² | Interpretation |
|---|---|---|---|
| x* = training point xᵢ | Exactly yᵢ | 0 | Perfect interpolation |
| x* very close to xᵢ | ≈ yᵢ | ≈ 0 (very small) | High-confidence prediction |
| x* between training points | Weighted interpolation | Moderate | Reasonable uncertainty |
| x* far from all training | → 0 (prior mean) | → k(x*,x*) (prior) | Maximum uncertainty |
Often we want to make predictions at multiple test points simultaneously. The joint predictive distribution captures not just individual prediction uncertainties, but correlations between predictions at different test points.
For test inputs $\mathbf{X}* = [\mathbf{x}^{(1)}, \ldots, \mathbf{x}_^{(m)}]$, the joint predictive distribution is:
$$\mathbf{f}* | \mathbf{X}, \mathbf{X}, \mathbf{y} \sim \mathcal{N}(\boldsymbol{\mu}_, \Sigma_*)$$
where:
$$\boldsymbol{\mu}* = K*^\top K^{-1} \mathbf{y} \in \mathbb{R}^m$$
$$\Sigma_* = K_{**} - K_^\top K^{-1} K_ \in \mathbb{R}^{m \times m}$$
Here $K_* = K(\mathbf{X}, \mathbf{X}*)$ is the $n \times m$ cross-covariance matrix, and $K{**} = K(\mathbf{X}*, \mathbf{X}*)$ is the $m \times m$ test point covariance matrix.
The diagonal of Σ* gives marginal variances at each test point. But the off-diagonal entries capture prediction correlations. Two nearby test points will have correlated predictions—if you're wrong at one, you're likely wrong at the other in the same direction. This joint structure matters for sampling coherent functions and for sequential decision-making.
Why Joint Predictions Matter
The joint predictive distribution is essential for several tasks:
Function Sampling: To draw a smooth function sample from the posterior, you must sample from the joint distribution, not independent marginals. Independent samples from marginals would yield jagged, discontinuous 'functions'.
Optimization: In Bayesian optimization, the joint distribution over multiple potential next points enables batch acquisition.
Uncertainty Bounds on Functionals: If you want a confidence interval for $\int f(x) dx$, you need the joint distribution to account for prediction correlations.
Sequential Prediction: In time series or spatial prediction, prediction errors are correlated across nearby points.
Computational Note: Computing the full $m \times m$ posterior covariance matrix costs $\mathcal{O}(nm^2)$ for the matrix multiplications, in addition to the $\mathcal{O}(n^3)$ cost for inverting $K$. For very large test sets, we often compute only the diagonal (pointwise variances) which is $\mathcal{O}(nm)$.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
import numpy as npfrom scipy.linalg import cholesky, solve_triangular def gp_predict_joint(X_train, y_train, X_test, kernel_fn, jitter=1e-8): """ Compute the full joint predictive distribution at test points. Returns: mu: Predictive mean (m,) cov: Full predictive covariance (m, m) var: Predictive variances (diagonal) (m,) """ n = X_train.shape[0] m = X_test.shape[0] # Kernel matrices K = kernel_fn(X_train, X_train) + jitter * np.eye(n) K_star = kernel_fn(X_train, X_test) # (n, m) K_star_star = kernel_fn(X_test, X_test) # (m, m) # Cholesky decomposition L = cholesky(K, lower=True) # Predictive mean: K_*^T @ K^{-1} @ y alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True), lower=False) mu = K_star.T @ alpha # Predictive covariance: K_** - K_*^T @ K^{-1} @ K_* V = solve_triangular(L, K_star, lower=True) # (n, m) cov = K_star_star - V.T @ V # Ensure symmetry (for numerical stability) cov = 0.5 * (cov + cov.T) var = np.diag(cov) return mu, cov, var def sample_posterior(mu, cov, n_samples=5, jitter=1e-8): """ Sample functions from the joint predictive distribution. """ m = len(mu) # Add jitter for numerical stability in Cholesky L = cholesky(cov + jitter * np.eye(m), lower=True) # Sample: f = mu + L @ z, where z ~ N(0, I) z = np.random.randn(m, n_samples) samples = mu[:, None] + L @ z return samples # (m, n_samples) # Example: Demonstrate difference between marginal and joint samplingnp.random.seed(42) # Training dataX_train = np.array([[-2], [0], [3]])y_train = np.array([1.5, -0.2, 2.0]) # Dense test gridX_test = np.linspace(-4, 5, 100).reshape(-1, 1) # RBF kerneldef rbf(X1, X2, l=1.0, s=1.0): sq_dist = np.sum(X1**2, 1, keepdims=True) - 2*X1@X2.T + np.sum(X2**2, 1) return s * np.exp(-0.5 * sq_dist / l**2) # Get joint predictive distributionmu, cov, var = gp_predict_joint(X_train, y_train, X_test, rbf) # Sample from joint distribution (smooth functions)joint_samples = sample_posterior(mu, cov, n_samples=5) # Compare with independent marginal samples (NOT smooth)marginal_samples = mu[:, None] + np.sqrt(var)[:, None] * np.random.randn(100, 5) print("Joint samples shape:", joint_samples.shape)print("Marginal samples shape:", marginal_samples.shape)print("Joint samples are smooth --- they're proper function samples")print("Marginal samples are jagged --- they ignore point correlations")One of the most powerful capabilities of GP regression is the ability to sample entire functions from the posterior distribution. Each sample represents a plausible function that is consistent with the observed data.
The Sampling Procedure
To sample a function at test points $\mathbf{X}_*$:
This procedure generates a sample from $\mathcal{N}(\boldsymbol{\mu}*, \Sigma*)$ using the reparameterization trick. The Cholesky factor $\mathbf{L}$ transforms independent standard normals into samples with the correct covariance structure.
Function samples are not just for visualization. They enable: (1) Monte Carlo estimation of posterior expectations E[g(f)], (2) Thompson sampling for Bayesian optimization, (3) Uncertainty propagation through nonlinear transformations, (4) Model checking and posterior predictive checks.
Properties of Posterior Samples
Samples from the GP posterior have important properties:
Exact interpolation: In the noise-free case, all samples pass through the training points exactly. This is because the posterior variance at training points is zero.
Smoothness: The smoothness of samples matches the smoothness encoded by the kernel. Samples from an RBF kernel are infinitely differentiable; samples from a Matérn-1/2 kernel are continuous but not differentiable.
Consistency: As more training data is observed, posterior samples concentrate around the true underlying function (if the kernel is well-specified).
Prior reversion: In regions far from training data, samples revert to prior behavior—they become draws from the prior GP.
Efficient Sampling for Many Points
Sampling at a large number of test points is expensive because the Cholesky decomposition of $\Sigma_*$ costs $\mathcal{O}(m^3)$. Alternatives include:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
import numpy as npfrom scipy.linalg import cholesky, solve_triangular def sample_gp_posterior_matheron(X_train, y_train, X_test, kernel_fn, n_samples=10, jitter=1e-8): """ Efficient posterior sampling using Matheron's rule. Key insight: Instead of sampling from the posterior directly, we can sample from the prior and then update using conditioning. f* | y = f*_prior + K_*^T K^{-1} (y - f_prior) where f_prior ~ GP(0, k) is a sample from the prior evaluated at both training and test points. """ n = X_train.shape[0] m = X_test.shape[0] # All points (training + test) X_all = np.vstack([X_train, X_test]) # Compute kernel matrices K_all = kernel_fn(X_all, X_all) + jitter * np.eye(n + m) K = K_all[:n, :n] K_star = K_all[:n, n:] # (n, m) # Cholesky of full kernel for prior sampling L_all = cholesky(K_all, lower=True) # Cholesky of training kernel for conditioning L = cholesky(K + jitter * np.eye(n), lower=True) # Pre-compute K^{-1} @ K_* for efficiency V = solve_triangular(L.T, solve_triangular(L, K_star, lower=True), lower=False) samples = np.zeros((m, n_samples)) for i in range(n_samples): # Step 1: Sample from the prior at all points z = np.random.randn(n + m) f_prior_all = L_all @ z f_prior_train = f_prior_all[:n] f_prior_test = f_prior_all[n:] # Step 2: Apply Matheron's update rule # f*|y = f*_prior + K_*^T @ K^{-1} @ (y - f_train_prior) residual = y_train - f_prior_train update = V.T @ solve_triangular( L.T, solve_triangular(L, residual, lower=True), lower=False ) samples[:, i] = f_prior_test + update return samples # Compare standard sampling with Matheron's rulenp.random.seed(123) # Simple 1D exampleX_train = np.array([[-1.0], [1.0], [2.5]])y_train = np.array([1.0, -0.5, 0.8])X_test = np.linspace(-3, 4, 200).reshape(-1, 1) def rbf_kernel(X1, X2, l=1.0, s=1.0): sq_dist = np.sum(X1**2, 1, keepdims=True) - 2*X1@X2.T + np.sum(X2**2, 1) return s * np.exp(-0.5 * sq_dist / l**2) # Matheron sampling (efficient)samples_matheron = sample_gp_posterior_matheron( X_train, y_train, X_test, rbf_kernel, n_samples=5) print("Generated", samples_matheron.shape[1], "posterior function samples")print("Each sample is evaluated at", samples_matheron.shape[0], "test points") # Verify samples pass through training points (approximately)for i in range(len(X_train)): train_idx = np.argmin(np.abs(X_test.flatten() - X_train[i, 0])) sample_at_train = samples_matheron[train_idx, 0] print(f"Sample value at x={X_train[i,0]}: {sample_at_train:.3f}, true y: {y_train[i]:.3f}")Since the predictive distribution is Gaussian, we can construct exact prediction intervals. For a target coverage probability $1 - \alpha$, the prediction interval at test point $\mathbf{x}_*$ is:
$$\left[ \mu_* - z_{1-\alpha/2} \sigma_, ; \mu_ + z_{1-\alpha/2} \sigma_* \right]$$
where $z_{1-\alpha/2}$ is the $(1-\alpha/2)$ quantile of the standard normal distribution.
Common Intervals:
| Confidence Level | z-value | Interval Width |
|---|---|---|
| 68% | 1.00 | ±1.0σ |
| 90% | 1.645 | ±1.645σ |
| 95% | 1.96 | ±1.96σ |
| 99% | 2.576 | ±2.576σ |
The intervals above are POINTWISE: at each individual test point, there's a 95% probability the true function value lies within the interval. But across many test points, the SIMULTANEOUS probability that ALL predictions lie within their intervals is lower. For truly simultaneous bounds, you'd need wider intervals (Bonferroni correction) or sample-based methods.
Interpretation of Uncertainty Bands
When visualizing GP predictions, we typically show:
These bands have important properties:
What the Bands Mean
The uncertainty bands are epistemic uncertainty bounds—they represent our uncertainty about the true function due to limited data. They are NOT confidence intervals for future noisy observations (those would be wider, as we'll see when we add observation noise).
Crucially, these bounds assume the model is correct: the true function was drawn from a GP with our assumed kernel. Model misspecification can make these bounds unreliable.
So far we've assumed noise-free observations $y_i = f(\mathbf{x}_i)$. In practice, observations are noisy:
$$y_i = f(\mathbf{x}_i) + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma_n^2)$$
where $\sigma_n^2$ is the observation noise variance. This changes the posterior equations.
Modified Posterior (for the latent function):
The posterior distribution of $f_*$ (the noise-free function value) becomes:
$$f_* | \mathbf{X}, \mathbf{y}, \mathbf{x}* \sim \mathcal{N}(\bar{\mu}, \bar{\sigma}_^2)$$
where: $$\bar{\mu}* = \mathbf{k}^\top (K + \sigma_n^2 I)^{-1} \mathbf{y}$$ $$\bar{\sigma}_^2 = k(\mathbf{x}*, \mathbf{x}) - \mathbf{k}_^\top (K + \sigma_n^2 I)^{-1} \mathbf{k}_*$$
The only change is $K \to K + \sigma_n^2 I$: we add noise variance to the diagonal of the training covariance matrix.
The noise term σₙ²I inflates the variance of observed training outputs. Intuitively, noisy observations are less informative than noise-free ones, so we shouldn't trust them as much. The GP no longer interpolates training points exactly—it smooths through them, trading off data fit against function smoothness.
Predictive Distribution for New Observations
If we want to predict a noisy observation $y_* = f_* + \epsilon_*$ at a new point, the predictive distribution is:
$$y_* | \mathbf{X}, \mathbf{y}, \mathbf{x}* \sim \mathcal{N}(\bar{\mu}, \bar{\sigma}_^2 + \sigma_n^2)$$
The predictive mean is unchanged, but the variance increases by $\sigma_n^2$ to account for observation noise.
Two Types of Uncertainty:
| Quantity | Variance | Represents |
|---|---|---|
| $f_*$ (latent function) | $\bar{\sigma}_*^2$ | Epistemic uncertainty (reducible with more data) |
| $y_*$ (noisy observation) | $\bar{\sigma}_*^2 + \sigma_n^2$ | Epistemic + Aleatoric uncertainty |
The noise variance $\sigma_n^2$ is aleatoric (irreducible)—even with infinite data, future observations will still be noisy. The function uncertainty $\bar{\sigma}_*^2$ is epistemic (reducible)—more data will shrink it.
Regularization Interpretation
The noise term also provides numerical regularization. The matrix $K + \sigma_n^2 I$ is guaranteed to be positive definite even when $K$ itself is nearly singular (due to duplicate or very close training points). This makes the noisy GP more robust computationally.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
import numpy as npfrom scipy.linalg import cholesky, solve_triangular def gp_predict_noisy(X_train, y_train, X_test, kernel_fn, noise_var=0.1, jitter=1e-8): """ GP prediction with observation noise. Parameters: noise_var: Observation noise variance σ_n² Returns: mu_f: Predictive mean for latent function f* var_f: Predictive variance for latent function f* var_y: Predictive variance for noisy observation y* """ n = X_train.shape[0] # Kernel matrices K = kernel_fn(X_train, X_train) K_star = kernel_fn(X_train, X_test) K_star_star = kernel_fn(X_test, X_test) # Add noise to training covariance K_noisy = K + noise_var * np.eye(n) + jitter * np.eye(n) # Cholesky decomposition of noisy covariance L = cholesky(K_noisy, lower=True) # Predictive mean alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True), lower=False) mu_f = K_star.T @ alpha # Predictive variance for f* V = solve_triangular(L, K_star, lower=True) var_f = np.diag(K_star_star) - np.sum(V**2, axis=0) # Predictive variance for y* = f* + noise var_y = var_f + noise_var return mu_f, var_f, var_y # Example: Compare noise-free vs noisy GPnp.random.seed(42) # Noisy training dataX_train = np.linspace(-2, 2, 8).reshape(-1, 1)f_true = np.sin(2 * X_train.flatten()) # True functiony_train = f_true + 0.3 * np.random.randn(8) # Noisy observations X_test = np.linspace(-3, 3, 100).reshape(-1, 1) def rbf(X1, X2, l=1.0, s=1.0): sq_dist = np.sum(X1**2, 1, keepdims=True) - 2*X1@X2.T + np.sum(X2**2, 1) return s * np.exp(-0.5 * sq_dist / l**2) # Noisy predictionsmu, var_f, var_y = gp_predict_noisy(X_train, y_train, X_test, rbf, noise_var=0.09) print("Latent function f* uncertainty (epistemic):")print(f" Min variance: {var_f.min():.4f}")print(f" Max variance: {var_f.max():.4f}") print("Noisy observation y* uncertainty (epistemic + aleatoric):")print(f" Min variance: {var_y.min():.4f} (≈ noise_var = 0.09)")print(f" Max variance: {var_y.max():.4f}") print("Note: var_y = var_f + noise_var, so minimum var_y ≈ 0.09")GP predictions satisfy important consistency properties that distinguish principled probabilistic models from ad-hoc uncertainty estimates.
Marginalization Consistency
If we compute the joint predictive distribution over multiple test points and then marginalize, we get the same result as computing marginal predictions directly:
$$p(f_^{(1)} | \mathcal{D}) = \int p(f_^{(1)}, f_^{(2)} | \mathcal{D}) , df_^{(2)}$$
This might seem obvious, but many uncertainty estimation methods fail this test. For example, neural network ensembles often produce inconsistent marginal and joint uncertainties.
Conditioning Consistency
If we condition on a subset of test points as if they were additional training data, the remaining predictions update correctly:
$$p(f_^{(2)} | \mathcal{D}, f_^{(1)}) = \frac{p(f_^{(1)}, f_^{(2)} | \mathcal{D})}{p(f_*^{(1)} | \mathcal{D})}$$
This property is essential for sequential decision-making and active learning.
These consistency properties ensure the GP predictions are 'coherent' in the Bayesian sense—you cannot construct a Dutch book (a set of bets that guarantees loss) against a GP-based decision maker. This rationality comes from GPs being proper probabilistic models, not just algorithms that output numbers.
Order Invariance
The GP posterior is invariant to the order in which training points are processed. Adding points one at a time (online) versus all at once (batch) yields identical final posteriors.
This property follows from Bayes' rule: $p(f | D_1, D_2) = p(f | D_1 \cup D_2)$, regardless of whether we condition on $D_1$ first then $D_2$, or vice versa.
Implications for Applications
These consistency properties make GPs ideal for:
Non-probabilistic methods (random forests, gradient boosting, etc.) can produce uncertainty estimates, but these estimates typically lack formal consistency guarantees.
Understanding the computational cost of GP prediction is essential for practical applications.
Training Phase (one-time cost):
Prediction Phase (per test point):
Multiple Test Points ($m$ points):
| Quantity | Time Complexity | Memory | Practical Limit |
|---|---|---|---|
| Training (Cholesky) | O(n³) | O(n²) | n ≈ 10,000-50,000 |
| Predict mean (m points) | O(nm) | O(nm) | m: millions feasible |
| Predict variance (m points) | O(nm) | O(nm) | Same as mean |
| Full covariance (m points) | O(nm² + m³) | O(m²) | m ≈ 10,000 |
| Sample functions (m points) | O(m³) + sampling | O(m²) | m ≈ 10,000 |
The O(n³) training cost and O(n²) memory requirement are the primary limitations of exact GP regression. For n > 10,000-50,000 (depending on hardware), you need sparse/approximate methods. Prediction is cheaper—O(n) per point—but still requires storing the Cholesky factor L.
We have developed the complete theory of GP predictive distributions. Let us consolidate the key results:
What's Next
We have derived the predictive distribution for both noise-free and noisy cases. In the next page, we dive deeper into mean and variance prediction—the practical aspects of computing point predictions and uncertainty estimates. We will examine how the kernel choice affects predictions, explore edge cases, and develop intuition for interpreting GP outputs in real applications.
You now understand the complete structure of GP predictive distributions. The ability to make probabilistic predictions with principled uncertainty quantification is what distinguishes GPs from most other regression methods—and forms the foundation for their use in Bayesian optimization, active learning, and uncertainty-critical applications.