Gp Regression - Learning Module

Loading content...

0/278

Predictive Distribution

From Posterior to Predictions

In the previous page, we derived the Gaussian Process posterior—the distribution over functions conditioned on observed data. But knowing the posterior is only the first step. The ultimate goal of any regression model is prediction: given a new input $\mathbf{x}_*$, what is the corresponding output?

For point-estimate regression methods (linear regression, neural networks, etc.), prediction means computing a single number $\hat{y}*$. But Gaussian Processes offer something far richer: a predictive distribution that fully characterizes our uncertainty about $y*$. This distribution tells us not just what to expect, but how confident we should be—and when we should trust our predictions.

The predictive distribution is what makes GPs uniquely valuable for applications where uncertainty matters: Bayesian optimization, active learning, model-based reinforcement learning, and safety-critical systems. In this page, we develop the complete theory of GP prediction.

What You Will Learn

By the end of this page, you will: (1) Derive the predictive distribution at a single test point, (2) Understand the joint predictive distribution over multiple test points, (3) Distinguish between marginal and joint predictions, (4) Sample functions from the predictive distribution, (5) Interpret prediction intervals correctly.

Predictive Distribution at a Single Point

Consider a single test input $\mathbf{x}*$. We want the predictive distribution $p(f* | \mathbf{x}_*, \mathbf{X}, \mathbf{y})$—the probability distribution over function values at this new point, conditioned on our training data.

From the posterior derivation, this distribution is Gaussian:

$$f_* | \mathbf{x}*, \mathbf{X}, \mathbf{y} \sim \mathcal{N}(\mu, \sigma_^2)$$

where (assuming zero prior mean):

$$\mu_* = \mathbf{k}_*^\top K^{-1} \mathbf{y}$$

$$\sigma_^2 = k(\mathbf{x}_, \mathbf{x}*) - \mathbf{k}^\top K^{-1} \mathbf{k}_$$

Here $\mathbf{k}* = [k(\mathbf{x}, \mathbf{x}1), \ldots, k(\mathbf{x}, \mathbf{x}_n)]^\top$ is the vector of covariances between the test point and all training points, and $K = K(\mathbf{X}, \mathbf{X})$ is the training covariance matrix.

The Two Quantities You Need

For prediction at a single point, you need exactly two numbers: the predictive mean μ* (your best guess) and the predictive variance σ*² (your uncertainty). Together, these fully characterize the Gaussian predictive distribution.

Decomposing the Predictive Mean

The predictive mean has a beautiful interpretation as a weighted average of training outputs:

$$\mu_* = \mathbf{k}*^\top K^{-1} \mathbf{y} = \sum{i=1}^{n} w_i y_i$$

where the weights $\mathbf{w} = K^{-1} \mathbf{k}_*$ depend on:

The kernel similarity between $\mathbf{x}*$ and each training point (via $\mathbf{k}*$)
The correlations among training points (via $K^{-1}$)

The weights $w_i$ can be positive or negative, and they sum to a value that depends on the kernel and point configuration. For the RBF kernel, if $\mathbf{x}*$ is far from all training points, the weights become negligible and $\mu* \to 0$ (the prior mean).

Decomposing the Predictive Variance

The predictive variance reveals how uncertainty decreases with proximity to training data:

$$\sigma_^2 = \underbrace{k(\mathbf{x}_, \mathbf{x}*)}{\text{prior variance}} - \underbrace{\mathbf{k}*^\top K^{-1} \mathbf{k}*}_{\text{variance reduction}}$$

The first term is the prior variance at $\mathbf{x}_*$, determined solely by the kernel.
The second term represents how much variance we eliminate by observing training data. This term is always non-negative (since $K$ is positive definite), so observations can only reduce uncertainty, never increase it.
When $\mathbf{x}*$ is far from training data, $\mathbf{k}* \approx \mathbf{0}$ and $\sigma_^2 \approx k(\mathbf{x}_, \mathbf{x}_*)$—we retain full prior uncertainty.
When $\mathbf{x}_* = \mathbf{x}i$ (at a training point), $\sigma*^2 = 0$—perfect certainty in the noise-free case.

Predictive Distribution Behavior
Scenario	Predictive Mean μ*	Predictive Variance σ*²	Interpretation
x* = training point xᵢ	Exactly yᵢ	0	Perfect interpolation
x* very close to xᵢ	≈ yᵢ	≈ 0 (very small)	High-confidence prediction
x* between training points	Weighted interpolation	Moderate	Reasonable uncertainty
x* far from all training	→ 0 (prior mean)	→ k(x,x) (prior)	Maximum uncertainty

Joint Predictive Distribution Over Multiple Points

Often we want to make predictions at multiple test points simultaneously. The joint predictive distribution captures not just individual prediction uncertainties, but correlations between predictions at different test points.

For test inputs $\mathbf{X}* = [\mathbf{x}^{(1)}, \ldots, \mathbf{x}_^{(m)}]$, the joint predictive distribution is:

$$\mathbf{f}* | \mathbf{X}, \mathbf{X}, \mathbf{y} \sim \mathcal{N}(\boldsymbol{\mu}_, \Sigma_*)$$

where:

$$\boldsymbol{\mu}* = K*^\top K^{-1} \mathbf{y} \in \mathbb{R}^m$$

$$\Sigma_* = K_{**} - K_^\top K^{-1} K_ \in \mathbb{R}^{m \times m}$$

Here $K_* = K(\mathbf{X}, \mathbf{X}*)$ is the $n \times m$ cross-covariance matrix, and $K{**} = K(\mathbf{X}*, \mathbf{X}*)$ is the $m \times m$ test point covariance matrix.

Marginal vs Joint Predictions

The diagonal of Σ* gives marginal variances at each test point. But the off-diagonal entries capture prediction correlations. Two nearby test points will have correlated predictions—if you're wrong at one, you're likely wrong at the other in the same direction. This joint structure matters for sampling coherent functions and for sequential decision-making.

Why Joint Predictions Matter

The joint predictive distribution is essential for several tasks:

Function Sampling: To draw a smooth function sample from the posterior, you must sample from the joint distribution, not independent marginals. Independent samples from marginals would yield jagged, discontinuous 'functions'.
Optimization: In Bayesian optimization, the joint distribution over multiple potential next points enables batch acquisition.
Uncertainty Bounds on Functionals: If you want a confidence interval for $\int f(x) dx$, you need the joint distribution to account for prediction correlations.
Sequential Prediction: In time series or spatial prediction, prediction errors are correlated across nearby points.

Computational Note: Computing the full $m \times m$ posterior covariance matrix costs $\mathcal{O}(nm^2)$ for the matrix multiplications, in addition to the $\mathcal{O}(n^3)$ cost for inverting $K$. For very large test sets, we often compute only the diagonal (pointwise variances) which is $\mathcal{O}(nm)$.

joint_prediction.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
from scipy.linalg import cholesky, solve_triangular
 
def gp_predict_joint(X_train, y_train, X_test, kernel_fn, jitter=1e-8):
    """
    Compute the full joint predictive distribution at test points.
    
    Returns:
        mu: Predictive mean (m,)
        cov: Full predictive covariance (m, m)
        var: Predictive variances (diagonal) (m,)
    """
    n = X_train.shape[0]
    m = X_test.shape[0]
    
    # Kernel matrices
    K = kernel_fn(X_train, X_train) + jitter * np.eye(n)
    K_star = kernel_fn(X_train, X_test)  # (n, m)
    K_star_star = kernel_fn(X_test, X_test)  # (m, m)
    
    # Cholesky decomposition
    L = cholesky(K, lower=True)
    
    # Predictive mean: K_*^T @ K^{-1} @ y
    alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True), lower=False)
    mu = K_star.T @ alpha
    
    # Predictive covariance: K_** - K_*^T @ K^{-1} @ K_*
    V = solve_triangular(L, K_star, lower=True)  # (n, m)
    cov = K_star_star - V.T @ V
    
    # Ensure symmetry (for numerical stability)
    cov = 0.5 * (cov + cov.T)
    
    var = np.diag(cov)
    
    return mu, cov, var
 
def sample_posterior(mu, cov, n_samples=5, jitter=1e-8):
    """
    Sample functions from the joint predictive distribution.
    """
    m = len(mu)
    # Add jitter for numerical stability in Cholesky
    L = cholesky(cov + jitter * np.eye(m), lower=True)
    
    # Sample: f = mu + L @ z, where z ~ N(0, I)
    z = np.random.randn(m, n_samples)
    samples = mu[:, None] + L @ z
    
    return samples  # (m, n_samples)
 
# Example: Demonstrate difference between marginal and joint sampling
np.random.seed(42)
 
# Training data
X_train = np.array([[-2], [0], [3]])
y_train = np.array([1.5, -0.2, 2.0])
 
# Dense test grid
X_test = np.linspace(-4, 5, 100).reshape(-1, 1)
 
# RBF kernel
def rbf(X1, X2, l=1.0, s=1.0):
    sq_dist = np.sum(X1**2, 1, keepdims=True) - 2*X1@X2.T + np.sum(X2**2, 1)
    return s * np.exp(-0.5 * sq_dist / l**2)
 
# Get joint predictive distribution
mu, cov, var = gp_predict_joint(X_train, y_train, X_test, rbf)
 
# Sample from joint distribution (smooth functions)
joint_samples = sample_posterior(mu, cov, n_samples=5)
 
# Compare with independent marginal samples (NOT smooth)
marginal_samples = mu[:, None] + np.sqrt(var)[:, None] * np.random.randn(100, 5)
 
print("Joint samples shape:", joint_samples.shape)
print("Marginal samples shape:", marginal_samples.shape)
print("
Joint samples are smooth --- they're proper function samples")
print("Marginal samples are jagged --- they ignore point correlations")

Sampling Functions from the Predictive Distribution

One of the most powerful capabilities of GP regression is the ability to sample entire functions from the posterior distribution. Each sample represents a plausible function that is consistent with the observed data.

The Sampling Procedure

To sample a function at test points $\mathbf{X}_*$:

Compute the predictive mean $\boldsymbol{\mu}*$ and covariance $\Sigma*$
Compute the Cholesky decomposition $\Sigma_* = \mathbf{L} \mathbf{L}^\top$
Sample $\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$
Compute $\mathbf{f}* = \boldsymbol{\mu}* + \mathbf{L} \mathbf{z}$

This procedure generates a sample from $\mathcal{N}(\boldsymbol{\mu}*, \Sigma*)$ using the reparameterization trick. The Cholesky factor $\mathbf{L}$ transforms independent standard normals into samples with the correct covariance structure.

Why Sampling Matters

Function samples are not just for visualization. They enable: (1) Monte Carlo estimation of posterior expectations E[g(f)], (2) Thompson sampling for Bayesian optimization, (3) Uncertainty propagation through nonlinear transformations, (4) Model checking and posterior predictive checks.

Properties of Posterior Samples

Samples from the GP posterior have important properties:

Exact interpolation: In the noise-free case, all samples pass through the training points exactly. This is because the posterior variance at training points is zero.
Smoothness: The smoothness of samples matches the smoothness encoded by the kernel. Samples from an RBF kernel are infinitely differentiable; samples from a Matérn-1/2 kernel are continuous but not differentiable.
Consistency: As more training data is observed, posterior samples concentrate around the true underlying function (if the kernel is well-specified).
Prior reversion: In regions far from training data, samples revert to prior behavior—they become draws from the prior GP.

Efficient Sampling for Many Points

Sampling at a large number of test points is expensive because the Cholesky decomposition of $\Sigma_*$ costs $\mathcal{O}(m^3)$. Alternatives include:

Matheron's rule: Sample from the prior and perform a conditioning step
Random Fourier features: Approximate the GP with finite-dimensional random features
Inducing point methods: Sample from a sparse approximation

posterior_sampling_efficient.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
from scipy.linalg import cholesky, solve_triangular
 
def sample_gp_posterior_matheron(X_train, y_train, X_test, kernel_fn, 
                                   n_samples=10, jitter=1e-8):
    """
    Efficient posterior sampling using Matheron's rule.
    
    Key insight: Instead of sampling from the posterior directly,
    we can sample from the prior and then update using conditioning.
    
    f* | y = f*_prior + K_*^T K^{-1} (y - f_prior)
    
    where f_prior ~ GP(0, k) is a sample from the prior evaluated
    at both training and test points.
    """
    n = X_train.shape[0]
    m = X_test.shape[0]
    
    # All points (training + test)
    X_all = np.vstack([X_train, X_test])
    
    # Compute kernel matrices
    K_all = kernel_fn(X_all, X_all) + jitter * np.eye(n + m)
    K = K_all[:n, :n]
    K_star = K_all[:n, n:]  # (n, m)
    
    # Cholesky of full kernel for prior sampling
    L_all = cholesky(K_all, lower=True)
    
    # Cholesky of training kernel for conditioning
    L = cholesky(K + jitter * np.eye(n), lower=True)
    
    # Pre-compute K^{-1} @ K_* for efficiency
    V = solve_triangular(L.T, solve_triangular(L, K_star, lower=True), lower=False)
    
    samples = np.zeros((m, n_samples))
    
    for i in range(n_samples):
        # Step 1: Sample from the prior at all points
        z = np.random.randn(n + m)
        f_prior_all = L_all @ z
        
        f_prior_train = f_prior_all[:n]
        f_prior_test = f_prior_all[n:]
        
        # Step 2: Apply Matheron's update rule
        # f*|y = f*_prior + K_*^T @ K^{-1} @ (y - f_train_prior)
        residual = y_train - f_prior_train
        update = V.T @ solve_triangular(
            L.T, solve_triangular(L, residual, lower=True), lower=False
        )
        
        samples[:, i] = f_prior_test + update
    
    return samples
 
# Compare standard sampling with Matheron's rule
np.random.seed(123)
 
# Simple 1D example
X_train = np.array([[-1.0], [1.0], [2.5]])
y_train = np.array([1.0, -0.5, 0.8])
X_test = np.linspace(-3, 4, 200).reshape(-1, 1)
 
def rbf_kernel(X1, X2, l=1.0, s=1.0):
    sq_dist = np.sum(X1**2, 1, keepdims=True) - 2*X1@X2.T + np.sum(X2**2, 1)
    return s * np.exp(-0.5 * sq_dist / l**2)
 
# Matheron sampling (efficient)
samples_matheron = sample_gp_posterior_matheron(
    X_train, y_train, X_test, rbf_kernel, n_samples=5
)
 
print("Generated", samples_matheron.shape[1], "posterior function samples")
print("Each sample is evaluated at", samples_matheron.shape[0], "test points")
 
# Verify samples pass through training points (approximately)
for i in range(len(X_train)):
    train_idx = np.argmin(np.abs(X_test.flatten() - X_train[i, 0]))
    sample_at_train = samples_matheron[train_idx, 0]
    print(f"Sample value at x={X_train[i,0]}: {sample_at_train:.3f}, true y: {y_train[i]:.3f}")

Prediction Intervals and Confidence Bounds

Since the predictive distribution is Gaussian, we can construct exact prediction intervals. For a target coverage probability $1 - \alpha$, the prediction interval at test point $\mathbf{x}_*$ is:

$$\left[ \mu_* - z_{1-\alpha/2} \sigma_, ; \mu_ + z_{1-\alpha/2} \sigma_* \right]$$

where $z_{1-\alpha/2}$ is the $(1-\alpha/2)$ quantile of the standard normal distribution.

Common Intervals:

Confidence Level	z-value	Interval Width
68%	1.00	±1.0σ
90%	1.645	±1.645σ
95%	1.96	±1.96σ
99%	2.576	±2.576σ

Pointwise vs Simultaneous Intervals

The intervals above are POINTWISE: at each individual test point, there's a 95% probability the true function value lies within the interval. But across many test points, the SIMULTANEOUS probability that ALL predictions lie within their intervals is lower. For truly simultaneous bounds, you'd need wider intervals (Bonferroni correction) or sample-based methods.

Interpretation of Uncertainty Bands

When visualizing GP predictions, we typically show:

Mean function $\mu_*(\mathbf{x})$: Our best point prediction
Uncertainty bands $\mu_(\mathbf{x}) \pm 2\sigma_(\mathbf{x})$: The ~95% credible interval

These bands have important properties:

Shrink near training data: The closer to training points, the narrower the bands
Expand far from data: In data-sparse regions, bands widen toward prior uncertainty
Never exceed prior uncertainty: The posterior variance is always ≤ prior variance
Zero width at training points: In the noise-free case, bands collapse at observations

What the Bands Mean

The uncertainty bands are epistemic uncertainty bounds—they represent our uncertainty about the true function due to limited data. They are NOT confidence intervals for future noisy observations (those would be wider, as we'll see when we add observation noise).

Crucially, these bounds assume the model is correct: the true function was drawn from a GP with our assumed kernel. Model misspecification can make these bounds unreliable.

Common Misinterpretations to Avoid

•"95% of future observations will fall in the band" — Wrong. The band is for the function value, not noisy observations.
•"The true function definitely lies within the band" — No guarantee. Bands are probabilistic, not deterministic.
•"Narrow bands mean good predictions" — Not necessarily. Model misspecification can yield overconfident narrow bands.
•"Wide bands mean the model is bad" — No. Wide bands in data-sparse regions correctly reflect uncertainty.

Predictive Distribution for Noisy Observations

So far we've assumed noise-free observations $y_i = f(\mathbf{x}_i)$. In practice, observations are noisy:

$$y_i = f(\mathbf{x}_i) + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma_n^2)$$

where $\sigma_n^2$ is the observation noise variance. This changes the posterior equations.

Modified Posterior (for the latent function):

The posterior distribution of $f_*$ (the noise-free function value) becomes:

$$f_* | \mathbf{X}, \mathbf{y}, \mathbf{x}* \sim \mathcal{N}(\bar{\mu}, \bar{\sigma}_^2)$$

where: $$\bar{\mu}* = \mathbf{k}^\top (K + \sigma_n^2 I)^{-1} \mathbf{y}$$ $$\bar{\sigma}_^2 = k(\mathbf{x}*, \mathbf{x}) - \mathbf{k}_^\top (K + \sigma_n^2 I)^{-1} \mathbf{k}_*$$

The only change is $K \to K + \sigma_n^2 I$: we add noise variance to the diagonal of the training covariance matrix.

Why Add Noise to the Diagonal?

The noise term σₙ²I inflates the variance of observed training outputs. Intuitively, noisy observations are less informative than noise-free ones, so we shouldn't trust them as much. The GP no longer interpolates training points exactly—it smooths through them, trading off data fit against function smoothness.

Predictive Distribution for New Observations

If we want to predict a noisy observation $y_* = f_* + \epsilon_*$ at a new point, the predictive distribution is:

$$y_* | \mathbf{X}, \mathbf{y}, \mathbf{x}* \sim \mathcal{N}(\bar{\mu}, \bar{\sigma}_^2 + \sigma_n^2)$$

The predictive mean is unchanged, but the variance increases by $\sigma_n^2$ to account for observation noise.

Two Types of Uncertainty:

Quantity	Variance	Represents
$f_*$ (latent function)	$\bar{\sigma}_*^2$	Epistemic uncertainty (reducible with more data)
$y_*$ (noisy observation)	$\bar{\sigma}_*^2 + \sigma_n^2$	Epistemic + Aleatoric uncertainty

The noise variance $\sigma_n^2$ is aleatoric (irreducible)—even with infinite data, future observations will still be noisy. The function uncertainty $\bar{\sigma}_*^2$ is epistemic (reducible)—more data will shrink it.

Regularization Interpretation

The noise term also provides numerical regularization. The matrix $K + \sigma_n^2 I$ is guaranteed to be positive definite even when $K$ itself is nearly singular (due to duplicate or very close training points). This makes the noisy GP more robust computationally.

noisy_gp_prediction.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import numpy as np
from scipy.linalg import cholesky, solve_triangular
 
def gp_predict_noisy(X_train, y_train, X_test, kernel_fn, 
                      noise_var=0.1, jitter=1e-8):
    """
    GP prediction with observation noise.
    
    Parameters:
        noise_var: Observation noise variance σ_n²
    
    Returns:
        mu_f: Predictive mean for latent function f*
        var_f: Predictive variance for latent function f*
        var_y: Predictive variance for noisy observation y*
    """
    n = X_train.shape[0]
    
    # Kernel matrices
    K = kernel_fn(X_train, X_train)
    K_star = kernel_fn(X_train, X_test)
    K_star_star = kernel_fn(X_test, X_test)
    
    # Add noise to training covariance
    K_noisy = K + noise_var * np.eye(n) + jitter * np.eye(n)
    
    # Cholesky decomposition of noisy covariance
    L = cholesky(K_noisy, lower=True)
    
    # Predictive mean
    alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True), lower=False)
    mu_f = K_star.T @ alpha
    
    # Predictive variance for f*
    V = solve_triangular(L, K_star, lower=True)
    var_f = np.diag(K_star_star) - np.sum(V**2, axis=0)
    
    # Predictive variance for y* = f* + noise
    var_y = var_f + noise_var
    
    return mu_f, var_f, var_y
 
# Example: Compare noise-free vs noisy GP
np.random.seed(42)
 
# Noisy training data
X_train = np.linspace(-2, 2, 8).reshape(-1, 1)
f_true = np.sin(2 * X_train.flatten())  # True function
y_train = f_true + 0.3 * np.random.randn(8)  # Noisy observations
 
X_test = np.linspace(-3, 3, 100).reshape(-1, 1)
 
def rbf(X1, X2, l=1.0, s=1.0):
    sq_dist = np.sum(X1**2, 1, keepdims=True) - 2*X1@X2.T + np.sum(X2**2, 1)
    return s * np.exp(-0.5 * sq_dist / l**2)
 
# Noisy predictions
mu, var_f, var_y = gp_predict_noisy(X_train, y_train, X_test, rbf, noise_var=0.09)
 
print("Latent function f* uncertainty (epistemic):")
print(f"  Min variance: {var_f.min():.4f}")
print(f"  Max variance: {var_f.max():.4f}")
 
print("
Noisy observation y* uncertainty (epistemic + aleatoric):")
print(f"  Min variance: {var_y.min():.4f} (≈ noise_var = 0.09)")
print(f"  Max variance: {var_y.max():.4f}")
 
print("
Note: var_y = var_f + noise_var, so minimum var_y ≈ 0.09")

Consistency and Rationality of GP Predictions

GP predictions satisfy important consistency properties that distinguish principled probabilistic models from ad-hoc uncertainty estimates.

Marginalization Consistency

If we compute the joint predictive distribution over multiple test points and then marginalize, we get the same result as computing marginal predictions directly:

$$p(f_^{(1)} | \mathcal{D}) = \int p(f_^{(1)}, f_^{(2)} | \mathcal{D}) , df_^{(2)}$$

This might seem obvious, but many uncertainty estimation methods fail this test. For example, neural network ensembles often produce inconsistent marginal and joint uncertainties.

Conditioning Consistency

If we condition on a subset of test points as if they were additional training data, the remaining predictions update correctly:

$$p(f_^{(2)} | \mathcal{D}, f_^{(1)}) = \frac{p(f_^{(1)}, f_^{(2)} | \mathcal{D})}{p(f_*^{(1)} | \mathcal{D})}$$

This property is essential for sequential decision-making and active learning.

Dutch Book Coherence

These consistency properties ensure the GP predictions are 'coherent' in the Bayesian sense—you cannot construct a Dutch book (a set of bets that guarantees loss) against a GP-based decision maker. This rationality comes from GPs being proper probabilistic models, not just algorithms that output numbers.

Order Invariance

The GP posterior is invariant to the order in which training points are processed. Adding points one at a time (online) versus all at once (batch) yields identical final posteriors.

This property follows from Bayes' rule: $p(f | D_1, D_2) = p(f | D_1 \cup D_2)$, regardless of whether we condition on $D_1$ first then $D_2$, or vice versa.

Implications for Applications

These consistency properties make GPs ideal for:

Bayesian Optimization: Sequential decision-making requires consistent updating
Active Learning: Point selection based on current uncertainty must be coherent
Model Comparison: Log-likelihood can be computed consistently
Hierarchical Modeling: GPs compose correctly in complex graphical models

Non-probabilistic methods (random forests, gradient boosting, etc.) can produce uncertainty estimates, but these estimates typically lack formal consistency guarantees.

Computational Complexity of Prediction

Understanding the computational cost of GP prediction is essential for practical applications.

Training Phase (one-time cost):

Compute kernel matrix $K$: $\mathcal{O}(n^2)$ kernel evaluations
Cholesky decomposition: $\mathcal{O}(n^3)$
Solve for $\boldsymbol{\alpha} = K^{-1}\mathbf{y}$: $\mathcal{O}(n^2)$
Total: $\mathcal{O}(n^3)$ time, $\mathcal{O}(n^2)$ memory

Prediction Phase (per test point):

Compute $\mathbf{k}_*$ (kernel with all training points): $\mathcal{O}(n)$
Predictive mean $\mu_* = \mathbf{k}_*^\top \boldsymbol{\alpha}$: $\mathcal{O}(n)$
Predictive variance (requires solving linear system): $\mathcal{O}(n)$ if $L$ is cached
Total per test point: $\mathcal{O}(n)$

Multiple Test Points ($m$ points):

Kernel matrix $K_*$: $\mathcal{O}(nm)$
Predictive means: $\mathcal{O}(nm)$
Predictive variances only: $\mathcal{O}(nm)$
Full predictive covariance: $\mathcal{O}(nm^2 + m^3)$ additional

GP Prediction Scalability
Quantity	Time Complexity	Memory	Practical Limit
Training (Cholesky)	O(n³)	O(n²)	n ≈ 10,000-50,000
Predict mean (m points)	O(nm)	O(nm)	m: millions feasible
Predict variance (m points)	O(nm)	O(nm)	Same as mean
Full covariance (m points)	O(nm² + m³)	O(m²)	m ≈ 10,000
Sample functions (m points)	O(m³) + sampling	O(m²)	m ≈ 10,000

The Cubic Bottleneck

The O(n³) training cost and O(n²) memory requirement are the primary limitations of exact GP regression. For n > 10,000-50,000 (depending on hardware), you need sparse/approximate methods. Prediction is cheaper—O(n) per point—but still requires storing the Cholesky factor L.

Summary: The Predictive Distribution

We have developed the complete theory of GP predictive distributions. Let us consolidate the key results:

Key Results

•Gaussian predictive distribution: At any test point, the predictive distribution is Gaussian with analytically-computed mean and variance.
•Mean is a weighted sum: μ* = k*ᵀK⁻¹y is a weighted average of training outputs, with weights determined by kernel similarities.
•Variance decreases near data: σ² = k(x,x*) - kᵀK⁻¹k is always ≤ prior variance, with equality far from training data.
•Joint predictions capture correlations: The full posterior covariance Σ* encodes how prediction errors at different points are correlated.
•Function sampling requires joint distribution: Proper function samples must be drawn from the multivariate Gaussian, not independent marginals.
•Noisy vs noise-free predictions: For noisy observations, predictive variance for y* includes both epistemic (reducible) and aleatoric (irreducible) components.

What's Next

We have derived the predictive distribution for both noise-free and noisy cases. In the next page, we dive deeper into mean and variance prediction—the practical aspects of computing point predictions and uncertainty estimates. We will examine how the kernel choice affects predictions, explore edge cases, and develop intuition for interpreting GP outputs in real applications.

Page Complete

You now understand the complete structure of GP predictive distributions. The ability to make probabilistic predictions with principled uncertainty quantification is what distinguishes GPs from most other regression methods—and forms the foundation for their use in Bayesian optimization, active learning, and uncertainty-critical applications.