Gp Regression - Learning Module

Loading content...

0/278

Mean and Variance Prediction

The Two Quantities That Matter

At the heart of every GP prediction are two numbers: the predictive mean and the predictive variance. These aren't just mathematical abstractions—they encode everything you need to make decisions under uncertainty.

The mean tells you what to expect. The variance tells you how much to trust that expectation.

While the formulas for these quantities are simple, understanding their behavior requires deep intuition. How does the mean interpolate between training points? Why does variance vary across input space? How do kernel parameters affect both? This page develops that intuition through careful analysis and concrete examples.

For practitioners, these quantities aren't academic—they drive real decisions. A Bayesian optimization acquisition function trades off mean (exploitation) against variance (exploration). A safety-critical system uses variance to know when to abstain from automated decisions. A scientific model uses variance to propagate uncertainty through downstream analyses.

What You Will Learn

By the end of this page, you will: (1) Deeply understand how kernel parameters affect mean and variance predictions, (2) Recognize the spatial structure of GP uncertainty, (3) Interpret mean predictions as kernel-weighted interpolation, (4) Understand edge cases where standard intuitions fail, (5) Apply this knowledge to practical prediction scenarios.

Anatomy of the Predictive Mean

The predictive mean at a test point $\mathbf{x}_*$ is:

$$\mu_* = \mathbf{k}*^\top K^{-1} \mathbf{y} = \mathbf{k}*^\top \boldsymbol{\alpha}$$

where $\boldsymbol{\alpha} = K^{-1} \mathbf{y}$ are the 'dual coefficients'. This formula admits multiple interpretations, each illuminating a different aspect of GP prediction.

Interpretation 1: Weighted Sum of Training Outputs

We can write: $$\mu_* = \sum_{i=1}^{n} w_i y_i$$

where the weights $w_i = [K^{-1} \mathbf{k}_]i$ depend on both the kernel similarities and the correlation structure among training points. Unlike simple kernel smoothing where weights are just $k(\mathbf{x}, \mathbf{x}_i)$, GP weights account for redundancy among training points.

Example: If two training points $\mathbf{x}_1$ and $\mathbf{x}2$ are very close (highly correlated), their combined influence on $\mu*$ is roughly what a single point would contribute. The GP doesn't double-count nearby evidence.

The Role of K⁻¹: Decorrelation

The matrix K⁻¹ performs decorrelation. If training points were uncorrelated (K = I), then w = k*, and weights would simply be kernel similarities. When training points are correlated, K⁻¹ adjusts weights to avoid over-counting redundant information.

Interpretation 2: Representer Theorem

The mean function lies in the reproducing kernel Hilbert space (RKHS) associated with the kernel:

$$\mu_*(\mathbf{x}) = \sum_{i=1}^{n} \alpha_i k(\mathbf{x}, \mathbf{x}_i)$$

This is a sum of kernel functions centered at training points, each weighted by $\alpha_i$. The GP mean is the unique function in this RKHS that minimizes a regularized loss.

Interpretation 3: Linear Predictor in Feature Space

For kernels with explicit feature maps $\phi$, the mean can be written as:

$$\mu_*(\mathbf{x}) = \boldsymbol{\beta}^\top \phi(\mathbf{x})$$

where $\boldsymbol{\beta}$ is the posterior mean of the weight vector in Bayesian linear regression with features $\phi$. GPs provide a kernelized, infinite-dimensional version of Bayesian linear regression.

Weight Properties

The weights $w_i$ have interesting properties:

Property	Explanation
Can be negative	Nearby opposite-signed $y_i$ values can induce negative weights
Don't sum to 1	Unlike kernel density estimation, GP weights aren't normalized
Depend on all training points	Each weight depends on the full training configuration
Decay with distance	Weights for distant training points are typically small

mean_weights_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import numpy as np
from scipy.linalg import solve
 
def analyze_mean_weights(X_train, y_train, x_test, kernel_fn, jitter=1e-8):
    """
    Analyze the weights used in GP mean prediction.
    """
    n = X_train.shape[0]
    
    # Compute kernel matrices
    K = kernel_fn(X_train, X_train) + jitter * np.eye(n)
    k_star = kernel_fn(X_train, x_test.reshape(1, -1)).flatten()
    
    # Dual coefficients alpha = K^{-1} y
    alpha = solve(K, y_train, assume_a='pos')
    
    # Weights: w = K^{-1} k_* (same as solving K @ w = k_*)
    weights = solve(K, k_star, assume_a='pos')
    
    # Mean prediction
    mu = np.dot(k_star, alpha)
    mu_from_weights = np.dot(weights, y_train)
    
    return {
        'alpha': alpha,
        'weights': weights,
        'kernel_similarities': k_star,
        'mu': mu,
        'weight_sum': np.sum(weights),
        'has_negative_weights': np.any(weights < 0),
    }
 
# Example: Analyze weight distribution
np.random.seed(42)
 
# Training data
X_train = np.array([[-2], [-1], [0], [1], [2]])
y_train = np.array([1.0, -0.5, 0.2, 1.5, 0.8])
 
def rbf(X1, X2, l=1.0):
    sq_dist = np.sum(X1**2, 1, keepdims=True) - 2*X1@X2.T + np.sum(X2**2, 1)
    return np.exp(-0.5 * sq_dist / l**2)
 
# Analyze at different test points
test_points = [-3.0, -1.5, 0.5, 3.0]
 
for x_test in test_points:
    result = analyze_mean_weights(X_train, y_train, np.array([x_test]), rbf)
    
    print(f"
Test point x* = {x_test}")
    print(f"  Kernel similarities k(x*, xᵢ): {np.round(result['kernel_similarities'], 3)}")
    print(f"  Prediction weights wᵢ: {np.round(result['weights'], 3)}")
    print(f"  Sum of weights: {result['weight_sum']:.3f}")
    print(f"  Has negative weights: {result['has_negative_weights']}")
    print(f"  Prediction μ*: {result['mu']:.3f}")

Anatomy of the Predictive Variance

The predictive variance at test point $\mathbf{x}_*$ is:

$$\sigma_^2 = k(\mathbf{x}_, \mathbf{x}*) - \mathbf{k}^\top K^{-1} \mathbf{k}_$$

This expression has a compelling intuitive interpretation:

$$\sigma_^2 = \underbrace{k(\mathbf{x}_, \mathbf{x}*)}{\text{Prior variance}} - \underbrace{\mathbf{k}*^\top K^{-1} \mathbf{k}*}_{\text{Variance reduction from data}}$$

The Variance Reduction Term

The term $\mathbf{k}*^\top K^{-1} \mathbf{k}$ is a quadratic form that measures how much information about $f_$ is contained in the training data. Its properties:

Always non-negative: Since $K$ is positive definite, this term is always $\geq 0$.
Bounded above by prior variance: We always have $\mathbf{k}*^\top K^{-1} \mathbf{k}* \leq k(\mathbf{x}*, \mathbf{x})$, ensuring $\sigma_^2 \geq 0$.
Equals prior variance at training points: When $\mathbf{x}* = \mathbf{x}i$, we have $\mathbf{k}* = K{:,i}$ (the $i$-th column of $K$), and $\mathbf{k}*^\top K^{-1} \mathbf{k}* = K_{i,i}$, giving $\sigma_*^2 = 0$.

Variance Independent of Output Values

Notice that σ*² depends only on input locations (through k and K), not on the output values y. This means you can compute prediction uncertainty before collecting any output data—useful for experimental design and active learning.

Geometric Interpretation

The variance reduction term has a geometric interpretation. Define $\mathbf{v} = K^{-1/2} \mathbf{k}_*$, where $K^{1/2}$ is the matrix square root of $K$. Then:

$$\mathbf{k}*^\top K^{-1} \mathbf{k}* = |\mathbf{v}|^2$$

This is the squared norm of the projection of $\mathbf{k}*$ onto the space spanned by the training covariances. The variance reduction depends on how well $\mathbf{k}*$ can be 'explained' by the training point covariance structure.

Spatial Structure of Variance

The variance function $\sigma_*^2(\mathbf{x})$ has characteristic spatial structure:

Zero at training points: Perfect confidence at observed locations (noise-free case)
Increases with distance from data: Moving away from training points increases uncertainty
Approaches prior variance asymptotically: Far from all data, we recover full prior uncertainty
Depends on training point density: Regions with many nearby training points have lower variance
Affected by training point configuration: The shape of the 'uncertainty landscape' depends on how training points are distributed

variance_landscape.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import numpy as np
from scipy.linalg import cholesky, solve_triangular
 
def variance_landscape(X_train, kernel_fn, resolution=100, x_range=(-5, 5)):
    """
    Compute the predictive variance landscape (before observing any y values).
    """
    n = X_train.shape[0]
    
    # Dense grid of test points
    X_test = np.linspace(x_range[0], x_range[1], resolution).reshape(-1, 1)
    
    # Kernel matrices
    K = kernel_fn(X_train, X_train) + 1e-8 * np.eye(n)
    K_star = kernel_fn(X_train, X_test)  # (n, resolution)
    k_diag = kernel_fn(X_test, X_test).diagonal()  # Prior variances
    
    # Efficient variance computation
    L = cholesky(K, lower=True)
    V = solve_triangular(L, K_star, lower=True)  # (n, resolution)
    
    # Variance reduction at each test point
    var_reduction = np.sum(V**2, axis=0)  # (resolution,)
    
    # Predictive variance
    var = k_diag - var_reduction
    
    return X_test.flatten(), var
 
# Compare variance landscapes for different training configurations
np.random.seed(42)
 
def rbf(X1, X2, l=1.0):
    sq_dist = np.sum(X1**2, 1, keepdims=True) - 2*X1@X2.T + np.sum(X2**2, 1)
    return np.exp(-0.5 * sq_dist / l**2)
 
# Configuration 1: Evenly spaced training points
X_uniform = np.linspace(-3, 3, 5).reshape(-1, 1)
 
# Configuration 2: Clustered training points
X_clustered = np.array([[-2.5], [-2.0], [0], [2.0], [2.5]])
 
# Configuration 3: One-sided training points
X_onesided = np.linspace(-3, 0, 5).reshape(-1, 1)
 
for name, X_train in [('Uniform', X_uniform), ('Clustered', X_clustered), ('One-sided', X_onesided)]:
    x_grid, var = variance_landscape(X_train, rbf)
    
    print(f"
{name} configuration:")
    print(f"  Min variance: {var.min():.6f} (should be ≈0 at training pts)")
    print(f"  Max variance: {var.max():.4f} (prior variance = 1.0)")
    print(f"  Mean variance: {var.mean():.4f}")
    
    # Find where variance exceeds 90% of prior
    high_unc_regions = x_grid[var > 0.9]
    if len(high_unc_regions) > 0:
        print(f"  High uncertainty (>90% prior) regions: x ∈ [{high_unc_regions.min():.1f}, {high_unc_regions.max():.1f}]")

How Kernel Parameters Affect Predictions

Kernel hyperparameters profoundly influence both mean and variance predictions. Understanding these effects is crucial for practitioners.

RBF Kernel Parameters

The RBF (Squared Exponential) kernel $k(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left(-\frac{|\mathbf{x} - \mathbf{x}'|^2}{2\ell^2}\right)$ has two key parameters:

Length scale $\ell$: Controls how quickly correlation decays with distance
Signal variance $\sigma_f^2$: Controls the prior variance of function values

Let's analyze their effects systematically.

Effect of Length Scale ℓ on Predictions
Aspect	Small ℓ (e.g., 0.1)	Large ℓ (e.g., 10)
Correlation range	Only very close points correlated	Distant points still correlated
Posterior mean	Wiggly, follows training data closely	Smooth, averages over data
Variance between training pts	Rises quickly, high uncertainty gaps	Stays low, interpolation confident
Extrapolation behavior	Prediction reverts to prior quickly	Prediction reverts to prior slowly
Function sample smoothness	Rough, rapidly varying	Smooth, slowly varying

Length Scale Extremes

Very small ℓ: Training points become uncorrelated, each prediction depends only on the nearest training point (approaching nearest-neighbor interpolation). Very large ℓ: All points become highly correlated, predictions approach a constant (the mean of training outputs).

Effect of Signal Variance σ_f² on Predictions
Aspect	Small σ_f² (e.g., 0.1)	Large σ_f² (e.g., 10)
Prior uncertainty	Low—prior believes functions are small	High—prior allows large function values
Posterior variance scale	Variance bounded by σ_f²	Variance can be large far from data
Effect on mean	Pulls predictions toward zero	Allows larger deviations from zero
Noise interpretation	Relative noise is higher	Relative noise is lower

Interaction Between Length Scale and Noise

In the noisy observation case, the ratio between observation noise $\sigma_n^2$ and signal variance $\sigma_f^2$ is more important than their absolute values:

High $\sigma_n^2 / \sigma_f^2$: Model treats observations as very noisy, smooths aggressively
Low $\sigma_n^2 / \sigma_f^2$: Model trusts observations, predictions stay close to data

The length scale $\ell$ interacts with training point spacing:

$\ell \ll $ typical spacing: Points appear uncorrelated, variance remains high between them
$\ell \gg$ typical spacing: Points appear redundant, adding more doesn't help much
$\ell \approx$ spacing: Good coverage, reasonable variance reduction throughout

Multi-Dimensional Length Scales (ARD)

With Automatic Relevance Determination (ARD), each input dimension has its own length scale $\ell_j$:

$$k(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left(-\frac{1}{2} \sum_{j=1}^{d} \frac{(x_j - x'_j)^2}{\ell_j^2}\right)$$

Dimensions with small $\ell_j$ are relevant (function varies rapidly along them). Dimensions with large $\ell_j$ are less relevant (function nearly constant along them).

Interpolation and Extrapolation Behavior

GP prediction behavior differs fundamentally between interpolation (predicting within the training data range) and extrapolation (predicting outside it).

Interpolation: Between Training Points

In regions surrounded by training data, GP predictions exhibit:

Smooth interpolation: The mean function smoothly connects training points, with smoothness determined by the kernel.
Low variance: Surrounded by data, prediction uncertainty is low. The exact level depends on training point density relative to length scale.
Local influence: Predictions depend primarily on nearby training points, with distant points contributing little.
Behavior at midpoints: At the midpoint between two training points, the mean is approximately (but not exactly) the average of their outputs, modulated by the global training configuration.

Extrapolation: Beyond Training Data

Moving outside the training data convex hull:

Reversion to prior mean: The predictive mean approaches the prior mean (typically 0) as we move further from data.
Reversion to prior variance: The predictive variance approaches $k(\mathbf{x}*, \mathbf{x}*) = \sigma_f^2$.
Rate of reversion: Determined by length scale. Small $\ell$ causes rapid reversion; large $\ell$ causes gradual reversion.

Extrapolation Is Unreliable

GPs do NOT extrapolate trends. Unlike polynomial regression that continues its trend forever, a GP with stationary kernel (RBF, Matérn, etc.) always reverts to the prior mean. If your task requires extrapolation, you need: (1) domain knowledge encoded in the mean function, (2) non-stationary kernels, or (3) explicit trend modeling.

Why Extrapolation Fails with Stationary Kernels

Stationary kernels depend only on $|\mathbf{x} - \mathbf{x}'|$, not on absolute position. This means:

The prior is the same everywhere—no built-in trends
Information from training data decays with distance
Eventually, the extrapolated region is 'too far' to receive any training signal

Quantifying Extrapolation Distance

For the RBF kernel, the correlation $k(\mathbf{x}_*, \mathbf{x}_i)$ between a test point and the nearest training point decays as:

$$k(\mathbf{x}_*, \mathbf{x}i) = \sigma_f^2 \exp\left(-\frac{d{\min}^2}{2\ell^2}\right)$$

where $d_{\min} = \min_i |\mathbf{x}_* - \mathbf{x}_i|$.

Distance $d_{\min}/\ell$	Correlation $k/\sigma_f^2$	Interpolation Quality
0.0	1.0	At training point
1.0	0.606	Reasonable interpolation
2.0	0.135	Weak influence
3.0	0.011	Negligible influence
4.0	0.0003	Effectively extrapolating

extrapolation_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import numpy as np
from scipy.linalg import cholesky, solve_triangular
 
def analyze_extrapolation(X_train, y_train, extrapolation_distance, kernel_fn, jitter=1e-8):
    """
    Analyze GP behavior at different extrapolation distances.
    """
    n = X_train.shape[0]
    
    # Rightmost training point
    x_max = X_train.max()
    
    # Kernel matrices
    K = kernel_fn(X_train, X_train) + jitter * np.eye(n)
    L = cholesky(K, lower=True)
    
    # Test at various extrapolation distances
    results = []
    
    for dist in extrapolation_distance:
        x_test = np.array([[x_max + dist]])
        
        k_star = kernel_fn(X_train, x_test).flatten()
        k_ss = kernel_fn(x_test, x_test)[0, 0]
        
        # Mean prediction
        alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True), lower=False)
        mu = np.dot(k_star, alpha)
        
        # Variance
        v = solve_triangular(L, k_star, lower=True)
        var = k_ss - np.dot(v, v)
        
        # Max kernel similarity to any training point
        max_corr = k_star.max()
        
        results.append({
            'distance': dist,
            'mean': mu,
            'std': np.sqrt(var),
            'max_correlation': max_corr,
            'prior_fraction': var / k_ss  # Fraction of prior variance remaining
        })
    
    return results
 
# Example analysis
np.random.seed(42)
 
# Training data: observations of a linear trend
X_train = np.linspace(0, 5, 6).reshape(-1, 1)
y_train = 0.5 * X_train.flatten() + np.random.randn(6) * 0.1
 
# RBF kernel with length scale 2.0
def rbf(X1, X2, l=2.0, s=1.0):
    sq_dist = np.sum(X1**2, 1, keepdims=True) - 2*X1@X2.T + np.sum(X2**2, 1)
    return s * np.exp(-0.5 * sq_dist / l**2)
 
distances = [0.5, 1.0, 2.0, 4.0, 6.0, 10.0]
results = analyze_extrapolation(X_train, y_train, distances, rbf)
 
print("Extrapolation analysis (training x ∈ [0, 5], ℓ = 2.0)")
print("-" * 65)
print(f"{'Distance':>10} {'Mean':>10} {'Std':>10} {'Max Corr':>12} {'Prior %':>10}")
print("-" * 65)
 
for r in results:
    print(f"{r['distance']:>10.1f} {r['mean']:>10.3f} {r['std']:>10.3f} "
          f"{r['max_correlation']:>12.4f} {r['prior_fraction']*100:>9.1f}%")
 
print("
Note: As distance increases, mean → 0 (prior), std → 1 (prior variance)")
print("The 'trend' in training data is NOT extrapolated!")

Edge Cases and Pathological Behaviors

Understanding when standard GP intuition fails is as important as understanding when it works. Here we examine edge cases and potential pathologies.

Common Edge Cases

•Nearly duplicate training points: Very close training points cause near-singularity in K. Symptoms: Cholesky fails, predictions become unstable. Solution: Add jitter, or deduplicate points.
•Very small length scale: With ℓ → 0, the kernel matrix approaches identity (points uncorrelated). Predictions become nearest-neighbor-like, variance stays high except exactly at training points.
•Very large length scale: With ℓ → ∞, all points become perfectly correlated. Predictions approach a constant (mean of training outputs if mean function is 0).
•Training output scale mismatch: If |y| values are very large but σ_f² is small, the GP struggles. Standardize outputs or adjust σ_f².
•Single training point: With n=1, predictions interpolate between prior (far away) and the single observation (at that point), with uncertainty depending on distance.

Numerical Instability Warning Signs

Watch for: (1) Negative predictive variances (numerically impossible, indicates instability), (2) Cholesky decomposition failures, (3) Predictions wildly different from training outputs near training points, (4) Condition number of K exceeding 10¹² or so. Solutions: increase jitter, standardize inputs/outputs, or reconsider hyperparameters.

The Overconfidence Problem

A common pathology is overconfident predictions where the GP reports low variance but predictions are poor. This typically occurs due to:

Model misspecification: The true function doesn't match the kernel's assumptions (e.g., using RBF for a non-smooth function)
Poor hyperparameters: Length scale or noise variance badly wrong (often due to local optima in marginal likelihood)
Insufficient training data: Sparse data combined with small length scale leads to confident but wrong interpolations

The Underconfidence Problem

Conversely, underconfident predictions waste information:

Length scale too small: Nearby points don't inform each other, variance stays high
Noise variance too high: Model doesn't trust observations enough

Diagnostic: Calibration Check

A well-calibrated GP should have ~95% of true values within the ±2σ prediction band. If significantly more or fewer values fall within this band, the model is miscalibrated.

calibration_check.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import numpy as np
from scipy.stats import norm
 
def calibration_check(y_true, mu_pred, std_pred, quantiles=[0.5, 0.9, 0.95, 0.99]):
    """
    Check GP calibration by verifying prediction interval coverage.
    
    For a well-calibrated GP, X% of true values should fall within
    the X% prediction interval.
    """
    results = {}
    
    for q in quantiles:
        # Compute prediction interval
        z = norm.ppf(0.5 + q/2)  # e.g., 1.96 for q=0.95
        
        lower = mu_pred - z * std_pred
        upper = mu_pred + z * std_pred
        
        # Count coverage
        within = np.mean((y_true >= lower) & (y_true <= upper))
        
        results[q] = {
            'expected': q,
            'observed': within,
            'ratio': within / q  # <1 overconfident, >1 underconfident
        }
    
    return results
 
def normalized_residuals(y_true, mu_pred, std_pred):
    """
    Compute standardized residuals. For a well-calibrated GP,
    these should be approximately standard normal.
    """
    z = (y_true - mu_pred) / std_pred
    return z
 
# Example with synthetic data
np.random.seed(42)
 
# True function
def f_true(x):
    return np.sin(2 * x) + 0.1 * x**2
 
# Training data
X_train = np.random.uniform(-3, 3, 20)
y_train = f_true(X_train) + 0.1 * np.random.randn(20)
 
# Test data
X_test = np.random.uniform(-3, 3, 200)
y_test = f_true(X_test) + 0.1 * np.random.randn(200)
 
# Simulated GP predictions (replace with actual GP in practice)
# For demo, using local polynomial regression
from scipy.interpolate import Rbf
rbf_interp = Rbf(X_train, y_train, function='gaussian', epsilon=0.5)
mu_pred = rbf_interp(X_test)
std_pred = np.ones_like(mu_pred) * 0.15  # Simulated uncertainty
 
# Calibration check
cal = calibration_check(y_test, mu_pred, std_pred)
 
print("Calibration analysis:")
print("-" * 50)
for q, result in cal.items():
    status = "✓ OK" if 0.8 < result['ratio'] < 1.2 else "⚠ Check"
    print(f"  {int(q*100):2d}% interval: expected {result['expected']:.2f}, "
          f"observed {result['observed']:.2f} ({status})")
 
# Check normalized residuals
z = normalized_residuals(y_test, mu_pred, std_pred)
print(f"
Normalized residual statistics (should be ≈ N(0,1)):")
print(f"  Mean: {z.mean():.3f} (should be ≈ 0)")
print(f"  Std:  {z.std():.3f} (should be ≈ 1)")

Efficient Computation of Mean and Variance

Efficient implementation of mean and variance prediction requires careful attention to numerical linear algebra.

Precomputation for Fast Prediction

Many quantities can be precomputed once during training and reused for all predictions:

Cholesky factor $L$ where $K + \sigma_n^2 I = LL^\top$: $\mathcal{O}(n^3)$ once
Dual coefficients $\boldsymbol{\alpha} = (K + \sigma_n^2 I)^{-1} \mathbf{y}$: $\mathcal{O}(n^2)$ once

With these precomputed, prediction at a single test point costs:

Mean: $\mathcal{O}(n)$ for computing $\mathbf{k}_*$ and the dot product
Variance: $\mathcal{O}(n^2)$ for solving $L \mathbf{v} = \mathbf{k}_*$ (can be $\mathcal{O}(n)$ if done carefully)

The Matrix-Vector Solve Trick

Never explicitly compute K⁻¹. Instead, solve linear systems: K @ x = b is solved as x = solve(K, b). With the Cholesky factor L, this becomes two triangular solves: L @ z = b, then Lᵀ @ x = z. Each is O(n²) but with small constants.

Variance Computation Variants

There are several ways to compute predictive variance:

Method 1: Direct computation $$\sigma_^2 = k_{**} - \mathbf{k}_^\top K^{-1} \mathbf{k}* = k{**} - \mathbf{v}^\top \mathbf{v}$$ where $L \mathbf{v} = \mathbf{k}_*$.

Method 2: Using Cholesky of posterior covariance If you need multiple samples at the same test points, compute $\Sigma_* = LL^\top$ once, then sample as $\boldsymbol{\mu}_* + L \mathbf{z}$.

Method 3: Incremental update When adding one training point at a time, the posterior can be updated in $\mathcal{O}(n^2)$ using rank-1 Cholesky updates.

Batch Prediction Efficiency

For $m$ test points, naive computation costs $\mathcal{O}(m \cdot n^2)$ for variance. Batching cleverly:

K_star = kernel(X_train, X_test)  # (n, m)
V = solve_triangular(L, K_star)   # (n, m) - solved column by column
var = diag(K_test_test) - sum(V**2, axis=0)  # (m,)

This is still $\mathcal{O}(n^2 m)$ but with efficient BLAS operations.

efficient_gp.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import numpy as np
from scipy.linalg import cholesky, solve_triangular
 
class EfficientGP:
    """
    An efficient GP implementation with precomputation for fast prediction.
    """
    
    def __init__(self, kernel_fn, noise_var=0.0, jitter=1e-8):
        self.kernel_fn = kernel_fn
        self.noise_var = noise_var
        self.jitter = jitter
        self.is_fitted = False
    
    def fit(self, X_train, y_train):
        """
        Fit the GP by precomputing expensive quantities.
        Cost: O(n³) for Cholesky
        """
        self.X_train = X_train
        self.y_train = y_train
        self.n = X_train.shape[0]
        
        # Compute and decompose kernel matrix
        K = self.kernel_fn(X_train, X_train)
        K += (self.noise_var + self.jitter) * np.eye(self.n)
        
        # Cholesky decomposition
        self.L = cholesky(K, lower=True)
        
        # Precompute alpha = K^{-1} @ y
        z = solve_triangular(self.L, y_train, lower=True)
        self.alpha = solve_triangular(self.L.T, z, lower=False)
        
        self.is_fitted = True
        return self
    
    def predict(self, X_test, return_cov=False):
        """
        Predict at test points.
        Cost: O(nm) for mean, O(n²m) for variance
        """
        if not self.is_fitted:
            raise RuntimeError("Must call fit() before predict()")
        
        m = X_test.shape[0]
        
        # Cross-covariance
        K_star = self.kernel_fn(self.X_train, X_test)  # (n, m)
        
        # Mean: K_*^T @ alpha
        mu = K_star.T @ self.alpha  # (m,)
        
        if return_cov:
            # Solve L @ V = K_star
            V = solve_triangular(self.L, K_star, lower=True)  # (n, m)
            
            # Prior covariance
            K_ss = self.kernel_fn(X_test, X_test)  # (m, m)
            
            # Posterior covariance
            cov = K_ss - V.T @ V  # (m, m)
            
            var = np.diag(cov)
            return mu, var, cov
        else:
            # Just variance (cheaper)
            V = solve_triangular(self.L, K_star, lower=True)
            k_ss = self.kernel_fn(X_test, X_test)
            var = np.diag(k_ss) - np.sum(V**2, axis=0)
            return mu, var
 
# Usage example
np.random.seed(42)
 
def rbf(X1, X2, l=1.0, s=1.0):
    sq_dist = np.sum(X1**2, 1, keepdims=True) - 2*X1@X2.T + np.sum(X2**2, 1)
    return s * np.exp(-0.5 * sq_dist / l**2)
 
# Training
X_train = np.random.uniform(-3, 3, 100).reshape(-1, 1)
y_train = np.sin(X_train.flatten()) + 0.1 * np.random.randn(100)
 
# Create and fit GP
gp = EfficientGP(rbf, noise_var=0.01)
gp.fit(X_train, y_train)
 
# Predict at many points efficiently
X_test = np.linspace(-4, 4, 1000).reshape(-1, 1)
 
import time
start = time.time()
mu, var = gp.predict(X_test)
elapsed = time.time() - start
 
print(f"Predicted at {len(X_test)} points in {elapsed*1000:.1f} ms")
print(f"Mean prediction range: [{mu.min():.3f}, {mu.max():.3f}]")
print(f"Variance range: [{var.min():.6f}, {var.max():.3f}]")

Practical Guidelines for Mean and Variance Interpretation

Here we consolidate practical wisdom for interpreting GP predictions in real applications.

Guidelines for Mean Predictions

•Trust means more near training data: Predictions between training points are generally reliable. Predictions far from data should be treated skeptically.
•Don't expect extrapolation: The GP mean reverts to the prior mean outside the training domain. It does NOT continue observed trends.
•Check for overfitting symptoms: If mean predictions are extremely wiggly between training points, the length scale may be too small.
•Standardize outputs: If training outputs have large magnitude, standardize (zero mean, unit variance) for numerical stability.
•Mean is smooth for smooth kernels: RBF kernel gives infinitely differentiable mean function. If your target is non-smooth, consider Matérn kernels.

Guidelines for Variance Interpretation

•Variance is epistemic: It represents uncertainty about the function, reducible with more data. It's NOT prediction error for noisy observations.
•σ vs σ² matters: Report standard deviation σ for interpretability (same units as y). Compute with variance σ² for numerical stability.
•±2σ is not a guarantee: The ~95% interval assumes the GP model is correct. Model misspecification can make intervals unreliable.
•Low variance ≠ correct: Overconfident models can have low variance but high error. Always check calibration.
•High variance is informative: High variance regions are opportunities for active learning / experimental design.

The Practitioner's Mental Model

Think of the GP variance as describing a 'tube of plausible functions'. The mean is the center of the tube. Near training data, the tube is narrow (we know the function must pass nearby). Far from data, the tube expands toward prior uncertainty. The kernel determines how quickly the tube expands.

Summary: Mean and Variance Prediction

We have developed deep intuition for GP mean and variance predictions. Let us consolidate:

Key Takeaways

•Mean is a weighted sum of training outputs: Weights determined by kernel similarities and training point correlations (K⁻¹ decorrelates).
•Variance is prior minus information gained: σ² = k(x,x*) - kᵀK⁻¹k, always non-negative, decreases near data.
•Length scale controls interpolation: Small ℓ → wiggly predictions; large ℓ → smooth predictions. Must be appropriate for data density.
•Extrapolation reverts to prior: Stationary kernels cannot extrapolate trends. Mean → 0, variance → σ_f² outside training range.
•Variance is independent of outputs: Prediction uncertainty depends only on input locations, useful for experimental design.
•Calibration check is essential: Compare empirical coverage to nominal coverage; miscalibration indicates model problems.

What's Next

We have mastered mean and variance prediction for the noise-free and noisy cases. In the next page, we examine noise handling in depth—how observation noise affects the GP model, how to estimate noise variance, and the distinction between homoscedastic and heteroscedastic noise models.

Page Complete

You now have deep intuition for how GP mean and variance predictions behave across different scenarios. This understanding is essential for practical GP applications—knowing when to trust predictions, how to diagnose problems, and how to interpret uncertainty in decision-making contexts.