Loading content...
At the heart of every GP prediction are two numbers: the predictive mean and the predictive variance. These aren't just mathematical abstractions—they encode everything you need to make decisions under uncertainty.
The mean tells you what to expect. The variance tells you how much to trust that expectation.
While the formulas for these quantities are simple, understanding their behavior requires deep intuition. How does the mean interpolate between training points? Why does variance vary across input space? How do kernel parameters affect both? This page develops that intuition through careful analysis and concrete examples.
For practitioners, these quantities aren't academic—they drive real decisions. A Bayesian optimization acquisition function trades off mean (exploitation) against variance (exploration). A safety-critical system uses variance to know when to abstain from automated decisions. A scientific model uses variance to propagate uncertainty through downstream analyses.
By the end of this page, you will: (1) Deeply understand how kernel parameters affect mean and variance predictions, (2) Recognize the spatial structure of GP uncertainty, (3) Interpret mean predictions as kernel-weighted interpolation, (4) Understand edge cases where standard intuitions fail, (5) Apply this knowledge to practical prediction scenarios.
The predictive mean at a test point $\mathbf{x}_*$ is:
$$\mu_* = \mathbf{k}*^\top K^{-1} \mathbf{y} = \mathbf{k}*^\top \boldsymbol{\alpha}$$
where $\boldsymbol{\alpha} = K^{-1} \mathbf{y}$ are the 'dual coefficients'. This formula admits multiple interpretations, each illuminating a different aspect of GP prediction.
Interpretation 1: Weighted Sum of Training Outputs
We can write: $$\mu_* = \sum_{i=1}^{n} w_i y_i$$
where the weights $w_i = [K^{-1} \mathbf{k}_]i$ depend on both the kernel similarities and the correlation structure among training points. Unlike simple kernel smoothing where weights are just $k(\mathbf{x}, \mathbf{x}_i)$, GP weights account for redundancy among training points.
Example: If two training points $\mathbf{x}_1$ and $\mathbf{x}2$ are very close (highly correlated), their combined influence on $\mu*$ is roughly what a single point would contribute. The GP doesn't double-count nearby evidence.
The matrix K⁻¹ performs decorrelation. If training points were uncorrelated (K = I), then w = k*, and weights would simply be kernel similarities. When training points are correlated, K⁻¹ adjusts weights to avoid over-counting redundant information.
Interpretation 2: Representer Theorem
The mean function lies in the reproducing kernel Hilbert space (RKHS) associated with the kernel:
$$\mu_*(\mathbf{x}) = \sum_{i=1}^{n} \alpha_i k(\mathbf{x}, \mathbf{x}_i)$$
This is a sum of kernel functions centered at training points, each weighted by $\alpha_i$. The GP mean is the unique function in this RKHS that minimizes a regularized loss.
Interpretation 3: Linear Predictor in Feature Space
For kernels with explicit feature maps $\phi$, the mean can be written as:
$$\mu_*(\mathbf{x}) = \boldsymbol{\beta}^\top \phi(\mathbf{x})$$
where $\boldsymbol{\beta}$ is the posterior mean of the weight vector in Bayesian linear regression with features $\phi$. GPs provide a kernelized, infinite-dimensional version of Bayesian linear regression.
Weight Properties
The weights $w_i$ have interesting properties:
| Property | Explanation |
|---|---|
| Can be negative | Nearby opposite-signed $y_i$ values can induce negative weights |
| Don't sum to 1 | Unlike kernel density estimation, GP weights aren't normalized |
| Depend on all training points | Each weight depends on the full training configuration |
| Decay with distance | Weights for distant training points are typically small |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
import numpy as npfrom scipy.linalg import solve def analyze_mean_weights(X_train, y_train, x_test, kernel_fn, jitter=1e-8): """ Analyze the weights used in GP mean prediction. """ n = X_train.shape[0] # Compute kernel matrices K = kernel_fn(X_train, X_train) + jitter * np.eye(n) k_star = kernel_fn(X_train, x_test.reshape(1, -1)).flatten() # Dual coefficients alpha = K^{-1} y alpha = solve(K, y_train, assume_a='pos') # Weights: w = K^{-1} k_* (same as solving K @ w = k_*) weights = solve(K, k_star, assume_a='pos') # Mean prediction mu = np.dot(k_star, alpha) mu_from_weights = np.dot(weights, y_train) return { 'alpha': alpha, 'weights': weights, 'kernel_similarities': k_star, 'mu': mu, 'weight_sum': np.sum(weights), 'has_negative_weights': np.any(weights < 0), } # Example: Analyze weight distributionnp.random.seed(42) # Training dataX_train = np.array([[-2], [-1], [0], [1], [2]])y_train = np.array([1.0, -0.5, 0.2, 1.5, 0.8]) def rbf(X1, X2, l=1.0): sq_dist = np.sum(X1**2, 1, keepdims=True) - 2*X1@X2.T + np.sum(X2**2, 1) return np.exp(-0.5 * sq_dist / l**2) # Analyze at different test pointstest_points = [-3.0, -1.5, 0.5, 3.0] for x_test in test_points: result = analyze_mean_weights(X_train, y_train, np.array([x_test]), rbf) print(f"Test point x* = {x_test}") print(f" Kernel similarities k(x*, xᵢ): {np.round(result['kernel_similarities'], 3)}") print(f" Prediction weights wᵢ: {np.round(result['weights'], 3)}") print(f" Sum of weights: {result['weight_sum']:.3f}") print(f" Has negative weights: {result['has_negative_weights']}") print(f" Prediction μ*: {result['mu']:.3f}")The predictive variance at test point $\mathbf{x}_*$ is:
$$\sigma_^2 = k(\mathbf{x}_, \mathbf{x}*) - \mathbf{k}^\top K^{-1} \mathbf{k}_$$
This expression has a compelling intuitive interpretation:
$$\sigma_^2 = \underbrace{k(\mathbf{x}_, \mathbf{x}*)}{\text{Prior variance}} - \underbrace{\mathbf{k}*^\top K^{-1} \mathbf{k}*}_{\text{Variance reduction from data}}$$
The Variance Reduction Term
The term $\mathbf{k}*^\top K^{-1} \mathbf{k}$ is a quadratic form that measures how much information about $f_$ is contained in the training data. Its properties:
Notice that σ*² depends only on input locations (through k and K), not on the output values y. This means you can compute prediction uncertainty before collecting any output data—useful for experimental design and active learning.
Geometric Interpretation
The variance reduction term has a geometric interpretation. Define $\mathbf{v} = K^{-1/2} \mathbf{k}_*$, where $K^{1/2}$ is the matrix square root of $K$. Then:
$$\mathbf{k}*^\top K^{-1} \mathbf{k}* = |\mathbf{v}|^2$$
This is the squared norm of the projection of $\mathbf{k}*$ onto the space spanned by the training covariances. The variance reduction depends on how well $\mathbf{k}*$ can be 'explained' by the training point covariance structure.
Spatial Structure of Variance
The variance function $\sigma_*^2(\mathbf{x})$ has characteristic spatial structure:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
import numpy as npfrom scipy.linalg import cholesky, solve_triangular def variance_landscape(X_train, kernel_fn, resolution=100, x_range=(-5, 5)): """ Compute the predictive variance landscape (before observing any y values). """ n = X_train.shape[0] # Dense grid of test points X_test = np.linspace(x_range[0], x_range[1], resolution).reshape(-1, 1) # Kernel matrices K = kernel_fn(X_train, X_train) + 1e-8 * np.eye(n) K_star = kernel_fn(X_train, X_test) # (n, resolution) k_diag = kernel_fn(X_test, X_test).diagonal() # Prior variances # Efficient variance computation L = cholesky(K, lower=True) V = solve_triangular(L, K_star, lower=True) # (n, resolution) # Variance reduction at each test point var_reduction = np.sum(V**2, axis=0) # (resolution,) # Predictive variance var = k_diag - var_reduction return X_test.flatten(), var # Compare variance landscapes for different training configurationsnp.random.seed(42) def rbf(X1, X2, l=1.0): sq_dist = np.sum(X1**2, 1, keepdims=True) - 2*X1@X2.T + np.sum(X2**2, 1) return np.exp(-0.5 * sq_dist / l**2) # Configuration 1: Evenly spaced training pointsX_uniform = np.linspace(-3, 3, 5).reshape(-1, 1) # Configuration 2: Clustered training pointsX_clustered = np.array([[-2.5], [-2.0], [0], [2.0], [2.5]]) # Configuration 3: One-sided training pointsX_onesided = np.linspace(-3, 0, 5).reshape(-1, 1) for name, X_train in [('Uniform', X_uniform), ('Clustered', X_clustered), ('One-sided', X_onesided)]: x_grid, var = variance_landscape(X_train, rbf) print(f"{name} configuration:") print(f" Min variance: {var.min():.6f} (should be ≈0 at training pts)") print(f" Max variance: {var.max():.4f} (prior variance = 1.0)") print(f" Mean variance: {var.mean():.4f}") # Find where variance exceeds 90% of prior high_unc_regions = x_grid[var > 0.9] if len(high_unc_regions) > 0: print(f" High uncertainty (>90% prior) regions: x ∈ [{high_unc_regions.min():.1f}, {high_unc_regions.max():.1f}]")Kernel hyperparameters profoundly influence both mean and variance predictions. Understanding these effects is crucial for practitioners.
RBF Kernel Parameters
The RBF (Squared Exponential) kernel $k(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left(-\frac{|\mathbf{x} - \mathbf{x}'|^2}{2\ell^2}\right)$ has two key parameters:
Let's analyze their effects systematically.
| Aspect | Small ℓ (e.g., 0.1) | Large ℓ (e.g., 10) |
|---|---|---|
| Correlation range | Only very close points correlated | Distant points still correlated |
| Posterior mean | Wiggly, follows training data closely | Smooth, averages over data |
| Variance between training pts | Rises quickly, high uncertainty gaps | Stays low, interpolation confident |
| Extrapolation behavior | Prediction reverts to prior quickly | Prediction reverts to prior slowly |
| Function sample smoothness | Rough, rapidly varying | Smooth, slowly varying |
Very small ℓ: Training points become uncorrelated, each prediction depends only on the nearest training point (approaching nearest-neighbor interpolation). Very large ℓ: All points become highly correlated, predictions approach a constant (the mean of training outputs).
| Aspect | Small σ_f² (e.g., 0.1) | Large σ_f² (e.g., 10) |
|---|---|---|
| Prior uncertainty | Low—prior believes functions are small | High—prior allows large function values |
| Posterior variance scale | Variance bounded by σ_f² | Variance can be large far from data |
| Effect on mean | Pulls predictions toward zero | Allows larger deviations from zero |
| Noise interpretation | Relative noise is higher | Relative noise is lower |
Interaction Between Length Scale and Noise
In the noisy observation case, the ratio between observation noise $\sigma_n^2$ and signal variance $\sigma_f^2$ is more important than their absolute values:
The length scale $\ell$ interacts with training point spacing:
Multi-Dimensional Length Scales (ARD)
With Automatic Relevance Determination (ARD), each input dimension has its own length scale $\ell_j$:
$$k(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left(-\frac{1}{2} \sum_{j=1}^{d} \frac{(x_j - x'_j)^2}{\ell_j^2}\right)$$
Dimensions with small $\ell_j$ are relevant (function varies rapidly along them). Dimensions with large $\ell_j$ are less relevant (function nearly constant along them).
GP prediction behavior differs fundamentally between interpolation (predicting within the training data range) and extrapolation (predicting outside it).
Interpolation: Between Training Points
In regions surrounded by training data, GP predictions exhibit:
Smooth interpolation: The mean function smoothly connects training points, with smoothness determined by the kernel.
Low variance: Surrounded by data, prediction uncertainty is low. The exact level depends on training point density relative to length scale.
Local influence: Predictions depend primarily on nearby training points, with distant points contributing little.
Behavior at midpoints: At the midpoint between two training points, the mean is approximately (but not exactly) the average of their outputs, modulated by the global training configuration.
Extrapolation: Beyond Training Data
Moving outside the training data convex hull:
Reversion to prior mean: The predictive mean approaches the prior mean (typically 0) as we move further from data.
Reversion to prior variance: The predictive variance approaches $k(\mathbf{x}*, \mathbf{x}*) = \sigma_f^2$.
Rate of reversion: Determined by length scale. Small $\ell$ causes rapid reversion; large $\ell$ causes gradual reversion.
GPs do NOT extrapolate trends. Unlike polynomial regression that continues its trend forever, a GP with stationary kernel (RBF, Matérn, etc.) always reverts to the prior mean. If your task requires extrapolation, you need: (1) domain knowledge encoded in the mean function, (2) non-stationary kernels, or (3) explicit trend modeling.
Why Extrapolation Fails with Stationary Kernels
Stationary kernels depend only on $|\mathbf{x} - \mathbf{x}'|$, not on absolute position. This means:
Quantifying Extrapolation Distance
For the RBF kernel, the correlation $k(\mathbf{x}_*, \mathbf{x}_i)$ between a test point and the nearest training point decays as:
$$k(\mathbf{x}_*, \mathbf{x}i) = \sigma_f^2 \exp\left(-\frac{d{\min}^2}{2\ell^2}\right)$$
where $d_{\min} = \min_i |\mathbf{x}_* - \mathbf{x}_i|$.
| Distance $d_{\min}/\ell$ | Correlation $k/\sigma_f^2$ | Interpolation Quality |
|---|---|---|
| 0.0 | 1.0 | At training point |
| 1.0 | 0.606 | Reasonable interpolation |
| 2.0 | 0.135 | Weak influence |
| 3.0 | 0.011 | Negligible influence |
| 4.0 | 0.0003 | Effectively extrapolating |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
import numpy as npfrom scipy.linalg import cholesky, solve_triangular def analyze_extrapolation(X_train, y_train, extrapolation_distance, kernel_fn, jitter=1e-8): """ Analyze GP behavior at different extrapolation distances. """ n = X_train.shape[0] # Rightmost training point x_max = X_train.max() # Kernel matrices K = kernel_fn(X_train, X_train) + jitter * np.eye(n) L = cholesky(K, lower=True) # Test at various extrapolation distances results = [] for dist in extrapolation_distance: x_test = np.array([[x_max + dist]]) k_star = kernel_fn(X_train, x_test).flatten() k_ss = kernel_fn(x_test, x_test)[0, 0] # Mean prediction alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True), lower=False) mu = np.dot(k_star, alpha) # Variance v = solve_triangular(L, k_star, lower=True) var = k_ss - np.dot(v, v) # Max kernel similarity to any training point max_corr = k_star.max() results.append({ 'distance': dist, 'mean': mu, 'std': np.sqrt(var), 'max_correlation': max_corr, 'prior_fraction': var / k_ss # Fraction of prior variance remaining }) return results # Example analysisnp.random.seed(42) # Training data: observations of a linear trendX_train = np.linspace(0, 5, 6).reshape(-1, 1)y_train = 0.5 * X_train.flatten() + np.random.randn(6) * 0.1 # RBF kernel with length scale 2.0def rbf(X1, X2, l=2.0, s=1.0): sq_dist = np.sum(X1**2, 1, keepdims=True) - 2*X1@X2.T + np.sum(X2**2, 1) return s * np.exp(-0.5 * sq_dist / l**2) distances = [0.5, 1.0, 2.0, 4.0, 6.0, 10.0]results = analyze_extrapolation(X_train, y_train, distances, rbf) print("Extrapolation analysis (training x ∈ [0, 5], ℓ = 2.0)")print("-" * 65)print(f"{'Distance':>10} {'Mean':>10} {'Std':>10} {'Max Corr':>12} {'Prior %':>10}")print("-" * 65) for r in results: print(f"{r['distance']:>10.1f} {r['mean']:>10.3f} {r['std']:>10.3f} " f"{r['max_correlation']:>12.4f} {r['prior_fraction']*100:>9.1f}%") print("Note: As distance increases, mean → 0 (prior), std → 1 (prior variance)")print("The 'trend' in training data is NOT extrapolated!")Understanding when standard GP intuition fails is as important as understanding when it works. Here we examine edge cases and potential pathologies.
Watch for: (1) Negative predictive variances (numerically impossible, indicates instability), (2) Cholesky decomposition failures, (3) Predictions wildly different from training outputs near training points, (4) Condition number of K exceeding 10¹² or so. Solutions: increase jitter, standardize inputs/outputs, or reconsider hyperparameters.
The Overconfidence Problem
A common pathology is overconfident predictions where the GP reports low variance but predictions are poor. This typically occurs due to:
Model misspecification: The true function doesn't match the kernel's assumptions (e.g., using RBF for a non-smooth function)
Poor hyperparameters: Length scale or noise variance badly wrong (often due to local optima in marginal likelihood)
Insufficient training data: Sparse data combined with small length scale leads to confident but wrong interpolations
The Underconfidence Problem
Conversely, underconfident predictions waste information:
Length scale too small: Nearby points don't inform each other, variance stays high
Noise variance too high: Model doesn't trust observations enough
Diagnostic: Calibration Check
A well-calibrated GP should have ~95% of true values within the ±2σ prediction band. If significantly more or fewer values fall within this band, the model is miscalibrated.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
import numpy as npfrom scipy.stats import norm def calibration_check(y_true, mu_pred, std_pred, quantiles=[0.5, 0.9, 0.95, 0.99]): """ Check GP calibration by verifying prediction interval coverage. For a well-calibrated GP, X% of true values should fall within the X% prediction interval. """ results = {} for q in quantiles: # Compute prediction interval z = norm.ppf(0.5 + q/2) # e.g., 1.96 for q=0.95 lower = mu_pred - z * std_pred upper = mu_pred + z * std_pred # Count coverage within = np.mean((y_true >= lower) & (y_true <= upper)) results[q] = { 'expected': q, 'observed': within, 'ratio': within / q # <1 overconfident, >1 underconfident } return results def normalized_residuals(y_true, mu_pred, std_pred): """ Compute standardized residuals. For a well-calibrated GP, these should be approximately standard normal. """ z = (y_true - mu_pred) / std_pred return z # Example with synthetic datanp.random.seed(42) # True functiondef f_true(x): return np.sin(2 * x) + 0.1 * x**2 # Training dataX_train = np.random.uniform(-3, 3, 20)y_train = f_true(X_train) + 0.1 * np.random.randn(20) # Test dataX_test = np.random.uniform(-3, 3, 200)y_test = f_true(X_test) + 0.1 * np.random.randn(200) # Simulated GP predictions (replace with actual GP in practice)# For demo, using local polynomial regressionfrom scipy.interpolate import Rbfrbf_interp = Rbf(X_train, y_train, function='gaussian', epsilon=0.5)mu_pred = rbf_interp(X_test)std_pred = np.ones_like(mu_pred) * 0.15 # Simulated uncertainty # Calibration checkcal = calibration_check(y_test, mu_pred, std_pred) print("Calibration analysis:")print("-" * 50)for q, result in cal.items(): status = "✓ OK" if 0.8 < result['ratio'] < 1.2 else "⚠ Check" print(f" {int(q*100):2d}% interval: expected {result['expected']:.2f}, " f"observed {result['observed']:.2f} ({status})") # Check normalized residualsz = normalized_residuals(y_test, mu_pred, std_pred)print(f"Normalized residual statistics (should be ≈ N(0,1)):")print(f" Mean: {z.mean():.3f} (should be ≈ 0)")print(f" Std: {z.std():.3f} (should be ≈ 1)")Efficient implementation of mean and variance prediction requires careful attention to numerical linear algebra.
Precomputation for Fast Prediction
Many quantities can be precomputed once during training and reused for all predictions:
With these precomputed, prediction at a single test point costs:
Never explicitly compute K⁻¹. Instead, solve linear systems: K @ x = b is solved as x = solve(K, b). With the Cholesky factor L, this becomes two triangular solves: L @ z = b, then Lᵀ @ x = z. Each is O(n²) but with small constants.
Variance Computation Variants
There are several ways to compute predictive variance:
Method 1: Direct computation $$\sigma_^2 = k_{**} - \mathbf{k}_^\top K^{-1} \mathbf{k}* = k{**} - \mathbf{v}^\top \mathbf{v}$$ where $L \mathbf{v} = \mathbf{k}_*$.
Method 2: Using Cholesky of posterior covariance If you need multiple samples at the same test points, compute $\Sigma_* = LL^\top$ once, then sample as $\boldsymbol{\mu}_* + L \mathbf{z}$.
Method 3: Incremental update When adding one training point at a time, the posterior can be updated in $\mathcal{O}(n^2)$ using rank-1 Cholesky updates.
Batch Prediction Efficiency
For $m$ test points, naive computation costs $\mathcal{O}(m \cdot n^2)$ for variance. Batching cleverly:
K_star = kernel(X_train, X_test) # (n, m)
V = solve_triangular(L, K_star) # (n, m) - solved column by column
var = diag(K_test_test) - sum(V**2, axis=0) # (m,)
This is still $\mathcal{O}(n^2 m)$ but with efficient BLAS operations.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798
import numpy as npfrom scipy.linalg import cholesky, solve_triangular class EfficientGP: """ An efficient GP implementation with precomputation for fast prediction. """ def __init__(self, kernel_fn, noise_var=0.0, jitter=1e-8): self.kernel_fn = kernel_fn self.noise_var = noise_var self.jitter = jitter self.is_fitted = False def fit(self, X_train, y_train): """ Fit the GP by precomputing expensive quantities. Cost: O(n³) for Cholesky """ self.X_train = X_train self.y_train = y_train self.n = X_train.shape[0] # Compute and decompose kernel matrix K = self.kernel_fn(X_train, X_train) K += (self.noise_var + self.jitter) * np.eye(self.n) # Cholesky decomposition self.L = cholesky(K, lower=True) # Precompute alpha = K^{-1} @ y z = solve_triangular(self.L, y_train, lower=True) self.alpha = solve_triangular(self.L.T, z, lower=False) self.is_fitted = True return self def predict(self, X_test, return_cov=False): """ Predict at test points. Cost: O(nm) for mean, O(n²m) for variance """ if not self.is_fitted: raise RuntimeError("Must call fit() before predict()") m = X_test.shape[0] # Cross-covariance K_star = self.kernel_fn(self.X_train, X_test) # (n, m) # Mean: K_*^T @ alpha mu = K_star.T @ self.alpha # (m,) if return_cov: # Solve L @ V = K_star V = solve_triangular(self.L, K_star, lower=True) # (n, m) # Prior covariance K_ss = self.kernel_fn(X_test, X_test) # (m, m) # Posterior covariance cov = K_ss - V.T @ V # (m, m) var = np.diag(cov) return mu, var, cov else: # Just variance (cheaper) V = solve_triangular(self.L, K_star, lower=True) k_ss = self.kernel_fn(X_test, X_test) var = np.diag(k_ss) - np.sum(V**2, axis=0) return mu, var # Usage examplenp.random.seed(42) def rbf(X1, X2, l=1.0, s=1.0): sq_dist = np.sum(X1**2, 1, keepdims=True) - 2*X1@X2.T + np.sum(X2**2, 1) return s * np.exp(-0.5 * sq_dist / l**2) # TrainingX_train = np.random.uniform(-3, 3, 100).reshape(-1, 1)y_train = np.sin(X_train.flatten()) + 0.1 * np.random.randn(100) # Create and fit GPgp = EfficientGP(rbf, noise_var=0.01)gp.fit(X_train, y_train) # Predict at many points efficientlyX_test = np.linspace(-4, 4, 1000).reshape(-1, 1) import timestart = time.time()mu, var = gp.predict(X_test)elapsed = time.time() - start print(f"Predicted at {len(X_test)} points in {elapsed*1000:.1f} ms")print(f"Mean prediction range: [{mu.min():.3f}, {mu.max():.3f}]")print(f"Variance range: [{var.min():.6f}, {var.max():.3f}]")Here we consolidate practical wisdom for interpreting GP predictions in real applications.
Think of the GP variance as describing a 'tube of plausible functions'. The mean is the center of the tube. Near training data, the tube is narrow (we know the function must pass nearby). Far from data, the tube expands toward prior uncertainty. The kernel determines how quickly the tube expands.
We have developed deep intuition for GP mean and variance predictions. Let us consolidate:
What's Next
We have mastered mean and variance prediction for the noise-free and noisy cases. In the next page, we examine noise handling in depth—how observation noise affects the GP model, how to estimate noise variance, and the distinction between homoscedastic and heteroscedastic noise models.
You now have deep intuition for how GP mean and variance predictions behave across different scenarios. This understanding is essential for practical GP applications—knowing when to trust predictions, how to diagnose problems, and how to interpret uncertainty in decision-making contexts.