Loading content...
We've established that a Gaussian Process defines a prior distribution over functions, specified through a mean function and kernel. But priors alone don't make predictions—we need to condition on observed data to update our beliefs.
This is where the magic of Gaussian Processes truly shines. Unlike most machine learning methods that produce point predictions, GP posterior inference yields complete probability distributions over function values. At every new input, we obtain not just a prediction but a full quantification of our uncertainty.
The remarkable fact is that this posterior computation has a closed-form solution. No sampling, no variational approximations, no iterative optimization—just matrix algebra. This analytical tractability is a direct consequence of the Gaussian distribution's closure under conditioning.
By the end of this page, you will understand how to derive the GP posterior distribution through conditioning, the closed-form expressions for posterior mean and variance, how observations "pin down" the function at data points while uncertainty remains elsewhere, and the computational costs and numerical considerations for GP inference.
Suppose we have observed data $\mathcal{D} = {(x_i, y_i)}{i=1}^n$ where $y_i = f(x_i) + \epsilon_i$ with observation noise $\epsilon_i \sim \mathcal{N}(0, \sigma_n^2)$. We want to predict $f(x)$ at a new test point $x_$.
The Bayesian Framework:
The posterior is also a GP, thanks to the Gaussian's closure properties. Our task is to derive the posterior mean and covariance functions.
Setting Up the Joint Distribution:
Consider the joint distribution of the training observations $\mathbf{y} = [y_1, \ldots, y_n]^T$ and the test function value $f_* = f(x_*)$. Under our GP prior:
$$\begin{pmatrix} \mathbf{y} \ f_* \end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix} \mathbf{m} \ m_* \end{pmatrix}, \begin{pmatrix} K + \sigma_n^2 I & \mathbf{k}* \ \mathbf{k}*^T & k_{**} \end{pmatrix} \right)$$
where:
The diagonal $\sigma_n^2 I$ is added to $K$ because we observe noisy measurements $y$, not the true function values $f(x)$. This noise "inflates" the covariance of observations. If observations were noise-free, we would condition directly on $f(X)$, not $\mathbf{y}$.
Now we apply the standard formula for conditioning a multivariate Gaussian. Given the joint distribution:
$$\begin{pmatrix} \mathbf{a} \ \mathbf{b} \end{pmatrix} \sim \mathcal{N}\left( \begin{pmatrix} \boldsymbol{\mu}a \ \boldsymbol{\mu}b \end{pmatrix}, \begin{pmatrix} \Sigma{aa} & \Sigma{ab} \ \Sigma_{ba} & \Sigma_{bb} \end{pmatrix} \right)$$
The conditional distribution of $\mathbf{b}$ given $\mathbf{a}$ is:
$$\mathbf{b} | \mathbf{a} \sim \mathcal{N}\left( \boldsymbol{\mu}b + \Sigma{ba}\Sigma_{aa}^{-1}(\mathbf{a} - \boldsymbol{\mu}a), ; \Sigma{bb} - \Sigma_{ba}\Sigma_{aa}^{-1}\Sigma_{ab} \right)$$
Applying to Our GP:
Identifying $\mathbf{a} = \mathbf{y}$ (observed) and $\mathbf{b} = f_*$ (to predict):
$$f_* | \mathbf{y}, X, x_* \sim \mathcal{N}(\mu_, \sigma_^2)$$
where:
$$\boxed{\mu_* = m(x_) + \mathbf{k}_^T (K + \sigma_n^2 I)^{-1} (\mathbf{y} - \mathbf{m})}$$
$$\boxed{\sigma_^2 = k(x_, x_) - \mathbf{k}_^T (K + \sigma_n^2 I)^{-1} \mathbf{k}_*}$$
These are the fundamental equations of GP regression.
The posterior mean is a linear combination of the observations: $\mu_* = m_* + \mathbf{k}*^T \boldsymbol{\alpha}$ where $\boldsymbol{\alpha} = (K + \sigma_n^2 I)^{-1}(\mathbf{y} - \mathbf{m})$. The weights depend on how correlated the test point is with each training point ($\mathbf{k}*$), modulated by the inverse covariance structure. Points far from training data get weights close to zero, reverting prediction toward the prior mean.
Understanding the Posterior Variance:
The posterior variance has an elegant interpretation:
$$\sigma_^2 = \underbrace{k(x_, x_)}{\text{prior variance}} - \underbrace{\mathbf{k}^T (K + \sigma_n^2 I)^{-1} \mathbf{k}*}{\text{variance reduction}}$$
The variance reduction is always non-negative (since $K + \sigma_n^2 I$ is positive definite), so the posterior variance is always ≤ prior variance. Observations reduce uncertainty—never increase it.
Key properties of the posterior variance:
| Property | Posterior Mean | Posterior Variance |
|---|---|---|
| Depends on observed $\mathbf{y}$? | Yes — linear in $\mathbf{y}$ | No — only depends on $X, x_*$ |
| At training point | Interpolates data (if $\sigma_n = 0$) | Zero (perfect knowledge) |
| Far from data | Reverts to prior mean | Reverts to prior variance |
| Complexity | $O(n^2)$ per test point | $O(n^2)$ per test point |
Let's implement GP regression from scratch, carefully handling numerical stability and providing clear visualization.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193
import numpy as npimport matplotlib.pyplot as pltfrom scipy.linalg import cholesky, solve_triangular class GaussianProcessRegressor: """ Gaussian Process Regressor with squared exponential kernel. Provides posterior mean and variance predictions with numerically stable implementations. """ def __init__(self, length_scale=1.0, signal_variance=1.0, noise_variance=0.1): """ Initialize GP hyperparameters. Parameters: - length_scale: Controls function smoothness - signal_variance: Controls function magnitude - noise_variance: Observation noise variance """ self.length_scale = length_scale self.signal_variance = signal_variance self.noise_variance = noise_variance # Will be set during fit self.X_train = None self.y_train = None self.L = None # Cholesky factor self.alpha = None # Precomputed weights def kernel(self, X1, X2): """ Squared Exponential kernel. k(x, x') = σ² exp(-||x - x'||² / (2ℓ²)) """ # Compute squared distances efficiently X1 = np.atleast_2d(X1) X2 = np.atleast_2d(X2) # ||x - x'||² = ||x||² + ||x'||² - 2 x·x' X1_sq = np.sum(X1**2, axis=1).reshape(-1, 1) X2_sq = np.sum(X2**2, axis=1).reshape(1, -1) sq_dist = X1_sq + X2_sq - 2 * X1 @ X2.T # Ensure non-negative (numerical issues can cause tiny negatives) sq_dist = np.maximum(sq_dist, 0) return self.signal_variance * np.exp(-sq_dist / (2 * self.length_scale**2)) def fit(self, X, y): """ Fit the GP to training data. Precomputes quantities needed for prediction. """ self.X_train = np.atleast_2d(X) self.y_train = y.ravel() n = len(self.y_train) # Compute kernel matrix K = self.kernel(self.X_train, self.X_train) # Add noise variance to diagonal K_noisy = K + self.noise_variance * np.eye(n) # Cholesky decomposition for numerical stability # K_noisy = L @ L.T self.L = cholesky(K_noisy, lower=True) # Solve L @ z = y for z, then L.T @ alpha = z for alpha # Equivalent to alpha = K_noisy^{-1} @ y self.alpha = solve_triangular( self.L.T, solve_triangular(self.L, self.y_train, lower=True), lower=False ) return self def predict(self, X_test, return_std=True, return_cov=False): """ Predict at test points. Returns: - mean: Posterior mean at test points - std (optional): Posterior standard deviation - cov (optional): Full posterior covariance matrix """ X_test = np.atleast_2d(X_test) n_test = len(X_test) # Cross-covariance between test and training points K_star = self.kernel(X_test, self.X_train) # Posterior mean: μ* = K* @ (K + σₙ²I)⁻¹ @ y = K* @ α mean = K_star @ self.alpha if not return_std and not return_cov: return mean # For variance, we need L⁻¹ @ K*ᵀ # Solve L @ v = K*ᵀ for v v = solve_triangular(self.L, K_star.T, lower=True) if return_cov: # Full covariance: K** - K* @ (K + σₙ²I)⁻¹ @ K*ᵀ K_test = self.kernel(X_test, X_test) cov = K_test - v.T @ v return mean, cov if return_std: # Diagonal only (variance at each point) K_diag = self.signal_variance * np.ones(n_test) # k(x*, x*) = σ² var = K_diag - np.sum(v**2, axis=0) # Ensure non-negative (numerical stability) var = np.maximum(var, 0) return mean, np.sqrt(var) return mean def sample_posterior(self, X_test, n_samples=5): """ Draw samples from the posterior distribution. """ mean, cov = self.predict(X_test, return_std=False, return_cov=True) # Add small jitter for numerical stability cov += 1e-8 * np.eye(len(X_test)) # Sample from multivariate normal L = cholesky(cov, lower=True) samples = mean[:, np.newaxis] + L @ np.random.randn(len(X_test), n_samples) return samples # Demonstrationnp.random.seed(42) # True function (unknown to the model)def true_function(x): return np.sin(x) + 0.5 * np.sin(3*x) # Generate training data with noisen_train = 10X_train = np.random.uniform(-5, 5, n_train).reshape(-1, 1)y_train = true_function(X_train.ravel()) + 0.2 * np.random.randn(n_train) # Fit GPgp = GaussianProcessRegressor(length_scale=1.0, signal_variance=1.0, noise_variance=0.04)gp.fit(X_train, y_train) # Predict on dense gridX_test = np.linspace(-6, 6, 200).reshape(-1, 1)mean, std = gp.predict(X_test)samples = gp.sample_posterior(X_test, n_samples=5) # Visualizationfig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Plot 1: Mean and uncertaintyax = axes[0]ax.fill_between(X_test.ravel(), mean - 2*std, mean + 2*std, alpha=0.2, color='C0', label='95% CI')ax.fill_between(X_test.ravel(), mean - std, mean + std, alpha=0.3, color='C0', label='68% CI')ax.plot(X_test, mean, 'C0-', linewidth=2, label='Posterior mean')ax.plot(X_test, true_function(X_test.ravel()), 'k--', linewidth=1.5, label='True function', alpha=0.7)ax.scatter(X_train, y_train, c='red', s=50, zorder=5, label='Training data')ax.set_xlabel('x')ax.set_ylabel('f(x)')ax.set_title('GP Regression: Posterior Mean and Uncertainty')ax.legend(loc='upper right')ax.grid(True, alpha=0.3) # Plot 2: Posterior samplesax = axes[1]for i in range(5): ax.plot(X_test, samples[:, i], alpha=0.7, linewidth=1.5, label=f'Sample {i+1}' if i < 3 else None)ax.plot(X_test, true_function(X_test.ravel()), 'k--', linewidth=2, label='True function', alpha=0.7)ax.scatter(X_train, y_train, c='red', s=50, zorder=5, label='Training data')ax.set_xlabel('x')ax.set_ylabel('f(x)')ax.set_title('Posterior Samples (consistent with data)')ax.legend(loc='upper right')ax.grid(True, alpha=0.3) plt.tight_layout()plt.show()Examining the posterior distribution reveals deep insights about how GPs combine prior knowledge with data.
Insight 1: Interpolation vs. Smoothing
The noise variance $\sigma_n^2$ controls whether the GP interpolates training points (passes exactly through them) or smooths over them:
Insight 2: Data Density Effects
Where training data is dense, the posterior mean is well-constrained and uncertainty is low. In regions with sparse or no data, predictions revert toward the prior with increased uncertainty.
This is exactly what we want! The GP "knows what it doesn't know"—uncertainty automatically grows where data is lacking.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
import numpy as npimport matplotlib.pyplot as pltfrom scipy.linalg import cholesky, solve_triangular def gp_posterior(X_train, y_train, X_test, length_scale=1.0, signal_var=1.0, noise_var=0.1): """Compute GP posterior mean and std.""" # Kernel function def kernel(X1, X2): X1, X2 = np.atleast_2d(X1), np.atleast_2d(X2) sq_dist = np.sum(X1**2, 1).reshape(-1,1) + np.sum(X2**2, 1) - 2*X1@X2.T return signal_var * np.exp(-np.maximum(sq_dist, 0) / (2*length_scale**2)) K = kernel(X_train, X_train) + noise_var * np.eye(len(X_train)) K_star = kernel(X_test, X_train) L = cholesky(K, lower=True) alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True)) v = solve_triangular(L, K_star.T, lower=True) mean = K_star @ alpha var = signal_var - np.sum(v**2, axis=0) return mean, np.sqrt(np.maximum(var, 0)) # Demonstration of key insightsnp.random.seed(42) fig, axes = plt.subplots(2, 2, figsize=(14, 10)) # Demo 1: Interpolation vs SmoothingX_train = np.array([[-2], [0], [2]])y_train = np.array([1, 0, 1])X_test = np.linspace(-4, 4, 200).reshape(-1, 1) for noise_var, label, ax in [(0.001, 'Interpolation (σₙ²≈0)', axes[0,0]), (0.5, 'Smoothing (σₙ²=0.5)', axes[0,1])]: mean, std = gp_posterior(X_train, y_train, X_test, noise_var=noise_var) ax.fill_between(X_test.ravel(), mean-2*std, mean+2*std, alpha=0.2, color='C0') ax.plot(X_test, mean, 'C0-', linewidth=2) ax.scatter(X_train, y_train, c='red', s=100, zorder=5) ax.set_title(label) ax.set_xlabel('x') ax.set_ylabel('f(x)') ax.grid(True, alpha=0.3) # Demo 2: Data density effectsX_train_sparse = np.array([[-4], [4]]) # Only at edgesy_train_sparse = np.array([1, -1])mean_sparse, std_sparse = gp_posterior(X_train_sparse, y_train_sparse, X_test, noise_var=0.01, length_scale=2.0) ax = axes[1, 0]ax.fill_between(X_test.ravel(), mean_sparse-2*std_sparse, mean_sparse+2*std_sparse, alpha=0.2, color='C0', label='95% CI')ax.plot(X_test, mean_sparse, 'C0-', linewidth=2, label='Mean')ax.scatter(X_train_sparse, y_train_sparse, c='red', s=100, zorder=5, label='Data')ax.axhline(0, color='gray', linestyle='--', alpha=0.5)ax.set_title('Sparse Data: High Uncertainty in Middle')ax.set_xlabel('x')ax.set_ylabel('f(x)')ax.legend()ax.grid(True, alpha=0.3) # Demo 3: Uncertainty only depends on input locationsX_train_same = np.array([[-2], [0], [2]])y_train_1 = np.array([0, 0, 0]) # Flaty_train_2 = np.array([2, -1, 2]) # Curved ax = axes[1, 1]_, std_1 = gp_posterior(X_train_same, y_train_1, X_test, noise_var=0.01)_, std_2 = gp_posterior(X_train_same, y_train_2, X_test, noise_var=0.01) ax.plot(X_test, std_1, 'C0-', linewidth=2, label='y = [0, 0, 0]')ax.plot(X_test, std_2, 'C1--', linewidth=2, label='y = [2, -1, 2]')for x in X_train_same: ax.axvline(x.item(), color='red', linestyle=':', alpha=0.5)ax.set_title('Std Dev: Same for Different y-values!')ax.set_xlabel('x')ax.set_ylabel('Posterior Std Dev')ax.legend()ax.grid(True, alpha=0.3) plt.suptitle('Key Posterior Insights', fontsize=14, fontweight='bold')plt.tight_layout()plt.show()A remarkable property: the posterior variance depends only on the input locations X and x*, not on the observed values y. This means we can pre-compute uncertainty before seeing target values! This is useful for experimental design—we can identify where to sample next based on uncertainty, without needing to run experiments first.
GP inference involves matrix operations that can be numerically unstable. Understanding these issues is crucial for robust implementations.
The Core Challenge: Matrix Inversion
The posterior equations require $(K + \sigma_n^2 I)^{-1}$. Directly computing matrix inverses is:
Solution: Cholesky Decomposition
Instead of inverting, we decompose the matrix: $$K + \sigma_n^2 I = LL^T$$
where $L$ is lower triangular. Then:
This is faster and more numerically stable than explicit inversion.
Even with Cholesky, the kernel matrix can become numerically singular when points are very close. Adding small "jitter" to the diagonal (e.g., 1e-8 × I) ensures positive definiteness. This is standard practice and has negligible impact on predictions. Libraries like GPyTorch add this automatically.
np.maximum(var, 0).12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
import numpy as npfrom scipy.linalg import cholesky, cho_solve, LinAlgError def stable_gp_inference(K, y, noise_var, jitter=1e-8, max_jitter=1e-2): """ Numerically stable GP inference with automatic jitter adjustment. Parameters: - K: Kernel matrix (n × n) - y: Observations (n,) - noise_var: Observation noise variance - jitter: Initial jitter value - max_jitter: Maximum jitter before giving up Returns: - L: Cholesky factor - alpha: (K + noise_var*I)^{-1} @ y """ n = len(y) K_noisy = K + noise_var * np.eye(n) current_jitter = jitter while current_jitter < max_jitter: try: L = cholesky(K_noisy + current_jitter * np.eye(n), lower=True) alpha = cho_solve((L, True), y) if current_jitter > jitter: print(f"Warning: Required jitter = {current_jitter:.2e}") return L, alpha except LinAlgError: current_jitter *= 10 raise LinAlgError(f"Cholesky failed even with jitter = {current_jitter}") def compute_log_marginal_likelihood(L, y, alpha): """ Compute log marginal likelihood (for hyperparameter optimization). log p(y|X,θ) = -½ yᵀ(K+σₙ²I)⁻¹y - ½ log|K+σₙ²I| - n/2 log(2π) Using Cholesky: - yᵀα gives y^T (K+σₙ²I)⁻¹ y - 2 Σ log(L_ii) gives log|K+σₙ²I| """ n = len(y) # Data fit term: -½ yᵀ α data_fit = -0.5 * y.T @ alpha # Complexity term: -½ log|K| = -Σ log(L_ii) complexity = -np.sum(np.log(np.diag(L))) # Constant term constant = -0.5 * n * np.log(2 * np.pi) return data_fit + complexity + constant # Demonstrationnp.random.seed(42) # Generate datan = 50X = np.linspace(0, 10, n).reshape(-1, 1)y = np.sin(X.ravel()) + 0.1 * np.random.randn(n) # Build kernel matrix with potential numerical issueslength_scale = 0.1 # Very short length scale can create issuessignal_var = 1.0sq_dist = (X - X.T)**2K = signal_var * np.exp(-sq_dist / (2 * length_scale**2)) # Stable inferenceL, alpha = stable_gp_inference(K, y, noise_var=0.01)log_ml = compute_log_marginal_likelihood(L, y, alpha) print(f"Log marginal likelihood: {log_ml:.2f}")print(f"Cholesky factor condition number: {np.linalg.cond(L):.2e}") # Verify positive definitenesseigenvalues = np.linalg.eigvalsh(K + 0.01 * np.eye(n))print(f"Smallest eigenvalue: {eigenvalues.min():.2e}")print(f"Largest eigenvalue: {eigenvalues.max():.2e}")print(f"Condition number: {eigenvalues.max() / eigenvalues.min():.2e}")So far we've considered predicting at a single test point $x_$. When predicting at multiple test points $X_ = [x_^{(1)}, \ldots, x_^{(m)}]$, we can compute the joint posterior distribution:
$$\mathbf{f}* | \mathbf{y}, X, X* \sim \mathcal{N}(\boldsymbol{\mu}*, \Sigma*)$$
where:
$$\boldsymbol{\mu}* = K*^T (K + \sigma_n^2 I)^{-1} \mathbf{y}$$
$$\Sigma_* = K_{**} - K_^T (K + \sigma_n^2 I)^{-1} K_$$
Here:
Why Joint Predictions Matter:
The joint posterior captures correlations between test predictions. If two test points are nearby, their function values are correlated—knowing $f(x_^{(1)})$ tells us something about $f(x_^{(2)})$.
Individual marginal variances don't capture this information, but the joint covariance $\Sigma_*$ does. This is crucial for:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798
import numpy as npimport matplotlib.pyplot as pltfrom scipy.linalg import cholesky, solve_triangular def gp_joint_posterior(X_train, y_train, X_test, length_scale=1.0, signal_var=1.0, noise_var=0.1): """ Compute joint posterior distribution over test points. Returns mean vector and full covariance matrix. """ def kernel(X1, X2): X1, X2 = np.atleast_2d(X1), np.atleast_2d(X2) sq_dist = np.sum(X1**2, 1).reshape(-1,1) + np.sum(X2**2, 1) - 2*X1@X2.T return signal_var * np.exp(-np.maximum(sq_dist, 0) / (2*length_scale**2)) # Compute all kernel matrices K = kernel(X_train, X_train) + noise_var * np.eye(len(X_train)) K_star = kernel(X_test, X_train) # m × n K_starstar = kernel(X_test, X_test) # m × m # Cholesky decomposition L = cholesky(K, lower=True) # Alpha = K^{-1} y alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True)) # v = L^{-1} K_star^T v = solve_triangular(L, K_star.T, lower=True) # Posterior mean and covariance mean = K_star @ alpha cov = K_starstar - v.T @ v # Ensure symmetric (numerical fix) cov = 0.5 * (cov + cov.T) return mean, cov # Demonstrationnp.random.seed(42) # Training dataX_train = np.array([[-3], [0], [3]])y_train = np.array([1, 0, 1]) # Test pointsX_test = np.linspace(-5, 5, 100).reshape(-1, 1) # Compute joint posteriormean, cov = gp_joint_posterior(X_train, y_train, X_test, length_scale=1.5, noise_var=0.01) # Sample from joint distributionn_samples = 50L_post = cholesky(cov + 1e-8 * np.eye(len(X_test)), lower=True)samples = mean[:, np.newaxis] + L_post @ np.random.randn(len(X_test), n_samples) # Also generate "independent" samples (ignoring correlation)std = np.sqrt(np.diag(cov))independent_samples = mean[:, np.newaxis] + std[:, np.newaxis] * np.random.randn(len(X_test), n_samples) # Visualizationfig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Joint samples (smooth, correlated)ax = axes[0]for i in range(10): ax.plot(X_test, samples[:, i], alpha=0.5, color='C0')ax.scatter(X_train, y_train, c='red', s=100, zorder=5)ax.set_title('Joint Posterior Samples(Correlated, smooth functions)')ax.set_xlabel('x')ax.set_ylabel('f(x)')ax.grid(True, alpha=0.3) # Independent samples (jagged, not functions)ax = axes[1]for i in range(10): ax.plot(X_test, independent_samples[:, i], alpha=0.5, color='C1')ax.scatter(X_train, y_train, c='red', s=100, zorder=5)ax.set_title('Independent Marginal Samples(Uncorrelated, not proper functions!)')ax.set_xlabel('x')ax.set_ylabel('f(x)')ax.grid(True, alpha=0.3) plt.suptitle('Joint vs Marginal Sampling', fontsize=14, fontweight='bold')plt.tight_layout()plt.show() # Show correlation structurefig, ax = plt.subplots(figsize=(8, 6))im = ax.imshow(cov, cmap='RdBu_r', vmin=-1, vmax=1)plt.colorbar(im, ax=ax, label='Covariance')ax.set_title('Posterior Covariance Matrix')ax.set_xlabel('Test Point Index')ax.set_ylabel('Test Point Index')plt.show()When visualizing posterior uncertainty, always sample from the joint distribution, not independent marginals. Independent marginal samples ignore correlations and produce jagged, non-function-like curves. Joint samples respect the GP's correlation structure and produce smooth functions consistent with the kernel.
Understanding GP computational complexity is essential for practical applications.
Training (Fitting):
The dominant cost is Cholesky decomposition of the $n \times n$ kernel matrix:
This cubic scaling is the primary limitation of exact GPs. For $n = 10,000$ points, you're decomposing a $10^8$-element matrix—feasible but slow. For $n = 100,000$, standard exact GPs become impractical.
Prediction (at $m$ test points):
Once Cholesky is computed:
| Operation | Time Complexity | Memory | Bottleneck |
|---|---|---|---|
| Build kernel matrix | $O(n^2)$ | $O(n^2)$ | Pairwise distances |
| Cholesky decomposition | $O(n^3)$ | $O(n^2)$ | Matrix factorization |
| Solve for $\alpha$ | $O(n^2)$ | $O(n)$ | Triangular solves |
| Predict mean (m points) | $O(mn)$ | $O(mn)$ | Cross-covariance |
| Predict variance (m points) | $O(mn)$ | $O(mn)$ | Triangular solve |
| Full posterior covariance | $O(m^2n)$ | $O(m^2)$ | Dense product |
| Sample from posterior | $O(m^3)$ | $O(m^2)$ | Test Cholesky |
The $O(n^3)$ barrier has motivated extensive research into scalable GP approximations: inducing point methods (FITC, VFE), random Fourier features, structured kernels, and GPU acceleration. We'll cover these in the 'Computational Aspects' module. For now, understand that exact GPs work well for thousands of points; beyond that requires approximations.
We've derived and implemented the core equations of Gaussian Process regression, understanding how conditioning on data transforms prior beliefs into predictions with calibrated uncertainty.
What's Next:
With posterior inference mastered, we'll focus on mean and variance prediction—the practical outputs of GP regression. We'll examine how to interpret predictions, visualize uncertainty, and use GP outputs for downstream tasks like decision-making and active learning.
You now understand the mathematical heart of Gaussian Process regression: conditioning on data to obtain posterior distributions. The closed-form solutions make GPs remarkably elegant and computationally tractable for moderate-sized datasets. Next, we'll explore how to interpret and use these predictions effectively.