Machine LearningGaussian Processes

Gaussian Process Regression

LevelAdvanced

Duration90 mins

TopicGaussian Processes

5 / 5

Closed-Form Solution

The Complete GP Regression Algorithm

We have developed all the theoretical components of Gaussian Process regression across the preceding pages. Now we bring everything together into a complete, closed-form algorithm that you can implement and apply to real problems.

The closed-form solution is what makes GP regression uniquely elegant among flexible regression methods. Unlike neural networks requiring iterative optimization, or tree ensembles requiring greedy splitting, the GP solution can be written in a few matrix equations. Given training data and hyperparameters, there's one correct answer—no randomness, no convergence issues, no local optima in the inference procedure itself.

This page presents the complete algorithm, addresses all computational details, provides a production-quality reference implementation, and discusses the practical considerations that distinguish textbook GPs from real-world deployments.

What You Will Learn

By the end of this page, you will: (1) Have the complete GP regression algorithm ready to implement, (2) Understand all computational details and efficiency considerations, (3) Know how to handle edge cases and numerical issues, (4) Have a production-quality reference implementation, (5) Be prepared to apply GPs to real-world problems.

The Complete GP Regression Algorithm

Here is the complete GP regression algorithm, from inputs to predictions:

Inputs:

Training inputs: $\mathbf{X} \in \mathbb{R}^{n \times d}$ ($n$ points, $d$ dimensions)
Training outputs: $\mathbf{y} \in \mathbb{R}^n$
Test inputs: $\mathbf{X}_* \in \mathbb{R}^{m \times d}$ ($m$ test points)
Kernel function: $k(\cdot, \cdot)$ with hyperparameters $\boldsymbol{\theta}$
Noise variance: $\sigma_n^2$
Jitter: $\epsilon_{\text{jitter}}$ (typically $10^{-8}$)

Algorithm:

Compute training covariance: $K = k(\mathbf{X}, \mathbf{X}) \in \mathbb{R}^{n \times n}$
Add noise and jitter: $K_y = K + (\sigma_n^2 + \epsilon_{\text{jitter}}) \mathbf{I}$
Cholesky decomposition: $K_y = \mathbf{L} \mathbf{L}^\top$ where $\mathbf{L}$ is lower triangular
Compute dual coefficients: Solve $\mathbf{L} \mathbf{z} = \mathbf{y}$, then $\mathbf{L}^\top \boldsymbol{\alpha} = \mathbf{z}$ Result: $\boldsymbol{\alpha} = K_y^{-1} \mathbf{y}$
Compute cross-covariance: $K_* = k(\mathbf{X}, \mathbf{X}_*) \in \mathbb{R}^{n \times m}$
Predictive mean: $\boldsymbol{\mu}* = K*^\top \boldsymbol{\alpha}$
Solve for variance computation: $\mathbf{L} \mathbf{V} = K_*$ → $\mathbf{V} \in \mathbb{R}^{n \times m}$
Prior variance at test points: $\mathbf{k}{**} = \text{diag}(k(\mathbf{X}, \mathbf{X}_)) \in \mathbb{R}^m$
Predictive variance of latent function: $\boldsymbol{\sigma}*^2 = \mathbf{k}{**} - \text{colsum}(\mathbf{V}^2)$
Predictive variance of observations: $\boldsymbol{\sigma}{y*}^2 = \boldsymbol{\sigma}*^2 + \sigma_n^2$

Outputs:

Predictive means: $\boldsymbol{\mu}_* \in \mathbb{R}^m$
Predictive variances (function): $\boldsymbol{\sigma}_*^2 \in \mathbb{R}^m$
Predictive variances (observations): $\boldsymbol{\sigma}_{y*}^2 \in \mathbb{R}^m$

The Key Equations

The core results, boxed for reference:

μ* = Kᵀ (K + σ_n²I)⁻¹ y = Kᵀ α

σ² = k(x,x*) - Kᵀ (K + σ_n²I)⁻¹ K

Computational Complexity Analysis

Understanding the computational costs enables informed decisions about problem scale and approximation methods.

Training Phase (steps 1-4):

Step	Operation	Time	Space
1	Kernel matrix $K$	$\mathcal{O}(n^2 d)$	$\mathcal{O}(n^2)$
2	Add noise (in-place)	$\mathcal{O}(n)$	$\mathcal{O}(1)$
3	Cholesky decomposition	$\mathcal{O}(n^3/3)$	$\mathcal{O}(n^2)$
4	Triangular solves	$\mathcal{O}(n^2)$	$\mathcal{O}(n)$
Total Training		$\mathcal{O}(n^3)$	$\mathcal{O}(n^2)$

Prediction Phase (steps 5-10):

Step	Operation	Time	Space
5	Cross-covariance $K_*$	$\mathcal{O}(nmd)$	$\mathcal{O}(nm)$
6	Mean prediction	$\mathcal{O}(nm)$	$\mathcal{O}(m)$
7	Triangular solve for $V$	$\mathcal{O}(n^2 m)$	$\mathcal{O}(nm)$
8	Prior variances	$\mathcal{O}(md)$	$\mathcal{O}(m)$
9	Posterior variances	$\mathcal{O}(nm)$	$\mathcal{O}(m)$
Total Prediction		$\mathcal{O}(n^2 m)$	$\mathcal{O}(nm)$

The Cubic Bottleneck

The O(n³) Cholesky decomposition and O(n²) memory for K are the fundamental limits of exact GP regression. For n > 10,000-50,000 (depending on hardware and memory), you need sparse/approximate methods like inducing points, random features, or structured approximations.

Practical Scaling Guidelines

Training Size $n$	Feasibility	Notes
$n < 1,000$	Easy	Runs in seconds on any modern hardware
$1,000 < n < 10,000$	Moderate	Seconds to minutes; memory becomes relevant
$10,000 < n < 50,000$	Challenging	Minutes; requires careful memory management
$n > 50,000$	Requires approximations	Use sparse GPs or variational methods

Amortization Through Precomputation

Once $\mathbf{L}$ and $\boldsymbol{\alpha}$ are computed (training), predictions at new points only require:

Cross-covariance computation: $\mathcal{O}(nm)$
Matrix-vector products and solves: $\mathcal{O}(nm + n^2)$

For many test points, you can make predictions efficiently after the one-time $\mathcal{O}(n^3)$ training cost.

Numerical Stability Considerations

Numerical stability is critical for reliable GP implementations. Here are the key considerations and solutions.

Never Compute $K^{-1}$ Explicitly

The expression $K^{-1} \mathbf{y}$ appears throughout GP equations, but you should never compute $K^{-1}$ as a matrix. Instead:

Use the Cholesky decomposition $K = LL^\top$
Solve $L\mathbf{z} = \mathbf{y}$ (forward substitution)
Solve $L^\top \boldsymbol{\alpha} = \mathbf{z}$ (backward substitution)

This is more stable and often faster.

Positive Definiteness Guarantees

The Cholesky decomposition fails if $K_y$ is not positive definite. Ensure this by:

Jitter: Add $\epsilon I$ with $\epsilon \geq 10^{-10}$
Valid kernel: Ensure your kernel is a valid positive definite function
Noise variance: Any positive $\sigma_n^2 > 0$ helps
No exact duplicates: Duplicate training points create singularity

Condition Number Monitoring

Monitor the condition number κ(K) = λ_max/λ_min. High condition numbers (>10¹²) indicate numerical problems. Remedies: increase jitter, adjust length scale, standardize inputs.

Common Numerical Issues and Solutions

•Cholesky failure: Matrix not positive definite. Solution: Increase jitter, check for duplicate points, verify kernel validity.
•Negative variance predictions: Numerical error. Solution: Clamp to zero, increase jitter, or use more stable computation order.
•NaN or Inf in predictions: Overflow or underflow. Solution: Work in log-space for likelihood computations, use numerically stable formulas.
•Predictions don't interpolate: Noise variance too high or jitter too large. Solution: Reduce σ_n², reduce jitter.
•Oscillatory predictions: Numerical instability. Solution: Check condition number, increase regularization.

Log-Likelihood Computation

The log marginal likelihood is: $$\log p(\mathbf{y}) = -\frac{1}{2} \mathbf{y}^\top K_y^{-1} \mathbf{y} - \frac{1}{2} \log |K_y| - \frac{n}{2} \log(2\pi)$$

Compute this stably as: $$\log p(\mathbf{y}) = -\frac{1}{2} \mathbf{y}^\top \boldsymbol{\alpha} - \sum_{i=1}^{n} \log L_{ii} - \frac{n}{2} \log(2\pi)$$

where $\mathbf{L}$ is the Cholesky factor. The log-determinant $\log |K_y| = 2 \sum_i \log L_{ii}$ avoids overflow.

Production-Quality Reference Implementation

Here is a complete, production-quality GP regression implementation that incorporates all the considerations we've discussed. This code is designed to be correct, numerically stable, and reasonably efficient.

gaussian_process_regression.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
"""
Complete Gaussian Process Regression Implementation
===================================================
A production-quality implementation covering all aspects of GP regression.
"""
 
import numpy as np
from scipy.linalg import cholesky, solve_triangular, cho_solve
from scipy.optimize import minimize
from typing import Tuple, Optional, Callable, Dict, Any
 
class GaussianProcessRegressor:
    """
    Gaussian Process Regressor with full functionality.
    
    Features:
    - RBF (Squared Exponential) kernel with customizable parameters
    - Automatic hyperparameter optimization via marginal likelihood
    - Numerically stable Cholesky-based inference
    - Both pointwise and full covariance predictions
    - Proper handling of noise and jitter
    """
    
    def __init__(
        self,
        kernel: str = 'rbf',
        length_scale: float = 1.0,
        signal_variance: float = 1.0,
        noise_variance: float = 1e-6,
        jitter: float = 1e-8,
        optimize_hyperparameters: bool = True,
        n_restarts: int = 5,
    ):
        """
        Initialize the GP regressor.
        
        Parameters:
        -----------
        kernel : str
            Kernel type. Currently supports 'rbf'.
        length_scale : float
            Initial length scale for the kernel.
        signal_variance : float
            Initial signal variance (kernel amplitude squared).
        noise_variance : float
            Observation noise variance. Set to small value for near-interpolation.
        jitter : float
            Small value added to diagonal for numerical stability.
        optimize_hyperparameters : bool
            Whether to optimize hyperparameters during fit.
        n_restarts : int
            Number of random restarts for optimization.
        """
        self.kernel_type = kernel
        self.length_scale = length_scale
        self.signal_variance = signal_variance
        self.noise_variance = noise_variance
        self.jitter = jitter
        self.optimize_hyperparameters = optimize_hyperparameters
        self.n_restarts = n_restarts
        
        # Fitted state
        self._is_fitted = False
        self._X_train = None
        self._y_train = None
        self._L = None  # Cholesky factor
        self._alpha = None  # Dual coefficients
        
    def _kernel(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray:
        """
        Compute the kernel matrix between X1 and X2.
        """
        if self.kernel_type == 'rbf':
            # Squared distances: ||x1 - x2||^2
            sq_dist = (
                np.sum(X1**2, axis=1, keepdims=True) 
                - 2 * X1 @ X2.T 
                + np.sum(X2**2, axis=1)
            )
            return self.signal_variance * np.exp(-0.5 * sq_dist / self.length_scale**2)
        else:
            raise ValueError(f"Unknown kernel: {self.kernel_type}")
    
    def _compute_kernel_with_params(
        self, X1: np.ndarray, X2: np.ndarray, 
        length_scale: float, signal_variance: float
    ) -> np.ndarray:
        """Compute kernel with specific hyperparameters."""
        sq_dist = (
            np.sum(X1**2, axis=1, keepdims=True) 
            - 2 * X1 @ X2.T 
            + np.sum(X2**2, axis=1)
        )
        return signal_variance * np.exp(-0.5 * sq_dist / length_scale**2)
    
    def _neg_log_marginal_likelihood(
        self, log_theta: np.ndarray, X: np.ndarray, y: np.ndarray
    ) -> float:
        """
        Compute negative log marginal likelihood for optimization.
        
        Parameters:
        -----------
        log_theta : array
            [log(length_scale), log(signal_variance), log(noise_variance)]
        """
        length_scale = np.exp(log_theta[0])
        signal_variance = np.exp(log_theta[1])
        noise_variance = np.exp(log_theta[2])
        
        n = len(y)
        
        # Compute kernel matrix
        K = self._compute_kernel_with_params(X, X, length_scale, signal_variance)
        K_y = K + (noise_variance + self.jitter) * np.eye(n)
        
        # Cholesky decomposition
        try:
            L = cholesky(K_y, lower=True)
        except np.linalg.LinAlgError:
            return 1e10  # Return large value if not positive definite
        
        # Solve for alpha
        alpha = cho_solve((L, True), y)
        
        # Log marginal likelihood components
        data_fit = -0.5 * y @ alpha
        complexity = -np.sum(np.log(np.diag(L)))  # -0.5 * log|K_y|
        constant = -0.5 * n * np.log(2 * np.pi)
        
        return -(data_fit + complexity + constant)
    
    def fit(self, X: np.ndarray, y: np.ndarray) -> 'GaussianProcessRegressor':
        """
        Fit the GP to training data.
        
        Parameters:
        -----------
        X : array of shape (n_samples, n_features)
            Training inputs.
        y : array of shape (n_samples,)
            Training targets.
        
        Returns:
        --------
        self : GaussianProcessRegressor
            Fitted estimator.
        """
        X = np.atleast_2d(X)
        y = np.asarray(y).ravel()
        
        if X.shape[0] != len(y):
            raise ValueError("X and y must have same number of samples")
        
        # Store training data
        self._X_train = X.copy()
        self._y_train = y.copy()
        n = len(y)
        
        # Optimize hyperparameters if requested
        if self.optimize_hyperparameters and n > 1:
            best_nlml = np.inf
            best_theta = None
            
            for _ in range(self.n_restarts):
                # Random initialization in log-space
                theta0 = np.array([
                    np.log(self.length_scale) + 0.5 * np.random.randn(),
                    np.log(self.signal_variance) + 0.5 * np.random.randn(),
                    np.log(self.noise_variance) + 0.5 * np.random.randn(),
                ])
                
                result = minimize(
                    self._neg_log_marginal_likelihood,
                    theta0,
                    args=(X, y),
                    method='L-BFGS-B',
                    bounds=[(-10, 10), (-10, 10), (-15, 5)],
                )
                
                if result.fun < best_nlml:
                    best_nlml = result.fun
                    best_theta = result.x
            
            if best_theta is not None:
                self.length_scale = np.exp(best_theta[0])
                self.signal_variance = np.exp(best_theta[1])
                self.noise_variance = np.exp(best_theta[2])
        
        # Compute kernel matrix and decomposition
        K = self._kernel(X, X)
        K_y = K + (self.noise_variance + self.jitter) * np.eye(n)
        
        # Cholesky decomposition
        self._L = cholesky(K_y, lower=True)
        
        # Compute dual coefficients
        self._alpha = cho_solve((self._L, True), y)
        
        self._is_fitted = True
        return self
    
    def predict(
        self, X: np.ndarray, return_std: bool = False, return_cov: bool = False
    ) -> Tuple[np.ndarray, ...]:
        """
        Make predictions at test points.
        
        Parameters:
        -----------
        X : array of shape (n_samples, n_features)
            Test inputs.
        return_std : bool
            Whether to return standard deviation.
        return_cov : bool
            Whether to return full covariance matrix.
        
        Returns:
        --------
        y_mean : array of shape (n_samples,)
            Predictive means.
        y_std : array of shape (n_samples,) (if return_std=True)
            Predictive standard deviations for latent function.
        y_cov : array of shape (n_samples, n_samples) (if return_cov=True)
            Predictive covariance for latent function.
        """
        if not self._is_fitted:
            raise RuntimeError("Must call fit() before predict()")
        
        X = np.atleast_2d(X)
        
        # Cross-covariance
        K_star = self._kernel(self._X_train, X)
        
        # Predictive mean
        y_mean = K_star.T @ self._alpha
        
        result = [y_mean]
        
        if return_std or return_cov:
            # Solve L @ V = K_star
            V = solve_triangular(self._L, K_star, lower=True)
            
            if return_cov:
                # Full covariance
                K_ss = self._kernel(X, X)
                y_cov = K_ss - V.T @ V
                # Ensure symmetry
                y_cov = 0.5 * (y_cov + y_cov.T)
                result.append(y_cov)
            
            if return_std:
                # Pointwise variance
                K_ss_diag = self.signal_variance * np.ones(X.shape[0])
                y_var = K_ss_diag - np.sum(V**2, axis=0)
                # Clamp to avoid numerical issues
                y_var = np.maximum(y_var, 0)
                result.append(np.sqrt(y_var))
        
        return tuple(result) if len(result) > 1 else result[0]
    
    def sample_y(self, X: np.ndarray, n_samples: int = 1) -> np.ndarray:
        """
        Sample functions from the posterior distribution.
        
        Parameters:
        -----------
        X : array of shape (n_points, n_features)
            Points at which to sample.
        n_samples : int
            Number of function samples.
        
        Returns:
        --------
        samples : array of shape (n_points, n_samples)
            Function samples.
        """
        y_mean, y_cov = self.predict(X, return_cov=True)
        
        # Cholesky of posterior covariance
        L_cov = cholesky(y_cov + self.jitter * np.eye(len(y_mean)), lower=True)
        
        # Sample
        z = np.random.randn(len(y_mean), n_samples)
        samples = y_mean[:, None] + L_cov @ z
        
        return samples
    
    def log_marginal_likelihood(self) -> float:
        """
        Compute log marginal likelihood of the fitted model.
        """
        if not self._is_fitted:
            raise RuntimeError("Must call fit() before log_marginal_likelihood()")
        
        n = len(self._y_train)
        
        data_fit = -0.5 * self._y_train @ self._alpha
        complexity = -np.sum(np.log(np.diag(self._L)))
        constant = -0.5 * n * np.log(2 * np.pi)
        
        return data_fit + complexity + constant
    
    def get_params(self) -> Dict[str, Any]:
        """Get current hyperparameters."""
        return {
            'length_scale': self.length_scale,
            'signal_variance': self.signal_variance,
            'noise_variance': self.noise_variance,
        }

Usage Examples

Let's demonstrate the GP implementation with several practical examples.

gp_examples.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
"""
Gaussian Process Regression - Usage Examples
=============================================
"""
 
import numpy as np
 
# Assuming GaussianProcessRegressor is imported from the previous code
 
np.random.seed(42)
 
# =============================================================================
# Example 1: Basic Regression
# =============================================================================
print("Example 1: Basic Regression")
print("=" * 50)
 
# Generate synthetic data
def true_function(x):
    return np.sin(2 * x) + 0.5 * np.cos(4 * x)
 
X_train = np.sort(np.random.uniform(-3, 3, 15)).reshape(-1, 1)
y_train = true_function(X_train.flatten()) + 0.2 * np.random.randn(15)
 
# Fit GP
gp = GaussianProcessRegressor(optimize_hyperparameters=True, n_restarts=3)
gp.fit(X_train, y_train)
 
print(f"Optimized hyperparameters:")
params = gp.get_params()
for k, v in params.items():
    print(f"  {k}: {v:.4f}")
 
# Predict
X_test = np.linspace(-4, 4, 100).reshape(-1, 1)
y_mean, y_std = gp.predict(X_test, return_std=True)
 
print(f"\nPrediction range: [{y_mean.min():.3f}, {y_mean.max():.3f}]")
print(f"Uncertainty range: [{y_std.min():.4f}, {y_std.max():.3f}]")
print(f"Log marginal likelihood: {gp.log_marginal_likelihood():.2f}")
 
# =============================================================================
# Example 2: Posterior Sampling
# =============================================================================
print("\nExample 2: Posterior Sampling")
print("=" * 50)
 
# Sample functions from posterior
samples = gp.sample_y(X_test, n_samples=5)
 
print(f"Generated {samples.shape[1]} posterior function samples")
print(f"Sample 1 range: [{samples[:, 0].min():.3f}, {samples[:, 0].max():.3f}]")
 
# =============================================================================
# Example 3: Interpolation (Low Noise)
# =============================================================================
print("\nExample 3: Interpolation (Low Noise)")
print("=" * 50)
 
# Create GP with very low noise for interpolation
gp_interp = GaussianProcessRegressor(
    noise_variance=1e-10,
    optimize_hyperparameters=False,
    length_scale=1.0,
    signal_variance=1.0,
)
gp_interp.fit(X_train, y_train)
 
y_at_train = gp_interp.predict(X_train)
residuals = np.abs(y_at_train - y_train)
 
print(f"Max residual at training points: {residuals.max():.2e}")
print("(Should be near machine precision for interpolation)")
 
# =============================================================================
# Example 4: Multi-Dimensional Input
# =============================================================================
print("\nExample 4: Multi-Dimensional Input")
print("=" * 50)
 
# 2D Branin function (classic optimization benchmark)
def branin(X):
    x1, x2 = X[:, 0], X[:, 1]
    a, b, c = 1, 5.1/(4*np.pi**2), 5/np.pi
    r, s, t = 6, 10, 1/(8*np.pi)
    return a*(x2 - b*x1**2 + c*x1 - r)**2 + s*(1-t)*np.cos(x1) + s
 
# Generate training data
X_train_2d = np.random.uniform(low=[-5, 0], high=[10, 15], size=(30, 2))
y_train_2d = branin(X_train_2d) + 10 * np.random.randn(30)
 
# Fit GP
gp_2d = GaussianProcessRegressor(optimize_hyperparameters=True, n_restarts=3)
gp_2d.fit(X_train_2d, y_train_2d)
 
print(f"2D GP hyperparameters:")
for k, v in gp_2d.get_params().items():
    print(f"  {k}: {v:.4f}")
 
# Predict on grid
xx, yy = np.meshgrid(np.linspace(-5, 10, 20), np.linspace(0, 15, 20))
X_grid = np.column_stack([xx.ravel(), yy.ravel()])
y_pred, y_std_2d = gp_2d.predict(X_grid, return_std=True)
 
print(f"2D prediction range: [{y_pred.min():.1f}, {y_pred.max():.1f}]")
print(f"2D uncertainty range: [{y_std_2d.min():.2f}, {y_std_2d.max():.2f}]")
 
# =============================================================================
# Example 5: Extrapolation Behavior  
# =============================================================================
print("\nExample 5: Extrapolation Behavior")
print("=" * 50)
 
# Fit GP on limited range
X_limited = np.linspace(-2, 2, 10).reshape(-1, 1)
y_limited = np.sin(X_limited.flatten())
 
gp_extrap = GaussianProcessRegressor(
    optimize_hyperparameters=False,
    length_scale=1.0,
    signal_variance=1.0,
    noise_variance=0.01,
)
gp_extrap.fit(X_limited, y_limited)
 
# Predict in and out of training range
X_in = np.array([[0.0]])  # In range
X_out = np.array([[5.0]])  # Out of range
 
y_in_mean, y_in_std = gp_extrap.predict(X_in, return_std=True)
y_out_mean, y_out_std = gp_extrap.predict(X_out, return_std=True)
 
print(f"In-range (x=0): mean={y_in_mean[0]:.3f}, std={y_in_std[0]:.4f}")
print(f"Out-of-range (x=5): mean={y_out_mean[0]:.3f}, std={y_out_std[0]:.3f}")
print("Note: Extrapolation reverts to prior (mean→0, std→signal_variance)")

Working with the Full Posterior Covariance

For many applications, pointwise variances are insufficient—you need the full posterior covariance matrix. This section details when and how to work with $\Sigma_*$.

When You Need Full Covariance

Sampling functions: To draw coherent function samples, you must sample from the $m$-dimensional Gaussian $\mathcal{N}(\boldsymbol{\mu}*, \Sigma*)$.
Functional uncertainty: Uncertainty about integrals, maxima, or other functionals of $f$ depends on correlations.
Batch Bayesian optimization: Acquiring multiple points simultaneously requires the joint distribution.
Optimal experimental design: Choosing informative observation locations uses the predictive covariance.

Computing Full Covariance

$$\Sigma_* = K(\mathbf{X}*, \mathbf{X}) - K_^\top K_y^{-1} K_* = K_{**} - V^\top V$$

where $L V = K_*$ (triangular solve).

Memory Considerations

The full covariance Σ* is an m×m matrix. For m=10,000 test points, this is 100 million entries (~800 MB in float64). For large test sets, consider: (1) computing covariance only for subsets, (2) using low-rank approximations, (3) avoiding explicit storage via matrix-free methods.

Ensuring Covariance Validity

Due to numerical error, the computed $\Sigma_*$ may not be exactly symmetric or positive semi-definite. Common fixes:

Symmetrize: $\Sigma_* \leftarrow \frac{1}{2}(\Sigma_* + \Sigma_*^\top)$
Add jitter for sampling: When computing Cholesky of $\Sigma_*$ for sampling, add small jitter
Eigenvalue clipping: Set negative eigenvalues to zero (projection onto PSD cone)

Efficient Sampling from Full Covariance

To sample $\mathbf{f}$ from $\mathcal{N}(\boldsymbol{\mu}*, \Sigma*)$:

Compute $\Sigma_* = L_* L_*^\top$ (Cholesky)
Sample $\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$
Compute $\mathbf{f} = \boldsymbol{\mu}* + L* \mathbf{z}$

For many samples, reuse $L_*$ and generate multiple $\mathbf{z}$ vectors.

Hyperparameter Optimization: The Missing Piece

The closed-form solution assumes hyperparameters are known. In practice, they must be learned from data.

The Marginal Likelihood Objective

$$\log p(\mathbf{y} | \mathbf{X}, \boldsymbol{\theta}) = -\frac{1}{2} \mathbf{y}^\top K_y^{-1} \mathbf{y} - \frac{1}{2} \log |K_y| - \frac{n}{2} \log(2\pi)$$

This objective balances:

Data fit ($-\frac{1}{2} \mathbf{y}^\top K_y^{-1} \mathbf{y}$): How well the GP explains observations
Complexity ($-\frac{1}{2} \log |K_y|$): Penalty for complex models (Occam's razor)

Gradient-Based Optimization

The gradients with respect to hyperparameters are:

$$\frac{\partial \log p}{\partial \theta_j} = \frac{1}{2} \text{tr}\left( (\boldsymbol{\alpha} \boldsymbol{\alpha}^\top - K_y^{-1}) \frac{\partial K_y}{\partial \theta_j} \right)$$

where $\boldsymbol{\alpha} = K_y^{-1} \mathbf{y}$. Computing the trace requires $\mathcal{O}(n^2)$ operations per hyperparameter.

Efficient Gradient Computation

Compute K_y⁻¹ as solve(K_y, I), not as an explicit inverse. Or better: if you need the full inverse for multiple gradient computations, solve once and cache. The gradient formula tr((ααᵀ - K⁻¹) ∂K/∂θ) can be computed as Σᵢⱼ (αᵢαⱼ - [K⁻¹]ᵢⱼ) [∂K/∂θ]ᵢⱼ using element-wise operations.

Optimization Pitfalls

•Local optima: The marginal likelihood is non-convex. Use multiple random restarts.
•Unbounded length scale: Without constraints, ℓ can go to 0 or ∞. Bound or use priors.
•Noise-signal confusion: With little data, σ_n² and σ_f² may trade off. Regularize or fix one.
•Numerical issues at extremes: Very small or large hyperparameters cause numerical problems.
•Overfitting the evidence: With many hyperparameters, you can overfit the marginal likelihood itself.

Practical Optimization Strategy

Initialize reasonably: Use data-derived initial guesses (e.g., $\ell$ = median pairwise distance, $\sigma_f^2$ = variance of $y$, $\sigma_n^2$ = 0.1 × $\sigma_f^2$)
Work in log-space: Optimize $\log \theta$ rather than $\theta$ for positive parameters
Use bounds: Prevent extreme values (e.g., $\ell \in [0.01c, 100c]$ where $c$ is characteristic input scale)
Multiple restarts: Run optimization from 3-10 random starting points, keep best
Check convergence: Verify the optimizer converged and didn't hit bounds
Validate on held-out data: If possible, check predictions on validation set

Summary: The Complete GP Regression Solution

We have now covered the complete theory and practice of Gaussian Process regression. Let us consolidate the key points:

Key Takeaways

•Closed-form solution: GP regression has exact analytic solutions for posterior mean and covariance, unlike most flexible models.
•Algorithm is simple: Compute kernel matrix → Cholesky decompose → solve for dual coefficients → predict.
•Computational cost is O(n³): The Cholesky decomposition is the bottleneck. Memory is O(n²) for storing K.
•Numerical stability is critical: Use Cholesky factorization, add jitter, never invert matrices explicitly.
•Hyperparameters must be learned: Optimize the marginal likelihood with multiple restarts.
•Full covariance enables sampling: For function samples, you need the joint distribution, not just marginal variances.

GP Regression Quick Reference
Quantity	Formula	Purpose
Predictive mean	μ* = K*ᵀ(K + σ²I)⁻¹y	Best prediction
Predictive variance	σ² = k* - Kᵀ(K + σ²I)⁻¹K	Uncertainty (latent f)
Observation variance	σ²_y* = σ*² + σ²_n	Uncertainty (noisy y)
Log marginal likelihood	-½yᵀK⁻¹y - ½log\|K\| - n/2 log(2π)	Hyperparameter optimization

Module Complete!

You have now mastered the fundamentals of GP regression:

Page 0: Derived the posterior distribution from first principles
Page 1: Explored the predictive distribution for single and multiple test points
Page 2: Developed intuition for mean and variance prediction behavior
Page 3: Understood noise handling from homoscedastic to heteroscedastic models
Page 4: Assembled the complete closed-form algorithm with production-quality implementation

With this foundation, you are ready to apply GP regression to real problems and to explore advanced topics: kernel design, scalable approximations, and GP classification.

Module Complete

Congratulations! You now have complete mastery of Gaussian Process regression—from theoretical foundations through practical implementation. The closed-form solution you've learned is the foundation for advanced GP methods and a powerful tool for probabilistic regression with principled uncertainty quantification.

5 / 5

Loading learning content...

Machine LearningGaussian Processes

Gaussian Process Regression

LevelAdvanced

Duration90 mins

TopicGaussian Processes

5 / 5

Closed-Form Solution

The Complete GP Regression Algorithm

What You Will Learn

The Complete GP Regression Algorithm

Here is the complete GP regression algorithm, from inputs to predictions:

Inputs:

Training inputs: $\mathbf{X} \in \mathbb{R}^{n \times d}$ ($n$ points, $d$ dimensions)
Training outputs: $\mathbf{y} \in \mathbb{R}^n$
Test inputs: $\mathbf{X}_* \in \mathbb{R}^{m \times d}$ ($m$ test points)
Kernel function: $k(\cdot, \cdot)$ with hyperparameters $\boldsymbol{\theta}$
Noise variance: $\sigma_n^2$
Jitter: $\epsilon_{\text{jitter}}$ (typically $10^{-8}$)

Algorithm:

Compute training covariance: $K = k(\mathbf{X}, \mathbf{X}) \in \mathbb{R}^{n \times n}$
Add noise and jitter: $K_y = K + (\sigma_n^2 + \epsilon_{\text{jitter}}) \mathbf{I}$
Cholesky decomposition: $K_y = \mathbf{L} \mathbf{L}^\top$ where $\mathbf{L}$ is lower triangular
Compute dual coefficients: Solve $\mathbf{L} \mathbf{z} = \mathbf{y}$, then $\mathbf{L}^\top \boldsymbol{\alpha} = \mathbf{z}$ Result: $\boldsymbol{\alpha} = K_y^{-1} \mathbf{y}$
Compute cross-covariance: $K_* = k(\mathbf{X}, \mathbf{X}_*) \in \mathbb{R}^{n \times m}$
Predictive mean: $\boldsymbol{\mu}* = K*^\top \boldsymbol{\alpha}$
Solve for variance computation: $\mathbf{L} \mathbf{V} = K_*$ → $\mathbf{V} \in \mathbb{R}^{n \times m}$
Prior variance at test points: $\mathbf{k}{**} = \text{diag}(k(\mathbf{X}, \mathbf{X}_)) \in \mathbb{R}^m$
Predictive variance of latent function: $\boldsymbol{\sigma}*^2 = \mathbf{k}{**} - \text{colsum}(\mathbf{V}^2)$
Predictive variance of observations: $\boldsymbol{\sigma}{y*}^2 = \boldsymbol{\sigma}*^2 + \sigma_n^2$

Outputs:

Predictive means: $\boldsymbol{\mu}_* \in \mathbb{R}^m$
Predictive variances (function): $\boldsymbol{\sigma}_*^2 \in \mathbb{R}^m$
Predictive variances (observations): $\boldsymbol{\sigma}_{y*}^2 \in \mathbb{R}^m$

The Key Equations

The core results, boxed for reference:

μ* = Kᵀ (K + σ_n²I)⁻¹ y = Kᵀ α

σ² = k(x,x*) - Kᵀ (K + σ_n²I)⁻¹ K

Computational Complexity Analysis

Understanding the computational costs enables informed decisions about problem scale and approximation methods.

Training Phase (steps 1-4):

Step	Operation	Time	Space
1	Kernel matrix $K$	$\mathcal{O}(n^2 d)$	$\mathcal{O}(n^2)$
2	Add noise (in-place)	$\mathcal{O}(n)$	$\mathcal{O}(1)$
3	Cholesky decomposition	$\mathcal{O}(n^3/3)$	$\mathcal{O}(n^2)$
4	Triangular solves	$\mathcal{O}(n^2)$	$\mathcal{O}(n)$
Total Training		$\mathcal{O}(n^3)$	$\mathcal{O}(n^2)$

Prediction Phase (steps 5-10):

Step	Operation	Time	Space
5	Cross-covariance $K_*$	$\mathcal{O}(nmd)$	$\mathcal{O}(nm)$
6	Mean prediction	$\mathcal{O}(nm)$	$\mathcal{O}(m)$
7	Triangular solve for $V$	$\mathcal{O}(n^2 m)$	$\mathcal{O}(nm)$
8	Prior variances	$\mathcal{O}(md)$	$\mathcal{O}(m)$
9	Posterior variances	$\mathcal{O}(nm)$	$\mathcal{O}(m)$
Total Prediction		$\mathcal{O}(n^2 m)$	$\mathcal{O}(nm)$

The Cubic Bottleneck

Practical Scaling Guidelines

Training Size $n$	Feasibility	Notes
$n < 1,000$	Easy	Runs in seconds on any modern hardware
$1,000 < n < 10,000$	Moderate	Seconds to minutes; memory becomes relevant
$10,000 < n < 50,000$	Challenging	Minutes; requires careful memory management
$n > 50,000$	Requires approximations	Use sparse GPs or variational methods

Amortization Through Precomputation

Once $\mathbf{L}$ and $\boldsymbol{\alpha}$ are computed (training), predictions at new points only require:

Cross-covariance computation: $\mathcal{O}(nm)$
Matrix-vector products and solves: $\mathcal{O}(nm + n^2)$

For many test points, you can make predictions efficiently after the one-time $\mathcal{O}(n^3)$ training cost.

Numerical Stability Considerations

Numerical stability is critical for reliable GP implementations. Here are the key considerations and solutions.

Never Compute $K^{-1}$ Explicitly

The expression $K^{-1} \mathbf{y}$ appears throughout GP equations, but you should never compute $K^{-1}$ as a matrix. Instead:

Use the Cholesky decomposition $K = LL^\top$
Solve $L\mathbf{z} = \mathbf{y}$ (forward substitution)
Solve $L^\top \boldsymbol{\alpha} = \mathbf{z}$ (backward substitution)

This is more stable and often faster.

Positive Definiteness Guarantees

The Cholesky decomposition fails if $K_y$ is not positive definite. Ensure this by:

Jitter: Add $\epsilon I$ with $\epsilon \geq 10^{-10}$
Valid kernel: Ensure your kernel is a valid positive definite function
Noise variance: Any positive $\sigma_n^2 > 0$ helps
No exact duplicates: Duplicate training points create singularity

Condition Number Monitoring

Monitor the condition number κ(K) = λ_max/λ_min. High condition numbers (>10¹²) indicate numerical problems. Remedies: increase jitter, adjust length scale, standardize inputs.

Common Numerical Issues and Solutions

•Cholesky failure: Matrix not positive definite. Solution: Increase jitter, check for duplicate points, verify kernel validity.
•Negative variance predictions: Numerical error. Solution: Clamp to zero, increase jitter, or use more stable computation order.
•NaN or Inf in predictions: Overflow or underflow. Solution: Work in log-space for likelihood computations, use numerically stable formulas.
•Predictions don't interpolate: Noise variance too high or jitter too large. Solution: Reduce σ_n², reduce jitter.
•Oscillatory predictions: Numerical instability. Solution: Check condition number, increase regularization.

Log-Likelihood Computation

The log marginal likelihood is: $$\log p(\mathbf{y}) = -\frac{1}{2} \mathbf{y}^\top K_y^{-1} \mathbf{y} - \frac{1}{2} \log |K_y| - \frac{n}{2} \log(2\pi)$$

Compute this stably as: $$\log p(\mathbf{y}) = -\frac{1}{2} \mathbf{y}^\top \boldsymbol{\alpha} - \sum_{i=1}^{n} \log L_{ii} - \frac{n}{2} \log(2\pi)$$

where $\mathbf{L}$ is the Cholesky factor. The log-determinant $\log |K_y| = 2 \sum_i \log L_{ii}$ avoids overflow.

Production-Quality Reference Implementation

gaussian_process_regression.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
"""
Complete Gaussian Process Regression Implementation
===================================================
A production-quality implementation covering all aspects of GP regression.
"""
 
import numpy as np
from scipy.linalg import cholesky, solve_triangular, cho_solve
from scipy.optimize import minimize
from typing import Tuple, Optional, Callable, Dict, Any
 
class GaussianProcessRegressor:
    """
    Gaussian Process Regressor with full functionality.
    
    Features:
    - RBF (Squared Exponential) kernel with customizable parameters
    - Automatic hyperparameter optimization via marginal likelihood
    - Numerically stable Cholesky-based inference
    - Both pointwise and full covariance predictions
    - Proper handling of noise and jitter
    """
    
    def __init__(
        self,
        kernel: str = 'rbf',
        length_scale: float = 1.0,
        signal_variance: float = 1.0,
        noise_variance: float = 1e-6,
        jitter: float = 1e-8,
        optimize_hyperparameters: bool = True,
        n_restarts: int = 5,
    ):
        """
        Initialize the GP regressor.
        
        Parameters:
        -----------
        kernel : str
            Kernel type. Currently supports 'rbf'.
        length_scale : float
            Initial length scale for the kernel.
        signal_variance : float
            Initial signal variance (kernel amplitude squared).
        noise_variance : float
            Observation noise variance. Set to small value for near-interpolation.
        jitter : float
            Small value added to diagonal for numerical stability.
        optimize_hyperparameters : bool
            Whether to optimize hyperparameters during fit.
        n_restarts : int
            Number of random restarts for optimization.
        """
        self.kernel_type = kernel
        self.length_scale = length_scale
        self.signal_variance = signal_variance
        self.noise_variance = noise_variance
        self.jitter = jitter
        self.optimize_hyperparameters = optimize_hyperparameters
        self.n_restarts = n_restarts
        
        # Fitted state
        self._is_fitted = False
        self._X_train = None
        self._y_train = None
        self._L = None  # Cholesky factor
        self._alpha = None  # Dual coefficients
        
    def _kernel(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray:
        """
        Compute the kernel matrix between X1 and X2.
        """
        if self.kernel_type == 'rbf':
            # Squared distances: ||x1 - x2||^2
            sq_dist = (
                np.sum(X1**2, axis=1, keepdims=True) 
                - 2 * X1 @ X2.T 
                + np.sum(X2**2, axis=1)
            )
            return self.signal_variance * np.exp(-0.5 * sq_dist / self.length_scale**2)
        else:
            raise ValueError(f"Unknown kernel: {self.kernel_type}")
    
    def _compute_kernel_with_params(
        self, X1: np.ndarray, X2: np.ndarray, 
        length_scale: float, signal_variance: float
    ) -> np.ndarray:
        """Compute kernel with specific hyperparameters."""
        sq_dist = (
            np.sum(X1**2, axis=1, keepdims=True) 
            - 2 * X1 @ X2.T 
            + np.sum(X2**2, axis=1)
        )
        return signal_variance * np.exp(-0.5 * sq_dist / length_scale**2)
    
    def _neg_log_marginal_likelihood(
        self, log_theta: np.ndarray, X: np.ndarray, y: np.ndarray
    ) -> float:
        """
        Compute negative log marginal likelihood for optimization.
        
        Parameters:
        -----------
        log_theta : array
            [log(length_scale), log(signal_variance), log(noise_variance)]
        """
        length_scale = np.exp(log_theta[0])
        signal_variance = np.exp(log_theta[1])
        noise_variance = np.exp(log_theta[2])
        
        n = len(y)
        
        # Compute kernel matrix
        K = self._compute_kernel_with_params(X, X, length_scale, signal_variance)
        K_y = K + (noise_variance + self.jitter) * np.eye(n)
        
        # Cholesky decomposition
        try:
            L = cholesky(K_y, lower=True)
        except np.linalg.LinAlgError:
            return 1e10  # Return large value if not positive definite
        
        # Solve for alpha
        alpha = cho_solve((L, True), y)
        
        # Log marginal likelihood components
        data_fit = -0.5 * y @ alpha
        complexity = -np.sum(np.log(np.diag(L)))  # -0.5 * log|K_y|
        constant = -0.5 * n * np.log(2 * np.pi)
        
        return -(data_fit + complexity + constant)
    
    def fit(self, X: np.ndarray, y: np.ndarray) -> 'GaussianProcessRegressor':
        """
        Fit the GP to training data.
        
        Parameters:
        -----------
        X : array of shape (n_samples, n_features)
            Training inputs.
        y : array of shape (n_samples,)
            Training targets.
        
        Returns:
        --------
        self : GaussianProcessRegressor
            Fitted estimator.
        """
        X = np.atleast_2d(X)
        y = np.asarray(y).ravel()
        
        if X.shape[0] != len(y):
            raise ValueError("X and y must have same number of samples")
        
        # Store training data
        self._X_train = X.copy()
        self._y_train = y.copy()
        n = len(y)
        
        # Optimize hyperparameters if requested
        if self.optimize_hyperparameters and n > 1:
            best_nlml = np.inf
            best_theta = None
            
            for _ in range(self.n_restarts):
                # Random initialization in log-space
                theta0 = np.array([
                    np.log(self.length_scale) + 0.5 * np.random.randn(),
                    np.log(self.signal_variance) + 0.5 * np.random.randn(),
                    np.log(self.noise_variance) + 0.5 * np.random.randn(),
                ])
                
                result = minimize(
                    self._neg_log_marginal_likelihood,
                    theta0,
                    args=(X, y),
                    method='L-BFGS-B',
                    bounds=[(-10, 10), (-10, 10), (-15, 5)],
                )
                
                if result.fun < best_nlml:
                    best_nlml = result.fun
                    best_theta = result.x
            
            if best_theta is not None:
                self.length_scale = np.exp(best_theta[0])
                self.signal_variance = np.exp(best_theta[1])
                self.noise_variance = np.exp(best_theta[2])
        
        # Compute kernel matrix and decomposition
        K = self._kernel(X, X)
        K_y = K + (self.noise_variance + self.jitter) * np.eye(n)
        
        # Cholesky decomposition
        self._L = cholesky(K_y, lower=True)
        
        # Compute dual coefficients
        self._alpha = cho_solve((self._L, True), y)
        
        self._is_fitted = True
        return self
    
    def predict(
        self, X: np.ndarray, return_std: bool = False, return_cov: bool = False
    ) -> Tuple[np.ndarray, ...]:
        """
        Make predictions at test points.
        
        Parameters:
        -----------
        X : array of shape (n_samples, n_features)
            Test inputs.
        return_std : bool
            Whether to return standard deviation.
        return_cov : bool
            Whether to return full covariance matrix.
        
        Returns:
        --------
        y_mean : array of shape (n_samples,)
            Predictive means.
        y_std : array of shape (n_samples,) (if return_std=True)
            Predictive standard deviations for latent function.
        y_cov : array of shape (n_samples, n_samples) (if return_cov=True)
            Predictive covariance for latent function.
        """
        if not self._is_fitted:
            raise RuntimeError("Must call fit() before predict()")
        
        X = np.atleast_2d(X)
        
        # Cross-covariance
        K_star = self._kernel(self._X_train, X)
        
        # Predictive mean
        y_mean = K_star.T @ self._alpha
        
        result = [y_mean]
        
        if return_std or return_cov:
            # Solve L @ V = K_star
            V = solve_triangular(self._L, K_star, lower=True)
            
            if return_cov:
                # Full covariance
                K_ss = self._kernel(X, X)
                y_cov = K_ss - V.T @ V
                # Ensure symmetry
                y_cov = 0.5 * (y_cov + y_cov.T)
                result.append(y_cov)
            
            if return_std:
                # Pointwise variance
                K_ss_diag = self.signal_variance * np.ones(X.shape[0])
                y_var = K_ss_diag - np.sum(V**2, axis=0)
                # Clamp to avoid numerical issues
                y_var = np.maximum(y_var, 0)
                result.append(np.sqrt(y_var))
        
        return tuple(result) if len(result) > 1 else result[0]
    
    def sample_y(self, X: np.ndarray, n_samples: int = 1) -> np.ndarray:
        """
        Sample functions from the posterior distribution.
        
        Parameters:
        -----------
        X : array of shape (n_points, n_features)
            Points at which to sample.
        n_samples : int
            Number of function samples.
        
        Returns:
        --------
        samples : array of shape (n_points, n_samples)
            Function samples.
        """
        y_mean, y_cov = self.predict(X, return_cov=True)
        
        # Cholesky of posterior covariance
        L_cov = cholesky(y_cov + self.jitter * np.eye(len(y_mean)), lower=True)
        
        # Sample
        z = np.random.randn(len(y_mean), n_samples)
        samples = y_mean[:, None] + L_cov @ z
        
        return samples
    
    def log_marginal_likelihood(self) -> float:
        """
        Compute log marginal likelihood of the fitted model.
        """
        if not self._is_fitted:
            raise RuntimeError("Must call fit() before log_marginal_likelihood()")
        
        n = len(self._y_train)
        
        data_fit = -0.5 * self._y_train @ self._alpha
        complexity = -np.sum(np.log(np.diag(self._L)))
        constant = -0.5 * n * np.log(2 * np.pi)
        
        return data_fit + complexity + constant
    
    def get_params(self) -> Dict[str, Any]:
        """Get current hyperparameters."""
        return {
            'length_scale': self.length_scale,
            'signal_variance': self.signal_variance,
            'noise_variance': self.noise_variance,
        }

Usage Examples

Let's demonstrate the GP implementation with several practical examples.

gp_examples.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
"""
Gaussian Process Regression - Usage Examples
=============================================
"""
 
import numpy as np
 
# Assuming GaussianProcessRegressor is imported from the previous code
 
np.random.seed(42)
 
# =============================================================================
# Example 1: Basic Regression
# =============================================================================
print("Example 1: Basic Regression")
print("=" * 50)
 
# Generate synthetic data
def true_function(x):
    return np.sin(2 * x) + 0.5 * np.cos(4 * x)
 
X_train = np.sort(np.random.uniform(-3, 3, 15)).reshape(-1, 1)
y_train = true_function(X_train.flatten()) + 0.2 * np.random.randn(15)
 
# Fit GP
gp = GaussianProcessRegressor(optimize_hyperparameters=True, n_restarts=3)
gp.fit(X_train, y_train)
 
print(f"Optimized hyperparameters:")
params = gp.get_params()
for k, v in params.items():
    print(f"  {k}: {v:.4f}")
 
# Predict
X_test = np.linspace(-4, 4, 100).reshape(-1, 1)
y_mean, y_std = gp.predict(X_test, return_std=True)
 
print(f"\nPrediction range: [{y_mean.min():.3f}, {y_mean.max():.3f}]")
print(f"Uncertainty range: [{y_std.min():.4f}, {y_std.max():.3f}]")
print(f"Log marginal likelihood: {gp.log_marginal_likelihood():.2f}")
 
# =============================================================================
# Example 2: Posterior Sampling
# =============================================================================
print("\nExample 2: Posterior Sampling")
print("=" * 50)
 
# Sample functions from posterior
samples = gp.sample_y(X_test, n_samples=5)
 
print(f"Generated {samples.shape[1]} posterior function samples")
print(f"Sample 1 range: [{samples[:, 0].min():.3f}, {samples[:, 0].max():.3f}]")
 
# =============================================================================
# Example 3: Interpolation (Low Noise)
# =============================================================================
print("\nExample 3: Interpolation (Low Noise)")
print("=" * 50)
 
# Create GP with very low noise for interpolation
gp_interp = GaussianProcessRegressor(
    noise_variance=1e-10,
    optimize_hyperparameters=False,
    length_scale=1.0,
    signal_variance=1.0,
)
gp_interp.fit(X_train, y_train)
 
y_at_train = gp_interp.predict(X_train)
residuals = np.abs(y_at_train - y_train)
 
print(f"Max residual at training points: {residuals.max():.2e}")
print("(Should be near machine precision for interpolation)")
 
# =============================================================================
# Example 4: Multi-Dimensional Input
# =============================================================================
print("\nExample 4: Multi-Dimensional Input")
print("=" * 50)
 
# 2D Branin function (classic optimization benchmark)
def branin(X):
    x1, x2 = X[:, 0], X[:, 1]
    a, b, c = 1, 5.1/(4*np.pi**2), 5/np.pi
    r, s, t = 6, 10, 1/(8*np.pi)
    return a*(x2 - b*x1**2 + c*x1 - r)**2 + s*(1-t)*np.cos(x1) + s
 
# Generate training data
X_train_2d = np.random.uniform(low=[-5, 0], high=[10, 15], size=(30, 2))
y_train_2d = branin(X_train_2d) + 10 * np.random.randn(30)
 
# Fit GP
gp_2d = GaussianProcessRegressor(optimize_hyperparameters=True, n_restarts=3)
gp_2d.fit(X_train_2d, y_train_2d)
 
print(f"2D GP hyperparameters:")
for k, v in gp_2d.get_params().items():
    print(f"  {k}: {v:.4f}")
 
# Predict on grid
xx, yy = np.meshgrid(np.linspace(-5, 10, 20), np.linspace(0, 15, 20))
X_grid = np.column_stack([xx.ravel(), yy.ravel()])
y_pred, y_std_2d = gp_2d.predict(X_grid, return_std=True)
 
print(f"2D prediction range: [{y_pred.min():.1f}, {y_pred.max():.1f}]")
print(f"2D uncertainty range: [{y_std_2d.min():.2f}, {y_std_2d.max():.2f}]")
 
# =============================================================================
# Example 5: Extrapolation Behavior  
# =============================================================================
print("\nExample 5: Extrapolation Behavior")
print("=" * 50)
 
# Fit GP on limited range
X_limited = np.linspace(-2, 2, 10).reshape(-1, 1)
y_limited = np.sin(X_limited.flatten())
 
gp_extrap = GaussianProcessRegressor(
    optimize_hyperparameters=False,
    length_scale=1.0,
    signal_variance=1.0,
    noise_variance=0.01,
)
gp_extrap.fit(X_limited, y_limited)
 
# Predict in and out of training range
X_in = np.array([[0.0]])  # In range
X_out = np.array([[5.0]])  # Out of range
 
y_in_mean, y_in_std = gp_extrap.predict(X_in, return_std=True)
y_out_mean, y_out_std = gp_extrap.predict(X_out, return_std=True)
 
print(f"In-range (x=0): mean={y_in_mean[0]:.3f}, std={y_in_std[0]:.4f}")
print(f"Out-of-range (x=5): mean={y_out_mean[0]:.3f}, std={y_out_std[0]:.3f}")
print("Note: Extrapolation reverts to prior (mean→0, std→signal_variance)")

Working with the Full Posterior Covariance

For many applications, pointwise variances are insufficient—you need the full posterior covariance matrix. This section details when and how to work with $\Sigma_*$.

When You Need Full Covariance

Sampling functions: To draw coherent function samples, you must sample from the $m$-dimensional Gaussian $\mathcal{N}(\boldsymbol{\mu}*, \Sigma*)$.
Functional uncertainty: Uncertainty about integrals, maxima, or other functionals of $f$ depends on correlations.
Batch Bayesian optimization: Acquiring multiple points simultaneously requires the joint distribution.
Optimal experimental design: Choosing informative observation locations uses the predictive covariance.

Computing Full Covariance

$$\Sigma_* = K(\mathbf{X}*, \mathbf{X}) - K_^\top K_y^{-1} K_* = K_{**} - V^\top V$$

where $L V = K_*$ (triangular solve).

Memory Considerations

Ensuring Covariance Validity

Due to numerical error, the computed $\Sigma_*$ may not be exactly symmetric or positive semi-definite. Common fixes:

Symmetrize: $\Sigma_* \leftarrow \frac{1}{2}(\Sigma_* + \Sigma_*^\top)$
Add jitter for sampling: When computing Cholesky of $\Sigma_*$ for sampling, add small jitter
Eigenvalue clipping: Set negative eigenvalues to zero (projection onto PSD cone)

Efficient Sampling from Full Covariance

To sample $\mathbf{f}$ from $\mathcal{N}(\boldsymbol{\mu}*, \Sigma*)$:

Compute $\Sigma_* = L_* L_*^\top$ (Cholesky)
Sample $\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$
Compute $\mathbf{f} = \boldsymbol{\mu}* + L* \mathbf{z}$

For many samples, reuse $L_*$ and generate multiple $\mathbf{z}$ vectors.

Hyperparameter Optimization: The Missing Piece

The closed-form solution assumes hyperparameters are known. In practice, they must be learned from data.

The Marginal Likelihood Objective

$$\log p(\mathbf{y} | \mathbf{X}, \boldsymbol{\theta}) = -\frac{1}{2} \mathbf{y}^\top K_y^{-1} \mathbf{y} - \frac{1}{2} \log |K_y| - \frac{n}{2} \log(2\pi)$$

This objective balances:

Data fit ($-\frac{1}{2} \mathbf{y}^\top K_y^{-1} \mathbf{y}$): How well the GP explains observations
Complexity ($-\frac{1}{2} \log |K_y|$): Penalty for complex models (Occam's razor)

Gradient-Based Optimization

The gradients with respect to hyperparameters are:

$$\frac{\partial \log p}{\partial \theta_j} = \frac{1}{2} \text{tr}\left( (\boldsymbol{\alpha} \boldsymbol{\alpha}^\top - K_y^{-1}) \frac{\partial K_y}{\partial \theta_j} \right)$$

where $\boldsymbol{\alpha} = K_y^{-1} \mathbf{y}$. Computing the trace requires $\mathcal{O}(n^2)$ operations per hyperparameter.

Efficient Gradient Computation

Optimization Pitfalls

•Local optima: The marginal likelihood is non-convex. Use multiple random restarts.
•Unbounded length scale: Without constraints, ℓ can go to 0 or ∞. Bound or use priors.
•Noise-signal confusion: With little data, σ_n² and σ_f² may trade off. Regularize or fix one.
•Numerical issues at extremes: Very small or large hyperparameters cause numerical problems.
•Overfitting the evidence: With many hyperparameters, you can overfit the marginal likelihood itself.

Practical Optimization Strategy

Initialize reasonably: Use data-derived initial guesses (e.g., $\ell$ = median pairwise distance, $\sigma_f^2$ = variance of $y$, $\sigma_n^2$ = 0.1 × $\sigma_f^2$)
Work in log-space: Optimize $\log \theta$ rather than $\theta$ for positive parameters
Use bounds: Prevent extreme values (e.g., $\ell \in [0.01c, 100c]$ where $c$ is characteristic input scale)
Multiple restarts: Run optimization from 3-10 random starting points, keep best
Check convergence: Verify the optimizer converged and didn't hit bounds
Validate on held-out data: If possible, check predictions on validation set

Summary: The Complete GP Regression Solution

We have now covered the complete theory and practice of Gaussian Process regression. Let us consolidate the key points:

Key Takeaways

•Closed-form solution: GP regression has exact analytic solutions for posterior mean and covariance, unlike most flexible models.
•Algorithm is simple: Compute kernel matrix → Cholesky decompose → solve for dual coefficients → predict.
•Computational cost is O(n³): The Cholesky decomposition is the bottleneck. Memory is O(n²) for storing K.
•Numerical stability is critical: Use Cholesky factorization, add jitter, never invert matrices explicitly.
•Hyperparameters must be learned: Optimize the marginal likelihood with multiple restarts.
•Full covariance enables sampling: For function samples, you need the joint distribution, not just marginal variances.

GP Regression Quick Reference
Quantity	Formula	Purpose
Predictive mean	μ* = K*ᵀ(K + σ²I)⁻¹y	Best prediction
Predictive variance	σ² = k* - Kᵀ(K + σ²I)⁻¹K	Uncertainty (latent f)
Observation variance	σ²_y* = σ*² + σ²_n	Uncertainty (noisy y)
Log marginal likelihood	-½yᵀK⁻¹y - ½log\|K\| - n/2 log(2π)	Hyperparameter optimization

Module Complete!

You have now mastered the fundamentals of GP regression:

Page 0: Derived the posterior distribution from first principles
Page 1: Explored the predictive distribution for single and multiple test points
Page 2: Developed intuition for mean and variance prediction behavior
Page 3: Understood noise handling from homoscedastic to heteroscedastic models
Page 4: Assembled the complete closed-form algorithm with production-quality implementation

With this foundation, you are ready to apply GP regression to real problems and to explore advanced topics: kernel design, scalable approximations, and GP classification.

Module Complete

5 / 5