Loading learning content...
We have developed all the theoretical components of Gaussian Process regression across the preceding pages. Now we bring everything together into a complete, closed-form algorithm that you can implement and apply to real problems.
The closed-form solution is what makes GP regression uniquely elegant among flexible regression methods. Unlike neural networks requiring iterative optimization, or tree ensembles requiring greedy splitting, the GP solution can be written in a few matrix equations. Given training data and hyperparameters, there's one correct answer—no randomness, no convergence issues, no local optima in the inference procedure itself.
This page presents the complete algorithm, addresses all computational details, provides a production-quality reference implementation, and discusses the practical considerations that distinguish textbook GPs from real-world deployments.
By the end of this page, you will: (1) Have the complete GP regression algorithm ready to implement, (2) Understand all computational details and efficiency considerations, (3) Know how to handle edge cases and numerical issues, (4) Have a production-quality reference implementation, (5) Be prepared to apply GPs to real-world problems.
Here is the complete GP regression algorithm, from inputs to predictions:
Inputs:
Algorithm:
Compute training covariance: $K = k(\mathbf{X}, \mathbf{X}) \in \mathbb{R}^{n \times n}$
Add noise and jitter: $K_y = K + (\sigma_n^2 + \epsilon_{\text{jitter}}) \mathbf{I}$
Cholesky decomposition: $K_y = \mathbf{L} \mathbf{L}^\top$ where $\mathbf{L}$ is lower triangular
Compute dual coefficients: Solve $\mathbf{L} \mathbf{z} = \mathbf{y}$, then $\mathbf{L}^\top \boldsymbol{\alpha} = \mathbf{z}$ Result: $\boldsymbol{\alpha} = K_y^{-1} \mathbf{y}$
Compute cross-covariance: $K_* = k(\mathbf{X}, \mathbf{X}_*) \in \mathbb{R}^{n \times m}$
Predictive mean: $\boldsymbol{\mu}* = K*^\top \boldsymbol{\alpha}$
Solve for variance computation: $\mathbf{L} \mathbf{V} = K_*$ → $\mathbf{V} \in \mathbb{R}^{n \times m}$
Prior variance at test points: $\mathbf{k}{**} = \text{diag}(k(\mathbf{X}, \mathbf{X}_)) \in \mathbb{R}^m$
Predictive variance of latent function: $\boldsymbol{\sigma}*^2 = \mathbf{k}{**} - \text{colsum}(\mathbf{V}^2)$
Predictive variance of observations: $\boldsymbol{\sigma}{y*}^2 = \boldsymbol{\sigma}*^2 + \sigma_n^2$
Outputs:
The core results, boxed for reference:
μ* = Kᵀ (K + σ_n²I)⁻¹ y = Kᵀ α
σ² = k(x,x*) - Kᵀ (K + σ_n²I)⁻¹ K
Understanding the computational costs enables informed decisions about problem scale and approximation methods.
Training Phase (steps 1-4):
| Step | Operation | Time | Space |
|---|---|---|---|
| 1 | Kernel matrix $K$ | $\mathcal{O}(n^2 d)$ | $\mathcal{O}(n^2)$ |
| 2 | Add noise (in-place) | $\mathcal{O}(n)$ | $\mathcal{O}(1)$ |
| 3 | Cholesky decomposition | $\mathcal{O}(n^3/3)$ | $\mathcal{O}(n^2)$ |
| 4 | Triangular solves | $\mathcal{O}(n^2)$ | $\mathcal{O}(n)$ |
| Total Training | $\mathcal{O}(n^3)$ | $\mathcal{O}(n^2)$ |
Prediction Phase (steps 5-10):
| Step | Operation | Time | Space |
|---|---|---|---|
| 5 | Cross-covariance $K_*$ | $\mathcal{O}(nmd)$ | $\mathcal{O}(nm)$ |
| 6 | Mean prediction | $\mathcal{O}(nm)$ | $\mathcal{O}(m)$ |
| 7 | Triangular solve for $V$ | $\mathcal{O}(n^2 m)$ | $\mathcal{O}(nm)$ |
| 8 | Prior variances | $\mathcal{O}(md)$ | $\mathcal{O}(m)$ |
| 9 | Posterior variances | $\mathcal{O}(nm)$ | $\mathcal{O}(m)$ |
| Total Prediction | $\mathcal{O}(n^2 m)$ | $\mathcal{O}(nm)$ |
The O(n³) Cholesky decomposition and O(n²) memory for K are the fundamental limits of exact GP regression. For n > 10,000-50,000 (depending on hardware and memory), you need sparse/approximate methods like inducing points, random features, or structured approximations.
Practical Scaling Guidelines
| Training Size $n$ | Feasibility | Notes |
|---|---|---|
| $n < 1,000$ | Easy | Runs in seconds on any modern hardware |
| $1,000 < n < 10,000$ | Moderate | Seconds to minutes; memory becomes relevant |
| $10,000 < n < 50,000$ | Challenging | Minutes; requires careful memory management |
| $n > 50,000$ | Requires approximations | Use sparse GPs or variational methods |
Amortization Through Precomputation
Once $\mathbf{L}$ and $\boldsymbol{\alpha}$ are computed (training), predictions at new points only require:
For many test points, you can make predictions efficiently after the one-time $\mathcal{O}(n^3)$ training cost.
Numerical stability is critical for reliable GP implementations. Here are the key considerations and solutions.
Never Compute $K^{-1}$ Explicitly
The expression $K^{-1} \mathbf{y}$ appears throughout GP equations, but you should never compute $K^{-1}$ as a matrix. Instead:
This is more stable and often faster.
Positive Definiteness Guarantees
The Cholesky decomposition fails if $K_y$ is not positive definite. Ensure this by:
Monitor the condition number κ(K) = λ_max/λ_min. High condition numbers (>10¹²) indicate numerical problems. Remedies: increase jitter, adjust length scale, standardize inputs.
Log-Likelihood Computation
The log marginal likelihood is: $$\log p(\mathbf{y}) = -\frac{1}{2} \mathbf{y}^\top K_y^{-1} \mathbf{y} - \frac{1}{2} \log |K_y| - \frac{n}{2} \log(2\pi)$$
Compute this stably as: $$\log p(\mathbf{y}) = -\frac{1}{2} \mathbf{y}^\top \boldsymbol{\alpha} - \sum_{i=1}^{n} \log L_{ii} - \frac{n}{2} \log(2\pi)$$
where $\mathbf{L}$ is the Cholesky factor. The log-determinant $\log |K_y| = 2 \sum_i \log L_{ii}$ avoids overflow.
Here is a complete, production-quality GP regression implementation that incorporates all the considerations we've discussed. This code is designed to be correct, numerically stable, and reasonably efficient.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310
"""Complete Gaussian Process Regression Implementation===================================================A production-quality implementation covering all aspects of GP regression.""" import numpy as npfrom scipy.linalg import cholesky, solve_triangular, cho_solvefrom scipy.optimize import minimizefrom typing import Tuple, Optional, Callable, Dict, Any class GaussianProcessRegressor: """ Gaussian Process Regressor with full functionality. Features: - RBF (Squared Exponential) kernel with customizable parameters - Automatic hyperparameter optimization via marginal likelihood - Numerically stable Cholesky-based inference - Both pointwise and full covariance predictions - Proper handling of noise and jitter """ def __init__( self, kernel: str = 'rbf', length_scale: float = 1.0, signal_variance: float = 1.0, noise_variance: float = 1e-6, jitter: float = 1e-8, optimize_hyperparameters: bool = True, n_restarts: int = 5, ): """ Initialize the GP regressor. Parameters: ----------- kernel : str Kernel type. Currently supports 'rbf'. length_scale : float Initial length scale for the kernel. signal_variance : float Initial signal variance (kernel amplitude squared). noise_variance : float Observation noise variance. Set to small value for near-interpolation. jitter : float Small value added to diagonal for numerical stability. optimize_hyperparameters : bool Whether to optimize hyperparameters during fit. n_restarts : int Number of random restarts for optimization. """ self.kernel_type = kernel self.length_scale = length_scale self.signal_variance = signal_variance self.noise_variance = noise_variance self.jitter = jitter self.optimize_hyperparameters = optimize_hyperparameters self.n_restarts = n_restarts # Fitted state self._is_fitted = False self._X_train = None self._y_train = None self._L = None # Cholesky factor self._alpha = None # Dual coefficients def _kernel(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray: """ Compute the kernel matrix between X1 and X2. """ if self.kernel_type == 'rbf': # Squared distances: ||x1 - x2||^2 sq_dist = ( np.sum(X1**2, axis=1, keepdims=True) - 2 * X1 @ X2.T + np.sum(X2**2, axis=1) ) return self.signal_variance * np.exp(-0.5 * sq_dist / self.length_scale**2) else: raise ValueError(f"Unknown kernel: {self.kernel_type}") def _compute_kernel_with_params( self, X1: np.ndarray, X2: np.ndarray, length_scale: float, signal_variance: float ) -> np.ndarray: """Compute kernel with specific hyperparameters.""" sq_dist = ( np.sum(X1**2, axis=1, keepdims=True) - 2 * X1 @ X2.T + np.sum(X2**2, axis=1) ) return signal_variance * np.exp(-0.5 * sq_dist / length_scale**2) def _neg_log_marginal_likelihood( self, log_theta: np.ndarray, X: np.ndarray, y: np.ndarray ) -> float: """ Compute negative log marginal likelihood for optimization. Parameters: ----------- log_theta : array [log(length_scale), log(signal_variance), log(noise_variance)] """ length_scale = np.exp(log_theta[0]) signal_variance = np.exp(log_theta[1]) noise_variance = np.exp(log_theta[2]) n = len(y) # Compute kernel matrix K = self._compute_kernel_with_params(X, X, length_scale, signal_variance) K_y = K + (noise_variance + self.jitter) * np.eye(n) # Cholesky decomposition try: L = cholesky(K_y, lower=True) except np.linalg.LinAlgError: return 1e10 # Return large value if not positive definite # Solve for alpha alpha = cho_solve((L, True), y) # Log marginal likelihood components data_fit = -0.5 * y @ alpha complexity = -np.sum(np.log(np.diag(L))) # -0.5 * log|K_y| constant = -0.5 * n * np.log(2 * np.pi) return -(data_fit + complexity + constant) def fit(self, X: np.ndarray, y: np.ndarray) -> 'GaussianProcessRegressor': """ Fit the GP to training data. Parameters: ----------- X : array of shape (n_samples, n_features) Training inputs. y : array of shape (n_samples,) Training targets. Returns: -------- self : GaussianProcessRegressor Fitted estimator. """ X = np.atleast_2d(X) y = np.asarray(y).ravel() if X.shape[0] != len(y): raise ValueError("X and y must have same number of samples") # Store training data self._X_train = X.copy() self._y_train = y.copy() n = len(y) # Optimize hyperparameters if requested if self.optimize_hyperparameters and n > 1: best_nlml = np.inf best_theta = None for _ in range(self.n_restarts): # Random initialization in log-space theta0 = np.array([ np.log(self.length_scale) + 0.5 * np.random.randn(), np.log(self.signal_variance) + 0.5 * np.random.randn(), np.log(self.noise_variance) + 0.5 * np.random.randn(), ]) result = minimize( self._neg_log_marginal_likelihood, theta0, args=(X, y), method='L-BFGS-B', bounds=[(-10, 10), (-10, 10), (-15, 5)], ) if result.fun < best_nlml: best_nlml = result.fun best_theta = result.x if best_theta is not None: self.length_scale = np.exp(best_theta[0]) self.signal_variance = np.exp(best_theta[1]) self.noise_variance = np.exp(best_theta[2]) # Compute kernel matrix and decomposition K = self._kernel(X, X) K_y = K + (self.noise_variance + self.jitter) * np.eye(n) # Cholesky decomposition self._L = cholesky(K_y, lower=True) # Compute dual coefficients self._alpha = cho_solve((self._L, True), y) self._is_fitted = True return self def predict( self, X: np.ndarray, return_std: bool = False, return_cov: bool = False ) -> Tuple[np.ndarray, ...]: """ Make predictions at test points. Parameters: ----------- X : array of shape (n_samples, n_features) Test inputs. return_std : bool Whether to return standard deviation. return_cov : bool Whether to return full covariance matrix. Returns: -------- y_mean : array of shape (n_samples,) Predictive means. y_std : array of shape (n_samples,) (if return_std=True) Predictive standard deviations for latent function. y_cov : array of shape (n_samples, n_samples) (if return_cov=True) Predictive covariance for latent function. """ if not self._is_fitted: raise RuntimeError("Must call fit() before predict()") X = np.atleast_2d(X) # Cross-covariance K_star = self._kernel(self._X_train, X) # Predictive mean y_mean = K_star.T @ self._alpha result = [y_mean] if return_std or return_cov: # Solve L @ V = K_star V = solve_triangular(self._L, K_star, lower=True) if return_cov: # Full covariance K_ss = self._kernel(X, X) y_cov = K_ss - V.T @ V # Ensure symmetry y_cov = 0.5 * (y_cov + y_cov.T) result.append(y_cov) if return_std: # Pointwise variance K_ss_diag = self.signal_variance * np.ones(X.shape[0]) y_var = K_ss_diag - np.sum(V**2, axis=0) # Clamp to avoid numerical issues y_var = np.maximum(y_var, 0) result.append(np.sqrt(y_var)) return tuple(result) if len(result) > 1 else result[0] def sample_y(self, X: np.ndarray, n_samples: int = 1) -> np.ndarray: """ Sample functions from the posterior distribution. Parameters: ----------- X : array of shape (n_points, n_features) Points at which to sample. n_samples : int Number of function samples. Returns: -------- samples : array of shape (n_points, n_samples) Function samples. """ y_mean, y_cov = self.predict(X, return_cov=True) # Cholesky of posterior covariance L_cov = cholesky(y_cov + self.jitter * np.eye(len(y_mean)), lower=True) # Sample z = np.random.randn(len(y_mean), n_samples) samples = y_mean[:, None] + L_cov @ z return samples def log_marginal_likelihood(self) -> float: """ Compute log marginal likelihood of the fitted model. """ if not self._is_fitted: raise RuntimeError("Must call fit() before log_marginal_likelihood()") n = len(self._y_train) data_fit = -0.5 * self._y_train @ self._alpha complexity = -np.sum(np.log(np.diag(self._L))) constant = -0.5 * n * np.log(2 * np.pi) return data_fit + complexity + constant def get_params(self) -> Dict[str, Any]: """Get current hyperparameters.""" return { 'length_scale': self.length_scale, 'signal_variance': self.signal_variance, 'noise_variance': self.noise_variance, }Let's demonstrate the GP implementation with several practical examples.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135
"""Gaussian Process Regression - Usage Examples=============================================""" import numpy as np # Assuming GaussianProcessRegressor is imported from the previous code np.random.seed(42) # =============================================================================# Example 1: Basic Regression# =============================================================================print("Example 1: Basic Regression")print("=" * 50) # Generate synthetic datadef true_function(x): return np.sin(2 * x) + 0.5 * np.cos(4 * x) X_train = np.sort(np.random.uniform(-3, 3, 15)).reshape(-1, 1)y_train = true_function(X_train.flatten()) + 0.2 * np.random.randn(15) # Fit GPgp = GaussianProcessRegressor(optimize_hyperparameters=True, n_restarts=3)gp.fit(X_train, y_train) print(f"Optimized hyperparameters:")params = gp.get_params()for k, v in params.items(): print(f" {k}: {v:.4f}") # PredictX_test = np.linspace(-4, 4, 100).reshape(-1, 1)y_mean, y_std = gp.predict(X_test, return_std=True) print(f"\nPrediction range: [{y_mean.min():.3f}, {y_mean.max():.3f}]")print(f"Uncertainty range: [{y_std.min():.4f}, {y_std.max():.3f}]")print(f"Log marginal likelihood: {gp.log_marginal_likelihood():.2f}") # =============================================================================# Example 2: Posterior Sampling# =============================================================================print("\nExample 2: Posterior Sampling")print("=" * 50) # Sample functions from posteriorsamples = gp.sample_y(X_test, n_samples=5) print(f"Generated {samples.shape[1]} posterior function samples")print(f"Sample 1 range: [{samples[:, 0].min():.3f}, {samples[:, 0].max():.3f}]") # =============================================================================# Example 3: Interpolation (Low Noise)# =============================================================================print("\nExample 3: Interpolation (Low Noise)")print("=" * 50) # Create GP with very low noise for interpolationgp_interp = GaussianProcessRegressor( noise_variance=1e-10, optimize_hyperparameters=False, length_scale=1.0, signal_variance=1.0,)gp_interp.fit(X_train, y_train) y_at_train = gp_interp.predict(X_train)residuals = np.abs(y_at_train - y_train) print(f"Max residual at training points: {residuals.max():.2e}")print("(Should be near machine precision for interpolation)") # =============================================================================# Example 4: Multi-Dimensional Input# =============================================================================print("\nExample 4: Multi-Dimensional Input")print("=" * 50) # 2D Branin function (classic optimization benchmark)def branin(X): x1, x2 = X[:, 0], X[:, 1] a, b, c = 1, 5.1/(4*np.pi**2), 5/np.pi r, s, t = 6, 10, 1/(8*np.pi) return a*(x2 - b*x1**2 + c*x1 - r)**2 + s*(1-t)*np.cos(x1) + s # Generate training dataX_train_2d = np.random.uniform(low=[-5, 0], high=[10, 15], size=(30, 2))y_train_2d = branin(X_train_2d) + 10 * np.random.randn(30) # Fit GPgp_2d = GaussianProcessRegressor(optimize_hyperparameters=True, n_restarts=3)gp_2d.fit(X_train_2d, y_train_2d) print(f"2D GP hyperparameters:")for k, v in gp_2d.get_params().items(): print(f" {k}: {v:.4f}") # Predict on gridxx, yy = np.meshgrid(np.linspace(-5, 10, 20), np.linspace(0, 15, 20))X_grid = np.column_stack([xx.ravel(), yy.ravel()])y_pred, y_std_2d = gp_2d.predict(X_grid, return_std=True) print(f"2D prediction range: [{y_pred.min():.1f}, {y_pred.max():.1f}]")print(f"2D uncertainty range: [{y_std_2d.min():.2f}, {y_std_2d.max():.2f}]") # =============================================================================# Example 5: Extrapolation Behavior # =============================================================================print("\nExample 5: Extrapolation Behavior")print("=" * 50) # Fit GP on limited rangeX_limited = np.linspace(-2, 2, 10).reshape(-1, 1)y_limited = np.sin(X_limited.flatten()) gp_extrap = GaussianProcessRegressor( optimize_hyperparameters=False, length_scale=1.0, signal_variance=1.0, noise_variance=0.01,)gp_extrap.fit(X_limited, y_limited) # Predict in and out of training rangeX_in = np.array([[0.0]]) # In rangeX_out = np.array([[5.0]]) # Out of range y_in_mean, y_in_std = gp_extrap.predict(X_in, return_std=True)y_out_mean, y_out_std = gp_extrap.predict(X_out, return_std=True) print(f"In-range (x=0): mean={y_in_mean[0]:.3f}, std={y_in_std[0]:.4f}")print(f"Out-of-range (x=5): mean={y_out_mean[0]:.3f}, std={y_out_std[0]:.3f}")print("Note: Extrapolation reverts to prior (mean→0, std→signal_variance)")For many applications, pointwise variances are insufficient—you need the full posterior covariance matrix. This section details when and how to work with $\Sigma_*$.
When You Need Full Covariance
Sampling functions: To draw coherent function samples, you must sample from the $m$-dimensional Gaussian $\mathcal{N}(\boldsymbol{\mu}*, \Sigma*)$.
Functional uncertainty: Uncertainty about integrals, maxima, or other functionals of $f$ depends on correlations.
Batch Bayesian optimization: Acquiring multiple points simultaneously requires the joint distribution.
Optimal experimental design: Choosing informative observation locations uses the predictive covariance.
Computing Full Covariance
$$\Sigma_* = K(\mathbf{X}*, \mathbf{X}) - K_^\top K_y^{-1} K_* = K_{**} - V^\top V$$
where $L V = K_*$ (triangular solve).
The full covariance Σ* is an m×m matrix. For m=10,000 test points, this is 100 million entries (~800 MB in float64). For large test sets, consider: (1) computing covariance only for subsets, (2) using low-rank approximations, (3) avoiding explicit storage via matrix-free methods.
Ensuring Covariance Validity
Due to numerical error, the computed $\Sigma_*$ may not be exactly symmetric or positive semi-definite. Common fixes:
Symmetrize: $\Sigma_* \leftarrow \frac{1}{2}(\Sigma_* + \Sigma_*^\top)$
Add jitter for sampling: When computing Cholesky of $\Sigma_*$ for sampling, add small jitter
Eigenvalue clipping: Set negative eigenvalues to zero (projection onto PSD cone)
Efficient Sampling from Full Covariance
To sample $\mathbf{f}$ from $\mathcal{N}(\boldsymbol{\mu}*, \Sigma*)$:
For many samples, reuse $L_*$ and generate multiple $\mathbf{z}$ vectors.
The closed-form solution assumes hyperparameters are known. In practice, they must be learned from data.
The Marginal Likelihood Objective
$$\log p(\mathbf{y} | \mathbf{X}, \boldsymbol{\theta}) = -\frac{1}{2} \mathbf{y}^\top K_y^{-1} \mathbf{y} - \frac{1}{2} \log |K_y| - \frac{n}{2} \log(2\pi)$$
This objective balances:
Gradient-Based Optimization
The gradients with respect to hyperparameters are:
$$\frac{\partial \log p}{\partial \theta_j} = \frac{1}{2} \text{tr}\left( (\boldsymbol{\alpha} \boldsymbol{\alpha}^\top - K_y^{-1}) \frac{\partial K_y}{\partial \theta_j} \right)$$
where $\boldsymbol{\alpha} = K_y^{-1} \mathbf{y}$. Computing the trace requires $\mathcal{O}(n^2)$ operations per hyperparameter.
Compute K_y⁻¹ as solve(K_y, I), not as an explicit inverse. Or better: if you need the full inverse for multiple gradient computations, solve once and cache. The gradient formula tr((ααᵀ - K⁻¹) ∂K/∂θ) can be computed as Σᵢⱼ (αᵢαⱼ - [K⁻¹]ᵢⱼ) [∂K/∂θ]ᵢⱼ using element-wise operations.
Practical Optimization Strategy
Initialize reasonably: Use data-derived initial guesses (e.g., $\ell$ = median pairwise distance, $\sigma_f^2$ = variance of $y$, $\sigma_n^2$ = 0.1 × $\sigma_f^2$)
Work in log-space: Optimize $\log \theta$ rather than $\theta$ for positive parameters
Use bounds: Prevent extreme values (e.g., $\ell \in [0.01c, 100c]$ where $c$ is characteristic input scale)
Multiple restarts: Run optimization from 3-10 random starting points, keep best
Check convergence: Verify the optimizer converged and didn't hit bounds
Validate on held-out data: If possible, check predictions on validation set
We have now covered the complete theory and practice of Gaussian Process regression. Let us consolidate the key points:
| Quantity | Formula | Purpose |
|---|---|---|
| Predictive mean | μ* = K*ᵀ(K + σ²I)⁻¹y | Best prediction |
| Predictive variance | σ² = k** - Kᵀ(K + σ²I)⁻¹K* | Uncertainty (latent f) |
| Observation variance | σ²_y* = σ*² + σ²_n | Uncertainty (noisy y) |
| Log marginal likelihood | -½yᵀK⁻¹y - ½log|K| - n/2 log(2π) | Hyperparameter optimization |
Module Complete!
You have now mastered the fundamentals of GP regression:
With this foundation, you are ready to apply GP regression to real problems and to explore advanced topics: kernel design, scalable approximations, and GP classification.
Congratulations! You now have complete mastery of Gaussian Process regression—from theoretical foundations through practical implementation. The closed-form solution you've learned is the foundation for advanced GP methods and a powerful tool for probabilistic regression with principled uncertainty quantification.