Loading learning content...
While the marginal likelihood provides a principled Bayesian approach to hyperparameter selection, cross-validation (CV) offers a complementary perspective rooted in predictive performance. In some situations, CV may be preferable:
Model misspecification: When the true data-generating process doesn't match any GP prior, CV directly optimizes prediction error rather than model fit.
Non-probabilistic objectives: If you care about point predictions (RMSE, MAE) rather than probabilistic forecasts, CV can target these directly.
Validation of marginal likelihood: CV provides an independent check—if marginal likelihood and CV select very different hyperparameters, something warrants investigation.
Robustness: CV makes fewer assumptions about the noise distribution and prior correctness.
This page develops cross-validation theory for GP hyperparameter selection, with emphasis on efficient computation.
By the end of this page, you will understand leave-one-out cross-validation, its closed-form expression for GPs, k-fold cross-validation, efficient computation strategies, and when to prefer CV over marginal likelihood.
Leave-One-Out Cross-Validation (LOO-CV) evaluates model performance by training on $n-1$ points and predicting the held-out point, repeated for each observation.
The LOO Objective:
For hyperparameters $\boldsymbol{\theta}$, the LOO log-likelihood is:
$$\text{LOO}(\boldsymbol{\theta}) = \sum_{i=1}^n \log p(y_i | \mathbf{X}, \mathbf{y}_{-i}, \boldsymbol{\theta})$$
where $\mathbf{y}_{-i}$ denotes all observations except $y_i$. This measures the total predictive probability of held-out data across all leave-one-out splits.
Connection to Information Theory:
The negative LOO is related to the predictive information criterion:
$$\text{PIC} = -2 \sum_{i=1}^n \log p(y_i | \mathbf{y}_{-i}, \boldsymbol{\theta})$$
Minimizing PIC (or equivalently maximizing LOO) selects hyperparameters with best average predictive performance.
Naive Computation:
A direct approach would:
Total cost: $O(n \cdot n^3) = O(n^4)$—prohibitively expensive.
The GP Miracle:
Remarkably, for Gaussian Processes, LOO-CV has a closed-form expression that can be computed in $O(n^3)$—the same complexity as evaluating the marginal likelihood once!
Let's derive the closed-form leave-one-out expressions.
Setup:
With covariance matrix $\mathbf{K}_y = \mathbf{K} + \sigma_n^2 \mathbf{I}$ and its inverse $\mathbf{K}_y^{-1}$, define:
$$\boldsymbol{\alpha} = \mathbf{K}_y^{-1} \mathbf{y}$$
The Leave-One-Out Predictive Distribution:
For the $i$-th observation, training on all others gives predictive:
$$y_i | \mathbf{y}{-i} \sim \mathcal{N}(\mu{-i}, \sigma_{-i}^2)$$
Using the matrix inversion lemma and partitioned matrix identities:
$$\mu_{-i} = y_i - \frac{\alpha_i}{(\mathbf{K}y^{-1}){ii}}$$
$$\sigma_{-i}^2 = \frac{1}{(\mathbf{K}y^{-1}){ii}}$$
where $\alpha_i$ is the $i$-th component of $\boldsymbol{\alpha}$ and $(\mathbf{K}y^{-1}){ii}$ is the $i$-th diagonal element of the precision matrix.
Derivation Intuition:
The key insight is that removing one observation from a Gaussian is equivalent to 'conditioning away' its influence. The matrix inversion lemma allows expressing this removal without recomputing the entire inverse.
LOO Error:
The leave-one-out error for point $i$ is: $$e_i = y_i - \mu_{-i} = \frac{\alpha_i}{(\mathbf{K}y^{-1}){ii}}$$
This is the residual when predicting $y_i$ from all other points.
LOO Variance:
The predicted variance is simply the inverse of the precision: $$\sigma_{-i}^2 = \frac{1}{(\mathbf{K}y^{-1}){ii}}$$
LOO Log-Probability:
The log-probability of observing $y_i$ under the LOO predictive:
$$\log p(y_i | \mathbf{y}{-i}) = -\frac{1}{2}\log(2\pi) - \frac{1}{2}\log(\sigma{-i}^2) - \frac{e_i^2}{2\sigma_{-i}^2}$$
$$= -\frac{1}{2}\log(2\pi) + \frac{1}{2}\log((\mathbf{K}y^{-1}){ii}) - \frac{\alpha_i^2}{2(\mathbf{K}y^{-1}){ii}}$$
Once we have $\mathbf{K}_y^{-1}$ and $\boldsymbol{\alpha}$ (which we compute anyway for the marginal likelihood), all LOO quantities are available with just $O(n)$ additional operations. LOO-CV is essentially free given the matrix inverse!
Let's implement LOO-CV with attention to numerical stability and efficiency.
Complete LOO Algorithm:
Total complexity: $O(n^3)$ — same as marginal likelihood!
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
import numpy as npfrom scipy.linalg import cho_factor, cho_solve def leave_one_out_cv( y: np.ndarray, K_y: np.ndarray, return_details: bool = False) -> dict: """ Compute leave-one-out cross-validation metrics for GP. Parameters ---------- y : (n,) array - Observations K_y : (n, n) array - Covariance matrix K + sigma_n^2 * I return_details : bool - Return per-point errors and variances Returns ------- dict with LOO metrics """ n = len(y) # Cholesky decomposition L, lower = cho_factor(K_y, lower=True) # Compute alpha = K_y^{-1} @ y alpha = cho_solve((L, lower), y) # Compute K_y^{-1} (need diagonal) K_y_inv = cho_solve((L, lower), np.eye(n)) diag_K_y_inv = np.diag(K_y_inv) # LOO predictions loo_errors = alpha / diag_K_y_inv # e_i = alpha_i / (K_y^{-1})_{ii} loo_variances = 1.0 / diag_K_y_inv # sigma^2_i = 1 / (K_y^{-1})_{ii} loo_means = y - loo_errors # mu_{-i} = y_i - e_i # LOO log-likelihood (sum over all points) loo_log_likelihood = -0.5 * ( n * np.log(2 * np.pi) - np.sum(np.log(diag_K_y_inv)) # = sum log variances + np.sum(alpha * loo_errors) # = sum e_i^2 / var_i ) # Alternative objectives loo_mse = np.mean(loo_errors ** 2) loo_mae = np.mean(np.abs(loo_errors)) loo_rmse = np.sqrt(loo_mse) # Standardized LOO (for calibration check) standardized_errors = loo_errors / np.sqrt(loo_variances) result = { 'loo_log_likelihood': loo_log_likelihood, 'loo_mse': loo_mse, 'loo_rmse': loo_rmse, 'loo_mae': loo_mae, 'standardized_error_mean': np.mean(standardized_errors), 'standardized_error_std': np.std(standardized_errors), } if return_details: result['loo_errors'] = loo_errors result['loo_variances'] = loo_variances result['loo_means'] = loo_means result['standardized_errors'] = standardized_errors return resultInterpreting LOO Metrics:
| Metric | Description | Good Values |
|---|---|---|
| LOO Log-Likelihood | Total log probability of held-out points | Higher is better |
| LOO MSE | Mean squared LOO error | Lower is better |
| LOO RMSE | Root mean squared LOO error | In units of $y$ |
| Standardized Error Mean | Should be near 0 if well-calibrated | $\approx 0$ |
| Standardized Error Std | Should be near 1 if well-calibrated | $\approx 1$ |
To optimize hyperparameters using LOO, we need its gradients. The derivation is more involved than for marginal likelihood but follows similar matrix calculus principles.
The LOO Objective:
$$\text{LOO} = \sum_{i=1}^n \left[ \frac{1}{2}\log(\mathbf{K}y^{-1}){ii} - \frac{\alpha_i^2}{2(\mathbf{K}y^{-1}){ii}} \right] - \frac{n}{2}\log(2\pi)$$
Key Intermediate Results:
For hyperparameter $\theta_j$, we need:
$\frac{\partial \boldsymbol{\alpha}}{\partial \theta_j} = -\mathbf{K}_y^{-1} \frac{\partial \mathbf{K}_y}{\partial \theta_j} \boldsymbol{\alpha}$
$\frac{\partial \mathbf{K}_y^{-1}}{\partial \theta_j} = -\mathbf{K}_y^{-1} \frac{\partial \mathbf{K}_y}{\partial \theta_j} \mathbf{K}_y^{-1}$
$\frac{\partial (\mathbf{K}y^{-1}){ii}}{\partial \theta_j}$ = $i$-th diagonal of the above
Gradient Derivation:
Let $p_i = (\mathbf{K}y^{-1}){ii}$ (precision) and $\mathbf{P} = \mathbf{K}_y^{-1} \frac{\partial \mathbf{K}_y}{\partial \theta_j} \mathbf{K}_y^{-1}$.
Then: $$\frac{\partial p_i}{\partial \theta_j} = -P_{ii}$$
$$\frac{\partial \alpha_i}{\partial \theta_j} = -(\mathbf{K}_y^{-1} \frac{\partial \mathbf{K}_y}{\partial \theta_j} \boldsymbol{\alpha})_i$$
The full LOO gradient is:
$$\frac{\partial \text{LOO}}{\partial \theta_j} = \sum_{i=1}^n \left[ \frac{1}{2p_i} \cdot (-P_{ii}) + \frac{\alpha_i^2}{2p_i^2} \cdot P_{ii} - \frac{\alpha_i}{p_i} \cdot \frac{\partial \alpha_i}{\partial \theta_j} \right]$$
This can be simplified and computed efficiently using matrix operations.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
def loo_cv_and_gradients( y: np.ndarray, K_y: np.ndarray, dK_dtheta: list[np.ndarray]) -> tuple[float, np.ndarray]: """ Compute LOO log-likelihood and gradients w.r.t. hyperparameters. Returns ------- loo_ll : float - LOO log-likelihood (to maximize) grads : (p,) array - Gradients w.r.t. each hyperparameter """ n = len(y) # Standard computations L, lower = cho_factor(K_y, lower=True) alpha = cho_solve((L, lower), y) K_y_inv = cho_solve((L, lower), np.eye(n)) diag_prec = np.diag(K_y_inv) # p_i = (K_y^{-1})_{ii} # LOO errors and log-likelihood loo_errors = alpha / diag_prec loo_ll = -0.5 * n * np.log(2 * np.pi) + 0.5 * np.sum(np.log(diag_prec)) - 0.5 * np.sum(alpha * loo_errors) # Compute gradients grads = [] for dK in dK_dtheta: # P = K_y^{-1} @ dK @ K_y^{-1} P = K_y_inv @ dK @ K_y_inv P_diag = np.diag(P) # d_alpha/d_theta = -K_y^{-1} @ dK @ alpha d_alpha = -K_y_inv @ dK @ alpha # Gradient components # term1: sum_i (1/(2*p_i)) * (-P_ii) = -0.5 * sum(P_diag / diag_prec) term1 = -0.5 * np.sum(P_diag / diag_prec) # term2: sum_i (alpha_i^2 / (2*p_i^2)) * P_ii term2 = 0.5 * np.sum((alpha**2 / diag_prec**2) * P_diag) # term3: sum_i -(alpha_i / p_i) * d_alpha_i term3 = -np.sum((alpha / diag_prec) * d_alpha) grads.append(term1 + term2 + term3) return loo_ll, np.array(grads)While LOO-CV has a beautiful closed form for GPs, K-fold CV offers an alternative with different trade-offs.
K-Fold Procedure:
Partition data into $K$ approximately equal folds: $\mathcal{D} = \mathcal{D}_1 \cup \cdots \cup \mathcal{D}_K$
For $k = 1, \ldots, K$:
Sum log-likelihoods across folds
Computational Complexity:
With $K$ folds of size $n/K$ each:
For $K = 5$ or $K = 10$, this is 5-10× the cost of LOO or marginal likelihood.
When to Prefer K-Fold:
Time series data: LOO doesn't respect temporal ordering. Use forward-chaining or blocked folds.
Grouped data: When observations are naturally grouped (e.g., patients with multiple measurements), hold out entire groups.
Non-Gaussian likelihoods: K-fold extends directly to GP classification and other variants where LOO doesn't have a closed form.
Computational constraints: If memory for $n \times n$ matrix is limiting, K-fold can work with smaller matrices per fold.
Let's systematically compare these two hyperparameter selection approaches.
Theoretical Relationship:
For well-specified models (true data comes from GP prior), marginal likelihood and LOO-CV select asymptotically equivalent hyperparameters. The marginal likelihood can be approximated as:
$$\log p(\mathbf{y} | \boldsymbol{\theta}) \approx \text{LOO}(\boldsymbol{\theta}) + \text{penalty terms}$$
The penalty terms capture complexity in a way that LOO approximates through held-out evaluation.
| Aspect | Marginal Likelihood | Cross-Validation (LOO) |
|---|---|---|
| Theoretical Basis | Bayesian: probability of data under prior | Frequentist: predictive performance |
| Complexity Handling | Automatic via determinant term | Implicit via held-out evaluation |
| Model Misspecification | Can be misleading if prior is wrong | Robust: measures actual predictions |
| Computational Cost | $O(n^3)$ | $O(n^3)$ (closed-form for GP) |
| Gradient Computation | Simple, closed-form | More complex, but doable |
| Uncertainty Calibration | Optimizes probabilistic fit | Can check but doesn't optimize directly |
| Non-Gaussian Extensions | Requires approximations | Straightforward with more compute |
| Literature Support | Dominant in GP community | Common in statistics/ML more broadly |
When They Disagree:
If marginal likelihood and LOO select substantially different hyperparameters, investigate:
Check for model misspecification: The GP assumptions (Gaussian noise, stationary kernel) may not hold.
Examine outliers: Outliers affect ML and LOO differently—ML is more sensitive to fitting outliers.
Look at data density: In sparsely sampled regions, ML and LOO may weight differently.
Consider noise model: If noise is heteroscedastic, both methods may struggle.
Practical Recommendation: Use marginal likelihood as primary objective (better gradient properties, principled), but check LOO as validation. If they agree, you're likely in good shape. If they disagree strongly, investigate further.
Beyond hyperparameter selection, LOO provides insight into calibration—whether the GP's uncertainty estimates are accurate.
Standardized LOO Residuals:
Define standardized residuals: $$z_i = \frac{y_i - \mu_{-i}}{\sigma_{-i}} = \frac{e_i}{\sqrt{\sigma_{-i}^2}} = \alpha_i \sqrt{(\mathbf{K}y^{-1}){ii}}$$
For a well-calibrated GP, these should follow $\mathcal{N}(0, 1)$.
Calibration Checks:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
import numpy as npfrom scipy import stats def calibration_diagnostics( y: np.ndarray, loo_means: np.ndarray, loo_variances: np.ndarray) -> dict: """ Compute calibration diagnostics from LOO predictions. Returns ------- dict with calibration metrics and test results """ n = len(y) # Standardized residuals residuals = y - loo_means std_residuals = residuals / np.sqrt(loo_variances) # Basic statistics mean_std_residual = np.mean(std_residuals) std_std_residual = np.std(std_residuals, ddof=1) # Coverage: fraction of y within prediction intervals coverage_68 = np.mean(np.abs(std_residuals) <= 1) # ±1σ: should be ~68% coverage_95 = np.mean(np.abs(std_residuals) <= 1.96) # ±2σ: should be ~95% coverage_99 = np.mean(np.abs(std_residuals) <= 2.576) # ±3σ: should be ~99% # Normality test on standardized residuals _, p_value_shapiro = stats.shapiro(std_residuals[:min(n, 5000)]) # Shapiro limited to 5000 # Kolmogorov-Smirnov test against N(0,1) ks_stat, p_value_ks = stats.kstest(std_residuals, 'norm') return { 'mean': mean_std_residual, 'std': std_std_residual, 'coverage_68': coverage_68, 'coverage_95': coverage_95, 'coverage_99': coverage_99, 'shapiro_p_value': p_value_shapiro, 'ks_statistic': ks_stat, 'ks_p_value': p_value_ks, 'calibrated': std_std_residual > 0.9 and std_std_residual < 1.1 and abs(mean_std_residual) < 0.1 }Interpreting Calibration Results:
| Observation | Interpretation | Action |
|---|---|---|
| $\text{Std}(z) < 1$ | Overestimating uncertainty | May have too much noise variance |
| $\text{Std}(z) > 1$ | Underestimating uncertainty | May have too little noise; model too confident |
| $\text{Mean}(z) \neq 0$ | Biased predictions | Mean function may not be appropriate |
| Coverage $< $ expected | Prediction intervals too narrow | Uncertainty underestimated |
| Non-Gaussian $z$ | Distributional assumptions violated | Consider transforming data or non-Gaussian likelihood |
Good calibration (standardized residuals close to N(0,1)) indicates not just good hyperparameters but appropriate model assumptions—Gaussian noise, correct kernel family, appropriate mean function. Poor calibration despite good marginal likelihood suggests model misspecification.
In practice, combining marginal likelihood and cross-validation often works best.
Strategy 1: ML Optimization with LOO Validation
This leverages ML's gradient efficiency while using LOO as a sanity check.
Strategy 2: Dual Objective Optimization
Optimize a combination: $$\mathcal{J}(\boldsymbol{\theta}) = \lambda \cdot \mathcal{L}{\text{ML}}(\boldsymbol{\theta}) + (1-\lambda) \cdot \mathcal{L}{\text{LOO}}(\boldsymbol{\theta})$$
with $\lambda \in [0, 1]$. This can be useful when ML and LOO disagree, finding a compromise solution.
Strategy 3: Nested Cross-Validation
For rigorous model evaluation:
This gives unbiased estimate of out-of-sample performance with properly tuned hyperparameters.
Strategy 4: Early Stopping with LOO
During optimization:
This is particularly useful for complex models with many hyperparameters.
Given that LOO is essentially free once you have the matrix inverse, always compute both ML and LOO at the final solution. The small additional cost provides valuable validation. A perfect match isn't expected, but wildly different rankings of hyperparameter settings is a red flag.
Beyond LOO, several other predictive performance measures exist.
Continuous Ranked Probability Score (CRPS):
Measures distance between predicted CDF and true observation:
$$\text{CRPS}(F, y) = \int_{-\infty}^{\infty} (F(z) - \mathbf{1}_{z \geq y})^2 dz$$
For Gaussian predictive $\mathcal{N}(\mu, \sigma^2)$:
$$\text{CRPS} = \sigma \left[ \frac{y - \mu}{\sigma} \left( 2\Phi\left(\frac{y-\mu}{\sigma}\right) - 1 \right) + 2\phi\left(\frac{y-\mu}{\sigma}\right) - \frac{1}{\sqrt{\pi}} \right]$$
where $\Phi$ and $\phi$ are standard Gaussian CDF and PDF.
Properties:
Interval Score:
For a $(1-\alpha)$ prediction interval $[l, u]$:
$$S_\alpha(l, u, y) = (u - l) + \frac{2}{\alpha}(l - y)\mathbf{1}{y < l} + \frac{2}{\alpha}(y - u)\mathbf{1}{y > u}$$
Penalizes:
Predictive Variance:
Simply measure average predictive variance: $$\bar{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n \sigma_{-i}^2$$
Useful for comparing models: prefer models with tighter (but calibrated) uncertainty.
LOO RMSE:
Ignore uncertainty entirely: $$\text{LOO-RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^n e_i^2}$$
Useful when only point predictions matter, but loses probabilistic benefits.
We've developed cross-validation as a complementary tool for GP hyperparameter selection. Let's consolidate:
Looking Ahead:
The final page in this module covers practical considerations for GP hyperparameter learning—implementation details, common pitfalls, software libraries, and real-world application guidance. This completes our comprehensive treatment of hyperparameter learning for Gaussian Processes.
You now understand cross-validation as an alternative to marginal likelihood for GP hyperparameter selection. The closed-form LOO expressions make it computationally practical, and calibration diagnostics provide valuable model checking beyond hyperparameter selection.