Machine LearningGaussian Processes

Hyperparameter Learning in Gaussian Processes

LevelAdvanced

Duration90 mins

TopicGaussian Processes

4 / 5

Cross-Validation Alternative

Cross-Validation as Alternative to Marginal Likelihood

While the marginal likelihood provides a principled Bayesian approach to hyperparameter selection, cross-validation (CV) offers a complementary perspective rooted in predictive performance. In some situations, CV may be preferable:

Model misspecification: When the true data-generating process doesn't match any GP prior, CV directly optimizes prediction error rather than model fit.
Non-probabilistic objectives: If you care about point predictions (RMSE, MAE) rather than probabilistic forecasts, CV can target these directly.
Validation of marginal likelihood: CV provides an independent check—if marginal likelihood and CV select very different hyperparameters, something warrants investigation.
Robustness: CV makes fewer assumptions about the noise distribution and prior correctness.

This page develops cross-validation theory for GP hyperparameter selection, with emphasis on efficient computation.

What You Will Learn

By the end of this page, you will understand leave-one-out cross-validation, its closed-form expression for GPs, k-fold cross-validation, efficient computation strategies, and when to prefer CV over marginal likelihood.

Leave-One-Out Cross-Validation Theory

Leave-One-Out Cross-Validation (LOO-CV) evaluates model performance by training on $n-1$ points and predicting the held-out point, repeated for each observation.

The LOO Objective:

For hyperparameters $\boldsymbol{\theta}$, the LOO log-likelihood is:

$$\text{LOO}(\boldsymbol{\theta}) = \sum_{i=1}^n \log p(y_i | \mathbf{X}, \mathbf{y}_{-i}, \boldsymbol{\theta})$$

where $\mathbf{y}_{-i}$ denotes all observations except $y_i$. This measures the total predictive probability of held-out data across all leave-one-out splits.

Connection to Information Theory:

The negative LOO is related to the predictive information criterion:

$$\text{PIC} = -2 \sum_{i=1}^n \log p(y_i | \mathbf{y}_{-i}, \boldsymbol{\theta})$$

Minimizing PIC (or equivalently maximizing LOO) selects hyperparameters with best average predictive performance.

Naive Computation:

A direct approach would:

For $i = 1, \ldots, n$:
- Form training set ${(\mathbf{x}_j, y_j) : j \neq i}$
- Train GP on $n-1$ points: $O((n-1)^3)$
- Compute predictive distribution at $\mathbf{x}_i$
- Evaluate $p(y_i | \text{prediction})$
Sum log-probabilities

Total cost: $O(n \cdot n^3) = O(n^4)$—prohibitively expensive.

The GP Miracle:

Remarkably, for Gaussian Processes, LOO-CV has a closed-form expression that can be computed in $O(n^3)$—the same complexity as evaluating the marginal likelihood once!

Closed-Form LOO for Gaussian Processes

Let's derive the closed-form leave-one-out expressions.

Setup:

With covariance matrix $\mathbf{K}_y = \mathbf{K} + \sigma_n^2 \mathbf{I}$ and its inverse $\mathbf{K}_y^{-1}$, define:

$$\boldsymbol{\alpha} = \mathbf{K}_y^{-1} \mathbf{y}$$

The Leave-One-Out Predictive Distribution:

For the $i$-th observation, training on all others gives predictive:

$$y_i | \mathbf{y}{-i} \sim \mathcal{N}(\mu{-i}, \sigma_{-i}^2)$$

Using the matrix inversion lemma and partitioned matrix identities:

$$\mu_{-i} = y_i - \frac{\alpha_i}{(\mathbf{K}y^{-1}){ii}}$$

$$\sigma_{-i}^2 = \frac{1}{(\mathbf{K}y^{-1}){ii}}$$

where $\alpha_i$ is the $i$-th component of $\boldsymbol{\alpha}$ and $(\mathbf{K}y^{-1}){ii}$ is the $i$-th diagonal element of the precision matrix.

Derivation Intuition:

The key insight is that removing one observation from a Gaussian is equivalent to 'conditioning away' its influence. The matrix inversion lemma allows expressing this removal without recomputing the entire inverse.

LOO Error:

The leave-one-out error for point $i$ is: $$e_i = y_i - \mu_{-i} = \frac{\alpha_i}{(\mathbf{K}y^{-1}){ii}}$$

This is the residual when predicting $y_i$ from all other points.

LOO Variance:

The predicted variance is simply the inverse of the precision: $$\sigma_{-i}^2 = \frac{1}{(\mathbf{K}y^{-1}){ii}}$$

LOO Log-Probability:

The log-probability of observing $y_i$ under the LOO predictive:

$$\log p(y_i | \mathbf{y}{-i}) = -\frac{1}{2}\log(2\pi) - \frac{1}{2}\log(\sigma{-i}^2) - \frac{e_i^2}{2\sigma_{-i}^2}$$

$$= -\frac{1}{2}\log(2\pi) + \frac{1}{2}\log((\mathbf{K}y^{-1}){ii}) - \frac{\alpha_i^2}{2(\mathbf{K}y^{-1}){ii}}$$

The Elegant Result

Once we have $\mathbf{K}_y^{-1}$ and $\boldsymbol{\alpha}$ (which we compute anyway for the marginal likelihood), all LOO quantities are available with just $O(n)$ additional operations. LOO-CV is essentially free given the matrix inverse!

Efficient LOO Implementation

Let's implement LOO-CV with attention to numerical stability and efficiency.

Complete LOO Algorithm:

Compute Cholesky: $\mathbf{K}_y = \mathbf{L}\mathbf{L}^\top$ — $O(n^3)$
Solve for $\boldsymbol{\alpha}$: $\boldsymbol{\alpha} = \mathbf{K}_y^{-1}\mathbf{y}$ — $O(n^2)$
Compute $\mathbf{K}_y^{-1}$: solve $n$ linear systems — $O(n^3)$
Extract diagonals: $d_i = (\mathbf{K}y^{-1}){ii}$ — $O(n)$
Compute LOO errors: $e_i = \alpha_i / d_i$ — $O(n)$
Compute LOO variances: $\sigma_i^2 = 1 / d_i$ — $O(n)$
Compute LOO log-likelihood — $O(n)$

Total complexity: $O(n^3)$ — same as marginal likelihood!

loo_cv.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import numpy as np
from scipy.linalg import cho_factor, cho_solve
 
def leave_one_out_cv(
    y: np.ndarray,
    K_y: np.ndarray,
    return_details: bool = False
) -> dict:
    """
    Compute leave-one-out cross-validation metrics for GP.
    
    Parameters
    ----------
    y : (n,) array - Observations
    K_y : (n, n) array - Covariance matrix K + sigma_n^2 * I
    return_details : bool - Return per-point errors and variances
    
    Returns
    -------
    dict with LOO metrics
    """
    n = len(y)
    
    # Cholesky decomposition
    L, lower = cho_factor(K_y, lower=True)
    
    # Compute alpha = K_y^{-1} @ y
    alpha = cho_solve((L, lower), y)
    
    # Compute K_y^{-1} (need diagonal)
    K_y_inv = cho_solve((L, lower), np.eye(n))
    diag_K_y_inv = np.diag(K_y_inv)
    
    # LOO predictions
    loo_errors = alpha / diag_K_y_inv  # e_i = alpha_i / (K_y^{-1})_{ii}
    loo_variances = 1.0 / diag_K_y_inv  # sigma^2_i = 1 / (K_y^{-1})_{ii}
    loo_means = y - loo_errors  # mu_{-i} = y_i - e_i
    
    # LOO log-likelihood (sum over all points)
    loo_log_likelihood = -0.5 * (
        n * np.log(2 * np.pi)
        - np.sum(np.log(diag_K_y_inv))  # = sum log variances
        + np.sum(alpha * loo_errors)  # = sum e_i^2 / var_i
    )
    
    # Alternative objectives
    loo_mse = np.mean(loo_errors ** 2)
    loo_mae = np.mean(np.abs(loo_errors))
    loo_rmse = np.sqrt(loo_mse)
    
    # Standardized LOO (for calibration check)
    standardized_errors = loo_errors / np.sqrt(loo_variances)
    
    result = {
        'loo_log_likelihood': loo_log_likelihood,
        'loo_mse': loo_mse,
        'loo_rmse': loo_rmse,
        'loo_mae': loo_mae,
        'standardized_error_mean': np.mean(standardized_errors),
        'standardized_error_std': np.std(standardized_errors),
    }
    
    if return_details:
        result['loo_errors'] = loo_errors
        result['loo_variances'] = loo_variances
        result['loo_means'] = loo_means
        result['standardized_errors'] = standardized_errors
    
    return result

Interpreting LOO Metrics:

Metric	Description	Good Values
LOO Log-Likelihood	Total log probability of held-out points	Higher is better
LOO MSE	Mean squared LOO error	Lower is better
LOO RMSE	Root mean squared LOO error	In units of $y$
Standardized Error Mean	Should be near 0 if well-calibrated	$\approx 0$
Standardized Error Std	Should be near 1 if well-calibrated	$\approx 1$

Gradients of LOO Objective

To optimize hyperparameters using LOO, we need its gradients. The derivation is more involved than for marginal likelihood but follows similar matrix calculus principles.

The LOO Objective:

$$\text{LOO} = \sum_{i=1}^n \left[ \frac{1}{2}\log(\mathbf{K}y^{-1}){ii} - \frac{\alpha_i^2}{2(\mathbf{K}y^{-1}){ii}} \right] - \frac{n}{2}\log(2\pi)$$

Key Intermediate Results:

For hyperparameter $\theta_j$, we need:

$\frac{\partial \boldsymbol{\alpha}}{\partial \theta_j} = -\mathbf{K}_y^{-1} \frac{\partial \mathbf{K}_y}{\partial \theta_j} \boldsymbol{\alpha}$
$\frac{\partial \mathbf{K}_y^{-1}}{\partial \theta_j} = -\mathbf{K}_y^{-1} \frac{\partial \mathbf{K}_y}{\partial \theta_j} \mathbf{K}_y^{-1}$
$\frac{\partial (\mathbf{K}y^{-1}){ii}}{\partial \theta_j}$ = $i$-th diagonal of the above

Gradient Derivation:

Let $p_i = (\mathbf{K}y^{-1}){ii}$ (precision) and $\mathbf{P} = \mathbf{K}_y^{-1} \frac{\partial \mathbf{K}_y}{\partial \theta_j} \mathbf{K}_y^{-1}$.

Then: $$\frac{\partial p_i}{\partial \theta_j} = -P_{ii}$$

$$\frac{\partial \alpha_i}{\partial \theta_j} = -(\mathbf{K}_y^{-1} \frac{\partial \mathbf{K}_y}{\partial \theta_j} \boldsymbol{\alpha})_i$$

The full LOO gradient is:

$$\frac{\partial \text{LOO}}{\partial \theta_j} = \sum_{i=1}^n \left[ \frac{1}{2p_i} \cdot (-P_{ii}) + \frac{\alpha_i^2}{2p_i^2} \cdot P_{ii} - \frac{\alpha_i}{p_i} \cdot \frac{\partial \alpha_i}{\partial \theta_j} \right]$$

This can be simplified and computed efficiently using matrix operations.

loo_gradients.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
def loo_cv_and_gradients(
    y: np.ndarray,
    K_y: np.ndarray,
    dK_dtheta: list[np.ndarray]
) -> tuple[float, np.ndarray]:
    """
    Compute LOO log-likelihood and gradients w.r.t. hyperparameters.
    
    Returns
    -------
    loo_ll : float - LOO log-likelihood (to maximize)
    grads : (p,) array - Gradients w.r.t. each hyperparameter
    """
    n = len(y)
    
    # Standard computations
    L, lower = cho_factor(K_y, lower=True)
    alpha = cho_solve((L, lower), y)
    K_y_inv = cho_solve((L, lower), np.eye(n))
    
    diag_prec = np.diag(K_y_inv)  # p_i = (K_y^{-1})_{ii}
    
    # LOO errors and log-likelihood
    loo_errors = alpha / diag_prec
    loo_ll = -0.5 * n * np.log(2 * np.pi)              + 0.5 * np.sum(np.log(diag_prec))              - 0.5 * np.sum(alpha * loo_errors)
    
    # Compute gradients
    grads = []
    for dK in dK_dtheta:
        # P = K_y^{-1} @ dK @ K_y^{-1}
        P = K_y_inv @ dK @ K_y_inv
        P_diag = np.diag(P)
        
        # d_alpha/d_theta = -K_y^{-1} @ dK @ alpha
        d_alpha = -K_y_inv @ dK @ alpha
        
        # Gradient components
        # term1: sum_i (1/(2*p_i)) * (-P_ii) = -0.5 * sum(P_diag / diag_prec)
        term1 = -0.5 * np.sum(P_diag / diag_prec)
        
        # term2: sum_i (alpha_i^2 / (2*p_i^2)) * P_ii
        term2 = 0.5 * np.sum((alpha**2 / diag_prec**2) * P_diag)
        
        # term3: sum_i -(alpha_i / p_i) * d_alpha_i
        term3 = -np.sum((alpha / diag_prec) * d_alpha)
        
        grads.append(term1 + term2 + term3)
    
    return loo_ll, np.array(grads)

K-Fold Cross-Validation

While LOO-CV has a beautiful closed form for GPs, K-fold CV offers an alternative with different trade-offs.

K-Fold Procedure:

Partition data into $K$ approximately equal folds: $\mathcal{D} = \mathcal{D}_1 \cup \cdots \cup \mathcal{D}_K$
For $k = 1, \ldots, K$:
- Train on $\mathcal{D} \setminus \mathcal{D}_k$ (all data except fold $k$)
- Predict on fold $\mathcal{D}_k$
- Compute predictive log-likelihood for fold $k$
Sum log-likelihoods across folds

Computational Complexity:

With $K$ folds of size $n/K$ each:

Training on $n - n/K$ points: $O((n(1-1/K))^3) \approx O(n^3)$ per fold
Total: $O(K \cdot n^3)$

For $K = 5$ or $K = 10$, this is 5-10× the cost of LOO or marginal likelihood.

K-Fold Advantages

•Block structure: Respects temporal or spatial dependencies when data is grouped
•Parallel computation: Each fold is independent, enabling parallelization
•Flexible objectives: Can use any loss function, not just log-likelihood
•Validation of predictions: Uses truly held-out data without matrix tricks

K-Fold Disadvantages

•Higher variance: With only $K$ folds, estimates have higher variance than LOO
•Fold choice sensitivity: Results depend on how data is partitioned
•Computational cost: $K\times$ more expensive than LOO's closed form
•Gradient computation: Requires separate optimization per fold

When to Prefer K-Fold:

Time series data: LOO doesn't respect temporal ordering. Use forward-chaining or blocked folds.
Grouped data: When observations are naturally grouped (e.g., patients with multiple measurements), hold out entire groups.
Non-Gaussian likelihoods: K-fold extends directly to GP classification and other variants where LOO doesn't have a closed form.
Computational constraints: If memory for $n \times n$ matrix is limiting, K-fold can work with smaller matrices per fold.

Marginal Likelihood vs. Cross-Validation: Deep Comparison

Let's systematically compare these two hyperparameter selection approaches.

Theoretical Relationship:

For well-specified models (true data comes from GP prior), marginal likelihood and LOO-CV select asymptotically equivalent hyperparameters. The marginal likelihood can be approximated as:

$$\log p(\mathbf{y} | \boldsymbol{\theta}) \approx \text{LOO}(\boldsymbol{\theta}) + \text{penalty terms}$$

The penalty terms capture complexity in a way that LOO approximates through held-out evaluation.

Comprehensive Comparison: Marginal Likelihood vs. Cross-Validation
Aspect	Marginal Likelihood	Cross-Validation (LOO)
Theoretical Basis	Bayesian: probability of data under prior	Frequentist: predictive performance
Complexity Handling	Automatic via determinant term	Implicit via held-out evaluation
Model Misspecification	Can be misleading if prior is wrong	Robust: measures actual predictions
Computational Cost	$O(n^3)$	$O(n^3)$ (closed-form for GP)
Gradient Computation	Simple, closed-form	More complex, but doable
Uncertainty Calibration	Optimizes probabilistic fit	Can check but doesn't optimize directly
Non-Gaussian Extensions	Requires approximations	Straightforward with more compute
Literature Support	Dominant in GP community	Common in statistics/ML more broadly

When They Disagree:

If marginal likelihood and LOO select substantially different hyperparameters, investigate:

Check for model misspecification: The GP assumptions (Gaussian noise, stationary kernel) may not hold.
Examine outliers: Outliers affect ML and LOO differently—ML is more sensitive to fitting outliers.
Look at data density: In sparsely sampled regions, ML and LOO may weight differently.
Consider noise model: If noise is heteroscedastic, both methods may struggle.

Practical Recommendation: Use marginal likelihood as primary objective (better gradient properties, principled), but check LOO as validation. If they agree, you're likely in good shape. If they disagree strongly, investigate further.

Using LOO for Calibration Assessment

Beyond hyperparameter selection, LOO provides insight into calibration—whether the GP's uncertainty estimates are accurate.

Standardized LOO Residuals:

Define standardized residuals: $$z_i = \frac{y_i - \mu_{-i}}{\sigma_{-i}} = \frac{e_i}{\sqrt{\sigma_{-i}^2}} = \alpha_i \sqrt{(\mathbf{K}y^{-1}){ii}}$$

For a well-calibrated GP, these should follow $\mathcal{N}(0, 1)$.

Calibration Checks:

Mean: $\mathbb{E}[z_i] \approx 0$ (unbiased predictions)
Variance: $\text{Var}(z_i) \approx 1$ (correct uncertainty)
Distribution: $z_i$ should look Gaussian (correct distributional assumptions)

calibration_check.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import numpy as np
from scipy import stats
 
def calibration_diagnostics(
    y: np.ndarray,
    loo_means: np.ndarray,
    loo_variances: np.ndarray
) -> dict:
    """
    Compute calibration diagnostics from LOO predictions.
    
    Returns
    -------
    dict with calibration metrics and test results
    """
    n = len(y)
    
    # Standardized residuals
    residuals = y - loo_means
    std_residuals = residuals / np.sqrt(loo_variances)
    
    # Basic statistics
    mean_std_residual = np.mean(std_residuals)
    std_std_residual = np.std(std_residuals, ddof=1)
    
    # Coverage: fraction of y within prediction intervals
    coverage_68 = np.mean(np.abs(std_residuals) <= 1)  # ±1σ: should be ~68%
    coverage_95 = np.mean(np.abs(std_residuals) <= 1.96)  # ±2σ: should be ~95%
    coverage_99 = np.mean(np.abs(std_residuals) <= 2.576)  # ±3σ: should be ~99%
    
    # Normality test on standardized residuals
    _, p_value_shapiro = stats.shapiro(std_residuals[:min(n, 5000)])  # Shapiro limited to 5000
    
    # Kolmogorov-Smirnov test against N(0,1)
    ks_stat, p_value_ks = stats.kstest(std_residuals, 'norm')
    
    return {
        'mean': mean_std_residual,
        'std': std_std_residual,
        'coverage_68': coverage_68,
        'coverage_95': coverage_95,
        'coverage_99': coverage_99,
        'shapiro_p_value': p_value_shapiro,
        'ks_statistic': ks_stat,
        'ks_p_value': p_value_ks,
        'calibrated': std_std_residual > 0.9 and std_std_residual < 1.1 and abs(mean_std_residual) < 0.1
    }

Interpreting Calibration Results:

Observation	Interpretation	Action
$\text{Std}(z) < 1$	Overestimating uncertainty	May have too much noise variance
$\text{Std}(z) > 1$	Underestimating uncertainty	May have too little noise; model too confident
$\text{Mean}(z) \neq 0$	Biased predictions	Mean function may not be appropriate
Coverage $< $ expected	Prediction intervals too narrow	Uncertainty underestimated
Non-Gaussian $z$	Distributional assumptions violated	Consider transforming data or non-Gaussian likelihood

Calibration as Model Check

Good calibration (standardized residuals close to N(0,1)) indicates not just good hyperparameters but appropriate model assumptions—Gaussian noise, correct kernel family, appropriate mean function. Poor calibration despite good marginal likelihood suggests model misspecification.

Hybrid Strategies

In practice, combining marginal likelihood and cross-validation often works best.

Strategy 1: ML Optimization with LOO Validation

Optimize hyperparameters via marginal likelihood
Compute LOO metrics at the optimum
Check calibration via standardized residuals
If calibration is poor, investigate or try CV-based optimization

This leverages ML's gradient efficiency while using LOO as a sanity check.

Strategy 2: Dual Objective Optimization

Optimize a combination: $$\mathcal{J}(\boldsymbol{\theta}) = \lambda \cdot \mathcal{L}{\text{ML}}(\boldsymbol{\theta}) + (1-\lambda) \cdot \mathcal{L}{\text{LOO}}(\boldsymbol{\theta})$$

with $\lambda \in [0, 1]$. This can be useful when ML and LOO disagree, finding a compromise solution.

Strategy 3: Nested Cross-Validation

For rigorous model evaluation:

Outer loop: K-fold CV for final performance estimation
Inner loop: For each outer fold, optimize hyperparameters via ML or LOO on training set
Report: Average performance across outer folds with optimized hyperparameters

This gives unbiased estimate of out-of-sample performance with properly tuned hyperparameters.

Strategy 4: Early Stopping with LOO

During optimization:

Compute ML and LOO at each iteration
If LOO starts to degrade while ML improves, stop early (sign of overfitting)
Return hyperparameters before LOO degradation

This is particularly useful for complex models with many hyperparameters.

Computational Budget Allocation

Given that LOO is essentially free once you have the matrix inverse, always compute both ML and LOO at the final solution. The small additional cost provides valuable validation. A perfect match isn't expected, but wildly different rankings of hyperparameter settings is a red flag.

Predictive Likelihood Variants

Beyond LOO, several other predictive performance measures exist.

Continuous Ranked Probability Score (CRPS):

Measures distance between predicted CDF and true observation:

$$\text{CRPS}(F, y) = \int_{-\infty}^{\infty} (F(z) - \mathbf{1}_{z \geq y})^2 dz$$

For Gaussian predictive $\mathcal{N}(\mu, \sigma^2)$:

$$\text{CRPS} = \sigma \left[ \frac{y - \mu}{\sigma} \left( 2\Phi\left(\frac{y-\mu}{\sigma}\right) - 1 \right) + 2\phi\left(\frac{y-\mu}{\sigma}\right) - \frac{1}{\sqrt{\pi}} \right]$$

where $\Phi$ and $\phi$ are standard Gaussian CDF and PDF.

Properties:

Lower is better
Penalizes both bias and poor calibration
In same units as data (unlike log-likelihood)

Interval Score:

For a $(1-\alpha)$ prediction interval $[l, u]$:

$$S_\alpha(l, u, y) = (u - l) + \frac{2}{\alpha}(l - y)\mathbf{1}{y < l} + \frac{2}{\alpha}(y - u)\mathbf{1}{y > u}$$

Penalizes:

Width of interval (should be tight)
Missed coverage (penalty weighted by $2/\alpha$)

Predictive Variance:

Simply measure average predictive variance: $$\bar{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n \sigma_{-i}^2$$

Useful for comparing models: prefer models with tighter (but calibrated) uncertainty.

LOO RMSE:

Ignore uncertainty entirely: $$\text{LOO-RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^n e_i^2}$$

Useful when only point predictions matter, but loses probabilistic benefits.

Summary: Cross-Validation Alternative

We've developed cross-validation as a complementary tool for GP hyperparameter selection. Let's consolidate:

Key Takeaways

•LOO-CV has a closed form for GPs — The leave-one-out predictive distribution is available at $O(n^3)$ cost, same as marginal likelihood.
•Key formulas: $e_i = \alpha_i / (\mathbf{K}y^{-1}){ii}$ and $\sigma_{-i}^2 = 1/(\mathbf{K}y^{-1}){ii}$ give LOO error and variance.
•LOO-CV is robust to misspecification — It directly measures predictive performance without assuming the prior is correct.
•K-fold CV is more expensive but flexible — Use when data has structure (time series, groups) that LOO doesn't respect.
•Calibration diagnostics from LOO — Standardized residuals should be $\mathcal{N}(0,1)$ for well-calibrated models.
•Hybrid strategies combine ML and CV — Optimize via marginal likelihood, validate via LOO, investigate if they disagree.

Looking Ahead:

The final page in this module covers practical considerations for GP hyperparameter learning—implementation details, common pitfalls, software libraries, and real-world application guidance. This completes our comprehensive treatment of hyperparameter learning for Gaussian Processes.

Page Complete

You now understand cross-validation as an alternative to marginal likelihood for GP hyperparameter selection. The closed-form LOO expressions make it computationally practical, and calibration diagnostics provide valuable model checking beyond hyperparameter selection.

4 / 5

Loading learning content...

Machine LearningGaussian Processes

Hyperparameter Learning in Gaussian Processes

LevelAdvanced

Duration90 mins

TopicGaussian Processes

4 / 5

Cross-Validation Alternative

Cross-Validation as Alternative to Marginal Likelihood

Model misspecification: When the true data-generating process doesn't match any GP prior, CV directly optimizes prediction error rather than model fit.
Non-probabilistic objectives: If you care about point predictions (RMSE, MAE) rather than probabilistic forecasts, CV can target these directly.
Validation of marginal likelihood: CV provides an independent check—if marginal likelihood and CV select very different hyperparameters, something warrants investigation.
Robustness: CV makes fewer assumptions about the noise distribution and prior correctness.

This page develops cross-validation theory for GP hyperparameter selection, with emphasis on efficient computation.

What You Will Learn

Leave-One-Out Cross-Validation Theory

Leave-One-Out Cross-Validation (LOO-CV) evaluates model performance by training on $n-1$ points and predicting the held-out point, repeated for each observation.

The LOO Objective:

For hyperparameters $\boldsymbol{\theta}$, the LOO log-likelihood is:

$$\text{LOO}(\boldsymbol{\theta}) = \sum_{i=1}^n \log p(y_i | \mathbf{X}, \mathbf{y}_{-i}, \boldsymbol{\theta})$$

where $\mathbf{y}_{-i}$ denotes all observations except $y_i$. This measures the total predictive probability of held-out data across all leave-one-out splits.

Connection to Information Theory:

The negative LOO is related to the predictive information criterion:

$$\text{PIC} = -2 \sum_{i=1}^n \log p(y_i | \mathbf{y}_{-i}, \boldsymbol{\theta})$$

Minimizing PIC (or equivalently maximizing LOO) selects hyperparameters with best average predictive performance.

Naive Computation:

A direct approach would:

For $i = 1, \ldots, n$:
- Form training set ${(\mathbf{x}_j, y_j) : j \neq i}$
- Train GP on $n-1$ points: $O((n-1)^3)$
- Compute predictive distribution at $\mathbf{x}_i$
- Evaluate $p(y_i | \text{prediction})$
Sum log-probabilities

Total cost: $O(n \cdot n^3) = O(n^4)$—prohibitively expensive.

The GP Miracle:

Remarkably, for Gaussian Processes, LOO-CV has a closed-form expression that can be computed in $O(n^3)$—the same complexity as evaluating the marginal likelihood once!

Closed-Form LOO for Gaussian Processes

Let's derive the closed-form leave-one-out expressions.

Setup:

With covariance matrix $\mathbf{K}_y = \mathbf{K} + \sigma_n^2 \mathbf{I}$ and its inverse $\mathbf{K}_y^{-1}$, define:

$$\boldsymbol{\alpha} = \mathbf{K}_y^{-1} \mathbf{y}$$

The Leave-One-Out Predictive Distribution:

For the $i$-th observation, training on all others gives predictive:

$$y_i | \mathbf{y}{-i} \sim \mathcal{N}(\mu{-i}, \sigma_{-i}^2)$$

Using the matrix inversion lemma and partitioned matrix identities:

$$\mu_{-i} = y_i - \frac{\alpha_i}{(\mathbf{K}y^{-1}){ii}}$$

$$\sigma_{-i}^2 = \frac{1}{(\mathbf{K}y^{-1}){ii}}$$

where $\alpha_i$ is the $i$-th component of $\boldsymbol{\alpha}$ and $(\mathbf{K}y^{-1}){ii}$ is the $i$-th diagonal element of the precision matrix.

Derivation Intuition:

LOO Error:

The leave-one-out error for point $i$ is: $$e_i = y_i - \mu_{-i} = \frac{\alpha_i}{(\mathbf{K}y^{-1}){ii}}$$

This is the residual when predicting $y_i$ from all other points.

LOO Variance:

The predicted variance is simply the inverse of the precision: $$\sigma_{-i}^2 = \frac{1}{(\mathbf{K}y^{-1}){ii}}$$

LOO Log-Probability:

The log-probability of observing $y_i$ under the LOO predictive:

$$\log p(y_i | \mathbf{y}{-i}) = -\frac{1}{2}\log(2\pi) - \frac{1}{2}\log(\sigma{-i}^2) - \frac{e_i^2}{2\sigma_{-i}^2}$$

$$= -\frac{1}{2}\log(2\pi) + \frac{1}{2}\log((\mathbf{K}y^{-1}){ii}) - \frac{\alpha_i^2}{2(\mathbf{K}y^{-1}){ii}}$$

The Elegant Result

Efficient LOO Implementation

Let's implement LOO-CV with attention to numerical stability and efficiency.

Complete LOO Algorithm:

Compute Cholesky: $\mathbf{K}_y = \mathbf{L}\mathbf{L}^\top$ — $O(n^3)$
Solve for $\boldsymbol{\alpha}$: $\boldsymbol{\alpha} = \mathbf{K}_y^{-1}\mathbf{y}$ — $O(n^2)$
Compute $\mathbf{K}_y^{-1}$: solve $n$ linear systems — $O(n^3)$
Extract diagonals: $d_i = (\mathbf{K}y^{-1}){ii}$ — $O(n)$
Compute LOO errors: $e_i = \alpha_i / d_i$ — $O(n)$
Compute LOO variances: $\sigma_i^2 = 1 / d_i$ — $O(n)$
Compute LOO log-likelihood — $O(n)$

Total complexity: $O(n^3)$ — same as marginal likelihood!

loo_cv.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import numpy as np
from scipy.linalg import cho_factor, cho_solve
 
def leave_one_out_cv(
    y: np.ndarray,
    K_y: np.ndarray,
    return_details: bool = False
) -> dict:
    """
    Compute leave-one-out cross-validation metrics for GP.
    
    Parameters
    ----------
    y : (n,) array - Observations
    K_y : (n, n) array - Covariance matrix K + sigma_n^2 * I
    return_details : bool - Return per-point errors and variances
    
    Returns
    -------
    dict with LOO metrics
    """
    n = len(y)
    
    # Cholesky decomposition
    L, lower = cho_factor(K_y, lower=True)
    
    # Compute alpha = K_y^{-1} @ y
    alpha = cho_solve((L, lower), y)
    
    # Compute K_y^{-1} (need diagonal)
    K_y_inv = cho_solve((L, lower), np.eye(n))
    diag_K_y_inv = np.diag(K_y_inv)
    
    # LOO predictions
    loo_errors = alpha / diag_K_y_inv  # e_i = alpha_i / (K_y^{-1})_{ii}
    loo_variances = 1.0 / diag_K_y_inv  # sigma^2_i = 1 / (K_y^{-1})_{ii}
    loo_means = y - loo_errors  # mu_{-i} = y_i - e_i
    
    # LOO log-likelihood (sum over all points)
    loo_log_likelihood = -0.5 * (
        n * np.log(2 * np.pi)
        - np.sum(np.log(diag_K_y_inv))  # = sum log variances
        + np.sum(alpha * loo_errors)  # = sum e_i^2 / var_i
    )
    
    # Alternative objectives
    loo_mse = np.mean(loo_errors ** 2)
    loo_mae = np.mean(np.abs(loo_errors))
    loo_rmse = np.sqrt(loo_mse)
    
    # Standardized LOO (for calibration check)
    standardized_errors = loo_errors / np.sqrt(loo_variances)
    
    result = {
        'loo_log_likelihood': loo_log_likelihood,
        'loo_mse': loo_mse,
        'loo_rmse': loo_rmse,
        'loo_mae': loo_mae,
        'standardized_error_mean': np.mean(standardized_errors),
        'standardized_error_std': np.std(standardized_errors),
    }
    
    if return_details:
        result['loo_errors'] = loo_errors
        result['loo_variances'] = loo_variances
        result['loo_means'] = loo_means
        result['standardized_errors'] = standardized_errors
    
    return result

Interpreting LOO Metrics:

Metric	Description	Good Values
LOO Log-Likelihood	Total log probability of held-out points	Higher is better
LOO MSE	Mean squared LOO error	Lower is better
LOO RMSE	Root mean squared LOO error	In units of $y$
Standardized Error Mean	Should be near 0 if well-calibrated	$\approx 0$
Standardized Error Std	Should be near 1 if well-calibrated	$\approx 1$

Gradients of LOO Objective

To optimize hyperparameters using LOO, we need its gradients. The derivation is more involved than for marginal likelihood but follows similar matrix calculus principles.

The LOO Objective:

$$\text{LOO} = \sum_{i=1}^n \left[ \frac{1}{2}\log(\mathbf{K}y^{-1}){ii} - \frac{\alpha_i^2}{2(\mathbf{K}y^{-1}){ii}} \right] - \frac{n}{2}\log(2\pi)$$

Key Intermediate Results:

For hyperparameter $\theta_j$, we need:

$\frac{\partial \boldsymbol{\alpha}}{\partial \theta_j} = -\mathbf{K}_y^{-1} \frac{\partial \mathbf{K}_y}{\partial \theta_j} \boldsymbol{\alpha}$
$\frac{\partial \mathbf{K}_y^{-1}}{\partial \theta_j} = -\mathbf{K}_y^{-1} \frac{\partial \mathbf{K}_y}{\partial \theta_j} \mathbf{K}_y^{-1}$
$\frac{\partial (\mathbf{K}y^{-1}){ii}}{\partial \theta_j}$ = $i$-th diagonal of the above

Gradient Derivation:

Let $p_i = (\mathbf{K}y^{-1}){ii}$ (precision) and $\mathbf{P} = \mathbf{K}_y^{-1} \frac{\partial \mathbf{K}_y}{\partial \theta_j} \mathbf{K}_y^{-1}$.

Then: $$\frac{\partial p_i}{\partial \theta_j} = -P_{ii}$$

$$\frac{\partial \alpha_i}{\partial \theta_j} = -(\mathbf{K}_y^{-1} \frac{\partial \mathbf{K}_y}{\partial \theta_j} \boldsymbol{\alpha})_i$$

The full LOO gradient is:

This can be simplified and computed efficiently using matrix operations.

loo_gradients.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
def loo_cv_and_gradients(
    y: np.ndarray,
    K_y: np.ndarray,
    dK_dtheta: list[np.ndarray]
) -> tuple[float, np.ndarray]:
    """
    Compute LOO log-likelihood and gradients w.r.t. hyperparameters.
    
    Returns
    -------
    loo_ll : float - LOO log-likelihood (to maximize)
    grads : (p,) array - Gradients w.r.t. each hyperparameter
    """
    n = len(y)
    
    # Standard computations
    L, lower = cho_factor(K_y, lower=True)
    alpha = cho_solve((L, lower), y)
    K_y_inv = cho_solve((L, lower), np.eye(n))
    
    diag_prec = np.diag(K_y_inv)  # p_i = (K_y^{-1})_{ii}
    
    # LOO errors and log-likelihood
    loo_errors = alpha / diag_prec
    loo_ll = -0.5 * n * np.log(2 * np.pi)              + 0.5 * np.sum(np.log(diag_prec))              - 0.5 * np.sum(alpha * loo_errors)
    
    # Compute gradients
    grads = []
    for dK in dK_dtheta:
        # P = K_y^{-1} @ dK @ K_y^{-1}
        P = K_y_inv @ dK @ K_y_inv
        P_diag = np.diag(P)
        
        # d_alpha/d_theta = -K_y^{-1} @ dK @ alpha
        d_alpha = -K_y_inv @ dK @ alpha
        
        # Gradient components
        # term1: sum_i (1/(2*p_i)) * (-P_ii) = -0.5 * sum(P_diag / diag_prec)
        term1 = -0.5 * np.sum(P_diag / diag_prec)
        
        # term2: sum_i (alpha_i^2 / (2*p_i^2)) * P_ii
        term2 = 0.5 * np.sum((alpha**2 / diag_prec**2) * P_diag)
        
        # term3: sum_i -(alpha_i / p_i) * d_alpha_i
        term3 = -np.sum((alpha / diag_prec) * d_alpha)
        
        grads.append(term1 + term2 + term3)
    
    return loo_ll, np.array(grads)

K-Fold Cross-Validation

While LOO-CV has a beautiful closed form for GPs, K-fold CV offers an alternative with different trade-offs.

K-Fold Procedure:

Partition data into $K$ approximately equal folds: $\mathcal{D} = \mathcal{D}_1 \cup \cdots \cup \mathcal{D}_K$
For $k = 1, \ldots, K$:
- Train on $\mathcal{D} \setminus \mathcal{D}_k$ (all data except fold $k$)
- Predict on fold $\mathcal{D}_k$
- Compute predictive log-likelihood for fold $k$
Sum log-likelihoods across folds

Computational Complexity:

With $K$ folds of size $n/K$ each:

Training on $n - n/K$ points: $O((n(1-1/K))^3) \approx O(n^3)$ per fold
Total: $O(K \cdot n^3)$

For $K = 5$ or $K = 10$, this is 5-10× the cost of LOO or marginal likelihood.

K-Fold Advantages

•Block structure: Respects temporal or spatial dependencies when data is grouped
•Parallel computation: Each fold is independent, enabling parallelization
•Flexible objectives: Can use any loss function, not just log-likelihood
•Validation of predictions: Uses truly held-out data without matrix tricks

K-Fold Disadvantages

•Higher variance: With only $K$ folds, estimates have higher variance than LOO
•Fold choice sensitivity: Results depend on how data is partitioned
•Computational cost: $K\times$ more expensive than LOO's closed form
•Gradient computation: Requires separate optimization per fold

When to Prefer K-Fold:

Time series data: LOO doesn't respect temporal ordering. Use forward-chaining or blocked folds.
Grouped data: When observations are naturally grouped (e.g., patients with multiple measurements), hold out entire groups.
Non-Gaussian likelihoods: K-fold extends directly to GP classification and other variants where LOO doesn't have a closed form.
Computational constraints: If memory for $n \times n$ matrix is limiting, K-fold can work with smaller matrices per fold.

Marginal Likelihood vs. Cross-Validation: Deep Comparison

Let's systematically compare these two hyperparameter selection approaches.

Theoretical Relationship:

For well-specified models (true data comes from GP prior), marginal likelihood and LOO-CV select asymptotically equivalent hyperparameters. The marginal likelihood can be approximated as:

$$\log p(\mathbf{y} | \boldsymbol{\theta}) \approx \text{LOO}(\boldsymbol{\theta}) + \text{penalty terms}$$

The penalty terms capture complexity in a way that LOO approximates through held-out evaluation.

Comprehensive Comparison: Marginal Likelihood vs. Cross-Validation
Aspect	Marginal Likelihood	Cross-Validation (LOO)
Theoretical Basis	Bayesian: probability of data under prior	Frequentist: predictive performance
Complexity Handling	Automatic via determinant term	Implicit via held-out evaluation
Model Misspecification	Can be misleading if prior is wrong	Robust: measures actual predictions
Computational Cost	$O(n^3)$	$O(n^3)$ (closed-form for GP)
Gradient Computation	Simple, closed-form	More complex, but doable
Uncertainty Calibration	Optimizes probabilistic fit	Can check but doesn't optimize directly
Non-Gaussian Extensions	Requires approximations	Straightforward with more compute
Literature Support	Dominant in GP community	Common in statistics/ML more broadly

When They Disagree:

If marginal likelihood and LOO select substantially different hyperparameters, investigate:

Check for model misspecification: The GP assumptions (Gaussian noise, stationary kernel) may not hold.
Examine outliers: Outliers affect ML and LOO differently—ML is more sensitive to fitting outliers.
Look at data density: In sparsely sampled regions, ML and LOO may weight differently.
Consider noise model: If noise is heteroscedastic, both methods may struggle.

Using LOO for Calibration Assessment

Beyond hyperparameter selection, LOO provides insight into calibration—whether the GP's uncertainty estimates are accurate.

Standardized LOO Residuals:

Define standardized residuals: $$z_i = \frac{y_i - \mu_{-i}}{\sigma_{-i}} = \frac{e_i}{\sqrt{\sigma_{-i}^2}} = \alpha_i \sqrt{(\mathbf{K}y^{-1}){ii}}$$

For a well-calibrated GP, these should follow $\mathcal{N}(0, 1)$.

Calibration Checks:

Mean: $\mathbb{E}[z_i] \approx 0$ (unbiased predictions)
Variance: $\text{Var}(z_i) \approx 1$ (correct uncertainty)
Distribution: $z_i$ should look Gaussian (correct distributional assumptions)

calibration_check.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import numpy as np
from scipy import stats
 
def calibration_diagnostics(
    y: np.ndarray,
    loo_means: np.ndarray,
    loo_variances: np.ndarray
) -> dict:
    """
    Compute calibration diagnostics from LOO predictions.
    
    Returns
    -------
    dict with calibration metrics and test results
    """
    n = len(y)
    
    # Standardized residuals
    residuals = y - loo_means
    std_residuals = residuals / np.sqrt(loo_variances)
    
    # Basic statistics
    mean_std_residual = np.mean(std_residuals)
    std_std_residual = np.std(std_residuals, ddof=1)
    
    # Coverage: fraction of y within prediction intervals
    coverage_68 = np.mean(np.abs(std_residuals) <= 1)  # ±1σ: should be ~68%
    coverage_95 = np.mean(np.abs(std_residuals) <= 1.96)  # ±2σ: should be ~95%
    coverage_99 = np.mean(np.abs(std_residuals) <= 2.576)  # ±3σ: should be ~99%
    
    # Normality test on standardized residuals
    _, p_value_shapiro = stats.shapiro(std_residuals[:min(n, 5000)])  # Shapiro limited to 5000
    
    # Kolmogorov-Smirnov test against N(0,1)
    ks_stat, p_value_ks = stats.kstest(std_residuals, 'norm')
    
    return {
        'mean': mean_std_residual,
        'std': std_std_residual,
        'coverage_68': coverage_68,
        'coverage_95': coverage_95,
        'coverage_99': coverage_99,
        'shapiro_p_value': p_value_shapiro,
        'ks_statistic': ks_stat,
        'ks_p_value': p_value_ks,
        'calibrated': std_std_residual > 0.9 and std_std_residual < 1.1 and abs(mean_std_residual) < 0.1
    }

Interpreting Calibration Results:

Observation	Interpretation	Action
$\text{Std}(z) < 1$	Overestimating uncertainty	May have too much noise variance
$\text{Std}(z) > 1$	Underestimating uncertainty	May have too little noise; model too confident
$\text{Mean}(z) \neq 0$	Biased predictions	Mean function may not be appropriate
Coverage $< $ expected	Prediction intervals too narrow	Uncertainty underestimated
Non-Gaussian $z$	Distributional assumptions violated	Consider transforming data or non-Gaussian likelihood

Calibration as Model Check

Hybrid Strategies

In practice, combining marginal likelihood and cross-validation often works best.

Strategy 1: ML Optimization with LOO Validation

Optimize hyperparameters via marginal likelihood
Compute LOO metrics at the optimum
Check calibration via standardized residuals
If calibration is poor, investigate or try CV-based optimization

This leverages ML's gradient efficiency while using LOO as a sanity check.

Strategy 2: Dual Objective Optimization

Optimize a combination: $$\mathcal{J}(\boldsymbol{\theta}) = \lambda \cdot \mathcal{L}{\text{ML}}(\boldsymbol{\theta}) + (1-\lambda) \cdot \mathcal{L}{\text{LOO}}(\boldsymbol{\theta})$$

with $\lambda \in [0, 1]$. This can be useful when ML and LOO disagree, finding a compromise solution.

Strategy 3: Nested Cross-Validation

For rigorous model evaluation:

Outer loop: K-fold CV for final performance estimation
Inner loop: For each outer fold, optimize hyperparameters via ML or LOO on training set
Report: Average performance across outer folds with optimized hyperparameters

This gives unbiased estimate of out-of-sample performance with properly tuned hyperparameters.

Strategy 4: Early Stopping with LOO

During optimization:

Compute ML and LOO at each iteration
If LOO starts to degrade while ML improves, stop early (sign of overfitting)
Return hyperparameters before LOO degradation

This is particularly useful for complex models with many hyperparameters.

Computational Budget Allocation

Predictive Likelihood Variants

Beyond LOO, several other predictive performance measures exist.

Continuous Ranked Probability Score (CRPS):

Measures distance between predicted CDF and true observation:

$$\text{CRPS}(F, y) = \int_{-\infty}^{\infty} (F(z) - \mathbf{1}_{z \geq y})^2 dz$$

For Gaussian predictive $\mathcal{N}(\mu, \sigma^2)$:

$$\text{CRPS} = \sigma \left[ \frac{y - \mu}{\sigma} \left( 2\Phi\left(\frac{y-\mu}{\sigma}\right) - 1 \right) + 2\phi\left(\frac{y-\mu}{\sigma}\right) - \frac{1}{\sqrt{\pi}} \right]$$

where $\Phi$ and $\phi$ are standard Gaussian CDF and PDF.

Properties:

Lower is better
Penalizes both bias and poor calibration
In same units as data (unlike log-likelihood)

Interval Score:

For a $(1-\alpha)$ prediction interval $[l, u]$:

$$S_\alpha(l, u, y) = (u - l) + \frac{2}{\alpha}(l - y)\mathbf{1}{y < l} + \frac{2}{\alpha}(y - u)\mathbf{1}{y > u}$$

Penalizes:

Width of interval (should be tight)
Missed coverage (penalty weighted by $2/\alpha$)

Predictive Variance:

Simply measure average predictive variance: $$\bar{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n \sigma_{-i}^2$$

Useful for comparing models: prefer models with tighter (but calibrated) uncertainty.

LOO RMSE:

Ignore uncertainty entirely: $$\text{LOO-RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^n e_i^2}$$

Useful when only point predictions matter, but loses probabilistic benefits.

Summary: Cross-Validation Alternative

We've developed cross-validation as a complementary tool for GP hyperparameter selection. Let's consolidate:

Key Takeaways

•LOO-CV has a closed form for GPs — The leave-one-out predictive distribution is available at $O(n^3)$ cost, same as marginal likelihood.
•Key formulas: $e_i = \alpha_i / (\mathbf{K}y^{-1}){ii}$ and $\sigma_{-i}^2 = 1/(\mathbf{K}y^{-1}){ii}$ give LOO error and variance.
•LOO-CV is robust to misspecification — It directly measures predictive performance without assuming the prior is correct.
•K-fold CV is more expensive but flexible — Use when data has structure (time series, groups) that LOO doesn't respect.
•Calibration diagnostics from LOO — Standardized residuals should be $\mathcal{N}(0,1)$ for well-calibrated models.
•Hybrid strategies combine ML and CV — Optimize via marginal likelihood, validate via LOO, investigate if they disagree.

Looking Ahead:

Page Complete

4 / 5