Machine LearningGP Kernels and Hyperparameters

Gaussian Process Kernels and Hyperparameter Optimization

LevelAdvanced

Duration90 mins

TopicGP Kernels and Hyperparameters

3 / 5

Hyperparameter Optimization for Gaussian Processes

The Hyperparameter Challenge

Every kernel we've discussed has hyperparameters: length scales, signal variances, periods, smoothness parameters. These hyperparameters profoundly affect the GP's predictions and uncertainty estimates.

Setting them incorrectly leads to:

Overfitting: too short length scale, fits noise
Underfitting: too long length scale, misses patterns
Miscalibrated uncertainty: wrong variance leads to over/under-confident predictions

This page covers how to learn optimal hyperparameters from data.

What You Will Learn

By the end of this page, you will understand the Type-II Maximum Likelihood (empirical Bayes) approach to hyperparameter optimization, implement gradient-based optimization of the log marginal likelihood, and know practical strategies for avoiding local optima.

The Optimization Landscape

Let $\theta$ denote the vector of all hyperparameters. For an RBF kernel with noise:

$$\theta = (\ell, \sigma_f^2, \sigma_n^2)$$

Our goal is to find:

$$\theta^* = \arg\max_\theta p(\mathbf{y} | X, \theta)$$

This is the marginal likelihood (or evidence): the probability of the observed data given the hyperparameters, with the function values $f$ integrated out.

Why Not Maximize Fit to Training Data?

We could maximize how well the GP fits training data, but this leads to overfitting. The marginal likelihood automatically balances fit and complexity through Bayesian Occam's razor. It penalizes overly complex models that 'waste' probability mass on functions that could explain many datasets.

The Optimization Landscape is Non-Convex

The log marginal likelihood is generally non-convex in the hyperparameters. This means:

There may be multiple local optima
Gradient descent may get stuck
The global optimum is not guaranteed
Initialization matters

Despite this, gradient-based optimization works well in practice when combined with multiple restarts or careful initialization.

Gradient-Based Optimization

The standard approach is gradient ascent on the log marginal likelihood. We can compute the gradient analytically:

$$\frac{\partial \log p(\mathbf{y}|X,\theta)}{\partial \theta_j} = \frac{1}{2}\text{tr}\left((\boldsymbol{\alpha}\boldsymbol{\alpha}^T - K^{-1})\frac{\partial K}{\partial \theta_j}\right)$$

where $\boldsymbol{\alpha} = K^{-1}\mathbf{y}$ and $K = K_\theta + \sigma_n^2 I$.

This gradient has two interpretable terms:

Data fit term: $\boldsymbol{\alpha}\boldsymbol{\alpha}^T \frac{\partial K}{\partial \theta_j}$ — how changing $\theta_j$ affects fit
Complexity term: $-K^{-1}\frac{\partial K}{\partial \theta_j}$ — how it affects model complexity

gradient_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
from scipy.optimize import minimize
from scipy.spatial.distance import cdist
 
class GPWithOptimization:
    """GP with gradient-based hyperparameter optimization."""
    
    def __init__(self, noise_variance=1e-2):
        self.noise_variance = noise_variance
        self.length_scale = 1.0
        self.signal_variance = 1.0
    
    def _kernel(self, X1, X2, ls, sv):
        """RBF kernel with given hyperparameters."""
        sq_dist = cdist(X1, X2, 'sqeuclidean')
        return sv * np.exp(-sq_dist / (2 * ls**2))
    
    def _neg_log_marginal_likelihood(self, theta, X, y):
        """Negative log marginal likelihood (for minimization)."""
        ls, sv, nv = np.exp(theta)  # Log-transform for unconstrained opt
        
        K = self._kernel(X, X, ls, sv) + nv * np.eye(len(X))
        
        try:
            L = np.linalg.cholesky(K)
        except np.linalg.LinAlgError:
            return 1e10  # Return large value if not PSD
        
        alpha = np.linalg.solve(L.T, np.linalg.solve(L, y))
        
        # Log marginal likelihood
        log_ml = -0.5 * y.T @ alpha
        log_ml -= np.sum(np.log(np.diag(L)))
        log_ml -= 0.5 * len(y) * np.log(2 * np.pi)
        
        return -log_ml  # Negative for minimization
    
    def optimize(self, X, y, n_restarts=5):
        """Optimize hyperparameters with multiple restarts."""
        best_theta = None
        best_nll = np.inf
        
        for _ in range(n_restarts):
            # Random initialization in log space
            theta0 = np.random.randn(3)
            
            result = minimize(
                self._neg_log_marginal_likelihood,
                theta0,
                args=(X, y),
                method='L-BFGS-B',
            )
            
            if result.fun < best_nll:
                best_nll = result.fun
                best_theta = result.x
        
        # Set optimized hyperparameters
        self.length_scale, self.signal_variance, self.noise_variance = \
            np.exp(best_theta)
        
        return {
            'length_scale': self.length_scale,
            'signal_variance': self.signal_variance,
            'noise_variance': self.noise_variance,
            'log_marginal_likelihood': -best_nll
        }

Practical Optimization Strategies

Several strategies improve optimization reliability:

Best Practices for Hyperparameter Optimization

•Log-transform parameters — Optimize $\log(\ell)$ instead of $\ell$ to ensure positivity and improve conditioning
•Multiple restarts — Run optimization from 3-10 random starting points to escape local optima
•Bounded optimization — Set reasonable bounds based on data scale (e.g., length scale between 0.1× and 10× input range)
•Use L-BFGS-B — This quasi-Newton method handles bounds and works well for GP optimization
•Normalize inputs — Scale inputs to [0,1] or zero mean/unit variance for better conditioning
•Add jitter — Add small diagonal term (1e-6 to 1e-4) for numerical stability

Watch for Pathological Solutions

Very short length scales with low noise variance often indicate overfitting. Very long length scales may indicate the model is reducing to a constant. Always visualize predictions and check that hyperparameters are physically reasonable.

Alternative Approaches

While gradient-based MLE is standard, alternatives exist:

Hyperparameter Optimization Methods
Method	Approach	Pros/Cons
MLE (Type-II ML)	Maximize marginal likelihood	Standard; efficient but prone to overfitting with little data
Cross-Validation	Maximize held-out prediction	More robust but computationally expensive; O(n³) per fold
Full Bayesian	MCMC over hyperparameters	Uncertainty quantified but very slow; often impractical
Grid Search	Evaluate on grid of values	Simple but scales poorly with dimensionality
Bayesian Optimization	GP over the hyperparameter space	Efficient for expensive objectives; can use for GP tuning

Cross-Validation Alternative

Instead of marginal likelihood, we can use leave-one-out cross-validation (LOO-CV). The LOO predictive distribution can be computed efficiently:

$$\mu_{-i} = y_i - \frac{\alpha_i}{K^{-1}{ii}}, \quad \sigma^2{-i} = \frac{1}{K^{-1}_{ii}}$$

where $\alpha = K^{-1}y$. This requires only one matrix inversion but gives predictions as if each point was left out.

Smart Initialization

Good initialization dramatically improves optimization. Data-driven heuristics:

Length Scale Initialization:

Set $\ell_0$ to the median pairwise distance between points
Or use 1/5 to 1/2 of the input range

Signal Variance Initialization:

Set $\sigma_f^2$ to the variance of the output values

Noise Variance Initialization:

Start with 10% of the signal variance
Or use an estimate from replicated measurements

Period Initialization (for periodic kernels):

Use Fourier analysis to find dominant frequencies
Or start from known domain knowledge (e.g., 365 days for yearly data)

smart_initialization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def initialize_hyperparameters(X, y):
    """Data-driven hyperparameter initialization."""
    from scipy.spatial.distance import pdist
    
    # Length scale: median pairwise distance
    pairwise_dists = pdist(X.reshape(-1, 1), 'euclidean')
    length_scale = np.median(pairwise_dists)
    
    # Signal variance: variance of outputs
    signal_variance = np.var(y)
    
    # Noise variance: 10% of signal variance
    noise_variance = 0.1 * signal_variance
    
    return {
        'length_scale': max(length_scale, 1e-3),
        'signal_variance': max(signal_variance, 1e-3),
        'noise_variance': max(noise_variance, 1e-6)
    }

Summary

Key Takeaways

•Type-II ML maximizes marginal likelihood — Balances fit and complexity automatically
•Gradients are computable analytically — Enables efficient gradient-based optimization
•The landscape is non-convex — Use multiple restarts to find good solutions
•Log-transform hyperparameters — Ensures positivity and improves optimization
•Data-driven initialization helps — Use median distances and output variance

Page Complete

You now understand how to optimize GP hyperparameters. Next, we'll dive deeper into the marginal likelihood—what it means and why it works as an objective function.

3 / 5

Loading learning content...

Machine LearningGP Kernels and Hyperparameters

Gaussian Process Kernels and Hyperparameter Optimization

LevelAdvanced

Duration90 mins

TopicGP Kernels and Hyperparameters

3 / 5

Hyperparameter Optimization for Gaussian Processes

The Hyperparameter Challenge

Setting them incorrectly leads to:

Overfitting: too short length scale, fits noise
Underfitting: too long length scale, misses patterns
Miscalibrated uncertainty: wrong variance leads to over/under-confident predictions

This page covers how to learn optimal hyperparameters from data.

What You Will Learn

The Optimization Landscape

Let $\theta$ denote the vector of all hyperparameters. For an RBF kernel with noise:

$$\theta = (\ell, \sigma_f^2, \sigma_n^2)$$

Our goal is to find:

$$\theta^* = \arg\max_\theta p(\mathbf{y} | X, \theta)$$

This is the marginal likelihood (or evidence): the probability of the observed data given the hyperparameters, with the function values $f$ integrated out.

Why Not Maximize Fit to Training Data?

The Optimization Landscape is Non-Convex

The log marginal likelihood is generally non-convex in the hyperparameters. This means:

There may be multiple local optima
Gradient descent may get stuck
The global optimum is not guaranteed
Initialization matters

Despite this, gradient-based optimization works well in practice when combined with multiple restarts or careful initialization.

Gradient-Based Optimization

The standard approach is gradient ascent on the log marginal likelihood. We can compute the gradient analytically:

$$\frac{\partial \log p(\mathbf{y}|X,\theta)}{\partial \theta_j} = \frac{1}{2}\text{tr}\left((\boldsymbol{\alpha}\boldsymbol{\alpha}^T - K^{-1})\frac{\partial K}{\partial \theta_j}\right)$$

where $\boldsymbol{\alpha} = K^{-1}\mathbf{y}$ and $K = K_\theta + \sigma_n^2 I$.

This gradient has two interpretable terms:

Data fit term: $\boldsymbol{\alpha}\boldsymbol{\alpha}^T \frac{\partial K}{\partial \theta_j}$ — how changing $\theta_j$ affects fit
Complexity term: $-K^{-1}\frac{\partial K}{\partial \theta_j}$ — how it affects model complexity

gradient_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
from scipy.optimize import minimize
from scipy.spatial.distance import cdist
 
class GPWithOptimization:
    """GP with gradient-based hyperparameter optimization."""
    
    def __init__(self, noise_variance=1e-2):
        self.noise_variance = noise_variance
        self.length_scale = 1.0
        self.signal_variance = 1.0
    
    def _kernel(self, X1, X2, ls, sv):
        """RBF kernel with given hyperparameters."""
        sq_dist = cdist(X1, X2, 'sqeuclidean')
        return sv * np.exp(-sq_dist / (2 * ls**2))
    
    def _neg_log_marginal_likelihood(self, theta, X, y):
        """Negative log marginal likelihood (for minimization)."""
        ls, sv, nv = np.exp(theta)  # Log-transform for unconstrained opt
        
        K = self._kernel(X, X, ls, sv) + nv * np.eye(len(X))
        
        try:
            L = np.linalg.cholesky(K)
        except np.linalg.LinAlgError:
            return 1e10  # Return large value if not PSD
        
        alpha = np.linalg.solve(L.T, np.linalg.solve(L, y))
        
        # Log marginal likelihood
        log_ml = -0.5 * y.T @ alpha
        log_ml -= np.sum(np.log(np.diag(L)))
        log_ml -= 0.5 * len(y) * np.log(2 * np.pi)
        
        return -log_ml  # Negative for minimization
    
    def optimize(self, X, y, n_restarts=5):
        """Optimize hyperparameters with multiple restarts."""
        best_theta = None
        best_nll = np.inf
        
        for _ in range(n_restarts):
            # Random initialization in log space
            theta0 = np.random.randn(3)
            
            result = minimize(
                self._neg_log_marginal_likelihood,
                theta0,
                args=(X, y),
                method='L-BFGS-B',
            )
            
            if result.fun < best_nll:
                best_nll = result.fun
                best_theta = result.x
        
        # Set optimized hyperparameters
        self.length_scale, self.signal_variance, self.noise_variance = \
            np.exp(best_theta)
        
        return {
            'length_scale': self.length_scale,
            'signal_variance': self.signal_variance,
            'noise_variance': self.noise_variance,
            'log_marginal_likelihood': -best_nll
        }

Practical Optimization Strategies

Several strategies improve optimization reliability:

Best Practices for Hyperparameter Optimization

•Log-transform parameters — Optimize $\log(\ell)$ instead of $\ell$ to ensure positivity and improve conditioning
•Multiple restarts — Run optimization from 3-10 random starting points to escape local optima
•Bounded optimization — Set reasonable bounds based on data scale (e.g., length scale between 0.1× and 10× input range)
•Use L-BFGS-B — This quasi-Newton method handles bounds and works well for GP optimization
•Normalize inputs — Scale inputs to [0,1] or zero mean/unit variance for better conditioning
•Add jitter — Add small diagonal term (1e-6 to 1e-4) for numerical stability

Watch for Pathological Solutions

Alternative Approaches

While gradient-based MLE is standard, alternatives exist:

Hyperparameter Optimization Methods
Method	Approach	Pros/Cons
MLE (Type-II ML)	Maximize marginal likelihood	Standard; efficient but prone to overfitting with little data
Cross-Validation	Maximize held-out prediction	More robust but computationally expensive; O(n³) per fold
Full Bayesian	MCMC over hyperparameters	Uncertainty quantified but very slow; often impractical
Grid Search	Evaluate on grid of values	Simple but scales poorly with dimensionality
Bayesian Optimization	GP over the hyperparameter space	Efficient for expensive objectives; can use for GP tuning

Cross-Validation Alternative

Instead of marginal likelihood, we can use leave-one-out cross-validation (LOO-CV). The LOO predictive distribution can be computed efficiently:

$$\mu_{-i} = y_i - \frac{\alpha_i}{K^{-1}{ii}}, \quad \sigma^2{-i} = \frac{1}{K^{-1}_{ii}}$$

where $\alpha = K^{-1}y$. This requires only one matrix inversion but gives predictions as if each point was left out.

Smart Initialization

Good initialization dramatically improves optimization. Data-driven heuristics:

Length Scale Initialization:

Set $\ell_0$ to the median pairwise distance between points
Or use 1/5 to 1/2 of the input range

Signal Variance Initialization:

Set $\sigma_f^2$ to the variance of the output values

Noise Variance Initialization:

Start with 10% of the signal variance
Or use an estimate from replicated measurements

Period Initialization (for periodic kernels):

Use Fourier analysis to find dominant frequencies
Or start from known domain knowledge (e.g., 365 days for yearly data)

smart_initialization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def initialize_hyperparameters(X, y):
    """Data-driven hyperparameter initialization."""
    from scipy.spatial.distance import pdist
    
    # Length scale: median pairwise distance
    pairwise_dists = pdist(X.reshape(-1, 1), 'euclidean')
    length_scale = np.median(pairwise_dists)
    
    # Signal variance: variance of outputs
    signal_variance = np.var(y)
    
    # Noise variance: 10% of signal variance
    noise_variance = 0.1 * signal_variance
    
    return {
        'length_scale': max(length_scale, 1e-3),
        'signal_variance': max(signal_variance, 1e-3),
        'noise_variance': max(noise_variance, 1e-6)
    }

Summary

Key Takeaways

•Type-II ML maximizes marginal likelihood — Balances fit and complexity automatically
•Gradients are computable analytically — Enables efficient gradient-based optimization
•The landscape is non-convex — Use multiple restarts to find good solutions
•Log-transform hyperparameters — Ensures positivity and improves optimization
•Data-driven initialization helps — Use median distances and output variance

Page Complete

You now understand how to optimize GP hyperparameters. Next, we'll dive deeper into the marginal likelihood—what it means and why it works as an objective function.

3 / 5