Machine LearningGaussian Processes

Hyperparameter Learning in Gaussian Processes

LevelAdvanced

Duration90 mins

TopicGaussian Processes

3 / 5

Optimization Strategies

The Hyperparameter Optimization Challenge

With the marginal likelihood objective and its gradients established, we now tackle the optimization itself. While the mathematical machinery looks clean, practical optimization of GP hyperparameters presents significant challenges:

Non-convexity: The log marginal likelihood surface has multiple local optima corresponding to qualitatively different models.
Scale sensitivity: Hyperparameters span many orders of magnitude—length-scales from 0.001 to 1000, variances from $10^{-6}$ to $10^6$.
Ill-conditioning: The Hessian can have eigenvalues differing by factors of $10^{10}$, causing gradient descent to oscillate wildly.
Parameter interactions: Signal variance and noise variance trade off; multiple length-scales interact in complex ways.
Computational cost: Each function evaluation requires $O(n^3)$ operations, limiting how many optimization iterations are practical.

This page develops strategies to navigate these challenges reliably.

What You Will Learn

By the end of this page, you will understand which optimization algorithms work best for GP hyperparameters, how to handle multiple local optima, strategies for initialization, and practical techniques for ensuring robust convergence.

Choice of Optimizer

Several optimization algorithms are suitable for GP hyperparameter learning. Each has trade-offs between per-iteration cost, convergence speed, and robustness.

First-Order Methods (Gradient-Based):

These methods use only gradient information $\nabla_\theta \mathcal{L}$:

Method	Update Rule	Strengths	Weaknesses
Gradient Descent	$\theta \leftarrow \theta + \eta \nabla\mathcal{L}$	Simple, low memory	Slow, requires LR tuning
Momentum	$v \leftarrow \beta v + \nabla\mathcal{L}$; $\theta \leftarrow \theta + \eta v$	Accelerates in consistent gradients	Still requires LR tuning
Adam	Adaptive per-parameter LR	Robust to LR choice	May not converge precisely
RMSprop	Adaptive LR scaling	Good for non-stationary	Hyperparameters to tune

Quasi-Newton Methods:

These methods approximate second-order information from gradient history:

Method	Key Idea	Strengths	Weaknesses
L-BFGS	Limited-memory Hessian approx.	Near-Newton convergence, memory efficient	Line search can fail
BFGS	Full Hessian approximation	Fast convergence	$O(p^2)$ memory
Conjugate Gradient	Conjugate search directions	Good for large-scale	Sensitive to restarts

Second-Order Methods:

These use explicit Hessian information:

Method	Key Idea	Strengths	Weaknesses
Newton's Method	$\theta \leftarrow \theta - H^{-1}\nabla\mathcal{L}$	Quadratic convergence near optimum	$O(p^3)$ per iteration
Trust Region	Constrained Newton step	Robust to indefinite Hessians	Complex implementation

Practical Recommendation

L-BFGS is the gold standard for GP hyperparameter optimization. It achieves near-Newton convergence rates with $O(p)$ memory, handles moderate ill-conditioning well, and requires minimal hyperparameter tuning. Use L-BFGS as the default; switch to Adam for very high-dimensional hyperparameter spaces or when L-BFGS fails to converge.

L-BFGS: The Workhorse Algorithm

Limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) is a quasi-Newton method that approximates the inverse Hessian using information from the last $m$ gradient evaluations (typically $m = 5$ to $20$).

Algorithm Overview:

Compute gradient: $\mathbf{g}k = -\nabla\theta \mathcal{L}(\theta_k)$ (negative because we maximize)
Compute search direction: $\mathbf{d}_k = \mathbf{H}_k \mathbf{g}_k$ where $\mathbf{H}_k$ approximates $\nabla^2 \mathcal{L}^{-1}$
Line search: Find step size $\alpha$ satisfying Wolfe conditions:
- Armijo (sufficient decrease): $\mathcal{L}(\theta_k + \alpha \mathbf{d}_k) \geq \mathcal{L}(\theta_k) + c_1 \alpha \mathbf{g}_k^\top \mathbf{d}_k$
- Curvature: $|\nabla\mathcal{L}(\theta_k + \alpha \mathbf{d}_k)^\top \mathbf{d}_k| \leq c_2 |\mathbf{g}_k^\top \mathbf{d}_k|$
Update: $\theta_{k+1} = \theta_k + \alpha \mathbf{d}_k$
Update inverse Hessian approximation: Store $(\mathbf{s}_k, \mathbf{y}_k)$ pairs where:
- $\mathbf{s}k = \theta{k+1} - \theta_k$
- $\mathbf{y}k = \nabla\mathcal{L}(\theta{k+1}) - \nabla\mathcal{L}(\theta_k)$

Why L-BFGS Excels for GPs:

Handles scale differences: The inverse Hessian approximation automatically adapts to different parameter scales.
Memory efficiency: Only stores $m$ vector pairs, regardless of problem dimensionality.
Superlinear convergence: Achieves better-than-linear convergence near the optimum.
Robust line search: The Wolfe conditions ensure progress even with approximate Hessians.

Key Parameters:

Parameter	Typical Value	Effect
Memory $m$	10-20	More memory → better Hessian approximation
$c_1$ (Armijo)	$10^{-4}$	Stricter decrease requirement
$c_2$ (curvature)	0.9	Ensures curvature condition
Max iterations	100-1000	Sufficient for most problems

lbfgs_optimization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
from scipy.optimize import minimize
import numpy as np
 
def optimize_gp_hyperparameters(
    objective_and_gradient: callable,
    initial_theta: np.ndarray,
    bounds: list[tuple] = None,
    max_iterations: int = 100
) -> dict:
    """
    Optimize GP hyperparameters using L-BFGS-B.
    
    Parameters
    ----------
    objective_and_gradient : callable
        Function that takes theta and returns (negative_lml, gradient)
        Note: scipy minimizes, so return negative LML
    initial_theta : np.ndarray
        Initial hyperparameter values (in log-space typically)
    bounds : list of (min, max) tuples
        Bounds for each hyperparameter (None for unbounded)
    max_iterations : int
        Maximum number of L-BFGS iterations
        
    Returns
    -------
    dict with 'theta', 'lml', 'success', 'message'
    """
    # L-BFGS-B with bounds (variant of L-BFGS)
    result = minimize(
        objective_and_gradient,
        initial_theta,
        method='L-BFGS-B',
        jac=True,  # objective_and_gradient returns (value, gradient)
        bounds=bounds,
        options={
            'maxiter': max_iterations,
            'disp': False,
            'gtol': 1e-5,  # Gradient tolerance
            'ftol': 1e-9,  # Function tolerance
        }
    )
    
    return {
        'theta': result.x,
        'lml': -result.fun,  # Negate back to get actual LML
        'success': result.success,
        'message': result.message,
        'n_iterations': result.nit,
        'n_function_evals': result.nfev
    }

Handling Multiple Local Optima

The marginal likelihood surface is non-convex. Different local optima can correspond to genuinely different explanations of the data:

Example: Two Valid Models

Consider noisy observations of a moderately varying function:

Model A: Smooth function ($\ell = 1.0$) + substantial noise ($\sigma_n^2 = 0.5$)
- Interpretation: 'The true function is smooth; variability is noise.'
Model B: Wiggly function ($\ell = 0.1$) + minimal noise ($\sigma_n^2 = 0.01$)
- Interpretation: 'The true function is complex; observations are accurate.'

Both can achieve similar marginal likelihood values, yet they make very different predictions for new points (especially in uncertainty estimates).

Strategy: Multiple Random Restarts

The most reliable approach to finding the global optimum:

Generate diverse initial points: Sample initial hyperparameters from a broad prior.
Run optimization from each: Apply L-BFGS (or chosen optimizer) independently.
Select best: Choose the result with highest marginal likelihood.
Check consistency: If runs converge to very different solutions with similar LML, investigate further—the data may genuinely support multiple explanations.

How Many Restarts?

Problem Complexity	Recommended Restarts
Simple (1-3 hyperparameters)	5-10
Moderate (4-10 hyperparameters)	10-30
Complex (ARD with many dims)	30-100
Critical applications	Run until convergence pattern stabilizes

multi_restart_optimization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import numpy as np
from typing import Callable
 
def multi_restart_optimization(
    objective_and_gradient: Callable,
    n_hyperparameters: int,
    n_restarts: int = 20,
    bounds: list[tuple] = None,
    random_state: int = None
) -> dict:
    """
    Optimize with multiple random restarts.
    
    Returns the best result across all restarts.
    """
    rng = np.random.RandomState(random_state)
    
    best_result = None
    best_lml = -np.inf
    all_results = []
    
    for restart in range(n_restarts):
        # Generate random initial point
        if bounds is not None:
            # Sample within bounds (uniform in log-space if log-parameterized)
            initial = np.array([
                rng.uniform(low, high) 
                for (low, high) in bounds
            ])
        else:
            # Default: sample from N(0, 1) for log-parameters
            initial = rng.randn(n_hyperparameters)
        
        # Run optimization
        try:
            result = optimize_gp_hyperparameters(
                objective_and_gradient,
                initial,
                bounds=bounds
            )
            all_results.append(result)
            
            if result['lml'] > best_lml:
                best_lml = result['lml']
                best_result = result
                
        except Exception as e:
            # Handle optimization failures gracefully
            print(f"Restart {restart} failed: {e}")
            continue
    
    # Return best result with additional statistics
    lmls = [r['lml'] for r in all_results]
    return {
        **best_result,
        'n_successful_restarts': len(all_results),
        'lml_mean': np.mean(lmls),
        'lml_std': np.std(lmls),
        'lml_max': best_lml,
        'all_results': all_results
    }

Distinguishing Good from Bad Optima

Not all local optima are equally valid. Pathological optima include: (1) $\sigma_n^2 \to 0$ (overfitting noise), (2) $\ell \to 0$ (interpolating through every point), (3) $\sigma_f^2 \to 0$ (ignoring data). Add bounds or priors to exclude such solutions.

Initialization Strategies

Good initialization accelerates convergence and increases the probability of finding the global optimum.

Data-Informed Initialization:

Rather than random sampling, use data statistics to guide initialization:

Hyperparameter	Recommended Initialization
Length-scale $\ell$	Median pairwise distance in $\mathbf{X}$
Signal variance $\sigma_f^2$	Variance of $\mathbf{y}$
Noise variance $\sigma_n^2$	10% of $\text{Var}(\mathbf{y})$ initially
ARD length-scales $\ell_d$	Range or IQR of dimension $d$

smart_initialization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import numpy as np
from scipy.spatial.distance import pdist
 
def data_informed_initial_hyperparameters(
    X: np.ndarray,  # (n, d) inputs
    y: np.ndarray,  # (n,) outputs
    kernel_type: str = 'rbf',
    ard: bool = False
) -> dict:
    """
    Compute data-informed initial hyperparameters.
    
    Returns log-transformed hyperparameters for unconstrained optimization.
    """
    n, d = X.shape
    
    # Signal variance: start near output variance
    y_var = np.var(y)
    log_signal_var = np.log(y_var) if y_var > 0 else 0.0
    
    # Noise variance: start at fraction of signal
    log_noise_var = np.log(0.1 * y_var) if y_var > 0 else -2.0
    
    if ard:
        # ARD: length-scale per dimension
        log_lengthscales = []
        for dim in range(d):
            # Use IQR as robust scale estimate
            q75, q25 = np.percentile(X[:, dim], [75, 25])
            iqr = q75 - q25
            if iqr > 0:
                log_lengthscales.append(np.log(iqr))
            else:
                # Fallback for constant dimensions
                log_lengthscales.append(0.0)
        log_lengthscales = np.array(log_lengthscales)
    else:
        # Isotropic: single length-scale based on median distance
        if n < 5000:
            pairwise_dists = pdist(X)
            median_dist = np.median(pairwise_dists)
        else:
            # Subsample for large datasets
            idx = np.random.choice(n, 500, replace=False)
            pairwise_dists = pdist(X[idx])
            median_dist = np.median(pairwise_dists)
        
        log_lengthscales = np.log(median_dist) if median_dist > 0 else 0.0
    
    return {
        'log_lengthscales': log_lengthscales,
        'log_signal_var': log_signal_var,
        'log_noise_var': log_noise_var
    }

Initialization for Random Restarts:

When using multiple restarts, diversify initial points:

One data-informed point: Use the strategy above for at least one restart.
Latin Hypercube Sampling: Generate well-spread points in hyperparameter space.
Log-uniform sampling: Sample $\log \theta \sim \text{Uniform}(\log \theta_{\min}, \log \theta_{\max})$ to cover multiple orders of magnitude.
Perturbations of data-informed: Add Gaussian noise to the data-informed initialization.

Parameterization Choices

How we represent hyperparameters matters as much as how we optimize them. Poor parameterization causes ill-conditioning, constraint violations, and slow convergence.

The Natural (Constrained) Parameterization:

Hyperparameters have natural constraints:

$\ell > 0$ (length-scale positive)
$\sigma_f^2 > 0$ (signal variance positive)
$\sigma_n^2 > 0$ (noise variance positive)

Optimizing directly with constrained optimization is possible but adds complexity.

The Log Parameterization (Unconstrained):

Map constrained parameters to unconstrained via: $$\bar{\theta} = \log \theta \quad \Rightarrow \quad \theta = \exp(\bar{\theta})$$

Advantages:

Unconstrained optimization: Use any optimizer without bounds.
Scale invariance: Parameters spanning orders of magnitude become comparable.
Numerical stability: Small positive values don't underflow.

Gradient Transformation:

With $\theta = \exp(\bar{\theta})$, the chain rule gives:

$$\frac{\partial \mathcal{L}}{\partial \bar{\theta}} = \frac{\partial \mathcal{L}}{\partial \theta} \cdot \frac{\partial \theta}{\partial \bar{\theta}} = \frac{\partial \mathcal{L}}{\partial \theta} \cdot \theta$$

This multiplication by $\theta$ naturally scales gradients—large parameters have proportionally larger gradients, appropriate for their scale.

Alternative Parameterizations:

Parameterization	Transform	Use Case
Log	$\log \theta$	Standard for variance/length-scale
Softplus	$\log(1 + \exp(\bar\theta))$	Smoother near zero
Squared	$\bar\theta^2$	Simple but can cause gradient issues
Inverse	$1/\bar\theta$	When working with precisions

Variance vs. Standard Deviation

Some implementations parameterize by log standard deviation ($\log \sigma$ rather than $\log \sigma^2$). This changes gradient magnitudes by a factor of 2 but doesn't fundamentally affect optimization. Be consistent within your implementation.

Convergence Criteria

Knowing when to stop optimization is crucial—stopping too early leaves performance on the table; stopping too late wastes computation.

Standard Convergence Criteria:

Gradient norm: $|\nabla\mathcal{L}| < \epsilon_g$
- Typical: $\epsilon_g = 10^{-5}$ to $10^{-8}$
- Indicates: First-order optimality condition nearly satisfied
Function change: $|\mathcal{L}^{(k)} - \mathcal{L}^{(k-1)}| < \epsilon_f$
- Typical: $\epsilon_f = 10^{-9}$ to $10^{-12}$
- Indicates: Marginal likelihood barely improving
Parameter change: $|\theta^{(k)} - \theta^{(k-1)}| < \epsilon_x$
- Typical: $\epsilon_x = 10^{-6}$ to $10^{-9}$
- Indicates: Hyperparameters stabilized
Maximum iterations: Fallback to prevent infinite loops

Practical Considerations:

Criterion	When Important	Potential Issues
Gradient norm	Precise optima needed	May be sensitive to scaling
Function change	Diminishing returns	Insensitive to flat regions
Parameter change	Stable hyperparameters for deployment	May stop in saddle points
Max iterations	Computational budget constraints	May miss optimum

Recommended Composite Criterion:

Stop when ANY of:

Gradient norm $< 10^{-5}$ AND function change $< 10^{-9}$ for 3 consecutive iterations
Function change $< 10^{-12}$ for 5 consecutive iterations
Maximum iterations reached

The 'consecutive iterations' requirement prevents premature stopping in flat regions.

When High Precision Matters

For most applications, approximate optima are sufficient—the marginal likelihood is typically flat near the optimum. Invest more in finding the right local optimum (via restarts) rather than converging to extreme precision in a potentially suboptimal basin.

Bounds and Regularization

The marginal likelihood alone can sometimes drive hyperparameters to pathological values. Adding bounds or regularization prevents such behavior.

Hyperparameter Bounds:

Hyperparameter	Reasonable Lower Bound	Reasonable Upper Bound
$\log \ell$	$\log(\text{min spacing})$	$\log(10 \times \text{data range})$
$\log \sigma_f^2$	$\log(10^{-6} \cdot \text{Var}(y))$	$\log(10^3 \cdot \text{Var}(y))$
$\log \sigma_n^2$	$\log(10^{-8})$ or jitter	$\log(\text{Var}(y))$

Jitter for Numerical Stability:

Always add a small 'jitter' term to the noise variance: $$\mathbf{K}_y = \mathbf{K} + (\sigma_n^2 + \epsilon) \mathbf{I}$$

with $\epsilon \approx 10^{-6}$ to $10^{-8}$. This ensures $\mathbf{K}_y$ is positive definite even if $\sigma_n^2$ becomes very small.

Hyperprior Regularization:

An alternative to hard bounds is adding a log-prior to the objective:

$$\tilde{\mathcal{L}}(\theta) = \mathcal{L}(\theta) + \log p(\theta)$$

This corresponds to MAP estimation of hyperparameters rather than maximum likelihood.

Common Hyperpriors:

Distribution	Form	Effect
Gamma	$p(\sigma^2) \propto (\sigma^2)^{\alpha-1} e^{-\beta \sigma^2}$	Soft upper bound on variance
Inverse Gamma	$p(\sigma^2) \propto (\sigma^2)^{-\alpha-1} e^{-\beta/\sigma^2}$	Soft lower bound on variance
Log-Normal	$\log \theta \sim \mathcal{N}(\mu, \tau^2)$	Centered around expected scale
Horseshoe	Hierarchical shrinkage	Strong regularization for ARD

The gradient of the log-prior simply adds to the marginal likelihood gradient.

When to Use Regularization

Regularization is most valuable when: (1) data is scarce relative to hyperparameter count, (2) some hyperparameters are poorly determined by the data, (3) prior knowledge about appropriate scales exists. With abundant data, the marginal likelihood typically provides sufficient regularization.

Scalability Considerations

Each marginal likelihood evaluation costs $O(n^3)$ for $n$ training points. For large datasets, this dominates total computation.

Budget Allocation:

Dataset Size	LML Cost	Optimization Strategy
$n < 1000$	< 1 second	Full L-BFGS, many restarts
$1000 < n < 5000$	1-60 seconds	L-BFGS with moderate restarts
$5000 < n < 10000$	1-10 minutes	Few restarts, careful initialization
$n > 10000$	Impractical	Use sparse GP methods

Strategies for Large Datasets:

Subsampling for initialization: Optimize on a random subset (e.g., 1000 points) to find good starting hyperparameters, then fine-tune on full data.
Warm starting: If training multiple related models, use previous optimum as starting point.
Early stopping in restarts: If a restart's LML is below the current best after a few iterations, abort early.
Parallel restarts: Run multiple restarts simultaneously on different cores.
Sparse approximations: Use inducing point methods (covered in Module 5) where the cost is $O(nm^2)$ with $m \ll n$ inducing points.

Approximating Gradients:

For very large $n$, consider:

Stochastic gradients using minibatches (requires careful variance reduction)
Matrix-free methods that avoid forming full $\mathbf{K}$
Kronecker or Toeplitz structure exploitation for gridded data

Practical Heuristic

For routine optimization, allocate roughly 80% of compute budget to a thorough search (many restarts with data-informed initialization) and 20% to final refinement. It's better to find the right basin than to precisely converge in the wrong one.

Debugging Optimization Failures

When optimization doesn't converge or produces unreasonable hyperparameters, systematic debugging is essential.

Common Failure Modes and Solutions:

Optimization Failure Diagnosis
Symptom	Likely Cause	Solution
Cholesky fails (not positive definite)	Noise variance too small or numerical issues	Add jitter; bound $\log \sigma_n^2$ from below
LML is NaN or Inf	Overflow/underflow in exp or log	Use log-parameterization; check for extreme values
Gradient is zero everywhere	Hyperparameters at boundary	Check bounds; verify gradient computation
Oscillation without progress	Ill-conditioning	Use L-BFGS with larger memory; reparameterize
$\ell \to 0$ (exploding complexity)	Overfitting to noise	Add lower bound on length-scale
$\ell \to \infty$ (constant model)	No signal in data, or wrong kernel	Verify data has variation; try different kernel
All restarts find same (bad) solution	Initialization too narrow	Broaden initial sampling range
Restarts find very different solutions	Multiple valid explanations	Increase restarts; compare predictions

Debugging Checklist:

Verify gradients numerically (as covered in Page 1)
Check data for issues: Duplicates, huge outliers, constant features
Visualize LML surface: Plot 1D or 2D slices to understand landscape
Examine intermediate values: Log $\mathbf{K}_y$ condition number, gradient norms
Test on synthetic data: Generate data from known GP, verify recovery
Compare with known implementation: GPyTorch, GPflow, scikit-learn

Summary: Optimization Strategies

We've developed a comprehensive toolkit for optimizing GP hyperparameters. Let's consolidate the key strategies:

Key Takeaways

•L-BFGS is the default choice — It combines quasi-Newton efficiency with low memory overhead, handling moderate ill-conditioning well.
•Multiple random restarts are essential — The marginal likelihood surface has multiple local optima; systematic exploration is required to find the global maximum.
•Data-informed initialization accelerates convergence — Use data statistics to set initial length-scales and variances in reasonable ranges.
•Log-parameterization is standard — Transform positive hyperparameters via $\log$ for unconstrained optimization and better conditioning.
•Bounds and jitter prevent pathologies — Physical constraints and numerical stability requirements should be enforced.
•Scalability requires approximation — For $n > 10000$, switch to sparse GP methods or use subsampling strategies.

Looking Ahead:

The next page examines cross-validation as an alternative to marginal likelihood for hyperparameter selection. Understanding both approaches allows you to choose the right tool for each situation and use cross-validation as a sanity check when marginal likelihood behavior is unexpected.

Page Complete

You now have practical strategies for optimizing GP hyperparameters: which algorithms to use, how to handle non-convexity, initialization techniques, and debugging approaches. This completes the optimization machinery for GP hyperparameter learning.

3 / 5

Loading learning content...

Machine LearningGaussian Processes

Hyperparameter Learning in Gaussian Processes

LevelAdvanced

Duration90 mins

TopicGaussian Processes

3 / 5

Optimization Strategies

The Hyperparameter Optimization Challenge

Non-convexity: The log marginal likelihood surface has multiple local optima corresponding to qualitatively different models.
Scale sensitivity: Hyperparameters span many orders of magnitude—length-scales from 0.001 to 1000, variances from $10^{-6}$ to $10^6$.
Ill-conditioning: The Hessian can have eigenvalues differing by factors of $10^{10}$, causing gradient descent to oscillate wildly.
Parameter interactions: Signal variance and noise variance trade off; multiple length-scales interact in complex ways.
Computational cost: Each function evaluation requires $O(n^3)$ operations, limiting how many optimization iterations are practical.

This page develops strategies to navigate these challenges reliably.

What You Will Learn

Choice of Optimizer

Several optimization algorithms are suitable for GP hyperparameter learning. Each has trade-offs between per-iteration cost, convergence speed, and robustness.

First-Order Methods (Gradient-Based):

These methods use only gradient information $\nabla_\theta \mathcal{L}$:

Method	Update Rule	Strengths	Weaknesses
Gradient Descent	$\theta \leftarrow \theta + \eta \nabla\mathcal{L}$	Simple, low memory	Slow, requires LR tuning
Momentum	$v \leftarrow \beta v + \nabla\mathcal{L}$; $\theta \leftarrow \theta + \eta v$	Accelerates in consistent gradients	Still requires LR tuning
Adam	Adaptive per-parameter LR	Robust to LR choice	May not converge precisely
RMSprop	Adaptive LR scaling	Good for non-stationary	Hyperparameters to tune

Quasi-Newton Methods:

These methods approximate second-order information from gradient history:

Method	Key Idea	Strengths	Weaknesses
L-BFGS	Limited-memory Hessian approx.	Near-Newton convergence, memory efficient	Line search can fail
BFGS	Full Hessian approximation	Fast convergence	$O(p^2)$ memory
Conjugate Gradient	Conjugate search directions	Good for large-scale	Sensitive to restarts

Second-Order Methods:

These use explicit Hessian information:

Method	Key Idea	Strengths	Weaknesses
Newton's Method	$\theta \leftarrow \theta - H^{-1}\nabla\mathcal{L}$	Quadratic convergence near optimum	$O(p^3)$ per iteration
Trust Region	Constrained Newton step	Robust to indefinite Hessians	Complex implementation

Practical Recommendation

L-BFGS: The Workhorse Algorithm

Algorithm Overview:

Compute gradient: $\mathbf{g}k = -\nabla\theta \mathcal{L}(\theta_k)$ (negative because we maximize)
Compute search direction: $\mathbf{d}_k = \mathbf{H}_k \mathbf{g}_k$ where $\mathbf{H}_k$ approximates $\nabla^2 \mathcal{L}^{-1}$
Line search: Find step size $\alpha$ satisfying Wolfe conditions:
- Armijo (sufficient decrease): $\mathcal{L}(\theta_k + \alpha \mathbf{d}_k) \geq \mathcal{L}(\theta_k) + c_1 \alpha \mathbf{g}_k^\top \mathbf{d}_k$
- Curvature: $|\nabla\mathcal{L}(\theta_k + \alpha \mathbf{d}_k)^\top \mathbf{d}_k| \leq c_2 |\mathbf{g}_k^\top \mathbf{d}_k|$
Update: $\theta_{k+1} = \theta_k + \alpha \mathbf{d}_k$
Update inverse Hessian approximation: Store $(\mathbf{s}_k, \mathbf{y}_k)$ pairs where:
- $\mathbf{s}k = \theta{k+1} - \theta_k$
- $\mathbf{y}k = \nabla\mathcal{L}(\theta{k+1}) - \nabla\mathcal{L}(\theta_k)$

Why L-BFGS Excels for GPs:

Handles scale differences: The inverse Hessian approximation automatically adapts to different parameter scales.
Memory efficiency: Only stores $m$ vector pairs, regardless of problem dimensionality.
Superlinear convergence: Achieves better-than-linear convergence near the optimum.
Robust line search: The Wolfe conditions ensure progress even with approximate Hessians.

Key Parameters:

Parameter	Typical Value	Effect
Memory $m$	10-20	More memory → better Hessian approximation
$c_1$ (Armijo)	$10^{-4}$	Stricter decrease requirement
$c_2$ (curvature)	0.9	Ensures curvature condition
Max iterations	100-1000	Sufficient for most problems

lbfgs_optimization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
from scipy.optimize import minimize
import numpy as np
 
def optimize_gp_hyperparameters(
    objective_and_gradient: callable,
    initial_theta: np.ndarray,
    bounds: list[tuple] = None,
    max_iterations: int = 100
) -> dict:
    """
    Optimize GP hyperparameters using L-BFGS-B.
    
    Parameters
    ----------
    objective_and_gradient : callable
        Function that takes theta and returns (negative_lml, gradient)
        Note: scipy minimizes, so return negative LML
    initial_theta : np.ndarray
        Initial hyperparameter values (in log-space typically)
    bounds : list of (min, max) tuples
        Bounds for each hyperparameter (None for unbounded)
    max_iterations : int
        Maximum number of L-BFGS iterations
        
    Returns
    -------
    dict with 'theta', 'lml', 'success', 'message'
    """
    # L-BFGS-B with bounds (variant of L-BFGS)
    result = minimize(
        objective_and_gradient,
        initial_theta,
        method='L-BFGS-B',
        jac=True,  # objective_and_gradient returns (value, gradient)
        bounds=bounds,
        options={
            'maxiter': max_iterations,
            'disp': False,
            'gtol': 1e-5,  # Gradient tolerance
            'ftol': 1e-9,  # Function tolerance
        }
    )
    
    return {
        'theta': result.x,
        'lml': -result.fun,  # Negate back to get actual LML
        'success': result.success,
        'message': result.message,
        'n_iterations': result.nit,
        'n_function_evals': result.nfev
    }

Handling Multiple Local Optima

The marginal likelihood surface is non-convex. Different local optima can correspond to genuinely different explanations of the data:

Example: Two Valid Models

Consider noisy observations of a moderately varying function:

Model A: Smooth function ($\ell = 1.0$) + substantial noise ($\sigma_n^2 = 0.5$)
- Interpretation: 'The true function is smooth; variability is noise.'
Model B: Wiggly function ($\ell = 0.1$) + minimal noise ($\sigma_n^2 = 0.01$)
- Interpretation: 'The true function is complex; observations are accurate.'

Both can achieve similar marginal likelihood values, yet they make very different predictions for new points (especially in uncertainty estimates).

Strategy: Multiple Random Restarts

The most reliable approach to finding the global optimum:

Generate diverse initial points: Sample initial hyperparameters from a broad prior.
Run optimization from each: Apply L-BFGS (or chosen optimizer) independently.
Select best: Choose the result with highest marginal likelihood.
Check consistency: If runs converge to very different solutions with similar LML, investigate further—the data may genuinely support multiple explanations.

How Many Restarts?

Problem Complexity	Recommended Restarts
Simple (1-3 hyperparameters)	5-10
Moderate (4-10 hyperparameters)	10-30
Complex (ARD with many dims)	30-100
Critical applications	Run until convergence pattern stabilizes

multi_restart_optimization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import numpy as np
from typing import Callable
 
def multi_restart_optimization(
    objective_and_gradient: Callable,
    n_hyperparameters: int,
    n_restarts: int = 20,
    bounds: list[tuple] = None,
    random_state: int = None
) -> dict:
    """
    Optimize with multiple random restarts.
    
    Returns the best result across all restarts.
    """
    rng = np.random.RandomState(random_state)
    
    best_result = None
    best_lml = -np.inf
    all_results = []
    
    for restart in range(n_restarts):
        # Generate random initial point
        if bounds is not None:
            # Sample within bounds (uniform in log-space if log-parameterized)
            initial = np.array([
                rng.uniform(low, high) 
                for (low, high) in bounds
            ])
        else:
            # Default: sample from N(0, 1) for log-parameters
            initial = rng.randn(n_hyperparameters)
        
        # Run optimization
        try:
            result = optimize_gp_hyperparameters(
                objective_and_gradient,
                initial,
                bounds=bounds
            )
            all_results.append(result)
            
            if result['lml'] > best_lml:
                best_lml = result['lml']
                best_result = result
                
        except Exception as e:
            # Handle optimization failures gracefully
            print(f"Restart {restart} failed: {e}")
            continue
    
    # Return best result with additional statistics
    lmls = [r['lml'] for r in all_results]
    return {
        **best_result,
        'n_successful_restarts': len(all_results),
        'lml_mean': np.mean(lmls),
        'lml_std': np.std(lmls),
        'lml_max': best_lml,
        'all_results': all_results
    }

Distinguishing Good from Bad Optima

Initialization Strategies

Good initialization accelerates convergence and increases the probability of finding the global optimum.

Data-Informed Initialization:

Rather than random sampling, use data statistics to guide initialization:

Hyperparameter	Recommended Initialization
Length-scale $\ell$	Median pairwise distance in $\mathbf{X}$
Signal variance $\sigma_f^2$	Variance of $\mathbf{y}$
Noise variance $\sigma_n^2$	10% of $\text{Var}(\mathbf{y})$ initially
ARD length-scales $\ell_d$	Range or IQR of dimension $d$

smart_initialization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import numpy as np
from scipy.spatial.distance import pdist
 
def data_informed_initial_hyperparameters(
    X: np.ndarray,  # (n, d) inputs
    y: np.ndarray,  # (n,) outputs
    kernel_type: str = 'rbf',
    ard: bool = False
) -> dict:
    """
    Compute data-informed initial hyperparameters.
    
    Returns log-transformed hyperparameters for unconstrained optimization.
    """
    n, d = X.shape
    
    # Signal variance: start near output variance
    y_var = np.var(y)
    log_signal_var = np.log(y_var) if y_var > 0 else 0.0
    
    # Noise variance: start at fraction of signal
    log_noise_var = np.log(0.1 * y_var) if y_var > 0 else -2.0
    
    if ard:
        # ARD: length-scale per dimension
        log_lengthscales = []
        for dim in range(d):
            # Use IQR as robust scale estimate
            q75, q25 = np.percentile(X[:, dim], [75, 25])
            iqr = q75 - q25
            if iqr > 0:
                log_lengthscales.append(np.log(iqr))
            else:
                # Fallback for constant dimensions
                log_lengthscales.append(0.0)
        log_lengthscales = np.array(log_lengthscales)
    else:
        # Isotropic: single length-scale based on median distance
        if n < 5000:
            pairwise_dists = pdist(X)
            median_dist = np.median(pairwise_dists)
        else:
            # Subsample for large datasets
            idx = np.random.choice(n, 500, replace=False)
            pairwise_dists = pdist(X[idx])
            median_dist = np.median(pairwise_dists)
        
        log_lengthscales = np.log(median_dist) if median_dist > 0 else 0.0
    
    return {
        'log_lengthscales': log_lengthscales,
        'log_signal_var': log_signal_var,
        'log_noise_var': log_noise_var
    }

Initialization for Random Restarts:

When using multiple restarts, diversify initial points:

One data-informed point: Use the strategy above for at least one restart.
Latin Hypercube Sampling: Generate well-spread points in hyperparameter space.
Log-uniform sampling: Sample $\log \theta \sim \text{Uniform}(\log \theta_{\min}, \log \theta_{\max})$ to cover multiple orders of magnitude.
Perturbations of data-informed: Add Gaussian noise to the data-informed initialization.

Parameterization Choices

How we represent hyperparameters matters as much as how we optimize them. Poor parameterization causes ill-conditioning, constraint violations, and slow convergence.

The Natural (Constrained) Parameterization:

Hyperparameters have natural constraints:

$\ell > 0$ (length-scale positive)
$\sigma_f^2 > 0$ (signal variance positive)
$\sigma_n^2 > 0$ (noise variance positive)

Optimizing directly with constrained optimization is possible but adds complexity.

The Log Parameterization (Unconstrained):

Map constrained parameters to unconstrained via: $$\bar{\theta} = \log \theta \quad \Rightarrow \quad \theta = \exp(\bar{\theta})$$

Advantages:

Unconstrained optimization: Use any optimizer without bounds.
Scale invariance: Parameters spanning orders of magnitude become comparable.
Numerical stability: Small positive values don't underflow.

Gradient Transformation:

With $\theta = \exp(\bar{\theta})$, the chain rule gives:

This multiplication by $\theta$ naturally scales gradients—large parameters have proportionally larger gradients, appropriate for their scale.

Alternative Parameterizations:

Parameterization	Transform	Use Case
Log	$\log \theta$	Standard for variance/length-scale
Softplus	$\log(1 + \exp(\bar\theta))$	Smoother near zero
Squared	$\bar\theta^2$	Simple but can cause gradient issues
Inverse	$1/\bar\theta$	When working with precisions

Variance vs. Standard Deviation

Convergence Criteria

Knowing when to stop optimization is crucial—stopping too early leaves performance on the table; stopping too late wastes computation.

Standard Convergence Criteria:

Gradient norm: $|\nabla\mathcal{L}| < \epsilon_g$
- Typical: $\epsilon_g = 10^{-5}$ to $10^{-8}$
- Indicates: First-order optimality condition nearly satisfied
Function change: $|\mathcal{L}^{(k)} - \mathcal{L}^{(k-1)}| < \epsilon_f$
- Typical: $\epsilon_f = 10^{-9}$ to $10^{-12}$
- Indicates: Marginal likelihood barely improving
Parameter change: $|\theta^{(k)} - \theta^{(k-1)}| < \epsilon_x$
- Typical: $\epsilon_x = 10^{-6}$ to $10^{-9}$
- Indicates: Hyperparameters stabilized
Maximum iterations: Fallback to prevent infinite loops

Practical Considerations:

Criterion	When Important	Potential Issues
Gradient norm	Precise optima needed	May be sensitive to scaling
Function change	Diminishing returns	Insensitive to flat regions
Parameter change	Stable hyperparameters for deployment	May stop in saddle points
Max iterations	Computational budget constraints	May miss optimum

Recommended Composite Criterion:

Stop when ANY of:

Gradient norm $< 10^{-5}$ AND function change $< 10^{-9}$ for 3 consecutive iterations
Function change $< 10^{-12}$ for 5 consecutive iterations
Maximum iterations reached

The 'consecutive iterations' requirement prevents premature stopping in flat regions.

When High Precision Matters

Bounds and Regularization

The marginal likelihood alone can sometimes drive hyperparameters to pathological values. Adding bounds or regularization prevents such behavior.

Hyperparameter Bounds:

Hyperparameter	Reasonable Lower Bound	Reasonable Upper Bound
$\log \ell$	$\log(\text{min spacing})$	$\log(10 \times \text{data range})$
$\log \sigma_f^2$	$\log(10^{-6} \cdot \text{Var}(y))$	$\log(10^3 \cdot \text{Var}(y))$
$\log \sigma_n^2$	$\log(10^{-8})$ or jitter	$\log(\text{Var}(y))$

Jitter for Numerical Stability:

Always add a small 'jitter' term to the noise variance: $$\mathbf{K}_y = \mathbf{K} + (\sigma_n^2 + \epsilon) \mathbf{I}$$

with $\epsilon \approx 10^{-6}$ to $10^{-8}$. This ensures $\mathbf{K}_y$ is positive definite even if $\sigma_n^2$ becomes very small.

Hyperprior Regularization:

An alternative to hard bounds is adding a log-prior to the objective:

$$\tilde{\mathcal{L}}(\theta) = \mathcal{L}(\theta) + \log p(\theta)$$

This corresponds to MAP estimation of hyperparameters rather than maximum likelihood.

Common Hyperpriors:

Distribution	Form	Effect
Gamma	$p(\sigma^2) \propto (\sigma^2)^{\alpha-1} e^{-\beta \sigma^2}$	Soft upper bound on variance
Inverse Gamma	$p(\sigma^2) \propto (\sigma^2)^{-\alpha-1} e^{-\beta/\sigma^2}$	Soft lower bound on variance
Log-Normal	$\log \theta \sim \mathcal{N}(\mu, \tau^2)$	Centered around expected scale
Horseshoe	Hierarchical shrinkage	Strong regularization for ARD

The gradient of the log-prior simply adds to the marginal likelihood gradient.

When to Use Regularization

Scalability Considerations

Each marginal likelihood evaluation costs $O(n^3)$ for $n$ training points. For large datasets, this dominates total computation.

Budget Allocation:

Dataset Size	LML Cost	Optimization Strategy
$n < 1000$	< 1 second	Full L-BFGS, many restarts
$1000 < n < 5000$	1-60 seconds	L-BFGS with moderate restarts
$5000 < n < 10000$	1-10 minutes	Few restarts, careful initialization
$n > 10000$	Impractical	Use sparse GP methods

Strategies for Large Datasets:

Subsampling for initialization: Optimize on a random subset (e.g., 1000 points) to find good starting hyperparameters, then fine-tune on full data.
Warm starting: If training multiple related models, use previous optimum as starting point.
Early stopping in restarts: If a restart's LML is below the current best after a few iterations, abort early.
Parallel restarts: Run multiple restarts simultaneously on different cores.
Sparse approximations: Use inducing point methods (covered in Module 5) where the cost is $O(nm^2)$ with $m \ll n$ inducing points.

Approximating Gradients:

For very large $n$, consider:

Stochastic gradients using minibatches (requires careful variance reduction)
Matrix-free methods that avoid forming full $\mathbf{K}$
Kronecker or Toeplitz structure exploitation for gridded data

Practical Heuristic

Debugging Optimization Failures

When optimization doesn't converge or produces unreasonable hyperparameters, systematic debugging is essential.

Common Failure Modes and Solutions:

Optimization Failure Diagnosis
Symptom	Likely Cause	Solution
Cholesky fails (not positive definite)	Noise variance too small or numerical issues	Add jitter; bound $\log \sigma_n^2$ from below
LML is NaN or Inf	Overflow/underflow in exp or log	Use log-parameterization; check for extreme values
Gradient is zero everywhere	Hyperparameters at boundary	Check bounds; verify gradient computation
Oscillation without progress	Ill-conditioning	Use L-BFGS with larger memory; reparameterize
$\ell \to 0$ (exploding complexity)	Overfitting to noise	Add lower bound on length-scale
$\ell \to \infty$ (constant model)	No signal in data, or wrong kernel	Verify data has variation; try different kernel
All restarts find same (bad) solution	Initialization too narrow	Broaden initial sampling range
Restarts find very different solutions	Multiple valid explanations	Increase restarts; compare predictions

Debugging Checklist:

Verify gradients numerically (as covered in Page 1)
Check data for issues: Duplicates, huge outliers, constant features
Visualize LML surface: Plot 1D or 2D slices to understand landscape
Examine intermediate values: Log $\mathbf{K}_y$ condition number, gradient norms
Test on synthetic data: Generate data from known GP, verify recovery
Compare with known implementation: GPyTorch, GPflow, scikit-learn

Summary: Optimization Strategies

We've developed a comprehensive toolkit for optimizing GP hyperparameters. Let's consolidate the key strategies:

Key Takeaways

•L-BFGS is the default choice — It combines quasi-Newton efficiency with low memory overhead, handling moderate ill-conditioning well.
•Multiple random restarts are essential — The marginal likelihood surface has multiple local optima; systematic exploration is required to find the global maximum.
•Data-informed initialization accelerates convergence — Use data statistics to set initial length-scales and variances in reasonable ranges.
•Log-parameterization is standard — Transform positive hyperparameters via $\log$ for unconstrained optimization and better conditioning.
•Bounds and jitter prevent pathologies — Physical constraints and numerical stability requirements should be enforced.
•Scalability requires approximation — For $n > 10000$, switch to sparse GP methods or use subsampling strategies.

Looking Ahead:

Page Complete

3 / 5