Loading learning content...
With the marginal likelihood objective and its gradients established, we now tackle the optimization itself. While the mathematical machinery looks clean, practical optimization of GP hyperparameters presents significant challenges:
Non-convexity: The log marginal likelihood surface has multiple local optima corresponding to qualitatively different models.
Scale sensitivity: Hyperparameters span many orders of magnitude—length-scales from 0.001 to 1000, variances from $10^{-6}$ to $10^6$.
Ill-conditioning: The Hessian can have eigenvalues differing by factors of $10^{10}$, causing gradient descent to oscillate wildly.
Parameter interactions: Signal variance and noise variance trade off; multiple length-scales interact in complex ways.
Computational cost: Each function evaluation requires $O(n^3)$ operations, limiting how many optimization iterations are practical.
This page develops strategies to navigate these challenges reliably.
By the end of this page, you will understand which optimization algorithms work best for GP hyperparameters, how to handle multiple local optima, strategies for initialization, and practical techniques for ensuring robust convergence.
Several optimization algorithms are suitable for GP hyperparameter learning. Each has trade-offs between per-iteration cost, convergence speed, and robustness.
First-Order Methods (Gradient-Based):
These methods use only gradient information $\nabla_\theta \mathcal{L}$:
| Method | Update Rule | Strengths | Weaknesses |
|---|---|---|---|
| Gradient Descent | $\theta \leftarrow \theta + \eta \nabla\mathcal{L}$ | Simple, low memory | Slow, requires LR tuning |
| Momentum | $v \leftarrow \beta v + \nabla\mathcal{L}$; $\theta \leftarrow \theta + \eta v$ | Accelerates in consistent gradients | Still requires LR tuning |
| Adam | Adaptive per-parameter LR | Robust to LR choice | May not converge precisely |
| RMSprop | Adaptive LR scaling | Good for non-stationary | Hyperparameters to tune |
Quasi-Newton Methods:
These methods approximate second-order information from gradient history:
| Method | Key Idea | Strengths | Weaknesses |
|---|---|---|---|
| L-BFGS | Limited-memory Hessian approx. | Near-Newton convergence, memory efficient | Line search can fail |
| BFGS | Full Hessian approximation | Fast convergence | $O(p^2)$ memory |
| Conjugate Gradient | Conjugate search directions | Good for large-scale | Sensitive to restarts |
Second-Order Methods:
These use explicit Hessian information:
| Method | Key Idea | Strengths | Weaknesses |
|---|---|---|---|
| Newton's Method | $\theta \leftarrow \theta - H^{-1}\nabla\mathcal{L}$ | Quadratic convergence near optimum | $O(p^3)$ per iteration |
| Trust Region | Constrained Newton step | Robust to indefinite Hessians | Complex implementation |
L-BFGS is the gold standard for GP hyperparameter optimization. It achieves near-Newton convergence rates with $O(p)$ memory, handles moderate ill-conditioning well, and requires minimal hyperparameter tuning. Use L-BFGS as the default; switch to Adam for very high-dimensional hyperparameter spaces or when L-BFGS fails to converge.
Limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) is a quasi-Newton method that approximates the inverse Hessian using information from the last $m$ gradient evaluations (typically $m = 5$ to $20$).
Algorithm Overview:
Compute gradient: $\mathbf{g}k = -\nabla\theta \mathcal{L}(\theta_k)$ (negative because we maximize)
Compute search direction: $\mathbf{d}_k = \mathbf{H}_k \mathbf{g}_k$ where $\mathbf{H}_k$ approximates $\nabla^2 \mathcal{L}^{-1}$
Line search: Find step size $\alpha$ satisfying Wolfe conditions:
Update: $\theta_{k+1} = \theta_k + \alpha \mathbf{d}_k$
Update inverse Hessian approximation: Store $(\mathbf{s}_k, \mathbf{y}_k)$ pairs where:
Why L-BFGS Excels for GPs:
Handles scale differences: The inverse Hessian approximation automatically adapts to different parameter scales.
Memory efficiency: Only stores $m$ vector pairs, regardless of problem dimensionality.
Superlinear convergence: Achieves better-than-linear convergence near the optimum.
Robust line search: The Wolfe conditions ensure progress even with approximate Hessians.
Key Parameters:
| Parameter | Typical Value | Effect |
|---|---|---|
| Memory $m$ | 10-20 | More memory → better Hessian approximation |
| $c_1$ (Armijo) | $10^{-4}$ | Stricter decrease requirement |
| $c_2$ (curvature) | 0.9 | Ensures curvature condition |
| Max iterations | 100-1000 | Sufficient for most problems |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
from scipy.optimize import minimizeimport numpy as np def optimize_gp_hyperparameters( objective_and_gradient: callable, initial_theta: np.ndarray, bounds: list[tuple] = None, max_iterations: int = 100) -> dict: """ Optimize GP hyperparameters using L-BFGS-B. Parameters ---------- objective_and_gradient : callable Function that takes theta and returns (negative_lml, gradient) Note: scipy minimizes, so return negative LML initial_theta : np.ndarray Initial hyperparameter values (in log-space typically) bounds : list of (min, max) tuples Bounds for each hyperparameter (None for unbounded) max_iterations : int Maximum number of L-BFGS iterations Returns ------- dict with 'theta', 'lml', 'success', 'message' """ # L-BFGS-B with bounds (variant of L-BFGS) result = minimize( objective_and_gradient, initial_theta, method='L-BFGS-B', jac=True, # objective_and_gradient returns (value, gradient) bounds=bounds, options={ 'maxiter': max_iterations, 'disp': False, 'gtol': 1e-5, # Gradient tolerance 'ftol': 1e-9, # Function tolerance } ) return { 'theta': result.x, 'lml': -result.fun, # Negate back to get actual LML 'success': result.success, 'message': result.message, 'n_iterations': result.nit, 'n_function_evals': result.nfev }The marginal likelihood surface is non-convex. Different local optima can correspond to genuinely different explanations of the data:
Example: Two Valid Models
Consider noisy observations of a moderately varying function:
Model A: Smooth function ($\ell = 1.0$) + substantial noise ($\sigma_n^2 = 0.5$)
Model B: Wiggly function ($\ell = 0.1$) + minimal noise ($\sigma_n^2 = 0.01$)
Both can achieve similar marginal likelihood values, yet they make very different predictions for new points (especially in uncertainty estimates).
Strategy: Multiple Random Restarts
The most reliable approach to finding the global optimum:
Generate diverse initial points: Sample initial hyperparameters from a broad prior.
Run optimization from each: Apply L-BFGS (or chosen optimizer) independently.
Select best: Choose the result with highest marginal likelihood.
Check consistency: If runs converge to very different solutions with similar LML, investigate further—the data may genuinely support multiple explanations.
How Many Restarts?
| Problem Complexity | Recommended Restarts |
|---|---|
| Simple (1-3 hyperparameters) | 5-10 |
| Moderate (4-10 hyperparameters) | 10-30 |
| Complex (ARD with many dims) | 30-100 |
| Critical applications | Run until convergence pattern stabilizes |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
import numpy as npfrom typing import Callable def multi_restart_optimization( objective_and_gradient: Callable, n_hyperparameters: int, n_restarts: int = 20, bounds: list[tuple] = None, random_state: int = None) -> dict: """ Optimize with multiple random restarts. Returns the best result across all restarts. """ rng = np.random.RandomState(random_state) best_result = None best_lml = -np.inf all_results = [] for restart in range(n_restarts): # Generate random initial point if bounds is not None: # Sample within bounds (uniform in log-space if log-parameterized) initial = np.array([ rng.uniform(low, high) for (low, high) in bounds ]) else: # Default: sample from N(0, 1) for log-parameters initial = rng.randn(n_hyperparameters) # Run optimization try: result = optimize_gp_hyperparameters( objective_and_gradient, initial, bounds=bounds ) all_results.append(result) if result['lml'] > best_lml: best_lml = result['lml'] best_result = result except Exception as e: # Handle optimization failures gracefully print(f"Restart {restart} failed: {e}") continue # Return best result with additional statistics lmls = [r['lml'] for r in all_results] return { **best_result, 'n_successful_restarts': len(all_results), 'lml_mean': np.mean(lmls), 'lml_std': np.std(lmls), 'lml_max': best_lml, 'all_results': all_results }Not all local optima are equally valid. Pathological optima include: (1) $\sigma_n^2 \to 0$ (overfitting noise), (2) $\ell \to 0$ (interpolating through every point), (3) $\sigma_f^2 \to 0$ (ignoring data). Add bounds or priors to exclude such solutions.
Good initialization accelerates convergence and increases the probability of finding the global optimum.
Data-Informed Initialization:
Rather than random sampling, use data statistics to guide initialization:
| Hyperparameter | Recommended Initialization |
|---|---|
| Length-scale $\ell$ | Median pairwise distance in $\mathbf{X}$ |
| Signal variance $\sigma_f^2$ | Variance of $\mathbf{y}$ |
| Noise variance $\sigma_n^2$ | 10% of $\text{Var}(\mathbf{y})$ initially |
| ARD length-scales $\ell_d$ | Range or IQR of dimension $d$ |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
import numpy as npfrom scipy.spatial.distance import pdist def data_informed_initial_hyperparameters( X: np.ndarray, # (n, d) inputs y: np.ndarray, # (n,) outputs kernel_type: str = 'rbf', ard: bool = False) -> dict: """ Compute data-informed initial hyperparameters. Returns log-transformed hyperparameters for unconstrained optimization. """ n, d = X.shape # Signal variance: start near output variance y_var = np.var(y) log_signal_var = np.log(y_var) if y_var > 0 else 0.0 # Noise variance: start at fraction of signal log_noise_var = np.log(0.1 * y_var) if y_var > 0 else -2.0 if ard: # ARD: length-scale per dimension log_lengthscales = [] for dim in range(d): # Use IQR as robust scale estimate q75, q25 = np.percentile(X[:, dim], [75, 25]) iqr = q75 - q25 if iqr > 0: log_lengthscales.append(np.log(iqr)) else: # Fallback for constant dimensions log_lengthscales.append(0.0) log_lengthscales = np.array(log_lengthscales) else: # Isotropic: single length-scale based on median distance if n < 5000: pairwise_dists = pdist(X) median_dist = np.median(pairwise_dists) else: # Subsample for large datasets idx = np.random.choice(n, 500, replace=False) pairwise_dists = pdist(X[idx]) median_dist = np.median(pairwise_dists) log_lengthscales = np.log(median_dist) if median_dist > 0 else 0.0 return { 'log_lengthscales': log_lengthscales, 'log_signal_var': log_signal_var, 'log_noise_var': log_noise_var }Initialization for Random Restarts:
When using multiple restarts, diversify initial points:
One data-informed point: Use the strategy above for at least one restart.
Latin Hypercube Sampling: Generate well-spread points in hyperparameter space.
Log-uniform sampling: Sample $\log \theta \sim \text{Uniform}(\log \theta_{\min}, \log \theta_{\max})$ to cover multiple orders of magnitude.
Perturbations of data-informed: Add Gaussian noise to the data-informed initialization.
How we represent hyperparameters matters as much as how we optimize them. Poor parameterization causes ill-conditioning, constraint violations, and slow convergence.
The Natural (Constrained) Parameterization:
Hyperparameters have natural constraints:
Optimizing directly with constrained optimization is possible but adds complexity.
The Log Parameterization (Unconstrained):
Map constrained parameters to unconstrained via: $$\bar{\theta} = \log \theta \quad \Rightarrow \quad \theta = \exp(\bar{\theta})$$
Advantages:
Gradient Transformation:
With $\theta = \exp(\bar{\theta})$, the chain rule gives:
$$\frac{\partial \mathcal{L}}{\partial \bar{\theta}} = \frac{\partial \mathcal{L}}{\partial \theta} \cdot \frac{\partial \theta}{\partial \bar{\theta}} = \frac{\partial \mathcal{L}}{\partial \theta} \cdot \theta$$
This multiplication by $\theta$ naturally scales gradients—large parameters have proportionally larger gradients, appropriate for their scale.
Alternative Parameterizations:
| Parameterization | Transform | Use Case |
|---|---|---|
| Log | $\log \theta$ | Standard for variance/length-scale |
| Softplus | $\log(1 + \exp(\bar\theta))$ | Smoother near zero |
| Squared | $\bar\theta^2$ | Simple but can cause gradient issues |
| Inverse | $1/\bar\theta$ | When working with precisions |
Some implementations parameterize by log standard deviation ($\log \sigma$ rather than $\log \sigma^2$). This changes gradient magnitudes by a factor of 2 but doesn't fundamentally affect optimization. Be consistent within your implementation.
Knowing when to stop optimization is crucial—stopping too early leaves performance on the table; stopping too late wastes computation.
Standard Convergence Criteria:
Gradient norm: $|\nabla\mathcal{L}| < \epsilon_g$
Function change: $|\mathcal{L}^{(k)} - \mathcal{L}^{(k-1)}| < \epsilon_f$
Parameter change: $|\theta^{(k)} - \theta^{(k-1)}| < \epsilon_x$
Maximum iterations: Fallback to prevent infinite loops
Practical Considerations:
| Criterion | When Important | Potential Issues |
|---|---|---|
| Gradient norm | Precise optima needed | May be sensitive to scaling |
| Function change | Diminishing returns | Insensitive to flat regions |
| Parameter change | Stable hyperparameters for deployment | May stop in saddle points |
| Max iterations | Computational budget constraints | May miss optimum |
Recommended Composite Criterion:
Stop when ANY of:
The 'consecutive iterations' requirement prevents premature stopping in flat regions.
For most applications, approximate optima are sufficient—the marginal likelihood is typically flat near the optimum. Invest more in finding the right local optimum (via restarts) rather than converging to extreme precision in a potentially suboptimal basin.
The marginal likelihood alone can sometimes drive hyperparameters to pathological values. Adding bounds or regularization prevents such behavior.
Hyperparameter Bounds:
| Hyperparameter | Reasonable Lower Bound | Reasonable Upper Bound |
|---|---|---|
| $\log \ell$ | $\log(\text{min spacing})$ | $\log(10 \times \text{data range})$ |
| $\log \sigma_f^2$ | $\log(10^{-6} \cdot \text{Var}(y))$ | $\log(10^3 \cdot \text{Var}(y))$ |
| $\log \sigma_n^2$ | $\log(10^{-8})$ or jitter | $\log(\text{Var}(y))$ |
Jitter for Numerical Stability:
Always add a small 'jitter' term to the noise variance: $$\mathbf{K}_y = \mathbf{K} + (\sigma_n^2 + \epsilon) \mathbf{I}$$
with $\epsilon \approx 10^{-6}$ to $10^{-8}$. This ensures $\mathbf{K}_y$ is positive definite even if $\sigma_n^2$ becomes very small.
Hyperprior Regularization:
An alternative to hard bounds is adding a log-prior to the objective:
$$\tilde{\mathcal{L}}(\theta) = \mathcal{L}(\theta) + \log p(\theta)$$
This corresponds to MAP estimation of hyperparameters rather than maximum likelihood.
Common Hyperpriors:
| Distribution | Form | Effect |
|---|---|---|
| Gamma | $p(\sigma^2) \propto (\sigma^2)^{\alpha-1} e^{-\beta \sigma^2}$ | Soft upper bound on variance |
| Inverse Gamma | $p(\sigma^2) \propto (\sigma^2)^{-\alpha-1} e^{-\beta/\sigma^2}$ | Soft lower bound on variance |
| Log-Normal | $\log \theta \sim \mathcal{N}(\mu, \tau^2)$ | Centered around expected scale |
| Horseshoe | Hierarchical shrinkage | Strong regularization for ARD |
The gradient of the log-prior simply adds to the marginal likelihood gradient.
Regularization is most valuable when: (1) data is scarce relative to hyperparameter count, (2) some hyperparameters are poorly determined by the data, (3) prior knowledge about appropriate scales exists. With abundant data, the marginal likelihood typically provides sufficient regularization.
Each marginal likelihood evaluation costs $O(n^3)$ for $n$ training points. For large datasets, this dominates total computation.
Budget Allocation:
| Dataset Size | LML Cost | Optimization Strategy |
|---|---|---|
| $n < 1000$ | < 1 second | Full L-BFGS, many restarts |
| $1000 < n < 5000$ | 1-60 seconds | L-BFGS with moderate restarts |
| $5000 < n < 10000$ | 1-10 minutes | Few restarts, careful initialization |
| $n > 10000$ | Impractical | Use sparse GP methods |
Strategies for Large Datasets:
Subsampling for initialization: Optimize on a random subset (e.g., 1000 points) to find good starting hyperparameters, then fine-tune on full data.
Warm starting: If training multiple related models, use previous optimum as starting point.
Early stopping in restarts: If a restart's LML is below the current best after a few iterations, abort early.
Parallel restarts: Run multiple restarts simultaneously on different cores.
Sparse approximations: Use inducing point methods (covered in Module 5) where the cost is $O(nm^2)$ with $m \ll n$ inducing points.
Approximating Gradients:
For very large $n$, consider:
For routine optimization, allocate roughly 80% of compute budget to a thorough search (many restarts with data-informed initialization) and 20% to final refinement. It's better to find the right basin than to precisely converge in the wrong one.
When optimization doesn't converge or produces unreasonable hyperparameters, systematic debugging is essential.
Common Failure Modes and Solutions:
| Symptom | Likely Cause | Solution |
|---|---|---|
| Cholesky fails (not positive definite) | Noise variance too small or numerical issues | Add jitter; bound $\log \sigma_n^2$ from below |
| LML is NaN or Inf | Overflow/underflow in exp or log | Use log-parameterization; check for extreme values |
| Gradient is zero everywhere | Hyperparameters at boundary | Check bounds; verify gradient computation |
| Oscillation without progress | Ill-conditioning | Use L-BFGS with larger memory; reparameterize |
| $\ell \to 0$ (exploding complexity) | Overfitting to noise | Add lower bound on length-scale |
| $\ell \to \infty$ (constant model) | No signal in data, or wrong kernel | Verify data has variation; try different kernel |
| All restarts find same (bad) solution | Initialization too narrow | Broaden initial sampling range |
| Restarts find very different solutions | Multiple valid explanations | Increase restarts; compare predictions |
Debugging Checklist:
We've developed a comprehensive toolkit for optimizing GP hyperparameters. Let's consolidate the key strategies:
Looking Ahead:
The next page examines cross-validation as an alternative to marginal likelihood for hyperparameter selection. Understanding both approaches allows you to choose the right tool for each situation and use cross-validation as a sanity check when marginal likelihood behavior is unexpected.
You now have practical strategies for optimizing GP hyperparameters: which algorithms to use, how to handle non-convexity, initialization techniques, and debugging approaches. This completes the optimization machinery for GP hyperparameter learning.