Loading learning content...
Having established the marginal likelihood as our objective for hyperparameter learning, we now face a computational challenge: how do we efficiently optimize it?
The marginal likelihood landscape is typically non-convex with multiple local optima. Gradient-free methods (grid search, random search, evolutionary algorithms) can work for low-dimensional hyperparameter spaces but become prohibitively expensive as dimensionality increases. With multiple length-scales (one per input dimension), periodic kernels, or composite kernel structures, the hyperparameter space easily reaches 10-50 dimensions.
Gradient-based optimization is the solution. Methods like L-BFGS, conjugate gradient, or Adam can efficiently navigate high-dimensional spaces when gradients are available. For Gaussian Processes, we're fortunate: the log marginal likelihood has closed-form gradients.
By the end of this page, you will understand how to derive and compute gradients of the log marginal likelihood with respect to arbitrary hyperparameters. You'll master the chain rule structure, trace identities, and efficient computational strategies that make gradient-based GP optimization practical.
Let's establish the notation we'll use throughout this derivation.
The Log Marginal Likelihood (LML):
$$\mathcal{L}(\boldsymbol{\theta}) = \log p(\mathbf{y} | \mathbf{X}, \boldsymbol{\theta}) = -\frac{1}{2}\mathbf{y}^\top \mathbf{K}_y^{-1} \mathbf{y} - \frac{1}{2}\log |\mathbf{K}_y| - \frac{n}{2}\log(2\pi)$$
where:
Our Goal:
Compute $\frac{\partial \mathcal{L}}{\partial \theta_j}$ for each hyperparameter $\theta_j$. This enables gradient-based optimization:
$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} + \eta \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}^{(t)})$$
Key Quantities:
We'll repeatedly use:
The structure of the gradient will depend on how each hyperparameter enters the covariance matrix. Fortunately, all common kernels have simple, differentiable forms.
Before diving into the gradient derivation, we need several matrix calculus identities. These are essential tools for differentiating expressions involving matrices.
Identity 1: Derivative of Matrix Inverse
For a matrix $\mathbf{A}$ that depends on a scalar $\theta$:
$$\frac{\partial \mathbf{A}^{-1}}{\partial \theta} = -\mathbf{A}^{-1} \frac{\partial \mathbf{A}}{\partial \theta} \mathbf{A}^{-1}$$
Proof sketch: Differentiate $\mathbf{A}^{-1} \mathbf{A} = \mathbf{I}$ using the product rule.
Identity 2: Derivative of Log Determinant
For a positive definite matrix $\mathbf{A}$:
$$\frac{\partial \log |\mathbf{A}|}{\partial \theta} = \text{tr}\left( \mathbf{A}^{-1} \frac{\partial \mathbf{A}}{\partial \theta} \right)$$
Proof sketch: Use $\log|\mathbf{A}| = \sum_i \log \lambda_i$ and the sensitivity of eigenvalues, or the identity $\frac{\partial |\mathbf{A}|}{\partial A_{ij}} = |\mathbf{A}| (\mathbf{A}^{-1})_{ji}$.
Identity 3: Derivative of Quadratic Form
For vectors $\mathbf{a}, \mathbf{b}$ independent of $\theta$:
$$\frac{\partial}{\partial \theta} \left( \mathbf{a}^\top \mathbf{A}^{-1} \mathbf{b} \right) = -\mathbf{a}^\top \mathbf{A}^{-1} \frac{\partial \mathbf{A}}{\partial \theta} \mathbf{A}^{-1} \mathbf{b}$$
This follows directly from Identity 1.
Identity 4: Trace Cyclic Property
For matrices of compatible dimensions:
$$\text{tr}(\mathbf{ABC}) = \text{tr}(\mathbf{CAB}) = \text{tr}(\mathbf{BCA})$$
This allows rearranging terms inside traces, which is crucial for efficient computation.
Identity 5: Trace and Outer Products
For vectors $\mathbf{a}, \mathbf{b}$ and matrix $\mathbf{M}$:
$$\mathbf{a}^\top \mathbf{M} \mathbf{b} = \text{tr}(\mathbf{M} \mathbf{b} \mathbf{a}^\top) = \text{tr}(\mathbf{a}^\top \mathbf{M} \mathbf{b})$$
This allows converting quadratic forms to traces when convenient.
These five identities are the 'atoms' of matrix calculus for multivariate Gaussian distributions. Nearly every derivative in GP theory—gradients, Hessians, Fisher information—reduces to combinations of these rules. Mastering them pays dividends across probabilistic machine learning.
We now derive the gradient $\frac{\partial \mathcal{L}}{\partial \theta_j}$ for an arbitrary hyperparameter $\theta_j$.
Starting Point:
$$\mathcal{L} = -\frac{1}{2}\mathbf{y}^\top \mathbf{K}_y^{-1} \mathbf{y} - \frac{1}{2}\log |\mathbf{K}_y| - \frac{n}{2}\log(2\pi)$$
The third term is constant with respect to hyperparameters. Let's differentiate the first two.
Gradient of the Data Fit Term:
$$\frac{\partial}{\partial \theta_j}\left( -\frac{1}{2}\mathbf{y}^\top \mathbf{K}_y^{-1} \mathbf{y} \right)$$
Using Identity 3:
$$= -\frac{1}{2} \cdot \left( -\mathbf{y}^\top \mathbf{K}_y^{-1} \frac{\partial \mathbf{K}_y}{\partial \theta_j} \mathbf{K}_y^{-1} \mathbf{y} \right)$$
$$= \frac{1}{2} \mathbf{y}^\top \mathbf{K}_y^{-1} \frac{\partial \mathbf{K}_y}{\partial \theta_j} \mathbf{K}_y^{-1} \mathbf{y}$$
Defining $\boldsymbol{\alpha} = \mathbf{K}_y^{-1} \mathbf{y}$:
$$= \frac{1}{2} \boldsymbol{\alpha}^\top \frac{\partial \mathbf{K}_y}{\partial \theta_j} \boldsymbol{\alpha}$$
Gradient of the Complexity Term:
$$\frac{\partial}{\partial \theta_j}\left( -\frac{1}{2}\log |\mathbf{K}_y| \right)$$
Using Identity 2:
$$= -\frac{1}{2} \text{tr}\left( \mathbf{K}_y^{-1} \frac{\partial \mathbf{K}_y}{\partial \theta_j} \right)$$
Combining the Terms:
$$\boxed{\frac{\partial \mathcal{L}}{\partial \theta_j} = \frac{1}{2} \boldsymbol{\alpha}^\top \frac{\partial \mathbf{K}_y}{\partial \theta_j} \boldsymbol{\alpha} - \frac{1}{2} \text{tr}\left( \mathbf{K}_y^{-1} \frac{\partial \mathbf{K}_y}{\partial \theta_j} \right)}$$
Or equivalently, using the trace-quadratic identity ($\boldsymbol{\alpha}^\top \mathbf{A} \boldsymbol{\alpha} = \text{tr}(\boldsymbol{\alpha} \boldsymbol{\alpha}^\top \mathbf{A})$):
$$\boxed{\frac{\partial \mathcal{L}}{\partial \theta_j} = \frac{1}{2} \text{tr}\left( \left( \boldsymbol{\alpha} \boldsymbol{\alpha}^\top - \mathbf{K}_y^{-1} \right) \frac{\partial \mathbf{K}_y}{\partial \theta_j} \right)}$$
The trace form is particularly elegant: it shows the gradient is determined by a single matrix $\boldsymbol{\alpha}\boldsymbol{\alpha}^\top - \mathbf{K}_y^{-1}$ contracted with the kernel derivative. This matrix encodes the 'discrepancy' between the outer product of the dual solution and the precision matrix.
Let's develop intuition for what each gradient term represents.
Term 1: Data Fit Contribution
$$\frac{1}{2} \boldsymbol{\alpha}^\top \frac{\partial \mathbf{K}_y}{\partial \theta_j} \boldsymbol{\alpha}$$
Recall $\boldsymbol{\alpha} = \mathbf{K}_y^{-1} \mathbf{y}$ represents the 'residuals in the dual space.' This term measures how the data fit changes when we perturb $\theta_j$:
Term 2: Complexity Contribution
$$-\frac{1}{2} \text{tr}\left( \mathbf{K}_y^{-1} \frac{\partial \mathbf{K}_y}{\partial \theta_j} \right)$$
This term measures how the model complexity changes with $\theta_j$:
The Balance:
At the optimum, these two forces are in equilibrium:
$$\boldsymbol{\alpha}^\top \frac{\partial \mathbf{K}_y}{\partial \theta_j} \boldsymbol{\alpha} = \text{tr}\left( \mathbf{K}_y^{-1} \frac{\partial \mathbf{K}_y}{\partial \theta_j} \right)$$
For each hyperparameter, the marginal gain in data fit from increasing it equals the marginal cost in complexity. This is the differential form of Occam's Razor.
Connection to Fisher Information:
The complexity term $\text{tr}(\mathbf{K}_y^{-1} \frac{\partial \mathbf{K}_y}{\partial \theta_j})$ is related to the Fisher information of the Gaussian distribution with covariance $\mathbf{K}_y$. Hyperparameters with large Fisher information (the model is sensitive to their values) are more heavily penalized—the model 'pays more' for parameters that strongly affect predictions.
We're computing gradients for maximizing the log marginal likelihood. If using a minimization framework (as many optimizers expect), negate the objective and gradients. The convention here follows the original GP literature.
The gradient formula requires $\frac{\partial \mathbf{K}_y}{\partial \theta_j}$—the derivative of the covariance matrix with respect to each hyperparameter. This decomposes as:
$$\frac{\partial \mathbf{K}_y}{\partial \theta_j} = \frac{\partial \mathbf{K}}{\partial \theta_j} + \frac{\partial (\sigma_n^2 \mathbf{I})}{\partial \theta_j}$$
The noise derivative is simple: $\frac{\partial (\sigma_n^2 \mathbf{I})}{\partial \sigma_n^2} = \mathbf{I}$ if $\theta_j = \sigma_n^2$, and zero otherwise.
The kernel derivative $\frac{\partial \mathbf{K}}{\partial \theta_j}$ depends on the specific kernel form. Let's work through the most common cases.
Squared Exponential (RBF) Kernel:
$$k(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left( -\frac{r^2}{2\ell^2} \right), \quad r = |\mathbf{x} - \mathbf{x}'|$$
Derivative w.r.t. signal variance $\sigma_f^2$: $$\frac{\partial k}{\partial \sigma_f^2} = \frac{k(\mathbf{x}, \mathbf{x}')}{\sigma_f^2}$$
Derivative w.r.t. length-scale $\ell$: $$\frac{\partial k}{\partial \ell} = k(\mathbf{x}, \mathbf{x}') \cdot \frac{r^2}{\ell^3}$$
For log-parameterization (common in practice), with $\bar{\ell} = \log \ell$: $$\frac{\partial k}{\partial \bar{\ell}} = \frac{\partial k}{\partial \ell} \cdot \ell = k(\mathbf{x}, \mathbf{x}') \cdot \frac{r^2}{\ell^2}$$
Matérn Kernels:
$$k_\nu(r) = \sigma_f^2 \frac{2^{1-\nu}}{\Gamma(\nu)} \left( \frac{\sqrt{2\nu} r}{\ell} \right)^\nu K_\nu\left( \frac{\sqrt{2\nu} r}{\ell} \right)$$
The derivatives are more complex due to the Bessel function. For the commonly used special cases:
Matérn-1/2 (Exponential): $k(r) = \sigma_f^2 \exp(-r/\ell)$ $$\frac{\partial k}{\partial \ell} = k(r) \cdot \frac{r}{\ell^2}$$
Matérn-3/2: $k(r) = \sigma_f^2 \left(1 + \frac{\sqrt{3}r}{\ell}\right) \exp\left(-\frac{\sqrt{3}r}{\ell}\right)$ $$\frac{\partial k}{\partial \ell} = \sigma_f^2 \frac{3r^2}{\ell^3} \exp\left(-\frac{\sqrt{3}r}{\ell}\right)$$
Matérn-5/2: $k(r) = \sigma_f^2 \left(1 + \frac{\sqrt{5}r}{\ell} + \frac{5r^2}{3\ell^2}\right) \exp\left(-\frac{\sqrt{5}r}{\ell}\right)$ $$\frac{\partial k}{\partial \ell} = \sigma_f^2 \frac{5r^2}{3\ell^3} \left(1 + \frac{\sqrt{5}r}{\ell}\right) \exp\left(-\frac{\sqrt{5}r}{\ell}\right)$$
| Kernel | Parameter | $\frac{\partial k}{\partial \theta}$ |
|---|---|---|
| All kernels | $\sigma_f^2$ | $k(\mathbf{x}, \mathbf{x}')/\sigma_f^2$ |
| RBF | $\ell$ | $k(\mathbf{x}, \mathbf{x}') \cdot r^2/\ell^3$ |
| RBF (log-param) | $\log \ell$ | $k(\mathbf{x}, \mathbf{x}') \cdot r^2/\ell^2$ |
| Matérn-1/2 | $\ell$ | $k(r) \cdot r/\ell^2$ |
| Matérn-3/2 | $\ell$ | $\sigma_f^2 \frac{3r^2}{\ell^3} \exp(-\sqrt{3}r/\ell)$ |
| Periodic | period $p$ | Involves $\sin$ and $\cos$ terms |
| Noise | $\sigma_n^2$ | $\mathbf{I}$ (identity matrix) |
Automatic Relevance Determination (ARD) uses a separate length-scale for each input dimension, enabling automatic feature selection:
$$k(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left( -\frac{1}{2} \sum_{d=1}^D \frac{(x_d - x'_d)^2}{\ell_d^2} \right)$$
This introduces $D$ length-scale parameters ${\ell_1, \ldots, \ell_D}$, and we need gradients for each.
Derivation:
Define the weighted squared distance: $$r_d = (x_d - x'_d)^2$$
Then: $$\frac{\partial k}{\partial \ell_d} = k(\mathbf{x}, \mathbf{x}') \cdot \frac{r_d}{\ell_d^3}$$
With log-parameterization ($\bar{\ell}_d = \log \ell_d$): $$\frac{\partial k}{\partial \bar{\ell}_d} = k(\mathbf{x}, \mathbf{x}') \cdot \frac{r_d}{\ell_d^2}$$
Interpreting ARD Gradients:
After optimization, the learned length-scales reveal feature importance:
Large $\ell_d$: Input dimension $d$ has little effect on predictions; the function varies slowly along this axis. This dimension is effectively 'turned off.'
Small $\ell_d$: Input dimension $d$ strongly influences predictions; the function varies rapidly along this axis. This dimension is highly relevant.
The gradient structure reveals why ARD works: for an irrelevant feature (constant along dimension $d$), the squared distance term $r_d$ is small for all pairs, making the gradient negligible. The length-scale then grows large with little marginal likelihood change. For relevant features, $r_d$ varies substantially, and the gradient drives $\ell_d$ to an appropriate value.
ARD provides a principled, embedded approach to feature selection. After training, features with $\ell_d > 10 \times \max(\text{data range in } d)$ can often be safely removed—their influence on predictions is negligible, and model complexity is reduced.
Computing gradients naively is expensive. Each hyperparameter's gradient requires:
$$\frac{\partial \mathcal{L}}{\partial \theta_j} = \frac{1}{2} \text{tr}\left( \left( \boldsymbol{\alpha} \boldsymbol{\alpha}^\top - \mathbf{K}_y^{-1} \right) \frac{\partial \mathbf{K}_y}{\partial \theta_j} \right)$$
Computing $\mathbf{K}_y^{-1}$ explicitly costs $O(n^3)$ and produces an $n \times n$ matrix. With $p$ hyperparameters, naive computation costs $O(pn^3)$, which is prohibitive for large datasets.
Efficient Strategy:
Cholesky decomposition once: Factor $\mathbf{K}_y = \mathbf{L}\mathbf{L}^\top$ in $O(n^3)$ time.
Solve for $\boldsymbol{\alpha}$ once: Compute $\boldsymbol{\alpha} = \mathbf{K}_y^{-1}\mathbf{y}$ via back-substitution in $O(n^2)$ time.
Solve for $\mathbf{K}_y^{-1}$ column-by-column: If needed, compute each column of $\mathbf{K}_y^{-1}$ by solving $\mathbf{L}\mathbf{L}^\top \mathbf{b}_i = \mathbf{e}_i$. Total: $O(n^3)$.
Compute each gradient via trace: Use the relation: $$\text{tr}(\mathbf{A}\mathbf{B}) = \sum_{i,j} A_{ij} B_{ji}$$
This is element-wise multiply and sum: $O(n^2)$ per hyperparameter.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
import numpy as npfrom scipy.linalg import cho_factor, cho_solve def compute_lml_and_gradients( y: np.ndarray, K: np.ndarray, noise_var: float, dK_dtheta: list[np.ndarray]) -> tuple[float, np.ndarray]: """ Efficiently compute LML and gradients w.r.t. hyperparameters. Parameters ---------- y : (n,) array - Observations K : (n, n) array - Kernel matrix (without noise) noise_var : float - Noise variance sigma_n^2 dK_dtheta : list of (n, n) arrays - Kernel derivatives Returns ------- lml : float - Log marginal likelihood grads : (p,) array - Gradient w.r.t. each hyperparameter """ n = len(y) K_y = K + noise_var * np.eye(n) # Step 1: Cholesky decomposition (O(n^3), done once) L, lower = cho_factor(K_y, lower=True) # Step 2: Compute alpha = K_y^{-1} @ y (O(n^2)) alpha = cho_solve((L, lower), y) # Step 3: Compute log|K_y| (O(n)) log_det = 2 * np.sum(np.log(np.diag(L))) # Step 4: Compute LML data_fit = -0.5 * np.dot(y, alpha) complexity = -0.5 * log_det lml = data_fit + complexity - 0.5 * n * np.log(2 * np.pi) # Step 5: Compute K_y^{-1} (O(n^3), can be optimized) K_y_inv = cho_solve((L, lower), np.eye(n)) # Step 6: Compute gradient matrix (done once, reused) # Q = alpha @ alpha.T - K_y^{-1} Q = np.outer(alpha, alpha) - K_y_inv # Step 7: Compute gradient for each hyperparameter (O(n^2) each) grads = [] for dK in dK_dtheta: # grad = 0.5 * tr(Q @ dK) # Efficient: element-wise multiply and sum grad = 0.5 * np.sum(Q * dK.T) # Q * dK.T for tr(Q @ dK) grads.append(grad) return lml, np.array(grads)Complexity Analysis:
| Operation | Complexity | Notes |
|---|---|---|
| Cholesky decomposition | $O(n^3)$ | Done once |
| Solve for $\boldsymbol{\alpha}$ | $O(n^2)$ | Back-substitution |
| Compute $\log | \mathbf{K}_y | $ |
| Compute $\mathbf{K}_y^{-1}$ | $O(n^3)$ | $n$ back-substitutions |
| Each gradient (trace) | $O(n^2)$ | Element-wise product |
| Total for $p$ hyperparameters | $O(n^3 + pn^2)$ |
The $O(n^3)$ Cholesky and inverse computations dominate, but they're done once per evaluation. Additional hyperparameters add only $O(n^2)$ cost each.
Modern numerical libraries achieve peak performance through vectorization. Here we present an optimized implementation that minimizes memory allocations and maximizes cache efficiency.
Key Optimizations:
Avoid explicit $\mathbf{K}_y^{-1}$: In many cases, we can restructure computations to avoid forming the full inverse.
Hadamard product for traces: Use $\text{tr}(\mathbf{A}\mathbf{B}) = \mathbf{1}^\top (\mathbf{A} \odot \mathbf{B}^\top) \mathbf{1}$ where $\odot$ is element-wise multiplication.
Pre-compute shared structures: Matrices like $\boldsymbol{\alpha}\boldsymbol{\alpha}^\top$ are computed once.
Log-parameterization: Working with $\log \theta$ instead of $\theta$ is numerically stable and eliminates positivity constraints.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
import numpy as npfrom scipy.linalg import solve_triangular def efficient_gp_gradients( X: np.ndarray, # (n, d) input points y: np.ndarray, # (n,) observations log_lengthscales: np.ndarray, # (d,) log length-scales (ARD) log_signal_var: float, # log signal variance log_noise_var: float, # log noise variance jitter: float = 1e-8) -> tuple[float, np.ndarray]: """ Compute LML and gradients for ARD RBF kernel. All hyperparameters in log-space for unconstrained optimization. """ n, d = X.shape # Transform from log-space lengthscales = np.exp(log_lengthscales) signal_var = np.exp(log_signal_var) noise_var = np.exp(log_noise_var) # Compute scaled inputs: X_scaled[i, j] = X[i, j] / lengthscales[j] X_scaled = X / lengthscales # Compute squared distances matrix # ||x_i - x_j||^2 / ell^2 = sum_d (x_id - x_jd)^2 / ell_d^2 sq_dists = np.sum(X_scaled**2, axis=1, keepdims=True) + np.sum(X_scaled**2, axis=1) - 2 * X_scaled @ X_scaled.T # Kernel matrix K = signal_var * np.exp(-0.5 * sq_dists) K_y = K + (noise_var + jitter) * np.eye(n) # Cholesky decomposition L = np.linalg.cholesky(K_y) # Solve for alpha alpha = solve_triangular(L, y, lower=True) alpha = solve_triangular(L.T, alpha, lower=False) # Log marginal likelihood log_det = 2 * np.sum(np.log(np.diag(L))) lml = -0.5 * np.dot(y, alpha) - 0.5 * log_det - 0.5 * n * np.log(2 * np.pi) # Gradient computation: construct Q = alpha @ alpha.T - K_y^{-1} # Solving L @ L.T @ W = I gives W = K_y^{-1} V = solve_triangular(L, np.eye(n), lower=True) K_y_inv = V.T @ V Q = np.outer(alpha, alpha) - K_y_inv # Gradient w.r.t. log_signal_var: dK/d(log σ²) = K grad_log_signal_var = 0.5 * np.sum(Q * K.T) # Gradient w.r.t. log_noise_var: dK/d(log σ²_n) = σ²_n * I grad_log_noise_var = 0.5 * noise_var * np.trace(Q) # Gradients w.r.t. log_lengthscales (ARD) grad_log_lengthscales = np.zeros(d) for dim in range(d): # Compute dimension-specific squared distance X_dim = X[:, dim:dim+1] sq_dist_dim = (X_dim - X_dim.T)**2 # dK/d(log ell_d) = K * sq_dist_dim / ell_d^2 dK_d_log_ell = K * sq_dist_dim / lengthscales[dim]**2 grad_log_lengthscales[dim] = 0.5 * np.sum(Q * dK_d_log_ell.T) # Pack gradients: [log_lengthscales, log_signal_var, log_noise_var] grads = np.concatenate([ grad_log_lengthscales, [grad_log_signal_var], [grad_log_noise_var] ]) return lml, gradsWhen implementing GP gradients, numerical verification is essential. Analytic gradient errors can cause silent failures—the optimization may converge to a suboptimal point without any obvious error message.
Finite Difference Approximation:
The simplest approach uses centered finite differences:
$$\frac{\partial \mathcal{L}}{\partial \theta_j} \approx \frac{\mathcal{L}(\theta_j + \epsilon) - \mathcal{L}(\theta_j - \epsilon)}{2\epsilon}$$
with $\epsilon \approx 10^{-5}$ to $10^{-7}$.
Relative Error Check:
$$\text{relative error} = \frac{|\text{analytic} - \text{numerical}|}{\max(|\text{analytic}|, |\text{numerical}|, 10^{-8})}$$
A relative error below $10^{-4}$ is typically acceptable; below $10^{-6}$ is excellent.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
def numerical_gradient( f: callable, # Function to differentiate x: np.ndarray, # Point at which to evaluate epsilon: float = 1e-5) -> np.ndarray: """ Compute numerical gradient via centered differences. """ grad = np.zeros_like(x) for i in range(len(x)): x_plus = x.copy() x_minus = x.copy() x_plus[i] += epsilon x_minus[i] -= epsilon grad[i] = (f(x_plus) - f(x_minus)) / (2 * epsilon) return grad def check_gradients( f_and_grad: callable, # Returns (f_value, gradient) x: np.ndarray, epsilon: float = 1e-5, rtol: float = 1e-4) -> bool: """ Verify analytic gradients against numerical approximation. Returns True if all gradients pass relative tolerance check. """ _, analytic = f_and_grad(x) numerical = numerical_gradient(lambda t: f_and_grad(t)[0], x, epsilon) # Compute relative errors denom = np.maximum(np.abs(analytic), np.abs(numerical)) denom = np.maximum(denom, 1e-8) # Avoid division by zero rel_error = np.abs(analytic - numerical) / denom print("Gradient Check Results:") print(f"{'Index':<6} {'Analytic':<15} {'Numerical':<15} {'Rel Error':<12} {'Status'}") print("-" * 60) all_passed = True for i in range(len(x)): status = "✓" if rel_error[i] < rtol else "✗" if rel_error[i] >= rtol: all_passed = False print(f"{i:<6} {analytic[i]:<15.6e} {numerical[i]:<15.6e} {rel_error[i]:<12.6e} {status}") return all_passedSign errors: Confusing maximize vs. minimize conventions. Index errors: Loop indices off by one. Transpose errors: Using row when column is needed. Chain rule omissions: Forgetting Jacobians when using transformed parameters. Always verify gradients on a small random dataset before trusting implementations.
We've developed the complete machinery for computing gradients of the log marginal likelihood. Let's consolidate the key results:
Looking Ahead:
With gradients in hand, we can apply gradient-based optimization algorithms to find hyperparameters that maximize the marginal likelihood. The next page explores optimization strategies—which algorithms to use, how to handle multiple optima, and practical considerations for robust hyperparameter learning.
You now have a complete understanding of gradient computation for GP hyperparameter learning. This mathematical machinery enables efficient gradient-based optimization, transforming hyperparameter learning from an expensive search problem into a smooth optimization problem.