Machine LearningGaussian Processes

Hyperparameter Learning in Gaussian Processes

LevelAdvanced

Duration90 mins

TopicGaussian Processes

2 / 5

Gradient Computation

The Need for Gradient Information

Having established the marginal likelihood as our objective for hyperparameter learning, we now face a computational challenge: how do we efficiently optimize it?

The marginal likelihood landscape is typically non-convex with multiple local optima. Gradient-free methods (grid search, random search, evolutionary algorithms) can work for low-dimensional hyperparameter spaces but become prohibitively expensive as dimensionality increases. With multiple length-scales (one per input dimension), periodic kernels, or composite kernel structures, the hyperparameter space easily reaches 10-50 dimensions.

Gradient-based optimization is the solution. Methods like L-BFGS, conjugate gradient, or Adam can efficiently navigate high-dimensional spaces when gradients are available. For Gaussian Processes, we're fortunate: the log marginal likelihood has closed-form gradients.

What You Will Learn

By the end of this page, you will understand how to derive and compute gradients of the log marginal likelihood with respect to arbitrary hyperparameters. You'll master the chain rule structure, trace identities, and efficient computational strategies that make gradient-based GP optimization practical.

Setup and Notation

Let's establish the notation we'll use throughout this derivation.

The Log Marginal Likelihood (LML):

$$\mathcal{L}(\boldsymbol{\theta}) = \log p(\mathbf{y} | \mathbf{X}, \boldsymbol{\theta}) = -\frac{1}{2}\mathbf{y}^\top \mathbf{K}_y^{-1} \mathbf{y} - \frac{1}{2}\log |\mathbf{K}_y| - \frac{n}{2}\log(2\pi)$$

where:

$\mathbf{y} \in \mathbb{R}^n$ — Training observations
$\mathbf{K}_y = \mathbf{K} + \sigma_n^2 \mathbf{I}$ — Noisy covariance matrix
$\mathbf{K} = [k(\mathbf{x}_i, \mathbf{x}j)]{i,j=1}^n$ — Kernel matrix
$\boldsymbol{\theta}$ — Vector of all hyperparameters (kernel parameters + noise variance)

Our Goal:

Compute $\frac{\partial \mathcal{L}}{\partial \theta_j}$ for each hyperparameter $\theta_j$. This enables gradient-based optimization:

$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} + \eta \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}^{(t)})$$

Key Quantities:

We'll repeatedly use:

$\boldsymbol{\alpha} = \mathbf{K}_y^{-1} \mathbf{y}$ — The 'dual representation' of predictions
$\frac{\partial \mathbf{K}_y}{\partial \theta_j}$ — Derivative of covariance matrix w.r.t. hyperparameter $\theta_j$

The structure of the gradient will depend on how each hyperparameter enters the covariance matrix. Fortunately, all common kernels have simple, differentiable forms.

Matrix Calculus Preliminaries

Before diving into the gradient derivation, we need several matrix calculus identities. These are essential tools for differentiating expressions involving matrices.

Identity 1: Derivative of Matrix Inverse

For a matrix $\mathbf{A}$ that depends on a scalar $\theta$:

$$\frac{\partial \mathbf{A}^{-1}}{\partial \theta} = -\mathbf{A}^{-1} \frac{\partial \mathbf{A}}{\partial \theta} \mathbf{A}^{-1}$$

Proof sketch: Differentiate $\mathbf{A}^{-1} \mathbf{A} = \mathbf{I}$ using the product rule.

Identity 2: Derivative of Log Determinant

For a positive definite matrix $\mathbf{A}$:

$$\frac{\partial \log |\mathbf{A}|}{\partial \theta} = \text{tr}\left( \mathbf{A}^{-1} \frac{\partial \mathbf{A}}{\partial \theta} \right)$$

Proof sketch: Use $\log|\mathbf{A}| = \sum_i \log \lambda_i$ and the sensitivity of eigenvalues, or the identity $\frac{\partial |\mathbf{A}|}{\partial A_{ij}} = |\mathbf{A}| (\mathbf{A}^{-1})_{ji}$.

Identity 3: Derivative of Quadratic Form

For vectors $\mathbf{a}, \mathbf{b}$ independent of $\theta$:

$$\frac{\partial}{\partial \theta} \left( \mathbf{a}^\top \mathbf{A}^{-1} \mathbf{b} \right) = -\mathbf{a}^\top \mathbf{A}^{-1} \frac{\partial \mathbf{A}}{\partial \theta} \mathbf{A}^{-1} \mathbf{b}$$

This follows directly from Identity 1.

Identity 4: Trace Cyclic Property

For matrices of compatible dimensions:

$$\text{tr}(\mathbf{ABC}) = \text{tr}(\mathbf{CAB}) = \text{tr}(\mathbf{BCA})$$

This allows rearranging terms inside traces, which is crucial for efficient computation.

Identity 5: Trace and Outer Products

For vectors $\mathbf{a}, \mathbf{b}$ and matrix $\mathbf{M}$:

$$\mathbf{a}^\top \mathbf{M} \mathbf{b} = \text{tr}(\mathbf{M} \mathbf{b} \mathbf{a}^\top) = \text{tr}(\mathbf{a}^\top \mathbf{M} \mathbf{b})$$

This allows converting quadratic forms to traces when convenient.

Why These Identities?

These five identities are the 'atoms' of matrix calculus for multivariate Gaussian distributions. Nearly every derivative in GP theory—gradients, Hessians, Fisher information—reduces to combinations of these rules. Mastering them pays dividends across probabilistic machine learning.

Deriving the Gradient of the Log Marginal Likelihood

We now derive the gradient $\frac{\partial \mathcal{L}}{\partial \theta_j}$ for an arbitrary hyperparameter $\theta_j$.

Starting Point:

$$\mathcal{L} = -\frac{1}{2}\mathbf{y}^\top \mathbf{K}_y^{-1} \mathbf{y} - \frac{1}{2}\log |\mathbf{K}_y| - \frac{n}{2}\log(2\pi)$$

The third term is constant with respect to hyperparameters. Let's differentiate the first two.

Gradient of the Data Fit Term:

$$\frac{\partial}{\partial \theta_j}\left( -\frac{1}{2}\mathbf{y}^\top \mathbf{K}_y^{-1} \mathbf{y} \right)$$

Using Identity 3:

$$= -\frac{1}{2} \cdot \left( -\mathbf{y}^\top \mathbf{K}_y^{-1} \frac{\partial \mathbf{K}_y}{\partial \theta_j} \mathbf{K}_y^{-1} \mathbf{y} \right)$$

$$= \frac{1}{2} \mathbf{y}^\top \mathbf{K}_y^{-1} \frac{\partial \mathbf{K}_y}{\partial \theta_j} \mathbf{K}_y^{-1} \mathbf{y}$$

Defining $\boldsymbol{\alpha} = \mathbf{K}_y^{-1} \mathbf{y}$:

$$= \frac{1}{2} \boldsymbol{\alpha}^\top \frac{\partial \mathbf{K}_y}{\partial \theta_j} \boldsymbol{\alpha}$$

Gradient of the Complexity Term:

$$\frac{\partial}{\partial \theta_j}\left( -\frac{1}{2}\log |\mathbf{K}_y| \right)$$

Using Identity 2:

$$= -\frac{1}{2} \text{tr}\left( \mathbf{K}_y^{-1} \frac{\partial \mathbf{K}_y}{\partial \theta_j} \right)$$

Combining the Terms:

$$\boxed{\frac{\partial \mathcal{L}}{\partial \theta_j} = \frac{1}{2} \boldsymbol{\alpha}^\top \frac{\partial \mathbf{K}_y}{\partial \theta_j} \boldsymbol{\alpha} - \frac{1}{2} \text{tr}\left( \mathbf{K}_y^{-1} \frac{\partial \mathbf{K}_y}{\partial \theta_j} \right)}$$

Or equivalently, using the trace-quadratic identity ($\boldsymbol{\alpha}^\top \mathbf{A} \boldsymbol{\alpha} = \text{tr}(\boldsymbol{\alpha} \boldsymbol{\alpha}^\top \mathbf{A})$):

$$\boxed{\frac{\partial \mathcal{L}}{\partial \theta_j} = \frac{1}{2} \text{tr}\left( \left( \boldsymbol{\alpha} \boldsymbol{\alpha}^\top - \mathbf{K}_y^{-1} \right) \frac{\partial \mathbf{K}_y}{\partial \theta_j} \right)}$$

The Unified Form

The trace form is particularly elegant: it shows the gradient is determined by a single matrix $\boldsymbol{\alpha}\boldsymbol{\alpha}^\top - \mathbf{K}_y^{-1}$ contracted with the kernel derivative. This matrix encodes the 'discrepancy' between the outer product of the dual solution and the precision matrix.

Interpreting the Gradient Terms

Let's develop intuition for what each gradient term represents.

Term 1: Data Fit Contribution

$$\frac{1}{2} \boldsymbol{\alpha}^\top \frac{\partial \mathbf{K}_y}{\partial \theta_j} \boldsymbol{\alpha}$$

Recall $\boldsymbol{\alpha} = \mathbf{K}_y^{-1} \mathbf{y}$ represents the 'residuals in the dual space.' This term measures how the data fit changes when we perturb $\theta_j$:

If increasing $\theta_j$ makes $\mathbf{K}_y$ 'align better' with the outer product $\mathbf{y}\mathbf{y}^\top$, this term is positive
It encourages hyperparameters that make the prior covariance match the empirical covariance of the data

Term 2: Complexity Contribution

$$-\frac{1}{2} \text{tr}\left( \mathbf{K}_y^{-1} \frac{\partial \mathbf{K}_y}{\partial \theta_j} \right)$$

This term measures how the model complexity changes with $\theta_j$:

If increasing $\theta_j$ increases $|\mathbf{K}_y|$ (the prior volume), this term is negative
It penalizes hyperparameters that expand the space of functions unnecessarily

The Balance:

At the optimum, these two forces are in equilibrium:

$$\boldsymbol{\alpha}^\top \frac{\partial \mathbf{K}_y}{\partial \theta_j} \boldsymbol{\alpha} = \text{tr}\left( \mathbf{K}_y^{-1} \frac{\partial \mathbf{K}_y}{\partial \theta_j} \right)$$

For each hyperparameter, the marginal gain in data fit from increasing it equals the marginal cost in complexity. This is the differential form of Occam's Razor.

Connection to Fisher Information:

The complexity term $\text{tr}(\mathbf{K}_y^{-1} \frac{\partial \mathbf{K}_y}{\partial \theta_j})$ is related to the Fisher information of the Gaussian distribution with covariance $\mathbf{K}_y$. Hyperparameters with large Fisher information (the model is sensitive to their values) are more heavily penalized—the model 'pays more' for parameters that strongly affect predictions.

Sign Convention

We're computing gradients for maximizing the log marginal likelihood. If using a minimization framework (as many optimizers expect), negate the objective and gradients. The convention here follows the original GP literature.

Computing Kernel-Specific Derivatives

The gradient formula requires $\frac{\partial \mathbf{K}_y}{\partial \theta_j}$—the derivative of the covariance matrix with respect to each hyperparameter. This decomposes as:

$$\frac{\partial \mathbf{K}_y}{\partial \theta_j} = \frac{\partial \mathbf{K}}{\partial \theta_j} + \frac{\partial (\sigma_n^2 \mathbf{I})}{\partial \theta_j}$$

The noise derivative is simple: $\frac{\partial (\sigma_n^2 \mathbf{I})}{\partial \sigma_n^2} = \mathbf{I}$ if $\theta_j = \sigma_n^2$, and zero otherwise.

The kernel derivative $\frac{\partial \mathbf{K}}{\partial \theta_j}$ depends on the specific kernel form. Let's work through the most common cases.

Squared Exponential (RBF) Kernel:

$$k(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left( -\frac{r^2}{2\ell^2} \right), \quad r = |\mathbf{x} - \mathbf{x}'|$$

Derivative w.r.t. signal variance $\sigma_f^2$: $$\frac{\partial k}{\partial \sigma_f^2} = \frac{k(\mathbf{x}, \mathbf{x}')}{\sigma_f^2}$$

Derivative w.r.t. length-scale $\ell$: $$\frac{\partial k}{\partial \ell} = k(\mathbf{x}, \mathbf{x}') \cdot \frac{r^2}{\ell^3}$$

For log-parameterization (common in practice), with $\bar{\ell} = \log \ell$: $$\frac{\partial k}{\partial \bar{\ell}} = \frac{\partial k}{\partial \ell} \cdot \ell = k(\mathbf{x}, \mathbf{x}') \cdot \frac{r^2}{\ell^2}$$

Matérn Kernels:

$$k_\nu(r) = \sigma_f^2 \frac{2^{1-\nu}}{\Gamma(\nu)} \left( \frac{\sqrt{2\nu} r}{\ell} \right)^\nu K_\nu\left( \frac{\sqrt{2\nu} r}{\ell} \right)$$

The derivatives are more complex due to the Bessel function. For the commonly used special cases:

Matérn-1/2 (Exponential): $k(r) = \sigma_f^2 \exp(-r/\ell)$ $$\frac{\partial k}{\partial \ell} = k(r) \cdot \frac{r}{\ell^2}$$

Matérn-3/2: $k(r) = \sigma_f^2 \left(1 + \frac{\sqrt{3}r}{\ell}\right) \exp\left(-\frac{\sqrt{3}r}{\ell}\right)$ $$\frac{\partial k}{\partial \ell} = \sigma_f^2 \frac{3r^2}{\ell^3} \exp\left(-\frac{\sqrt{3}r}{\ell}\right)$$

Matérn-5/2: $k(r) = \sigma_f^2 \left(1 + \frac{\sqrt{5}r}{\ell} + \frac{5r^2}{3\ell^2}\right) \exp\left(-\frac{\sqrt{5}r}{\ell}\right)$ $$\frac{\partial k}{\partial \ell} = \sigma_f^2 \frac{5r^2}{3\ell^3} \left(1 + \frac{\sqrt{5}r}{\ell}\right) \exp\left(-\frac{\sqrt{5}r}{\ell}\right)$$

Common Kernel Derivatives Summary
Kernel	Parameter	$\frac{\partial k}{\partial \theta}$
All kernels	$\sigma_f^2$	$k(\mathbf{x}, \mathbf{x}')/\sigma_f^2$
RBF	$\ell$	$k(\mathbf{x}, \mathbf{x}') \cdot r^2/\ell^3$
RBF (log-param)	$\log \ell$	$k(\mathbf{x}, \mathbf{x}') \cdot r^2/\ell^2$
Matérn-1/2	$\ell$	$k(r) \cdot r/\ell^2$
Matérn-3/2	$\ell$	$\sigma_f^2 \frac{3r^2}{\ell^3} \exp(-\sqrt{3}r/\ell)$
Periodic	period $p$	Involves $\sin$ and $\cos$ terms
Noise	$\sigma_n^2$	$\mathbf{I}$ (identity matrix)

Automatic Relevance Determination (ARD) Gradients

Automatic Relevance Determination (ARD) uses a separate length-scale for each input dimension, enabling automatic feature selection:

$$k(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left( -\frac{1}{2} \sum_{d=1}^D \frac{(x_d - x'_d)^2}{\ell_d^2} \right)$$

This introduces $D$ length-scale parameters ${\ell_1, \ldots, \ell_D}$, and we need gradients for each.

Derivation:

Define the weighted squared distance: $$r_d = (x_d - x'_d)^2$$

Then: $$\frac{\partial k}{\partial \ell_d} = k(\mathbf{x}, \mathbf{x}') \cdot \frac{r_d}{\ell_d^3}$$

With log-parameterization ($\bar{\ell}_d = \log \ell_d$): $$\frac{\partial k}{\partial \bar{\ell}_d} = k(\mathbf{x}, \mathbf{x}') \cdot \frac{r_d}{\ell_d^2}$$

Interpreting ARD Gradients:

After optimization, the learned length-scales reveal feature importance:

Large $\ell_d$: Input dimension $d$ has little effect on predictions; the function varies slowly along this axis. This dimension is effectively 'turned off.'
Small $\ell_d$: Input dimension $d$ strongly influences predictions; the function varies rapidly along this axis. This dimension is highly relevant.

The gradient structure reveals why ARD works: for an irrelevant feature (constant along dimension $d$), the squared distance term $r_d$ is small for all pairs, making the gradient negligible. The length-scale then grows large with little marginal likelihood change. For relevant features, $r_d$ varies substantially, and the gradient drives $\ell_d$ to an appropriate value.

Feature Selection via ARD

ARD provides a principled, embedded approach to feature selection. After training, features with $\ell_d > 10 \times \max(\text{data range in } d)$ can often be safely removed—their influence on predictions is negligible, and model complexity is reduced.

Efficient Gradient Computation

Computing gradients naively is expensive. Each hyperparameter's gradient requires:

$$\frac{\partial \mathcal{L}}{\partial \theta_j} = \frac{1}{2} \text{tr}\left( \left( \boldsymbol{\alpha} \boldsymbol{\alpha}^\top - \mathbf{K}_y^{-1} \right) \frac{\partial \mathbf{K}_y}{\partial \theta_j} \right)$$

Computing $\mathbf{K}_y^{-1}$ explicitly costs $O(n^3)$ and produces an $n \times n$ matrix. With $p$ hyperparameters, naive computation costs $O(pn^3)$, which is prohibitive for large datasets.

Efficient Strategy:

Cholesky decomposition once: Factor $\mathbf{K}_y = \mathbf{L}\mathbf{L}^\top$ in $O(n^3)$ time.
Solve for $\boldsymbol{\alpha}$ once: Compute $\boldsymbol{\alpha} = \mathbf{K}_y^{-1}\mathbf{y}$ via back-substitution in $O(n^2)$ time.
Solve for $\mathbf{K}_y^{-1}$ column-by-column: If needed, compute each column of $\mathbf{K}_y^{-1}$ by solving $\mathbf{L}\mathbf{L}^\top \mathbf{b}_i = \mathbf{e}_i$. Total: $O(n^3)$.
Compute each gradient via trace: Use the relation: $$\text{tr}(\mathbf{A}\mathbf{B}) = \sum_{i,j} A_{ij} B_{ji}$$

This is element-wise multiply and sum: $O(n^2)$ per hyperparameter.

gp_gradients.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
from scipy.linalg import cho_factor, cho_solve
 
def compute_lml_and_gradients(
    y: np.ndarray,
    K: np.ndarray,
    noise_var: float,
    dK_dtheta: list[np.ndarray]
) -> tuple[float, np.ndarray]:
    """
    Efficiently compute LML and gradients w.r.t. hyperparameters.
    
    Parameters
    ----------
    y : (n,) array - Observations
    K : (n, n) array - Kernel matrix (without noise)
    noise_var : float - Noise variance sigma_n^2
    dK_dtheta : list of (n, n) arrays - Kernel derivatives
    
    Returns
    -------
    lml : float - Log marginal likelihood
    grads : (p,) array - Gradient w.r.t. each hyperparameter
    """
    n = len(y)
    K_y = K + noise_var * np.eye(n)
    
    # Step 1: Cholesky decomposition (O(n^3), done once)
    L, lower = cho_factor(K_y, lower=True)
    
    # Step 2: Compute alpha = K_y^{-1} @ y (O(n^2))
    alpha = cho_solve((L, lower), y)
    
    # Step 3: Compute log|K_y| (O(n))
    log_det = 2 * np.sum(np.log(np.diag(L)))
    
    # Step 4: Compute LML
    data_fit = -0.5 * np.dot(y, alpha)
    complexity = -0.5 * log_det
    lml = data_fit + complexity - 0.5 * n * np.log(2 * np.pi)
    
    # Step 5: Compute K_y^{-1} (O(n^3), can be optimized)
    K_y_inv = cho_solve((L, lower), np.eye(n))
    
    # Step 6: Compute gradient matrix (done once, reused)
    # Q = alpha @ alpha.T - K_y^{-1}
    Q = np.outer(alpha, alpha) - K_y_inv
    
    # Step 7: Compute gradient for each hyperparameter (O(n^2) each)
    grads = []
    for dK in dK_dtheta:
        # grad = 0.5 * tr(Q @ dK)
        # Efficient: element-wise multiply and sum
        grad = 0.5 * np.sum(Q * dK.T)  # Q * dK.T for tr(Q @ dK)
        grads.append(grad)
    
    return lml, np.array(grads)

Complexity Analysis:

Operation	Complexity	Notes
Cholesky decomposition	$O(n^3)$	Done once
Solve for $\boldsymbol{\alpha}$	$O(n^2)$	Back-substitution
Compute $\log	\mathbf{K}_y	$
Compute $\mathbf{K}_y^{-1}$	$O(n^3)$	$n$ back-substitutions
Each gradient (trace)	$O(n^2)$	Element-wise product
Total for $p$ hyperparameters	$O(n^3 + pn^2)$

The $O(n^3)$ Cholesky and inverse computations dominate, but they're done once per evaluation. Additional hyperparameters add only $O(n^2)$ cost each.

Vectorized Implementation for Performance

Modern numerical libraries achieve peak performance through vectorization. Here we present an optimized implementation that minimizes memory allocations and maximizes cache efficiency.

Key Optimizations:

Avoid explicit $\mathbf{K}_y^{-1}$: In many cases, we can restructure computations to avoid forming the full inverse.
Hadamard product for traces: Use $\text{tr}(\mathbf{A}\mathbf{B}) = \mathbf{1}^\top (\mathbf{A} \odot \mathbf{B}^\top) \mathbf{1}$ where $\odot$ is element-wise multiplication.
Pre-compute shared structures: Matrices like $\boldsymbol{\alpha}\boldsymbol{\alpha}^\top$ are computed once.
Log-parameterization: Working with $\log \theta$ instead of $\theta$ is numerically stable and eliminates positivity constraints.

vectorized_gradients.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import numpy as np
from scipy.linalg import solve_triangular
 
def efficient_gp_gradients(
    X: np.ndarray,  # (n, d) input points
    y: np.ndarray,  # (n,) observations
    log_lengthscales: np.ndarray,  # (d,) log length-scales (ARD)
    log_signal_var: float,  # log signal variance
    log_noise_var: float,  # log noise variance
    jitter: float = 1e-8
) -> tuple[float, np.ndarray]:
    """
    Compute LML and gradients for ARD RBF kernel.
    All hyperparameters in log-space for unconstrained optimization.
    """
    n, d = X.shape
    
    # Transform from log-space
    lengthscales = np.exp(log_lengthscales)
    signal_var = np.exp(log_signal_var)
    noise_var = np.exp(log_noise_var)
    
    # Compute scaled inputs: X_scaled[i, j] = X[i, j] / lengthscales[j]
    X_scaled = X / lengthscales
    
    # Compute squared distances matrix
    # ||x_i - x_j||^2 / ell^2 = sum_d (x_id - x_jd)^2 / ell_d^2
    sq_dists = np.sum(X_scaled**2, axis=1, keepdims=True)              + np.sum(X_scaled**2, axis=1)              - 2 * X_scaled @ X_scaled.T
    
    # Kernel matrix
    K = signal_var * np.exp(-0.5 * sq_dists)
    K_y = K + (noise_var + jitter) * np.eye(n)
    
    # Cholesky decomposition
    L = np.linalg.cholesky(K_y)
    
    # Solve for alpha
    alpha = solve_triangular(L, y, lower=True)
    alpha = solve_triangular(L.T, alpha, lower=False)
    
    # Log marginal likelihood
    log_det = 2 * np.sum(np.log(np.diag(L)))
    lml = -0.5 * np.dot(y, alpha) - 0.5 * log_det - 0.5 * n * np.log(2 * np.pi)
    
    # Gradient computation: construct Q = alpha @ alpha.T - K_y^{-1}
    # Solving L @ L.T @ W = I gives W = K_y^{-1}
    V = solve_triangular(L, np.eye(n), lower=True)
    K_y_inv = V.T @ V
    
    Q = np.outer(alpha, alpha) - K_y_inv
    
    # Gradient w.r.t. log_signal_var: dK/d(log σ²) = K
    grad_log_signal_var = 0.5 * np.sum(Q * K.T)
    
    # Gradient w.r.t. log_noise_var: dK/d(log σ²_n) = σ²_n * I
    grad_log_noise_var = 0.5 * noise_var * np.trace(Q)
    
    # Gradients w.r.t. log_lengthscales (ARD)
    grad_log_lengthscales = np.zeros(d)
    for dim in range(d):
        # Compute dimension-specific squared distance
        X_dim = X[:, dim:dim+1]
        sq_dist_dim = (X_dim - X_dim.T)**2
        
        # dK/d(log ell_d) = K * sq_dist_dim / ell_d^2
        dK_d_log_ell = K * sq_dist_dim / lengthscales[dim]**2
        grad_log_lengthscales[dim] = 0.5 * np.sum(Q * dK_d_log_ell.T)
    
    # Pack gradients: [log_lengthscales, log_signal_var, log_noise_var]
    grads = np.concatenate([
        grad_log_lengthscales,
        [grad_log_signal_var],
        [grad_log_noise_var]
    ])
    
    return lml, grads

Numerical Gradient Verification

When implementing GP gradients, numerical verification is essential. Analytic gradient errors can cause silent failures—the optimization may converge to a suboptimal point without any obvious error message.

Finite Difference Approximation:

The simplest approach uses centered finite differences:

$$\frac{\partial \mathcal{L}}{\partial \theta_j} \approx \frac{\mathcal{L}(\theta_j + \epsilon) - \mathcal{L}(\theta_j - \epsilon)}{2\epsilon}$$

with $\epsilon \approx 10^{-5}$ to $10^{-7}$.

Relative Error Check:

$$\text{relative error} = \frac{|\text{analytic} - \text{numerical}|}{\max(|\text{analytic}|, |\text{numerical}|, 10^{-8})}$$

A relative error below $10^{-4}$ is typically acceptable; below $10^{-6}$ is excellent.

gradient_check.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
def numerical_gradient(
    f: callable,  # Function to differentiate
    x: np.ndarray,  # Point at which to evaluate
    epsilon: float = 1e-5
) -> np.ndarray:
    """
    Compute numerical gradient via centered differences.
    """
    grad = np.zeros_like(x)
    for i in range(len(x)):
        x_plus = x.copy()
        x_minus = x.copy()
        x_plus[i] += epsilon
        x_minus[i] -= epsilon
        grad[i] = (f(x_plus) - f(x_minus)) / (2 * epsilon)
    return grad
 
def check_gradients(
    f_and_grad: callable,  # Returns (f_value, gradient)
    x: np.ndarray,
    epsilon: float = 1e-5,
    rtol: float = 1e-4
) -> bool:
    """
    Verify analytic gradients against numerical approximation.
    
    Returns True if all gradients pass relative tolerance check.
    """
    _, analytic = f_and_grad(x)
    numerical = numerical_gradient(lambda t: f_and_grad(t)[0], x, epsilon)
    
    # Compute relative errors
    denom = np.maximum(np.abs(analytic), np.abs(numerical))
    denom = np.maximum(denom, 1e-8)  # Avoid division by zero
    rel_error = np.abs(analytic - numerical) / denom
    
    print("Gradient Check Results:")
    print(f"{'Index':<6} {'Analytic':<15} {'Numerical':<15} {'Rel Error':<12} {'Status'}")
    print("-" * 60)
    
    all_passed = True
    for i in range(len(x)):
        status = "✓" if rel_error[i] < rtol else "✗"
        if rel_error[i] >= rtol:
            all_passed = False
        print(f"{i:<6} {analytic[i]:<15.6e} {numerical[i]:<15.6e} {rel_error[i]:<12.6e} {status}")
    
    return all_passed

Common Gradient Bugs

Sign errors: Confusing maximize vs. minimize conventions. Index errors: Loop indices off by one. Transpose errors: Using row when column is needed. Chain rule omissions: Forgetting Jacobians when using transformed parameters. Always verify gradients on a small random dataset before trusting implementations.

Summary: Gradient Computation

We've developed the complete machinery for computing gradients of the log marginal likelihood. Let's consolidate the key results:

Key Takeaways

•The gradient formula is: $\frac{\partial \mathcal{L}}{\partial \theta_j} = \frac{1}{2} \text{tr}\left( (\boldsymbol{\alpha}\boldsymbol{\alpha}^\top - \mathbf{K}_y^{-1}) \frac{\partial \mathbf{K}_y}{\partial \theta_j} \right)$
•Two competing terms: Data fit encourages hyperparameters that match prior covariance to empirical covariance; complexity penalizes hyperparameters that expand the prior volume.
•Kernel derivatives are kernel-specific but follow standard calculus. Common kernels (RBF, Matérn, periodic) have closed-form derivatives.
•ARD gradients enable automatic feature selection by learning dimension-specific length-scales.
•Efficient computation requires Cholesky decomposition ($O(n^3)$) plus $O(n^2)$ per hyperparameter for trace computation.
•Numerical verification is essential—always check analytic gradients against finite differences before deploying.

Looking Ahead:

With gradients in hand, we can apply gradient-based optimization algorithms to find hyperparameters that maximize the marginal likelihood. The next page explores optimization strategies—which algorithms to use, how to handle multiple optima, and practical considerations for robust hyperparameter learning.

Page Complete

You now have a complete understanding of gradient computation for GP hyperparameter learning. This mathematical machinery enables efficient gradient-based optimization, transforming hyperparameter learning from an expensive search problem into a smooth optimization problem.

2 / 5

Loading learning content...

Machine LearningGaussian Processes

Hyperparameter Learning in Gaussian Processes

LevelAdvanced

Duration90 mins

TopicGaussian Processes

2 / 5

Gradient Computation

The Need for Gradient Information

Having established the marginal likelihood as our objective for hyperparameter learning, we now face a computational challenge: how do we efficiently optimize it?

What You Will Learn

Setup and Notation

Let's establish the notation we'll use throughout this derivation.

The Log Marginal Likelihood (LML):

where:

$\mathbf{y} \in \mathbb{R}^n$ — Training observations
$\mathbf{K}_y = \mathbf{K} + \sigma_n^2 \mathbf{I}$ — Noisy covariance matrix
$\mathbf{K} = [k(\mathbf{x}_i, \mathbf{x}j)]{i,j=1}^n$ — Kernel matrix
$\boldsymbol{\theta}$ — Vector of all hyperparameters (kernel parameters + noise variance)

Our Goal:

Compute $\frac{\partial \mathcal{L}}{\partial \theta_j}$ for each hyperparameter $\theta_j$. This enables gradient-based optimization:

$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} + \eta \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}^{(t)})$$

Key Quantities:

We'll repeatedly use:

$\boldsymbol{\alpha} = \mathbf{K}_y^{-1} \mathbf{y}$ — The 'dual representation' of predictions
$\frac{\partial \mathbf{K}_y}{\partial \theta_j}$ — Derivative of covariance matrix w.r.t. hyperparameter $\theta_j$

The structure of the gradient will depend on how each hyperparameter enters the covariance matrix. Fortunately, all common kernels have simple, differentiable forms.

Matrix Calculus Preliminaries

Before diving into the gradient derivation, we need several matrix calculus identities. These are essential tools for differentiating expressions involving matrices.

Identity 1: Derivative of Matrix Inverse

For a matrix $\mathbf{A}$ that depends on a scalar $\theta$:

$$\frac{\partial \mathbf{A}^{-1}}{\partial \theta} = -\mathbf{A}^{-1} \frac{\partial \mathbf{A}}{\partial \theta} \mathbf{A}^{-1}$$

Proof sketch: Differentiate $\mathbf{A}^{-1} \mathbf{A} = \mathbf{I}$ using the product rule.

Identity 2: Derivative of Log Determinant

For a positive definite matrix $\mathbf{A}$:

$$\frac{\partial \log |\mathbf{A}|}{\partial \theta} = \text{tr}\left( \mathbf{A}^{-1} \frac{\partial \mathbf{A}}{\partial \theta} \right)$$

Identity 3: Derivative of Quadratic Form

For vectors $\mathbf{a}, \mathbf{b}$ independent of $\theta$:

This follows directly from Identity 1.

Identity 4: Trace Cyclic Property

For matrices of compatible dimensions:

$$\text{tr}(\mathbf{ABC}) = \text{tr}(\mathbf{CAB}) = \text{tr}(\mathbf{BCA})$$

This allows rearranging terms inside traces, which is crucial for efficient computation.

Identity 5: Trace and Outer Products

For vectors $\mathbf{a}, \mathbf{b}$ and matrix $\mathbf{M}$:

$$\mathbf{a}^\top \mathbf{M} \mathbf{b} = \text{tr}(\mathbf{M} \mathbf{b} \mathbf{a}^\top) = \text{tr}(\mathbf{a}^\top \mathbf{M} \mathbf{b})$$

This allows converting quadratic forms to traces when convenient.

Why These Identities?

Deriving the Gradient of the Log Marginal Likelihood

We now derive the gradient $\frac{\partial \mathcal{L}}{\partial \theta_j}$ for an arbitrary hyperparameter $\theta_j$.

Starting Point:

$$\mathcal{L} = -\frac{1}{2}\mathbf{y}^\top \mathbf{K}_y^{-1} \mathbf{y} - \frac{1}{2}\log |\mathbf{K}_y| - \frac{n}{2}\log(2\pi)$$

The third term is constant with respect to hyperparameters. Let's differentiate the first two.

Gradient of the Data Fit Term:

$$\frac{\partial}{\partial \theta_j}\left( -\frac{1}{2}\mathbf{y}^\top \mathbf{K}_y^{-1} \mathbf{y} \right)$$

Using Identity 3:

$$= -\frac{1}{2} \cdot \left( -\mathbf{y}^\top \mathbf{K}_y^{-1} \frac{\partial \mathbf{K}_y}{\partial \theta_j} \mathbf{K}_y^{-1} \mathbf{y} \right)$$

$$= \frac{1}{2} \mathbf{y}^\top \mathbf{K}_y^{-1} \frac{\partial \mathbf{K}_y}{\partial \theta_j} \mathbf{K}_y^{-1} \mathbf{y}$$

Defining $\boldsymbol{\alpha} = \mathbf{K}_y^{-1} \mathbf{y}$:

$$= \frac{1}{2} \boldsymbol{\alpha}^\top \frac{\partial \mathbf{K}_y}{\partial \theta_j} \boldsymbol{\alpha}$$

Gradient of the Complexity Term:

$$\frac{\partial}{\partial \theta_j}\left( -\frac{1}{2}\log |\mathbf{K}_y| \right)$$

Using Identity 2:

$$= -\frac{1}{2} \text{tr}\left( \mathbf{K}_y^{-1} \frac{\partial \mathbf{K}_y}{\partial \theta_j} \right)$$

Combining the Terms:

Or equivalently, using the trace-quadratic identity ($\boldsymbol{\alpha}^\top \mathbf{A} \boldsymbol{\alpha} = \text{tr}(\boldsymbol{\alpha} \boldsymbol{\alpha}^\top \mathbf{A})$):

The Unified Form

Interpreting the Gradient Terms

Let's develop intuition for what each gradient term represents.

Term 1: Data Fit Contribution

$$\frac{1}{2} \boldsymbol{\alpha}^\top \frac{\partial \mathbf{K}_y}{\partial \theta_j} \boldsymbol{\alpha}$$

Recall $\boldsymbol{\alpha} = \mathbf{K}_y^{-1} \mathbf{y}$ represents the 'residuals in the dual space.' This term measures how the data fit changes when we perturb $\theta_j$:

If increasing $\theta_j$ makes $\mathbf{K}_y$ 'align better' with the outer product $\mathbf{y}\mathbf{y}^\top$, this term is positive
It encourages hyperparameters that make the prior covariance match the empirical covariance of the data

Term 2: Complexity Contribution

$$-\frac{1}{2} \text{tr}\left( \mathbf{K}_y^{-1} \frac{\partial \mathbf{K}_y}{\partial \theta_j} \right)$$

This term measures how the model complexity changes with $\theta_j$:

If increasing $\theta_j$ increases $|\mathbf{K}_y|$ (the prior volume), this term is negative
It penalizes hyperparameters that expand the space of functions unnecessarily

The Balance:

At the optimum, these two forces are in equilibrium:

$$\boldsymbol{\alpha}^\top \frac{\partial \mathbf{K}_y}{\partial \theta_j} \boldsymbol{\alpha} = \text{tr}\left( \mathbf{K}_y^{-1} \frac{\partial \mathbf{K}_y}{\partial \theta_j} \right)$$

For each hyperparameter, the marginal gain in data fit from increasing it equals the marginal cost in complexity. This is the differential form of Occam's Razor.

Connection to Fisher Information:

Sign Convention

Computing Kernel-Specific Derivatives

The gradient formula requires $\frac{\partial \mathbf{K}_y}{\partial \theta_j}$—the derivative of the covariance matrix with respect to each hyperparameter. This decomposes as:

$$\frac{\partial \mathbf{K}_y}{\partial \theta_j} = \frac{\partial \mathbf{K}}{\partial \theta_j} + \frac{\partial (\sigma_n^2 \mathbf{I})}{\partial \theta_j}$$

The noise derivative is simple: $\frac{\partial (\sigma_n^2 \mathbf{I})}{\partial \sigma_n^2} = \mathbf{I}$ if $\theta_j = \sigma_n^2$, and zero otherwise.

The kernel derivative $\frac{\partial \mathbf{K}}{\partial \theta_j}$ depends on the specific kernel form. Let's work through the most common cases.

Squared Exponential (RBF) Kernel:

$$k(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left( -\frac{r^2}{2\ell^2} \right), \quad r = |\mathbf{x} - \mathbf{x}'|$$

Derivative w.r.t. signal variance $\sigma_f^2$: $$\frac{\partial k}{\partial \sigma_f^2} = \frac{k(\mathbf{x}, \mathbf{x}')}{\sigma_f^2}$$

Derivative w.r.t. length-scale $\ell$: $$\frac{\partial k}{\partial \ell} = k(\mathbf{x}, \mathbf{x}') \cdot \frac{r^2}{\ell^3}$$

Matérn Kernels:

$$k_\nu(r) = \sigma_f^2 \frac{2^{1-\nu}}{\Gamma(\nu)} \left( \frac{\sqrt{2\nu} r}{\ell} \right)^\nu K_\nu\left( \frac{\sqrt{2\nu} r}{\ell} \right)$$

The derivatives are more complex due to the Bessel function. For the commonly used special cases:

Matérn-1/2 (Exponential): $k(r) = \sigma_f^2 \exp(-r/\ell)$ $$\frac{\partial k}{\partial \ell} = k(r) \cdot \frac{r}{\ell^2}$$

Common Kernel Derivatives Summary
Kernel	Parameter	$\frac{\partial k}{\partial \theta}$
All kernels	$\sigma_f^2$	$k(\mathbf{x}, \mathbf{x}')/\sigma_f^2$
RBF	$\ell$	$k(\mathbf{x}, \mathbf{x}') \cdot r^2/\ell^3$
RBF (log-param)	$\log \ell$	$k(\mathbf{x}, \mathbf{x}') \cdot r^2/\ell^2$
Matérn-1/2	$\ell$	$k(r) \cdot r/\ell^2$
Matérn-3/2	$\ell$	$\sigma_f^2 \frac{3r^2}{\ell^3} \exp(-\sqrt{3}r/\ell)$
Periodic	period $p$	Involves $\sin$ and $\cos$ terms
Noise	$\sigma_n^2$	$\mathbf{I}$ (identity matrix)

Automatic Relevance Determination (ARD) Gradients

Automatic Relevance Determination (ARD) uses a separate length-scale for each input dimension, enabling automatic feature selection:

$$k(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left( -\frac{1}{2} \sum_{d=1}^D \frac{(x_d - x'_d)^2}{\ell_d^2} \right)$$

This introduces $D$ length-scale parameters ${\ell_1, \ldots, \ell_D}$, and we need gradients for each.

Derivation:

Define the weighted squared distance: $$r_d = (x_d - x'_d)^2$$

Then: $$\frac{\partial k}{\partial \ell_d} = k(\mathbf{x}, \mathbf{x}') \cdot \frac{r_d}{\ell_d^3}$$

With log-parameterization ($\bar{\ell}_d = \log \ell_d$): $$\frac{\partial k}{\partial \bar{\ell}_d} = k(\mathbf{x}, \mathbf{x}') \cdot \frac{r_d}{\ell_d^2}$$

Interpreting ARD Gradients:

After optimization, the learned length-scales reveal feature importance:

Large $\ell_d$: Input dimension $d$ has little effect on predictions; the function varies slowly along this axis. This dimension is effectively 'turned off.'
Small $\ell_d$: Input dimension $d$ strongly influences predictions; the function varies rapidly along this axis. This dimension is highly relevant.

Feature Selection via ARD

Efficient Gradient Computation

Computing gradients naively is expensive. Each hyperparameter's gradient requires:

Computing $\mathbf{K}_y^{-1}$ explicitly costs $O(n^3)$ and produces an $n \times n$ matrix. With $p$ hyperparameters, naive computation costs $O(pn^3)$, which is prohibitive for large datasets.

Efficient Strategy:

Cholesky decomposition once: Factor $\mathbf{K}_y = \mathbf{L}\mathbf{L}^\top$ in $O(n^3)$ time.
Solve for $\boldsymbol{\alpha}$ once: Compute $\boldsymbol{\alpha} = \mathbf{K}_y^{-1}\mathbf{y}$ via back-substitution in $O(n^2)$ time.
Solve for $\mathbf{K}_y^{-1}$ column-by-column: If needed, compute each column of $\mathbf{K}_y^{-1}$ by solving $\mathbf{L}\mathbf{L}^\top \mathbf{b}_i = \mathbf{e}_i$. Total: $O(n^3)$.
Compute each gradient via trace: Use the relation: $$\text{tr}(\mathbf{A}\mathbf{B}) = \sum_{i,j} A_{ij} B_{ji}$$

This is element-wise multiply and sum: $O(n^2)$ per hyperparameter.

gp_gradients.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
from scipy.linalg import cho_factor, cho_solve
 
def compute_lml_and_gradients(
    y: np.ndarray,
    K: np.ndarray,
    noise_var: float,
    dK_dtheta: list[np.ndarray]
) -> tuple[float, np.ndarray]:
    """
    Efficiently compute LML and gradients w.r.t. hyperparameters.
    
    Parameters
    ----------
    y : (n,) array - Observations
    K : (n, n) array - Kernel matrix (without noise)
    noise_var : float - Noise variance sigma_n^2
    dK_dtheta : list of (n, n) arrays - Kernel derivatives
    
    Returns
    -------
    lml : float - Log marginal likelihood
    grads : (p,) array - Gradient w.r.t. each hyperparameter
    """
    n = len(y)
    K_y = K + noise_var * np.eye(n)
    
    # Step 1: Cholesky decomposition (O(n^3), done once)
    L, lower = cho_factor(K_y, lower=True)
    
    # Step 2: Compute alpha = K_y^{-1} @ y (O(n^2))
    alpha = cho_solve((L, lower), y)
    
    # Step 3: Compute log|K_y| (O(n))
    log_det = 2 * np.sum(np.log(np.diag(L)))
    
    # Step 4: Compute LML
    data_fit = -0.5 * np.dot(y, alpha)
    complexity = -0.5 * log_det
    lml = data_fit + complexity - 0.5 * n * np.log(2 * np.pi)
    
    # Step 5: Compute K_y^{-1} (O(n^3), can be optimized)
    K_y_inv = cho_solve((L, lower), np.eye(n))
    
    # Step 6: Compute gradient matrix (done once, reused)
    # Q = alpha @ alpha.T - K_y^{-1}
    Q = np.outer(alpha, alpha) - K_y_inv
    
    # Step 7: Compute gradient for each hyperparameter (O(n^2) each)
    grads = []
    for dK in dK_dtheta:
        # grad = 0.5 * tr(Q @ dK)
        # Efficient: element-wise multiply and sum
        grad = 0.5 * np.sum(Q * dK.T)  # Q * dK.T for tr(Q @ dK)
        grads.append(grad)
    
    return lml, np.array(grads)

Complexity Analysis:

Operation	Complexity	Notes
Cholesky decomposition	$O(n^3)$	Done once
Solve for $\boldsymbol{\alpha}$	$O(n^2)$	Back-substitution
Compute $\log	\mathbf{K}_y	$
Compute $\mathbf{K}_y^{-1}$	$O(n^3)$	$n$ back-substitutions
Each gradient (trace)	$O(n^2)$	Element-wise product
Total for $p$ hyperparameters	$O(n^3 + pn^2)$

The $O(n^3)$ Cholesky and inverse computations dominate, but they're done once per evaluation. Additional hyperparameters add only $O(n^2)$ cost each.

Vectorized Implementation for Performance

Modern numerical libraries achieve peak performance through vectorization. Here we present an optimized implementation that minimizes memory allocations and maximizes cache efficiency.

Key Optimizations:

Avoid explicit $\mathbf{K}_y^{-1}$: In many cases, we can restructure computations to avoid forming the full inverse.
Hadamard product for traces: Use $\text{tr}(\mathbf{A}\mathbf{B}) = \mathbf{1}^\top (\mathbf{A} \odot \mathbf{B}^\top) \mathbf{1}$ where $\odot$ is element-wise multiplication.
Pre-compute shared structures: Matrices like $\boldsymbol{\alpha}\boldsymbol{\alpha}^\top$ are computed once.
Log-parameterization: Working with $\log \theta$ instead of $\theta$ is numerically stable and eliminates positivity constraints.

vectorized_gradients.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import numpy as np
from scipy.linalg import solve_triangular
 
def efficient_gp_gradients(
    X: np.ndarray,  # (n, d) input points
    y: np.ndarray,  # (n,) observations
    log_lengthscales: np.ndarray,  # (d,) log length-scales (ARD)
    log_signal_var: float,  # log signal variance
    log_noise_var: float,  # log noise variance
    jitter: float = 1e-8
) -> tuple[float, np.ndarray]:
    """
    Compute LML and gradients for ARD RBF kernel.
    All hyperparameters in log-space for unconstrained optimization.
    """
    n, d = X.shape
    
    # Transform from log-space
    lengthscales = np.exp(log_lengthscales)
    signal_var = np.exp(log_signal_var)
    noise_var = np.exp(log_noise_var)
    
    # Compute scaled inputs: X_scaled[i, j] = X[i, j] / lengthscales[j]
    X_scaled = X / lengthscales
    
    # Compute squared distances matrix
    # ||x_i - x_j||^2 / ell^2 = sum_d (x_id - x_jd)^2 / ell_d^2
    sq_dists = np.sum(X_scaled**2, axis=1, keepdims=True)              + np.sum(X_scaled**2, axis=1)              - 2 * X_scaled @ X_scaled.T
    
    # Kernel matrix
    K = signal_var * np.exp(-0.5 * sq_dists)
    K_y = K + (noise_var + jitter) * np.eye(n)
    
    # Cholesky decomposition
    L = np.linalg.cholesky(K_y)
    
    # Solve for alpha
    alpha = solve_triangular(L, y, lower=True)
    alpha = solve_triangular(L.T, alpha, lower=False)
    
    # Log marginal likelihood
    log_det = 2 * np.sum(np.log(np.diag(L)))
    lml = -0.5 * np.dot(y, alpha) - 0.5 * log_det - 0.5 * n * np.log(2 * np.pi)
    
    # Gradient computation: construct Q = alpha @ alpha.T - K_y^{-1}
    # Solving L @ L.T @ W = I gives W = K_y^{-1}
    V = solve_triangular(L, np.eye(n), lower=True)
    K_y_inv = V.T @ V
    
    Q = np.outer(alpha, alpha) - K_y_inv
    
    # Gradient w.r.t. log_signal_var: dK/d(log σ²) = K
    grad_log_signal_var = 0.5 * np.sum(Q * K.T)
    
    # Gradient w.r.t. log_noise_var: dK/d(log σ²_n) = σ²_n * I
    grad_log_noise_var = 0.5 * noise_var * np.trace(Q)
    
    # Gradients w.r.t. log_lengthscales (ARD)
    grad_log_lengthscales = np.zeros(d)
    for dim in range(d):
        # Compute dimension-specific squared distance
        X_dim = X[:, dim:dim+1]
        sq_dist_dim = (X_dim - X_dim.T)**2
        
        # dK/d(log ell_d) = K * sq_dist_dim / ell_d^2
        dK_d_log_ell = K * sq_dist_dim / lengthscales[dim]**2
        grad_log_lengthscales[dim] = 0.5 * np.sum(Q * dK_d_log_ell.T)
    
    # Pack gradients: [log_lengthscales, log_signal_var, log_noise_var]
    grads = np.concatenate([
        grad_log_lengthscales,
        [grad_log_signal_var],
        [grad_log_noise_var]
    ])
    
    return lml, grads

Numerical Gradient Verification

Finite Difference Approximation:

The simplest approach uses centered finite differences:

$$\frac{\partial \mathcal{L}}{\partial \theta_j} \approx \frac{\mathcal{L}(\theta_j + \epsilon) - \mathcal{L}(\theta_j - \epsilon)}{2\epsilon}$$

with $\epsilon \approx 10^{-5}$ to $10^{-7}$.

Relative Error Check:

$$\text{relative error} = \frac{|\text{analytic} - \text{numerical}|}{\max(|\text{analytic}|, |\text{numerical}|, 10^{-8})}$$

A relative error below $10^{-4}$ is typically acceptable; below $10^{-6}$ is excellent.

gradient_check.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
def numerical_gradient(
    f: callable,  # Function to differentiate
    x: np.ndarray,  # Point at which to evaluate
    epsilon: float = 1e-5
) -> np.ndarray:
    """
    Compute numerical gradient via centered differences.
    """
    grad = np.zeros_like(x)
    for i in range(len(x)):
        x_plus = x.copy()
        x_minus = x.copy()
        x_plus[i] += epsilon
        x_minus[i] -= epsilon
        grad[i] = (f(x_plus) - f(x_minus)) / (2 * epsilon)
    return grad
 
def check_gradients(
    f_and_grad: callable,  # Returns (f_value, gradient)
    x: np.ndarray,
    epsilon: float = 1e-5,
    rtol: float = 1e-4
) -> bool:
    """
    Verify analytic gradients against numerical approximation.
    
    Returns True if all gradients pass relative tolerance check.
    """
    _, analytic = f_and_grad(x)
    numerical = numerical_gradient(lambda t: f_and_grad(t)[0], x, epsilon)
    
    # Compute relative errors
    denom = np.maximum(np.abs(analytic), np.abs(numerical))
    denom = np.maximum(denom, 1e-8)  # Avoid division by zero
    rel_error = np.abs(analytic - numerical) / denom
    
    print("Gradient Check Results:")
    print(f"{'Index':<6} {'Analytic':<15} {'Numerical':<15} {'Rel Error':<12} {'Status'}")
    print("-" * 60)
    
    all_passed = True
    for i in range(len(x)):
        status = "✓" if rel_error[i] < rtol else "✗"
        if rel_error[i] >= rtol:
            all_passed = False
        print(f"{i:<6} {analytic[i]:<15.6e} {numerical[i]:<15.6e} {rel_error[i]:<12.6e} {status}")
    
    return all_passed

Common Gradient Bugs

Summary: Gradient Computation

We've developed the complete machinery for computing gradients of the log marginal likelihood. Let's consolidate the key results:

Key Takeaways

•The gradient formula is: $\frac{\partial \mathcal{L}}{\partial \theta_j} = \frac{1}{2} \text{tr}\left( (\boldsymbol{\alpha}\boldsymbol{\alpha}^\top - \mathbf{K}_y^{-1}) \frac{\partial \mathbf{K}_y}{\partial \theta_j} \right)$
•Two competing terms: Data fit encourages hyperparameters that match prior covariance to empirical covariance; complexity penalizes hyperparameters that expand the prior volume.
•Kernel derivatives are kernel-specific but follow standard calculus. Common kernels (RBF, Matérn, periodic) have closed-form derivatives.
•ARD gradients enable automatic feature selection by learning dimension-specific length-scales.
•Efficient computation requires Cholesky decomposition ($O(n^3)$) plus $O(n^2)$ per hyperparameter for trace computation.
•Numerical verification is essential—always check analytic gradients against finite differences before deploying.

Looking Ahead:

Page Complete

2 / 5