Gp Kernels And Hyperparameters - Learning Module

Loading content...

0/245

Automatic Relevance Determination (ARD)

Letting the Data Decide Feature Importance

In many machine learning problems, not all input features are equally important. Some features strongly influence the output, while others are irrelevant noise. Automatic Relevance Determination (ARD) is a powerful technique that lets the Gaussian Process learn which features matter directly from data.

The key idea: instead of using a single length scale $\ell$ for all dimensions, we use a separate length scale $\ell_d$ for each input dimension $d$. Features with large $\ell_d$ are effectively 'turned off'—the function is insensitive to them.

What You Will Learn

By the end of this page, you will understand how ARD kernels work, interpret learned length scales as feature importances, implement ARD for automatic feature selection, and recognize when ARD helps and when it can cause problems.

The ARD Kernel

The ARD Squared Exponential kernel replaces the isotropic distance with a weighted distance:

$$k_{ARD}(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left(-\frac{1}{2}\sum_{d=1}^{D}\frac{(x_d - x'_d)^2}{\ell_d^2}\right)$$

This is equivalent to: $$k_{ARD}(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left(-\frac{1}{2}(\mathbf{x} - \mathbf{x}')^T \Lambda^{-1} (\mathbf{x} - \mathbf{x}')\right)$$

where $\Lambda = \text{diag}(\ell_1^2, \ell_2^2, \ldots, \ell_D^2)$ is the diagonal matrix of squared length scales.

Interpretation of Length Scales:

$\ell_d$ small: The function varies rapidly with dimension $d$ → high relevance
$\ell_d$ large: The function is nearly constant along dimension $d$ → low relevance
$\ell_d \to \infty$: Dimension $d$ is completely ignored

ard_kernel.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import numpy as np
 
class ARDKernel:
    """
    Automatic Relevance Determination (ARD) Kernel.
    
    Each input dimension has its own length scale, enabling
    automatic feature importance determination.
    """
    
    def __init__(self, length_scales: np.ndarray, signal_variance: float = 1.0):
        """
        Parameters:
        -----------
        length_scales : np.ndarray of shape (D,)
            Per-dimension length scales
        signal_variance : float
            Signal variance σ²
        """
        self.length_scales = np.asarray(length_scales)
        self.signal_variance = signal_variance
    
    def __call__(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray:
        """
        Compute ARD kernel matrix.
        
        X1: (n1, D), X2: (n2, D)
        Returns: (n1, n2) kernel matrix
        """
        # Scale each dimension by its length scale
        X1_scaled = X1 / self.length_scales
        X2_scaled = X2 / self.length_scales
        
        # Compute squared distances in scaled space
        sq_dist = np.sum(X1_scaled**2, axis=1, keepdims=True) + \
                  np.sum(X2_scaled**2, axis=1) - 2 * X1_scaled @ X2_scaled.T
        
        return self.signal_variance * np.exp(-0.5 * sq_dist)
    
    def relevance_scores(self) -> np.ndarray:
        """
        Compute relevance scores from length scales.
        
        Higher scores = more relevant features.
        Uses inverse squared length scale as relevance.
        """
        return 1.0 / self.length_scales**2
    
    def feature_ranking(self) -> np.ndarray:
        """Return indices of features sorted by relevance (most relevant first)."""
        return np.argsort(self.length_scales)  # Smallest length scale = most relevant
 
# Example usage
D = 5  # 5 input dimensions
length_scales = np.array([0.5, 5.0, 0.3, 10.0, 1.0])  # Dims 2, 0 are most relevant
 
kernel = ARDKernel(length_scales)
print("Feature ranking:", kernel.feature_ranking())  # [2, 0, 4, 1, 3]
print("Relevance scores:", kernel.relevance_scores().round(2))

Learning ARD Parameters

ARD length scales are learned by maximizing the marginal likelihood, just like other hyperparameters. The gradient with respect to each length scale:

$$\frac{\partial \log p(\mathbf{y})}{\partial \ell_d} = \frac{1}{2}\text{tr}\left((\boldsymbol{\alpha}\boldsymbol{\alpha}^T - K^{-1})\frac{\partial K}{\partial \ell_d}\right)$$

where: $$\frac{\partial K_{ij}}{\partial \ell_d} = K_{ij} \cdot \frac{(x_{id} - x_{jd})^2}{\ell_d^3}$$

During optimization, the marginal likelihood naturally pushes irrelevant length scales to large values.

Regularization of Length Scales

To prevent length scales from going to infinity (which causes numerical issues), use log-transformed parameters during optimization: optimize log(ℓ_d) instead of ℓ_d. You can also add a weak prior that penalizes very large length scales.

ard_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
from scipy.optimize import minimize
 
def optimize_ard_gp(X, y, n_restarts=5):
    """
    Optimize ARD GP hyperparameters.
    
    Returns optimized length scales, signal variance, and noise variance.
    """
    n, D = X.shape
    
    def neg_log_ml(log_theta):
        # Unpack log-transformed parameters
        log_ls = log_theta[:D]  # D length scales
        log_sv = log_theta[D]   # signal variance
        log_nv = log_theta[D+1] # noise variance
        
        ls = np.exp(log_ls)
        sv = np.exp(log_sv)
        nv = np.exp(log_nv)
        
        # Build ARD kernel matrix
        X_scaled = X / ls
        sq_dist = np.sum(X_scaled**2, axis=1, keepdims=True) + \
                  np.sum(X_scaled**2, axis=1) - 2 * X_scaled @ X_scaled.T
        K = sv * np.exp(-0.5 * sq_dist) + nv * np.eye(n)
        
        try:
            L = np.linalg.cholesky(K)
        except:
            return 1e10
        
        alpha = np.linalg.solve(L.T, np.linalg.solve(L, y))
        nll = 0.5 * y @ alpha + np.sum(np.log(np.diag(L)))
        
        return nll
    
    best_result = None
    best_nll = np.inf
    
    for _ in range(n_restarts):
        # Initialize: length scales from data range, variances from y
        log_theta0 = np.concatenate([
            np.log(np.std(X, axis=0) + 1e-6),  # length scales ~ data spread
            [np.log(np.var(y))],                # signal variance ~ y variance
            [np.log(0.1 * np.var(y))]           # noise = 10% of signal
        ])
        log_theta0 += 0.5 * np.random.randn(D + 2)  # Add randomness
        
        result = minimize(neg_log_ml, log_theta0, method='L-BFGS-B')
        
        if result.fun < best_nll:
            best_nll = result.fun
            best_result = result.x
    
    # Extract optimized hyperparameters
    length_scales = np.exp(best_result[:D])
    signal_var = np.exp(best_result[D])
    noise_var = np.exp(best_result[D+1])
    
    return {
        'length_scales': length_scales,
        'signal_variance': signal_var,
        'noise_variance': noise_var,
        'relevance_ranking': np.argsort(length_scales)
    }

Interpreting ARD Results

After optimization, the length scales reveal feature importance:

Relevance Score: A common metric is the inverse squared length scale: $$r_d = \frac{1}{\ell_d^2}$$

Higher $r_d$ means dimension $d$ is more relevant.

Normalized Relevance: To compare across features: $$\bar{r}d = \frac{r_d}{\sum{d'} r_{d'}}$$

Interpreting Learned Length Scales
Length Scale Ratio ($\ell_d / \ell_{min}$)	Interpretation
1 - 2	Highly relevant feature
2 - 5	Moderately relevant
5 - 10	Marginally relevant
10	Likely irrelevant (candidate for removal)

ARD is Not Always Reliable

With limited data or correlated features, ARD can give misleading relevance estimates. If two features are highly correlated, ARD may assign high relevance to one and low to the other arbitrarily. Always validate with domain knowledge.

ARD for Feature Selection

ARD enables automatic feature selection within the GP framework:

Soft Selection: Use all features but let ARD weight them by relevance. Features with large $\ell_d$ contribute minimally to predictions.

Hard Selection: After ARD optimization, remove features where $\ell_d > \tau$ for some threshold $\tau$. Then retrain with reduced feature set.

Benefits over explicit feature selection:

No separate feature selection step needed
Relevance determined in the context of GP modeling
Soft selection avoids discrete, potentially suboptimal choices

When ARD Helps

•High-dimensional inputs — Many features where only some matter
•Unknown relevance — When you don't know which features to include
•Interpretability — When you need to explain which inputs drive predictions
•Automatic preprocessing — Replaces manual feature selection

Practical Considerations

Computational Cost: ARD adds $D$ parameters (one per dimension). For high-D inputs, this increases optimization difficulty.

Overfitting Risk: With many hyperparameters and little data, ARD can overfit. The number of effective parameters is roughly $D + 2$ (D length scales + signal + noise variance).

Scaling: Always normalize inputs before using ARD. Otherwise, features on different scales will have incomparable length scales.

Initialization: Initialize all length scales similarly (e.g., to median pairwise distance in that dimension). Random initialization can lead to poor local optima.

Rule of Thumb

Reliable ARD typically requires n >> D (number of observations much greater than dimensions). With n ≈ D or n < D, consider using a single shared length scale or dimensionality reduction before GP modeling.

Summary

Key Takeaways

•ARD = per-dimension length scales — Small length scale = high relevance
•Learned via marginal likelihood — Same optimization as other hyperparameters
•Automatic feature selection — Irrelevant features pushed to large length scales
•Requires sufficient data — Need n >> D for reliable relevance estimates
•Normalize inputs — Essential for meaningful length scale comparison

Module Complete

Congratulations! You've completed the module on GP Kernels and Hyperparameters. You now understand kernel design principles, how to combine kernels for expressive models, hyperparameter optimization via marginal likelihood, and automatic relevance determination for feature selection.