Loading content...
In many machine learning problems, not all input features are equally important. Some features strongly influence the output, while others are irrelevant noise. Automatic Relevance Determination (ARD) is a powerful technique that lets the Gaussian Process learn which features matter directly from data.
The key idea: instead of using a single length scale $\ell$ for all dimensions, we use a separate length scale $\ell_d$ for each input dimension $d$. Features with large $\ell_d$ are effectively 'turned off'—the function is insensitive to them.
By the end of this page, you will understand how ARD kernels work, interpret learned length scales as feature importances, implement ARD for automatic feature selection, and recognize when ARD helps and when it can cause problems.
The ARD Squared Exponential kernel replaces the isotropic distance with a weighted distance:
$$k_{ARD}(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left(-\frac{1}{2}\sum_{d=1}^{D}\frac{(x_d - x'_d)^2}{\ell_d^2}\right)$$
This is equivalent to: $$k_{ARD}(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left(-\frac{1}{2}(\mathbf{x} - \mathbf{x}')^T \Lambda^{-1} (\mathbf{x} - \mathbf{x}')\right)$$
where $\Lambda = \text{diag}(\ell_1^2, \ell_2^2, \ldots, \ell_D^2)$ is the diagonal matrix of squared length scales.
Interpretation of Length Scales:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
import numpy as np class ARDKernel: """ Automatic Relevance Determination (ARD) Kernel. Each input dimension has its own length scale, enabling automatic feature importance determination. """ def __init__(self, length_scales: np.ndarray, signal_variance: float = 1.0): """ Parameters: ----------- length_scales : np.ndarray of shape (D,) Per-dimension length scales signal_variance : float Signal variance σ² """ self.length_scales = np.asarray(length_scales) self.signal_variance = signal_variance def __call__(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray: """ Compute ARD kernel matrix. X1: (n1, D), X2: (n2, D) Returns: (n1, n2) kernel matrix """ # Scale each dimension by its length scale X1_scaled = X1 / self.length_scales X2_scaled = X2 / self.length_scales # Compute squared distances in scaled space sq_dist = np.sum(X1_scaled**2, axis=1, keepdims=True) + \ np.sum(X2_scaled**2, axis=1) - 2 * X1_scaled @ X2_scaled.T return self.signal_variance * np.exp(-0.5 * sq_dist) def relevance_scores(self) -> np.ndarray: """ Compute relevance scores from length scales. Higher scores = more relevant features. Uses inverse squared length scale as relevance. """ return 1.0 / self.length_scales**2 def feature_ranking(self) -> np.ndarray: """Return indices of features sorted by relevance (most relevant first).""" return np.argsort(self.length_scales) # Smallest length scale = most relevant # Example usageD = 5 # 5 input dimensionslength_scales = np.array([0.5, 5.0, 0.3, 10.0, 1.0]) # Dims 2, 0 are most relevant kernel = ARDKernel(length_scales)print("Feature ranking:", kernel.feature_ranking()) # [2, 0, 4, 1, 3]print("Relevance scores:", kernel.relevance_scores().round(2))ARD length scales are learned by maximizing the marginal likelihood, just like other hyperparameters. The gradient with respect to each length scale:
$$\frac{\partial \log p(\mathbf{y})}{\partial \ell_d} = \frac{1}{2}\text{tr}\left((\boldsymbol{\alpha}\boldsymbol{\alpha}^T - K^{-1})\frac{\partial K}{\partial \ell_d}\right)$$
where: $$\frac{\partial K_{ij}}{\partial \ell_d} = K_{ij} \cdot \frac{(x_{id} - x_{jd})^2}{\ell_d^3}$$
During optimization, the marginal likelihood naturally pushes irrelevant length scales to large values.
To prevent length scales from going to infinity (which causes numerical issues), use log-transformed parameters during optimization: optimize log(ℓ_d) instead of ℓ_d. You can also add a weak prior that penalizes very large length scales.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
from scipy.optimize import minimize def optimize_ard_gp(X, y, n_restarts=5): """ Optimize ARD GP hyperparameters. Returns optimized length scales, signal variance, and noise variance. """ n, D = X.shape def neg_log_ml(log_theta): # Unpack log-transformed parameters log_ls = log_theta[:D] # D length scales log_sv = log_theta[D] # signal variance log_nv = log_theta[D+1] # noise variance ls = np.exp(log_ls) sv = np.exp(log_sv) nv = np.exp(log_nv) # Build ARD kernel matrix X_scaled = X / ls sq_dist = np.sum(X_scaled**2, axis=1, keepdims=True) + \ np.sum(X_scaled**2, axis=1) - 2 * X_scaled @ X_scaled.T K = sv * np.exp(-0.5 * sq_dist) + nv * np.eye(n) try: L = np.linalg.cholesky(K) except: return 1e10 alpha = np.linalg.solve(L.T, np.linalg.solve(L, y)) nll = 0.5 * y @ alpha + np.sum(np.log(np.diag(L))) return nll best_result = None best_nll = np.inf for _ in range(n_restarts): # Initialize: length scales from data range, variances from y log_theta0 = np.concatenate([ np.log(np.std(X, axis=0) + 1e-6), # length scales ~ data spread [np.log(np.var(y))], # signal variance ~ y variance [np.log(0.1 * np.var(y))] # noise = 10% of signal ]) log_theta0 += 0.5 * np.random.randn(D + 2) # Add randomness result = minimize(neg_log_ml, log_theta0, method='L-BFGS-B') if result.fun < best_nll: best_nll = result.fun best_result = result.x # Extract optimized hyperparameters length_scales = np.exp(best_result[:D]) signal_var = np.exp(best_result[D]) noise_var = np.exp(best_result[D+1]) return { 'length_scales': length_scales, 'signal_variance': signal_var, 'noise_variance': noise_var, 'relevance_ranking': np.argsort(length_scales) }After optimization, the length scales reveal feature importance:
Relevance Score: A common metric is the inverse squared length scale: $$r_d = \frac{1}{\ell_d^2}$$
Higher $r_d$ means dimension $d$ is more relevant.
Normalized Relevance: To compare across features: $$\bar{r}d = \frac{r_d}{\sum{d'} r_{d'}}$$
| Length Scale Ratio ($\ell_d / \ell_{min}$) | Interpretation |
|---|---|
| 1 - 2 | Highly relevant feature |
| 2 - 5 | Moderately relevant |
| 5 - 10 | Marginally relevant |
10 | Likely irrelevant (candidate for removal) |
With limited data or correlated features, ARD can give misleading relevance estimates. If two features are highly correlated, ARD may assign high relevance to one and low to the other arbitrarily. Always validate with domain knowledge.
ARD enables automatic feature selection within the GP framework:
Soft Selection: Use all features but let ARD weight them by relevance. Features with large $\ell_d$ contribute minimally to predictions.
Hard Selection: After ARD optimization, remove features where $\ell_d > \tau$ for some threshold $\tau$. Then retrain with reduced feature set.
Benefits over explicit feature selection:
Computational Cost: ARD adds $D$ parameters (one per dimension). For high-D inputs, this increases optimization difficulty.
Overfitting Risk: With many hyperparameters and little data, ARD can overfit. The number of effective parameters is roughly $D + 2$ (D length scales + signal + noise variance).
Scaling: Always normalize inputs before using ARD. Otherwise, features on different scales will have incomparable length scales.
Initialization: Initialize all length scales similarly (e.g., to median pairwise distance in that dimension). Random initialization can lead to poor local optima.
Reliable ARD typically requires n >> D (number of observations much greater than dimensions). With n ≈ D or n < D, consider using a single shared length scale or dimensionality reduction before GP modeling.
Congratulations! You've completed the module on GP Kernels and Hyperparameters. You now understand kernel design principles, how to combine kernels for expressive models, hyperparameter optimization via marginal likelihood, and automatic relevance determination for feature selection.