Machine LearningGaussian Processes

Kernel Design for Gaussian Processes

LevelAdvanced

Duration90 mins

TopicGaussian Processes

5 / 5

Automatic Relevance Determination

Feature Selection Through the Kernel

In high-dimensional problems, not all input features matter equally. Some dimensions may be highly predictive, others completely irrelevant, and many somewhere in between. Automatic Relevance Determination (ARD) addresses this by assigning a separate length-scale to each input dimension, allowing the GP to automatically discover which features are important.

ARD is more than a convenience—it's a principled Bayesian approach to feature selection that quantifies uncertainty about feature relevance and can dramatically improve both predictive performance and interpretability in high-dimensional settings.

What You Will Learn

By the end of this page, you will understand: (1) the ARD kernel formulation, (2) how length-scales encode feature relevance, (3) optimization and regularization of ARD parameters, (4) connections to sparsity and Bayesian feature selection, (5) practical guidelines for high-dimensional GP modeling.

The ARD Kernel Formulation

Standard Isotropic Kernel (Single Length-Scale)

The standard RBF kernel uses one length-scale for all dimensions:

$$k(\mathbf{x}, \mathbf{x}') = \sigma^2 \exp\left(-\frac{|\mathbf{x} - \mathbf{x}'|^2}{2\ell^2}\right) = \sigma^2 \exp\left(-\frac{1}{2\ell^2}\sum_{d=1}^{D}(x_d - x'_d)^2\right)$$

This treats all dimensions identically—a strong assumption that's often wrong.

ARD Kernel (Per-Dimension Length-Scales)

$$k_{\text{ARD}}(\mathbf{x}, \mathbf{x}') = \sigma^2 \exp\left(-\frac{1}{2}\sum_{d=1}^{D}\frac{(x_d - x'_d)^2}{\ell_d^2}\right)$$

Equivalently, with diagonal matrix $\mathbf{\Lambda} = \text{diag}(\ell_1^{-2}, ..., \ell_D^{-2})$:

$$k_{\text{ARD}}(\mathbf{x}, \mathbf{x}') = \sigma^2 \exp\left(-\frac{1}{2}(\mathbf{x} - \mathbf{x}')^\top \mathbf{\Lambda} (\mathbf{x} - \mathbf{x}')\right)$$

Relevance Through Length-Scales

Small ℓ_d: The function is sensitive to changes in dimension d → dimension d is RELEVANT. Large ℓ_d: The function barely changes with dimension d → dimension d is IRRELEVANT. As ℓ_d → ∞, dimension d is completely ignored. This provides a continuous measure of feature importance.

Interpreting ARD Length-Scales
Length-Scale Value	Interpretation	Effect on Kernel	Feature Status
ℓ_d ≈ 0.1	Function varies rapidly with x_d	Strong sensitivity	Highly relevant
ℓ_d ≈ 1.0	Moderate variation with x_d	Moderate sensitivity	Somewhat relevant
ℓ_d ≈ 10	Function barely varies with x_d	Low sensitivity	Marginally relevant
ℓ_d → ∞	No dependence on x_d	Zero sensitivity	Irrelevant (pruned)

ard_kernel_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import numpy as np
import matplotlib.pyplot as plt
 
def ard_rbf_kernel(X1, X2, sigma_sq=1.0, length_scales=None):
    """
    ARD RBF kernel with per-dimension length-scales.
    
    Parameters:
    -----------
    X1 : ndarray of shape (n1, D)
    X2 : ndarray of shape (n2, D)
    sigma_sq : float
        Signal variance
    length_scales : ndarray of shape (D,)
        Per-dimension length-scales
    """
    X1 = np.atleast_2d(X1)
    X2 = np.atleast_2d(X2)
    D = X1.shape[1]
    
    if length_scales is None:
        length_scales = np.ones(D)
    
    # Scale inputs by inverse length-scales
    X1_scaled = X1 / length_scales
    X2_scaled = X2 / length_scales
    
    # Compute squared distances in scaled space
    sq_dists = np.sum((X1_scaled[:, np.newaxis, :] - 
                       X2_scaled[np.newaxis, :, :])**2, axis=2)
    
    return sigma_sq * np.exp(-0.5 * sq_dists)
 
def visualize_ard_effect():
    """Show how different dimension-wise length-scales affect the kernel."""
    np.random.seed(42)
    n = 100
    
    # Generate 2D data
    X = np.random.randn(n, 2)
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    # Case 1: Both dimensions equally relevant
    K1 = ard_rbf_kernel(X, X, length_scales=[1.0, 1.0])
    axes[0].imshow(K1, cmap='viridis')
    axes[0].set_title('ℓ = [1.0, 1.0]\nBoth dimensions equally relevant')
    
    # Case 2: Dimension 1 more relevant
    K2 = ard_rbf_kernel(X, X, length_scales=[0.3, 3.0])
    axes[1].imshow(K2, cmap='viridis')
    axes[1].set_title('ℓ = [0.3, 3.0]\nDimension 1 dominant')
    
    # Case 3: Dimension 2 irrelevant
    K3 = ard_rbf_kernel(X, X, length_scales=[1.0, 100.0])
    axes[2].imshow(K3, cmap='viridis')
    axes[2].set_title('ℓ = [1.0, 100]\nDimension 2 irrelevant')
    
    plt.suptitle('ARD Kernel: Same Data, Different Relevance Structures', fontsize=14)
    plt.tight_layout()
    return fig
 
visualize_ard_effect()

Learning ARD Parameters

Like other kernel hyperparameters, ARD length-scales are typically learned by maximizing the log marginal likelihood (LML):

$$\log p(\mathbf{y}|\mathbf{X}, \boldsymbol{\theta}) = -\frac{1}{2}\mathbf{y}^\top\mathbf{K}_y^{-1}\mathbf{y} - \frac{1}{2}\log|\mathbf{K}_y| - \frac{n}{2}\log(2\pi)$$

where $\boldsymbol{\theta} = (\sigma^2, \ell_1, ..., \ell_D, \sigma_n^2)$ are all hyperparameters.

Gradient-Based Optimization

The gradient of LML with respect to each length-scale $\ell_d$ can be computed:

$$\frac{\partial \log p(\mathbf{y}|\mathbf{X})}{\partial \ell_d} = \frac{1}{2}\text{tr}\left((\boldsymbol{\alpha}\boldsymbol{\alpha}^\top - \mathbf{K}_y^{-1})\frac{\partial \mathbf{K}_y}{\partial \ell_d}\right)$$

where $\boldsymbol{\alpha} = \mathbf{K}_y^{-1}\mathbf{y}$.

Modern GP libraries (GPyTorch, GPflow) compute these gradients automatically.

The Many Parameters Challenge

With D input dimensions, ARD introduces D length-scale parameters. For high-dimensional data (D > 100), this can lead to: (1) Optimization challenges with many local optima, (2) Overfitting risk if n/D is small, (3) Computational cost per gradient step. Regularization and careful initialization become essential.

ard_learning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import numpy as np
from scipy.optimize import minimize
from scipy.linalg import cho_solve, cho_factor
 
def log_marginal_likelihood(params, X, y):
    """
    Compute negative log marginal likelihood for ARD-RBF GP.
    
    params: [log_sigma, log_ell_1, ..., log_ell_D, log_sigma_n]
    """
    n, D = X.shape
    
    # Extract parameters (work in log space for positivity)
    log_sigma = params[0]
    log_ells = params[1:D+1]
    log_sigma_n = params[D+1]
    
    sigma_sq = np.exp(2 * log_sigma)
    length_scales = np.exp(log_ells)
    sigma_n_sq = np.exp(2 * log_sigma_n)
    
    # Compute ARD kernel
    K = ard_rbf_kernel(X, X, sigma_sq, length_scales)
    K_y = K + sigma_n_sq * np.eye(n)
    
    try:
        L = cho_factor(K_y)
        alpha = cho_solve(L, y)
        
        # Log marginal likelihood
        lml = -0.5 * y @ alpha
        lml -= np.sum(np.log(np.diag(L[0])))  # log|K_y|^0.5
        lml -= 0.5 * n * np.log(2 * np.pi)
        
        return -lml  # Negative for minimization
    except np.linalg.LinAlgError:
        return np.inf  # Return large value if Cholesky fails
 
def fit_ard_gp(X, y, n_restarts=5):
    """
    Fit ARD-GP with multiple restarts.
    """
    n, D = X.shape
    n_params = D + 2  # sigma, D length-scales, sigma_n
    
    best_result = None
    best_nlml = np.inf
    
    for restart in range(n_restarts):
        # Random initialization
        init_params = np.random.randn(n_params) * 0.5
        
        result = minimize(
            log_marginal_likelihood,
            init_params,
            args=(X, y),
            method='L-BFGS-B',
            options={'maxiter': 200}
        )
        
        if result.fun < best_nlml:
            best_nlml = result.fun
            best_result = result
    
    # Extract learned parameters
    params = best_result.x
    sigma = np.exp(params[0])
    length_scales = np.exp(params[1:D+1])
    sigma_n = np.exp(params[D+1])
    
    return {
        'sigma': sigma,
        'length_scales': length_scales,
        'sigma_n': sigma_n,
        'nlml': best_nlml
    }
 
def demonstrate_ard_feature_selection():
    """
    Show ARD discovering relevant features.
    """
    np.random.seed(42)
    n = 200
    D = 5  # 5 dimensions, but only 2 are relevant
    
    X = np.random.randn(n, D)
    
    # True function only depends on dimensions 0 and 2
    y_true = np.sin(3 * X[:, 0]) + X[:, 2]**2
    y = y_true + 0.3 * np.random.randn(n)
    
    result = fit_ard_gp(X, y, n_restarts=3)
    
    print("ARD Feature Selection Demo")
    print("="*50)
    print(f"True relevant dimensions: 0, 2")
    print(f"\nLearned length-scales:")
    for d in range(D):
        relevance = "RELEVANT" if result['length_scales'][d] < 2.0 else "irrelevant"
        print(f"  Dim {d}: ℓ = {result['length_scales'][d]:.3f} ({relevance})")
    
    return result
 
# Run demo
result = demonstrate_ard_feature_selection()

Regularization and Priors for ARD

With many ARD parameters, regularization prevents overfitting and encourages sparsity (pushing irrelevant length-scales to infinity).

Prior Distributions on Length-Scales

In a fully Bayesian treatment, we place priors on the length-scales:

$$p(\ell_d) \sim \text{InverseGamma}(\alpha, \beta) \quad \text{or} \quad p(\log \ell_d) \sim \mathcal{N}(\mu, \sigma^2)$$

Sparsity-Inducing Priors

To encourage feature selection (large length-scales for irrelevant features):

Horseshoe prior: Heavy-tailed prior encouraging shrinkage toward infinity
Spike-and-slab: Mixture prior with mass at both small and large length-scales
Automatic Relevance Prior (Half-Cauchy): Robust prior with polynomial tails

Practical Regularization

A simple, effective approach: add log-prior penalty to the LML objective. For example, a Gamma prior on ℓ_d encourages moderate values: L_regularized = LML + λ·Σ log(ℓ_d). This discourages both very small (overfitting) and very large (complete feature removal) length-scales.

Preventing Overfitting with ARD

Strategy	Description	When to Use
Priors on ℓ_d	Regularize toward sensible values	Always with D > 10
Cross-validation	Select hyperparameters on held-out data	Limited data
Feature grouping	Share length-scales across related features	Domain knowledge available
Dimensionality reduction	PCA/random projection before GP	D >> n
Sparse GP	Combine with inducing points	Large n and D

Initialization Strategies

Initialize all length-scales to the median inter-point distance per dimension
Use preliminary feature selection (e.g., correlation) to inform initialization
Start with shared length-scale, then switch to ARD after initial optimization

ARD for High-Dimensional Data

High-dimensional inputs (D > 50) present unique challenges for ARD:

The Curse of Dimensionality

Sparse data: Points become increasingly isolated as D grows
Many parameters: D length-scales plus base parameters
Optimization landscape: More local optima, harder convergence
Distance concentration: All distances become similar in high D

Strategies for High-Dimensional ARD

Scaling ARD to High Dimensions

•Dimensionality reduction first: Apply PCA, random projections, or autoencoders before the GP. Then use ARD on the reduced representation.
•Feature screening: Use fast methods (correlation, mutual information, LASSO) to pre-select promising features before ARD optimization.
•Hierarchical ARD: Group features and share length-scales within groups. Learn group-level and feature-level relevance.
•Additive GP models: Assume f(x) = Σ f_d(x_d); each univariate GP has one length-scale. Computationally efficient, interpretable.
•Deep Kernel Learning: Learn a neural network embedding, then apply ARD in the learned space. Combines representation learning with Bayesian uncertainty.

high_dim_ard.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
 
def ard_with_pca(X, y, n_components=10):
    """
    Two-stage approach for high-dimensional ARD:
    1. Reduce dimensions with PCA
    2. Apply ARD-GP in reduced space
    """
    # Standardize features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # PCA reduction
    pca = PCA(n_components=n_components)
    X_reduced = pca.fit_transform(X_scaled)
    
    # Fit ARD-GP in reduced space
    result = fit_ard_gp(X_reduced, y, n_restarts=3)
    
    # Transform ARD relevance back to original space
    # Components with low length-scale in PC space → 
    # features with high loadings on those PCs are relevant
    relevance_pcs = 1.0 / result['length_scales']
    relevance_features = np.abs(pca.components_.T) @ relevance_pcs
    
    return {
        'result': result,
        'pca': pca,
        'feature_relevance': relevance_features
    }
 
def additive_gp_ard(X, y):
    """
    Additive GP: f(x) = Σ_d f_d(x_d)
    
    Each marginal GP has its own length-scale.
    Much more scalable than full ARD for high D.
    """
    n, D = X.shape
    
    # Fit separate univariate GPs for each dimension
    length_scales = np.zeros(D)
    
    for d in range(D):
        X_d = X[:, d:d+1]
        # Fit simple GP to get length-scale
        # (In practice, optimize LML for each dimension)
        # Here we just estimate from data spread
        length_scales[d] = np.std(X_d) * 2
    
    print("Additive GP: Independent length-scale per dimension")
    print("Computational cost: O(D × n³) vs O(n³) for full GP")
    print("But more scalable: each subproblem is 1D")
    
    return length_scales
 
def demo_high_dimensional():
    """Demonstrate high-dimensional ARD strategies."""
    np.random.seed(42)
    
    n = 200
    D_original = 100  # High dimensional input
    D_relevant = 5    # Only 5 features matter
    
    # Generate sparse relevance structure
    X = np.random.randn(n, D_original)
    relevant_dims = np.random.choice(D_original, D_relevant, replace=False)
    
    # True function depends only on relevant dimensions
    y = np.sum(np.sin(2 * X[:, relevant_dims]), axis=1) + 0.3 * np.random.randn(n)
    
    print("High-Dimensional ARD Demo")
    print("="*50)
    print(f"Original dimensions: {D_original}")
    print(f"Truly relevant dimensions: {sorted(relevant_dims)}")
    
    # Strategy 1: PCA + ARD
    result = ard_with_pca(X, y, n_components=10)
    
    # Find top features by learned relevance
    top_features = np.argsort(result['feature_relevance'])[-10:]
    print(f"\nTop 10 features by PCA+ARD: {sorted(top_features)}")
    
    # Check overlap with true relevant features
    overlap = len(set(top_features) & set(relevant_dims))
    print(f"Correctly identified: {overlap}/{D_relevant}")
    
    return result
 
demo_high_dimensional()

Interpreting ARD Results

ARD provides a natural feature importance ranking, but interpretation requires care.

What ARD Length-Scales Tell You

✓ Relative importance of features ✓ Approximate scale of variation in each dimension ✓ Which features can be safely removed ✓ Potential feature interactions (if full ARD matrix used)

What ARD Doesn't Tell You

✗ Causal importance ✗ Performance with feature removed (test that explicitly) ✗ Nonlinear redundancies between features ✗ Importance for out-of-distribution data

Visualization and Reporting

Bar plot of inverse length-scales: 1/ℓ_d as "relevance score"
Sorted ranking: List features from most to least relevant
Threshold for removal: Features with ℓ_d > 10× median can often be dropped
Sensitivity analysis: How do predictions change if we fix some ℓ_d to infinity?

Comparison with Other Feature Selection Methods

Method	Type	Advantages	Disadvantages
ARD	Embedded/Wrapper	Principled, includes uncertainty	Computationally expensive
LASSO	Embedded	Fast, sparse solutions	Linear assumption
Random Forest	Wrapper	Handles nonlinearity	No uncertainty quantification
Mutual Information	Filter	Fast, model-free	Pairwise only
Forward Selection	Wrapper	Intuitive	Greedy, misses interactions

Summary: Automatic Relevance Determination

Key Takeaways

•ARD = per-dimension length-scales: Small ℓ_d means relevant, large ℓ_d means irrelevant
•Learning via marginal likelihood: Optimize all length-scales jointly; gradients available
•Regularization essential: Priors on ℓ_d prevent overfitting, especially for high D
•High-dimensional strategies: PCA first, feature screening, additive models, grouped ARD
•Interpretation requires care: ARD gives correlation-based importance, not causation
•Combine with other methods: Use ARD for uncertainty-aware feature selection; validate with held-out data

Module Complete: Kernel Design

You've now completed the Kernel Design module, covering:

Stationary kernels (RBF, Matérn, Periodic) and their properties
Non-stationary kernels for heterogeneous domains
Kernel composition for building complex, interpretable structures
Spectral analysis and the frequency-domain perspective
Automatic Relevance Determination for feature selection

This toolkit enables you to design GP models that encode sophisticated prior knowledge about your data's structure, improving both predictions and interpretability.

Module Complete

Congratulations! You now have a comprehensive understanding of kernel design for Gaussian Processes. You can select appropriate base kernels, compose them for complex patterns, interpret them spectrally, and use ARD for high-dimensional feature selection. These skills are fundamental for effective GP modeling in practice.

5 / 5

Loading learning content...

Machine LearningGaussian Processes

Kernel Design for Gaussian Processes

LevelAdvanced

Duration90 mins

TopicGaussian Processes

5 / 5

Automatic Relevance Determination

Feature Selection Through the Kernel

What You Will Learn

The ARD Kernel Formulation

Standard Isotropic Kernel (Single Length-Scale)

The standard RBF kernel uses one length-scale for all dimensions:

$$k(\mathbf{x}, \mathbf{x}') = \sigma^2 \exp\left(-\frac{|\mathbf{x} - \mathbf{x}'|^2}{2\ell^2}\right) = \sigma^2 \exp\left(-\frac{1}{2\ell^2}\sum_{d=1}^{D}(x_d - x'_d)^2\right)$$

This treats all dimensions identically—a strong assumption that's often wrong.

ARD Kernel (Per-Dimension Length-Scales)

$$k_{\text{ARD}}(\mathbf{x}, \mathbf{x}') = \sigma^2 \exp\left(-\frac{1}{2}\sum_{d=1}^{D}\frac{(x_d - x'_d)^2}{\ell_d^2}\right)$$

Equivalently, with diagonal matrix $\mathbf{\Lambda} = \text{diag}(\ell_1^{-2}, ..., \ell_D^{-2})$:

$$k_{\text{ARD}}(\mathbf{x}, \mathbf{x}') = \sigma^2 \exp\left(-\frac{1}{2}(\mathbf{x} - \mathbf{x}')^\top \mathbf{\Lambda} (\mathbf{x} - \mathbf{x}')\right)$$

Relevance Through Length-Scales

Interpreting ARD Length-Scales
Length-Scale Value	Interpretation	Effect on Kernel	Feature Status
ℓ_d ≈ 0.1	Function varies rapidly with x_d	Strong sensitivity	Highly relevant
ℓ_d ≈ 1.0	Moderate variation with x_d	Moderate sensitivity	Somewhat relevant
ℓ_d ≈ 10	Function barely varies with x_d	Low sensitivity	Marginally relevant
ℓ_d → ∞	No dependence on x_d	Zero sensitivity	Irrelevant (pruned)

ard_kernel_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import numpy as np
import matplotlib.pyplot as plt
 
def ard_rbf_kernel(X1, X2, sigma_sq=1.0, length_scales=None):
    """
    ARD RBF kernel with per-dimension length-scales.
    
    Parameters:
    -----------
    X1 : ndarray of shape (n1, D)
    X2 : ndarray of shape (n2, D)
    sigma_sq : float
        Signal variance
    length_scales : ndarray of shape (D,)
        Per-dimension length-scales
    """
    X1 = np.atleast_2d(X1)
    X2 = np.atleast_2d(X2)
    D = X1.shape[1]
    
    if length_scales is None:
        length_scales = np.ones(D)
    
    # Scale inputs by inverse length-scales
    X1_scaled = X1 / length_scales
    X2_scaled = X2 / length_scales
    
    # Compute squared distances in scaled space
    sq_dists = np.sum((X1_scaled[:, np.newaxis, :] - 
                       X2_scaled[np.newaxis, :, :])**2, axis=2)
    
    return sigma_sq * np.exp(-0.5 * sq_dists)
 
def visualize_ard_effect():
    """Show how different dimension-wise length-scales affect the kernel."""
    np.random.seed(42)
    n = 100
    
    # Generate 2D data
    X = np.random.randn(n, 2)
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    # Case 1: Both dimensions equally relevant
    K1 = ard_rbf_kernel(X, X, length_scales=[1.0, 1.0])
    axes[0].imshow(K1, cmap='viridis')
    axes[0].set_title('ℓ = [1.0, 1.0]\nBoth dimensions equally relevant')
    
    # Case 2: Dimension 1 more relevant
    K2 = ard_rbf_kernel(X, X, length_scales=[0.3, 3.0])
    axes[1].imshow(K2, cmap='viridis')
    axes[1].set_title('ℓ = [0.3, 3.0]\nDimension 1 dominant')
    
    # Case 3: Dimension 2 irrelevant
    K3 = ard_rbf_kernel(X, X, length_scales=[1.0, 100.0])
    axes[2].imshow(K3, cmap='viridis')
    axes[2].set_title('ℓ = [1.0, 100]\nDimension 2 irrelevant')
    
    plt.suptitle('ARD Kernel: Same Data, Different Relevance Structures', fontsize=14)
    plt.tight_layout()
    return fig
 
visualize_ard_effect()

Learning ARD Parameters

Like other kernel hyperparameters, ARD length-scales are typically learned by maximizing the log marginal likelihood (LML):

$$\log p(\mathbf{y}|\mathbf{X}, \boldsymbol{\theta}) = -\frac{1}{2}\mathbf{y}^\top\mathbf{K}_y^{-1}\mathbf{y} - \frac{1}{2}\log|\mathbf{K}_y| - \frac{n}{2}\log(2\pi)$$

where $\boldsymbol{\theta} = (\sigma^2, \ell_1, ..., \ell_D, \sigma_n^2)$ are all hyperparameters.

Gradient-Based Optimization

The gradient of LML with respect to each length-scale $\ell_d$ can be computed:

where $\boldsymbol{\alpha} = \mathbf{K}_y^{-1}\mathbf{y}$.

Modern GP libraries (GPyTorch, GPflow) compute these gradients automatically.

The Many Parameters Challenge

ard_learning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import numpy as np
from scipy.optimize import minimize
from scipy.linalg import cho_solve, cho_factor
 
def log_marginal_likelihood(params, X, y):
    """
    Compute negative log marginal likelihood for ARD-RBF GP.
    
    params: [log_sigma, log_ell_1, ..., log_ell_D, log_sigma_n]
    """
    n, D = X.shape
    
    # Extract parameters (work in log space for positivity)
    log_sigma = params[0]
    log_ells = params[1:D+1]
    log_sigma_n = params[D+1]
    
    sigma_sq = np.exp(2 * log_sigma)
    length_scales = np.exp(log_ells)
    sigma_n_sq = np.exp(2 * log_sigma_n)
    
    # Compute ARD kernel
    K = ard_rbf_kernel(X, X, sigma_sq, length_scales)
    K_y = K + sigma_n_sq * np.eye(n)
    
    try:
        L = cho_factor(K_y)
        alpha = cho_solve(L, y)
        
        # Log marginal likelihood
        lml = -0.5 * y @ alpha
        lml -= np.sum(np.log(np.diag(L[0])))  # log|K_y|^0.5
        lml -= 0.5 * n * np.log(2 * np.pi)
        
        return -lml  # Negative for minimization
    except np.linalg.LinAlgError:
        return np.inf  # Return large value if Cholesky fails
 
def fit_ard_gp(X, y, n_restarts=5):
    """
    Fit ARD-GP with multiple restarts.
    """
    n, D = X.shape
    n_params = D + 2  # sigma, D length-scales, sigma_n
    
    best_result = None
    best_nlml = np.inf
    
    for restart in range(n_restarts):
        # Random initialization
        init_params = np.random.randn(n_params) * 0.5
        
        result = minimize(
            log_marginal_likelihood,
            init_params,
            args=(X, y),
            method='L-BFGS-B',
            options={'maxiter': 200}
        )
        
        if result.fun < best_nlml:
            best_nlml = result.fun
            best_result = result
    
    # Extract learned parameters
    params = best_result.x
    sigma = np.exp(params[0])
    length_scales = np.exp(params[1:D+1])
    sigma_n = np.exp(params[D+1])
    
    return {
        'sigma': sigma,
        'length_scales': length_scales,
        'sigma_n': sigma_n,
        'nlml': best_nlml
    }
 
def demonstrate_ard_feature_selection():
    """
    Show ARD discovering relevant features.
    """
    np.random.seed(42)
    n = 200
    D = 5  # 5 dimensions, but only 2 are relevant
    
    X = np.random.randn(n, D)
    
    # True function only depends on dimensions 0 and 2
    y_true = np.sin(3 * X[:, 0]) + X[:, 2]**2
    y = y_true + 0.3 * np.random.randn(n)
    
    result = fit_ard_gp(X, y, n_restarts=3)
    
    print("ARD Feature Selection Demo")
    print("="*50)
    print(f"True relevant dimensions: 0, 2")
    print(f"\nLearned length-scales:")
    for d in range(D):
        relevance = "RELEVANT" if result['length_scales'][d] < 2.0 else "irrelevant"
        print(f"  Dim {d}: ℓ = {result['length_scales'][d]:.3f} ({relevance})")
    
    return result
 
# Run demo
result = demonstrate_ard_feature_selection()

Regularization and Priors for ARD

With many ARD parameters, regularization prevents overfitting and encourages sparsity (pushing irrelevant length-scales to infinity).

Prior Distributions on Length-Scales

In a fully Bayesian treatment, we place priors on the length-scales:

$$p(\ell_d) \sim \text{InverseGamma}(\alpha, \beta) \quad \text{or} \quad p(\log \ell_d) \sim \mathcal{N}(\mu, \sigma^2)$$

Sparsity-Inducing Priors

To encourage feature selection (large length-scales for irrelevant features):

Horseshoe prior: Heavy-tailed prior encouraging shrinkage toward infinity
Spike-and-slab: Mixture prior with mass at both small and large length-scales
Automatic Relevance Prior (Half-Cauchy): Robust prior with polynomial tails

Practical Regularization

Preventing Overfitting with ARD

Strategy	Description	When to Use
Priors on ℓ_d	Regularize toward sensible values	Always with D > 10
Cross-validation	Select hyperparameters on held-out data	Limited data
Feature grouping	Share length-scales across related features	Domain knowledge available
Dimensionality reduction	PCA/random projection before GP	D >> n
Sparse GP	Combine with inducing points	Large n and D

Initialization Strategies

Initialize all length-scales to the median inter-point distance per dimension
Use preliminary feature selection (e.g., correlation) to inform initialization
Start with shared length-scale, then switch to ARD after initial optimization

ARD for High-Dimensional Data

High-dimensional inputs (D > 50) present unique challenges for ARD:

The Curse of Dimensionality

Sparse data: Points become increasingly isolated as D grows
Many parameters: D length-scales plus base parameters
Optimization landscape: More local optima, harder convergence
Distance concentration: All distances become similar in high D

Strategies for High-Dimensional ARD

Scaling ARD to High Dimensions

•Dimensionality reduction first: Apply PCA, random projections, or autoencoders before the GP. Then use ARD on the reduced representation.
•Feature screening: Use fast methods (correlation, mutual information, LASSO) to pre-select promising features before ARD optimization.
•Hierarchical ARD: Group features and share length-scales within groups. Learn group-level and feature-level relevance.
•Additive GP models: Assume f(x) = Σ f_d(x_d); each univariate GP has one length-scale. Computationally efficient, interpretable.
•Deep Kernel Learning: Learn a neural network embedding, then apply ARD in the learned space. Combines representation learning with Bayesian uncertainty.

high_dim_ard.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
 
def ard_with_pca(X, y, n_components=10):
    """
    Two-stage approach for high-dimensional ARD:
    1. Reduce dimensions with PCA
    2. Apply ARD-GP in reduced space
    """
    # Standardize features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # PCA reduction
    pca = PCA(n_components=n_components)
    X_reduced = pca.fit_transform(X_scaled)
    
    # Fit ARD-GP in reduced space
    result = fit_ard_gp(X_reduced, y, n_restarts=3)
    
    # Transform ARD relevance back to original space
    # Components with low length-scale in PC space → 
    # features with high loadings on those PCs are relevant
    relevance_pcs = 1.0 / result['length_scales']
    relevance_features = np.abs(pca.components_.T) @ relevance_pcs
    
    return {
        'result': result,
        'pca': pca,
        'feature_relevance': relevance_features
    }
 
def additive_gp_ard(X, y):
    """
    Additive GP: f(x) = Σ_d f_d(x_d)
    
    Each marginal GP has its own length-scale.
    Much more scalable than full ARD for high D.
    """
    n, D = X.shape
    
    # Fit separate univariate GPs for each dimension
    length_scales = np.zeros(D)
    
    for d in range(D):
        X_d = X[:, d:d+1]
        # Fit simple GP to get length-scale
        # (In practice, optimize LML for each dimension)
        # Here we just estimate from data spread
        length_scales[d] = np.std(X_d) * 2
    
    print("Additive GP: Independent length-scale per dimension")
    print("Computational cost: O(D × n³) vs O(n³) for full GP")
    print("But more scalable: each subproblem is 1D")
    
    return length_scales
 
def demo_high_dimensional():
    """Demonstrate high-dimensional ARD strategies."""
    np.random.seed(42)
    
    n = 200
    D_original = 100  # High dimensional input
    D_relevant = 5    # Only 5 features matter
    
    # Generate sparse relevance structure
    X = np.random.randn(n, D_original)
    relevant_dims = np.random.choice(D_original, D_relevant, replace=False)
    
    # True function depends only on relevant dimensions
    y = np.sum(np.sin(2 * X[:, relevant_dims]), axis=1) + 0.3 * np.random.randn(n)
    
    print("High-Dimensional ARD Demo")
    print("="*50)
    print(f"Original dimensions: {D_original}")
    print(f"Truly relevant dimensions: {sorted(relevant_dims)}")
    
    # Strategy 1: PCA + ARD
    result = ard_with_pca(X, y, n_components=10)
    
    # Find top features by learned relevance
    top_features = np.argsort(result['feature_relevance'])[-10:]
    print(f"\nTop 10 features by PCA+ARD: {sorted(top_features)}")
    
    # Check overlap with true relevant features
    overlap = len(set(top_features) & set(relevant_dims))
    print(f"Correctly identified: {overlap}/{D_relevant}")
    
    return result
 
demo_high_dimensional()

Interpreting ARD Results

ARD provides a natural feature importance ranking, but interpretation requires care.

What ARD Length-Scales Tell You

✓ Relative importance of features ✓ Approximate scale of variation in each dimension ✓ Which features can be safely removed ✓ Potential feature interactions (if full ARD matrix used)

What ARD Doesn't Tell You

✗ Causal importance ✗ Performance with feature removed (test that explicitly) ✗ Nonlinear redundancies between features ✗ Importance for out-of-distribution data

Visualization and Reporting

Bar plot of inverse length-scales: 1/ℓ_d as "relevance score"
Sorted ranking: List features from most to least relevant
Threshold for removal: Features with ℓ_d > 10× median can often be dropped
Sensitivity analysis: How do predictions change if we fix some ℓ_d to infinity?

Comparison with Other Feature Selection Methods

Method	Type	Advantages	Disadvantages
ARD	Embedded/Wrapper	Principled, includes uncertainty	Computationally expensive
LASSO	Embedded	Fast, sparse solutions	Linear assumption
Random Forest	Wrapper	Handles nonlinearity	No uncertainty quantification
Mutual Information	Filter	Fast, model-free	Pairwise only
Forward Selection	Wrapper	Intuitive	Greedy, misses interactions

Summary: Automatic Relevance Determination

Key Takeaways

•ARD = per-dimension length-scales: Small ℓ_d means relevant, large ℓ_d means irrelevant
•Learning via marginal likelihood: Optimize all length-scales jointly; gradients available
•Regularization essential: Priors on ℓ_d prevent overfitting, especially for high D
•High-dimensional strategies: PCA first, feature screening, additive models, grouped ARD
•Interpretation requires care: ARD gives correlation-based importance, not causation
•Combine with other methods: Use ARD for uncertainty-aware feature selection; validate with held-out data

Module Complete: Kernel Design

You've now completed the Kernel Design module, covering:

Stationary kernels (RBF, Matérn, Periodic) and their properties
Non-stationary kernels for heterogeneous domains
Kernel composition for building complex, interpretable structures
Spectral analysis and the frequency-domain perspective
Automatic Relevance Determination for feature selection

This toolkit enables you to design GP models that encode sophisticated prior knowledge about your data's structure, improving both predictions and interpretability.

Module Complete

5 / 5