Kernel Pca - Learning Module

Loading content...

0/245

Kernel Selection

Choosing the Right Kernel for Your Data

The kernel function is the heart of Kernel PCA. It defines the implicit feature space, determines what nonlinear structures can be discovered, and ultimately controls whether KPCA succeeds or fails on a given dataset. Yet kernel selection is often treated as an afterthought—practitioners default to the RBF kernel with a hastily-chosen bandwidth, hoping for the best.

This approach is inadequate for serious applications. Different kernels encode fundamentally different assumptions about data structure. The polynomial kernel captures polynomial relationships; the RBF kernel measures local similarity; string and graph kernels handle structured data. Choosing wisely requires understanding what each kernel does, how its parameters affect behavior, and how to evaluate whether a kernel is appropriate for your data.

This page provides a comprehensive guide to kernel selection for KPCA. We'll examine common kernel families, understand their properties and parameter sensitivities, explore methods for kernel parameter tuning, and develop practical strategies for kernel selection in real applications.

What You Will Learn

By the end of this page, you will understand the properties and use cases of major kernel families, how kernel parameters affect KPCA behavior, methods for tuning kernel parameters, diagnostic techniques for evaluating kernel appropriateness, and practical strategies for kernel selection in real applications.

The Kernel Zoo: Common Kernel Functions

Let's survey the most commonly used kernels for KPCA, understanding what each one does and when it's appropriate.

1. Linear Kernel $$k(\mathbf{x}, \mathbf{y}) = \mathbf{x}^T \mathbf{y}$$

The simplest kernel—equivalent to standard PCA. The feature space equals the input space: $\phi(\mathbf{x}) = \mathbf{x}$.

Use when: Data relationships are approximately linear, or as a baseline for comparison.

Parameters: None.

2. Polynomial Kernel $$k(\mathbf{x}, \mathbf{y}) = (\gamma \mathbf{x}^T \mathbf{y} + c)^d$$

Captures polynomial interactions up to degree $d$. The feature space includes all monomials up to degree $d$.

Use when: Polynomial relationships are expected (e.g., physics-based models, polynomial regression settings).

Parameters:

$d$: Polynomial degree (integer ≥ 1)
$\gamma$: Scaling factor (often set to $1/d_{\text{features}}$)
$c$: Offset parameter (controls influence of lower-degree terms; $c=0$ gives homogeneous polynomial)

3. Radial Basis Function (RBF) / Gaussian Kernel $$k(\mathbf{x}, \mathbf{y}) = \exp\left(-\gamma |\mathbf{x} - \mathbf{y}|^2\right) = \exp\left(-\frac{|\mathbf{x} - \mathbf{y}|^2}{2\sigma^2}\right)$$

Measures local similarity—kernel value decays with distance. Corresponds to an infinite-dimensional feature space.

Use when: No prior knowledge about relationship form; general-purpose nonlinear DR; data has local structure.

Parameters:

$\gamma$ or $\sigma$: Bandwidth (critical; $\gamma = 1/(2\sigma^2)$)

The RBF Kernel as Default

The RBF kernel is often the default choice because it's a universal approximator—with infinite feature space dimension, it can represent any continuous function. However, this flexibility comes with sensitivity to the bandwidth parameter, which must be tuned carefully.

4. Laplacian Kernel $$k(\mathbf{x}, \mathbf{y}) = \exp\left(-\gamma |\mathbf{x} - \mathbf{y}|_1\right)$$

Similar to RBF but uses L1 distance. More robust to outliers, produces "spikier" similarity functions.

Use when: Data may contain outliers; sparse features are relevant.

Parameters: $\gamma$ (bandwidth)

5. Sigmoid Kernel $$k(\mathbf{x}, \mathbf{y}) = \tanh(\gamma \mathbf{x}^T \mathbf{y} + c)$$

Originally motivated by neural network connections. Not always positive semi-definite (only for certain parameter ranges).

Use when: Neural network analogy is relevant (rare in KPCA).

Parameters: $\gamma$ (scale), $c$ (offset). Caution: May not be a valid kernel for all parameter values.

6. Cosine Kernel $$k(\mathbf{x}, \mathbf{y}) = \frac{\mathbf{x}^T \mathbf{y}}{|\mathbf{x}| |\mathbf{y}|}$$

Measures angular similarity, ignoring magnitude. Equivalent to linear kernel on normalized data.

Use when: Only direction matters (text, proportions, directional data).

Parameters: None.

7. Specialized Kernels

String kernels: For text/sequence data (spectrum kernel, subsequence kernel)
Graph kernels: For molecular/network data (Weisfeiler-Lehman, random walk)
Image kernels: Spatial pyramid matching, histogram kernels

Kernel Function Summary
Kernel	Feature Space Dim	Parameters	Best For
Linear	$d$ (input dim)	None	Linear data, baseline
Polynomial (d)	$O(d^p)$	degree $d$, $\gamma$, $c$	Polynomial relationships
RBF	$\infty$	$\gamma$ (bandwidth)	General nonlinear, local structure
Laplacian	$\infty$	$\gamma$	Outlier robustness, sparse features
Cosine	$d$ (normalized)	None	Directional data, text
Sigmoid	Unclear	$\gamma$, $c$	Neural network analogy (rarely used)

RBF Bandwidth Selection: The Critical Parameter

The bandwidth parameter $\sigma$ (or equivalently $\gamma = 1/(2\sigma^2)$) in the RBF kernel is arguably the most important hyperparameter in Kernel PCA. Getting it wrong can completely destroy the algorithm's effectiveness.

What Bandwidth Controls

The bandwidth determines the "scale of locality" in the kernel:

Small $\sigma$ (large $\gamma$): Only very nearby points have high kernel similarity. The feature space becomes extremely localized. Each point is almost orthogonal to distant points.
Large $\sigma$ (small $\gamma$): Even distant points have significant kernel similarity. The kernel approaches a constant, and the feature space approaches the linear case.

Failure Modes

1. Bandwidth Too Small ($\gamma$ too large)

Kernel matrix approaches the identity matrix ($K_{ij} \approx 0$ for $i \neq j$)
All points are nearly orthogonal in feature space
All eigenvalues become approximately equal (≈ 1)
KPCA finds no meaningful structure—just noise
Every point is its own "cluster"

2. Bandwidth Too Large ($\gamma$ too small)

Kernel matrix approaches constant ($K_{ij} \approx 1$ for all $i, j$)
All points are nearly identical in feature space
After centering, kernel matrix is nearly zero
Eigenvalues collapse; KPCA fails to find any structure
Reverts essentially to linear PCA

The Bandwidth Sweet Spot

There's a "Goldilocks zone" for bandwidth where KPCA works well—large enough to connect related points, small enough to distinguish unrelated ones. Finding this zone is essential but non-trivial, and depends on the data's scale and structure.

Heuristic Methods for Initial $\sigma$

1. Median Heuristic

Set $\sigma$ to the median pairwise distance: $$\sigma = \text{median}_{i < j} |\mathbf{x}_i - \mathbf{x}_j|$$

This ensures that the "typical" pairwise kernel value is $e^{-0.5} \approx 0.61$—neither too close to 0 nor 1.

2. Percentile-Based

Use a percentile of pairwise distances (e.g., 10th-90th percentile range) depending on expected locality.

3. Mean Distance

Set $\sigma$ proportional to mean distance: $$\sigma = c \cdot \frac{1}{n(n-1)}\sum_{i \neq j} |\mathbf{x}_i - \mathbf{x}_j|$$

with $c \in [0.5, 2]$ as a tuning parameter.

4. Silverman's Rule (adapted from KDE)

For 1D data (or feature-by-feature): $$\sigma \approx 1.06 \cdot s \cdot n^{-1/5}$$

where $s$ is the standard deviation. Less applicable for KPCA but provides intuition.

bandwidth_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
import numpy as np
from scipy.spatial.distance import pdist
 
def bandwidth_heuristics(X: np.ndarray) -> dict:
    """
    Compute various bandwidth heuristics for RBF kernel.
    
    Parameters:
        X: Data matrix of shape (n_samples, n_features)
        
    Returns:
        Dictionary of bandwidth estimates (sigma values)
    """
    # Compute pairwise distances
    distances = pdist(X, metric='euclidean')
    
    heuristics = {}
    
    # Median heuristic (most commonly recommended)
    heuristics['median'] = np.median(distances)
    
    # Percentile-based
    heuristics['p10'] = np.percentile(distances, 10)
    heuristics['p25'] = np.percentile(distances, 25)
    heuristics['p75'] = np.percentile(distances, 75)
    heuristics['p90'] = np.percentile(distances, 90)
    
    # Mean distance
    heuristics['mean'] = np.mean(distances)
    
    # Standard deviation of distances
    heuristics['std'] = np.std(distances)
    
    # Convert to gamma (gamma = 1 / (2 * sigma^2))
    gammas = {k: 1 / (2 * v**2) if v > 0 else np.inf 
              for k, v in heuristics.items()}
    
    return heuristics, gammas
 
 
def diagnose_bandwidth(sigma: float, X: np.ndarray) -> dict:
    """
    Diagnose whether a bandwidth is appropriate for the data.
    
    Returns diagnostic metrics.
    """
    n = X.shape[0]
    gamma = 1 / (2 * sigma**2)
    
    # Compute kernel matrix
    sq_dist = np.sum((X[:, None, :] - X[None, :, :])**2, axis=2)
    K = np.exp(-gamma * sq_dist)
    
    diagnostics = {}
    
    # Off-diagonal statistics (exclude self-similarities)
    off_diag = K[np.triu_indices(n, k=1)]
    diagnostics['mean_kernel'] = off_diag.mean()
    diagnostics['min_kernel'] = off_diag.min()
    diagnostics['max_kernel'] = off_diag.max()
    diagnostics['std_kernel'] = off_diag.std()
    
    # Check for degenerate cases
    diagnostics['near_identity'] = diagnostics['mean_kernel'] < 0.1
    diagnostics['near_constant'] = diagnostics['std_kernel'] < 0.05
    
    # Effective rank (based on eigenvalue decay)
    K_centered = K - K.mean(axis=1, keepdims=True) - K.mean(axis=0, keepdims=True) + K.mean()
    eigenvalues = np.linalg.eigvalsh(K_centered)
    eigenvalues = eigenvalues[eigenvalues > 1e-10]
    eigenvalues = eigenvalues / eigenvalues.sum()  # Normalize
    entropy = -np.sum(eigenvalues * np.log(eigenvalues + 1e-10))
    diagnostics['effective_rank'] = np.exp(entropy)
    
    return diagnostics
 
 
# Example: compare bandwidth choices
np.random.seed(42)
n = 200
X = np.random.randn(n, 10)
 
sigmas, gammas = bandwidth_heuristics(X)
print("Bandwidth Heuristics (sigma values):")
for name, sigma in sigmas.items():
    print(f"  {name}: {sigma:.4f}")
 
print("
Diagnostics for different bandwidths:")
for name in ['p10', 'median', 'p90']:
    sigma = sigmas[name]
    diag = diagnose_bandwidth(sigma, X)
    status = "✓ Good" if not (diag['near_identity'] or diag['near_constant']) else "✗ Bad"
    print(f"  σ = {sigma:.4f} ({name}):")
    print(f"    Mean kernel: {diag['mean_kernel']:.3f}, Eff. rank: {diag['effective_rank']:.1f} {status}")

Cross-Validation for Kernel Parameters

Heuristics provide starting points, but systematic parameter selection requires objective criteria and validation.

The Challenge: No Labels

Unlike supervised learning, KPCA is unsupervised—there are no labels to evaluate predictions against. This makes cross-validation less straightforward. We need proxy objectives that correlate with good dimensionality reduction.

Approach 1: Reconstruction Error

Measure how well the low-dimensional representation reconstructs the original data. For KPCA, this requires pre-image estimation:

Project data to $k$ kernel principal components
Estimate pre-images
Measure reconstruction error in input space

Problem: Pre-image estimation is itself imperfect, confounding evaluation.

Approach 2: Supervised Proxy Task

If labels are available (even for a held-out validation set), use classification/regression accuracy on the reduced representation:

Apply KPCA with candidate kernel parameters
Train a classifier/regressor on the reduced features
Evaluate on held-out data
Select parameters maximizing downstream performance

This directly optimizes for a practical goal but requires labels.

Stability-Based Selection

A fully unsupervised approach: select parameters that produce stable projections. If repeatedly subsampling data and computing KPCA gives highly variable results, the parameters may be poorly chosen. Stable parameters produce consistent low-dimensional representations.

Approach 3: Eigenvalue Spectrum Analysis

Examine the eigenvalue spectrum of the centered kernel matrix:

Good parameters: A few large eigenvalues, then rapid decay → meaningful low-dimensional structure exists
Bad parameters (σ too small): Flat spectrum, all eigenvalues similar → no structure captured
Bad parameters (σ too large): One very large eigenvalue, rest near zero → collapsed to near-constant

Quantify via:

Variance explained: How much variance is captured by top $k$ components?
Effective rank: Entropy-based measure of eigenvalue spread
Eigenvalue decay rate: How fast do eigenvalues drop?

Approach 4: Kernel Alignment

If a "target" kernel is available (e.g., based on known labels or domain knowledge), measure alignment between candidate kernel and target:

$$A(K_1, K_2) = \frac{\langle K_1, K_2 \rangle_F}{|K_1|_F |K_2|_F}$$

where $\langle \cdot, \cdot \rangle_F$ is the Frobenius inner product.

Grid Search with Validation

For systematic selection:

Define a grid of kernel parameters (e.g., $\gamma \in [10^{-3}, 10^{-2}, \ldots, 10^{2}]$)
For each parameter setting:
- Perform KPCA
- Compute evaluation metric (reconstruction, downstream accuracy, stability)
Select parameters with best validation performance
(Optional) Refine search around best parameters

kernel_cv.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
 
def evaluate_kpca_downstream(
    X: np.ndarray, 
    y: np.ndarray, 
    gamma_values: list,
    n_components: int = 10,
    cv: int = 5
) -> dict:
    """
    Evaluate KPCA parameters using downstream classification accuracy.
    
    Parameters:
        X: Data matrix (n, d)
        y: Labels (n,)
        gamma_values: List of gamma values to evaluate
        n_components: Number of KPCA components
        cv: Number of cross-validation folds
        
    Returns:
        Dictionary of gamma -> mean CV accuracy
    """
    results = {}
    
    for gamma in gamma_values:
        # Compute kernel matrix
        sq_dist = np.sum((X[:, None, :] - X[None, :, :])**2, axis=2)
        K = np.exp(-gamma * sq_dist)
        
        # Center kernel
        row_means = K.mean(axis=1, keepdims=True)
        col_means = K.mean(axis=0, keepdims=True)
        grand_mean = K.mean()
        K_centered = K - row_means - col_means + grand_mean
        
        # Eigendecomposition
        eigenvalues, eigenvectors = np.linalg.eigh(K_centered)
        idx = np.argsort(eigenvalues)[::-1]
        eigenvalues = eigenvalues[idx][:n_components]
        eigenvectors = eigenvectors[:, idx][:, :n_components]
        
        # Handle zero/negative eigenvalues
        valid = eigenvalues > 1e-10
        if valid.sum() == 0:
            results[gamma] = {'accuracy': 0.0, 'variance_explained': 0.0}
            continue
            
        eigenvalues_valid = eigenvalues[valid]
        eigenvectors_valid = eigenvectors[:, valid]
        
        # Project data
        alphas = eigenvectors_valid / np.sqrt(eigenvalues_valid)
        Z = K_centered @ alphas
        
        # Evaluate with k-NN classifier
        clf = KNeighborsClassifier(n_neighbors=5)
        scores = cross_val_score(clf, Z, y, cv=cv, scoring='accuracy')
        
        # Variance explained by kept components
        total_var = np.abs(np.linalg.eigvalsh(K_centered)).sum()
        var_explained = eigenvalues_valid.sum() / total_var if total_var > 0 else 0
        
        results[gamma] = {
            'accuracy': scores.mean(),
            'accuracy_std': scores.std(),
            'variance_explained': var_explained,
            'effective_components': valid.sum()
        }
    
    return results
 
 
def evaluate_eigenspectrum(X: np.ndarray, gamma_values: list) -> dict:
    """
    Evaluate kernel parameters using eigenvalue spectrum analysis.
    
    Returns metrics for each gamma value.
    """
    results = {}
    
    for gamma in gamma_values:
        # Compute and center kernel
        sq_dist = np.sum((X[:, None, :] - X[None, :, :])**2, axis=2)
        K = np.exp(-gamma * sq_dist)
        K_centered = K - K.mean(axis=1, keepdims=True) - K.mean(axis=0, keepdims=True) + K.mean()
        
        # Eigenvalues
        eigenvalues = np.linalg.eigvalsh(K_centered)
        eigenvalues = np.sort(eigenvalues)[::-1]  # Descending
        
        # Total variance
        pos_eigenvalues = eigenvalues[eigenvalues > 1e-10]
        total_var = pos_eigenvalues.sum()
        
        if total_var > 0:
            # Normalized eigenvalues
            normalized = pos_eigenvalues / total_var
            
            # Effective rank (exponential of entropy)
            entropy = -np.sum(normalized * np.log(normalized + 1e-10))
            eff_rank = np.exp(entropy)
            
            # Variance explained by top k
            cumsum = np.cumsum(pos_eigenvalues) / total_var
            var_90 = np.searchsorted(cumsum, 0.9) + 1  # Components for 90% variance
            
            # Spectral gap (ratio of 1st to 2nd eigenvalue)
            if len(pos_eigenvalues) > 1:
                spectral_gap = pos_eigenvalues[0] / pos_eigenvalues[1]
            else:
                spectral_gap = np.inf
        else:
            eff_rank = 0
            var_90 = X.shape[0]
            spectral_gap = 0
            
        results[gamma] = {
            'effective_rank': eff_rank,
            'components_for_90': var_90,
            'spectral_gap': spectral_gap,
            'total_variance': total_var
        }
    
    return results
 
 
# Example: find optimal gamma
np.random.seed(42)
from sklearn.datasets import make_circles
 
X, y = make_circles(n_samples=200, noise=0.05, factor=0.3)
 
gamma_grid = [0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 50.0]
 
print("Downstream Accuracy Evaluation:")
acc_results = evaluate_kpca_downstream(X, y, gamma_grid, n_components=5)
for gamma, metrics in sorted(acc_results.items()):
    print(f"  γ = {gamma:5.2f}: Acc = {metrics['accuracy']:.3f} ± {metrics['accuracy_std']:.3f}, "
          f"VarExp = {metrics['variance_explained']:.2%}")
 
print("
Eigenspectrum Analysis:")
spec_results = evaluate_eigenspectrum(X, gamma_grid)
for gamma, metrics in sorted(spec_results.items()):
    print(f"  γ = {gamma:5.2f}: EffRank = {metrics['effective_rank']:5.1f}, "
          f"Comp@90% = {metrics['components_for_90']}")

Multi-Kernel Methods and Kernel Learning

When no single kernel is clearly appropriate, or when different aspects of the data suggest different kernels, multi-kernel methods offer a principled solution.

Multiple Kernel Learning (MKL)

Instead of selecting a single kernel, combine multiple kernels: $$k_{\text{combined}}(\mathbf{x}, \mathbf{y}) = \sum_{m=1}^{M} \mu_m k_m(\mathbf{x}, \mathbf{y})$$

where ${k_m}$ are base kernels and ${\mu_m \geq 0}$ are combination weights (often constrained to sum to 1).

This is a valid kernel as long as base kernels are valid (kernels are closed under non-negative linear combination).

Approaches to Learning Weights

Uniform combination: $\mu_m = 1/M$ (simple averaging)
Cross-validation: Search over convex combinations
Supervised MKL: Learn weights to optimize downstream objective (requires labels)
Unsupervised MKL: Optimize for reconstruction or spectral properties

Benefits of Multi-Kernel Approaches

Hedge against poor kernel choice
Combine complementary views of data
Automatically weight relevant features/kernels
Often outperform any single kernel

Kernels on Different Features

A powerful pattern: apply different kernels to different feature subsets. For example, RBF on continuous features, cosine on text features, polynomial on interaction features. Combine these for a comprehensive data representation.

Kernel Combinations

Beyond weighted sums, other valid combination operations:

Product Kernel: $$k(\mathbf{x}, \mathbf{y}) = k_1(\mathbf{x}, \mathbf{y}) \cdot k_2(\mathbf{x}, \mathbf{y})$$

Captures features that are high in both kernels.

Polynomial Kernel on Kernels: $$k(\mathbf{x}, \mathbf{y}) = (k_1(\mathbf{x}, \mathbf{y}) + c)^d$$

Applies polynomial transformation to kernel similarities.

Kernel on Subsets: $$k(\mathbf{x}, \mathbf{y}) = k_1(\mathbf{x}{[1:d_1]}, \mathbf{y}{[1:d_1]}) + k_2(\mathbf{x}{[d_1:d]}, \mathbf{y}{[d_1:d]})$$

Applies different kernels to disjoint feature subsets.

Bandwidth Mixture

Combine RBF kernels at multiple scales: $$k(\mathbf{x}, \mathbf{y}) = \sum_i \mu_i \exp(-\gamma_i |\mathbf{x} - \mathbf{y}|^2)$$

This captures both local and global structure simultaneously.

Kernel PCA with Combined Kernels

The combined kernel matrix is simply: $$\mathbf{K}_{\text{combined}} = \sum_m \mu_m \mathbf{K}_m$$

KPCA proceeds identically using this combined matrix.

multi_kernel.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
import numpy as np
from itertools import product
 
def combine_kernels(kernel_matrices: list, weights: np.ndarray = None) -> np.ndarray:
    """
    Combine multiple kernel matrices with given weights.
    
    Parameters:
        kernel_matrices: List of (n, n) kernel matrices
        weights: Combination weights (default: uniform)
        
    Returns:
        Combined kernel matrix
    """
    M = len(kernel_matrices)
    if weights is None:
        weights = np.ones(M) / M
    
    assert len(weights) == M
    assert np.all(weights >= 0)
    
    K_combined = sum(w * K for w, K in zip(weights, kernel_matrices))
    return K_combined
 
 
def multi_scale_rbf_kernel(X: np.ndarray, gamma_values: list) -> np.ndarray:
    """
    Combine RBF kernels at multiple scales with uniform weights.
    """
    sq_dist = np.sum((X[:, None, :] - X[None, :, :])**2, axis=2)
    
    kernels = []
    for gamma in gamma_values:
        K = np.exp(-gamma * sq_dist)
        kernels.append(K)
    
    return combine_kernels(kernels)
 
 
def grid_search_kernel_weights(
    kernel_matrices: list,
    X: np.ndarray,
    y: np.ndarray = None,
    n_grid: int = 5
) -> tuple:
    """
    Grid search over kernel combination weights.
    
    If y is provided, optimizes classification accuracy.
    Otherwise, optimizes for eigenvalue spread (unsupervised).
    """
    M = len(kernel_matrices)
    
    # Generate weight combinations on simplex
    if M == 2:
        weight_candidates = [[w, 1-w] for w in np.linspace(0, 1, n_grid)]
    else:
        # For M > 2, sample uniformly on simplex
        weight_candidates = []
        for _ in range(n_grid ** (M-1)):
            w = np.random.dirichlet(np.ones(M))
            weight_candidates.append(w.tolist())
    
    best_score = -np.inf
    best_weights = None
    
    for weights in weight_candidates:
        K_combined = combine_kernels(kernel_matrices, np.array(weights))
        
        # Center
        K_cent = K_combined - K_combined.mean(axis=1, keepdims=True) -                  K_combined.mean(axis=0, keepdims=True) + K_combined.mean()
        
        if y is not None:
            # Supervised: use classification accuracy
            eigenvalues, eigenvectors = np.linalg.eigh(K_cent)
            idx = np.argsort(eigenvalues)[::-1]
            eigenvalues = eigenvalues[idx][:10]
            eigenvectors = eigenvectors[:, idx][:, :10]
            
            valid = eigenvalues > 1e-10
            if valid.sum() < 2:
                continue
                
            alphas = eigenvectors[:, valid] / np.sqrt(eigenvalues[valid])
            Z = K_cent @ alphas
            
            from sklearn.neighbors import KNeighborsClassifier
            from sklearn.model_selection import cross_val_score
            clf = KNeighborsClassifier(n_neighbors=5)
            scores = cross_val_score(clf, Z, y, cv=3)
            score = scores.mean()
        else:
            # Unsupervised: use effective rank (prefer diverse eigenvalues)
            eigenvalues = np.linalg.eigvalsh(K_cent)
            pos_eig = eigenvalues[eigenvalues > 1e-10]
            if len(pos_eig) == 0:
                continue
            normalized = pos_eig / pos_eig.sum()
            entropy = -np.sum(normalized * np.log(normalized + 1e-10))
            score = np.exp(entropy)  # Effective rank
        
        if score > best_score:
            best_score = score
            best_weights = weights
    
    return np.array(best_weights), best_score
 
 
# Example: multi-scale RBF
np.random.seed(42)
from sklearn.datasets import make_moons
 
X, y = make_moons(n_samples=200, noise=0.1)
 
# Single kernels
gamma_values = [0.1, 1.0, 10.0]
kernels = []
for gamma in gamma_values:
    sq_dist = np.sum((X[:, None, :] - X[None, :, :])**2, axis=2)
    K = np.exp(-gamma * sq_dist)
    kernels.append(K)
 
# Find optimal weights
weights, score = grid_search_kernel_weights(kernels, X, y)
print(f"Optimal weights for γ = {gamma_values}: {weights.round(3)}")
print(f"CV Accuracy: {score:.3f}")
 
# Compare with individual kernels
for i, gamma in enumerate(gamma_values):
    K_cent = kernels[i] - kernels[i].mean(axis=1, keepdims=True) -              kernels[i].mean(axis=0, keepdims=True) + kernels[i].mean()
    eig = np.linalg.eigvalsh(K_cent)
    eig = np.sort(eig)[::-1][:10]
    alphas = np.linalg.eigh(K_cent)[1][:, -10:][:, ::-1]
    valid = eig > 1e-10
    Z = K_cent @ (alphas[:, valid] / np.sqrt(eig[valid]))
    
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.model_selection import cross_val_score
    scores = cross_val_score(KNeighborsClassifier(5), Z, y, cv=3)
    print(f"Single kernel γ = {gamma}: Accuracy = {scores.mean():.3f}")

A Practical Kernel Selection Strategy

Drawing together the concepts from this page, here's a practical strategy for kernel selection in KPCA applications.

Step 1: Understand Your Data and Goal

Before touching kernels, clarify:

What is the downstream task (visualization, denoising, preprocessing for classification)?
What relationships do you expect in the data (linear, polynomial, local similarity)?
Are there domain-specific kernels that make sense?

Step 2: Start Simple

Begin with:

Linear kernel as baseline (equivalent to PCA)
RBF kernel with median heuristic bandwidth
Polynomial kernel (degree 2-3) if polynomial relationships expected

Visualize projections, check eigenvalue spectra. This gives intuition before systematic search.

Step 3: Parameter Sweep for Primary Kernel

For the most promising kernel family:

Define a logarithmic grid (e.g., $\gamma \in [10^{-3}, 10^{-2}, \ldots, 10^{2}]$)
Evaluate each using:
- Eigenspectrum diagnostics (effective rank, variance explained)
- Visualization quality (if applicable)
- Downstream task performance (if labels available)
Narrow to promising regions, refine search

The 80/20 Rule

In practice, the RBF kernel with carefully tuned bandwidth works well for 80% of problems. Invest time tuning the bandwidth rather than exploring exotic kernels. Only move to specialized kernels if you have domain reasons or the RBF clearly fails.

Step 4: Validate and Diagnose

For the selected kernel/parameters:

Check for degenerate cases: Is the kernel matrix near-identity or near-constant?
Examine eigenvalue spectrum: Are there meaningful components, or is variance spread uniformly?
Visualize projections: Do they reveal expected structure?
Stability check: Does subsampling data change results dramatically?
Compare to linear PCA: Is KPCA actually providing benefit?

Step 5: Consider Combinations

If single kernels are inadequate:

Try multi-scale RBF (combine bandwidths)
Combine kernels for different feature subsets
Use MKL with grid search over weights

Common Pitfalls to Avoid

Overfitting bandwidth: Very small σ can "overfit" to training data
Ignoring scale: Features should be standardized before computing kernels
Computational cost: Large n makes KPCA expensive; consider Nyström approximation
Assuming KPCA is always better: Sometimes linear PCA suffices
Not validating: Always check if the kernel actually improves on the baseline

Kernel Selection Checklist

•Standardize features before computing any kernel
•Start with linear kernel as baseline comparison
•Use median heuristic for initial RBF bandwidth
•Check eigenvalue spectrum for meaningful structure
•Grid search parameters over logarithmic scale
•Validate downstream if labels available
•Diagnose failures: near-identity, near-constant, unstable
•Consider combination if single kernel insufficient
•Document and justify kernel choice for reproducibility

Module Summary: Kernel PCA Mastery

This completes our deep dive into Kernel PCA. Let's consolidate what we've learned across the entire module.

Module Recap

•Nonlinear Dimensionality Reduction: Linear PCA fails on curved manifolds; kernel methods enable implicit feature mappings to high-dimensional spaces where nonlinear structures become linear
•The Kernel Trick: By working with inner products (kernel evaluations) rather than explicit features, we can perform PCA in infinite-dimensional spaces with finite computation
•Centering in Feature Space: Proper centering via double-centering ($\tilde{\mathbf{K}} = \mathbf{H}\mathbf{K}\mathbf{H}$) is essential; test points must use training statistics
•The Pre-image Problem: Reconstructing input-space points from KPCA projections has no closed-form solution; iterative methods provide approximations for denoising and reconstruction
•Kernel Selection: The kernel function and its parameters fundamentally determine KPCA's behavior; careful selection via heuristics, cross-validation, and diagnostics is essential

When to Use Kernel PCA

✓ Data has nonlinear structure that linear PCA misses ✓ You need a principled, well-understood method ✓ Moderate sample sizes (n < 10,000 typically) ✓ Downstream task benefits from nonlinear features

When Not to Use Kernel PCA

✗ Linear relationships suffice (use standard PCA) ✗ Very large datasets without approximations ✗ Need interpretable components (feature-space directions are abstract) ✗ Generative applications (KPCA is discriminative/descriptive)

Kernel PCA Mastery Achieved

You now possess a comprehensive understanding of Kernel PCA—from the mathematical foundations of the kernel trick and dual formulation, through the practical details of centering and pre-image estimation, to the critical considerations of kernel selection. This knowledge equips you to apply KPCA effectively and to understand its place in the broader landscape of dimensionality reduction techniques.

Kernel PCA in the Dimensionality Reduction Landscape
Method	Type	Strengths	Compared to KPCA
PCA	Linear, global	Simple, fast, interpretable	KPCA is nonlinear extension
Isomap	Nonlinear, geodesic	Preserves manifold distances	Better for manifold unfolding
LLE	Nonlinear, local	Preserves local neighborhoods	Better for local structure
t-SNE	Nonlinear, probabilistic	Excellent visualization	Better for visualization only
UMAP	Nonlinear, topological	Fast, preserves structure	Often preferred for large data
Autoencoders	Neural, learned	Flexible, scalable	Better for very large data

Continuing Your Learning

Kernel PCA is part of a rich ecosystem of kernel methods and dimensionality reduction techniques. Consider exploring:

Other kernel methods: Kernel SVM, kernel ridge regression, kernel k-means
Manifold learning: Isomap, LLE, Laplacian Eigenmaps, t-SNE, UMAP
Scalable approximations: Nyström method, Random Fourier Features
Deep learning approaches: Autoencoders, Variational Autoencoders

Each has its niche; understanding the full landscape helps you choose the right tool for each problem.

Kernel Selection

Choosing the Right Kernel for Your Data

What You Will Learn

The Kernel Zoo: Common Kernel Functions

Let's survey the most commonly used kernels for KPCA, understanding what each one does and when it's appropriate.

1. Linear Kernel $$k(\mathbf{x}, \mathbf{y}) = \mathbf{x}^T \mathbf{y}$$

The simplest kernel—equivalent to standard PCA. The feature space equals the input space: $\phi(\mathbf{x}) = \mathbf{x}$.

Use when: Data relationships are approximately linear, or as a baseline for comparison.

Parameters: None.

2. Polynomial Kernel $$k(\mathbf{x}, \mathbf{y}) = (\gamma \mathbf{x}^T \mathbf{y} + c)^d$$

Captures polynomial interactions up to degree $d$. The feature space includes all monomials up to degree $d$.

Use when: Polynomial relationships are expected (e.g., physics-based models, polynomial regression settings).

Parameters:

$d$: Polynomial degree (integer ≥ 1)
$\gamma$: Scaling factor (often set to $1/d_{\text{features}}$)
$c$: Offset parameter (controls influence of lower-degree terms; $c=0$ gives homogeneous polynomial)

Measures local similarity—kernel value decays with distance. Corresponds to an infinite-dimensional feature space.

Use when: No prior knowledge about relationship form; general-purpose nonlinear DR; data has local structure.

Parameters:

$\gamma$ or $\sigma$: Bandwidth (critical; $\gamma = 1/(2\sigma^2)$)

The RBF Kernel as Default

4. Laplacian Kernel $$k(\mathbf{x}, \mathbf{y}) = \exp\left(-\gamma |\mathbf{x} - \mathbf{y}|_1\right)$$

Similar to RBF but uses L1 distance. More robust to outliers, produces "spikier" similarity functions.

Use when: Data may contain outliers; sparse features are relevant.

Parameters: $\gamma$ (bandwidth)

5. Sigmoid Kernel $$k(\mathbf{x}, \mathbf{y}) = \tanh(\gamma \mathbf{x}^T \mathbf{y} + c)$$

Originally motivated by neural network connections. Not always positive semi-definite (only for certain parameter ranges).

Use when: Neural network analogy is relevant (rare in KPCA).

Parameters: $\gamma$ (scale), $c$ (offset). Caution: May not be a valid kernel for all parameter values.

6. Cosine Kernel $$k(\mathbf{x}, \mathbf{y}) = \frac{\mathbf{x}^T \mathbf{y}}{|\mathbf{x}| |\mathbf{y}|}$$

Measures angular similarity, ignoring magnitude. Equivalent to linear kernel on normalized data.

Use when: Only direction matters (text, proportions, directional data).

Parameters: None.

7. Specialized Kernels

String kernels: For text/sequence data (spectrum kernel, subsequence kernel)
Graph kernels: For molecular/network data (Weisfeiler-Lehman, random walk)
Image kernels: Spatial pyramid matching, histogram kernels

Kernel Function Summary
Kernel	Feature Space Dim	Parameters	Best For
Linear	$d$ (input dim)	None	Linear data, baseline
Polynomial (d)	$O(d^p)$	degree $d$, $\gamma$, $c$	Polynomial relationships
RBF	$\infty$	$\gamma$ (bandwidth)	General nonlinear, local structure
Laplacian	$\infty$	$\gamma$	Outlier robustness, sparse features
Cosine	$d$ (normalized)	None	Directional data, text
Sigmoid	Unclear	$\gamma$, $c$	Neural network analogy (rarely used)

RBF Bandwidth Selection: The Critical Parameter

What Bandwidth Controls

The bandwidth determines the "scale of locality" in the kernel:

Small $\sigma$ (large $\gamma$): Only very nearby points have high kernel similarity. The feature space becomes extremely localized. Each point is almost orthogonal to distant points.
Large $\sigma$ (small $\gamma$): Even distant points have significant kernel similarity. The kernel approaches a constant, and the feature space approaches the linear case.

Failure Modes

1. Bandwidth Too Small ($\gamma$ too large)

Kernel matrix approaches the identity matrix ($K_{ij} \approx 0$ for $i \neq j$)
All points are nearly orthogonal in feature space
All eigenvalues become approximately equal (≈ 1)
KPCA finds no meaningful structure—just noise
Every point is its own "cluster"

2. Bandwidth Too Large ($\gamma$ too small)

Kernel matrix approaches constant ($K_{ij} \approx 1$ for all $i, j$)
All points are nearly identical in feature space
After centering, kernel matrix is nearly zero
Eigenvalues collapse; KPCA fails to find any structure
Reverts essentially to linear PCA

The Bandwidth Sweet Spot

Heuristic Methods for Initial $\sigma$

1. Median Heuristic

Set $\sigma$ to the median pairwise distance: $$\sigma = \text{median}_{i < j} |\mathbf{x}_i - \mathbf{x}_j|$$

This ensures that the "typical" pairwise kernel value is $e^{-0.5} \approx 0.61$—neither too close to 0 nor 1.

2. Percentile-Based

Use a percentile of pairwise distances (e.g., 10th-90th percentile range) depending on expected locality.

3. Mean Distance

Set $\sigma$ proportional to mean distance: $$\sigma = c \cdot \frac{1}{n(n-1)}\sum_{i \neq j} |\mathbf{x}_i - \mathbf{x}_j|$$

with $c \in [0.5, 2]$ as a tuning parameter.

4. Silverman's Rule (adapted from KDE)

For 1D data (or feature-by-feature): $$\sigma \approx 1.06 \cdot s \cdot n^{-1/5}$$

where $s$ is the standard deviation. Less applicable for KPCA but provides intuition.

bandwidth_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
import numpy as np
from scipy.spatial.distance import pdist
 
def bandwidth_heuristics(X: np.ndarray) -> dict:
    """
    Compute various bandwidth heuristics for RBF kernel.
    
    Parameters:
        X: Data matrix of shape (n_samples, n_features)
        
    Returns:
        Dictionary of bandwidth estimates (sigma values)
    """
    # Compute pairwise distances
    distances = pdist(X, metric='euclidean')
    
    heuristics = {}
    
    # Median heuristic (most commonly recommended)
    heuristics['median'] = np.median(distances)
    
    # Percentile-based
    heuristics['p10'] = np.percentile(distances, 10)
    heuristics['p25'] = np.percentile(distances, 25)
    heuristics['p75'] = np.percentile(distances, 75)
    heuristics['p90'] = np.percentile(distances, 90)
    
    # Mean distance
    heuristics['mean'] = np.mean(distances)
    
    # Standard deviation of distances
    heuristics['std'] = np.std(distances)
    
    # Convert to gamma (gamma = 1 / (2 * sigma^2))
    gammas = {k: 1 / (2 * v**2) if v > 0 else np.inf 
              for k, v in heuristics.items()}
    
    return heuristics, gammas
 
 
def diagnose_bandwidth(sigma: float, X: np.ndarray) -> dict:
    """
    Diagnose whether a bandwidth is appropriate for the data.
    
    Returns diagnostic metrics.
    """
    n = X.shape[0]
    gamma = 1 / (2 * sigma**2)
    
    # Compute kernel matrix
    sq_dist = np.sum((X[:, None, :] - X[None, :, :])**2, axis=2)
    K = np.exp(-gamma * sq_dist)
    
    diagnostics = {}
    
    # Off-diagonal statistics (exclude self-similarities)
    off_diag = K[np.triu_indices(n, k=1)]
    diagnostics['mean_kernel'] = off_diag.mean()
    diagnostics['min_kernel'] = off_diag.min()
    diagnostics['max_kernel'] = off_diag.max()
    diagnostics['std_kernel'] = off_diag.std()
    
    # Check for degenerate cases
    diagnostics['near_identity'] = diagnostics['mean_kernel'] < 0.1
    diagnostics['near_constant'] = diagnostics['std_kernel'] < 0.05
    
    # Effective rank (based on eigenvalue decay)
    K_centered = K - K.mean(axis=1, keepdims=True) - K.mean(axis=0, keepdims=True) + K.mean()
    eigenvalues = np.linalg.eigvalsh(K_centered)
    eigenvalues = eigenvalues[eigenvalues > 1e-10]
    eigenvalues = eigenvalues / eigenvalues.sum()  # Normalize
    entropy = -np.sum(eigenvalues * np.log(eigenvalues + 1e-10))
    diagnostics['effective_rank'] = np.exp(entropy)
    
    return diagnostics
 
 
# Example: compare bandwidth choices
np.random.seed(42)
n = 200
X = np.random.randn(n, 10)
 
sigmas, gammas = bandwidth_heuristics(X)
print("Bandwidth Heuristics (sigma values):")
for name, sigma in sigmas.items():
    print(f"  {name}: {sigma:.4f}")
 
print("
Diagnostics for different bandwidths:")
for name in ['p10', 'median', 'p90']:
    sigma = sigmas[name]
    diag = diagnose_bandwidth(sigma, X)
    status = "✓ Good" if not (diag['near_identity'] or diag['near_constant']) else "✗ Bad"
    print(f"  σ = {sigma:.4f} ({name}):")
    print(f"    Mean kernel: {diag['mean_kernel']:.3f}, Eff. rank: {diag['effective_rank']:.1f} {status}")

Cross-Validation for Kernel Parameters

Heuristics provide starting points, but systematic parameter selection requires objective criteria and validation.

The Challenge: No Labels

Approach 1: Reconstruction Error

Measure how well the low-dimensional representation reconstructs the original data. For KPCA, this requires pre-image estimation:

Project data to $k$ kernel principal components
Estimate pre-images
Measure reconstruction error in input space

Problem: Pre-image estimation is itself imperfect, confounding evaluation.

Approach 2: Supervised Proxy Task

If labels are available (even for a held-out validation set), use classification/regression accuracy on the reduced representation:

Apply KPCA with candidate kernel parameters
Train a classifier/regressor on the reduced features
Evaluate on held-out data
Select parameters maximizing downstream performance

This directly optimizes for a practical goal but requires labels.

Stability-Based Selection

Approach 3: Eigenvalue Spectrum Analysis

Examine the eigenvalue spectrum of the centered kernel matrix:

Good parameters: A few large eigenvalues, then rapid decay → meaningful low-dimensional structure exists
Bad parameters (σ too small): Flat spectrum, all eigenvalues similar → no structure captured
Bad parameters (σ too large): One very large eigenvalue, rest near zero → collapsed to near-constant

Quantify via:

Variance explained: How much variance is captured by top $k$ components?
Effective rank: Entropy-based measure of eigenvalue spread
Eigenvalue decay rate: How fast do eigenvalues drop?

Approach 4: Kernel Alignment

If a "target" kernel is available (e.g., based on known labels or domain knowledge), measure alignment between candidate kernel and target:

$$A(K_1, K_2) = \frac{\langle K_1, K_2 \rangle_F}{|K_1|_F |K_2|_F}$$

where $\langle \cdot, \cdot \rangle_F$ is the Frobenius inner product.

Grid Search with Validation

For systematic selection:

Define a grid of kernel parameters (e.g., $\gamma \in [10^{-3}, 10^{-2}, \ldots, 10^{2}]$)
For each parameter setting:
- Perform KPCA
- Compute evaluation metric (reconstruction, downstream accuracy, stability)
Select parameters with best validation performance
(Optional) Refine search around best parameters

kernel_cv.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
 
def evaluate_kpca_downstream(
    X: np.ndarray, 
    y: np.ndarray, 
    gamma_values: list,
    n_components: int = 10,
    cv: int = 5
) -> dict:
    """
    Evaluate KPCA parameters using downstream classification accuracy.
    
    Parameters:
        X: Data matrix (n, d)
        y: Labels (n,)
        gamma_values: List of gamma values to evaluate
        n_components: Number of KPCA components
        cv: Number of cross-validation folds
        
    Returns:
        Dictionary of gamma -> mean CV accuracy
    """
    results = {}
    
    for gamma in gamma_values:
        # Compute kernel matrix
        sq_dist = np.sum((X[:, None, :] - X[None, :, :])**2, axis=2)
        K = np.exp(-gamma * sq_dist)
        
        # Center kernel
        row_means = K.mean(axis=1, keepdims=True)
        col_means = K.mean(axis=0, keepdims=True)
        grand_mean = K.mean()
        K_centered = K - row_means - col_means + grand_mean
        
        # Eigendecomposition
        eigenvalues, eigenvectors = np.linalg.eigh(K_centered)
        idx = np.argsort(eigenvalues)[::-1]
        eigenvalues = eigenvalues[idx][:n_components]
        eigenvectors = eigenvectors[:, idx][:, :n_components]
        
        # Handle zero/negative eigenvalues
        valid = eigenvalues > 1e-10
        if valid.sum() == 0:
            results[gamma] = {'accuracy': 0.0, 'variance_explained': 0.0}
            continue
            
        eigenvalues_valid = eigenvalues[valid]
        eigenvectors_valid = eigenvectors[:, valid]
        
        # Project data
        alphas = eigenvectors_valid / np.sqrt(eigenvalues_valid)
        Z = K_centered @ alphas
        
        # Evaluate with k-NN classifier
        clf = KNeighborsClassifier(n_neighbors=5)
        scores = cross_val_score(clf, Z, y, cv=cv, scoring='accuracy')
        
        # Variance explained by kept components
        total_var = np.abs(np.linalg.eigvalsh(K_centered)).sum()
        var_explained = eigenvalues_valid.sum() / total_var if total_var > 0 else 0
        
        results[gamma] = {
            'accuracy': scores.mean(),
            'accuracy_std': scores.std(),
            'variance_explained': var_explained,
            'effective_components': valid.sum()
        }
    
    return results
 
 
def evaluate_eigenspectrum(X: np.ndarray, gamma_values: list) -> dict:
    """
    Evaluate kernel parameters using eigenvalue spectrum analysis.
    
    Returns metrics for each gamma value.
    """
    results = {}
    
    for gamma in gamma_values:
        # Compute and center kernel
        sq_dist = np.sum((X[:, None, :] - X[None, :, :])**2, axis=2)
        K = np.exp(-gamma * sq_dist)
        K_centered = K - K.mean(axis=1, keepdims=True) - K.mean(axis=0, keepdims=True) + K.mean()
        
        # Eigenvalues
        eigenvalues = np.linalg.eigvalsh(K_centered)
        eigenvalues = np.sort(eigenvalues)[::-1]  # Descending
        
        # Total variance
        pos_eigenvalues = eigenvalues[eigenvalues > 1e-10]
        total_var = pos_eigenvalues.sum()
        
        if total_var > 0:
            # Normalized eigenvalues
            normalized = pos_eigenvalues / total_var
            
            # Effective rank (exponential of entropy)
            entropy = -np.sum(normalized * np.log(normalized + 1e-10))
            eff_rank = np.exp(entropy)
            
            # Variance explained by top k
            cumsum = np.cumsum(pos_eigenvalues) / total_var
            var_90 = np.searchsorted(cumsum, 0.9) + 1  # Components for 90% variance
            
            # Spectral gap (ratio of 1st to 2nd eigenvalue)
            if len(pos_eigenvalues) > 1:
                spectral_gap = pos_eigenvalues[0] / pos_eigenvalues[1]
            else:
                spectral_gap = np.inf
        else:
            eff_rank = 0
            var_90 = X.shape[0]
            spectral_gap = 0
            
        results[gamma] = {
            'effective_rank': eff_rank,
            'components_for_90': var_90,
            'spectral_gap': spectral_gap,
            'total_variance': total_var
        }
    
    return results
 
 
# Example: find optimal gamma
np.random.seed(42)
from sklearn.datasets import make_circles
 
X, y = make_circles(n_samples=200, noise=0.05, factor=0.3)
 
gamma_grid = [0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 50.0]
 
print("Downstream Accuracy Evaluation:")
acc_results = evaluate_kpca_downstream(X, y, gamma_grid, n_components=5)
for gamma, metrics in sorted(acc_results.items()):
    print(f"  γ = {gamma:5.2f}: Acc = {metrics['accuracy']:.3f} ± {metrics['accuracy_std']:.3f}, "
          f"VarExp = {metrics['variance_explained']:.2%}")
 
print("
Eigenspectrum Analysis:")
spec_results = evaluate_eigenspectrum(X, gamma_grid)
for gamma, metrics in sorted(spec_results.items()):
    print(f"  γ = {gamma:5.2f}: EffRank = {metrics['effective_rank']:5.1f}, "
          f"Comp@90% = {metrics['components_for_90']}")

Multi-Kernel Methods and Kernel Learning

When no single kernel is clearly appropriate, or when different aspects of the data suggest different kernels, multi-kernel methods offer a principled solution.

Multiple Kernel Learning (MKL)

Instead of selecting a single kernel, combine multiple kernels: $$k_{\text{combined}}(\mathbf{x}, \mathbf{y}) = \sum_{m=1}^{M} \mu_m k_m(\mathbf{x}, \mathbf{y})$$

where ${k_m}$ are base kernels and ${\mu_m \geq 0}$ are combination weights (often constrained to sum to 1).

This is a valid kernel as long as base kernels are valid (kernels are closed under non-negative linear combination).

Approaches to Learning Weights

Uniform combination: $\mu_m = 1/M$ (simple averaging)
Cross-validation: Search over convex combinations
Supervised MKL: Learn weights to optimize downstream objective (requires labels)
Unsupervised MKL: Optimize for reconstruction or spectral properties

Benefits of Multi-Kernel Approaches

Hedge against poor kernel choice
Combine complementary views of data
Automatically weight relevant features/kernels
Often outperform any single kernel

Kernels on Different Features

Kernel Combinations

Beyond weighted sums, other valid combination operations:

Product Kernel: $$k(\mathbf{x}, \mathbf{y}) = k_1(\mathbf{x}, \mathbf{y}) \cdot k_2(\mathbf{x}, \mathbf{y})$$

Captures features that are high in both kernels.

Polynomial Kernel on Kernels: $$k(\mathbf{x}, \mathbf{y}) = (k_1(\mathbf{x}, \mathbf{y}) + c)^d$$

Applies polynomial transformation to kernel similarities.

Kernel on Subsets: $$k(\mathbf{x}, \mathbf{y}) = k_1(\mathbf{x}{[1:d_1]}, \mathbf{y}{[1:d_1]}) + k_2(\mathbf{x}{[d_1:d]}, \mathbf{y}{[d_1:d]})$$

Applies different kernels to disjoint feature subsets.

Bandwidth Mixture

Combine RBF kernels at multiple scales: $$k(\mathbf{x}, \mathbf{y}) = \sum_i \mu_i \exp(-\gamma_i |\mathbf{x} - \mathbf{y}|^2)$$

This captures both local and global structure simultaneously.

Kernel PCA with Combined Kernels

The combined kernel matrix is simply: $$\mathbf{K}_{\text{combined}} = \sum_m \mu_m \mathbf{K}_m$$

KPCA proceeds identically using this combined matrix.

multi_kernel.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
import numpy as np
from itertools import product
 
def combine_kernels(kernel_matrices: list, weights: np.ndarray = None) -> np.ndarray:
    """
    Combine multiple kernel matrices with given weights.
    
    Parameters:
        kernel_matrices: List of (n, n) kernel matrices
        weights: Combination weights (default: uniform)
        
    Returns:
        Combined kernel matrix
    """
    M = len(kernel_matrices)
    if weights is None:
        weights = np.ones(M) / M
    
    assert len(weights) == M
    assert np.all(weights >= 0)
    
    K_combined = sum(w * K for w, K in zip(weights, kernel_matrices))
    return K_combined
 
 
def multi_scale_rbf_kernel(X: np.ndarray, gamma_values: list) -> np.ndarray:
    """
    Combine RBF kernels at multiple scales with uniform weights.
    """
    sq_dist = np.sum((X[:, None, :] - X[None, :, :])**2, axis=2)
    
    kernels = []
    for gamma in gamma_values:
        K = np.exp(-gamma * sq_dist)
        kernels.append(K)
    
    return combine_kernels(kernels)
 
 
def grid_search_kernel_weights(
    kernel_matrices: list,
    X: np.ndarray,
    y: np.ndarray = None,
    n_grid: int = 5
) -> tuple:
    """
    Grid search over kernel combination weights.
    
    If y is provided, optimizes classification accuracy.
    Otherwise, optimizes for eigenvalue spread (unsupervised).
    """
    M = len(kernel_matrices)
    
    # Generate weight combinations on simplex
    if M == 2:
        weight_candidates = [[w, 1-w] for w in np.linspace(0, 1, n_grid)]
    else:
        # For M > 2, sample uniformly on simplex
        weight_candidates = []
        for _ in range(n_grid ** (M-1)):
            w = np.random.dirichlet(np.ones(M))
            weight_candidates.append(w.tolist())
    
    best_score = -np.inf
    best_weights = None
    
    for weights in weight_candidates:
        K_combined = combine_kernels(kernel_matrices, np.array(weights))
        
        # Center
        K_cent = K_combined - K_combined.mean(axis=1, keepdims=True) -                  K_combined.mean(axis=0, keepdims=True) + K_combined.mean()
        
        if y is not None:
            # Supervised: use classification accuracy
            eigenvalues, eigenvectors = np.linalg.eigh(K_cent)
            idx = np.argsort(eigenvalues)[::-1]
            eigenvalues = eigenvalues[idx][:10]
            eigenvectors = eigenvectors[:, idx][:, :10]
            
            valid = eigenvalues > 1e-10
            if valid.sum() < 2:
                continue
                
            alphas = eigenvectors[:, valid] / np.sqrt(eigenvalues[valid])
            Z = K_cent @ alphas
            
            from sklearn.neighbors import KNeighborsClassifier
            from sklearn.model_selection import cross_val_score
            clf = KNeighborsClassifier(n_neighbors=5)
            scores = cross_val_score(clf, Z, y, cv=3)
            score = scores.mean()
        else:
            # Unsupervised: use effective rank (prefer diverse eigenvalues)
            eigenvalues = np.linalg.eigvalsh(K_cent)
            pos_eig = eigenvalues[eigenvalues > 1e-10]
            if len(pos_eig) == 0:
                continue
            normalized = pos_eig / pos_eig.sum()
            entropy = -np.sum(normalized * np.log(normalized + 1e-10))
            score = np.exp(entropy)  # Effective rank
        
        if score > best_score:
            best_score = score
            best_weights = weights
    
    return np.array(best_weights), best_score
 
 
# Example: multi-scale RBF
np.random.seed(42)
from sklearn.datasets import make_moons
 
X, y = make_moons(n_samples=200, noise=0.1)
 
# Single kernels
gamma_values = [0.1, 1.0, 10.0]
kernels = []
for gamma in gamma_values:
    sq_dist = np.sum((X[:, None, :] - X[None, :, :])**2, axis=2)
    K = np.exp(-gamma * sq_dist)
    kernels.append(K)
 
# Find optimal weights
weights, score = grid_search_kernel_weights(kernels, X, y)
print(f"Optimal weights for γ = {gamma_values}: {weights.round(3)}")
print(f"CV Accuracy: {score:.3f}")
 
# Compare with individual kernels
for i, gamma in enumerate(gamma_values):
    K_cent = kernels[i] - kernels[i].mean(axis=1, keepdims=True) -              kernels[i].mean(axis=0, keepdims=True) + kernels[i].mean()
    eig = np.linalg.eigvalsh(K_cent)
    eig = np.sort(eig)[::-1][:10]
    alphas = np.linalg.eigh(K_cent)[1][:, -10:][:, ::-1]
    valid = eig > 1e-10
    Z = K_cent @ (alphas[:, valid] / np.sqrt(eig[valid]))
    
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.model_selection import cross_val_score
    scores = cross_val_score(KNeighborsClassifier(5), Z, y, cv=3)
    print(f"Single kernel γ = {gamma}: Accuracy = {scores.mean():.3f}")

A Practical Kernel Selection Strategy

Drawing together the concepts from this page, here's a practical strategy for kernel selection in KPCA applications.

Step 1: Understand Your Data and Goal

Before touching kernels, clarify:

What is the downstream task (visualization, denoising, preprocessing for classification)?
What relationships do you expect in the data (linear, polynomial, local similarity)?
Are there domain-specific kernels that make sense?

Step 2: Start Simple

Begin with:

Linear kernel as baseline (equivalent to PCA)
RBF kernel with median heuristic bandwidth
Polynomial kernel (degree 2-3) if polynomial relationships expected

Visualize projections, check eigenvalue spectra. This gives intuition before systematic search.

Step 3: Parameter Sweep for Primary Kernel

For the most promising kernel family:

Define a logarithmic grid (e.g., $\gamma \in [10^{-3}, 10^{-2}, \ldots, 10^{2}]$)
Evaluate each using:
- Eigenspectrum diagnostics (effective rank, variance explained)
- Visualization quality (if applicable)
- Downstream task performance (if labels available)
Narrow to promising regions, refine search

The 80/20 Rule

Step 4: Validate and Diagnose

For the selected kernel/parameters:

Check for degenerate cases: Is the kernel matrix near-identity or near-constant?
Examine eigenvalue spectrum: Are there meaningful components, or is variance spread uniformly?
Visualize projections: Do they reveal expected structure?
Stability check: Does subsampling data change results dramatically?
Compare to linear PCA: Is KPCA actually providing benefit?

Step 5: Consider Combinations

If single kernels are inadequate:

Try multi-scale RBF (combine bandwidths)
Combine kernels for different feature subsets
Use MKL with grid search over weights

Common Pitfalls to Avoid

Overfitting bandwidth: Very small σ can "overfit" to training data
Ignoring scale: Features should be standardized before computing kernels
Computational cost: Large n makes KPCA expensive; consider Nyström approximation
Assuming KPCA is always better: Sometimes linear PCA suffices
Not validating: Always check if the kernel actually improves on the baseline

Kernel Selection Checklist

•Standardize features before computing any kernel
•Start with linear kernel as baseline comparison
•Use median heuristic for initial RBF bandwidth
•Check eigenvalue spectrum for meaningful structure
•Grid search parameters over logarithmic scale
•Validate downstream if labels available
•Diagnose failures: near-identity, near-constant, unstable
•Consider combination if single kernel insufficient
•Document and justify kernel choice for reproducibility

Module Summary: Kernel PCA Mastery

This completes our deep dive into Kernel PCA. Let's consolidate what we've learned across the entire module.

Module Recap

•Nonlinear Dimensionality Reduction: Linear PCA fails on curved manifolds; kernel methods enable implicit feature mappings to high-dimensional spaces where nonlinear structures become linear
•The Kernel Trick: By working with inner products (kernel evaluations) rather than explicit features, we can perform PCA in infinite-dimensional spaces with finite computation
•Centering in Feature Space: Proper centering via double-centering ($\tilde{\mathbf{K}} = \mathbf{H}\mathbf{K}\mathbf{H}$) is essential; test points must use training statistics
•The Pre-image Problem: Reconstructing input-space points from KPCA projections has no closed-form solution; iterative methods provide approximations for denoising and reconstruction
•Kernel Selection: The kernel function and its parameters fundamentally determine KPCA's behavior; careful selection via heuristics, cross-validation, and diagnostics is essential

When to Use Kernel PCA

When Not to Use Kernel PCA

Kernel PCA Mastery Achieved

Kernel PCA in the Dimensionality Reduction Landscape
Method	Type	Strengths	Compared to KPCA
PCA	Linear, global	Simple, fast, interpretable	KPCA is nonlinear extension
Isomap	Nonlinear, geodesic	Preserves manifold distances	Better for manifold unfolding
LLE	Nonlinear, local	Preserves local neighborhoods	Better for local structure
t-SNE	Nonlinear, probabilistic	Excellent visualization	Better for visualization only
UMAP	Nonlinear, topological	Fast, preserves structure	Often preferred for large data
Autoencoders	Neural, learned	Flexible, scalable	Better for very large data

Continuing Your Learning

Kernel PCA is part of a rich ecosystem of kernel methods and dimensionality reduction techniques. Consider exploring:

Other kernel methods: Kernel SVM, kernel ridge regression, kernel k-means
Manifold learning: Isomap, LLE, Laplacian Eigenmaps, t-SNE, UMAP
Scalable approximations: Nyström method, Random Fourier Features
Deep learning approaches: Autoencoders, Variational Autoencoders

Each has its niche; understanding the full landscape helps you choose the right tool for each problem.