Machine LearningKernel Methods for Regression

Kernel Selection and Design

LevelAdvanced

Duration90 mins

TopicKernel Methods for Regression

4 / 5

Kernel Alignment

Measuring Kernel Quality

How do we know if a kernel is "good" for a particular task without training a full model? Kernel alignment provides a principled answer by measuring the similarity between a kernel matrix and the "ideal" kernel defined by the labels.

This enables:

Fast kernel selection without cross-validation
Kernel combination weights in MKL
Kernel learning via alignment maximization
Theoretical analysis of kernel-based generalization

What You Will Learn

This page covers kernel alignment definitions (empirical and centered), its relationship to generalization, alignment-based kernel selection, and algorithms for learning kernels by maximizing alignment.

Kernel Alignment Definition

The Ideal Kernel:

For classification with labels $y_i \in {-1, +1}$, the ideal kernel (or target kernel) assigns similarity 1 to same-class pairs and -1 (or 0) to different-class pairs:

$$K^*_{ij} = y_i y_j$$

This kernel matrix is $K^* = \mathbf{y}\mathbf{y}^\top$ where $\mathbf{y} = (y_1, \ldots, y_n)^\top$.

Intuition: A kernel that perfectly separates classes will have high inner product with $K^*$—training points with the same label will have high kernel values, and different labels will have low values.

Empirical Kernel Alignment (Cristianini et al., 2001):

The alignment between kernel matrix $K$ and target matrix $K^*$ is the cosine of the angle between them (viewed as vectors):

$$A(K, K^) = \frac{\langle K, K^ \rangle_F}{|K|F |K^*|F} = \frac{\sum{i,j} K{ij} K^{ij}}{\sqrt{\sum{i,j} K_{ij}^2} \sqrt{\sum_{i,j} (K^_{ij})^2}}$$

where $\langle \cdot, \cdot \rangle_F$ is the Frobenius inner product.

For binary classification with labels $y_i$:

$$A(K, \mathbf{y}) = \frac{\mathbf{y}^\top K \mathbf{y}}{n |K|_F}$$

since $|K^*|_F = |\mathbf{y}\mathbf{y}^\top|_F = n$ for balanced classes.

Alignment Range

Kernel alignment values lie in [-1, 1]. Positive alignment indicates the kernel tends to assign higher values to same-class pairs. Perfect alignment (A=1) means the kernel matrix is proportional to yy^T. Random kernels typically have near-zero alignment.

Properties of Kernel Alignment:

Scale invariance: $A(cK, K^) = A(K, K^)$ for $c > 0$
Bounded: $-1 \leq A(K, K^*) \leq 1$
Computationally cheap: $O(n^2)$ to compute
Concentration: For random draws, alignment concentrates around its expectation

Kernel-Target Alignment vs. Kernel-Kernel Alignment:

Alignment can also measure similarity between two kernels:

$$A(K_1, K_2) = \frac{\langle K_1, K_2 \rangle_F}{|K_1|_F |K_2|_F}$$

This is useful for analyzing kernel diversity in MKL.

kernel_alignment.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import numpy as np
 
def kernel_target_alignment(K, y):
    """
    Compute kernel-target alignment.
    
    A(K, y) = y^T K y / (n * ||K||_F)
    
    Parameters:
    -----------
    K : kernel matrix (n, n)
    y : labels (n,), assumed in {-1, +1}
    
    Returns:
    --------
    alignment : float in [-1, 1]
    """
    n = len(y)
    y = y.reshape(-1)
    
    # Numerator: y^T K y
    numerator = y @ K @ y
    
    # Denominator: n * ||K||_F
    K_frobenius = np.linalg.norm(K, 'fro')
    denominator = n * K_frobenius
    
    return numerator / denominator
 
def kernel_kernel_alignment(K1, K2):
    """
    Compute alignment between two kernel matrices.
    
    A(K1, K2) = <K1, K2>_F / (||K1||_F * ||K2||_F)
    """
    numerator = np.sum(K1 * K2)  # Frobenius inner product
    denominator = np.linalg.norm(K1, 'fro') * np.linalg.norm(K2, 'fro')
    return numerator / denominator
 
def ideal_kernel(y):
    """Compute the ideal (target) kernel matrix: K* = yy^T"""
    y = y.reshape(-1, 1)
    return y @ y.T
 
# Example
np.random.seed(42)
n = 100
y = np.sign(np.random.randn(n))  # Random labels
 
# Good kernel: similar to ideal
K_good = ideal_kernel(y) + 0.1 * np.random.randn(n, n)
K_good = (K_good + K_good.T) / 2  # Symmetrize
 
# Random kernel
K_random = np.random.randn(n, n)
K_random = K_random @ K_random.T
 
print(f"Alignment (good kernel): {kernel_target_alignment(K_good, y):.4f}")
print(f"Alignment (random kernel): {kernel_target_alignment(K_random, y):.4f}")

Centered Kernel Alignment

The basic alignment measure can be improved by centering the kernel matrices, which removes the effect of the mean and focuses on the actual structure.

Centered Kernel Matrix:

A kernel matrix $K$ is centered if $\sum_i K_{ij} = \sum_j K_{ij} = 0$ for all $i, j$. The centering operation is:

$$\tilde{K} = HKH$$

where $H = I - \frac{1}{n}\mathbf{1}\mathbf{1}^\top$ is the centering matrix.

Expanded: $$\tilde{K}{ij} = K{ij} - \frac{1}{n}\sum_k K_{ik} - \frac{1}{n}\sum_k K_{kj} + \frac{1}{n^2}\sum_{k,l} K_{kl}$$

Why Centering Matters

Centering corresponds to centering the feature maps: φ̃(x) = φ(x) - E[φ(X)]. Without centering, a kernel that assigns high values uniformly (e.g., constant kernel) could spuriously align with any target. Centering removes this bias.

Centered Kernel Alignment (CKA):

$$\text{CKA}(K, L) = \frac{\langle \tilde{K}, \tilde{L} \rangle_F}{|\tilde{K}|_F |\tilde{L}|_F}$$

For kernel-target alignment:

$$\text{CKA}(K, \mathbf{y}) = \frac{\text{tr}(\tilde{K} \tilde{K}^)}{|\tilde{K}|_F |\tilde{K}^|_F}$$

Advantages of CKA:

Invariant to orthogonal transformations in feature space
Invariant to isotropic scaling of feature space
More sensitive to genuine structural similarity
Better correlation with actual model performance

Relationship to HSIC:

Centered kernel alignment is closely related to the Hilbert-Schmidt Independence Criterion (HSIC), a kernel-based measure of statistical dependence:

$$\text{HSIC}(X, Y) = \frac{1}{(n-1)^2} \text{tr}(\tilde{K}_X \tilde{K}_Y)$$

where $K_X$ and $K_Y$ are kernels on variables $X$ and $Y$.

CKA normalizes HSIC to remove sensitivity to marginal kernel norms:

$$\text{CKA}(K_X, K_Y) = \frac{\text{HSIC}(K_X, K_Y)}{\sqrt{\text{HSIC}(K_X, K_X) \cdot \text{HSIC}(K_Y, K_Y)}}$$

centered_alignment.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
 
def center_kernel(K):
    """
    Center a kernel matrix: K̃ = HKH where H = I - 11^T/n
    
    Efficient implementation without forming H explicitly.
    """
    n = K.shape[0]
    row_mean = K.mean(axis=1, keepdims=True)
    col_mean = K.mean(axis=0, keepdims=True)
    total_mean = K.mean()
    
    return K - row_mean - col_mean + total_mean
 
def centered_kernel_alignment(K1, K2):
    """
    Compute Centered Kernel Alignment (CKA).
    
    CKA(K1, K2) = <K̃1, K̃2>_F / (||K̃1||_F * ||K̃2||_F)
    """
    K1_centered = center_kernel(K1)
    K2_centered = center_kernel(K2)
    
    numerator = np.sum(K1_centered * K2_centered)
    norm1 = np.linalg.norm(K1_centered, 'fro')
    norm2 = np.linalg.norm(K2_centered, 'fro')
    
    if norm1 < 1e-10 or norm2 < 1e-10:
        return 0.0
    
    return numerator / (norm1 * norm2)
 
def hsic(K1, K2):
    """
    Compute the Hilbert-Schmidt Independence Criterion.
    
    HSIC = (1/(n-1)^2) * tr(K̃1 @ K̃2)
    """
    n = K1.shape[0]
    K1_centered = center_kernel(K1)
    K2_centered = center_kernel(K2)
    
    return np.trace(K1_centered @ K2_centered) / (n - 1)**2
 
def cka_from_features(X, Y, kernel='linear'):
    """
    Compute CKA directly from feature matrices.
    
    For linear kernel, avoids explicit kernel matrix computation.
    """
    if kernel == 'linear':
        # Linear CKA: uses Gram matrices X^TX and Y^TY
        # Center the features
        X_centered = X - X.mean(axis=0)
        Y_centered = Y - Y.mean(axis=0)
        
        # Compute via feature dot products (memory efficient)
        XTX = X_centered.T @ X_centered
        YTY = Y_centered.T @ Y_centered
        XTY = X_centered.T @ Y_centered
        
        numerator = np.trace(XTY @ XTY.T)
        denom1 = np.trace(XTX @ XTX)
        denom2 = np.trace(YTY @ YTY)
        
        return numerator / np.sqrt(denom1 * denom2)
    else:
        raise NotImplementedError("Only linear kernel implemented")
 
# Example
np.random.seed(42)
n = 100
X = np.random.randn(n, 50)  # Feature representation 1
Y = X @ np.random.randn(50, 30) + 0.1 * np.random.randn(n, 30)  # Correlated
 
# Compute kernels
K_X = X @ X.T
K_Y = Y @ Y.T
 
print(f"CKA(X, Y): {centered_kernel_alignment(K_X, K_Y):.4f}")
print(f"CKA from features: {cka_from_features(X, Y):.4f}")

Alignment and Generalization

Kernel alignment is not just a heuristic—it has theoretical connections to generalization. High alignment implies the kernel's geometry is well-suited for the learning task.

Generalization Bound via Alignment (Cristianini et al., 2001):

For kernel SVM with margin $\gamma$ and alignment $A$:

$$R(f) \leq R_n(f) + O\left(\frac{1}{\gamma\sqrt{n}} \cdot \frac{1}{\sqrt{A}}\right)$$

where $R(f)$ is true risk and $R_n(f)$ is empirical risk.

Interpretation: Higher alignment means tighter generalization bounds. A well-aligned kernel allows the SVM to achieve the same generalization with fewer samples.

Alignment is Not Everything

While high alignment suggests a good kernel, maximizing empirical alignment can lead to overfitting. The ideal kernel yy^T has perfect alignment but corresponds to memorizing labels. Regularization and proper kernel design remain essential.

Alignment vs. Cross-Validation Performance:

Empirical studies show that alignment correlates with cross-validation accuracy but is not a perfect predictor:

Ranking: Alignment usually ranks kernels similarly to CV performance
Speed: Alignment is $O(n^2)$ vs. $O(n^3 k)$ for $k$-fold CV
Limitations: Alignment can't detect overfitting from kernel hyperparameters

Spectral Perspective:

Alignment has a spectral interpretation. Define the kernel's spectral representation:

$$K = \sum_{i=1}^{n} \lambda_i \mathbf{u}_i \mathbf{u}_i^\top$$

Then alignment measures how well the top eigenvectors align with the label vector:

$$A(K, \mathbf{y}) = \frac{\sum_i \lambda_i (\mathbf{y}^\top \mathbf{u}_i)^2}{|\mathbf{y}|^2 |K|_F}$$

High alignment means $\mathbf{y}$ lies primarily in the subspace spanned by top kernel eigenvectors.

Alignment vs. Cross-Validation Comparison
Criterion	Alignment	Cross-Validation
Computation	O(n²)	O(n³ × k folds)
Direct optimization	Yes (convex)	No (discrete hyperparams)
Overfitting detection	Limited	Good
Regularization effect	Ignores	Captures
Sample splitting	No	Yes
Best use	Quick screening	Final selection

Alignment-Based Kernel Selection

Kernel alignment provides a fast method for comparing and selecting kernels without training models.

Simple Alignment-Based Selection:

Compute alignment $A(K_m, \mathbf{y})$ for each candidate kernel $K_m$
Select the kernel with highest alignment
(Optional) Fine-tune hyperparameters via CV on the selected family

Alignment for MKL Weights:

In MKL, alignment provides initialization or even final weights:

$$\mu_m \propto A(K_m, \mathbf{y})^+$$

where $(\cdot)^+$ denotes positive part. This is the core of EasyMKL.

Kernel Learning via Alignment Maximization:

For parameterized kernels $K_\theta$, we can optimize parameters via alignment:

$$\max_\theta A(K_\theta, \mathbf{y})$$

Gradient-Based Optimization:

The gradient of alignment w.r.t. kernel parameters:

$$\frac{\partial A}{\partial \theta} = \frac{1}{|K|_F} \left[ \frac{\partial (\mathbf{y}^\top K \mathbf{y})}{\partial \theta} - A \cdot \frac{\partial |K|_F}{\partial \theta} \right]$$

For kernel hyperparameters (e.g., RBF length-scale), this gives:

$$\frac{\partial K_{ij}}{\partial \gamma} = |\mathbf{x}_i - \mathbf{x}j|^2 K{ij}$$

Regularized Alignment:

To prevent overfitting, add regularization:

$$A_{\text{reg}}(K, \mathbf{y}) = A(K, \mathbf{y}) - \lambda \cdot \text{complexity}(K)$$

where complexity could be the nuclear norm or rank of $K$.

alignment_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import numpy as np
from scipy.optimize import minimize_scalar
from scipy.spatial.distance import cdist
 
def alignment_based_kernel_selection(X, y, kernel_configs):
    """
    Select best kernel from candidates using alignment.
    
    Parameters:
    -----------
    X : features (n, d)
    y : labels (n,)
    kernel_configs : list of (kernel_fn, name) tuples
    
    Returns:
    --------
    best_kernel : kernel matrix with highest alignment
    results : alignment scores for all kernels
    """
    results = []
    best_alignment = -np.inf
    best_kernel = None
    best_name = None
    
    for kernel_fn, name in kernel_configs:
        K = kernel_fn(X)
        K_centered = center_kernel(K)
        y_centered = y - y.mean()
        
        # Centered alignment
        alignment = cka_target(K_centered, y_centered)
        results.append((name, alignment))
        
        if alignment > best_alignment:
            best_alignment = alignment
            best_kernel = K
            best_name = name
    
    return best_kernel, best_name, sorted(results, key=lambda x: -x[1])
 
def center_kernel(K):
    """Center kernel matrix."""
    n = K.shape[0]
    return K - K.mean(0, keepdims=True) - K.mean(1, keepdims=True) + K.mean()
 
def cka_target(K_centered, y_centered):
    """CKA with target vector."""
    yy = np.outer(y_centered, y_centered)
    yy_centered = center_kernel(yy)
    
    numerator = np.sum(K_centered * yy_centered)
    denom = np.linalg.norm(K_centered, 'fro') * np.linalg.norm(yy_centered, 'fro')
    
    return numerator / denom if denom > 0 else 0
 
def optimize_rbf_gamma_alignment(X, y, gamma_range=(1e-4, 1e2)):
    """
    Find optimal RBF gamma by maximizing alignment.
    """
    sq_dists = cdist(X, X, 'sqeuclidean')
    
    def neg_alignment(log_gamma):
        gamma = np.exp(log_gamma)
        K = np.exp(-gamma * sq_dists)
        K_centered = center_kernel(K)
        y_centered = y - y.mean()
        return -cka_target(K_centered, y_centered)
    
    result = minimize_scalar(
        neg_alignment,
        bounds=(np.log(gamma_range[0]), np.log(gamma_range[1])),
        method='bounded'
    )
    
    optimal_gamma = np.exp(result.x)
    optimal_alignment = -result.fun
    
    return optimal_gamma, optimal_alignment
 
# Example
np.random.seed(42)
n = 100
X = np.random.randn(n, 10)
y = np.sign(X[:, 0] + 0.5 * X[:, 1])  # Labels depend on first 2 features
 
optimal_gamma, opt_alignment = optimize_rbf_gamma_alignment(X, y)
print(f"Optimal RBF γ: {optimal_gamma:.4f}")
print(f"Optimal alignment: {opt_alignment:.4f}")

CKA in Deep Learning

Centered Kernel Alignment has found important applications in analyzing deep neural networks, particularly for comparing learned representations.

Representation Similarity Analysis:

Given two neural network layers with activations $X \in \mathbb{R}^{n \times p_1}$ and $Y \in \mathbb{R}^{n \times p_2}$ on $n$ examples, CKA measures how similar the representations are:

$$\text{CKA}(K_X, K_Y)$$

where $K_X = XX^\top$ and $K_Y = YY^\top$ are linear kernel matrices.

Key Findings (Kornblith et al., 2019):

Same architecture, different initializations: High CKA between corresponding layers
Different architectures: Similar CKA patterns for similar-performing networks
Layer depth: CKA reveals hierarchical representation structure
Width vs. depth: Width doesn't significantly change CKA structure

Why CKA for Neural Networks?

CKA is invariant to orthogonal transformations and isotropic scaling, making it robust to irrelevant differences between network representations. Two layers that encode the same information in different bases will have high CKA.

Applications in Deep Learning:

Network comparison: Understand which layers are similar across architectures
Transfer learning analysis: Measure representation overlap between source and target tasks
Pruning/distillation: Verify that compressed networks preserve representations
Training dynamics: Track how representations evolve during training

Mini-batch CKA:

For large networks, computing full $n \times n$ kernel matrices is expensive. Mini-batch approximation:

$$\text{CKA}{\text{approx}} = \frac{1}{B} \sum{b=1}^{B} \text{CKA}(K_X^{(b)}, K_Y^{(b)})$$

where each $K^{(b)}$ is computed on a mini-batch.

cka_neural_networks.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import numpy as np
 
def linear_cka(X, Y):
    """
    Compute linear CKA between two sets of activations.
    
    Parameters:
    -----------
    X : activations from layer 1 (n_samples, n_features_1)
    Y : activations from layer 2 (n_samples, n_features_2)
    
    Returns:
    --------
    cka : centered kernel alignment value
    """
    # Center the activations
    X = X - X.mean(axis=0)
    Y = Y - Y.mean(axis=0)
    
    # Compute HSIC numerator and denominators efficiently
    # Using: tr(XXT @ YYT) = ||X.T @ Y||_F^2
    XtY = X.T @ Y
    XtX = X.T @ X
    YtY = Y.T @ Y
    
    hsic_xy = np.sum(XtY ** 2)
    hsic_xx = np.sum(XtX ** 2)
    hsic_yy = np.sum(YtY ** 2)
    
    return hsic_xy / np.sqrt(hsic_xx * hsic_yy)
 
def rbf_cka(X, Y, sigma=None):
    """
    Compute RBF CKA (nonlinear similarity measure).
    """
    def rbf_kernel(Z, sigma):
        sq_dists = np.sum((Z[:, None] - Z[None, :]) ** 2, axis=-1)
        if sigma is None:
            sigma = np.median(sq_dists[sq_dists > 0]) ** 0.5
        return np.exp(-sq_dists / (2 * sigma ** 2))
    
    K_X = rbf_kernel(X, sigma)
    K_Y = rbf_kernel(Y, sigma)
    
    # Center
    K_X = K_X - K_X.mean(0) - K_X.mean(1, keepdims=True) + K_X.mean()
    K_Y = K_Y - K_Y.mean(0) - K_Y.mean(1, keepdims=True) + K_Y.mean()
    
    numerator = np.sum(K_X * K_Y)
    denominator = np.sqrt(np.sum(K_X ** 2) * np.sum(K_Y ** 2))
    
    return numerator / denominator
 
def layer_similarity_matrix(activations_list):
    """
    Compute CKA similarity matrix between all pairs of layers.
    
    Parameters:
    -----------
    activations_list : list of activation matrices
    
    Returns:
    --------
    sim_matrix : (n_layers, n_layers) CKA similarity matrix
    """
    n_layers = len(activations_list)
    sim_matrix = np.zeros((n_layers, n_layers))
    
    for i in range(n_layers):
        for j in range(i, n_layers):
            cka = linear_cka(activations_list[i], activations_list[j])
            sim_matrix[i, j] = cka
            sim_matrix[j, i] = cka
    
    return sim_matrix
 
# Example: Compare representations
np.random.seed(42)
n_samples = 200
 
# Simulate two networks with similar representations
layer1_net1 = np.random.randn(n_samples, 256)
layer2_net1 = layer1_net1 @ np.random.randn(256, 128) + 0.1 * np.random.randn(n_samples, 128)
 
# Network 2: similar to network 1 but rotated
rotation = np.linalg.qr(np.random.randn(128, 128))[0]
layer2_net2 = layer2_net1 @ rotation  # Should have high CKA
 
# Network 3: independent
layer2_net3 = np.random.randn(n_samples, 128)  # Should have low CKA
 
print(f"CKA(net1, net2): {linear_cka(layer2_net1, layer2_net2):.4f}")  # High
print(f"CKA(net1, net3): {linear_cka(layer2_net1, layer2_net3):.4f}")  # Low

Summary: Kernel Alignment

Key Takeaways

•Kernel alignment measures similarity between a kernel and the ideal target kernel, providing a fast proxy for kernel quality.
•Centered alignment (CKA) improves on basic alignment by removing mean effects, relating to HSIC (independence criterion).
•Theoretical connections link alignment to generalization bounds—higher alignment implies tighter bounds.
•Alignment-based selection enables fast kernel comparison in O(n²), useful for screening before expensive CV.
•Kernel learning can optimize parameters by maximizing alignment, with regularization to prevent overfitting.
•CKA in deep learning compares neural network representations, enabling analysis of layer similarity and transfer.

Page Complete

You now understand kernel alignment theory and applications. Next, we'll explore universal kernels—kernels with the theoretical power to approximate any continuous function.

4 / 5

Loading learning content...

Machine LearningKernel Methods for Regression

Kernel Selection and Design

LevelAdvanced

Duration90 mins

TopicKernel Methods for Regression

4 / 5

Kernel Alignment

Measuring Kernel Quality

This enables:

Fast kernel selection without cross-validation
Kernel combination weights in MKL
Kernel learning via alignment maximization
Theoretical analysis of kernel-based generalization

What You Will Learn

Kernel Alignment Definition

The Ideal Kernel:

For classification with labels $y_i \in {-1, +1}$, the ideal kernel (or target kernel) assigns similarity 1 to same-class pairs and -1 (or 0) to different-class pairs:

$$K^*_{ij} = y_i y_j$$

This kernel matrix is $K^* = \mathbf{y}\mathbf{y}^\top$ where $\mathbf{y} = (y_1, \ldots, y_n)^\top$.

Empirical Kernel Alignment (Cristianini et al., 2001):

The alignment between kernel matrix $K$ and target matrix $K^*$ is the cosine of the angle between them (viewed as vectors):

$$A(K, K^) = \frac{\langle K, K^ \rangle_F}{|K|F |K^*|F} = \frac{\sum{i,j} K{ij} K^{ij}}{\sqrt{\sum{i,j} K_{ij}^2} \sqrt{\sum_{i,j} (K^_{ij})^2}}$$

where $\langle \cdot, \cdot \rangle_F$ is the Frobenius inner product.

For binary classification with labels $y_i$:

$$A(K, \mathbf{y}) = \frac{\mathbf{y}^\top K \mathbf{y}}{n |K|_F}$$

since $|K^*|_F = |\mathbf{y}\mathbf{y}^\top|_F = n$ for balanced classes.

Alignment Range

Properties of Kernel Alignment:

Scale invariance: $A(cK, K^) = A(K, K^)$ for $c > 0$
Bounded: $-1 \leq A(K, K^*) \leq 1$
Computationally cheap: $O(n^2)$ to compute
Concentration: For random draws, alignment concentrates around its expectation

Kernel-Target Alignment vs. Kernel-Kernel Alignment:

Alignment can also measure similarity between two kernels:

$$A(K_1, K_2) = \frac{\langle K_1, K_2 \rangle_F}{|K_1|_F |K_2|_F}$$

This is useful for analyzing kernel diversity in MKL.

kernel_alignment.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import numpy as np
 
def kernel_target_alignment(K, y):
    """
    Compute kernel-target alignment.
    
    A(K, y) = y^T K y / (n * ||K||_F)
    
    Parameters:
    -----------
    K : kernel matrix (n, n)
    y : labels (n,), assumed in {-1, +1}
    
    Returns:
    --------
    alignment : float in [-1, 1]
    """
    n = len(y)
    y = y.reshape(-1)
    
    # Numerator: y^T K y
    numerator = y @ K @ y
    
    # Denominator: n * ||K||_F
    K_frobenius = np.linalg.norm(K, 'fro')
    denominator = n * K_frobenius
    
    return numerator / denominator
 
def kernel_kernel_alignment(K1, K2):
    """
    Compute alignment between two kernel matrices.
    
    A(K1, K2) = <K1, K2>_F / (||K1||_F * ||K2||_F)
    """
    numerator = np.sum(K1 * K2)  # Frobenius inner product
    denominator = np.linalg.norm(K1, 'fro') * np.linalg.norm(K2, 'fro')
    return numerator / denominator
 
def ideal_kernel(y):
    """Compute the ideal (target) kernel matrix: K* = yy^T"""
    y = y.reshape(-1, 1)
    return y @ y.T
 
# Example
np.random.seed(42)
n = 100
y = np.sign(np.random.randn(n))  # Random labels
 
# Good kernel: similar to ideal
K_good = ideal_kernel(y) + 0.1 * np.random.randn(n, n)
K_good = (K_good + K_good.T) / 2  # Symmetrize
 
# Random kernel
K_random = np.random.randn(n, n)
K_random = K_random @ K_random.T
 
print(f"Alignment (good kernel): {kernel_target_alignment(K_good, y):.4f}")
print(f"Alignment (random kernel): {kernel_target_alignment(K_random, y):.4f}")

Centered Kernel Alignment

The basic alignment measure can be improved by centering the kernel matrices, which removes the effect of the mean and focuses on the actual structure.

Centered Kernel Matrix:

A kernel matrix $K$ is centered if $\sum_i K_{ij} = \sum_j K_{ij} = 0$ for all $i, j$. The centering operation is:

$$\tilde{K} = HKH$$

where $H = I - \frac{1}{n}\mathbf{1}\mathbf{1}^\top$ is the centering matrix.

Expanded: $$\tilde{K}{ij} = K{ij} - \frac{1}{n}\sum_k K_{ik} - \frac{1}{n}\sum_k K_{kj} + \frac{1}{n^2}\sum_{k,l} K_{kl}$$

Why Centering Matters

Centered Kernel Alignment (CKA):

$$\text{CKA}(K, L) = \frac{\langle \tilde{K}, \tilde{L} \rangle_F}{|\tilde{K}|_F |\tilde{L}|_F}$$

For kernel-target alignment:

$$\text{CKA}(K, \mathbf{y}) = \frac{\text{tr}(\tilde{K} \tilde{K}^)}{|\tilde{K}|_F |\tilde{K}^|_F}$$

Advantages of CKA:

Invariant to orthogonal transformations in feature space
Invariant to isotropic scaling of feature space
More sensitive to genuine structural similarity
Better correlation with actual model performance

Relationship to HSIC:

Centered kernel alignment is closely related to the Hilbert-Schmidt Independence Criterion (HSIC), a kernel-based measure of statistical dependence:

$$\text{HSIC}(X, Y) = \frac{1}{(n-1)^2} \text{tr}(\tilde{K}_X \tilde{K}_Y)$$

where $K_X$ and $K_Y$ are kernels on variables $X$ and $Y$.

CKA normalizes HSIC to remove sensitivity to marginal kernel norms:

$$\text{CKA}(K_X, K_Y) = \frac{\text{HSIC}(K_X, K_Y)}{\sqrt{\text{HSIC}(K_X, K_X) \cdot \text{HSIC}(K_Y, K_Y)}}$$

centered_alignment.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
 
def center_kernel(K):
    """
    Center a kernel matrix: K̃ = HKH where H = I - 11^T/n
    
    Efficient implementation without forming H explicitly.
    """
    n = K.shape[0]
    row_mean = K.mean(axis=1, keepdims=True)
    col_mean = K.mean(axis=0, keepdims=True)
    total_mean = K.mean()
    
    return K - row_mean - col_mean + total_mean
 
def centered_kernel_alignment(K1, K2):
    """
    Compute Centered Kernel Alignment (CKA).
    
    CKA(K1, K2) = <K̃1, K̃2>_F / (||K̃1||_F * ||K̃2||_F)
    """
    K1_centered = center_kernel(K1)
    K2_centered = center_kernel(K2)
    
    numerator = np.sum(K1_centered * K2_centered)
    norm1 = np.linalg.norm(K1_centered, 'fro')
    norm2 = np.linalg.norm(K2_centered, 'fro')
    
    if norm1 < 1e-10 or norm2 < 1e-10:
        return 0.0
    
    return numerator / (norm1 * norm2)
 
def hsic(K1, K2):
    """
    Compute the Hilbert-Schmidt Independence Criterion.
    
    HSIC = (1/(n-1)^2) * tr(K̃1 @ K̃2)
    """
    n = K1.shape[0]
    K1_centered = center_kernel(K1)
    K2_centered = center_kernel(K2)
    
    return np.trace(K1_centered @ K2_centered) / (n - 1)**2
 
def cka_from_features(X, Y, kernel='linear'):
    """
    Compute CKA directly from feature matrices.
    
    For linear kernel, avoids explicit kernel matrix computation.
    """
    if kernel == 'linear':
        # Linear CKA: uses Gram matrices X^TX and Y^TY
        # Center the features
        X_centered = X - X.mean(axis=0)
        Y_centered = Y - Y.mean(axis=0)
        
        # Compute via feature dot products (memory efficient)
        XTX = X_centered.T @ X_centered
        YTY = Y_centered.T @ Y_centered
        XTY = X_centered.T @ Y_centered
        
        numerator = np.trace(XTY @ XTY.T)
        denom1 = np.trace(XTX @ XTX)
        denom2 = np.trace(YTY @ YTY)
        
        return numerator / np.sqrt(denom1 * denom2)
    else:
        raise NotImplementedError("Only linear kernel implemented")
 
# Example
np.random.seed(42)
n = 100
X = np.random.randn(n, 50)  # Feature representation 1
Y = X @ np.random.randn(50, 30) + 0.1 * np.random.randn(n, 30)  # Correlated
 
# Compute kernels
K_X = X @ X.T
K_Y = Y @ Y.T
 
print(f"CKA(X, Y): {centered_kernel_alignment(K_X, K_Y):.4f}")
print(f"CKA from features: {cka_from_features(X, Y):.4f}")

Alignment and Generalization

Kernel alignment is not just a heuristic—it has theoretical connections to generalization. High alignment implies the kernel's geometry is well-suited for the learning task.

Generalization Bound via Alignment (Cristianini et al., 2001):

For kernel SVM with margin $\gamma$ and alignment $A$:

$$R(f) \leq R_n(f) + O\left(\frac{1}{\gamma\sqrt{n}} \cdot \frac{1}{\sqrt{A}}\right)$$

where $R(f)$ is true risk and $R_n(f)$ is empirical risk.

Interpretation: Higher alignment means tighter generalization bounds. A well-aligned kernel allows the SVM to achieve the same generalization with fewer samples.

Alignment is Not Everything

Alignment vs. Cross-Validation Performance:

Empirical studies show that alignment correlates with cross-validation accuracy but is not a perfect predictor:

Ranking: Alignment usually ranks kernels similarly to CV performance
Speed: Alignment is $O(n^2)$ vs. $O(n^3 k)$ for $k$-fold CV
Limitations: Alignment can't detect overfitting from kernel hyperparameters

Spectral Perspective:

Alignment has a spectral interpretation. Define the kernel's spectral representation:

$$K = \sum_{i=1}^{n} \lambda_i \mathbf{u}_i \mathbf{u}_i^\top$$

Then alignment measures how well the top eigenvectors align with the label vector:

$$A(K, \mathbf{y}) = \frac{\sum_i \lambda_i (\mathbf{y}^\top \mathbf{u}_i)^2}{|\mathbf{y}|^2 |K|_F}$$

High alignment means $\mathbf{y}$ lies primarily in the subspace spanned by top kernel eigenvectors.

Alignment vs. Cross-Validation Comparison
Criterion	Alignment	Cross-Validation
Computation	O(n²)	O(n³ × k folds)
Direct optimization	Yes (convex)	No (discrete hyperparams)
Overfitting detection	Limited	Good
Regularization effect	Ignores	Captures
Sample splitting	No	Yes
Best use	Quick screening	Final selection

Alignment-Based Kernel Selection

Kernel alignment provides a fast method for comparing and selecting kernels without training models.

Simple Alignment-Based Selection:

Compute alignment $A(K_m, \mathbf{y})$ for each candidate kernel $K_m$
Select the kernel with highest alignment
(Optional) Fine-tune hyperparameters via CV on the selected family

Alignment for MKL Weights:

In MKL, alignment provides initialization or even final weights:

$$\mu_m \propto A(K_m, \mathbf{y})^+$$

where $(\cdot)^+$ denotes positive part. This is the core of EasyMKL.

Kernel Learning via Alignment Maximization:

For parameterized kernels $K_\theta$, we can optimize parameters via alignment:

$$\max_\theta A(K_\theta, \mathbf{y})$$

Gradient-Based Optimization:

The gradient of alignment w.r.t. kernel parameters:

$$\frac{\partial A}{\partial \theta} = \frac{1}{|K|_F} \left[ \frac{\partial (\mathbf{y}^\top K \mathbf{y})}{\partial \theta} - A \cdot \frac{\partial |K|_F}{\partial \theta} \right]$$

For kernel hyperparameters (e.g., RBF length-scale), this gives:

$$\frac{\partial K_{ij}}{\partial \gamma} = |\mathbf{x}_i - \mathbf{x}j|^2 K{ij}$$

Regularized Alignment:

To prevent overfitting, add regularization:

$$A_{\text{reg}}(K, \mathbf{y}) = A(K, \mathbf{y}) - \lambda \cdot \text{complexity}(K)$$

where complexity could be the nuclear norm or rank of $K$.

alignment_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import numpy as np
from scipy.optimize import minimize_scalar
from scipy.spatial.distance import cdist
 
def alignment_based_kernel_selection(X, y, kernel_configs):
    """
    Select best kernel from candidates using alignment.
    
    Parameters:
    -----------
    X : features (n, d)
    y : labels (n,)
    kernel_configs : list of (kernel_fn, name) tuples
    
    Returns:
    --------
    best_kernel : kernel matrix with highest alignment
    results : alignment scores for all kernels
    """
    results = []
    best_alignment = -np.inf
    best_kernel = None
    best_name = None
    
    for kernel_fn, name in kernel_configs:
        K = kernel_fn(X)
        K_centered = center_kernel(K)
        y_centered = y - y.mean()
        
        # Centered alignment
        alignment = cka_target(K_centered, y_centered)
        results.append((name, alignment))
        
        if alignment > best_alignment:
            best_alignment = alignment
            best_kernel = K
            best_name = name
    
    return best_kernel, best_name, sorted(results, key=lambda x: -x[1])
 
def center_kernel(K):
    """Center kernel matrix."""
    n = K.shape[0]
    return K - K.mean(0, keepdims=True) - K.mean(1, keepdims=True) + K.mean()
 
def cka_target(K_centered, y_centered):
    """CKA with target vector."""
    yy = np.outer(y_centered, y_centered)
    yy_centered = center_kernel(yy)
    
    numerator = np.sum(K_centered * yy_centered)
    denom = np.linalg.norm(K_centered, 'fro') * np.linalg.norm(yy_centered, 'fro')
    
    return numerator / denom if denom > 0 else 0
 
def optimize_rbf_gamma_alignment(X, y, gamma_range=(1e-4, 1e2)):
    """
    Find optimal RBF gamma by maximizing alignment.
    """
    sq_dists = cdist(X, X, 'sqeuclidean')
    
    def neg_alignment(log_gamma):
        gamma = np.exp(log_gamma)
        K = np.exp(-gamma * sq_dists)
        K_centered = center_kernel(K)
        y_centered = y - y.mean()
        return -cka_target(K_centered, y_centered)
    
    result = minimize_scalar(
        neg_alignment,
        bounds=(np.log(gamma_range[0]), np.log(gamma_range[1])),
        method='bounded'
    )
    
    optimal_gamma = np.exp(result.x)
    optimal_alignment = -result.fun
    
    return optimal_gamma, optimal_alignment
 
# Example
np.random.seed(42)
n = 100
X = np.random.randn(n, 10)
y = np.sign(X[:, 0] + 0.5 * X[:, 1])  # Labels depend on first 2 features
 
optimal_gamma, opt_alignment = optimize_rbf_gamma_alignment(X, y)
print(f"Optimal RBF γ: {optimal_gamma:.4f}")
print(f"Optimal alignment: {opt_alignment:.4f}")

CKA in Deep Learning

Centered Kernel Alignment has found important applications in analyzing deep neural networks, particularly for comparing learned representations.

Representation Similarity Analysis:

Given two neural network layers with activations $X \in \mathbb{R}^{n \times p_1}$ and $Y \in \mathbb{R}^{n \times p_2}$ on $n$ examples, CKA measures how similar the representations are:

$$\text{CKA}(K_X, K_Y)$$

where $K_X = XX^\top$ and $K_Y = YY^\top$ are linear kernel matrices.

Key Findings (Kornblith et al., 2019):

Same architecture, different initializations: High CKA between corresponding layers
Different architectures: Similar CKA patterns for similar-performing networks
Layer depth: CKA reveals hierarchical representation structure
Width vs. depth: Width doesn't significantly change CKA structure

Why CKA for Neural Networks?

Applications in Deep Learning:

Network comparison: Understand which layers are similar across architectures
Transfer learning analysis: Measure representation overlap between source and target tasks
Pruning/distillation: Verify that compressed networks preserve representations
Training dynamics: Track how representations evolve during training

Mini-batch CKA:

For large networks, computing full $n \times n$ kernel matrices is expensive. Mini-batch approximation:

$$\text{CKA}{\text{approx}} = \frac{1}{B} \sum{b=1}^{B} \text{CKA}(K_X^{(b)}, K_Y^{(b)})$$

where each $K^{(b)}$ is computed on a mini-batch.

cka_neural_networks.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import numpy as np
 
def linear_cka(X, Y):
    """
    Compute linear CKA between two sets of activations.
    
    Parameters:
    -----------
    X : activations from layer 1 (n_samples, n_features_1)
    Y : activations from layer 2 (n_samples, n_features_2)
    
    Returns:
    --------
    cka : centered kernel alignment value
    """
    # Center the activations
    X = X - X.mean(axis=0)
    Y = Y - Y.mean(axis=0)
    
    # Compute HSIC numerator and denominators efficiently
    # Using: tr(XXT @ YYT) = ||X.T @ Y||_F^2
    XtY = X.T @ Y
    XtX = X.T @ X
    YtY = Y.T @ Y
    
    hsic_xy = np.sum(XtY ** 2)
    hsic_xx = np.sum(XtX ** 2)
    hsic_yy = np.sum(YtY ** 2)
    
    return hsic_xy / np.sqrt(hsic_xx * hsic_yy)
 
def rbf_cka(X, Y, sigma=None):
    """
    Compute RBF CKA (nonlinear similarity measure).
    """
    def rbf_kernel(Z, sigma):
        sq_dists = np.sum((Z[:, None] - Z[None, :]) ** 2, axis=-1)
        if sigma is None:
            sigma = np.median(sq_dists[sq_dists > 0]) ** 0.5
        return np.exp(-sq_dists / (2 * sigma ** 2))
    
    K_X = rbf_kernel(X, sigma)
    K_Y = rbf_kernel(Y, sigma)
    
    # Center
    K_X = K_X - K_X.mean(0) - K_X.mean(1, keepdims=True) + K_X.mean()
    K_Y = K_Y - K_Y.mean(0) - K_Y.mean(1, keepdims=True) + K_Y.mean()
    
    numerator = np.sum(K_X * K_Y)
    denominator = np.sqrt(np.sum(K_X ** 2) * np.sum(K_Y ** 2))
    
    return numerator / denominator
 
def layer_similarity_matrix(activations_list):
    """
    Compute CKA similarity matrix between all pairs of layers.
    
    Parameters:
    -----------
    activations_list : list of activation matrices
    
    Returns:
    --------
    sim_matrix : (n_layers, n_layers) CKA similarity matrix
    """
    n_layers = len(activations_list)
    sim_matrix = np.zeros((n_layers, n_layers))
    
    for i in range(n_layers):
        for j in range(i, n_layers):
            cka = linear_cka(activations_list[i], activations_list[j])
            sim_matrix[i, j] = cka
            sim_matrix[j, i] = cka
    
    return sim_matrix
 
# Example: Compare representations
np.random.seed(42)
n_samples = 200
 
# Simulate two networks with similar representations
layer1_net1 = np.random.randn(n_samples, 256)
layer2_net1 = layer1_net1 @ np.random.randn(256, 128) + 0.1 * np.random.randn(n_samples, 128)
 
# Network 2: similar to network 1 but rotated
rotation = np.linalg.qr(np.random.randn(128, 128))[0]
layer2_net2 = layer2_net1 @ rotation  # Should have high CKA
 
# Network 3: independent
layer2_net3 = np.random.randn(n_samples, 128)  # Should have low CKA
 
print(f"CKA(net1, net2): {linear_cka(layer2_net1, layer2_net2):.4f}")  # High
print(f"CKA(net1, net3): {linear_cka(layer2_net1, layer2_net3):.4f}")  # Low

Summary: Kernel Alignment

Key Takeaways

•Kernel alignment measures similarity between a kernel and the ideal target kernel, providing a fast proxy for kernel quality.
•Centered alignment (CKA) improves on basic alignment by removing mean effects, relating to HSIC (independence criterion).
•Theoretical connections link alignment to generalization bounds—higher alignment implies tighter bounds.
•Alignment-based selection enables fast kernel comparison in O(n²), useful for screening before expensive CV.
•Kernel learning can optimize parameters by maximizing alignment, with regularization to prevent overfitting.
•CKA in deep learning compares neural network representations, enabling analysis of layer similarity and transfer.

Page Complete

You now understand kernel alignment theory and applications. Next, we'll explore universal kernels—kernels with the theoretical power to approximate any continuous function.

4 / 5