Loading learning content...
How do we know if a kernel is "good" for a particular task without training a full model? Kernel alignment provides a principled answer by measuring the similarity between a kernel matrix and the "ideal" kernel defined by the labels.
This enables:
This page covers kernel alignment definitions (empirical and centered), its relationship to generalization, alignment-based kernel selection, and algorithms for learning kernels by maximizing alignment.
The Ideal Kernel:
For classification with labels $y_i \in {-1, +1}$, the ideal kernel (or target kernel) assigns similarity 1 to same-class pairs and -1 (or 0) to different-class pairs:
$$K^*_{ij} = y_i y_j$$
This kernel matrix is $K^* = \mathbf{y}\mathbf{y}^\top$ where $\mathbf{y} = (y_1, \ldots, y_n)^\top$.
Intuition: A kernel that perfectly separates classes will have high inner product with $K^*$—training points with the same label will have high kernel values, and different labels will have low values.
Empirical Kernel Alignment (Cristianini et al., 2001):
The alignment between kernel matrix $K$ and target matrix $K^*$ is the cosine of the angle between them (viewed as vectors):
$$A(K, K^) = \frac{\langle K, K^ \rangle_F}{|K|F |K^*|F} = \frac{\sum{i,j} K{ij} K^{ij}}{\sqrt{\sum{i,j} K_{ij}^2} \sqrt{\sum_{i,j} (K^_{ij})^2}}$$
where $\langle \cdot, \cdot \rangle_F$ is the Frobenius inner product.
For binary classification with labels $y_i$:
$$A(K, \mathbf{y}) = \frac{\mathbf{y}^\top K \mathbf{y}}{n |K|_F}$$
since $|K^*|_F = |\mathbf{y}\mathbf{y}^\top|_F = n$ for balanced classes.
Kernel alignment values lie in [-1, 1]. Positive alignment indicates the kernel tends to assign higher values to same-class pairs. Perfect alignment (A=1) means the kernel matrix is proportional to yy^T. Random kernels typically have near-zero alignment.
Properties of Kernel Alignment:
Kernel-Target Alignment vs. Kernel-Kernel Alignment:
Alignment can also measure similarity between two kernels:
$$A(K_1, K_2) = \frac{\langle K_1, K_2 \rangle_F}{|K_1|_F |K_2|_F}$$
This is useful for analyzing kernel diversity in MKL.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
import numpy as np def kernel_target_alignment(K, y): """ Compute kernel-target alignment. A(K, y) = y^T K y / (n * ||K||_F) Parameters: ----------- K : kernel matrix (n, n) y : labels (n,), assumed in {-1, +1} Returns: -------- alignment : float in [-1, 1] """ n = len(y) y = y.reshape(-1) # Numerator: y^T K y numerator = y @ K @ y # Denominator: n * ||K||_F K_frobenius = np.linalg.norm(K, 'fro') denominator = n * K_frobenius return numerator / denominator def kernel_kernel_alignment(K1, K2): """ Compute alignment between two kernel matrices. A(K1, K2) = <K1, K2>_F / (||K1||_F * ||K2||_F) """ numerator = np.sum(K1 * K2) # Frobenius inner product denominator = np.linalg.norm(K1, 'fro') * np.linalg.norm(K2, 'fro') return numerator / denominator def ideal_kernel(y): """Compute the ideal (target) kernel matrix: K* = yy^T""" y = y.reshape(-1, 1) return y @ y.T # Examplenp.random.seed(42)n = 100y = np.sign(np.random.randn(n)) # Random labels # Good kernel: similar to idealK_good = ideal_kernel(y) + 0.1 * np.random.randn(n, n)K_good = (K_good + K_good.T) / 2 # Symmetrize # Random kernelK_random = np.random.randn(n, n)K_random = K_random @ K_random.T print(f"Alignment (good kernel): {kernel_target_alignment(K_good, y):.4f}")print(f"Alignment (random kernel): {kernel_target_alignment(K_random, y):.4f}")The basic alignment measure can be improved by centering the kernel matrices, which removes the effect of the mean and focuses on the actual structure.
Centered Kernel Matrix:
A kernel matrix $K$ is centered if $\sum_i K_{ij} = \sum_j K_{ij} = 0$ for all $i, j$. The centering operation is:
$$\tilde{K} = HKH$$
where $H = I - \frac{1}{n}\mathbf{1}\mathbf{1}^\top$ is the centering matrix.
Expanded: $$\tilde{K}{ij} = K{ij} - \frac{1}{n}\sum_k K_{ik} - \frac{1}{n}\sum_k K_{kj} + \frac{1}{n^2}\sum_{k,l} K_{kl}$$
Centering corresponds to centering the feature maps: φ̃(x) = φ(x) - E[φ(X)]. Without centering, a kernel that assigns high values uniformly (e.g., constant kernel) could spuriously align with any target. Centering removes this bias.
Centered Kernel Alignment (CKA):
$$\text{CKA}(K, L) = \frac{\langle \tilde{K}, \tilde{L} \rangle_F}{|\tilde{K}|_F |\tilde{L}|_F}$$
For kernel-target alignment:
$$\text{CKA}(K, \mathbf{y}) = \frac{\text{tr}(\tilde{K} \tilde{K}^)}{|\tilde{K}|_F |\tilde{K}^|_F}$$
Advantages of CKA:
Relationship to HSIC:
Centered kernel alignment is closely related to the Hilbert-Schmidt Independence Criterion (HSIC), a kernel-based measure of statistical dependence:
$$\text{HSIC}(X, Y) = \frac{1}{(n-1)^2} \text{tr}(\tilde{K}_X \tilde{K}_Y)$$
where $K_X$ and $K_Y$ are kernels on variables $X$ and $Y$.
CKA normalizes HSIC to remove sensitivity to marginal kernel norms:
$$\text{CKA}(K_X, K_Y) = \frac{\text{HSIC}(K_X, K_Y)}{\sqrt{\text{HSIC}(K_X, K_X) \cdot \text{HSIC}(K_Y, K_Y)}}$$
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
import numpy as np def center_kernel(K): """ Center a kernel matrix: K̃ = HKH where H = I - 11^T/n Efficient implementation without forming H explicitly. """ n = K.shape[0] row_mean = K.mean(axis=1, keepdims=True) col_mean = K.mean(axis=0, keepdims=True) total_mean = K.mean() return K - row_mean - col_mean + total_mean def centered_kernel_alignment(K1, K2): """ Compute Centered Kernel Alignment (CKA). CKA(K1, K2) = <K̃1, K̃2>_F / (||K̃1||_F * ||K̃2||_F) """ K1_centered = center_kernel(K1) K2_centered = center_kernel(K2) numerator = np.sum(K1_centered * K2_centered) norm1 = np.linalg.norm(K1_centered, 'fro') norm2 = np.linalg.norm(K2_centered, 'fro') if norm1 < 1e-10 or norm2 < 1e-10: return 0.0 return numerator / (norm1 * norm2) def hsic(K1, K2): """ Compute the Hilbert-Schmidt Independence Criterion. HSIC = (1/(n-1)^2) * tr(K̃1 @ K̃2) """ n = K1.shape[0] K1_centered = center_kernel(K1) K2_centered = center_kernel(K2) return np.trace(K1_centered @ K2_centered) / (n - 1)**2 def cka_from_features(X, Y, kernel='linear'): """ Compute CKA directly from feature matrices. For linear kernel, avoids explicit kernel matrix computation. """ if kernel == 'linear': # Linear CKA: uses Gram matrices X^TX and Y^TY # Center the features X_centered = X - X.mean(axis=0) Y_centered = Y - Y.mean(axis=0) # Compute via feature dot products (memory efficient) XTX = X_centered.T @ X_centered YTY = Y_centered.T @ Y_centered XTY = X_centered.T @ Y_centered numerator = np.trace(XTY @ XTY.T) denom1 = np.trace(XTX @ XTX) denom2 = np.trace(YTY @ YTY) return numerator / np.sqrt(denom1 * denom2) else: raise NotImplementedError("Only linear kernel implemented") # Examplenp.random.seed(42)n = 100X = np.random.randn(n, 50) # Feature representation 1Y = X @ np.random.randn(50, 30) + 0.1 * np.random.randn(n, 30) # Correlated # Compute kernelsK_X = X @ X.TK_Y = Y @ Y.T print(f"CKA(X, Y): {centered_kernel_alignment(K_X, K_Y):.4f}")print(f"CKA from features: {cka_from_features(X, Y):.4f}")Kernel alignment is not just a heuristic—it has theoretical connections to generalization. High alignment implies the kernel's geometry is well-suited for the learning task.
Generalization Bound via Alignment (Cristianini et al., 2001):
For kernel SVM with margin $\gamma$ and alignment $A$:
$$R(f) \leq R_n(f) + O\left(\frac{1}{\gamma\sqrt{n}} \cdot \frac{1}{\sqrt{A}}\right)$$
where $R(f)$ is true risk and $R_n(f)$ is empirical risk.
Interpretation: Higher alignment means tighter generalization bounds. A well-aligned kernel allows the SVM to achieve the same generalization with fewer samples.
While high alignment suggests a good kernel, maximizing empirical alignment can lead to overfitting. The ideal kernel yy^T has perfect alignment but corresponds to memorizing labels. Regularization and proper kernel design remain essential.
Alignment vs. Cross-Validation Performance:
Empirical studies show that alignment correlates with cross-validation accuracy but is not a perfect predictor:
Spectral Perspective:
Alignment has a spectral interpretation. Define the kernel's spectral representation:
$$K = \sum_{i=1}^{n} \lambda_i \mathbf{u}_i \mathbf{u}_i^\top$$
Then alignment measures how well the top eigenvectors align with the label vector:
$$A(K, \mathbf{y}) = \frac{\sum_i \lambda_i (\mathbf{y}^\top \mathbf{u}_i)^2}{|\mathbf{y}|^2 |K|_F}$$
High alignment means $\mathbf{y}$ lies primarily in the subspace spanned by top kernel eigenvectors.
| Criterion | Alignment | Cross-Validation |
|---|---|---|
| Computation | O(n²) | O(n³ × k folds) |
| Direct optimization | Yes (convex) | No (discrete hyperparams) |
| Overfitting detection | Limited | Good |
| Regularization effect | Ignores | Captures |
| Sample splitting | No | Yes |
| Best use | Quick screening | Final selection |
Kernel alignment provides a fast method for comparing and selecting kernels without training models.
Simple Alignment-Based Selection:
Alignment for MKL Weights:
In MKL, alignment provides initialization or even final weights:
$$\mu_m \propto A(K_m, \mathbf{y})^+$$
where $(\cdot)^+$ denotes positive part. This is the core of EasyMKL.
Kernel Learning via Alignment Maximization:
For parameterized kernels $K_\theta$, we can optimize parameters via alignment:
$$\max_\theta A(K_\theta, \mathbf{y})$$
Gradient-Based Optimization:
The gradient of alignment w.r.t. kernel parameters:
$$\frac{\partial A}{\partial \theta} = \frac{1}{|K|_F} \left[ \frac{\partial (\mathbf{y}^\top K \mathbf{y})}{\partial \theta} - A \cdot \frac{\partial |K|_F}{\partial \theta} \right]$$
For kernel hyperparameters (e.g., RBF length-scale), this gives:
$$\frac{\partial K_{ij}}{\partial \gamma} = |\mathbf{x}_i - \mathbf{x}j|^2 K{ij}$$
Regularized Alignment:
To prevent overfitting, add regularization:
$$A_{\text{reg}}(K, \mathbf{y}) = A(K, \mathbf{y}) - \lambda \cdot \text{complexity}(K)$$
where complexity could be the nuclear norm or rank of $K$.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788
import numpy as npfrom scipy.optimize import minimize_scalarfrom scipy.spatial.distance import cdist def alignment_based_kernel_selection(X, y, kernel_configs): """ Select best kernel from candidates using alignment. Parameters: ----------- X : features (n, d) y : labels (n,) kernel_configs : list of (kernel_fn, name) tuples Returns: -------- best_kernel : kernel matrix with highest alignment results : alignment scores for all kernels """ results = [] best_alignment = -np.inf best_kernel = None best_name = None for kernel_fn, name in kernel_configs: K = kernel_fn(X) K_centered = center_kernel(K) y_centered = y - y.mean() # Centered alignment alignment = cka_target(K_centered, y_centered) results.append((name, alignment)) if alignment > best_alignment: best_alignment = alignment best_kernel = K best_name = name return best_kernel, best_name, sorted(results, key=lambda x: -x[1]) def center_kernel(K): """Center kernel matrix.""" n = K.shape[0] return K - K.mean(0, keepdims=True) - K.mean(1, keepdims=True) + K.mean() def cka_target(K_centered, y_centered): """CKA with target vector.""" yy = np.outer(y_centered, y_centered) yy_centered = center_kernel(yy) numerator = np.sum(K_centered * yy_centered) denom = np.linalg.norm(K_centered, 'fro') * np.linalg.norm(yy_centered, 'fro') return numerator / denom if denom > 0 else 0 def optimize_rbf_gamma_alignment(X, y, gamma_range=(1e-4, 1e2)): """ Find optimal RBF gamma by maximizing alignment. """ sq_dists = cdist(X, X, 'sqeuclidean') def neg_alignment(log_gamma): gamma = np.exp(log_gamma) K = np.exp(-gamma * sq_dists) K_centered = center_kernel(K) y_centered = y - y.mean() return -cka_target(K_centered, y_centered) result = minimize_scalar( neg_alignment, bounds=(np.log(gamma_range[0]), np.log(gamma_range[1])), method='bounded' ) optimal_gamma = np.exp(result.x) optimal_alignment = -result.fun return optimal_gamma, optimal_alignment # Examplenp.random.seed(42)n = 100X = np.random.randn(n, 10)y = np.sign(X[:, 0] + 0.5 * X[:, 1]) # Labels depend on first 2 features optimal_gamma, opt_alignment = optimize_rbf_gamma_alignment(X, y)print(f"Optimal RBF γ: {optimal_gamma:.4f}")print(f"Optimal alignment: {opt_alignment:.4f}")Centered Kernel Alignment has found important applications in analyzing deep neural networks, particularly for comparing learned representations.
Representation Similarity Analysis:
Given two neural network layers with activations $X \in \mathbb{R}^{n \times p_1}$ and $Y \in \mathbb{R}^{n \times p_2}$ on $n$ examples, CKA measures how similar the representations are:
$$\text{CKA}(K_X, K_Y)$$
where $K_X = XX^\top$ and $K_Y = YY^\top$ are linear kernel matrices.
Key Findings (Kornblith et al., 2019):
CKA is invariant to orthogonal transformations and isotropic scaling, making it robust to irrelevant differences between network representations. Two layers that encode the same information in different bases will have high CKA.
Applications in Deep Learning:
Mini-batch CKA:
For large networks, computing full $n \times n$ kernel matrices is expensive. Mini-batch approximation:
$$\text{CKA}{\text{approx}} = \frac{1}{B} \sum{b=1}^{B} \text{CKA}(K_X^{(b)}, K_Y^{(b)})$$
where each $K^{(b)}$ is computed on a mini-batch.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293
import numpy as np def linear_cka(X, Y): """ Compute linear CKA between two sets of activations. Parameters: ----------- X : activations from layer 1 (n_samples, n_features_1) Y : activations from layer 2 (n_samples, n_features_2) Returns: -------- cka : centered kernel alignment value """ # Center the activations X = X - X.mean(axis=0) Y = Y - Y.mean(axis=0) # Compute HSIC numerator and denominators efficiently # Using: tr(XXT @ YYT) = ||X.T @ Y||_F^2 XtY = X.T @ Y XtX = X.T @ X YtY = Y.T @ Y hsic_xy = np.sum(XtY ** 2) hsic_xx = np.sum(XtX ** 2) hsic_yy = np.sum(YtY ** 2) return hsic_xy / np.sqrt(hsic_xx * hsic_yy) def rbf_cka(X, Y, sigma=None): """ Compute RBF CKA (nonlinear similarity measure). """ def rbf_kernel(Z, sigma): sq_dists = np.sum((Z[:, None] - Z[None, :]) ** 2, axis=-1) if sigma is None: sigma = np.median(sq_dists[sq_dists > 0]) ** 0.5 return np.exp(-sq_dists / (2 * sigma ** 2)) K_X = rbf_kernel(X, sigma) K_Y = rbf_kernel(Y, sigma) # Center K_X = K_X - K_X.mean(0) - K_X.mean(1, keepdims=True) + K_X.mean() K_Y = K_Y - K_Y.mean(0) - K_Y.mean(1, keepdims=True) + K_Y.mean() numerator = np.sum(K_X * K_Y) denominator = np.sqrt(np.sum(K_X ** 2) * np.sum(K_Y ** 2)) return numerator / denominator def layer_similarity_matrix(activations_list): """ Compute CKA similarity matrix between all pairs of layers. Parameters: ----------- activations_list : list of activation matrices Returns: -------- sim_matrix : (n_layers, n_layers) CKA similarity matrix """ n_layers = len(activations_list) sim_matrix = np.zeros((n_layers, n_layers)) for i in range(n_layers): for j in range(i, n_layers): cka = linear_cka(activations_list[i], activations_list[j]) sim_matrix[i, j] = cka sim_matrix[j, i] = cka return sim_matrix # Example: Compare representationsnp.random.seed(42)n_samples = 200 # Simulate two networks with similar representationslayer1_net1 = np.random.randn(n_samples, 256)layer2_net1 = layer1_net1 @ np.random.randn(256, 128) + 0.1 * np.random.randn(n_samples, 128) # Network 2: similar to network 1 but rotatedrotation = np.linalg.qr(np.random.randn(128, 128))[0]layer2_net2 = layer2_net1 @ rotation # Should have high CKA # Network 3: independentlayer2_net3 = np.random.randn(n_samples, 128) # Should have low CKA print(f"CKA(net1, net2): {linear_cka(layer2_net1, layer2_net2):.4f}") # Highprint(f"CKA(net1, net3): {linear_cka(layer2_net1, layer2_net3):.4f}") # LowYou now understand kernel alignment theory and applications. Next, we'll explore universal kernels—kernels with the theoretical power to approximate any continuous function.