Kernel Pca - Learning Module

Loading content...

0/245

Centering in Feature Space

The Subtle Art of Feature Space Centering

Centering is often treated as a trivial preprocessing step in standard PCA: subtract the mean from each feature, and proceed. In Kernel PCA, centering becomes far more subtle. We cannot compute the mean in feature space explicitly—we don't have coordinates in that space. Yet centering is absolutely essential: without it, the principal components we compute would be directions from the origin of feature space rather than from the centroid of the data.

Why does this matter? Because the origin in feature space is an arbitrary reference point with no intrinsic meaning, while the centroid represents the "center" of the data distribution. Principal components are fundamentally about directions of maximum variance—deviation from the mean—not about directions maximizing distance from the origin.

This page explores centering in depth. We'll develop geometric intuition for what centering accomplishes, derive the double-centering formula carefully, understand what goes wrong without proper centering, and learn diagnostic techniques to verify correct implementation. Mastering these details is essential for robust Kernel PCA applications.

What You Will Learn

By the end of this page, you will understand geometrically what centering means in feature space, why the centroid cannot be computed explicitly, how double centering handles this implicitly, what failure modes arise without centering, and how to diagnose centering issues in practice.

Why Centering Matters in PCA

Before examining feature space centering, let's establish precisely why centering is essential for PCA in general.

PCA and the Covariance Matrix

PCA finds eigenvectors of the covariance matrix: $$\mathbf{C} = \frac{1}{n}\sum_{i=1}^{n}(\mathbf{x}_i - \bar{\mathbf{x}})(\mathbf{x}_i - \bar{\mathbf{x}})^T$$

This is fundamentally a second-moment matrix about the mean. If we instead used the matrix: $$\mathbf{M} = \frac{1}{n}\sum_{i=1}^{n}\mathbf{x}_i\mathbf{x}_i^T$$

(second moment about the origin), we would get different results—potentially very different.

The Geometric Interpretation

Consider a 2D dataset: points clustered at $(100, 0)$ with small variance in both directions. The true principal components should capture the directions of variance within the cluster (small scale). But if we compute moments about the origin:

The dominant direction is from origin $(0,0)$ toward the cluster center $(100, 0)$
This reflects the location of the data, not its spread
The actual variance directions (within the cluster) become secondary

Uncentered PCA Finds the Wrong Thing

Without centering, PCA finds directions maximizing the second moment about zero—a mixture of data location and data spread. This is almost never what we want. The effect is most severe when data is far from the origin, and the first 'principal component' simply points at the data centroid.

Mathematical Relationship

The second moment matrix $\mathbf{M}$ relates to the covariance matrix $\mathbf{C}$ by: $$\mathbf{M} = \mathbf{C} + \bar{\mathbf{x}}\bar{\mathbf{x}}^T$$

The term $\bar{\mathbf{x}}\bar{\mathbf{x}}^T$ is a rank-1 matrix pointing from the origin to the mean. If $|\bar{\mathbf{x}}|$ is large compared to the eigenvalues of $\mathbf{C}$, this rank-1 term dominates and the first eigenvector of $\mathbf{M}$ is approximately $\bar{\mathbf{x}}/|\bar{\mathbf{x}}|$.

In Feature Space, the Problem is Worse

In the feature space $\mathcal{F}$ induced by a kernel, the origin has even less meaning. For the RBF kernel, the feature space is infinite-dimensional, and the origin is just a mathematical abstraction. The data will almost certainly not be centered at this origin. Without explicit centering in feature space, Kernel PCA degrades into a method that mixes location effects with variance effects.

Effect of Centering on PCA Results
Scenario	With Centering	Without Centering
Data centered at origin	Correct variance directions	Correct (identical results)
Data at $(100, 0)$, small variance	Finds small-scale variance directions	PC1 ≈ (1, 0) toward data location
Clustered data, off-center	Captures cluster structure	Dominated by mean location
Feature space (RBF kernel)	Proper nonlinear structure	Corrupted by arbitrary origin
Multiple clusters, all off-center	Captures inter/intra-cluster variance	PC1 points at grand mean

The Centroid in Feature Space

The mean (centroid) in feature space is: $$\bar{\phi} = \frac{1}{n}\sum_{i=1}^{n}\phi(\mathbf{x}_i)$$

This is a well-defined element of $\mathcal{F}$—it's the average of all mapped training points. But there's a fundamental problem: we can't compute or store this vector explicitly when $\mathcal{F}$ is high or infinite-dimensional.

The Implicit Representation

Although we cannot represent $\bar{\phi}$ as an explicit vector of coordinates, we can work with it implicitly through inner products. Any quantity involving $\bar{\phi}$ can be rewritten in terms of kernel evaluations:

$$\langle \bar{\phi}, \phi(\mathbf{x}) \rangle = \left\langle \frac{1}{n}\sum_j \phi(\mathbf{x}_j), \phi(\mathbf{x}) \right\rangle = \frac{1}{n}\sum_j \langle \phi(\mathbf{x}_j), \phi(\mathbf{x}) \rangle = \frac{1}{n}\sum_j k(\mathbf{x}_j, \mathbf{x})$$

Similarly: $$|\bar{\phi}|^2 = \langle \bar{\phi}, \bar{\phi} \rangle = \frac{1}{n^2}\sum_i \sum_j k(\mathbf{x}_i, \mathbf{x}_j)$$

Every computation involving the centroid reduces to sums of kernel values.

The Key Insight

We never need explicit coordinates of $\bar{\phi}$. We only need inner products involving $\bar{\phi}$, and these can always be computed from the kernel matrix. This is the essence of the kernel trick applied to centering.

Geometric Picture

Imagine the feature space $\mathcal{F}$ as a high-dimensional vector space. The mapped training points ${\phi(\mathbf{x}_1), \ldots, \phi(\mathbf{x}_n)}$ form a cloud in this space. The centroid $\bar{\phi}$ is the center of mass of this cloud.

Centering means computing: $$\tilde{\phi}(\mathbf{x}_i) = \phi(\mathbf{x}_i) - \bar{\phi}$$

Geometrically, this translates the entire point cloud so that its center of mass is at the origin of $\mathcal{F}$. After centering:

The centered points $\tilde{\phi}(\mathbf{x}_i)$ have mean zero: $\frac{1}{n}\sum_i \tilde{\phi}(\mathbf{x}_i) = \mathbf{0}$
PCA on centered data finds directions of maximum variance from the (now-centered) mean

What We're Really Computing

The centered point $\tilde{\phi}(\mathbf{x}_i)$ is the "residual" after subtracting the average. It captures how $\phi(\mathbf{x}_i)$ differs from the typical mapped point. PCA on these residuals finds the principal directions of variation—exactly what we want.

For Test Points

When projecting a new point $\mathbf{x}^$, we compute: $$\tilde{\phi}(\mathbf{x}^) = \phi(\mathbf{x}^*) - \bar{\phi}$$

Critically, we subtract the training centroid $\bar{\phi}$, not a new centroid involving the test point. The training set defines the coordinate system; test points are transformed using that same system.

Deriving the Double Centering Formula

We need to compute the centered kernel matrix: $$\tilde{K}_{ij} = \langle \tilde{\phi}(\mathbf{x}_i), \tilde{\phi}(\mathbf{x}_j) \rangle = \langle \phi(\mathbf{x}_i) - \bar{\phi}, \phi(\mathbf{x}_j) - \bar{\phi} \rangle$$

Let's expand this inner product carefully:

$$\tilde{K}_{ij} = \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle - \langle \phi(\mathbf{x}_i), \bar{\phi} \rangle - \langle \bar{\phi}, \phi(\mathbf{x}_j) \rangle + \langle \bar{\phi}, \bar{\phi} \rangle$$

Term 1: $\langle \phi(\mathbf{x}_i), \phi(\mathbf{x}j) \rangle = K{ij}$

Term 2: $\langle \phi(\mathbf{x}_i), \bar{\phi} \rangle = \frac{1}{n}\sum_m \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}m) \rangle = \frac{1}{n}\sum_m K{im}$

This is the mean of the $i$-th row of $\mathbf{K}$.

Term 3: $\langle \bar{\phi}, \phi(\mathbf{x}j) \rangle = \frac{1}{n}\sum_m K{mj}$

This is the mean of the $j$-th column of $\mathbf{K}$.

Term 4: $\langle \bar{\phi}, \bar{\phi} \rangle = \frac{1}{n^2}\sum_m \sum_\ell K_{m\ell}$

This is the grand mean of $\mathbf{K}$.

The Complete Centering Formula

$$\tilde{K}{ij} = K{ij} - \frac{1}{n}\sum_m K_{im} - \frac{1}{n}\sum_m K_{mj} + \frac{1}{n^2}\sum_m \sum_\ell K_{m\ell}$$

This subtracts row means, subtracts column means, and adds back the grand mean (to avoid double-counting the centralization).

Matrix Formulation

Define the centering matrix: $$\mathbf{H} = \mathbf{I}_n - \frac{1}{n}\mathbf{1}_n\mathbf{1}_n^T$$

where $\mathbf{1}_n$ is the $n$-vector of ones. The matrix $\mathbf{H}$ is:

Symmetric: $\mathbf{H}^T = \mathbf{H}$
Idempotent: $\mathbf{H}^2 = \mathbf{H}$ (it's a projection matrix)
Rank $n-1$: it projects onto the subspace orthogonal to $\mathbf{1}$

The centered kernel matrix is: $$\tilde{\mathbf{K}} = \mathbf{H}\mathbf{K}\mathbf{H}$$

Verification:

$\mathbf{H}\mathbf{K}$: subtracts the column means from each row
$(\mathbf{H}\mathbf{K})\mathbf{H}$: subtracts the row means from each column of the result
The overall effect matches the element-wise formula above

Why "Double" Centering?

We multiply by $\mathbf{H}$ on both sides—hence "double" centering. In standard PCA, we only center rows (subtract mean from each sample). Here, centering the kernel matrix requires centering both rows and columns because the kernel matrix contains inner products, not raw features.

Connection to Classical MDS

Double centering also appears in classical Multidimensional Scaling (MDS). If $\mathbf{D}$ is a squared distance matrix and we apply $-\frac{1}{2}\mathbf{H}\mathbf{D}\mathbf{H}$, we get a matrix whose eigendecomposition gives coordinates in Euclidean space. This connection is deep: kernel matrices and distance matrices are closely related, and centering plays analogous roles in both.

double_centering_verification.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
 
def center_kernel_matrix(K: np.ndarray) -> np.ndarray:
    """Center kernel matrix using double centering."""
    n = K.shape[0]
    
    # Compute means
    row_means = K.mean(axis=1, keepdims=True)
    col_means = K.mean(axis=0, keepdims=True)
    grand_mean = K.mean()
    
    # Apply centering formula
    K_centered = K - row_means - col_means + grand_mean
    
    return K_centered
 
 
def center_kernel_matrix_explicit(K: np.ndarray) -> np.ndarray:
    """Center using explicit H @ K @ H formula."""
    n = K.shape[0]
    H = np.eye(n) - np.ones((n, n)) / n
    return H @ K @ H
 
 
def verify_centering_properties(K_centered: np.ndarray, tol: float = 1e-10):
    """Verify that the centered matrix has expected properties."""
    n = K_centered.shape[0]
    
    # Row sums should be zero
    row_sums = K_centered.sum(axis=1)
    row_sum_check = np.abs(row_sums).max() < tol
    
    # Column sums should be zero  
    col_sums = K_centered.sum(axis=0)
    col_sum_check = np.abs(col_sums).max() < tol
    
    # Matrix should be symmetric
    symmetry_check = np.allclose(K_centered, K_centered.T, atol=tol)
    
    # Matrix should be positive semi-definite (all eigenvalues >= 0)
    eigenvalues = np.linalg.eigvalsh(K_centered)
    psd_check = eigenvalues.min() >= -tol
    
    print(f"Row sums zero: {row_sum_check}")
    print(f"Column sums zero: {col_sum_check}")
    print(f"Symmetric: {symmetry_check}")
    print(f"PSD: {psd_check} (min eigenvalue: {eigenvalues.min():.2e})")
    
    return all([row_sum_check, col_sum_check, symmetry_check, psd_check])
 
 
# Example with RBF kernel
np.random.seed(42)
n = 100
X = np.random.randn(n, 5)
 
# Compute RBF kernel
gamma = 0.5
dist_sq = np.sum((X[:, None, :] - X[None, :, :]) ** 2, axis=2)
K = np.exp(-gamma * dist_sq)
 
print("Original kernel matrix:")
print(f"  Shape: {K.shape}")
print(f"  Row sums range: [{K.sum(axis=1).min():.2f}, {K.sum(axis=1).max():.2f}]")
print()
 
K_centered = center_kernel_matrix(K)
 
print("Centered kernel matrix:")
verify_centering_properties(K_centered)
print()
 
# Verify both methods give same result
K_centered_explicit = center_kernel_matrix_explicit(K)
print(f"Methods equivalent: {np.allclose(K_centered, K_centered_explicit)}")

What Goes Wrong Without Centering

Omitting the centering step is a common implementation error. Let's understand precisely what happens and why it's problematic.

The Uncentered Eigenvalue Problem

Without centering, we solve: $$\mathbf{K}\boldsymbol{\alpha} = \mu\boldsymbol{\alpha}$$

rather than: $$\tilde{\mathbf{K}}\boldsymbol{\alpha} = \mu\boldsymbol{\alpha}$$

The eigenvectors we find correspond to principal components in a space where the data is not centered—meaning they're not true principal components at all.

Mathematical Analysis

Recall the relationship between centered and uncentered second moments: $$\mathbf{C}{\mathcal{F}} = \mathbf{M}{\mathcal{F}} - \bar{\phi}\bar{\phi}^T$$

where $\mathbf{M}_{\mathcal{F}} = \frac{1}{n}\sum_i \phi(\mathbf{x}_i)\phi(\mathbf{x}_i)^T$ is the uncentered second moment.

The uncentered kernel matrix $\mathbf{K}$ relates to uncentered operations. Its first eigenvector is influenced by $\bar{\phi}$—the component captures variance plus the "mass" of the centroid, not pure variance.

Severity Depends on Data Location

The effect is most severe when:

Data is far from the feature-space origin (large $|\bar{\phi}|$)
Variance is small relative to centroid distance

With the RBF kernel, data is always far from the origin in feature space (the origin has zero magnitude in the RKHS), so centering is always necessary.

Common Symptom of Missing Centering

If your first kernel principal component appears to capture little useful structure—representing a "global offset" rather than meaningful variation—suspect a centering problem. The first component may simply point at the centroid in feature space rather than capturing variance.

A Concrete Example

Consider data forming two clusters in 2D, both far from the origin:

Cluster A centered at $(100, 100)$ with variance 1 in each direction
Cluster B centered at $(102, 100)$ with variance 1 in each direction

With an RBF kernel and proper centering:

KPCA finds directions separating the two clusters
The within-cluster variance structure is preserved

Without centering:

The dominant direction in feature space points toward the overall centroid
Cluster separation may be washed out
Projections reflect distance from feature-space origin, not meaningful structure

Partial Centering Errors

Another failure mode: centering in input space but not in feature space. If you subtract $\bar{\mathbf{x}}$ from each $\mathbf{x}_i$ and then apply the kernel, you have not centered in feature space. The kernel is nonlinear: $$\phi(\mathbf{x} - \bar{\mathbf{x}}) \neq \phi(\mathbf{x}) - \phi(\bar{\mathbf{x}})$$

The mean in feature space is $\bar{\phi} = \frac{1}{n}\sum_i \phi(\mathbf{x}_i)$, which is generally not equal to $\phi(\bar{\mathbf{x}})$.

centering_failure_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import numpy as np
import matplotlib.pyplot as plt
 
def rbf_kernel(X, Y, gamma=0.1):
    """RBF kernel function."""
    X_sq = np.sum(X**2, axis=1, keepdims=True)
    Y_sq = np.sum(Y**2, axis=1, keepdims=True)
    dist_sq = X_sq + Y_sq.T - 2 * X @ Y.T
    return np.exp(-gamma * dist_sq)
 
def kpca_with_centering(K, n_components):
    """Kernel PCA with proper centering."""
    # Center the kernel matrix
    n = K.shape[0]
    row_means = K.mean(axis=1, keepdims=True)
    col_means = K.mean(axis=0, keepdims=True)
    grand_mean = K.mean()
    K_centered = K - row_means - col_means + grand_mean
    
    # Eigendecomposition
    eigenvalues, eigenvectors = np.linalg.eigh(K_centered)
    idx = np.argsort(eigenvalues)[::-1]
    eigenvalues = eigenvalues[idx][:n_components]
    eigenvectors = eigenvectors[:, idx][:, :n_components]
    
    # Normalize and project
    valid = eigenvalues > 1e-10
    alphas = eigenvectors[:, valid] / np.sqrt(eigenvalues[valid])
    Z = K_centered @ alphas
    
    return Z, eigenvalues
 
def kpca_without_centering(K, n_components):
    """INCORRECT: Kernel PCA without centering."""
    # Skip centering (this is wrong!)
    eigenvalues, eigenvectors = np.linalg.eigh(K)
    idx = np.argsort(eigenvalues)[::-1]
    eigenvalues = eigenvalues[idx][:n_components]
    eigenvectors = eigenvectors[:, idx][:, :n_components]
    
    valid = eigenvalues > 1e-10
    alphas = eigenvectors[:, valid] / np.sqrt(eigenvalues[valid])
    Z = K @ alphas
    
    return Z, eigenvalues
 
# Create two circles FAR from origin to highlight centering importance
np.random.seed(42)
n_per_class = 100
theta = np.linspace(0, 2*np.pi, n_per_class)
 
# Inner circle at (100, 100)
r1 = 1.0
X1 = np.column_stack([100 + r1*np.cos(theta) + 0.1*np.random.randn(n_per_class),
                       100 + r1*np.sin(theta) + 0.1*np.random.randn(n_per_class)])
 
# Outer circle at (100, 100)  
r2 = 2.0
X2 = np.column_stack([100 + r2*np.cos(theta) + 0.1*np.random.randn(n_per_class),
                       100 + r2*np.sin(theta) + 0.1*np.random.randn(n_per_class)])
 
X = np.vstack([X1, X2])
y = np.array([0]*n_per_class + [1]*n_per_class)
 
# Compute kernel
K = rbf_kernel(X, X, gamma=0.5)
 
# Compare with/without centering
Z_centered, eig_centered = kpca_with_centering(K, 2)
Z_uncentered, eig_uncentered = kpca_without_centering(K, 2)
 
print("Top 5 eigenvalues with centering:", eig_centered[:5].round(2))
print("Top 5 eigenvalues without centering:", eig_uncentered[:5].round(2))
 
# The difference is dramatic - plot would show proper separation 
# only with centering

Centering for Test Points

Projecting new test points requires careful centering using the training statistics. This distinction is crucial and a common source of bugs.

The Problem Setup

We have:

Training points: ${\mathbf{x}_1, \ldots, \mathbf{x}_n}$
Training kernel matrix: $\mathbf{K}$ (centered to $\tilde{\mathbf{K}}$)
Eigenvectors $\boldsymbol{\alpha}^{(k)}$ from training
New test point: $\mathbf{x}^*$

We need to compute: $$z_k(\mathbf{x}^) = \langle \mathbf{v}_k, \tilde{\phi}(\mathbf{x}^) \rangle = \sum_i \alpha_i^{(k)} \langle \tilde{\phi}(\mathbf{x}_i), \tilde{\phi}(\mathbf{x}^*) \rangle$$

where $\tilde{\phi}(\mathbf{x}^) = \phi(\mathbf{x}^) - \bar{\phi}_{\text{train}}$.

The Centered Test Kernel Formula

For the test point, we need: $$\tilde{k}_i^* = \langle \tilde{\phi}(\mathbf{x}_i), \tilde{\phi}(\mathbf{x}^*) \rangle$$

Expanding: $$= \langle \phi(\mathbf{x}_i) - \bar{\phi}, \phi(\mathbf{x}^) - \bar{\phi} \rangle$$ $$= k(\mathbf{x}_i, \mathbf{x}^) - \frac{1}{n}\sum_m k(\mathbf{x}_m, \mathbf{x}^*) - \frac{1}{n}\sum_m k(\mathbf{x}_i, \mathbf{x}m) + \frac{1}{n^2}\sum_m\sum\ell k(\mathbf{x}m, \mathbf{x}\ell)$$

Test Point Centering Formula

Let $\mathbf{k}^* = [k(\mathbf{x}_1, \mathbf{x}^), \ldots, k(\mathbf{x}_n, \mathbf{x}^)]^T$. The centered version is:

$$\tilde{\mathbf{k}}^* = \mathbf{k}^* - \mathbf{K}\mathbf{1}/n - (\mathbf{1}^T\mathbf{k}^*/n)\mathbf{1} + (\mathbf{1}^T\mathbf{K}\mathbf{1}/n^2)\mathbf{1}$$

Interpreting Each Term

$\mathbf{k}^*$: Raw kernel values between test point and training points
$\mathbf{K}\mathbf{1}/n$: Column means of training kernel matrix (train-train similarities)
$(\mathbf{1}^T\mathbf{k}^*/n)\mathbf{1}$: Mean kernel value between test and all training points
$(\mathbf{1}^T\mathbf{K}\mathbf{1}/n^2)\mathbf{1}$: Grand mean of training kernel matrix

What Must Be Stored

To project test points, we must store during training:

Training data $\mathbf{X}$ (to compute kernel with test points)
Eigenvectors $\boldsymbol{\alpha}^{(k)}$ (normalized)
Training kernel row means: $\bar{\mathbf{k}} = \mathbf{K}\mathbf{1}/n$
Training kernel grand mean: $\bar{K} = \mathbf{1}^T\mathbf{K}\mathbf{1}/n^2$

Common Errors

Forgetting to center test kernel vectors: Raw kernel values give wrong projections
Using test statistics: Don't include the test point in computing means
Recomputing means with test data: The mean must come from training distribution
Centering test point in input space: $\mathbf{x}^* - \bar{\mathbf{x}}$ doesn't help; centering must be in feature space

test_point_centering.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import numpy as np
 
class CenteringStatistics:
    """Stores centering statistics from training for use at test time."""
    
    def __init__(self, K_train: np.ndarray):
        """
        Compute and store centering statistics from training kernel matrix.
        
        Parameters:
            K_train: Training kernel matrix of shape (n, n)
        """
        self.n_train = K_train.shape[0]
        
        # Row means (same as column means since K is symmetric)
        self.row_means = K_train.mean(axis=0)  # Shape: (n,)
        
        # Grand mean
        self.grand_mean = K_train.mean()
    
    def center_test_kernel(self, K_test: np.ndarray) -> np.ndarray:
        """
        Center kernel values for test points using training statistics.
        
        Parameters:
            K_test: Kernel matrix between test and training points
                    Shape: (n_test, n_train)
                    
        Returns:
            K_test_centered: Centered kernel matrix, shape (n_test, n_train)
        """
        n_test = K_test.shape[0]
        
        # Mean of test-train kernel values for each test point
        test_means = K_test.mean(axis=1, keepdims=True)  # (n_test, 1)
        
        # Apply centering formula
        K_test_centered = (K_test 
                          - self.row_means.reshape(1, -1)  # Subtract training row means
                          - test_means                      # Subtract test row means
                          + self.grand_mean)               # Add back grand mean
        
        return K_test_centered
 
 
def verify_test_centering(X_train: np.ndarray, X_test: np.ndarray, 
                          kernel: callable):
    """
    Verify test centering by comparing implicit and explicit centering.
    """
    # Training kernel and centering
    K_train = kernel(X_train, X_train)
    stats = CenteringStatistics(K_train)
    
    # Test-train kernel
    K_test = kernel(X_test, X_train)  # (n_test, n_train)
    K_test_centered = stats.center_test_kernel(K_test)
    
    # Verify: after centering, projections should be consistent
    # The rows of K_test_centered should have mean close to 
    # the corresponding terms from the exact centering formula
    
    print("Test centering verification:")
    print(f"  K_test shape: {K_test.shape}")
    print(f"  K_test_centered row sums: min={K_test_centered.sum(axis=1).min():.4f}, "
          f"max={K_test_centered.sum(axis=1).max():.4f}")
    
    # These should NOT be exactly zero (unlike training matrix centering)
    # because test points are not included in defining the centering
    
    return K_test_centered
 
 
# Example
np.random.seed(42)
n_train, n_test = 100, 20
X_train = np.random.randn(n_train, 5)
X_test = np.random.randn(n_test, 5)
 
def rbf_kernel(X, Y, gamma=0.1):
    X_sq = np.sum(X**2, axis=1, keepdims=True)
    Y_sq = np.sum(Y**2, axis=1, keepdims=True)
    dist_sq = X_sq + Y_sq.T - 2 * X @ Y.T
    return np.exp(-gamma * dist_sq)
 
K_centered = verify_test_centering(X_train, X_test, rbf_kernel)

Diagnostic Techniques for Centering

How can you verify that centering is implemented correctly? Several diagnostic checks can reveal problems.

Check 1: Row and Column Sums

For a properly centered training kernel matrix $\tilde{\mathbf{K}}$: $$\sum_j \tilde{K}{ij} = 0 \quad \text{for all } i$$ $$\sum_i \tilde{K}{ij} = 0 \quad \text{for all } j$$

Row and column sums should be zero (up to numerical precision). If they're not, centering is wrong.

Check 2: Symmetry Preservation

The centered kernel matrix should remain symmetric: $$\tilde{\mathbf{K}} = \tilde{\mathbf{K}}^T$$

Some centering implementations can accidentally break symmetry through numerical errors.

Check 3: Positive Semi-Definiteness

The centered kernel matrix should be positive semi-definite (all eigenvalues ≥ 0). Negative eigenvalues beyond numerical precision indicate problems.

Check 4: Eigenvalue Spectrum

The top eigenvalue of an uncentered kernel matrix is often much larger than subsequent eigenvalues (it captures the "constant" direction toward the centroid). After centering, eigenvalue decay should reflect genuine variance structure.

Quick Diagnostic

If your first kernel principal component seems to capture a 'constant offset' rather than meaningful variation—with all samples having similar projections of the same sign—suspect a centering issue. Properly centered KPCA components have zero mean projection on training data.

Check 5: Zero-Mean Projections

After projecting the training data onto kernel principal components, the projections should have zero mean: $$\frac{1}{n}\sum_i z_k^{(i)} = 0 \quad \text{for all } k$$

This follows because the eigenvectors represent variance directions from the centered mean.

Check 6: Agreement with Linear PCA (Linear Kernel)

For the linear kernel $k(\mathbf{x}, \mathbf{y}) = \mathbf{x}^T\mathbf{y}$, Kernel PCA should reproduce standard PCA. Compare results with scikit-learn's PCA or your own linear PCA implementation. Any discrepancy indicates an implementation bug.

Check 7: Consistency Across Reformulations

The element-wise formula and matrix formula ($\tilde{\mathbf{K}} = \mathbf{H}\mathbf{K}\mathbf{H}$) should give identical results. If they differ, one implementation has a bug.

centering_diagnostics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
import numpy as np
from sklearn.decomposition import PCA
 
def run_centering_diagnostics(K_centered: np.ndarray, 
                               K_original: np.ndarray,
                               X: np.ndarray = None) -> dict:
    """
    Run comprehensive diagnostics on centered kernel matrix.
    
    Parameters:
        K_centered: Centered kernel matrix
        K_original: Original uncentered kernel matrix
        X: Original data (optional, for linear kernel comparison)
    
    Returns:
        Dictionary of diagnostic results
    """
    results = {}
    tol = 1e-10
    
    # Check 1: Row sums
    row_sums = K_centered.sum(axis=1)
    results['row_sums_zero'] = np.abs(row_sums).max() < tol
    results['max_row_sum'] = np.abs(row_sums).max()
    
    # Check 2: Column sums
    col_sums = K_centered.sum(axis=0)
    results['col_sums_zero'] = np.abs(col_sums).max() < tol
    results['max_col_sum'] = np.abs(col_sums).max()
    
    # Check 3: Symmetry
    results['symmetric'] = np.allclose(K_centered, K_centered.T, atol=tol)
    results['max_asymmetry'] = np.abs(K_centered - K_centered.T).max()
    
    # Check 4: Positive semi-definiteness
    eigenvalues = np.linalg.eigvalsh(K_centered)
    results['psd'] = eigenvalues.min() >= -tol
    results['min_eigenvalue'] = eigenvalues.min()
    results['negative_eigenvalues'] = (eigenvalues < -tol).sum()
    
    # Check 5: Eigenvalue spectrum comparison
    eig_orig = np.linalg.eigvalsh(K_original)
    results['top_eigenvalue_ratio'] = (np.max(eig_orig) / 
                                        np.max(eigenvalues) 
                                        if np.max(eigenvalues) > 0 else np.inf)
    
    # Check 6: Zero-mean projections (on training data)
    valid = eigenvalues > tol
    n = K_centered.shape[0]
    eigenvectors = np.linalg.eigh(K_centered)[1]
    idx = np.argsort(eigenvalues)[::-1]
    eigenvalues = eigenvalues[idx]
    eigenvectors = eigenvectors[:, idx]
    
    # Project using first few components
    n_components = min(5, (eigenvalues > tol).sum())
    if n_components > 0:
        alphas = eigenvectors[:, :n_components] / np.sqrt(eigenvalues[:n_components])
        Z = K_centered @ alphas
        projection_means = Z.mean(axis=0)
        results['projections_zero_mean'] = np.abs(projection_means).max() < tol
        results['max_projection_mean'] = np.abs(projection_means).max()
    
    return results
 
 
def compare_with_linear_pca(X: np.ndarray, n_components: int = 5):
    """
    Compare Kernel PCA (linear kernel) with standard PCA.
    Should give identical results up to sign.
    """
    n = X.shape[0]
    
    # Standard PCA
    pca = PCA(n_components=n_components)
    Z_pca = pca.fit_transform(X)
    
    # Kernel PCA with linear kernel
    K = X @ X.T  # Linear kernel
    
    # Center the kernel
    row_means = K.mean(axis=1, keepdims=True)
    col_means = K.mean(axis=0, keepdims=True)
    grand_mean = K.mean()
    K_centered = K - row_means - col_means + grand_mean
    
    # Eigendecomposition
    eigenvalues, eigenvectors = np.linalg.eigh(K_centered)
    idx = np.argsort(eigenvalues)[::-1]
    eigenvalues = eigenvalues[idx][:n_components]
    eigenvectors = eigenvectors[:, idx][:, :n_components]
    
    # Normalize and project
    alphas = eigenvectors / np.sqrt(np.maximum(eigenvalues, 1e-10))
    Z_kpca = K_centered @ alphas
    
    # Compare (accounts for sign ambiguity)
    correlations = []
    for k in range(n_components):
        corr = np.abs(np.corrcoef(Z_pca[:, k], Z_kpca[:, k])[0, 1])
        correlations.append(corr)
    
    print("Correlation between PCA and KPCA (linear kernel) components:")
    for k, corr in enumerate(correlations):
        status = "✓" if corr > 0.99 else "✗"
        print(f"  Component {k+1}: {corr:.6f} {status}")
    
    return np.mean(correlations) > 0.99
 
 
# Run diagnostics
np.random.seed(42)
n = 100
X = np.random.randn(n, 10)
K = X @ X.T  # Linear kernel
 
# Center
row_means = K.mean(axis=1, keepdims=True)
col_means = K.mean(axis=0, keepdims=True)
grand_mean = K.mean()
K_centered = K - row_means - col_means + grand_mean
 
print("Centering Diagnostics:")
results = run_centering_diagnostics(K_centered, K)
for key, value in results.items():
    print(f"  {key}: {value}")
 
print()
compare_with_linear_pca(X)

Summary

Centering is a crucial step in Kernel PCA that is easy to get wrong. We've developed a deep understanding of what centering means, why it's necessary, how to implement it correctly, and how to verify correct implementation.

Key Takeaways

•Centering computes variance from the data mean, not the origin — Without centering, PCA captures position plus spread, not pure variance
•The feature-space centroid cannot be computed explicitly — We work with it implicitly through inner products (kernel evaluations)
•Double centering ($\tilde{\mathbf{K}} = \mathbf{H}\mathbf{K}\mathbf{H}$) handles feature-space centering — Both rows and columns must be centered
•Omitting centering corrupts results, especially for off-center data — The first component points at the centroid rather than capturing variance
•Test points must be centered using training statistics — The training set defines the coordinate system; test points adapt to it
•Diagnostic checks verify correct implementation — Row/column sums, symmetry, PSD, zero-mean projections, linear kernel agreement

Centering Mastered

You now have a thorough understanding of centering in Kernel PCA—both the mathematical foundation and practical implementation. The next page tackles the pre-image problem: given a point in the reduced kernel PCA space, how do we find a corresponding point in the original input space?

What's Next:

The next page addresses the pre-image problem—one of the most challenging aspects of Kernel PCA. Given a projection in the kernel principal component space, can we find a point in the original input space that projects to that location? This problem has no closed-form solution for nonlinear kernels, requiring iterative approximation methods. Understanding the pre-image problem is essential for applications like denoising and reconstruction.